Alan’s blog

August 3, 2009

data types and interpretation in RDF

Filed under: academic, web development — alan @ 1:46 pm

After following a link from one of Nad’s tweets, read Jeni Tennison’s “SPARQL & Visualisation Frustrations: RDF Datatyping“.  Jeni had been having problems processing RDF of MP’s expense claims, because the amounts were plain RDF strings rather than as typed numbers.  She  suggests some best practice rules for data types in RDF based on the underlying philosophy of RDF that it should be self-describing:

  • if the literal is XML, it should be an XML literal
  • if the literal is in a particular language (such as a description or a name), it should be a plain literal with that language
  • otherwise it should be given an appropriate datatype

These seem pretty sensible for simple data types.

In work on the TIM project with colleagues in Athens and Rome, we too had issues with representing data types in ontologies, but more to do with the status of a data type.  Is a date a single thing “2009-08-03T10:23+01:00″, or is it a compound [[date year="2009" month="8" ...]]?

I just took a quick peek at how Dublin Core handles dates and see that the closest to standard references1 still include dates as ‘bare’ strings with implied semantics only, although one of the most recent docs does say:

It is recommended that RDF applications use explicit rdf:type triples …”

and David MComb’s “An OWL version of the Dublin Core” gives an alternative OWL ontology for DC that does include an explicit type for dc:date:

<owl:DatatypeProperty rdf:about="#date">
  <rdfs:domain rdf:resource="#Document"/>
  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#dateTime"/>
</owl:DatatypeProperty>

Our solution to the compound types has been to have “value classes” which do not represent ‘things’ in the world, similar to the way the RDF for vcard represents  complex elements such as names using blank nodes:

<vCard:N rdf:parseType="Resource">
  <vCard:Family> Crystal </vCard:Family>
  <vCard:Given> Corky </vCard:Given>
  ...
</vCard:N>

From2

This is fine, and we can have rules for parsing and formatting dates as compound objects to and from, say, W3C datetime strings.  However, this conflicts with the desire to have self-describing RDF as these formatting and parsing rules have to be available to any application or be present as reasoning rules in RDF stores.  If Jeni had been trying to use RDF data coded like this she would be cursing us!

This tension between representations of things (dates, names) and more semantic descriptions is also evident in other areas.  Looking again at Dublin Core the metamodal allows a property such as “subject”  to have a complex object with a URI and possibly several string values.

Very semantic, but hardly mashes well with sources that just say <dc:subject>Biology</dc:subject>.  Again a reasoning store could infer one from the other, but we still have issues about where the knowledge for such transformations resides.

Part of the problem is that the ’self-describing’ nature of RDF is a bit illusary.   In (Piercian) semiotics the interpretant of a sign is crucial, representations are interpreted by an agent in a particular context assuming a particular language, etc.  We do not expect human language to be ’sef describing’ in the sense of being totally acontextual.  Similarly in philosophy words and ideas are treated as intentional, in the (not standard English) sense that they refer out to something else; however, the binding of the idea to the thing it refers to is not part of the word, but separate from it.  Effectively the desire to be self-describing runs the risk of ignoring this distinction3.

Leigh Dodds commented on Jeni’s post to explain that the reason the expense amounts were not numbers was that some were published in non-standard ways such as “12345 (2004)”.  As an example this captures succinctly the perpetual problem between representation and abstracted meaning.  If a journal article was printed in the “Autumn 2007″ issue of  quarterly magazine, do we express this as <dc:date>2007</dc:date> or <dc:date>2007-10-01</dc:date>  attempting to give an approximation or inference from the actual represented date.

This makes one wonder whether what is really needed here is a meta-description of the RDF source (not simply the OWL as one wants to talk about the use of dc:date or whatever in a particular context) that can say things like “mainly numbers, but also occasionally non-strandard forms”, or “amounts sometimes refer to different years”.  Of course to be machine mashable there would need to be an ontology for such annotation …


  1. see “Expressing Simple Dublin Core in RDF/XML“, “Expressing Dublin Core metadata using HTML/XHTML meta and link elements” and Stanford DC OWL [back]
  2. Renato Iannella, Representing vCard Objects in RDF/XML, W3C Note, 22 February 2001. [back]
  3. Doing a quick web seek, these issues are discussed in several places, for example: Glaser, H., Lewy, T., Millard, I. and Dowling, B. (2007) On Coreference and the Semantic Web, (Technical Report, Electronics & Computer Science, University of Southampton) and Legg, C. (2007). Peirce, meaning and the semantic web (Paper presented at Applying Peirce Conference, University of Helsinki, Finland, June 2007). [back]

June 24, 2009

going SIOC (Semantically-Interlinked Online Communities)

Filed under: academic, web development — alan @ 2:31 pm

I’ve just SIOC enabled this blog using the SIOC Exporter for WordPress by Uldis Bojars.  Quoting from the SIOC project web site:

The SIOC initiative (Semantically-Interlinked Online Communities) aims to enable the integration of online community information. SIOC provides a Semantic Web ontology for representing rich data from the Social Web in RDF.

This means you can explore the blog as an RDF Graph including this post.

<sioc:Post rdf:about="http://www.alandix.com/blog/?p=176">
    <sioc:link rdf:resource="http://www.alandix.com/blog/?p=176"/>
    <sioc:has_container rdf:resource="http://www.alandix.com/blog/index.php?sioc_type=site#weblog"/>
    <dc:title>going SIOC (Semantically-Interlinked Online Communities)</dc:title>
    <sioc:has_creator>
        <sioc:User rdf:about="http://www.alandix.com/blog/author/admin/" rdfs:label="alan">
            <rdfs:seeAlso rdf:resource="http://www.alandix.com/blog/index.php?sioc_type=user&amp;sioc_id=1"/>
        </sioc:User>
    </sioc:has_creator>
...

April 19, 2009

RDF sequences … could they be more semantic?

Filed under: academic, web development — alan @ 12:00 pm

Although triples can in principle express anything (well anything computational), this does not mean they are particularly appropriate for everything1.

RDF sequences are one of the most basic structured types and I have always found the use of rdf:_1, rdf:_2 at best clunky.  In particular I don’t like the fact that the textual form embodies the meaning.

In the RDF schema, rdf:_1, rdf:_2, etc are all instances of the class rdfs:ContainerMembershipProperty and sub-properties of rdfs:member.  However, I was also looking to see if there was some (implicitly defined) property of each of them that said which index they represented.  For example:

<rdf:_3> <rdf:isSequenceNumber> “3″

This would mean that the fact that rdf:_3 corresponded to the third element in a sequence was expressed semantically by rdf:isSequenceNumber as well as lexically in the label “_3″.

Sadly I could find no mention of this or any alternative technique to give the rdf:_nnn properties explicit semantics :-(

This is not just me being a purist,  having explicit semantics makes it possible to express queries such as gathering together contiguous pairs in a sequence:

<ex:a> ?r1 ?a.
<ex:a> ?r2 ?b.
?r1 <rdf:hasSequenceNumber> ?index.
?r2 <rdf:hasSequenceNumber> ?index + 1.

Without explicit semantics, this would need to be expressed using string concatenation to create the labels for the relations – yuck!

Have I missed something? Is there an alternative mechanism in the RDF world that is like this or better?

Mind you I don’t see what’s wrong with a[index] … but may be that is just too simple?


  1. see also previous posts on “It-ness and identity: FOAF, RDF and RDMS” and “digging ourselves back from the Semantic Web mire“ [back]

October 23, 2008

web of data practioner’s days

Filed under: HCI and usability, academic, web development — alan @ 9:54 am

I am at the Web of Data Practitioners Days (WOD-PD 2008) in Vienna.  Mixture of talks and guided hands-on sessions.  I presented first half of session on “Using the Web of Data” this morning with focus (surprise) on the end user. Learnt loads about some of the applications out there – in fact Richard Cyganiak .  Interesting talk from a guy at the BBC about the way they are using RDF to link the currently disconnected parts of their web and also archives.  Jana Herwig from Semantic Web Company has been live blogging the event.

Being here has made me think about the different elements of SemWeb technology and how they individually contribute to the ‘vision’ of Linked Data.  The aim is to be able to link different data sources together.  For this having some form of shared/public vocabulary or ‘data definitions’ is essential as is some relatively uniform way of accessing data.  However, the implementation using RDF or use of SPARQL etc. seems to be secondary and useful for some data, but not other forms of data where tabular data may be more appropriate.  Linking these different representations  together seems far more important than specific internal representations.  So wondering whether there is a route to linked data that allows a more flexible interaction with existing data and applications as well as ’sucking’ in this data into the SemWeb.  Can the vocabularies generated for SemWeb be used as meta information for other forms of information and can  query/access protocols be designed that leverage this, but include broader range of data types.

March 18, 2008

It-ness and identity: FOAF, RDF and RDMS

Filed under: academic, web development — alan @ 8:25 am

Issues of ’sameness’ are the underpinnings of any common understanding; if I talk about America, bananas or Caruso, we need to know we are talking about the ’same’ thing.

Codd’s relational calculus was unashamedly phenomenological – if two things have the same attributes they are the same. Of course in practice, we often have things which look the same and yet we know are different: two cans of beans, two employees called David Jones. So many practical SQL database designs use unique ids as the key field of a table effectively making sure that otherwise identical rows are distinct1.

The id gives a database record identity – it is a something independent of its attributes.

I usually call this quality ‘it-ness’ and struggled to find appropriate (probably German) philosophical term to refer to it. Before we can point at something and say ‘it is a chair’, it must be an ‘it’ something we can refer to. This it-ness must be there before we consider the proeprties of ‘ot’ (legs, seat, etc.). It-ness is related to the substance/accident distinction important in medieval scholastic debate on transubstantiation, but different as the bread needs to be an ‘it’ before we can say that its real nature (substance) is different from its apparent nature (accidents).

In contrast RDF takes identity, as embodied in a URI, as its starting point. The origins of RDF are in web meta-data – talking about web pages … that is RDF is about talking about something else, and that something else has some form of (unique) identity. Although the word ‘ontology’ seems to be misused almost beyond recognition in computer science, here we are talking about true ontology. RDF assumes as a starting point it is discussing things that are, that exist, that have being. Given this of course several distinct things may have similar attributes2.

Whilst RDMS have problems talking about identity, and we often have to add artifices (like the id), to establish identity, in RDF the opposite problem arises. Often we do not have unique names even for web entities, and even less when we have RDF descriptions of people, places … or books. Nad discusses some of the problems of cleaning up book data (MARC, RDF and FRMR), part of which is establishing unique names … and really books are ‘easy’ as librarians have soent a long time thinking about idetifying them already.

FOAF (friend of a friend) is now widely used to represent personal relationships. In this Wordpress blog, when I add blogroll entries it prompts for FOAF information: is this a work colleague, family, friend (but not foe or competitor … FOAF is definitely about being friendly!).

FOAF has an RDF format, but examples, both in practice … and in the XMLNS RDF specification, are not full of “rdf:about” links as are typical RDF documents. This is because, while people clearly do have unique identity, there is thankfully no URI scheme that uniquely and universally defies us3.

In practice FOAF says things like “there is a person whose name is John Doe”, or “the blog VirtualChaos is by a person who is a friend and colleague of the author of this blog”.

In terms of identity this is a blank node “the person who …”. The computational representation of the person is a placeholder, or a variable waiting to be associated with other placeholders.

In terms of phenomenological attributes, the values either do not uniquely identify an individual (here may be many John Doe’s) and the individual may have several potential values for a given attribute (John Doe may not be the body’s only name,and a person may have several email addresses).

In order to match individuals in FOAF, we typically need to make assumption: while I may have several email addresses, they are all personal, so if two people have the same email address they are the same person. Of course such reasoning is defeasible: some families share an email address, but serves as a way of performing partial and approximate matching.

I think to the semantic web purist the goal would be to have the unique personal URI. However, to my mind the incomplete, often vague and personally defined FOAF is closer to the way the real world works even when ontologically there is a unique entity in the world that is the subject. FOAF challenges simplistic assumptions and representations of both a phenomenological and ontological nature.


  1. Furthermore if you do not specify a key, RDMS are likely to treat a relation as bag rather than a set of tuples! Try inserting the same record twice. [back]
  2. For those who know their quantum mechanics RDMS records are like Fermions and obey Pauli exclusion principle, whilst RDF entities are like Bosons and several entities can exist with identical attributes. [back]
  3. As it says in The Prisoner “I am not a number” … although maybe one day soon we will all be biometrically identified and have a global URI :-/ [back]

February 23, 2008

practical RDF

Filed under: academic, web development — alan @ 3:13 pm

I just came across D2RQ, a notation (plus implementation) for mapping relational databases to RDF, developed over the last four years by Chris Bizer, Richard Cyganiak and others at Freie Universität Berlin. In a previous post, “digging ourselves back from the Semantic Web mire“, I worried about the ghetto-like nature of RDF and the need for “abstractions that make non-triple structures more like the Semantic Web”, D2RQ is exactly the sort of thing, allowing existing relational databases to be accessed (but not updated) as if they were and RDF triple stores, including full SPARQL queries.

As D2RQ has clearly been around for years, I tried to do a bit of a web search to find things the other way around – more programmer-friendly layers on top of RDF (or XML) allowing it to be manipulated with IDL-like or other abstractions closer to ‘normal’ programming. ECMAScript for XML (E4X) seems to be just this allowing reasonably easy access to XML (but I guess RDF would be ‘flat’ in this). E4X has been around a few years (standard since 2005), but as far as I can see not yet in IE (surprise!). I guess for really practical XML it would be JSON, and there’s a nice discussion of different RDF in JSON representation issues on the n2 wiki “RDF JSON Brainstorming“. However, both E4X and RDF in JSON still are just accessing RDF nicely not adding higher level structure.

Going back to the beginning I was wondering about any tools that represent RDF as SQL / RDMS in order to make it available to ‘old technology’ … but then remembered that SPARQL creates tuples not triples so, I guess, one could say that is exactly what it does :-/

November 11, 2007

digging ourselves back from the Semantic Web mire

Filed under: academic, web development — alan @ 10:10 am

Discussions on the Talis Platform Advisory Group prompted me to look at some of the APIs of new Semantic-Web-like services such as Freebase1.

Freebase is interesting as its underlying representation is graph/relationship based like RDF, but its Metaweb Query Language (MQL) uses JSON which is a more programming-like whole and parts representation with arrays and slots. Facebook’s new Data Store API also has objects and associations, but does not use RDF or other obvious web technologies.

So the question is – if the closest things to Semantic Web apps on the internet don’t use SemWeb techology like RDF, SPARQL etc. … are these SemWeb techologies fit for purpose or indeed useful at all?

I think the answer is that (i) partly they are not fit for purpose – caught in a backwater by their history, but (ii) that is like all things and they are what we have got, and (iii) we can use some of the tools of computing to make them work …

(more…)


  1. listen to Talis’ pod cast interview with Jamie Taylor Freebase’s ‘Minister of Information’ (sic). [back]

Powered by WordPress