blog.literarymachine.net

Is there something missing in the link?

An inherent problem of the web (be it of documents or data) as we know it today is the fact that it has no notion of time. Both, the protocol (HTTP) and the link-syntax (in HTML, RDF and whatnot) assume that documents (or “information resources”, aka “representations of resources”) are stable and that the web is synchronous. In other words: it is not forseen that information resources have a history, that they change over time. If you follow a link and get a “404 Not Found” response from a web server, there is no way (at least no standardized way) to find out if the document has gone away, when it went away or if it perhaps never was there in the first place. And if a document is still available, you cannot (again, at least not by any standard) find out if it has changed since the link has been created.

Memento fixes this on the HTTP level by adding an “Accept-Datetime” header, thus implementing “Content-Negotiation in the datetime dimension”, as the Memento team phrases it. This makes it possible to request a document as it was at a certain point in time. The problem that remains, though, is how to figure out which version of a document a link points to, because it implicitly always points to the most current version. This is something that Memento cannot solve, because it is not a problem at HTTP-level but rather on the level of encoding links. At least in scientific publications, web resources are usually cited some thing like this:

Memento - Adding Time to the Web http://www.mementoweb.org/ (Last accessed 2011-11-20)

This is obviously not ideal for machines to process. Assuming (perhaps unrealistically, but I’m an optimist) that browsers will support the “Accept-Datetime” header at some point, something like the following could be a simple way to add the missing datetime information a browser would need to make a precise request:

<a href="http://www.mementoweb.org/" reftime="2011-11-30T14:12+01:00">Memento</a>

In RDF, the idea of a “context” comes to mind. Most triple stores actually store quads, adding a URI to identify the context of a triple. This context is often the URL of the RDF document where the triple stems from. Something similar could be used to identify the point in time when the triple was received from the source:

<http://lobid.org/resource/HT002948556> dcterms:creator <http://d-nb.info/gnd/135539897> \
<http://lobid.org/resource/page/HT002948556> "2011-11-30T14:12+01:00" .

This is just a quick note on something that has been on my mind for a while now and I haven’t done in-depth research if this problem is already solved otherwise. I’d be glad about a hint if so (and I would also be glad to be proven wrong in that this problem even exists).

Dead Simple: RDF and SPARQL using PHP

The Linked Data/RDF/SPARQL Documentation Challenge finally convinced me that there is something to blog about that actually might be of interest to others (although it is rather unlikely that I will be doing this on a regular basis). Now, with this post’s title I do not want to imply that either RDF or SPARQL are trivial, but that it is dead simple to set up a PHP environment to get your feet wet.

I took my first steps fiddling with Linked Data using RDFLib for Python, but for some reason I always end up using PHP as the programming language of my choice. For me, there are basically two options when it comes to processing RDF with PHP5 (there obviously are more, such as RAP, I just never dug deep into them).

One of them is ARC2, and I used it for my first experiments. For several reasons I was never quite happy with it, though. It is closely tied to a MySQL-Database, and the last time I looked, it was e.g. not possible to execute SPARQL queries against in-memory graphs. This is due to the fact that in-memory graphs are basically php arrays, which I didn’t like either. I always ended up with far too many isset calls, leading to verbose and ugly code.

Fortunately, at some point I discovered that librdf has language bindings in PHP, which turned out to be plain awesome. I started writing a simple object oriented wrapper but eventually stumbled across this, so I didn’t even have to do that. I have contacted the author, he hasn’t been working on this since 2006 and doesn’t plan to do so in the future. Since I found it extremely useful (and am amazed that there seems to be no wide-spread usage), I created a git(hub) repository for the wrapper. So far, it only contains the original code.

Anyways, in the end, this is all you need to get started with RDF and SPARQL using PHP5 on Ubuntu, 10.10 in my case:

$ sudo apt-get install php5-librdf librdf-storage-mysql
$ pear install http://reallylongword.org/projects/librdf-php/LibRDF-1.0.0.tgz

If you don’t like PEAR, you can of course also manually download the tarball or clone the git repository.

This installs the librdf module for PHP5, a storage backend for MySQL and the object oriented wrapper. Now all we need is some data. Let’s use, for no particular reason, Richard Cyganiak’s FOAF file. For a start, let’s find out who Richard knows by loading that file into a graph and executing a simple SPARQL query on it:

<?php
require_once('LibRDF/LibRDF.php');
 
// All models, i.e. graphs, reside in a storage. This defaults to
// memory.
$store = new LibRDF_Storage();
$model = new LibRDF_Model($store);
 
// Load some data into the model. The format must explicitly be
// declared for the parser, but using e.g. ARC's format detector
// should be easy to implement. Anyways, in this case we're
// dealing with an RDF/XML document:
$model->loadStatementsFromURI(
        new LibRDF_Parser('rdfxml'),
        'http://richard.cyganiak.de/foaf.rdf');
 
// Create a SPARQL query
$query = new LibRDF_Query("
PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
SELECT ?name1 ?name2
WHERE
  {
    ?person1 foaf:knows ?person2 .
    ?person1 foaf:name ?name1 .
    ?person2 foaf:name ?name2 .
  }
", null, 'sparql');
 
// Execute the query. The results of a SPARQL SELECT provide
// array access by using the variables used in the query as keys:
$results = $query->execute($model);
foreach ($results as $result) {
    echo $result['name1'] . " knows " . $result['name2'] . "\n";
}

Here is what we get:

Richard Cyganiak knows Roman Szmidt
Richard Cyganiak knows Jan Kretzschmar
Richard Cyganiak knows Jörg Meltzer
Richard Cyganiak knows Chris Bizer
Richard Cyganiak knows Manuel Schulze
(...)

If you don’t feel like learning SPARQL at all, you’re still not lost, at least for simple queries. librdf provides a couple of convenient methods – findStatements, getSource, getArc and getTarget – to traverse a graph that you can alternatively use through the wrapper:

<?php
require_once('LibRDF/LibRDF.php');
 
$store = new LibRDF_Storage();
$model = new LibRDF_Model($store);
$model->loadStatementsFromURI(
        new LibRDF_Parser('rdfxml'),
        'http://richard.cyganiak.de/foaf.rdf');
 
$foafKnows = new LibRDF_URINode("http://xmlns.com/foaf/0.1/knows");
$foafName = new LibRDF_URINode("http://xmlns.com/foaf/0.1/name");
$results = $model->findStatements(null, $foafKnows, null);
foreach ($results as $result) {
    $person1 = $result->getSubject();
    $person2 = $result->getObject();
    $name1 = $model->getTarget($person1, $foafName);
    $name2 = $model->getTarget($person2, $foafName);
    echo "$name1 knows $name2\n";
}

This produces the exact same output as with the SPARQL query above.

So far, we have loaded our sample data from the web right into memory. But what if we wanted to persist the model to our own database? This is actually also extremely simple. All that has to be done is to not create the storage in memory, but in one of the available persistent backends. Let’s use MySQL:

// Create a new MySQL storage. The second parameter is NOT the
// name of the MySQL database to use, but the name of the
// triplestore. This makes it possible to create several
// triplestores within one MySQL database. The third parameter is
// a string containing the options for the actual MySQL database.
// They should speak for themselves, except for "new='yes'". If
// this option is given, the necessary table structure is created and
// any existing triples are dropped. You probably only want to use
// it in some kind of setup or installation procedure.
$store = new LibRDF_Storage("mysql", "db",
        "new='no',
        host='localhost',
        database='mydatabase',
        user='foo',
        password='bar'");

Now we can proceed as above. Statements loaded into the model will now end up in the MySQL database, and queries executed against the model will query that database.

As challenged, we have achieved the following:

  • Install an RDF store from a package management system on a computer running either Apple’s OSX or Ubuntu Desktop.
  • Install a code library (again from a package management system) for talking to the RDF store in either PHP, Ruby or Python.
  • Programatically load some real-world data into the RDF datastore using either PHP, Ruby or Python.
  • Programatically retrieve data from the datastore with SPARQL using using either PHP, Ruby or Python.
  • Convert retrieved data into an object or datatype that can be used by the chosen programming language (e.g. a Python dictionary).

All of this with easy to install software that runs on standard LAMP systems and just a couple of lines of straight-forward code. Of course there are a lot more details to write about, and I might do so at some point, but I’ll wrap it up for now. Let me know what you think or ask questions in the comments.