Category Archives: Databases

The Gargonza Experiment

Only a few weeks after SPARQLing Days I’m now finally ready to report some progress…

2005-05-07: Updated to version 0.3, see changelog below.

SPARQL

As is now common knowledge, one outcome of the great gathering in Tuscany is The Gargonza Experiment, which attempts to create a set of showcases for the Resource Description Framework (RDF) and the SPARQL Query Language for RDF. Already the “community theatre” has gained a bit of steam and an increasing number of followers, and I’m sure there’ll be more to come as people get the stuff they started working on completed. To keep up-to-date, I suggest y’all subscribe to Planet RDF or at least to the Gargonza Experiment category that Danny Ayers — my gracious host for a few days following the main event — is maintaining.

image-101: Gargonza

To assist in that goal, I’ve made a subset of my photo description database available for SPARQL queries via Sparqlette — A SPARQL demo query service. The subset currently contains only descriptions of the photos taken during SPARQLing Days, and not all photos have been through my annotation process, there are still some foaf:depicts triples to be made…

Continue reading The Gargonza Experiment

Triple Loading

Phil Dawes is working on an ifpstore to back his Veudas RDF editor. Yesterday he published some benchmarks for importing triples into ifpstore.

While there are quite a number of variables — the data structures differ a bit, the hardware used isn’t identical, etc… — I thought I’d try to redo the benchmarks with the Redland/MySQL storage backend.

At first I tried importing the four separate files from Wordnet, then serialised in NTriples syntax in one file, and last a single RDF/XML file, generated naïvely (meaning that the size of the file is larger than the four raw files combined, even if it contains the same triples) from the NTriples one.

Then I repeated the tests, this time with the bulk loading features — table locking and temporary disabling of indices — of the MySQL storage backend turned on. The benchmark below includes the index rebuilding phase that takes place after the actual load.

[Redland/MySQL Wordnet Benchmark]

Redland/MySQL Wordnet Benchmark Standard Bulk
Separate 210 181
RDF/XML 225 169
Ntriples 205 158

As the numbers suggest, the bulk loading features work, and raptor is faster at parsing NTriples than RDF/XML, hardly a surprise.

What doesn’t show in the numbers is the fact that the entire load process is CPU bound, it’s not disk accesses that’s taking time. Also, the amount of triples, 473589, isn’t enough to fill up the in-memory cache MySQL maintains (here), not even importing a large dataset like Jim’s 6.7 million scuttered statements (converted into NTriples with jim2ntriples) seems to be. With bulk loading turned on, that entire process takes about 34 minutes, equivalent to about 3300 triples per second, as compared to the about 3000 triples per second for the best case above.

librdfutil

Since last year I have been storing metadata about my photos, and other stuff like the twilight data, in a Redland triple store, backed by the MySQL storage implementation I wrote. I have been writing various PHP scripts for maintaining and querying the various graphs, and I found that there were a few basic tasks I kept implementing, writing the same code over and over again. At one point I finally started factoring them out into a common “library”, librdfutil (syntax highlighted version).

Most of the functions are only helpful in a PHP and/or MySQL environment, but a few of them might make it into the core Redland API at some point if other people find them useful and we can persuade Dave Beckett to include them.

The current version of this library is 0.0.1 (this entry will server as the changelog).

  • librdfutil_mysql_cbd_original_lite: Add the original CBD (without reification) for a node object to a model.
  • librdfutil_mysql_cbd_lite: Add revised CBD (without reification) for a node object to a model.
  • librdfutil_strings_to_node: Create a new librdf_node from a set of strings.
  • librdfutil_stringset_to_statement: Create a new librdf_statement from a set of strings.
  • librdfutil_tuple_to_statement: Create a new librdf_statement from a database tuple.
  • librdfutil_model_to_string: Get a serialised representation of a model, in R3X or Turtle syntax.
  • librdfutil_stream_to_string: Get a serialised representation of a stream of statements, in R3X or Turtle syntax.
  • librdfutil_node_to_turtle_string: Generate Turtle syntax fragment for node.
  • librdfutil_node_to_hash: Get a string hash (the MySQL ID) of a node object.
  • librdfutil_strings_hash: Get a string hash (the MySQL ID) of a node.

A few additional notes:

  • The MySQL specific functions, the ones that start with librdfutil_mysql, require ADODB.
  • In case anyone is wondering about the naming of the CBD functions, I expect to implement the full specification(s) at a later date — or perhaps someone else will contribute them…
  • At some point, a pretty API overview with usage examples will be created, for easy reference.

Comments, bug reports, and suggestions are as always much appreciated. Thanks to Russell Cloran for initial thoughts and comments on the pre-release version.

Redland/MySQL utilities

Since it seems a number of people are using the MySQL backend storage for Redland, most notably crschmidt with his julie in #julie on freenode, now seems like a good time to “release” a few shell scripts I have put together.

Common to all of them is the use of the Unix principle of quiet operation iff successful, and non-zero return codes when not.

redland-mysql-optimize (latest version: 0.2)
As with all databases, it’s a good idea to make sure the indices are up-to-date. The redland-mysql-optimize script will make sure that all relevant tables in a triple store have indices that reflect the contents of each table in the best way possible. The script depends on the mysql command line client.
Usage example:
redland-mysql-optimize db

Changes since 0.1:

  • Changed a $1 to $*, thanks Russell!
redland-mysql-clean (latest version: 0.2)
When a statement is deleted from a MySQL triple store through the Redland API, the associated node tables (Bnodes, Resources and Literals) are not updated — more on why that is in a separate post. Also, a statement can currently be asserted more than once in a single context.
Together, these issues can lead to the various tables containing more data than needed. The redland-mysql-clean script will, when run against a specific database, remove all orphaned nodes and duplicate statements in all models. It’s a good idea to run the redland-mysql-optimize script after a cleanup. The script depends on perl, the mysql command line client, and the join (1) utility.
Usage example:
redland-mysql-clean db

Changes since 0.1:

  • Fixed interpolation of $model, thanks to Russell – again!
redland-mysql-drop-model (latest version: 0.1)
The Redland API doesn’t include a method for removing a model once it has been created in a storage. This script makes up for that, helping out when a model is no longer needed. Be sure to run the redland-mysql-clean and redland-mysql-optimize scripts afterwards. The script depends on perl and the mysql command line client.
Usage example:
redland-mysql-drop-model model db

NOTE: This is a destructive script.

Concise Bounded Resource Descriptions in Redland/MySQL

While I’m not sure about the merits of the entire URIQA proposal by Patrick Stickler, it does introduce the very nice concept of CBD‘s.

The concept is similar to — actually a superset of — FOAF’s notion of minimally identifying set of properties, the set of properties for a person that is needed to identify, display and get more information about the person, usually including a name (or nickname), at least one inverse functional property and a link, rdfs:seeAlso.

For this reason, and a few others, I decided to implement this in Redland and the Redland/MySQL storage engine as a method for the Model “class”, librdf_model_cbd_as_stream. Since I wanted to leave it up to each storage implementation how to implement it, it turned out to require quite a few source file changes, but I will be handing them over to Dave Beckett for inclusion in the next version of Redland if he sees it fit.

The definition of CBD is recursive, as for each bnode object the statements where it appears as a subject must be included in the result and so on, but implementing infinite recursive queries in SQL is impossible. To overcome this issue, I decided to go with the following algorithm (node is the input resource for which a CBD is wanted):

list of nodes = (node)
count of nodes = 1
REPEAT
  last count of nodes = count of nodes
  list of nodes = SQL(bnodes objects of statements with subject in list of nodes) + node
  count of nodes = COUNT(list of nodes)
UNTIL count of nodes = last count of nodes
RETURN statements with subject in list of nodes

The SQL generated for the query for bnode objects looks like this (operating on the most recent Redland/MySQL storage engine database schema):

select distinct ID
from Statements join Bnodes on Object=ID
where Subject=7972813756443468730 or Subject=10313337636846108089

While the algorithm works, and doesn’t put too much strain on the connection between the client and server, it does require at least one extraneous query, since the loop ends when two subsequent queries yield the same result. Hints on improving this will be much appreciated.

Please note that I have left out step 3 of the CBD definition, the reification part. This is mostly due to the reason that I don’t work with reification in my models, but also because I don’t see reification in the RDF sense to be of much use in practical implementations.

Also, in contrast to the CBD definition, this algorithm and implementation allows for CBDs for bnodes, not just URIs.