Triple Loading

Phil Dawes is working on an ifpstore to back his Veudas RDF editor. Yesterday he published some benchmarks for importing triples into ifpstore.

While there are quite a number of variables — the data structures differ a bit, the hardware used isn’t identical, etc… — I thought I’d try to redo the benchmarks with the Redland/MySQL storage backend.

At first I tried importing the four separate files from Wordnet, then serialised in NTriples syntax in one file, and last a single RDF/XML file, generated naïvely (meaning that the size of the file is larger than the four raw files combined, even if it contains the same triples) from the NTriples one.

Then I repeated the tests, this time with the bulk loading features — table locking and temporary disabling of indices — of the MySQL storage backend turned on. The benchmark below includes the index rebuilding phase that takes place after the actual load.

[Redland/MySQL Wordnet Benchmark]

Redland/MySQL Wordnet Benchmark Standard Bulk
Separate 210 181
RDF/XML 225 169
Ntriples 205 158

As the numbers suggest, the bulk loading features work, and raptor is faster at parsing NTriples than RDF/XML, hardly a surprise.

What doesn’t show in the numbers is the fact that the entire load process is CPU bound, it’s not disk accesses that’s taking time. Also, the amount of triples, 473589, isn’t enough to fill up the in-memory cache MySQL maintains (here), not even importing a large dataset like Jim’s 6.7 million scuttered statements (converted into NTriples with jim2ntriples) seems to be. With bulk loading turned on, that entire process takes about 34 minutes, equivalent to about 3300 triples per second, as compared to the about 3000 triples per second for the best case above.

librdfutil

Since last year I have been storing metadata about my photos, and other stuff like the twilight data, in a Redland triple store, backed by the MySQL storage implementation I wrote. I have been writing various PHP scripts for maintaining and querying the various graphs, and I found that there were a few basic tasks I kept implementing, writing the same code over and over again. At one point I finally started factoring them out into a common “library”, librdfutil (syntax highlighted version).

Most of the functions are only helpful in a PHP and/or MySQL environment, but a few of them might make it into the core Redland API at some point if other people find them useful and we can persuade Dave Beckett to include them.

The current version of this library is 0.0.1 (this entry will server as the changelog).

  • librdfutil_mysql_cbd_original_lite: Add the original CBD (without reification) for a node object to a model.
  • librdfutil_mysql_cbd_lite: Add revised CBD (without reification) for a node object to a model.
  • librdfutil_strings_to_node: Create a new librdf_node from a set of strings.
  • librdfutil_stringset_to_statement: Create a new librdf_statement from a set of strings.
  • librdfutil_tuple_to_statement: Create a new librdf_statement from a database tuple.
  • librdfutil_model_to_string: Get a serialised representation of a model, in R3X or Turtle syntax.
  • librdfutil_stream_to_string: Get a serialised representation of a stream of statements, in R3X or Turtle syntax.
  • librdfutil_node_to_turtle_string: Generate Turtle syntax fragment for node.
  • librdfutil_node_to_hash: Get a string hash (the MySQL ID) of a node object.
  • librdfutil_strings_hash: Get a string hash (the MySQL ID) of a node.

A few additional notes:

  • The MySQL specific functions, the ones that start with librdfutil_mysql, require ADODB.
  • In case anyone is wondering about the naming of the CBD functions, I expect to implement the full specification(s) at a later date — or perhaps someone else will contribute them…
  • At some point, a pretty API overview with usage examples will be created, for easy reference.

Comments, bug reports, and suggestions are as always much appreciated. Thanks to Russell Cloran for initial thoughts and comments on the pre-release version.

Copenhagen Dinner

There was a blogger dinner tonight in Copenhagen. While I should have stayed at home for the FOAF IRC meeting regarding the FOAFCommunityProcess document, I decided it’d be better to go local for once and spend an evening in the company of Dan Gillmor, Thomas Madsen-Mygdal, Christian Dalager, Lisbeth Klastrup, and the rest of the Danish bloggers at the Laundromat Café.

Also, I knew there’d be beer:

Knut Nägele, Joachim Oschlag, Jesper Balslev, Lisbeth Klastrup, Dan Gillmor, Guan Yang

It turned out to be quite a good idea, as there indeed was plenty of beer, and plenty of nice people. Discussed semweb and blogging with Henrik Føhns (of harddisken fame), cameras with Lisbeth (of royal fame), FOAF with Thomas (of reboot fame) and technology in general with everyone. A good time was had.

Redland/MySQL utilities

Since it seems a number of people are using the MySQL backend storage for Redland, most notably crschmidt with his julie in #julie on freenode, now seems like a good time to “release” a few shell scripts I have put together.

Common to all of them is the use of the Unix principle of quiet operation iff successful, and non-zero return codes when not.

redland-mysql-optimize (latest version: 0.2)
As with all databases, it’s a good idea to make sure the indices are up-to-date. The redland-mysql-optimize script will make sure that all relevant tables in a triple store have indices that reflect the contents of each table in the best way possible. The script depends on the mysql command line client.
Usage example:
redland-mysql-optimize db

Changes since 0.1:

  • Changed a $1 to $*, thanks Russell!
redland-mysql-clean (latest version: 0.2)
When a statement is deleted from a MySQL triple store through the Redland API, the associated node tables (Bnodes, Resources and Literals) are not updated — more on why that is in a separate post. Also, a statement can currently be asserted more than once in a single context.
Together, these issues can lead to the various tables containing more data than needed. The redland-mysql-clean script will, when run against a specific database, remove all orphaned nodes and duplicate statements in all models. It’s a good idea to run the redland-mysql-optimize script after a cleanup. The script depends on perl, the mysql command line client, and the join (1) utility.
Usage example:
redland-mysql-clean db

Changes since 0.1:

  • Fixed interpolation of $model, thanks to Russell – again!
redland-mysql-drop-model (latest version: 0.1)
The Redland API doesn’t include a method for removing a model once it has been created in a storage. This script makes up for that, helping out when a model is no longer needed. Be sure to run the redland-mysql-clean and redland-mysql-optimize scripts afterwards. The script depends on perl and the mysql command line client.
Usage example:
redland-mysql-drop-model model db

NOTE: This is a destructive script.

SKOS Output from WordPress

Waiting for the bus to the medieval banquet at Dunguaire Castle for the FOAF Galway Workshop (photos), I remembered I had tweaked the FOAF Output Plugin to also output SKOS concepts — and possibly mappings to others’ categories.

Get it before your neighbour, the FOAF Output Plugin, version 1.11 — the Galway Release…

See also the SKOS development toolshed.

In passing, being here in Galway/Ireland is great, not only do I get to meet a lot of smart and interesting people, very much like at FOAF Camp, I also get to add another country to my list of visits