Triple Loading

Phil Dawes is working on an ifpstore to back his Veudas RDF editor. Yesterday he published some benchmarks for importing triples into ifpstore.

While there are quite a number of variables — the data structures differ a bit, the hardware used isn’t identical, etc… — I thought I’d try to redo the benchmarks with the Redland/MySQL storage backend.

At first I tried importing the four separate files from Wordnet, then serialised in NTriples syntax in one file, and last a single RDF/XML file, generated naïvely (meaning that the size of the file is larger than the four raw files combined, even if it contains the same triples) from the NTriples one.

Then I repeated the tests, this time with the bulk loading features — table locking and temporary disabling of indices — of the MySQL storage backend turned on. The benchmark below includes the index rebuilding phase that takes place after the actual load.

[Redland/MySQL Wordnet Benchmark]

Redland/MySQL Wordnet Benchmark Standard Bulk
Separate 210 181
RDF/XML 225 169
Ntriples 205 158

As the numbers suggest, the bulk loading features work, and raptor is faster at parsing NTriples than RDF/XML, hardly a surprise.

What doesn’t show in the numbers is the fact that the entire load process is CPU bound, it’s not disk accesses that’s taking time. Also, the amount of triples, 473589, isn’t enough to fill up the in-memory cache MySQL maintains (here), not even importing a large dataset like Jim’s 6.7 million scuttered statements (converted into NTriples with jim2ntriples) seems to be. With bulk loading turned on, that entire process takes about 34 minutes, equivalent to about 3300 triples per second, as compared to the about 3000 triples per second for the best case above.

One thought on “Triple Loading

  1. Bleh. I think all my julie operations are disk bound – I’m adding http://www.semanticweb.org/library/wordnet/wordnet_similar-20010201.rdf , and i’m seeing 40% idle reported from top (with the used time being split between mysqld and python).

    I’m not sure if this means it’s time for a new disk, or if it’s just time to move my MySQL dbs to a different location: My disk is pretty full (6gig free / ~45), I wonder if that could be causing any of the problems.

    It’s especially frustrating during times when I’m doing queries: almost everything is I/O bound by a huge amount, meaning my CPU is at 10% and my disk is spinning like mad. In Windows, I’d say that it was time to defragment, but I’m not sure how, or even if I do, in Linux.

    Then again, part of this may be because I’m doing the index building along with everything else: I suppose I’d have to turn that off temporarily and see how far that gets me in comparison. I still think that it’s my disk that’s bad, rather than anything else, which is annoying.

Comments are closed.