Category Archives: MySQL

http://www.mysql.com/

Redland Hacking

During the last few days, I’ve been hacking a bit on — and with — Redland.

First off, I verified that a bug and associated patch from Simon Cross regarding portability of the hash calculations in the MySQL storage engine was indeed working. When originally writing the code for this I hadn’t thought of the use case of accessing a storage on a different architecture, but that is of course an important one. The issue is now closed, Dave Beckett has applied the patch to CVS.

I also created an issue regarding the design decision to not look for hash collisions, 28: Hash collisions possible in MySQL storage engine. I don’t have a solution ready for this, but I thought it would be a good idea to get it out in the open, so people are aware of the problem.

Another minor issue with the MySQL storage was its excessive use of connections, especially visible when using Rasqal. I wrote a patch to make it use persistent connections, and Christopher Schmidt was kind enough to help me test it. It seems to be working fine — it does here as well, so I sent a message to redland-dev asking for comments, hopefully this will get into CSV soon as well.

Then came a bit of work on the long-running issue with the PHP interface to Redland. PHP has its own unique NULL-value, so when the Redland Bindings blindly returned a C NULL wrapped in a regular PHP object (in the case of an error), Redland would crash Apache/PHP upon trying to use that object. In the past, Dave has been kind enough to hack a bit here and there when I ran into problems, but I decided to try to close the issue more pemanently. Thus, as explained in 15: PHP binding functions should return a PHP null, I patched the pointer return function to always return a PHP NULL instead a C one. My first version of the patch seems to have been faulty, as Dave couldn’t apply it to CVS, but I created a new one that I hope will do a better job. Also, as a side effect of this change, it is now no longer possible to pass a C NULL into some of the Redland functions where needed, so it seems we have to create a few PHP helper functions to return a C NULL wrapped in a PHP object…

I’ve got more ideas for improvement to Redland, but they really can’t be considered as anything other than feature requests to be coded on a day (and night) with nothing else to do, so I haven’t created issues for these:

  • An option for the MySQL storage to prefix table names with a constant string, to make it possible to have more than one storage in the same database, inspired by the way WordPress does it, and to help out with Dan Brickley’s SparqlPress project.
  • Some builtin “reasoning” functions, to — among other things — make my Redland Smusher obsolete. I’ve discussed this a bit with Dave, but we still haven’t figured out the “right” or best way to implement it.
  • It seems the new version of 3store will store simple datatyped literals like integers in separate columns, to make it easier for the database enginge to work with the values and to better support SPARQL. I think I’d like to do the same for the Redland MySQL storage, but still have to figure out the implications.
  • A new MySQL storage enginge that reads — later on maybe writes as well — the Jena schema layout. This could perhaps be an option to the MySQL storage enginge, in which case it would be almost trivial to also add an option for storing in a simpler, denormalized layout, where all the information is in a single table instead of spread out over four.

Last, and in some sense also least, I hacked a little conversion service, CSV-SPARQLer, that simply takes a URI to a CSV file and turns it into SPARQL Variable Bindings Results format (example, show query, extra example, show extra query).
As the extra example shows, I wanted to be able to subscribe to the action that goes on in the Redland Issue Tracker, but all it made available was a CSV file, so there: A CSV file converted into SPARQL result format, then converted into RSS through SPARQL Conversions XSLT. The resulting RSS is not perfect, notably the titles are a bit generic, but it’s good enough.

Triple Loading

Phil Dawes is working on an ifpstore to back his Veudas RDF editor. Yesterday he published some benchmarks for importing triples into ifpstore.

While there are quite a number of variables — the data structures differ a bit, the hardware used isn’t identical, etc… — I thought I’d try to redo the benchmarks with the Redland/MySQL storage backend.

At first I tried importing the four separate files from Wordnet, then serialised in NTriples syntax in one file, and last a single RDF/XML file, generated naïvely (meaning that the size of the file is larger than the four raw files combined, even if it contains the same triples) from the NTriples one.

Then I repeated the tests, this time with the bulk loading features — table locking and temporary disabling of indices — of the MySQL storage backend turned on. The benchmark below includes the index rebuilding phase that takes place after the actual load.

[Redland/MySQL Wordnet Benchmark]

Redland/MySQL Wordnet Benchmark Standard Bulk
Separate 210 181
RDF/XML 225 169
Ntriples 205 158

As the numbers suggest, the bulk loading features work, and raptor is faster at parsing NTriples than RDF/XML, hardly a surprise.

What doesn’t show in the numbers is the fact that the entire load process is CPU bound, it’s not disk accesses that’s taking time. Also, the amount of triples, 473589, isn’t enough to fill up the in-memory cache MySQL maintains (here), not even importing a large dataset like Jim’s 6.7 million scuttered statements (converted into NTriples with jim2ntriples) seems to be. With bulk loading turned on, that entire process takes about 34 minutes, equivalent to about 3300 triples per second, as compared to the about 3000 triples per second for the best case above.

librdfutil

Since last year I have been storing metadata about my photos, and other stuff like the twilight data, in a Redland triple store, backed by the MySQL storage implementation I wrote. I have been writing various PHP scripts for maintaining and querying the various graphs, and I found that there were a few basic tasks I kept implementing, writing the same code over and over again. At one point I finally started factoring them out into a common “library”, librdfutil (syntax highlighted version).

Most of the functions are only helpful in a PHP and/or MySQL environment, but a few of them might make it into the core Redland API at some point if other people find them useful and we can persuade Dave Beckett to include them.

The current version of this library is 0.0.1 (this entry will server as the changelog).

  • librdfutil_mysql_cbd_original_lite: Add the original CBD (without reification) for a node object to a model.
  • librdfutil_mysql_cbd_lite: Add revised CBD (without reification) for a node object to a model.
  • librdfutil_strings_to_node: Create a new librdf_node from a set of strings.
  • librdfutil_stringset_to_statement: Create a new librdf_statement from a set of strings.
  • librdfutil_tuple_to_statement: Create a new librdf_statement from a database tuple.
  • librdfutil_model_to_string: Get a serialised representation of a model, in R3X or Turtle syntax.
  • librdfutil_stream_to_string: Get a serialised representation of a stream of statements, in R3X or Turtle syntax.
  • librdfutil_node_to_turtle_string: Generate Turtle syntax fragment for node.
  • librdfutil_node_to_hash: Get a string hash (the MySQL ID) of a node object.
  • librdfutil_strings_hash: Get a string hash (the MySQL ID) of a node.

A few additional notes:

  • The MySQL specific functions, the ones that start with librdfutil_mysql, require ADODB.
  • In case anyone is wondering about the naming of the CBD functions, I expect to implement the full specification(s) at a later date — or perhaps someone else will contribute them…
  • At some point, a pretty API overview with usage examples will be created, for easy reference.

Comments, bug reports, and suggestions are as always much appreciated. Thanks to Russell Cloran for initial thoughts and comments on the pre-release version.

Redland/MySQL utilities

Since it seems a number of people are using the MySQL backend storage for Redland, most notably crschmidt with his julie in #julie on freenode, now seems like a good time to “release” a few shell scripts I have put together.

Common to all of them is the use of the Unix principle of quiet operation iff successful, and non-zero return codes when not.

redland-mysql-optimize (latest version: 0.2)
As with all databases, it’s a good idea to make sure the indices are up-to-date. The redland-mysql-optimize script will make sure that all relevant tables in a triple store have indices that reflect the contents of each table in the best way possible. The script depends on the mysql command line client.
Usage example:
redland-mysql-optimize db

Changes since 0.1:

  • Changed a $1 to $*, thanks Russell!
redland-mysql-clean (latest version: 0.2)
When a statement is deleted from a MySQL triple store through the Redland API, the associated node tables (Bnodes, Resources and Literals) are not updated — more on why that is in a separate post. Also, a statement can currently be asserted more than once in a single context.
Together, these issues can lead to the various tables containing more data than needed. The redland-mysql-clean script will, when run against a specific database, remove all orphaned nodes and duplicate statements in all models. It’s a good idea to run the redland-mysql-optimize script after a cleanup. The script depends on perl, the mysql command line client, and the join (1) utility.
Usage example:
redland-mysql-clean db

Changes since 0.1:

  • Fixed interpolation of $model, thanks to Russell – again!
redland-mysql-drop-model (latest version: 0.1)
The Redland API doesn’t include a method for removing a model once it has been created in a storage. This script makes up for that, helping out when a model is no longer needed. Be sure to run the redland-mysql-clean and redland-mysql-optimize scripts afterwards. The script depends on perl and the mysql command line client.
Usage example:
redland-mysql-drop-model model db

NOTE: This is a destructive script.

Concise Bounded Resource Descriptions in Redland/MySQL

While I’m not sure about the merits of the entire URIQA proposal by Patrick Stickler, it does introduce the very nice concept of CBD‘s.

The concept is similar to — actually a superset of — FOAF’s notion of minimally identifying set of properties, the set of properties for a person that is needed to identify, display and get more information about the person, usually including a name (or nickname), at least one inverse functional property and a link, rdfs:seeAlso.

For this reason, and a few others, I decided to implement this in Redland and the Redland/MySQL storage engine as a method for the Model “class”, librdf_model_cbd_as_stream. Since I wanted to leave it up to each storage implementation how to implement it, it turned out to require quite a few source file changes, but I will be handing them over to Dave Beckett for inclusion in the next version of Redland if he sees it fit.

The definition of CBD is recursive, as for each bnode object the statements where it appears as a subject must be included in the result and so on, but implementing infinite recursive queries in SQL is impossible. To overcome this issue, I decided to go with the following algorithm (node is the input resource for which a CBD is wanted):

list of nodes = (node)
count of nodes = 1
REPEAT
  last count of nodes = count of nodes
  list of nodes = SQL(bnodes objects of statements with subject in list of nodes) + node
  count of nodes = COUNT(list of nodes)
UNTIL count of nodes = last count of nodes
RETURN statements with subject in list of nodes

The SQL generated for the query for bnode objects looks like this (operating on the most recent Redland/MySQL storage engine database schema):

select distinct ID
from Statements join Bnodes on Object=ID
where Subject=7972813756443468730 or Subject=10313337636846108089

While the algorithm works, and doesn’t put too much strain on the connection between the client and server, it does require at least one extraneous query, since the loop ends when two subsequent queries yield the same result. Hints on improving this will be much appreciated.

Please note that I have left out step 3 of the CBD definition, the reification part. This is mostly due to the reason that I don’t work with reification in my models, but also because I don’t see reification in the RDF sense to be of much use in practical implementations.

Also, in contrast to the CBD definition, this algorithm and implementation allows for CBDs for bnodes, not just URIs.