A common complaint about the RDF/XML syntax in the XML-literate communities is the lack of a simple PHP parser. While Redland with Raptor does the job perfectly, it almost demands root access to install, and doesn’t run on the Windows platform without cygwin.
The best alternative for PHP is RAP, but that is often claimed to be too slow or there are problems understanding and using the API.
In trying to help out, I won’t be writing an RDF/XML parser from scratch (perhaps someone else will port Sean B. Palmer’s rdfxml.py to PHP), but I have created a little wrapper class for RAP, SimpleRdfParser, that only gives access to the RDF/XML parser, and thus doesn’t need the entire library. Also, the exposed API is simply an array of triples (indexed by subject), and together these simplifications help out on the parsing speed. There’s still room for improvement though, RAP was started a while ago and is based on previous syntax specifications, so it contains support for a number of constructs that aren’t legal anymore.
In addition to the parse method, string2triples
, the class also contains a serialiser, triples2string
, which turns the graph into a simple subset of RDF/XML, suitable for handling with a regular XML parser or XSLT, should anyone have those desires…
Examples:
The careful reader will notice that there is something missing in the output: The literal “Morten Frederiksen” should have a language of “en”, but it doesn’t. This is a bug in RAP, which has been reported and will likely be fixed in the next version.
Update: A small benchmark for parsing and reserializing appr. 800 statements (source) 100 times with Redland/Raptor, SimpleRdfParser, and RAP:
- Redland/Raptor: 8 seconds
- SimpleRdfParser: 23 seconds
- RAP: 50 seconds
It turns out Redland/Raptor is about 3 times as fast as SimpleRdfParser, which is about twice as fast as RAP.
Update 2: A more realistic benchmark, doing only the parsing, no serialising:
- Redland/Raptor: 2.7 seconds
- SimpleRdfParser: 17 seconds
- RAP: 25 seconds
For a while, I felt very much like everyone thought I was nuts for thinking RAP was slow. Looking at these numbers, it’s good to see that I’m not nearly as nuts as I was made to think I was.
This seems a lot closer to usable – cutting execution time in half gets me a lot closer to what I need. I’ll keep this in mind and try and do some work on integrating it into tools, rather than depending on the simple tool we use now for Drupal stuff.
However, there’s still the argument in many products that RAP is also a lot of code – for Drupal, the 256 file RAP package is larger than the “core” Drupal install.
I still want a small, quick, Perl RDF parser. ;)
Good points, Christopher.
However, I think the numbers are bit skewed by the fact that Raptor is written in C and pretty much optimized for speed, it’s simply the reference implementation that noone is going to come close to. Also, when you’ll actually be using the results of the parsing, the parsing cost is reduced to a small part of the entire process.
On the size of RAP, you can see that only four of the files are needed to do the parsing, the rest is test files, documentation and the rest of the API.
I’ll be trying RAP for a number of projects, but this post brings up a lot of good questions. For example, has anyone here migrated a RAP applicaton to a RAPTOR/REDLAND application?
This seems a lot closer to usable – cutting execution time in half gets me a lot closer to what I need. I’ll keep this in mind and try and do some work on integrating it into tools, rather than depending on the simple tool we use now for Drupal stuff.
RAP (the parser used) is too hard to understand and work with.
I did however have much much more luck with XML_FOAF, a pear package designed at creating, and to a limited extent, parsing, FOAF / RDF.