I mentioned SADDLE (which used to be a part of the SPARQL Protocol draft, but is no longer) in passing the other day, when describing OWL-S Maker and talking about service description in general.
Service description in this context — and in the context of Dion Hinchcliffe’s OWL-S-less overview of SDLs — is mostly about the interface, the inputs and outputs, not what’s in between.
In contrast, SADDLE originally entered that territory with its properties like saddle:vocabulary, and the other day on dev@gargonza Damian Steer announced a nice little javascript hack for using source content descriptions — this is not about I/O, but about what a “service” contains information about.
Central to Damian’s hack is a source content description, containing OWL statements about which classes and properties are present in the SPARQL source. For example, his description shows that all objects of foaf:name statements (in this particular store) are literals.
While the above example was handmade, I realized this was getting close to what I’ve been meaning to do for generating simpler and cleaner UIs for triplestores (asking for a foaf:Person? It’s likely you’d also want a foaf:name then…), so I figured I should try to generate such an SCD — Source Content Description — automagically, as Damian hints to himself: Ideally this information would mined from the store.
I’ve managed to come up with a single query that returns all the information necessary to construct an SCD, but since it’s quite complex, I’ll explain the steps I took on the way there.
First, let’s review what we need:
- Classes used in the store (objects of rdf:type statements)
- Predicates that are used with instances of each class (for the owl:onProperty restriction)
- Do all instances of each class have a statement with each predicate (for the owl:minCardinality restriction)?
- Is the type of object in a statement with each class/predicate combination always the same (for the owl:allValuesFrom restriction)?
That last piece of information could be expanded to include all types, and then described with owl:unionOf, but I’ll leave that as an excercise for the reader!
Classes used in the store
The first piece of information can be had by a simple SPARQL query:
SELECT DISTINCT ?Class WHERE { ?R rdf:type ?Class }
By way of my experimental SPARQL rewriter (so experimental, that this is the only query on this page it supports in full), this can be translated into SQL for the Redland MySQL storage engine:
SELECT DISTINCT IF(T0.Object,CONCAT_WS("",T0objectR.URI,T0objectB.Name,T0objectL.Value),NULL) AS Class FROM Statements AS T0 LEFT JOIN Resources AS T0objectR ON T0.Object=T0objectR.ID LEFT JOIN Bnodes AS T0objectB ON T0.Object=T0objectB.ID LEFT JOIN Literals AS T0objectL ON T0.Object=T0objectL.ID WHERE T0.Predicate=2982895206037061277
The large integer 2982895206037061277 is the internal hash value for rdf:type. The Redland MySQL storage engine uses 64 bit hashes based on MD5 as primary keys and foreign keys as can be observed in the Redland MySQL storage engine database schema. Note also that the table named Statements
only exists if the option merge=yes
is set, otherwise replace it with a suitable table name.
As is painfully clear, the rewriter tries everything to satisfy the query without the knowledge of the semantics of rdf:type, so it can be optimized without much ado:
SELECT DISTINCT URI AS Class FROM Statements JOIN Resources ON Object=ID WHERE Predicate=2982895206037061277
This optimization is really too aggressive, as a class doesn’t need to be identified with a URIref, but for this purpose it’s only really handy with “properly” identified classes.
Predicates that are used with instances of each class
We don’t want to describe rdf:type properties, since they are implicit in the SCD model, and we want instances that only have statements about their type, so we need to use a query with an optional graph part:
SELECT DISTINCT ?Class, ?Property WHERE { ?R rdf:type ?Class . OPTIONAL { ?R ?Property ?Object . FILTER ?Property != rdf:type } }
At this point my rewriter gives up, so the following SQL is actually an extended and tweaked version of the previously optimized query:
SELECT DISTINCT T0objectR.URI AS Class, T1predicateR.URI AS Property FROM Statements AS T0 JOIN Resources AS T0objectR ON T0.Object=T0objectR.ID LEFT JOIN Statements AS T1 ON T0.Subject=T1.Subject LEFT JOIN Resources AS T1predicateR ON T1.Predicate=T1predicateR.ID AND T1.Predicate != 2982895206037061277 WHERE T0.Predicate=2982895206037061277
Do all instances of each class have a statement with each predicate?
With this step it gets a bit hairy, as we need to count the number of instances of each class, as well as the number of predicate occurances for that class. These counts are at different levels, so we need to resort to grouping on a value that is NULL at times, at least with this somewhat ancient version of MySQL (4.0.16):
SELECT DISTINCT T0objectR.URI AS Class, T1predicateR.URI AS Property, COUNT(DISTINCT T0.Subject) AS Count FROM Statements AS T0 JOIN Resources AS T0objectR ON T0.Object=T0objectR.ID LEFT JOIN Statements AS T1 ON T0.Subject=T1.Subject LEFT JOIN Resources AS T1predicateR ON T1.Predicate=T1predicateR.ID AND T1.Predicate != 2982895206037061277 WHERE T0.Predicate=2982895206037061277 GROUP BY Class, Property ORDER BY Class, Property
Partial example output:
http://xmlns.com/foaf/0.1/Person | NULL | 20 |
http://xmlns.com/foaf/0.1/Person | http://www.w3.org/2000/01/rdf-schema#seeAlso | 5 |
http://xmlns.com/foaf/0.1/Person | http://xmlns.com/foaf/0.1/homepage | 6 |
http://xmlns.com/foaf/0.1/Person | http://xmlns.com/foaf/0.1/knows | 4 |
http://xmlns.com/foaf/0.1/Person | http://xmlns.com/foaf/0.1/mbox_sha1sum | 13 |
http://xmlns.com/foaf/0.1/Person | http://xmlns.com/foaf/0.1/name | 16 |
Post processing is needed, but as NULLs are always put first by the sort operation, a single pass through is enough – the cardinality of each predicate is computable when it is reached.
Is the type of object in a statement with each class/predicate combination always the same?
Determining the type of objects is hard, as we essentially need to add another level of counting. However, with a bit of procedural magic in combination with counting, it’s possible to get around that. The granularity of object types is at the class typing level, it could be extended to deal with datatyping, but that would like make the query slower, and it wouldn’t be possible to express it in OWL anyway, as datatyping is per statement instance.
SELECT DISTINCT T0objectR.URI AS Class, T1predicateR.URI AS Property, COUNT(DISTINCT T0.Subject) AS Count, IF(T1predicateR.ID,IF(COUNT(DISTINCT T1objectL.ID),IF(!COUNT(DISTINCT T1objectR.ID) AND !COUNT(DISTINCT T1objectB.ID),'http://www.w3.org/2000/01/rdf-schema#Literal',NULL),IF(COUNT(DISTINCT T2objectR.URI)=1 AND COUNT(DISTINCT T2.Object IS NOT NULL)=1,T2objectR.URI,IF(COUNT(DISTINCT T1objectR.ID) AND !COUNT(DISTINCT T1objectB.ID),'http://www.w3.org/2000/01/rdf-schema#Resource',NULL))),NULL) AS Type FROM Statements AS T0 JOIN Resources AS T0objectR ON T0.Object=T0objectR.ID LEFT JOIN Statements AS T1 ON T0.Subject=T1.Subject LEFT JOIN Resources AS T1predicateR ON T1.Predicate=T1predicateR.ID AND T1.Predicate!=2982895206037061277 LEFT JOIN Resources AS T1objectR ON T1.Object=T1objectR.ID LEFT JOIN Bnodes AS T1objectB ON T1.Object=T1objectB.ID LEFT JOIN Literals AS T1objectL ON T1.Object=T1objectL.ID LEFT JOIN Statements AS T2 ON T1.Object=T2.Subject AND T2.Predicate=2982895206037061277 LEFT JOIN Resources AS T2objectR ON T2.Object=T2objectR.ID WHERE T0.Predicate=2982895206037061277 GROUP BY Class, Property ORDER BY Class, Property;
Finally, we have arrived at the finished query, not exactly readable, but it does the job, it contains the needed information. As a very last step, we add an extra set of useful information: Labels for classes, with proper language identification, if present:
SELECT DISTINCT T0objectR.URI AS Class, IF(T1predicateR.ID,NULL,T3objectL.Value) AS Label, IF(T1predicateR.ID,NULL,T3objectL.Language) AS Language, T1predicateR.URI AS Property, COUNT(DISTINCT T0.Subject) AS Count, IF(T1predicateR.ID,IF(COUNT(DISTINCT T1objectL.ID),IF(!COUNT(DISTINCT T1objectR.ID) AND !COUNT(DISTINCT T1objectB.ID),'http://www.w3.org/2000/01/rdf-schema#Literal',NULL),IF(COUNT(DISTINCT T2objectR.URI)=1 AND COUNT(DISTINCT T2.Object IS NOT NULL)=1,T2objectR.URI,IF(COUNT(DISTINCT T1objectR.ID) AND !COUNT(DISTINCT T1objectB.ID),'http://www.w3.org/2000/01/rdf-schema#Resource',NULL))),NULL) AS Type FROM Statements AS T0 JOIN Resources AS T0objectR ON T0.Object=T0objectR.ID LEFT JOIN Statements AS T1 ON T0.Subject=T1.Subject LEFT JOIN Resources AS T1predicateR ON T1.Predicate=T1predicateR.ID AND T1.Predicate!=2982895206037061277 LEFT JOIN Resources AS T1objectR ON T1.Object=T1objectR.ID LEFT JOIN Bnodes AS T1objectB ON T1.Object=T1objectB.ID LEFT JOIN Literals AS T1objectL ON T1.Object=T1objectL.ID LEFT JOIN Statements AS T2 ON T1.Object=T2.Subject AND T2.Predicate=2982895206037061277 LEFT JOIN Resources AS T2objectR ON T2.Object=T2objectR.ID LEFT JOIN Statements AS T3 ON T0.Object=T3.Subject AND T3.Predicate=3108168581889151792 LEFT JOIN Literals AS T3objectL ON T3.Object=T3objectL.ID WHERE T0.Predicate=2982895206037061277 GROUP BY Class, Label, Language, Property ORDER BY Class, Label DESC, Language DESC, Property;
Sorting by label and language is done in descending order to keep the class total as the first record in each group. It’s possible that some classes will have labels in more than one language, but that’s not a problem as long as we look out for the records with a NULL property value.
Partial example output (complete, as XML from CSV-SPARQLer, show input):
http://xmlns.com/foaf/0.1/Person | Person | NULL | NULL | 20 | NULL |
http://xmlns.com/foaf/0.1/Person | NULL | NULL | http://www.w3.org/2000/01/rdf-schema#seeAlso | 5 | http://www.w3.org/2000/01/rdf-schema#Resource |
http://xmlns.com/foaf/0.1/Person | NULL | NULL | http://xmlns.com/foaf/0.1/homepage | 6 | http://www.w3.org/2000/01/rdf-schema#Resource |
http://xmlns.com/foaf/0.1/Person | NULL | NULL | http://xmlns.com/foaf/0.1/knows | 4 | http://xmlns.com/foaf/0.1/Person |
http://xmlns.com/foaf/0.1/Person | NULL | NULL | http://xmlns.com/foaf/0.1/mbox_sha1sum | 13 | http://www.w3.org/2000/01/rdf-schema#Literal |
http://xmlns.com/foaf/0.1/Person | NULL | NULL | http://xmlns.com/foaf/0.1/name | 16 | http://www.w3.org/2000/01/rdf-schema#Literal |
http://xmlns.com/foaf/0.1/Person | NULL | NULL | http://xmlns.com/foaf/0.1/weblog | 3 | http://www.w3.org/2000/01/rdf-schema#Resource |
As is seen from the output, the object of a foaf:knows statement is always a foaf:Person, and in general the store doesn’t use different object types for the same property (there are few NULL/unbound values in the right column for predicate rows, and the ones that are NULL, are most often objects with more than one type).
Sparqlette Source Content Description
The final step is the actual conversion into OWL as RDF/XML. This is done with a bit of XSLT, sparql-scd.xsl, that can be applied directly through CSV-SPARQLer to produce the Sparqlette Source Content Description.
A few notes about the resulting description:
- Stating that the range of a property is always a resource is strictly not necessary, as everything is always also a resource. Also, omitting the same statement for properties that always have bnode objects is likewise not strictly correct. However, in the absence of better terms, it’s actually not wrong, and the distinction could be useful under a closed world assumption, sometimes a URIref is needed.
- Labels are autogenerated for classes without labels, based on the local name part of the class’s URI.
- There’s a reason the actual query isn’t live, it takes about 8 seconds to run it on the 30K statements in Sparqlette, and more than 14 minutes for my main store with 900K statements — and it uses almost 1GB in temporary disk space…
To demonstrate that it works, I copied Damian’s demo to Sparqlette, and lo and behold, it works (after changing the endpoint URI, replacing ex:hasClass with dc:subject following a chat with Damian, and upgrading the parser to the new XML result format): Canned Queries by Source Content Description!
Extended usage
The next step is obviously leveraging this information when processing SPARQL queries — that would make it possible to outright reject a query looking for e.g. { ?person foaf:name ?name . ?name rdfs:label ?label }
in the store, since all the objects of a foaf:name statement are literals, and to limit the amount of joins, since e.g. the object of a foaf:homepage statement is always a URIref, eliminating the need to look in anything but the Resources
table for the actual value.
The information can easily be kept in memory and used by the storage engine when presented with a query to be rewritten. To work correctly, the statistics on disk essentially needs to be updated with every change to the graph, but with a bit of timestamping and invalidation, running the equivalent of COMPUTE STATISTICS
would only be necessary every now and then.
Perhaps I should open a new issue in the Redland Issue Tracker asking for this to be implemented?
A closing note: Being able to run a query like this is one of the reasons why I like having my triples in an SQL backed storage — it’s simply much more flexible, at least until SPARQL becomes as expressive as SQL.
Excellent! So much SQL, my eyes burn :-)
See related post on Source Content Descriptions (SPARQL queries were ported to Versa)