One of the better arguments for RSS 1.0 over other syndication formats is the claim that the (meta) data plugs directly into the greater Semantic Web, thus making it possible to go both back and forth between the two, making them one. Unfortunately, most aggregators don’t really aggregate, at most they just present a cached version of what’s currently offered, resulting in a disconnect, as Bob DuCharme recently pointed out on rdf-interest (eventually leading to rdfdata.org).
However, archiving “items” from RSS feeds over time presents a few issues.
- Not all RSS items have their own globally unique identifier
- Some RSS feeds are “linkrolls” more than a list of recently created or updated resources. A linkroll references other resources directly, sometimes making incorrect statements about e.g. the creator or time of publication (example: del.icio.us/mortenf). Reliable identification is needed to be able to recognise items that are new, old or updated.
- The ambigious definition of a channel
- In the RSS 1.0 spec it says the following about the rdf:about attribute on the channel element:
Most commonly, this is either the URL of the homepage being described or a URL where the RSS file can be found
. The right choice seems to always be the channel URI, the source of the statements, as that is what is commonly referred to byrdfs:seeAlso
in e.g. blogrolls and personal FOAF files, and most often as the identifier used for provenance in a triple store. - The
rss:items/rdf:Seq
construct - Each item is associated with one or more channels through the
rss:items
property, referencing a sequence of the “current” items. The sequence of items is determined through the use of the RDF/XML syntactic constructrdf:li
, which is expanded tordf:_1
,rdf:_2
, and so on, in the RDF model. When a new item is added to a channel, it is added at the first position,rdf:_1
, the existing items shift towards the end of the sequence, and the last item disappears from the sequence. In a naïve implementation, archiving a channel over time would lead to a “sequence” with each item being referenced more than once, and loss of actual temporal information — it’d be impossible to determine the actual order in which the items appeared. Note also, that in an even more naïve implementation (one that doesn’t recognise that the two sequences should be seen as one), the result wouldn’t be an “invalid” sequence, but instead a channel with multiplerss:items
properties, each with a perfectly fine sequence.