Aggregating and Archiving RSS Items

One of the better arguments for RSS 1.0 over other syndication formats is the claim that the (meta) data plugs directly into the greater Semantic Web, thus making it possible to go both back and forth between the two, making them one. Unfortunately, most aggregators don’t really aggregate, at most they just present a cached version of what’s currently offered, resulting in a disconnect, as Bob DuCharme recently pointed out on rdf-interest (eventually leading to rdfdata.org).

However, archiving “items” from RSS feeds over time presents a few issues.

Not all RSS items have their own globally unique identifier
Some RSS feeds are “linkrolls” more than a list of recently created or updated resources. A linkroll references other resources directly, sometimes making incorrect statements about e.g. the creator or time of publication (example: del.icio.us/mortenf). Reliable identification is needed to be able to recognise items that are new, old or updated.
The ambigious definition of a channel
In the RSS 1.0 spec it says the following about the rdf:about attribute on the channel element: Most commonly, this is either the URL of the homepage being described or a URL where the RSS file can be found. The right choice seems to always be the channel URI, the source of the statements, as that is what is commonly referred to by rdfs:seeAlso in e.g. blogrolls and personal FOAF files, and most often as the identifier used for provenance in a triple store.
The rss:items/rdf:Seq construct
Each item is associated with one or more channels through the rss:items property, referencing a sequence of the “current” items. The sequence of items is determined through the use of the RDF/XML syntactic construct rdf:li, which is expanded to rdf:_1, rdf:_2, and so on, in the RDF model. When a new item is added to a channel, it is added at the first position, rdf:_1, the existing items shift towards the end of the sequence, and the last item disappears from the sequence. In a naïve implementation, archiving a channel over time would lead to a “sequence” with each item being referenced more than once, and loss of actual temporal information — it’d be impossible to determine the actual order in which the items appeared. Note also, that in an even more naïve implementation (one that doesn’t recognise that the two sequences should be seen as one), the result wouldn’t be an “invalid” sequence, but instead a channel with multiple rss:items properties, each with a perfectly fine sequence.


A general solution the first issue regarding identification of items is not possible (although heuristics could help), but on a per-feed basis it is possible to transform the feed into a new feed that adheres to the rules of globally unique identifiers, URIs. One such solution is feed-normaliser.xsl, an XSLT that cleans up a link feed from del.icio.us and enhances it a bit with some FOAF and {SKOS} information (example output [original input]).

The second issue regarding channel location can be handled by simply making sure the URI is the same as then one from which the statements were retrieved, possibly ignoring the original URI or alternatively duplicating the information (this is also done by the above mentioned XSLT).

The third issue is by far the most complicated, and an implementation depends on the nature of the storage mechanism, in all cases assuming that item identification is a resolved issue.

If the archive consists simply of a feed that is continually expanded as new items appear, an algorithm for generating the new archive could look something like this:

  1. Copy channel information from incoming feed recursively, thus also retrieving current item list and items, making note of highest sequence number.
  2. For each item in existing archive, in sequence order:
    1. If item already exists in the new archive, skip the item.
    2. Otherwise include item and add to channel sequence with a sequence number one higher than previously used in new archive.

In an XSLT implementation of the merging algorithm, feed-merger.xsl (example output, based on this input as the archive and this input as the newest version [original input]), the part about keeping track of sequence numbers isn’t necessary, as the RSS 1.0 RDF/XML serialisation syntax uses rdf:li.

If the archive consists of a triple store, one possible solution would be to extract and remove the all statements obtained from the feed, update the extracted feed as outlined above, and reassert it in the triple store. While simple, this approach is not very efficient — design and implementation of a better algorithm is left as an exercise for the reader (or a later post on the matter)…

4 thoughts on “Aggregating and Archiving RSS Items

  1. “Not all RSS items have their own globally unique identifier” – well they darn well should have!

    “The ambigious definition of a channel…The right choice seems to always be the channel URI, the source of the statements…” – absolutely.

    “The rss:items/rdf:Seq construct” – yechh! The algorithm you describe sounds good for archiving. I’ve been bothered by the construct recently from a different angle, in the context of the feed diff technique (RFC3229 adapted for Atom), where it would help for the items to be as standalone as possible. I was wondering if per-item (in the item) rdf:member properties might be more useful, alongside xxx:previous to retain the “natural” publication order.

  2. > …URL of the homepage being described or a URL where the RSS file can be found

    I wonder why html representation and rss feed do not more often share the same uri, after all what’s content negotiation for?

  3. Danny,

    I’m not sure about the “member” status of items, I’ll have to think about that — and the “previous” part as well.
    However, I remember, in the days of the Great RSS War two years ago, thinking that using Dublin Core properties, e.g. partOf, could make RSS completely unnecessary…

    Reto,

    You’re right, that could be done more often, but there’s still the issue of fragment identifier meaning. Also, the content type for RDF/XML was only recently officially defined in an RFC.

Comments are closed.