Thursday, March 16, 2006

Feed caching using EHCache, Spring and ROME FeedFetcher

I have been re-examining my feed aggregator application (I wrote it over a year ago and use it most days). My intention is to repurpose it as a feedaggregator portlet. I think this would add some smarter behaviour to the typical RSS portlet. Proxying the feed acquisition also means I control the feed output format. Therefore I could use the On-Demand Javascript approach to output feeds in JSON or XML format and this would enable dynamic AJAX style feed updates.

I'm going to re-factor parts of the original application completely but the feed fetching mechanism itself I'm still very happy with.

The feed fetching process is quite simple and very powerful. I got some help producing it from Nick Lothian who is the author of the ROME FeedFetcher. I thought it might be useful to reproduce the details of the mechanism here. I've also recently added some Spring Framework configuration to the mechanism in order to further simplify it (and because I wanted an excuse to play with the Spring Framework EHCache support).

Fundamentally the application uses ROME, ROME FeedFetcher and EHCache. I've spoken before about the need to use the conditional get mechanism, ROME FeedFetcher supports feed retrieval using the conditional get mechanism. Most dynamically generated feeds do not implement the conditional get mechanism. In order to locally cache the feeds I added a EHCache enabled storage mechanism to the standard ROME FeedFetcher. An obvious benefit of caching feeds locally is speed but it also facilitates "merge processing" like Microsoft's RSS Platform does in Vista.

My feed fetcher re-implements two of the classes from the ROME FeedFetcher to make use of EHCache. It makes use of three separate caches.

  • URL_CACHE: A feed url cache containing only the URL in String format. This expires after X minutes.
  • FEED_CACHE: A feeds cache containing the actual feed content (stored ROME's SyndFeed class). This would never expire.
  • FEEDINFO_CACHE: A feed info cache which contains meta data about the feeds (stored in ROME FeedFetcher's SyndFeedInfo class). This includes the last modified date that is used to support the conditional get mechanism. This cache would never expire.

Every time I retrieve an updated version of a feed I would need to modify the above three caches on disk.

All feeds are cached for X minutes and when this has elapsed a conditional get is performed. Therefore a feed not supporting the conditional get mechanism (e.g. most dynamic feeds) would be automatically downloaded every X minutes whereas a feed supporting the conditional get mechanism would be polled every X minutes and only downloaded when it had actually changed.

The actual retrieveFeed() mechanism works like this:

if (feed url is cached in URL_CACHE)
fetch existing feed from FEED_CACHE.
get connection response code
if (304 - not modified)
fetch from FEED_CACHE
re-cache url in URL_CACHE
fetch feed afresh
cache url in URL_CACHE
cache feed in FEED_CACHE
cache feed info in FEEDINFO_CACHE.

Now I'm not saying this simple mechanism alone is as full featured as Microsoft's RSS Platform (that is for you to say! ;)). In my original feed aggregator I also had support for processing XBEL and OPML format feed collections. Using Quartz I was able to implement cron-job style backend fetching of feeds. Apart from the support for downloading enclosures this pretty much mirrors the important facilities of the Microsoft RSS Platform (and is arguably much simpler to understand)!!!

I have created a zip file distribution (~2.5MB) containing the necessary source code, libraries and an example of the above mechanism. I hope you find it useful. It requires Jakarta Ant and Java (of course!).

Incidentally, there is a feed persistence project using ROME technologies called ROME Aqueduct. I originally wrote my persistence implementation before Aqueduct came into existence.