Friday, March 31, 2006

Arrrghhh, Sourceforge developer CVS service is offline!

After messing about with PuTTY and WinCVS (etc.) for far too long, whilst chanting the mantra "it must be my fault" to myself, I discovered that the Sourceforge developer CVS is actually offline today. Grrr...

I'll keep a closer eye on the Sourceforge status page in future!

Tuesday, March 21, 2006

A Feed Aggregator web application

Regular visitors to my blog (as if!) will have noted that I have a rather unhealthy fixation with my newsfeed aggregator application. I wrote it about a year ago and despite the passage of time (a year is a long time in Java development) I am still very proud of it.

A scribbly application architecture diagram (UML, I wish I understood you).

It was originally modelled after the JavaRSS approach. At the time I had also written a 3 pane view newsfeed aggregator application which is currently part of our University portal. I personally prefer this one page view. I have tried to tweak it to work with a larger selection of newsfeeds, GoFetch, but the expand/collapse paradigm doesn't quite work for me on this occasion somehow.

Anyhoo, I have been revisiting some of the code recently and have tidied it up considerably. I have always been happy to share the code with those that asked but it came with a big disclaimer that it was somewhat messy. Having sorted it out, I now think I have something approaching professional. I may even set up a SourceForge project for it although that arena is somewhat overcrowded already. An application shortly to join this throng is Microsoft's RSS Platform.

My feed aggregator application does have some very nice features which I will describe shortly. The simplicity of the application has come at the cost of being reliant on numerous third party libraries. NamelySpring,EHCache,EHCache Constructs,Quartz,ROME,ROME Fetcher,XMLWriter, JSTL, Xalan, Xerces and the usual suspects from the Jakarta Commons.

The application is now very much simpler than before and at its heart it only makes use of 6 java classes.

XBEL based feed collection management

The basis of my feed collection management is the XBEL file format (I could have easily used OPML also). The collection of feeds is stored in XBEL and I create what I call a "populated" XBEL output. This populated XBEL includes the actual feed items that I want to display. For example:

XBEL


<?xml version="1.0"?>
<xbel>
<folder>
<title>Blogs</title>
<bookmark title="Mark McLaren's Weblog" href="http://blog.mark-mclaren.info"/>
</folder>
</xbel>

becomes

XBEL populated


<?xml version="1.0"?>
<xbel>
<folder>
<title>Blogs</title>
<folder>
<bookmark title="Mark McLaren's Weblog" href="http://blog.mark-mclaren.info"/>
<folder>
<bookmark title="Feed caching using EHCache, Spring and ROME FeedFetcher" href=""http://blog.mark-mclaren.info/2006/03/feed-caching-using-ehcache-spring-and.html" info="1142964250859"/>
<bookmark title="The RSS Platform concept" href=""http://blog.mark-mclaren.info/2006/03/rss-platform-concept.html" info="1142964250859"/>
<bookmark title="Keith Donald answers my, rather stupid, questions on Spring Web Flow" href=""http://blog.mark-mclaren.info/2006/03/keith-donald-answers-my-rather-stupid.html" info="1142964250859"/>
<bookmark title="Yet another Google Suggest Clone - Take 2" href=""http://blog.mark-mclaren.info/2006/02/yet-another-google-suggest-clone-take-2.html" info="1142964250859"/>
<bookmark title="How did I miss Type-Ahead behaviour?" href=""http://blog.mark-mclaren.info/2006/02/how-did-i-miss-type-ahead-behaviour.html" info="1142964250859"/></folder>
</folder>
</folder>
</xbel>

One source of complexity in my original application was that I used Castor to support the XBEL format (I had this code lying around as I was using this elsewhere). I decided that I didn't actually need Castor or an object representation of XBEL in this case. I now use DOM/SAX directly to traverse and extract data from XBEL files and I make use of XMLWriter to create populated XBEL files.

EHCache and ROME Fetcher powered feed fetch mechanism

I talked about this in my last entry. The feed fetching engine is ROME and ROME Fetcher. I have created an instance of ROME Fetcher that make use of EHCache. In reviewing the code, I noticed that although technically I had three caches, I only actually needed two EHCaches (since SyndFeedInfo includes the SyndFeed object). Plus I also noticed that I had been caching the entire SyndFeed instead of just the URL string. So fixing these issues I have some performance improvements right there! Moving to a Spring Framework powered EHCache implementation has also nicely reduced the complexity of the code involved.

Quartz based scheduling

I wrote a ServletListener that begins polling the feeds in the background when you start the application up. I did look into replacing this with a Spring powered implementation but on this occasion there weren't any great advantages to be had in doing so. It would mean replacing two classes with two alternative classes (and this would add an unnecessary additional dependency and external configuration file maintenance).

EHCache and EHConstruct based dynamic page caching

My feed aggregator view is cached via a page caching filter. This means it makes use of the conditional get mechanism itself. When this page is accessed it is cached in the browser for a period of time which reduces the load on the backend processing.

The final rendering and JavaScript enabled functionality

The final rendering is achieved via an XSL transformation of the populated XBEL (I use JSTL to do this but I could have easily used a servlet). HTML DOM processing is performed by JavaScript which, with the aid of cookies, highlights new feeds to the user. I had to bend the XHTML standards slightly to achieve this to support a custom attribute.

I'm happy for people to download and use my code, I'd be very interested to hear what you are using it for and any modifications you make to it.

I'm sure I could improve it still further, removing further hard coded variable references but it does the job.

I now plan to take it in a JSON powered portlet direction...

Thursday, March 16, 2006

Feed caching using EHCache, Spring and ROME FeedFetcher

I have been re-examining my feed aggregator application (I wrote it over a year ago and use it most days). My intention is to repurpose it as a feedaggregator portlet. I think this would add some smarter behaviour to the typical RSS portlet. Proxying the feed acquisition also means I control the feed output format. Therefore I could use the On-Demand Javascript approach to output feeds in JSON or XML format and this would enable dynamic AJAX style feed updates.

I'm going to re-factor parts of the original application completely but the feed fetching mechanism itself I'm still very happy with.

The feed fetching process is quite simple and very powerful. I got some help producing it from Nick Lothian who is the author of the ROME FeedFetcher. I thought it might be useful to reproduce the details of the mechanism here. I've also recently added some Spring Framework configuration to the mechanism in order to further simplify it (and because I wanted an excuse to play with the Spring Framework EHCache support).

Fundamentally the application uses ROME, ROME FeedFetcher and EHCache. I've spoken before about the need to use the conditional get mechanism, ROME FeedFetcher supports feed retrieval using the conditional get mechanism. Most dynamically generated feeds do not implement the conditional get mechanism. In order to locally cache the feeds I added a EHCache enabled storage mechanism to the standard ROME FeedFetcher. An obvious benefit of caching feeds locally is speed but it also facilitates "merge processing" like Microsoft's RSS Platform does in Vista.

My feed fetcher re-implements two of the classes from the ROME FeedFetcher to make use of EHCache. It makes use of three separate caches.

  • URL_CACHE: A feed url cache containing only the URL in String format. This expires after X minutes.
  • FEED_CACHE: A feeds cache containing the actual feed content (stored ROME's SyndFeed class). This would never expire.
  • FEEDINFO_CACHE: A feed info cache which contains meta data about the feeds (stored in ROME FeedFetcher's SyndFeedInfo class). This includes the last modified date that is used to support the conditional get mechanism. This cache would never expire.

Every time I retrieve an updated version of a feed I would need to modify the above three caches on disk.

All feeds are cached for X minutes and when this has elapsed a conditional get is performed. Therefore a feed not supporting the conditional get mechanism (e.g. most dynamic feeds) would be automatically downloaded every X minutes whereas a feed supporting the conditional get mechanism would be polled every X minutes and only downloaded when it had actually changed.

The actual retrieveFeed() mechanism works like this:


if (feed url is cached in URL_CACHE)
{
fetch existing feed from FEED_CACHE.
}
else
{
get connection response code
if (304 - not modified)
{
fetch from FEED_CACHE
re-cache url in URL_CACHE
}
else
{
fetch feed afresh
cache url in URL_CACHE
cache feed in FEED_CACHE
cache feed info in FEEDINFO_CACHE.
}
}

Now I'm not saying this simple mechanism alone is as full featured as Microsoft's RSS Platform (that is for you to say! ;)). In my original feed aggregator I also had support for processing XBEL and OPML format feed collections. Using Quartz I was able to implement cron-job style backend fetching of feeds. Apart from the support for downloading enclosures this pretty much mirrors the important facilities of the Microsoft RSS Platform (and is arguably much simpler to understand)!!!

I have created a zip file distribution (~2.5MB) containing the necessary source code, libraries and an example of the above mechanism. I hope you find it useful. It requires Jakarta Ant and Java (of course!).

Incidentally, there is a feed persistence project using ROME technologies called ROME Aqueduct. I originally wrote my persistence implementation before Aqueduct came into existence.

Wednesday, March 15, 2006

The RSS Platform concept

In some ways it is good to see that Microsoft are finally catching up on the feed aggregator idea (especially since Python and Java chaps having been banging on about this kind of thing for years). There has been some recent blog activity recently in response to the recent IE7 beta distribution and Microsoft's RSS Platform activities.

I've also seen various snapshots of the presentation given by Amar Gandhi at PDC05 in quite a few places around the web now. Amar's presentation was about Windows Vista: Building RSS Enabled Applications [PPT] (there is also a ~130Mb version including video available here).

I have long subscribed to the idea that the way to approach syndicated feed collection is via a single collection and storage mechanism. In the past I have even created my own modest feed aggregator application (inspired by JavaRSS.com). This seems a remarkably similar idea to the Microsoft common feedlist concept. I eventually chose to use ROME for my multi-format aware feed collection, Microsoft are implementing their own feed collection API.

I thought it was interesting that in Amar's presentation there was refence to the difference between time-based feeds and lists (e.g. News vs Top 10, Wish lists, Playlists, Bestsellers etc). It is good that it has been recognised that the way time-based feeds and list feeds are handled and merged for output purposes will need to differ.

When I first started looking at my modest feed aggregator my initial thought was to use an XML database as the storage mechanism. I played around with Apache Xindice for a bit but it was just a little too awkward for me to use at that time and so I chose to use EHCache backed storage instead. I hope Microsoft have opted for an XML database backed approach, I think it would be a mistake to concentrate too much on the RSS formats in current circulation. It would be nice to hand your XML storage mechanism a DTD or XML Schema and ask it to collect feeds that validate to those specifications. I really hope that we don't see the proliferation of too many future proprietary Microsoft XML formats (especially since MS Office will kick out XML).

Of course, there has been talk of embedding databases into Firefox, so why not an XML database? It could be a real shame that Oracle recently bought Sleepycat as I think the Berkeley DB XML would probably be an avenue worth exploring for Firefox integration.

It is a shame that the likes of Google, Sun, Mozilla, Oracle and Sleepycat won't have time to collectively come up with a world beating Firefox based RSS Platform of their own. Microsoft's IE7 looks like it might become the world's most installed feed aggregator software by default. Maybe Google could knock up a version of Google Desktop that works as a feed aggregator with database backing and search facility.

Then again, if Microsoft's feed aggregator implementation is sufficiently flawed in IE7 then maybe there is still an opportunity for somebody! Firefox hybrids already exist (e.g. Flock) and maybe feed aggregation will become the mechanism around which future Firefox hybrids are developed.

Thursday, March 02, 2006

Keith Donald answers my, rather stupid, questions on Spring Web Flow

Shortly after writing a blog entry that talked about various webflow options in web applications I received an e-mail from Keith Donald. I hope Keith doesn't mind my reproducing the content of his e-mails here as I'm sure they would be useful to the world and especially to fellow stupid developers! Keith said:

I noticed a blog entry of yours where you noted Spring Web Flow as "promising" but it didn't like it could easily do some things you needed. If you could let me know what you would like to accomplish with the framework, perhaps I can provide some insight there.

After telling everybody I knew that THE Keith Donald had e-mailed me and receiving mixed responses from my colleagues. I tried to string together an intelligent response to his request based on the immediate questions I had been facing.

  1. How to use Spring Web Flow with Struts
  2. At stages in my "wizard" web application form selections are made which govern the path that the rest of the wizard is going to take. The examples of Spring Web Flow I have seen have always been couched in terms success or error outcomes (rather than optionA, optionB or error).
  3. Similarly I didn't see an example of how best to include a page that references itself (but not as an error). For example, let's say I had a form in my wizard app that was the basis of an RSS feed and had submit buttons that would result in adding/deleting news items to the page (okay an odd example but you get the idea). I actually have quite a few occurrences of this kind of behaviour in my applications which I currently handle using the Struts nested taglib and the LookupDispatchAction.

These are Keith's responses to my questions:

Reponse to Question 1

The birthdate sample app shows Spring Web Flow + Struts integration. It's in PR5 and will also be in the upcoming 1.0 rc1.

Response to Question 2

Spring Web Flow can route differently on the type of event that occurs:

For example:


<view-state id="displayForm" view="form">
<transition on="submit" to="determinePath">
<!-- bind and validate input on the form bean on submit -->
<action bean="formAction" method="bindAndValidate"/>
</transition>
</view-state>

<!-- after binding, determine path -->
<decision-state id="determinePath">
<if test="${flowScope.formBean.option == 'optionA'}" then="optionAPath"/>
<if test="${flowScope.formBean.option == 'optionB'}" then="optionBPath"/>
<if test="${flowScope.formBean.option == 'error'}" then="error"/>
</decision-state>

As an alterative to the decision-state above, you could also use an action state there, if you don't like putting expressions in XML:


<action-state id="determinePathAlternate">
<action bean="formAction" method="determinePath"/>
<transition on="optionA" to="optionAPath"/>
<transition on="optionB" to="optionBPath"/>
<transition on="error" to="error"/>
</action-state>

Response to Question 3

Going back to the same view-state (to re-render the same page) isstraightforward:


<view-state id="displayItems" view="newsItems">
<transition on="add" to="addItem"/>
<transition on="remove" to="removeItem"/>
</view-state>

<action-state id="addItem">
<action bean="formAction" method="addItem"/>
<transition on="success" to="displayItems"/>
</action-state>

<action-state id="removeItem">
<action bean="formAction" method="removeItem"/>
<transition on="success" to="displayItems"/>
</action-state>