Tuesday, March 21, 2006

A Feed Aggregator web application

Regular visitors to my blog (as if!) will have noted that I have a rather unhealthy fixation with my newsfeed aggregator application. I wrote it about a year ago and despite the passage of time (a year is a long time in Java development) I am still very proud of it.

A scribbly application architecture diagram (UML, I wish I understood you).

It was originally modelled after the JavaRSS approach. At the time I had also written a 3 pane view newsfeed aggregator application which is currently part of our University portal. I personally prefer this one page view. I have tried to tweak it to work with a larger selection of newsfeeds, GoFetch, but the expand/collapse paradigm doesn't quite work for me on this occasion somehow.

Anyhoo, I have been revisiting some of the code recently and have tidied it up considerably. I have always been happy to share the code with those that asked but it came with a big disclaimer that it was somewhat messy. Having sorted it out, I now think I have something approaching professional. I may even set up a SourceForge project for it although that arena is somewhat overcrowded already. An application shortly to join this throng is Microsoft's RSS Platform.

My feed aggregator application does have some very nice features which I will describe shortly. The simplicity of the application has come at the cost of being reliant on numerous third party libraries. NamelySpring,EHCache,EHCache Constructs,Quartz,ROME,ROME Fetcher,XMLWriter, JSTL, Xalan, Xerces and the usual suspects from the Jakarta Commons.

The application is now very much simpler than before and at its heart it only makes use of 6 java classes.

XBEL based feed collection management

The basis of my feed collection management is the XBEL file format (I could have easily used OPML also). The collection of feeds is stored in XBEL and I create what I call a "populated" XBEL output. This populated XBEL includes the actual feed items that I want to display. For example:

XBEL


<?xml version="1.0"?>
<xbel>
<folder>
<title>Blogs</title>
<bookmark title="Mark McLaren's Weblog" href="http://blog.mark-mclaren.info"/>
</folder>
</xbel>

becomes

XBEL populated


<?xml version="1.0"?>
<xbel>
<folder>
<title>Blogs</title>
<folder>
<bookmark title="Mark McLaren's Weblog" href="http://blog.mark-mclaren.info"/>
<folder>
<bookmark title="Feed caching using EHCache, Spring and ROME FeedFetcher" href=""http://blog.mark-mclaren.info/2006/03/feed-caching-using-ehcache-spring-and.html" info="1142964250859"/>
<bookmark title="The RSS Platform concept" href=""http://blog.mark-mclaren.info/2006/03/rss-platform-concept.html" info="1142964250859"/>
<bookmark title="Keith Donald answers my, rather stupid, questions on Spring Web Flow" href=""http://blog.mark-mclaren.info/2006/03/keith-donald-answers-my-rather-stupid.html" info="1142964250859"/>
<bookmark title="Yet another Google Suggest Clone - Take 2" href=""http://blog.mark-mclaren.info/2006/02/yet-another-google-suggest-clone-take-2.html" info="1142964250859"/>
<bookmark title="How did I miss Type-Ahead behaviour?" href=""http://blog.mark-mclaren.info/2006/02/how-did-i-miss-type-ahead-behaviour.html" info="1142964250859"/></folder>
</folder>
</folder>
</xbel>

One source of complexity in my original application was that I used Castor to support the XBEL format (I had this code lying around as I was using this elsewhere). I decided that I didn't actually need Castor or an object representation of XBEL in this case. I now use DOM/SAX directly to traverse and extract data from XBEL files and I make use of XMLWriter to create populated XBEL files.

EHCache and ROME Fetcher powered feed fetch mechanism

I talked about this in my last entry. The feed fetching engine is ROME and ROME Fetcher. I have created an instance of ROME Fetcher that make use of EHCache. In reviewing the code, I noticed that although technically I had three caches, I only actually needed two EHCaches (since SyndFeedInfo includes the SyndFeed object). Plus I also noticed that I had been caching the entire SyndFeed instead of just the URL string. So fixing these issues I have some performance improvements right there! Moving to a Spring Framework powered EHCache implementation has also nicely reduced the complexity of the code involved.

Quartz based scheduling

I wrote a ServletListener that begins polling the feeds in the background when you start the application up. I did look into replacing this with a Spring powered implementation but on this occasion there weren't any great advantages to be had in doing so. It would mean replacing two classes with two alternative classes (and this would add an unnecessary additional dependency and external configuration file maintenance).

EHCache and EHConstruct based dynamic page caching

My feed aggregator view is cached via a page caching filter. This means it makes use of the conditional get mechanism itself. When this page is accessed it is cached in the browser for a period of time which reduces the load on the backend processing.

The final rendering and JavaScript enabled functionality

The final rendering is achieved via an XSL transformation of the populated XBEL (I use JSTL to do this but I could have easily used a servlet). HTML DOM processing is performed by JavaScript which, with the aid of cookies, highlights new feeds to the user. I had to bend the XHTML standards slightly to achieve this to support a custom attribute.

I'm happy for people to download and use my code, I'd be very interested to hear what you are using it for and any modifications you make to it.

I'm sure I could improve it still further, removing further hard coded variable references but it does the job.

I now plan to take it in a JSON powered portlet direction...

8 comments:

Mark McLaren said...

Given the limitations/bugs in java.net API package and way much superior jakarta HTTPClient - any reason you use the java.net API.



I've got into all kinds of problems with java.net?


Note: Comment imported. Original by Anonymous at 2006-12-04 15:27

Mark McLaren said...

I found ROME FeedFetcher easy to use and did not feel that it limited me.





Firstly, it supports the conditional get mechanism by default.

Secondly, it supports working with most common syndication formats.

Third, the cache mechnism was a nice fit to what I planned to do.





AFAIK, none of these are available by default when using Jakarta HTTPClient. Jakarta Commons FeedParser is the closest thing to ROME that Jakarta produce but it is still in the sandbox.





I really like ROME but there are several other possible alternatives that you could use.





I hope this answers your question.


Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2006-12-05 07:57

Mark McLaren said...

I find it your applicaiton well coded and really helpful...

thank you sir.
Note: Comment imported. Original by vivek beniwal at 2007-05-21 09:43

Mark McLaren said...

Mark,



Thanks for making your code freely available. Its great to be able to use a Java App. rather than PHP for web based aggregation.



I've played around with the source over the past day and am delighted to be able to finally replace Drupal at:

http://syn.indica-et-buddhica.org/







Best,



Richard
Note: Comment imported. Original by Richard MAHONEY website: http://www.indica-et-buddhica.org/ at 2007-12-09 03:56

Mark McLaren said...

Hi Richard,



I am very glad that you found my code useful. Your site looks great. If you spot any problems with the code or have any ideas for improvements it would be great if you could let me know.





Cheers,



Mark


Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2007-12-09 10:17

Mark McLaren said...

I'm wondering how we could have the application display UTF-8 characters in the aggregated material. I've used all of the following but still recieve `?' in some titles:





%@page pageEncoding="UTF-8"%

META http-equiv="Content-Type" content="text/html;charset=UTF-8"

%@page contentType="text/html;charset=UTF-8"%

?xml version="1.0" encoding="UTF-8"?





Would we need to use something such as:





request.setCharacterEncoding("UTF-8");


Note: Comment imported. Original by Richard Mahoney website: http://indica-et-buddhica.org/ at 2007-12-17 00:12

Mark McLaren said...

Hi Richard,



Your page output is UTF-8 (in Firefox you can check this by right clicking on the page and clicking on Page Info).





I don't think that your problem relates to this but you can specify both pageEncoding and contentType as these do subtly different things (encoding of the JSP file, encoding of the output).





<%@page pageEncoding="UTF-8" contentType="text/html;charset=UTF-8" %>





I have always found correctly handling encoding to be problematic! There are several things you can try - you probably will not need to try all of these!





Specify the output encoding of the XSL, e.g:

<xsl:output method="xml" version="1.0" indent="yes" omit-xml-declaration="yes" encoding="UTF-8"/>





You might be able to explicitly define the character encoding for the JSTL import tag (for both XSL and XML). e.g.:

<c:import url="xbel.jsp" var="XML" charEncoding="UTF-8" />





Update your ROME Jars to the latest versions (0.9).





In SyndFeedFetcherWithCacheAndMerge.java in the getSyndFeedFromStream method instead of:



final BufferedReader reader = new BufferedReader(new InputStreamReader(is,ResponseHandler.getCharacterEncoding(connection)));

final SyndFeedInput input = new SyndFeedInput();

feed = input.build(reader);





Try something like:





XmlReader reader = new XmlReader(connection);

SyndFeedInput input = new SyndFeedInput();

SyndFeed feed = input.build(reader);





Clean and rebuild the app. ROME's XmlReader "attempts to handle all the necessary Voodo to figure out the charset encoding of the XML document within the stream" - I don't think XmlReader was available when I first wrote this application.






Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2007-12-17 11:57

Mark McLaren said...

Thank you for this - it's absolutely awesome (and just what I was looking for)!
Note: Comment imported. Original by Rich L. at 2009-11-10 10:07