Tuesday, July 19, 2005

Efficient parsing of largish XML with JSTL and the XMLFilter

I discovered whilst browsing through Google Print that Shawn Bayern (in his excellent book JSTL in Action) described using an XMLFilter with JSTLs <x:parse>. What an XMLFilter allows you to do is discard the parts of an XML Document which you don't need before further processing. The XMLFilter is SAX rather than DOM based which apparently means that it can be more efficient in terms of the processing time and the memory it uses. The DOM method would need to build the entire document in memory before any processing can occur, whereas SAX works more like a parser reading only those parts of the XML document that it needs to.

In Shawn's example he created a taglib called Spath that is used to produce an XMLFilter via a limited subset of XPath. Part of SPath is generated with JavaCC. This all looked a bit complicated for me to use for my own purposes. I then discovered that it is actually quite simple to create an XMLFilter from an XSL transformation document.

For small XML documents you are unlikely to gain anything from adding an XMLFilter but if you are dealing with XML documents of a reasonable size then an additional filter can make quite a difference.

As an example I obtained a copy of Hamlet in XML format (I removed the DTD reference from it as it was causing trouble). For the filtered example I discarded all but the first scene of the first act, in the unfiltered example I processed the whole play (apologies for the scriptlets).


<%@ taglib prefix="c" uri="http://java.sun.com/jstl/core" %>
<%@ taglib prefix="x" uri="http://java.sun.com/jstl/xml" %>
<%@ taglib prefix="fmt" uri="http://java.sun.com/jstl/fmt" %>

<%@ page import="org.xml.sax.XMLFilter" %>
<%@ page import="javax.xml.transform.sax.SAXTransformerFactory" %>
<%@ page import="javax.xml.transform.Source"%>
<%@ page import="javax.xml.transform.stream.StreamSource"%>
<%@ page import="java.io.StringReader"%>
<%@ page import="javax.xml.transform.TransformerFactory"%>

<c:set var="xsl" scope="page">
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<xsl:copy-of select="//PLAY/ACT[1]/SCENE[1]"/>
</xsl:template>

</xsl:stylesheet>
</c:set>

<%
String xslt = (String) pageContext.getAttribute("xsl");
Source xsltSource = new StreamSource(new StringReader(xslt));
TransformerFactory tfactory = TransformerFactory.newInstance();
SAXTransformerFactory stf = (SAXTransformerFactory) tfactory;
XMLFilter filter = stf.newXMLFilter(xsltSource);
pageContext.setAttribute("filter", filter);
%>

<c:import url="hamlet.xml" var="feed" />

<jsp:useBean id="before" class="java.util.Date" />
<h1>
<fmt:formatDate value="${before}" type="both" pattern="HH:mm:ss:SSS" />
</h1>

<x:parse var="a" filter="${filter}"><c:out value="${feed}" escapeXml="false" /></x:parse>

<x:forEach select="$a//SPEECH" var="speech">
<p style="font-size: 8pt;">
<b><x:out select="$speech/SPEAKER"/></b><br />
<x:forEach select="$speech/LINE" var="line">
<x:out select="$line"/><br />
</x:forEach>
</p>
</x:forEach>

<jsp:useBean id="after" class="java.util.Date" />
<h1>
<fmt:formatDate value="${after}" type="both" pattern="HH:mm:ss:SSS" />
</h1>

<x:parse var="b"><c:out value="${feed}" escapeXml="false" /></x:parse>

<x:forEach select="$b//PLAY/ACT[1]/SCENE[1]/SPEECH" var="speech">
<p style="font-size: 8pt;">
<b><x:out select="$speech/SPEAKER"/></b><br />
<x:forEach select="$speech/LINE" var="line">
<x:out select="$line"/><br />
</x:forEach>
</p>
</x:forEach>

<jsp:useBean id="after2" class="java.util.Date" />
<h1>
<fmt:formatDate value="${after2}" type="both" pattern="HH:mm:ss:SSS" />
</h1>

The result is the filtered document processing looks like it is faster than the unfiltered document processing. Granted this is not a very scientific test and I'm not claiming that it is the most efficient use of either SAX or DOM techniques. It does however highlight a relatively easy way to make use of an interesting and little known feature of JSTL.

0 comments: