Wednesday, February 28, 2007

Unmarshalling KML with XStream and Polycythemic Data Models

As I mentioned before, I am usually only interested in about 20 percent of what a KML file can contain, just the 2D geometry and nothing else. I want to extract this geometry collection and store it in Java objects for further processing. If I was interested in capturing all the XML information into objects I would normally turn to some kind of XML binding library like Castor, JAXB or XML Beans. I do not want to capture the entire document in Java objects so what are my options?

I could pre-process the XML document with an XSL and create an abridged version of the original XML schema containing only the elements I am interested in. Then I could use XML binding on this abridged XML schema. Although this is possible, it feels a little too much like hard work, performance may suffer (unless I used a streaming transformation engine like STX rather than XSL) and generally it is not a very flexible approach. Therefore, I will not be doing that.

Then I discovered XStream. XStream is often mentioned in the XML binding context but it is subtly different to Castor, JAXB and XML Beans. XStream has a streaming nature (much like SAX), it parses XML and can map whatever it finds into Java classes (and vice versa). Essentially, you need to do a little groundwork in order to establish how your XML data maps to Java objects. XStream makes cunning use of reflection so that the amount of groundwork you need to do is actually relatively slight. XStream does not require that your Java object implement any specific methods, class variables are all it requires to populate objects. This appears a little strange at first and it can effect how you write your object methods. For example, you cannot assume that when XStream was populating your Java objects that any class constructor was ever called. This seems a little odd as you end up with a pre-populated object without all the usual procedures followed but at the same time it is quite ingenious.

I have created a simple proof of concept that converts KML into a Java object representation of Kml (containing only the 2D geometry elements); you can download the Java source code here, you will also need the XStream library.

In the main program XStream is used to read a KML sample file into simple, hand crafted, Java objects. XStream can then very easily return this tree of objects back into KML (of sorts). To prove to the more cynical (like myself) that this actually works I have also included a simple traversal of the newly created Kml object tree. That was scarily simple wasn't it?

Elsewhere, I have attempted to further extend this example code making use of the Geotools and the JTS project code (and possibly also WKB4J). I am attempting to create FeatureCollections upon which I can then use a bounding box filter and similar things.

My problem is that the Geotools/JTS code seems to be very focussed on conducting GML based XML parsing itself and binding these to its own Java objects tree. XStream has already produced a Java object tree for me, so I need to be able to skip to the good bits. Alas, so far, my approach of using XStream in order to convert KML into JTS or GML is getting a little too complicated for me, even if it is fun to play with. I have managed to create a hideous fusion hybrid of KML and GML, if this turns out be quite useful I will follow in the masters footsteps and not release the XML schema anytime soon.;)

I am about to go off on one for the next paragraph so consider yourself warned (this is not critism, just the ramblings of a madman, I send nothing but positive vibes :) ). As an enthusiastic Java programmer and amateur follower of GIS, I really, really want to like Geotools, JTS and find them a joy to work with. However, sometimes I find it all very complicated. There is great work in there but IMHO it is difficult to get at. When working with Java in the GIS arena there is a real danger of creating what I am calling Polycythemic data models. Polycythaemia is the exact opposite of Anemia (hence my use!). Martin Fowler warns of the dangers of creating Anemic data models, these do very little except act as containers for data. Inversely, Polycythmic models try to do far too much! E.g. the idea of "Spatial DB in Box" seemed quite good to me until I sat down and thought about what it was suggesting. What makes spatially aware databases like PostGIS good is that they are highly optimised. The approach of a Spatial DB in Box sounds like a good idea because it would turn a non-spatially aware database into one that was spatially aware, it hides the innate complexity of the GIS aware layer but at what cost to performance??? You have to be very careful about how you layer complexity upon complexity. Although it is very tempting to start experimenting with storing WKB objects in Derby or H2 blobs, I suspect that way madness lies...

Friday, February 23, 2007

Tomcat, sprechen Sie UTF-8?: Finally, my Portlet talks UTF-8

I finally got my Tomcat hosted portlet to communicate in UTF-8 throughout. I have had a pretty productive day and I am fairly light-headed and pleased with the relative ease and elegance of the method that finally succeeded, so please forgive me if I start rambling incoherently (for those only interested in the Java stuff, I'll try to confine my inane ramblings to sections denoted with italics).

Incidentally, I did consider calling this entry "Herr Tomcat sprechen sie UTF-8?" but I thought the gender police would come get me. I once read that the reason that men are drawn towards creative pursuits like programming is because they cannot physically give birth. I'm sure ladies who drone on about the pain of childbirth haven't experienced the pain of getting a web application using extensive CSS working across all browsers, okay, perhaps a tad glib, I digress...

Here is a screenshot of my working newsfeeds portlet, showing UTF-8 and all:

screenshot of my newsfeed app

So I have been writing a newsfeeds portlet based on my JSR168 bookmarks portlet and reusing a lot of the same code. Essentially, thanks to my decision to use Struts Bridge and Spring Framework together I think that my bookmarks portlet is pretty well architected. Converting the bookmarks portlet to handle newsfeeds is basically a case of adding a little ROME and AJAX magic into the equation.

The other day I was moaning about Unicode to my long-suffering partner (who does not work with computers)

Me: Unicode...terrible...complicated...grumble, grumble, grumble
Partner: That sounds nice
Me: What sounds nice?
Partner: Working with Unicorns

Er, yeah, well anyway...

Part 1: Database eat my UTF-8

Beating Oracle Database into submission, you will store my UTF-8 damn you.

The first part of the saga was to get my Oracle database to store UTF-8. But Oracle supports UTF-8 you say, well that is true but this is real life, I do not look after the database, I am far from being the only user and it is enterprise wide, has been in service for years and the format the database currently uses is 7bit US-ASCII. This is due to change in the near future but for now at least I have to deal with it. After looking long and hard at various methods to round trip between UTF-8 and 7bit US-ASCII, none of which I really understood, I found a solution I liked. This is supposed to be only a temporary fix until the Oracle databases are upgraded to UTF-8.

My solution to getting my UTF-8 in and out of an US-ASCII format database is to use base64 encoding. My XBEL XML string, which I store inside CLOBS in both bookmarks and newsfeeds portlets, is readily encoded using base64 (using a utility class from commons codec). I have not noticed any additional performance problems introduced by converting to and from base64 format yet, if anything it almost seems a little faster! Plus since I know I am only dealing with well formed XML and base64 encoded strings, I can check to see if the first character of the string is the "<" character (and the base64 alphabet does not contain this character), this identifies my string as XML format. Using this tell tale XML signal, I can introduce the base64 encoding/decoding inside my DAO implementation layer and it will continue to also work with existing XML format stored data. My DAO can now easily deal with both XML and base64 as appropriate. Also, when the database is upgraded to use UTF-8, I can use the same clue to gradually convert all the encoded strings back to raw XML string format.

Part one, success, a means to store UTF-8 in a non-UTF-8 compatible database and it will work with existing XML data and the method will be reversible in the future.

Part 2: Getting my application to display and receive UTF-8 correctly.

As I mentioned, this portlet is essentially a Struts application, including numerous JSP view pages. To ensure as much as possible that my Struts application outputs UTF-8 format I made several changes. Maybe not all of these changes are absolutely necessary but since I did not know what was stopping it working, I changed everything UTF-8 related that I could.

In struts-config.xml for my Struts bridge portlet I changed the controller entry to look like this:



<controller pagePattern="$M$P" inputForward="false" processorClass="TPstring">"org.apache.portals.bridges.struts.PortletRequestProcessor" contentType="text/html;charset=UTF8"/>

In web.xml I added additional init-param on servlets for my Struts bridge portlet:



<servlet>
<servlet-name>action</servlet-name>
<servlet-class>org.apache.portals.bridges.struts.PortletServlet</servlet-class>
<init-param>
<param-name>config</param-name>
<param-value>/WEB-INF/struts-config.xml</param-value>
</init-param>
<init-param>
<param-name>content</param-name>
<param-value>text/html;charset=UTF8</param-value>
</init-param>
</servlet>

in the XML and XHTML producing JSPs



<%@ page contentType="text/html; charset=utf-8" pageEncoding="UTF-8""TPoperator">%><%--
--%
><?xml version="1.0" encoding="UTF-8"?>

in my struts tag produced forms I added



<html:form action="/Action.do" acceptCharset="UTF-8""TPkeyword1">>

I read somewhere that I could apply page encodings in JSP2.0 applications using <jsp-property-group> in web.xml but this seemed to suggest that this would only work with JSP pages and in my Struts application my JSPs mostly live under WEB-INF and therefore do not have any directly addressable URL so I didn't think this technique could be applied.

At this point, having liberally sprinkled UTF-8 references throughout my application, the feeds fetched using AJAX that contained UTF-8 characters displayed correctly (hurray!) BUT this was not the case for the feed titles, the title and url of the feed was submitted by the user via a form. Something was messing up the format between the browser and the server.

I started looking round a bit more and found that several people were saying that a UTF-8 filter seemed to be of help, however, this was unlikely to be of help in my particular situation. This is a JSR168 portlet and portlets, without considerable effort, pay no heed to servlet filters.

In several of the pages I visited, it was suggested that Tomcat 5.5.X was responsible for the poor way in which UTF-8 character encoding was being handled. I could believe that since, I had gone to great lengths in my attempts to ensure everything produced UTF-8, and my application was half working (at least the parts that had not been submitted via forms). A workaround for Tomcat's deficiency was to add:



if(request.getCharacterEncoding() == null){
request.setCharacterEncoding("UTF-8");
}

before attempting to retrieve any request parameters. I tried this out and it seemed to work but for a moment I thought I would need to add this code to all my Struts actions (or at least extend the Struts action class to perform this). Then I remembered something I was reading the other day. I have started using ServletContextListeners quite a bit recently, these are called when an application first starts up, the Spring Framework uses them to establish contexts and they are very handy for initialising databases and such. Well, there are other Listeners available besides ServletContextListeners and the one I remembered reading about was a ServletRequestListener. A ServletRequestListener is called every time a request is created and destroyed. Therefore, I could place the above code inside a ServletRequestListener and it would solve my UTF-8 problems. There was an added bonus in using a ServletRequestListener, where a filter would not work in a portlet a ServletRequestListener does!



package somewhere.web;

import java.io.UnsupportedEncodingException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletRequestEvent;
import javax.servlet.ServletRequestListener;

public class UTF8EncodingRequestListener implements ServletRequestListener{

public void requestDestroyed(ServletRequestEvent servletRequestEvent) {
}

public void requestInitialized(ServletRequestEvent servletRequestEvent) {
ServletRequest request = servletRequestEvent.getServletRequest();
String enc = request.getCharacterEncoding();
if (enc == null){
try {
request.setCharacterEncoding("UTF-8");
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
}
}

}

One last thing, to test UTF-8 I got some Arabic, Russian and Chinese RSS newsfeeds from the BBC OPML feeds list (see above screenshot). All was working fine on my Win2K workstation but when I tried this on Windows XP my Chinese characters did not work. I tried to access Google China and again the characters did not work. It turns out that East Asian fonts are not installed by default with Windows XP and if you want them you will need to install them yourself (If you have a spare 230 megabytes on your hard disk and care about such things).

References I found useful

Wednesday, February 21, 2007

Extracting coordinates from KML with XSL (e.g. for Google Maps)

Now that the Google Earth KML 2.1 format has an XML Schema, we can use XML validators to say for definite if a given XML file follows the rules of the KML format (i.e. it validates). An XML Schema gives an authoritative description of an XML format. XML Schema is not an easy format for humans to read. Worse still, KML 2.1 is a very complicated format and due to this, the XML Schema for KML 2.1 is also complex.

Many people who find my blog are looking to release the data that they have trapped inside KML files. The main problem they face is getting the "coordinates" data out of the KML. Mostly, people want to do this because they want to create Google Maps (GMaps) with that data. There are other reasons that we might want to extract this data and what I am about to present is a generic approach to extract this data (even if you do not plan to use it to produce Google Maps).

Firstly, thanks to the XML Schema, I can now examine the structure of the KML elements with more confidence. I spent a little time extracting what I considered to be the important parts of the KML 2.1 file format. There can be a great deal of information inside a KML file but I discovered that I am only interested in about 20% of what a KML file can currently offer. I am intentionally ignoring any of the format that I consider to be specifically useful to Google Earth (e.g. NetworkLinks, Overlays, style information, schema extension mechanisms, 3d models and anything involving altitude). This left me with this:

<kml> - top level root of the XML document
<kml> can contain any number of Feature elements.

Feature elements are <Document>, <Folder>, <Placemark>.
Feature elements can contain <name>, <address>, <description> and other sub-elements.

Geometry elements are <MultiGeometry>, <Point>, <LineString>, <LinearRing>, <Polygon>.

Feature elements

<Document> can contain any number of Feature elements.
<Folder> can contain any number of Feature elements.
<Placemark> can contain any number of Geometry elements.

Geometry elements

<MultiGeometry> can contain any number of Geometry elements.
<Point> contains a single <coordinates> element.
<LineString> contains a single <coordinates> element.
<LinearRing> contains a single <coordinates> element
<Polygon> contains a maximum of one <outerBoundaryIs> element.
<Polygon> contains any number of <innerBoundaryIs> elements

<outerBoundaryIs> elements contain a single <LinearRing> element
<innerBoundaryIs> elements contain a single <LinearRing> element

<coordinates> elements contain a space separated Cartesian coordinate value triples (e.g. "x1,y1,z1 x2,y2,z2"), if the element is contained inside a <Point>, this string is most likely to contain a single coordinate triple

Using this abridged description of KML, I can construct an XSLT that can extract co-ordinate information from any KML 2.1 format file (providing that the KML file does not extend the KML schema).


<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:kml="http://earth.google.com/kml/2.1">
<xsl:output method="text" version="1.0" omit-xml-declaration="yes" />

<xsl:variable name="cr"><xsl:text>
</xsl:text></xsl:variable>

<xsl:template match="/ | @* | * | comment() | processing-instruction() | text()">
<xsl:apply-templates select="@* | * | comment() | processing-instruction() | text()" />
</xsl:template>

<xsl:template match="kml:Document | kml:Folder | kml:Placemark">
<xsl:value-of select="name()"/>
<xsl:if test="string-length(kml:name) > 0">
Name: <xsl:value-of select="kml:name"/>
</xsl:if>
<xsl:if test="string-length(kml:address) > 0">
Address: <xsl:value-of select="kml:address"/>
</xsl:if>
<xsl:if test="string-length(kml:description) > 0">
Description: <xsl:value-of select="kml:description"/>
</xsl:if>
<xsl:value-of select="$cr"/>
<xsl:apply-templates />
</xsl:template>

<xsl:template match="kml:MultiGeometry | kml:LineString | kml:Point | kml:LinearRing | kml:Polygon">
<xsl:value-of select="name()"/><xsl:value-of select="$cr"/>
<xsl:apply-templates />
</xsl:template>

<xsl:template match="kml:outerBoundaryIs | kml:innerBoundaryIs">
<xsl:value-of select="name()"/><xsl:value-of select="$cr"/>
<xsl:apply-templates />
</xsl:template>


<xsl:template match="kml:coordinates">
<xsl:call-template name="split">
<xsl:with-param name="str" select="normalize-space(.)" />
</xsl:call-template>
</xsl:template>

<xsl:template name="split">
<xsl:param name="str" />
<xsl:choose>
<xsl:when test="contains($str,' ')">
<xsl:variable name="coord"><xsl:value-of select="substring-before($str,' ')" /></xsl:variable>
<xsl:variable name="first"><xsl:value-of select="substring-before($coord,',')" /></xsl:variable>
<xsl:variable name="second"><xsl:value-of select="substring-before(substring-after($coord,','),',')" /></xsl:variable>
X: <xsl:value-of select="$first" />
Y: <xsl:value-of select="$second" /><xsl:value-of select="$cr"/>
<xsl:call-template name="split">
<xsl:with-param name="str" select="normalize-space(substring-after($str,' '))" />
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:if test="string-length($str) > 0">
<xsl:variable name="first"><xsl:value-of select="substring-before($str,',')" /></xsl:variable>
<xsl:variable name="second"><xsl:value-of select="substring-before(substring-after($str,','),',')" /></xsl:variable>
X: <xsl:value-of select="$first" />
Y: <xsl:value-of select="$second" /><xsl:value-of select="$cr"/>
</xsl:if>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

</xsl:stylesheet>

With a little XSL know-how, the above skeletal XSLT can be modified to insert Google Maps JavaScript as appropriate.

There are approximate parallel between KML and GMaps e.g.:

I would stop short of creating a universal KML to GMaps solution because I have found specific requirements vary greatly. Larger quantities of co-ordinate data need special handling. The presentation of data, colours and style should be the decision of the developer. Although I have not produced a technique that people without XML, XSLT and JavaScript knowledge can use, I hope this is still useful to somebody.

Wednesday, February 14, 2007

Invalid XML Characters: when valid UTF8 does not mean valid XML

I was working on Java application recently when I got the following exception

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the element content of the document.

I was using Castor in an attempt to unmarshal (XML->Java Objects) an XML string. The XML originated from a bookmarks file that was uploaded into the application by the user, tidied up, transformed and then stored in a Oracle database. It was at the point at which the XML string was returned from the database and was being converted into Java objects that this error occurred.

I immediately assumed it to be some kind of character encoding conversion problem.

With some databases (e.g. MySQL) it is possible to set data to be passed as UTF8 by passing certain settings via the JDBC url. I do not use MySQL so I do not know whether this would fix my problem; although I suspect it would not.

I tried various methods to convert my XML string into valid UTF8 and I was pretty sure that I had achieved satisfactory UTF8 conversion but I still got the error.

This is when I discovered that not all valid UTF8 characters are valid XML characters, which probably makes sense (what with control characters and such) but I have never had to think about this before.

After spending several hours previously messing with numerous UTF8 conversion techniques I eventually found a solution. I found it in the Xalan mailing list. I am reproducing this solution here because it was not mentioned in the context of the "Unicode: 0x1a" error, if it was I would have found the solution more quickly. The XML standard specifies which UTF8 characters are valid in XML documents, so it is possible to take a UTF8 document and filter out all the invalid characters using a method like this:


/**
* This method ensures that the output String has only
* valid XML unicode characters as specified by the
* XML 1.0 standard. For reference, please see
* <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
* standard</a>. This method will return an empty
* String if the input is null or empty.
*
* @param in The String whose non-valid characters we want to remove.
* @return The in String, stripped of non-valid characters.
*/

public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.

if (in == null || ("".equals(in))) return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}

I hope somebody else finds this useful (and it saves them a few hours of head scratching), alternatively if people reading this know of a better solution then please do let me know!

Thursday, February 08, 2007

POW: the embedded web server Firefox plugin has arrived!

Server Side JavaScript (SSJS) is back with avengeance, except now the server *is* the client!

David Kellog's POW Firefox Plugin (Plain Old Webserver) [See also: Wiki and Forum thread] is a work of near genius. I have blogged before about how we are on the brink of a POW is evidence of this. I have written about Embedding Databases, Web Servers, App Servers in the Browser, we have seen Sun's impressive LAJAX applet embedded Java DB (Apache Derby) demo, my own prize winning ;) experiments with Jetty and Firefox 2.0 support for persistent storage using SQLite.

Enough playing, even at this early stage of development, POW looks like the real thing. I do not really understand how it works (XUL and XPCOM being complete mysteries to me) but I know a good thing when I see it.Since POW uses the Firefox embedded JavaScript engine, it should also support E4X and anything else that should come along in later JavaScript releases.

If there is any justice in the world, POW should be HUGE...

Sunday, February 04, 2007

Bookmarks Portlet Updated

I have just updated my bookmarks portlet on SourceForge.

I had good fun doing this! I used it as an excuse to try Maven out again. Wow! Maven2 is much better than Maven1. That said, I had considerable difficulty in getting the maven-war-plugin to behave as I wanted. I think Maven2 isn't quite ready to knock Ant off of the top spot (it is very close though!). Mevenide worked brilliantly in Netbeans (at least until my PermGen errors started up!).