Friday, February 23, 2007

Tomcat, sprechen Sie UTF-8?: Finally, my Portlet talks UTF-8

I finally got my Tomcat hosted portlet to communicate in UTF-8 throughout. I have had a pretty productive day and I am fairly light-headed and pleased with the relative ease and elegance of the method that finally succeeded, so please forgive me if I start rambling incoherently (for those only interested in the Java stuff, I'll try to confine my inane ramblings to sections denoted with italics).

Incidentally, I did consider calling this entry "Herr Tomcat sprechen sie UTF-8?" but I thought the gender police would come get me. I once read that the reason that men are drawn towards creative pursuits like programming is because they cannot physically give birth. I'm sure ladies who drone on about the pain of childbirth haven't experienced the pain of getting a web application using extensive CSS working across all browsers, okay, perhaps a tad glib, I digress...

Here is a screenshot of my working newsfeeds portlet, showing UTF-8 and all:

screenshot of my newsfeed app

So I have been writing a newsfeeds portlet based on my JSR168 bookmarks portlet and reusing a lot of the same code. Essentially, thanks to my decision to use Struts Bridge and Spring Framework together I think that my bookmarks portlet is pretty well architected. Converting the bookmarks portlet to handle newsfeeds is basically a case of adding a little ROME and AJAX magic into the equation.

The other day I was moaning about Unicode to my long-suffering partner (who does not work with computers)

Me: Unicode...terrible...complicated...grumble, grumble, grumble
Partner: That sounds nice
Me: What sounds nice?
Partner: Working with Unicorns

Er, yeah, well anyway...

Part 1: Database eat my UTF-8

Beating Oracle Database into submission, you will store my UTF-8 damn you.

The first part of the saga was to get my Oracle database to store UTF-8. But Oracle supports UTF-8 you say, well that is true but this is real life, I do not look after the database, I am far from being the only user and it is enterprise wide, has been in service for years and the format the database currently uses is 7bit US-ASCII. This is due to change in the near future but for now at least I have to deal with it. After looking long and hard at various methods to round trip between UTF-8 and 7bit US-ASCII, none of which I really understood, I found a solution I liked. This is supposed to be only a temporary fix until the Oracle databases are upgraded to UTF-8.

My solution to getting my UTF-8 in and out of an US-ASCII format database is to use base64 encoding. My XBEL XML string, which I store inside CLOBS in both bookmarks and newsfeeds portlets, is readily encoded using base64 (using a utility class from commons codec). I have not noticed any additional performance problems introduced by converting to and from base64 format yet, if anything it almost seems a little faster! Plus since I know I am only dealing with well formed XML and base64 encoded strings, I can check to see if the first character of the string is the "<" character (and the base64 alphabet does not contain this character), this identifies my string as XML format. Using this tell tale XML signal, I can introduce the base64 encoding/decoding inside my DAO implementation layer and it will continue to also work with existing XML format stored data. My DAO can now easily deal with both XML and base64 as appropriate. Also, when the database is upgraded to use UTF-8, I can use the same clue to gradually convert all the encoded strings back to raw XML string format.

Part one, success, a means to store UTF-8 in a non-UTF-8 compatible database and it will work with existing XML data and the method will be reversible in the future.

Part 2: Getting my application to display and receive UTF-8 correctly.

As I mentioned, this portlet is essentially a Struts application, including numerous JSP view pages. To ensure as much as possible that my Struts application outputs UTF-8 format I made several changes. Maybe not all of these changes are absolutely necessary but since I did not know what was stopping it working, I changed everything UTF-8 related that I could.

In struts-config.xml for my Struts bridge portlet I changed the controller entry to look like this:



<controller pagePattern="$M$P" inputForward="false" processorClass="TPstring">"org.apache.portals.bridges.struts.PortletRequestProcessor" contentType="text/html;charset=UTF8"/>

In web.xml I added additional init-param on servlets for my Struts bridge portlet:



<servlet>
<servlet-name>action</servlet-name>
<servlet-class>org.apache.portals.bridges.struts.PortletServlet</servlet-class>
<init-param>
<param-name>config</param-name>
<param-value>/WEB-INF/struts-config.xml</param-value>
</init-param>
<init-param>
<param-name>content</param-name>
<param-value>text/html;charset=UTF8</param-value>
</init-param>
</servlet>

in the XML and XHTML producing JSPs



<%@ page contentType="text/html; charset=utf-8" pageEncoding="UTF-8""TPoperator">%><%--
--%
><?xml version="1.0" encoding="UTF-8"?>

in my struts tag produced forms I added



<html:form action="/Action.do" acceptCharset="UTF-8""TPkeyword1">>

I read somewhere that I could apply page encodings in JSP2.0 applications using <jsp-property-group> in web.xml but this seemed to suggest that this would only work with JSP pages and in my Struts application my JSPs mostly live under WEB-INF and therefore do not have any directly addressable URL so I didn't think this technique could be applied.

At this point, having liberally sprinkled UTF-8 references throughout my application, the feeds fetched using AJAX that contained UTF-8 characters displayed correctly (hurray!) BUT this was not the case for the feed titles, the title and url of the feed was submitted by the user via a form. Something was messing up the format between the browser and the server.

I started looking round a bit more and found that several people were saying that a UTF-8 filter seemed to be of help, however, this was unlikely to be of help in my particular situation. This is a JSR168 portlet and portlets, without considerable effort, pay no heed to servlet filters.

In several of the pages I visited, it was suggested that Tomcat 5.5.X was responsible for the poor way in which UTF-8 character encoding was being handled. I could believe that since, I had gone to great lengths in my attempts to ensure everything produced UTF-8, and my application was half working (at least the parts that had not been submitted via forms). A workaround for Tomcat's deficiency was to add:



if(request.getCharacterEncoding() == null){
request.setCharacterEncoding("UTF-8");
}

before attempting to retrieve any request parameters. I tried this out and it seemed to work but for a moment I thought I would need to add this code to all my Struts actions (or at least extend the Struts action class to perform this). Then I remembered something I was reading the other day. I have started using ServletContextListeners quite a bit recently, these are called when an application first starts up, the Spring Framework uses them to establish contexts and they are very handy for initialising databases and such. Well, there are other Listeners available besides ServletContextListeners and the one I remembered reading about was a ServletRequestListener. A ServletRequestListener is called every time a request is created and destroyed. Therefore, I could place the above code inside a ServletRequestListener and it would solve my UTF-8 problems. There was an added bonus in using a ServletRequestListener, where a filter would not work in a portlet a ServletRequestListener does!



package somewhere.web;

import java.io.UnsupportedEncodingException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletRequestEvent;
import javax.servlet.ServletRequestListener;

public class UTF8EncodingRequestListener implements ServletRequestListener{

public void requestDestroyed(ServletRequestEvent servletRequestEvent) {
}

public void requestInitialized(ServletRequestEvent servletRequestEvent) {
ServletRequest request = servletRequestEvent.getServletRequest();
String enc = request.getCharacterEncoding();
if (enc == null){
try {
request.setCharacterEncoding("UTF-8");
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
}
}

}

One last thing, to test UTF-8 I got some Arabic, Russian and Chinese RSS newsfeeds from the BBC OPML feeds list (see above screenshot). All was working fine on my Win2K workstation but when I tried this on Windows XP my Chinese characters did not work. I tried to access Google China and again the characters did not work. It turns out that East Asian fonts are not installed by default with Windows XP and if you want them you will need to install them yourself (If you have a spare 230 megabytes on your hard disk and care about such things).

References I found useful

3 comments:

ismjml said...

It has to be "Sie" (= you) instead of "sie" (= they). A comma after Tomcat would help, too.
Note: Comment imported. Original by Anonymous at 2007-02-24 13:05

ismjml said...

danke schön
Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2007-02-24 14:36

pak gendoet said...

Jual aneka barang ONLINE.

bisa di pilih mana yang pingin di order.

http://bantalsilikon.com
http://kopiluwakliar.com
http://marinirseo.com/
http://bumbupecel1.blogspot.com/

http://bantalsilikon.com
http://kopiluwakliar.com
http://marinirseo.com
http://bumbupecel.com
CP : 085-635-945-40