Wednesday, June 22, 2005

Scraping Google News with JSP and the Regexp taglib

I was inspired by this Google Hack a script written in Perl to scrape Google News. Now this is the sort of thing that Perl is very well suited to and it brought back fond memories of my time as a Perl programmer; I seemed to spend most of my time writing stuff like that.

Anyhoo, I thought I'd see how easy it was to do likewise with JSP instead of Perl. I found that Jakarta has a Regexp tag library, so it seems that my task would become fairly simple since I can virtually crib the regexp from the Perl script version.

It is never simple is it? Those clever guys at Google refuse access to the Google News based on the User Agent string therefore I can’t use the JSTL import tag and must resort to obtaining the Google News page by other means (in the case below a scriptlet, I know it is bad practice). I can't help but think that I must be doing something wrong but I can't be if the means to do this is available elsewhere in an O'Reilly book?

The following JSP page scrapes the news items and their respective URLs which could easily be massaged into XML feeds or whatever.


<%@ taglib uri="http://jakarta.apache.org/taglibs/regexp-1.0" prefix="rx" %>
<rx:text id="test">
<%= httpget("http://news.google.co.uk/news?lr=&tab=wn&ned=uk&topic=n") %>
</rx:text>

<rx:regexp id="rx1">m/<a href="([^"]+)" id="?r-(?:[0-9_]+)"?>(.+?)</a>/mi</rx:regexp>
<rx:regexp id="
rx2">s/<[^>]*>//gmi</rx:regexp>

<rx:match id="
match" regexp="rx1" text="test">
<rx:text id="
test2"><rx:group number="2"/></rx:text>
<a href="
<rx:group number="1"/>">
<rx:substitute regexp="rx2" text="test2"/>
</a>
<br />

</rx:match>

<%@ page import="java.lang.StringBuffer" %>
<%@ page import="java.net.HttpURLConnection" %>
<%@ page import="java.net.URL" %>
<%@ page import="java.io.InputStream" %>
<%!
String httpget( String url ) {
try {
StringBuffer sb = new StringBuffer(500);
URL href = new URL(url);
HttpURLConnection hc = (HttpURLConnection) href.openConnection();
String ua="Mozilla/4.0 (compatible; MSIE 6.0; WINDOWS; .NET CLR 1.1.4322)";
hc.setRequestProperty("user-agent", ua);
hc.setRequestMethod("GET");
hc.connect();

InputStream is = hc.getInputStream();
int i;
while ( (i = is.read() ) != -1 ) {
char c = (char) i;
sb.append(c);
}
is.close();
hc.disconnect();
return new String(sb);
} catch (Exception e) {
return "\r\n<!-- Error:=> " + e.toString() + "-->";
}
}
%>

2 comments:

Mark McLaren said...

Hi Mark



Would you think this tag Library would be useful to scrape google ranks as well?

Have you got any first hand expereince on this



Nice to know about this



I am fed up with my Perl Script :)


Note: Comment imported. Original by NastyKid at 2007-07-20 11:23

Mark McLaren said...

Hi Mark



Would you think this tag Library would be useful to scrape google ranks as well?

Have you got any first hand expereince on this



Nice to know about this



I am fed up with my Perl Script :)


Note: Comment imported. Original by NastyKid at 2007-07-20 11:27