JavaBlogs Weekly Top 10 and Java HTML parsing

I took some time to continue my little JavaBlogs analysis, I now have a page summarizing the top 10 most read blog entries in the last week. The page is generated every 24h (this is why there is no ‘best progression’ as of today).

I also fixed some bugs related to HTML in RSS2. I understand a bit better why a RSS 1.0 co-author decided to remove the possibility of HTML descriptions for RSS 3.0. It often does not make sense to keep all that information about styles, fonts, etc. from different sources. What I do is rewrite the HTML, allowing only b,i,a,p,br tags, with the style information stripped. I found the open-source htmlparser java library quite helpful to achieve that.

Mai 2006 Update

I now post the top10 every week to my blog, I wrote a little piece on how to interact with Blogger API in Java. This avoids me having to maintain an extra site.

HTTP requests handling is done using commons-httpclient library to have more control over how http requests to javablogs.com are performed. commons-httpclient is also useful to post to blogger. About the parsing with htmlparser, I changed the way to do it, I used to only use a simple Lexer, but now I changed to using NodeVisitor as it allows me to parse with finer granularity more easily, even though it is probably slower. I needed that to update href elements to that they are XHTML compliant.

You will find concrete code using htmlparser in a more recent post, just follow the link.

Spam In Blog Comments

I was a victim like many other of spams in comments. It’s stupid for people to do that on Blogger.com since the links on comments can not be referenced by search engines (they have some special ‘relative’attribute for that) and improve pagerank.

Fortunately Blogger.com provides a word verification step if you want to avoid random spam. However I am a bit disappointed that they force Blogger.com users to do that word verification as well. This time I find it stupid from Blogger.com. They have control on their users, so they could ban spamming users, and for everybody else on Blogger.com, this would be just one less step. I am always a bit annoyed at measures that solve a problem caused by a hand of people by making it more annoying for the majority.

JavaBlogs Daily Analysis

I was wondering what blog entries were the most interesting on Javablogs. I decided to write a small application to do that. It was not much more complex to put it online for others to look at as well. It is currently running on http://gopix.net:8081/javabuzz

It also presents Javablogs a bit differently (I like it better that way).

Please note that it is just the result of a 1 (full) day of work currently. I hopefully will have a bit of time to improve it. For example I’d like to add some graphs about popularity, some weekly stats, and comments in blog entries.

Commons-Beanutils Is Slow

The BeanUtil.popupate(bean,map) can be very handy, but if you care about performance, it is quite slow. I ran a micro benchmark on my machine (centrino 1.8ghz, JDK1.5) and found out that BeanUtils is up to 40x slower than a hand coded solution (where I assign each bean field manually). I was a bit surprised to find such a difference. I suppose there is a big penalty for using reflection and another big one for the BeanUtils abstraction (automatic casting, etc.). I did another test without BeanUtils, using if/else statements vs HashMap.get and found out that the if/else string.equals(…) statements can degrade performance by about 10x. The HashMap appears to be very performant, even with just a few elements in.

In conclusion, if you need more performance, consider hand coding or code generation (using aspectj, or annotations for example) rather than BeanUtils. Actually nowadays, using annotations where you used beanutils probably makes much more sense .

Generate your RSS feed in Java

There are some open source projects that can help you in generating or reading RSS feeds in Java. I found only two libraries a bit mature, other code is often embedded in other open source products (jroller for example):
  • Informa: Does various RSS formats and Atom 0.3. Documentation is better than its alternative, but less focused (has some hibernate helper thingy, some lucene helper, etc.).
  • Sandler: There is no working homepage while I am writing this. But the code is of decent quality, supports Atom 0.3 and RSS 1.0. It is easy to use it. However in reality it is not much more than a wrapper around some XML parser specialized in generating an RSS structure or an Atom structure.
  • Ooops, I forgot another important one, Rome. This RSS/Atom framework with a catchy name is very similar to Informa, has good documentation and good looking code. Under the hood it makes use of jdom.
I personally use dom4j since I only need to generate RSS, and RSS, or Atom are just XML. I don't find it particularly verbose to use dom4j for that, and it is very flexible.

If you need to parse feeds, then those libraries might make sense and save you a bit of time. For generating, I think their main interest is to abstract you from the differences in formats. So if you need to handle different formats, a framework will allow you to do it through only one API, which can be a big time-saver.

The Evil Port 80

I was writing an Atom feed generator for my current project. I chosed to support Atom 1.0 since it looks like it has the capabilities to establish as the next standard. Unfortunately I quickly saw that it was quite hard to test it in the real world (out of the good feedvalidator), as almost nobody seems to accept Atom 1.0 feeds yet, even if it is rapidely changing (there is support for it in Firefox CVS version).

So I decided to support RSS as well, the big question was: which RSS version? After grabbing lots of info on the subject, I opted for 1.0 again (more flexible, more different than Atom). It was actually quick to support RSS, but then when in real world, neither Google Desktop nor My Yahoo was willing to accept my feed. I looked at every bit of my xml, fiddled with Tomcat configuration in any possible way when I saw that no request was coming to my server from Yahoo or Google. And finally I thought, hmm maybe it's the port. I restarted my server on port 80, and yup, it worked!



I wonder why Google Desktop and My Yahoo don't support another port than port 80 for RSS feeds.

Google Sidebar Hotkey Activation

I like the new Google Desktop with the sidebar. It shares similarities with Konfabulator, recently bought by yahoo. They both allow easy access to some custom little widgets that I would call "active". They are active because they are refreshed periodically with new information (processor usage, news, scratch pad, emails, etc). But while Konfabulator choose to emphasize on visual effects, Google prefers a more standard information presentation. This shows as well in their choice of technologies:
  • Google Sidebar plug-ins are just Windows appz that can take advantage of Google Interfaces. That makes them quite powerful in theory, but programming them is less accessible.
  • Konfabulator plug-ins are Javascript+XML, the Javascript is not just regular client-side javascript, it can use Konfabulator API (containing many effect and rudimentary network access), and COM objects . That makes them very focused on presentation, and the Internet.
In the long run, Google choice makes sense, the forthcoming Avalon will make visual effects very accessible to windows developers.

Now back to the subject, I missed the 'activation on hotkey' feature from Konfabulator for the Google Sidebar. Fortunately, I a have found a powerful little open-source program, Autohotkey, that allowed me to do that very quickly. Here is the script I use (it's a hack since it relies on toolbar size (but not position), but I like the default position and it works (only using floating deskbar, i let you figure out for the non floating version)):
F12::
MouseGetPos, X, Y
if WinExist("ahk_class _GD_Sidebar")
{
WinActivate
BlockInput,On
MouseClick, left, 160, 16
MouseMove, X,Y
BlockInput,Off
return
}
else if WinExist("ahk_class ATL:0044A4C8")
{
WinActivate
BlockInput,On
MouseClick, left, 158, 16
MouseMove, X,Y
BlockInput,Off
return
}

I Need Another DB Framework!

I am currently facing a problem that neither Hibernate nor iBatis solves nicely. I also looked at other ORM or just DB framework, without success.

What I would need is a framework that generates PreparedStatements with a query by Criteria like API. I have many queries that are similar but varying according to different input parameters. iBatis can handle this, but for complex queries and scenarios, the XML becomes completely unreadable, and you therefore loose any advantage that iBatis was bringing with the externalization of SQL statements in XML. The other issue I have with using iBatis is that for another part of my project, the automatic generation of SQL statements a-la Hibernate is useful. Hibernate has a very nice Query by Criteria API, but it lacks just a tiny bit of flexibility in customizing queries. For example, I could not find a way to specify a “USE INDEX(index_name)” in the generated SQL, after the SELECT FROM xxx and before the rest of the query. I did not find either a way to specify the use of a “STRAIGHT_JOIN” instead of an INNER JOIN. These are all MySQL specific issues, but those little things are extremely useful at improving some of my queries performances. Writing N sql queries hard coded is not a good option, since this N can be quite big, which is why I am using Query by Criteria in the first place.

So is there a need for yet another DB framework?

del.icio.us toolbar customized

When I bookmark articles with delicious, I like to keep the content on my hard drive, because pages sometimes change, or are removed, or I want to do local searches. I believe this is one reason some people like furl (furl keeps a copy on their server that only yourself can read, but does not allow search).

A combination of slogger and delicious could solve partially the problem. But it is not integrated, I can't get my local version from delicious, so I loose the tagging, listing and all other plus from delicious.

I added my own feature to the delicious toolbar, which I like very much. This new toolbar saves automatically the file you bookmark (on the + button), and will add a link in your delicious home to the local version (if it exists).

I have it publicly accessible at http://perso.wanadoo.fr/logos01/deltoolbar.html

It is not meant to be used by everybody as it is not official. But the page will give you an idea of what it does. If you think it is useful, I will improve it, otherwise it will stay the way it is because it fits my use.

Categories: , ,

Inside the Java Virtual Machine

I am reading an old book, Inside the Java Virtual Machine. Some old books don't age, and this is one of them. The chapter on the Java Virtual Machine is just excellent and should be read by every Java developer. It explains each step a JVM does when you run a Java program, very clearly.

You could get plenty of stupid interview questions from it like: How is the Java stack used? Between method area, heap, pc register, stack which one are shared among threads?

Also they saw the full potential of Java quite early on (1997). They explain how the JVM specs allow for very different implementations, ones that can run in different environments, for example, simplifying a bit: low memory, embedded world, or lots of memory, mainframe world. It is not an accident if Microsoft chosed a very similar design for the CLI of .NET, they have been looking for getting into the embedded area for quite some time, and apparently, they are making good progress.

Previous

Next