Benchmarking Languages Is Difficult

Wed, Jun 3, 2009 I often looked at the famous computer languages shootout for fun. Recently I noticed they had the infamous thread ring test. I posted not very long ago several blog entries about it showing how silly this test was.

Looking at the existing Java implementation for the test I decided to try to submit the tricky one using a pool of thread, and pooling message processing rather creating 1 thread per node. To my surprise, it was accepted without questions and I did have the best score for a Java program for a while. Shortly after someone else copied my program and got rid of various stuff not useful for the particular benchmark (breaking the interesting part of the design) and got accepted as well with of course a better result.

I decided to see if I could make an even more silly program - tailored for the test only. I managed to be orders of magnitude faster - 1 thread, no synchronization, everything processed in a FIFO (linkedlist) queue. This is actually a standard way to reimplement recursion. But I was honest enough not to hide that I consider that kind of program to cheat the test and got my entry in the "interesting alternatives".

In reality there is no difference in the "cheating" between my new program and the program that got accepted in the official list, they both cheat by using only 1 thread and process everything 1 by 1. There is not 1 thread per node in any of the program, and they can avoid any concurrency issues. One "looks" better because it uses a pool of 503 threads (but really use only 1 or 2 threads) and the other does not hide its use of 1 thread for processing. But this is not evident to people accepting the programs.

When I look at the haskell code, I can not really tell if it is creating 503 threads in the language or a pool or ..., you have to know each language quite well and sometimes it is not that easy to define what cheating is. Therefore this kind of benchmark is a bit disappointing. One should force the use of the same algorithm. But can you do so (a functional language won't use the same algo as a procedural one)?

Benchmarking Languages Is Difficult

Comments