The CUDA Performance Myth

There is an interesting article on how to generate efficiently the inverse of the normal cumulative distribution on the GPU. This is useful for Monte-Carlo simulations based on normally distributed variables.

Another result of the paper is a method (breakless algorithm) to compute it apparently faster than the very good Wichura’s AS241 algorithm on the CPU as well keeping a similar precision. The key is to avoid branches (if-then) at the cost of not avoiding log() calls. As the algorithm is very simple, I decided to give it a try in Java.

Unfortunately I found out that in Java, on 64 bit machines, the breakless algorithm is actually around twice slower. Intrigued, I tried in Visual C++ the same, and found this time it was 1.5x slower. Then I tried in gcc and found that it was 10% faster… But the total time was significant, because I had not applied any optimization in the gcc compilation. With -O3 flag AS241 became much faster and we were back at the Java result: the breakless algorithm was twice slower. The first result is that the JITed java code is as fast as optimized C++ compiled code. The second result is that the authors did not think of compiling with optimization flag.

Then I decided to benchmark the CUDA float performance of similar algorithms. The CUDA program was 7x faster than the double precision multithreaded CPU program. This is comparing Nvidia GT330m vs Core i5 520m and float precision vs double precision on a naturally parallel problem. This is very far from the usually announced x80 speedup. Of course if one compares the algorithms with GCC single threaded no optimization, we might attain x50, but this is not a realistic comparison at all. I have heard that double precision is 8x slower on the GPU when compared to float precision: the difference then becomes quite small. Apparently Fermi cards are much faster, unfortunately I don’t have any. And still I would not expect much better than 10x speedup. This is good but very far from the usually advertised speedup.

Firefox 4 Is Great

I temporarily abandonned Firefox for Chrome/Chromium. I am now back at using Firefox as Firefox 4 is as fast or faster than Chrome and seems more stable, especially under linux. Also it does not send anything to Google and there is bookmark sync independently of Google.

I am impressed that Mozilla managed to improve Firefox that much.

Another Look at Java Matrix Libraries

A while ago, I was already looking for a good Java Matrix library, complaining that there does not seem any real good one where development is still active: the 2 best ones are in my opinion Jama and Colt.

Recently I tried to price options via RBF (radial basis functions) based on TR-BDF2 time stepping. This is a problem where one needs to do a few matrix multiplications and inverses (or better, LU solve) in a loop. The size of the matrix is typically 50x50 to 100x100, and one can loop between 10 and 1000 times.Out of curiosity I decided to give ojalgo and MTJ a chance. I had read benchmarks (here about jblas and here the java matrix benchmark) where those libraries performed really well.On my core i5 laptop under the latest 64bit JVM (Windows 7), I found out that for the 100x100 case, Jama was actually 30% faster than MTJ, and ojalgo was more than 50% slower. I also found out that I did not like ojalgo API at all. I was quite disappointed by those results.

So I tried the same test on a 6-core Phenom II (ubuntu 64bit), Jama was faster than MTJ by 0-10%. Ojalgo and ParallelColt were slower than Jama by more than 50% and 30%.

This does not mean that ojalgo and ParallelColt are so bad, maybe they behave much better than the simple Jama on large matrices. They also have more features, including sparse matrices. But Jama is quite a good choice for a default library, MTJ can also be a good choice, it can be faster and use less memory because most methods take the output matrix/vector as a parameter. Furthermore MTJ can use the native lapack and blas libraries for improved performance. The bigger the matrices, the most difference it will make.

RunJamaMTJMTJ native
10.1600.2400.140
20.0860.2000.220
100.0830.0890.056

(On a Phenom II under Ubuntu 10.10 64-bit)

Java enum Is Evil

Before Java 1.5, I never really complained about the lack of enum keyword. Sure the old enum via class pattern was a bit verbose at first (N.B.: Java 1.5 enums can also be verbose once you start adding methods to them). But more importantly, you would often use the table lookup pattern in combination.

The problem with Java 1.5 enum is that it is not Object-Oriented. You can't extend an enum, you can't add an element in an existing enum. Many will say "but that's what enum is for, a static list of things". In my experience,  the list of things often changes with time, or needs to be extended at one point. Furthermore, most people (including me when I am very lazy) end up writing switch statements on enum values. Enum promotes bad programming practices.

Think twice about using enum, this is often not what you want.

A Very Interesting Feature of Scala

I tried Scala a few years ago. There are several good ideas in it, but I found the language to be a bit too complicated to master. But I recently stubbled upon a paper on Scala generics that might change my mind about using Scala.

Scala Generics used to work in a similar way as Java Generics: via type erasure. One main reason is compatibility with Java, another is that C++ like templates make the code base blow up. Scala Generics offered some additional behavior (the variance/covariance notion). C++ templates, however, have some very interesting aspects: one is that everything is done at compile time, the other is  performance. If the generics are involved in any kind of computation intensive task, all the Java type conversion will create a significant overhead.

Now Scala has @specialized (since Scala 2.8). Annotating a generic type with @specialized will generate code. One has the choice to accept the performance penalty or to get all the performance but accept the code blow up. I think this is very useful.

If you read the paper you will see that the performance implications of this are not always small.

UPDATE: I thank the readers for pointing that this work only with primitive types to avoid autoboxing. It is still valuable but less than I first thought.

Street Fighting Mathematics Book

The MIT has a downloadable book on basic mathematics: Street Fighting Mathematics. I liked the part focused on the geometrical approach. It reminded me of the early greek mathematics.

Overall it does look like a very American approach to Maths: answering a multiple choices questions test by elimination. But it is still an interesting book.

Bye Bye Firefox

I have been a long user of Firefox, mostly thanks to the adblock extension. But recently, Firefox decided to change the way arrows work on the web pages, they don’t make the page scroll anymore. Meanwhile Chrome has now a good adblock plugin (that filters ads on load, not after load like it use to be) and is really much much faster than Firefox. So there is no more reason not to use it.

Hello Chrome, bye bye Firefox. Google has won the web browsers war.

Diffusion Limited Aggregation Applet

Yes, I wrote an applet. I know it is very 1990s but, amazingly, it still does the job quite well. Ok, next time I should really use Flash to do this.


The Applet simulates Diffusion Limited Aggregation as described in Chaos And Fractals from Peitgen, Juergens, and Saupe. It represents ions randomly wandering around (in a Brownian motion) until they are caught by an attractive force in electrochemical deposition experiment. This kind of phenomenon occurs at all scales, for example it happens in the distribution of galaxies. You can play around with the applet at http://31416.appspot.com/dla.vm

Java & 3D Surface

I have been looking all around the web for a Java library that can draw a simple 3D surface. And I did not find any. Most charting library, including the well known JFreeChart, can only draw 2D charts.

I am quite shocked that something that has been in Excel for 15 years is still not available in Java. And it’s not easy to make your own.

double[][] Is Fine

In my previous post, I suggest that keeping a double[] performs better than keeping a double[][] if you do matrix multiplications and other operations.

This is actually not true. I benchmarked 3 libraries, Colt (uses double[]), Apache Commons Math (uses double[][]) and Jama (uses double[][] cleverly). At first it looks like Jama has a similar performance as Colt (they avoid [][] slow access by a clever algorithm). But once hotspot hits, the difference is crazy and Jama becomes the fastest (Far ahead).




JDK 1.6.0 Linux 1000x1000 matrix multiplication on Intel Q6600
loop indexColtCommons MathJama
111.88074824.4551259.828977
211.87497524.2651029.848916
39.77261614.3741539.826572
49.75967914.3681052.655915
59.79962215.2389282.649129
69.78055614.7418632.668104
79.7283115.5099092.646811
89.7983815.7243482.646069
99.72614315.9887622.646052
109.78450515.1217822.644572
We don't include matrix construction time, and fetching the result. Only the multiplication is taken into account.

The difference is less pronounced on smaller matrices, but still there. Jama looks very good in this simple test case. In more real scenarios, the difference is not so obvious. For example Commons Math SVD is faster than Jama one.

Previous

Next