Today I found out that replacing Math.pow(x,y) by Math.exp(yMath.log(x))* made me gain 50% performance in my program. Of course, both x and y are double in my case. I find this quite surprising, I expected better from Math.pow.
I just miraculously received some AKG Q701 headphones. I never tried really high end headphones before those. I was used to my Sennheiser HD555, which I thought were quite good already.
I always thought there was not much difference between let’s say a $100 headphone and a $500 one, and that only real audiophiles would be able notice the difference: people like Jake who can distinguish artefacts in 320kbps mp3, but not people like me. I was wrong. Those headphones make a real difference, everything seems very clear, and it doesn’t feel like listening music through headphones at all, it is much closer to a good hi-fi stereo sound on the ear.
Since 2007, there is a new kind of Mersenne-Twister (MT) that exploits SIMD architecture, the SFMT. The Mersenne-Twister has set quite a standard in random number generation for Monte-Carlo simulations, even though it has flaws.
I was wondering if SFMT improved the performance over MT for a Java implementation. There is actually on the same page a decent Java port of the original algorithm. When I ran it, it ended up slower by more than 20% than the classical Mersenne-Twister (32-bit) on a 64-bit JDK 1.6.0.23 for Windows.
This kind of result is not too unsurprising as the code is more complex. But this shows that Java is unable to use SIMD instructions properly, which is still a bit disappointing.
Nvidia uses XorWow random number generator in its CURAND library. It is a simple and fast random number generator with a reasonably long period. It can also be parallelized relatively easily. Nvidia suggests it passes L'Ecuyer TestU01, but is not very explicit about it. So I've decided to see how it performed on TestU01.
I found very simple to test a new random number generator on TestU01, the documentation is great and the examples helpful. Basically there is just a simple C file to create & compile & run.
XORWOW passes the SmallCrush battery, and according to Marsaglia, it also passes DieHard. But it fails on 1 test in Crush and 3 in BigCrush. By comparison, Mersenne-Twister fails on 2 tests in Crush and 2 in BigCrush. Here are the results of the failures:
========= Summary results of Crush =========
Version: TestU01 1.2.3 Generator: Xorwow Number of statistics: 144 Total CPU time: 00:50:15.92 The following tests gave p-values outside [0.001, 0.9990]: (eps means a value < 1.0e-300): (eps1 means a value < 1.0e-15):
Test p-value ---------------------------------------------- 72 LinearComp, r = 29 1 - eps1 ---------------------------------------------- All other tests were passed
========= Summary results of BigCrush =========
Version: TestU01 1.2.3 Generator: Xorwow Number of statistics: 160 Total CPU time: 06:12:38.79 The following tests gave p-values outside [0.001, 0.9990]: (eps means a value < 1.0e-300): (eps1 means a value < 1.0e-15):
Test p-value ---------------------------------------------- 7 CollisionOver, t = 7 1.6e-6 27 SimpPoker, r = 27 eps 81 LinearComp, r = 29 1 - eps1 ---------------------------------------------- All other tests were passed
There is an interesting article on how to generate efficiently the inverse of the normal cumulative distribution on the GPU. This is useful for Monte-Carlo simulations based on normally distributed variables.
Another result of the paper is a method (breakless algorithm) to compute it apparently faster than the very good Wichura’s AS241 algorithm on the CPU as well keeping a similar precision. The key is to avoid branches (if-then) at the cost of not avoiding log() calls. As the algorithm is very simple, I decided to give it a try in Java.
Unfortunately I found out that in Java, on 64 bit machines, the breakless algorithm is actually around twice slower. Intrigued, I tried in Visual C++ the same, and found this time it was 1.5x slower. Then I tried in gcc and found that it was 10% faster… But the total time was significant, because I had not applied any optimization in the gcc compilation. With -O3 flag AS241 became much faster and we were back at the Java result: the breakless algorithm was twice slower. The first result is that the JITed java code is as fast as optimized C++ compiled code. The second result is that the authors did not think of compiling with optimization flag.
Then I decided to benchmark the CUDA float performance of similar algorithms. The CUDA program was 7x faster than the double precision multithreaded CPU program. This is comparing Nvidia GT330m vs Core i5 520m and float precision vs double precision on a naturally parallel problem. This is very far from the usually announced x80 speedup. Of course if one compares the algorithms with GCC single threaded no optimization, we might attain x50, but this is not a realistic comparison at all. I have heard that double precision is 8x slower on the GPU when compared to float precision: the difference then becomes quite small. Apparently Fermi cards are much faster, unfortunately I don’t have any. And still I would not expect much better than 10x speedup. This is good but very far from the usually advertised speedup.
I temporarily abandonned Firefox for Chrome/Chromium. I am now back at using Firefox as Firefox 4 is as fast or faster than Chrome and seems more stable, especially under linux. Also it does not send anything to Google and there is bookmark sync independently of Google.
I am impressed that Mozilla managed to improve Firefox that much.
A while ago, I was already looking for a good Java Matrix library, complaining that there does not seem any real good one where development is still active: the 2 best ones are in my opinion Jama and Colt.
Recently I tried to price options via RBF (radial basis functions) based on TR-BDF2 time stepping.
This is a problem where one needs to do a few matrix multiplications and inverses (or better, LU solve) in a loop. The size of the matrix is typically 50x50 to 100x100, and one can loop between 10 and 1000 times.Out of curiosity I decided to give ojalgo and MTJ a chance. I had read benchmarks (here about jblas and here the java matrix benchmark) where those libraries performed really well.On my core i5 laptop under the latest 64bit JVM (Windows 7), I found out that for the 100x100 case, Jama was actually 30% faster than MTJ, and ojalgo was more than 50% slower. I also found out that I did not like ojalgo API at all. I was quite disappointed by those results.
So I tried the same test on a 6-core Phenom II (ubuntu 64bit), Jama was faster than MTJ by 0-10%. Ojalgo and ParallelColt were slower than Jama by more than 50% and 30%.
This does not mean that ojalgo and ParallelColt are so bad, maybe they behave much better than the simple Jama on large matrices. They also have more features, including sparse matrices. But Jama is quite a good choice for a default library, MTJ can also be a good choice, it can be faster and use less memory because most methods take the output matrix/vector as a parameter. Furthermore MTJ can use the native lapack and blas libraries for improved performance. The bigger the matrices, the most difference it will make.
Before Java 1.5, I never really complained about the lack of enum keyword. Sure the old enum via class pattern was a bit verbose at first (N.B.: Java 1.5 enums can also be verbose once you start adding methods to them). But more importantly, you would often use the table lookup pattern in combination.
The problem with Java 1.5 enum is that it is not Object-Oriented. You can't extend an enum, you can't add an element in an existing enum. Many will say "but that's what enum is for, a static list of things". In my experience, the list of things often changes with time, or needs to be extended at one point. Furthermore, most people (including me when I am very lazy) end up writing switch statements on enum values. Enum promotes bad programming practices.
Think twice about using enum, this is often not what you want.
I tried Scala a few years ago. There are several good ideas in it, but I found the language to be a bit too complicated to master. But I recently stubbled upon a paper on Scala generics that might change my mind about using Scala.
Scala Generics used to work in a similar way as Java Generics: via type erasure. One main reason is compatibility with Java, another is that C++ like templates make the code base blow up. Scala Generics offered some additional behavior (the variance/covariance notion). C++ templates, however, have some very interesting aspects: one is that everything is done at compile time, the other is performance. If the generics are involved in any kind of computation intensive task, all the Java type conversion will create a significant overhead.
Now Scala has @specialized (since Scala 2.8). Annotating a generic type with @specialized will generate code. One has the choice to accept the performance penalty or to get all the performance but accept the code blow up. I think this is very useful.
If you read the paper you will see that the performance implications of this are not always small.
UPDATE: I thank the readers for pointing that this work only with primitive types to avoid autoboxing. It is still valuable but less than I first thought.
The MIT has a downloadable book on basic mathematics: Street Fighting Mathematics. I liked the part focused on the geometrical approach. It reminded me of the early greek mathematics.
Overall it does look like a very American approach to Maths: answering a multiple choices questions test by elimination. But it is still an interesting book.