Good & Popular Algorithms are Simple

I recently tried to minimise a function according to some constraints. One popular method to minimise a function in several dimensions is Nelder-Mead Simplex. It is quite simple, so simple that I programmed it in Java in 1h30, including a small design and a test. It helped that the original paper from Nelder-Mead is very clear:
However the main issue is that it works only for unconstrained problems. Nelder and Mead suggested to add a penalty, but in practice this does not work so well. For constrained problems, there is an adaptation of the original idea by Box (incredible name for a constrained method) that he called the Complex method, a deliberate pun to the Nelder-Mead Simplex method. The basic idea is to reset the trial point near the fixed boundary if it goes outside. Now this took me much longer to program, as the paper is not as clear, even if the method is still relatively simple. But worst, after a day of deciphering the paper and programming the complex method, I find out that it does not works so well: it does not manage to minimise a simple multidimensional quadratic with a simple bound constraint: f(X)=sum(X)^2 with X >= 0 where X is an N-dimensional vector. In the end, I don't want to work with such a simple function, but it is a good simple test to see if the method is really working or not. Here is how the function looks with N=2:
And if we restrict to x,y >= 0 it becomes:

I suspected an error in my program, so I decided to try with scilab, that has also the Box method as part of their neldermead_search functionality. Scilab also failed to compute the minimum in 8 dimensions of my simple quadratic.
I tried various settings, without ever obtaining a decent result (I expect to find a value near 0).

There is another algorithm that can find a global minimum, also very popular: the Differential Evolution. At first, being a genetic algorithm, you would think it would be complicated to write. But no, the main loop is around 20 lines.

Those 20 lines are a bit more difficult than Nelder-Mead, but still, when I saw that, I understood that "this is a classic algorithm". And it does work with constraints easily. How to do this is explained well in K. Price book "Differential Evolution", and takes only a few lines of code. Here is the result I got in dimension 29:
dim=29 min=1.2601854176573729E-12 genMax=412 x=[3.340096901536317E-8, 8.889252404343621E-8, 7.163904251807348E-10, 9.71847877381699E-9, 2.7423324674150668E-8, 2.4022269439114537E-9, 1.7336434478718816E-11, 7.244238163901778E-9, 1.0013136274729337E-8, 7.412154679865083E-9, 5.4694460144807974E-9, 2.3682413086417524E-9, 4.241739250073559E-7, 4.821920889534676E-10, 2.115396281722523E-9, 8.750883007882899E-8, 2.512011485133975E-9, 4.811507109129279E-9, 1.0752997894113096E-7, 5.120475258343548E-9, 8.404448964497456E-9, 4.1062290228305595E-9, 1.7030766521603753E-8, 5.589430643552073E-9, 8.237098544820173E-10, 3.5796523161196554E-9, 5.186299547466997E-9, 2.430326342762937E-7, 5.493850433494286E-9]

It works well, although there is quite a bit of parameters. I noticed that the strategy was especially important.

Gnome 3, Unity, Crap

After the upgrate to Ubuntu 11.04 I was directly on Unity. Having a dock on the left side is nice, but unfortunately, it has various bugs which makes it sometimes annoying. Also the menu on top like Mac Os X is not a bad idea, but it breaks many apps (for example Picasa). Then there is the scrollbar insanity, it’s almost impossible to click on to scroll, and again breaks in some apps (for example Eclipse). Fortunately one can disable all of that and go back to the old Gnome 2.

Then I decided to try Gnome 3 the past 2 weeks. I was curious that Linus found it so bad. At first it looked nice, probably because it’s different. But very quickly, it becomes annoying: the full screen changed just to switch window with the mouse is a stupid idea. And then installing Gnome 3 in Ubuntu is not such a great idea: Rhythmbox does not work well anymore with it, all my “file open” mapping was gone and replaced with garbage: why does it propose GIMP to open a PDF is beyond me.

Gnome 2 + a nice standard dock + widgets (or the nice unity indicators) + an indexer would have been perfect. Now I am back to KDE 4 which provides all of that, even if it feels a bit bloated.

Carmack & GPGPU programming

Finally someone who shares the same opinion on the current state of GPGPU programming.

John Carmack: On the other hand, we have converted all of our offline processing stuff to ray tracing. For years, the back-end MegaTexture generation for Rage was done with… we had a GPGPU cluster with NVIDIA cards and it was such a huge pain to keep. It was an amazing pain where one system would be having heat problems and would be behaving weird even though we thought they had identical drivers. Something would always be wrong with render farm 12, and whenever we wanted to put in new features it was like “Okay, writing new fragment programs to go into this.” Now, granted I did this just when CUDA was in its infancy. If I did re-implement it with OpenCL or CUDA we wouldn’t have some of these problems, but when I converted all these over to ray tracing there was a number of things that got a lot better. Things that we deal with, for example shadows and reflections that have to be approximated, and were so used to doing with rasterization… we sometimes forget how big of hacks these are. To be able to say I really just want that ray, and tell me what it hit; not do a projection with feathering shadowed edges and whatever the heck else we’re doing there, so much of the code got so much easier. If it’s a choice of… now that we have these awesome multi-core x86 CPUs where we can get 24 threads in commodity boxes… it’s true that one GPU card can do more ray tracing than one 24 thread x86 box, but it’s not multiples more and if it’s just a matter of buying more $2000 boxes, it makes the development, maintenance, and upkeep much better. While everyone in high performance computing is all “rah-rah” GPUs right now, I’ve come full circle back around to saying the fact that we can get massive amounts of x86 cores and threads… it wont win on FLOPS/watt or FLOPS/volume, but in terms of results per developer hour it is much, much better.

exp(y*log(x)) Much Faster than Math.pow(x,y)

Today I found out that replacing Math.pow(x,y) by Math.exp(yMath.log(x))* made me gain 50% performance in my program. Of course, both x and y are double in my case. I find this quite surprising, I expected better from Math.pow.

Best Headphones Ever?

I just miraculously received some AKG Q701 headphones. I never tried really high end headphones before those. I was used to my Sennheiser HD555, which I thought were quite good already.

I always thought there was not much difference between let’s say a $100 headphone and a $500 one, and that only real audiophiles would be able notice the difference: people like Jake who can distinguish artefacts in 320kbps mp3, but not people like me. I was wrong. Those headphones make a real difference, everything seems very clear, and it doesn’t feel like listening music through headphones at all, it is much closer to a good hi-fi stereo sound on the ear.

SIMD and Mersenne-Twister

Since 2007, there is a new kind of Mersenne-Twister (MT) that exploits SIMD architecture, the SFMT. The Mersenne-Twister has set quite a standard in random number generation for Monte-Carlo simulations, even though it has flaws.

I was wondering if SFMT improved the performance over MT for a Java implementation. There is actually on the same page a decent Java port of the original algorithm. When I ran it, it ended up slower by more than 20% than the classical Mersenne-Twister (32-bit) on a 64-bit JDK 1.6.0.23 for Windows.

This kind of result is not too unsurprising as the code is more complex. But this shows that Java is unable to use SIMD instructions properly, which is still a bit disappointing.

XORWOW L'ecuyer TestU01 Results

Nvidia uses XorWow random number generator in its CURAND library. It is a simple and fast random number generator with a reasonably long period. It can also be parallelized relatively easily. Nvidia suggests it passes L'Ecuyer TestU01, but is not very explicit about it. So I've decided to see how it performed on TestU01.

I found very simple to test a new random number generator on TestU01, the documentation is great and the examples helpful. Basically there is just a simple C file to create & compile & run.

XORWOW passes the SmallCrush battery, and according to Marsaglia, it also passes DieHard. But it fails on 1 test in Crush and 3 in BigCrush. By comparison, Mersenne-Twister fails on 2 tests in Crush and 2 in BigCrush. Here are the results of the failures:


========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  144
 Total CPU time:   00:50:15.92
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 72  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:        Xorwow
 Number of statistics:  160
 Total CPU time:   06:12:38.79
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  7  CollisionOver, t = 7            1.6e-6
 27  SimpPoker, r = 27                eps 
 81  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

The CUDA Performance Myth

There is an interesting article on how to generate efficiently the inverse of the normal cumulative distribution on the GPU. This is useful for Monte-Carlo simulations based on normally distributed variables.

Another result of the paper is a method (breakless algorithm) to compute it apparently faster than the very good Wichura’s AS241 algorithm on the CPU as well keeping a similar precision. The key is to avoid branches (if-then) at the cost of not avoiding log() calls. As the algorithm is very simple, I decided to give it a try in Java.

Unfortunately I found out that in Java, on 64 bit machines, the breakless algorithm is actually around twice slower. Intrigued, I tried in Visual C++ the same, and found this time it was 1.5x slower. Then I tried in gcc and found that it was 10% faster… But the total time was significant, because I had not applied any optimization in the gcc compilation. With -O3 flag AS241 became much faster and we were back at the Java result: the breakless algorithm was twice slower. The first result is that the JITed java code is as fast as optimized C++ compiled code. The second result is that the authors did not think of compiling with optimization flag.

Then I decided to benchmark the CUDA float performance of similar algorithms. The CUDA program was 7x faster than the double precision multithreaded CPU program. This is comparing Nvidia GT330m vs Core i5 520m and float precision vs double precision on a naturally parallel problem. This is very far from the usually announced x80 speedup. Of course if one compares the algorithms with GCC single threaded no optimization, we might attain x50, but this is not a realistic comparison at all. I have heard that double precision is 8x slower on the GPU when compared to float precision: the difference then becomes quite small. Apparently Fermi cards are much faster, unfortunately I don’t have any. And still I would not expect much better than 10x speedup. This is good but very far from the usually advertised speedup.

Firefox 4 Is Great

I temporarily abandonned Firefox for Chrome/Chromium. I am now back at using Firefox as Firefox 4 is as fast or faster than Chrome and seems more stable, especially under linux. Also it does not send anything to Google and there is bookmark sync independently of Google.

I am impressed that Mozilla managed to improve Firefox that much.

Another Look at Java Matrix Libraries

A while ago, I was already looking for a good Java Matrix library, complaining that there does not seem any real good one where development is still active: the 2 best ones are in my opinion Jama and Colt.

Recently I tried to price options via RBF (radial basis functions) based on TR-BDF2 time stepping. This is a problem where one needs to do a few matrix multiplications and inverses (or better, LU solve) in a loop. The size of the matrix is typically 50x50 to 100x100, and one can loop between 10 and 1000 times.Out of curiosity I decided to give ojalgo and MTJ a chance. I had read benchmarks (here about jblas and here the java matrix benchmark) where those libraries performed really well.On my core i5 laptop under the latest 64bit JVM (Windows 7), I found out that for the 100x100 case, Jama was actually 30% faster than MTJ, and ojalgo was more than 50% slower. I also found out that I did not like ojalgo API at all. I was quite disappointed by those results.

So I tried the same test on a 6-core Phenom II (ubuntu 64bit), Jama was faster than MTJ by 0-10%. Ojalgo and ParallelColt were slower than Jama by more than 50% and 30%.

This does not mean that ojalgo and ParallelColt are so bad, maybe they behave much better than the simple Jama on large matrices. They also have more features, including sparse matrices. But Jama is quite a good choice for a default library, MTJ can also be a good choice, it can be faster and use less memory because most methods take the output matrix/vector as a parameter. Furthermore MTJ can use the native lapack and blas libraries for improved performance. The bigger the matrices, the most difference it will make.

RunJamaMTJMTJ native
10.1600.2400.140
20.0860.2000.220
100.0830.0890.056

(On a Phenom II under Ubuntu 10.10 64-bit)

Previous

Next