Monte-Carlo Parallelization: to vectorize or not?
When writing a Monte-Carlo simulation to price financial derivative contracts, the most straightforward is to code a loop over the number of paths, in which each path is fully calculated. Inside the loop, a payoff function takes this path to compute the present value of the contract on the given path. The present values are recorded to lead to the Monte-Carlo statistics (mean, standard deviation). I ignore here any eventual callability of the payoff which may still be addressed with some work-arounds in this setup. The idea can be schematized by the following go code:
for i := 0; i < numSimulations; i++ {
pathGenerator.ComputeNextPath(&path) //path contains an array of n time-steps of float64.
pathEvaluator.Evaluate(&path, output)
statistics.RecordValue(output)
}
pathGenerator.ComputeAllPaths(&paths) //paths contains an array of n time-steps of vectors of size numSimulations
pathEvaluator.EvaluateAll(&paths, output)
statistics.RecordValues(output)
The latter approach sounds nice, but may easily blow up the memory, if all paths are stored in memory. One solution is to store only packages of path, and do the simulation multiple times, being careful as to where we start the random numbers along the way. It is possible to further optimize memory by not keeping each full path in memory, but just the current path values, along with a list of vector variables which track the payoff path dependency. This requires a special Brownian-bridge implementation in the context of Sobol quasi-random numbers, as described in Jherek Healy’s book.
We can thus summarize the various approaches by (A) iterative on scalars (B) fully vectorized (C) iterative on vectors. I intentionally abuse the language here as the scalars are not scalars, but 1 path comprised of values at all relevant times - in this terminology, scalar means a single path.
Which one is most appropriate?
I first heard about this debate in a Quantlib user meeting a long time ago. Alexander Sokol was arguing for (B) as it simplified the implementation of AAD, while Peter Caspers was more a (A) kind of guy. At the time, I had not fully grasped the interest of (B) as it seemed more complex to put in place properly.
It is interesting to think about those in terms of parallelization. At first, it may seem that (A) is simple to parallelize: you just process N/M package where M is the number of processors and N the total number of paths. Unfortunately, such a design implies a stateful payoff, which is not necessarily trivial to make thread-safe (actually very challenging). So the first drawback is that you may have to limit the parallelization to the path generation and to perform the payoff evaluation in a single thread. And then if you think more about it, you notice that while the first package is being computed, the payoff evaluation will just wait there. Sure, other packages will be computed in parallel, but this may again blow off the memory as you essentially keep all the paths in each package in memory. In practice, we would thus need to compute each path in some sort of thread pool and push those to a path queue of fixed length (blocking), which is read by the single thread, by taking a path from the queue, evaluate the payoff on this path, and pushing it back to a free queue in order to recycle the path. Alternatively, some sort of ring buffer structure could be used for the paths. One has also to be quite careful as well about the random number generator skipping to the correct position in order to keep the same sequence of random numbers as in a fully single threaded approach.
Taking parallelization into account, (A) is not that the trivial approach anymore. In contrast, (B) or (C) would parallelize the payoff evaluation by design, making the implementation of the full simulation parallelization much easier. Arguably, the package of paths approach may consume less memory than (C), as (1) the path generation usually involves many intermediate time-steps which may be discarded (kept only in cache per core) while in a vectorized approach, the full vector of those intermediate time-steps may be needed (typically for a classical Brownian-Bridge variance reduction), (2) the payoff evaluates scalars (and its state is thus comprised of scalars). A clever vectorized payoff evaluation would however reuse most vectors. The memory reduction may not be real in practice.
Now, if we include callability or particle method kind of evaluation, full vectorization, but keeping only the payoff (vector) variables in memory and building up the path step by step may be the most effective as described in Jherek Healy’s book.
To conclude, not vectorizing the path generation and payoff evaluation would be a design mistake nowadays.