The use of erfcx instead of direct erfc or CDF in a Black-Scholes implied volatility solver leads to gain in accuracy and performance in general. But which erfcx should we use?
This note compares practical erfcx implementations for Rust implied volatility solvers:
Cody: the Cody rational approximation exposed by the jaeckel = 0.2.0 crate during this comparison, originally coming from Netlib.
Johnson: Steven G. Johnson’s Faddeeva implementation, as exposed by errorfunctions = 0.2.0, faddeeva-sys = 0.1.0, and Julia SpecialFunctions.erfcx(::Float64)
The Commons-style implementation is a local Rust port of the Apache Commons Numbers/BoostErf approach. It computes the small central region through the local erf rational approximation and uses exp_m1(x*x) to avoid cancellation in exp(x^2) * erfc(x) near zero. For positive inputs it uses several rational subintervals: a central polynomial/rational erf branch, erfc-style rational branches around 0.5..1.5, 1.5..2.5, and 2.5..4, and an asymptotic rational in 1/x^2 beyond that. For very large positive x, it returns 1 / (sqrt(pi) * x). For negative x, it uses the reflected form with a split-square exponential, returns Inf below the finite overflow cutoff, and otherwise computes 2 exp(x^2) - erfcx(|x|) in a way that preserves more bits near the overflow boundary.
The Cody implementation used here is the jaeckel crate’s Cody/CALERF-style implementation. It has three main positive-side rational regions: AB for |x| <= 0.46875, CD for 0.46875 < |x| <= 4, and PQ as an asymptotic rational in 1/x^2 for |x| > 4. It uses smoothed exponentials by splitting x onto a 1/16 grid before applying the residual exponential. For negative x, it reflects with erfcx(-x) = 2 exp(x^2) - erfcx(x) and clamps below its XNEG threshold to f64::MAX.
Cody near-overflow spike.
The rare Cody spike in the finite negative overflow-boundary window is not caused by using the wrong rational interval but is deliberate. Cody/CALERF does use the PQ asymptotic rational for the positive |x| > 4 part, but the negative-side fix-up is still the reflected 2 exp(x^2) - erfcx(|x|) form with an XNEG overflow guard. The published IEEE double CALERF constant is the rounded XNEG = -26.628D0; the public Rust crate uses a refined XNEG = -26.6287357137514 with a source comment noting that the original value was -26.628. The plot spike is therefore boundary/guard behavior inherited from the CALERF-style negative branch, with a refined but still conservative cutoff near the true finite Float64 overflow boundary. A literal Cody reference would clamp even earlier rather than use a separate high-accuracy negative-tail expansion in that range.
errorfunctions = 0.2.0 is a close Rust translation of Steven G. Johnson’s Faddeeva C/C++ package. Its real erfcx implementation uses:
a 100-interval Chebyshev table after the map y = 4 / (4 + x) for 0 <= x <= 50
a continued-fraction expansion for x > 50
the reflection identity erfcx(-x) = 2 exp(x^2) - erfcx(x) for negative x
Julia SpecialFunctions.erfcx(::Float64) calls Faddeeva_erfcx_re from libopenspecfun, and faddeeva-sys binds the same Johnson C symbol. On the sampled Float64 grid below, errorfunctions, faddeeva-sys, and Julia SpecialFunctions were bit-identical.
Generic Erfcx Accuracy
Reference: Julia BigFloaterfcx(x) at 256-bit precision, rounded back to Float64. The grid used 7021 total samples, of which 6272 had finite rounded Float64 references.
backend
samples
invalid
max_ulp
max_rel
max_abs
worst_x
Commons
6272
0
1
1.2e-16
4.990e291
-2.6600845000e1
Cody
6272
0
337
3.7e-14
6.726e294
-2.6628735714e1
Johnson
6272
0
491
5.5e-14
1.614e212
-2.2769845000e1
The large absolute errors occur where erfcx(x) itself is enormous. The relative errors are still small, but the ULP differences matter because the IV tests include very sensitive tiny and saturated cases.
512-ULP Local Window Accuracy
The coarse grid above is useful for broad coverage, but it does not answer the more local question: what happens across consecutive representable inputs near the sensitive regions? I ran a temporary Rust verifier that generated 512 consecutive Float64 inputs around 37 representative centers, including the old coarse-grid worst tail locations and the finite overflow boundary. References are Julia BigFloaterfcx(x) at 256-bit precision rounded back to Float64. ULP error is the positive-Float64 bit-distance from that rounded reference.
The windows covered 18,944 input values. The finite-reference tables below use 18,176 values; 768 inputs from the overflow-boundary windows round to Inf and are excluded from the finite-error percentiles.
Global local-window stats:
backend
samples
invalid
exact_bits
p50
p95
p99
max
max_rel
worst_x
Commons
18176
0
13682
0
1
1
2
3.6e-16
2.5000000000e0
Cody
18176
0
10375
0
2
3
41238
4.6e-12
-2.6628735714e1
Johnson
18176
0
6490
1
395
509
511
5.7e-14
-2.6628700000e1
The p95/p99 picture is clearer than the max alone. Commons-style stays within 1 ulp at p99 and within 2 ulps at max. Cody is usually close, but has a rare near-overflow-tail spike. Johnson is the opposite shape: no huge single outlier here, but the negative tail is broadly hundreds of ulps away.
In the mid negative tail around x = -24, Johnson is consistently about 510 ulps away while Cody and Default/Commons stay close to the rounded BigFloat reference.
region
backend
samples
p95
p99
max
negative near-overflow tail
Commons
256
1
1
1
negative near-overflow tail
Cody
256
19083
36126
41238
negative near-overflow tail
Johnson
256
487
506
511
negative tail
Commons
512
1
1
1
negative tail
Cody
512
4
5
5
negative tail
Johnson
512
510
510
510
Zone definitions:
neg-near-overflow x < -26
neg-tail -26 <= x < -6.1
neg-transition -6.1 <= x < -0.5
central -0.5 <= x <= 0.5
pos-core 0.5 < x < 4
pos-tail 4 <= x <= 50
pos-far-tail x > 50
Per-zone local-window stats:
zone
backend
samples
invalid
exact_bits
p50
p95
p99
max
max_rel
worst_x
neg-near-overflow
Commons
3584
0
2682
0
1
1
1
1.9e-16
-2.6e1
neg-near-overflow
Cody
3584
0
1611
1
2
3
41238
4.6e-12
-2.7e1
neg-near-overflow
Johnson
3584
0
34
218
460
496
511
5.7e-14
-2.7e1
neg-tail
Commons
3584
0
2885
0
1
1
1
2.1e-16
-6.1e0
neg-tail
Cody
3584
0
2069
0
2
4
5
6.3e-16
-2.4e1
neg-tail
Johnson
3584
0
872
26
509
510
510
5.7e-14
-2.4e1
neg-transition
Commons
2048
0
1446
0
1
1
2
2.3e-16
-5.0e-1
neg-transition
Cody
2048
0
1155
0
1
2
3
3.4e-16
-5.0e-1
neg-transition
Johnson
2048
0
1291
0
15
24
26
3.6e-15
-6.1e0
central
Commons
2049
0
1649
0
1
1
2
3.6e-16
5.0e-1
central
Cody
2049
0
1313
0
1
2
3
3.6e-16
-5.0e-1
central
Johnson
2049
0
902
1
3
4
5
7.2e-16
2.5e-1
pos-core
Commons
2559
0
1772
0
1
1
2
3.6e-16
2.5e0
pos-core
Cody
2559
0
982
1
3
4
5
6.6e-16
2.5e0
pos-core
Johnson
2559
0
793
1
2
3
4
7.2e-16
5.0e-1
pos-tail
Commons
2049
0
1294
0
1
1
1
2.0e-16
5.0e1
pos-tail
Cody
2049
0
1291
0
1
1
1
2.0e-16
5.0e1
pos-tail
Johnson
2049
0
1047
0
1
2
2
4.1e-16
2.0e1
pos-far-tail
Commons
2303
0
1954
0
1
1
1
1.9e-16
1.0e8
pos-far-tail
Cody
2303
0
1954
0
1
1
1
1.9e-16
1.0e8
pos-far-tail
Johnson
2303
0
1551
0
1
1
2
3.1e-16
1.0e2
Local-window timing on the same 18,944 samples, best of five passes with 1000 sweeps per pass:
backend
ns/call
Commons
9.86
Cody
10.04
Johnson
6.28
So yes: the tail is the right place to look. The worst Johnson behavior is the negative tail and near-overflow tail. Cody has a rare sharper spike right at the finite overflow boundary. Commons is the most stable over these consecutive-ULP windows.
Generic Erfcx Timing
Mixed grid, 24003 samples, best of five timing passes, 100 sweeps per pass.
backend
ns/call
Commons
6.29
Cody
6.55
Johnson/errorfunctions
6.05
Johnson/faddeeva-sys
4.92
Julia SpecialFunctions
4.62
The Julia number is measured inside one Julia process launched by the Rust diagnostic program; it is a Julia-loop function-call timing, not per-call Rust-to-Julia FFI overhead.
Impact on ThiopheneIV Implied Volatility Solver
Values are max error in ULPs of reference total volatility on a specific set of options (see the ThiopheneIV paper for the definition of the exact zones). The rust implementation of the solver is available here
ThiopheneIV max ulp
backend
CLY-3D
CLY-20
CLY-80
Jaeckel
Market
Corners
Stress
HighVol
Cody
190
52
8
142
520
557
285
2
Commons
133
49
8
63
177
329
138
2
Johnson
400
71
16
97
1210
537
484
3
ThiopheneIV+ max ulp
backend
CLY-3D
CLY-20
CLY-80
Jaeckel
Market
Corners
Stress
HighVol
Cody
23
5
4
14
30
41
33
2
Commons
23
5
4
13
29
41
33
2
Johnson
22
5
7
13
29
41
32
3
The polished ThiopheneIV+ path is robust across erfcx choices. The unpolished fast solver is where Commons is clearly better for benchmark.
Impact on ThiopheneIV Performance
ThiopheneIV ns/call
backend
CLY-3D
CLY-20
CLY-80
Jaeckel
Market
Corners
Stress
HighVol
Cody
146
146
148
146
147
155
155
150
Commons
155
147
147
156
153
159
157
151
Johnson
179
180
188
211
203
185
204
161
ThiopheneIV+ ns/call
backend
CLY-3D
CLY-20
CLY-80
Jaeckel
Market
Corners
Stress
HighVol
Cody
199
200
199
185
193
210
205
187
Commons
202
203
197
197
200
214
210
189
Johnson
233
229
245
249
251
240
254
210
Cody is slightly faster than Commons for ThiopheneIV, but Commons is close and much more accurate. Johnson is slower in the solver context despite competitive standalone erfcx timing, most likely because its branch/table structure interacts differently with the full pricing paths.
Conclusion
Use the Commons erfcx implementation. It gives the best generic erfcx accuracy against the BigFloat reference among the tested Rust-callable implementations, the most stable 512-consecutive-ULP local-window profile, the best unpolished ThiopheneIV accuracy, and passes the strict reference/corner suite. Cody remains a good fast approximation, but it is of slighly lower accuracy (which impacts non-polished ThiopheneIV implied vols) and has a near-overflow-tail spike (very likely without impact on my use case).
Choi, Huh and Su have a very good paper entitled Tighter uniform bounds for Black–Scholes implied volatility and the applications to root-finding. What’s particularly great is that it gives both a decent lower bound and a proof a monotone convergence using Newton’s method starting from this lower bound.
The industry standard for solving the Black-Scholes implied volatility is Peter Jäckel Let’s be rational. I was wondering if one could not create a robust mathematically backed solver at least as fast and accurate as Jäckel’s algorithm.
It turns out that it is not so difficult to extend the monotone convergence proof to the Euler-Chebyshev method on the logarithm of normalized price. This is however not numerically robust enough when the normalized call option price (c) is close to 1: c is between 0 and 1, and when close to 1, the misbehaves numerically due to the limited floating point accuracy. It is not a convergence issue, but a true floating point arithmetic issue. A remedy is to switch to the log of the complementary normalized price (1-c). But then the Euler-Chebyshev method is not monotone on it. Fortunately, it turns out that Halley’s method is monotone on the logarithm of the complementary normalized price. Overall, when combined, the methods lead to a much faster pricing: 3 iterations are enough to reach machine epsilon level of accuracy.
Newton, Euler-Chebyshev and Halley's methods from Iterative Methods for the Solution of Equations by Traub (1964).
For a truly robust solver, this is not enough. There are more floating point or practical convergence issues: a Bachelier based refinement for tiny prices is necessary, and polishing via the very accurate Jäckel Black-Scholes price function allows to reach closer to machine epsilon accuracy. This last polishing step end up relatively expensive compared to the rest (20% of computational time).
Above, I have described the journey behind creating this new solver. The reverse journey also tells an interesting tale. We could start from the Black-Scholes option price, and notice that a really accurate Black-Scholes formula implementation (close to machine epsilon) is not trivial. The textbook formula based on the normal CDF, often used in various software systems used at banks or hedge funds, is not very accurate in many cases. One should really use Peter Jäckel’s implementation everywhere, otherwise it is pointless to attempt to solve with such a high accuracy (in the paper, I thus make the Jäckel polishing step optional).
Error in the Black-Scholes price with different formulas: textbook CDF, more advanced erfcx, and Jäckel's expansion. On those realistic examples, we hit Jäckel's expansion zone.
I named the resulting solver with the somewhat cryptic name Thiophene, as Thiophene is a chemical compound with formula \[ C_4 - H_4 - S \], and C-H-S are the initials of Choi, Huh and Su.
Several years ago, I had explored accuracy and performance of different ways to imply the Black-Scholes volatility. Jherek Healy proposed some improvements over my naive algorithm on his blog. Recently, a Linkedin post mentioned a new paper from Wolfgang Schadner which presents an almost explicit formula for the implied volatility. Almost because it actually relies on some implementation of the quantile function for the inverse Gaussian distribution: it requires a specialized numerical algorithm in practice, such as the one from Giner and Smyth.
The idea is actually not new, the same formula has been used back in 2016 by Jaehyuk Choi in his github repository.
If we have a good inverse Gaussian distribution function, the algorithm is clean and simple:
using Distributions
export impliedVolatilityIG
const N = Normal()
function impliedVolatilityIG(isCall::Bool, price::T, forward, strike, timeToExpiry, df) where {T}
c, ex = normalizePrice(isCall, price, forward, strike, df)
if c >=1/ ex || c >=1 throw(DomainError(c, string("price higher than intrinsic value ", 1/ ex)))
elseif c <=0 throw(DomainError(c, "price is negative"))
end x = log(ex)
returnif abs(x) <4* eps(T)
2* quantile(N, (c +1) /2) / sqrt(timeToExpiry)
else mu =2/ abs(x)
u = cquantile(InverseGaussian(mu), c)
2/ sqrt(u * timeToExpiry)
endend
function normalizePrice(isCall::Bool, price, f, strike, df)
c = price / (f * df)
ex = f / strike
if!isCall
if ex <=1 c = c +1-1/ ex # put call parityelse#duality + put call parity c = ex * c
ex =1/ ex
endelseif ex >1# use c(-x0, v) c = (f * (c -1) + strike) / strike # in out duality, c = ex*c + 1 - ex ex =1/ ex
endendreturn c, ex
end
I was wondering how accurate it was especially for very out-of-the-money options, and how fast it was in practice. Below I compare various implementations in Julia:
Peter Jaeckel, which is arguably the industry standard.
Jherek Healy, which consists in (log-)Householder iterations on top of a good explcit initial guess.
The inverse Gaussian approach using Julia’s Distribution.jl package (as in the above code).
The inverse Gaussian approach using a basic port of statsmod R algorithm, adapted to compute the complementary quantile function. This corresponds to a different implementation of cquantile(InverseGaussian(mu),c).
I consider call optiun prices for a fixed strike K = 200 for a forward F = 100, and a time to maturity T = 1.0 year, varying the volatility from 0.02 to 4.0 in steps of 0.01. This constitute 399 samples.
Algorithm
vol RMSD
vol MAXAD
Price MAXAD
Rel. Price MAXAD
Time
Jaeckel
8.66e-16
4.00e-15
4.26e-14
1.81e-12
104 μs
Healy
1.57e-15
9.77e-15
8.53e-14
3.02e-10
107 μs
IG Distribution
7.37e-16
5.41e-15
2.84e-14
8.99e-11
246 μs
IG statsmod
7.77e-16
7.55e-15
8.24e-13
3.03e-10
311 μs
The inverse Gaussian approach is competitive but still twice slower than Peter Jaeckel’s algorithm on this example. How does the error in volatility actually look like?
Error in impled volatility of the various algorithms
The accuracy of the inverse Gaussian looks as good as Jackel’s algorithm, when using Julia’s package Distibutions.jl (it is slightly less accurate towards low vol, but slightly more accurate for high vols).
It is revealing to also look at the case of varying strikes for a fixed volatility. We consider a volatility fixed to 0.1 (or 10%) and vary the strike from 100 to 500 by steps of 1.
Algorithm
vol RMSD
vol MAXAD
Price MAXAD
Rel. Price MAXAD
Time
JAECKEL
6.50e-16
2.73e-15
7.11e-15
6.59e-12
154 μs
HEALY
6.56e-16
2.73e-15
3.55e-14
6.59e-12
110 μs
IG Distribution
1.21e-15
7.76e-15
1.78e-14
1.95e-11
1704 μs
IG statsmod
8.68e-16
4.27e-15
1.06e-14
1.32e-11
288 μs
On this example the Julia Distributions.jl leads to a performance degradation by a factor of 10 and is less accurate. It would be worth investigating exactly where lies the bottleneck, as statsmod custom implementation keeps a good performance profile.
Error in impled volatility of the various algorithms
Because it originates from the CG community, I had assumed that this was faster than the more classic scrambling ACM Algorithm 823 by Hickernell and Hong. I was wrong. It may be faster for specific use cases, but in general it is (much) slower, especially for the kind of (quasi) Monte-Carlo simulation that we use in quantitative finance.
The slowness is because we can’t just generate the next number from the sequence, we must jump to each number instead. The shuffling is implemented by jumping to some hash based location. And jumping is somewhat significantly slower than getting the next number. Furthermore, the grey code optimization can not be applied anymore.
In my code, for arbirary dimension (on the example to integrate successively a function from dimension 1 to dimension 5000 using a 100 steps on the dimension side), the Burley implementation (actually optimized to skip faster a la Quantlib instead of the original code from Burley) is 3 times slower than Algorithm 823.
It is also not obviously better on the example of Ilya M. Sobol’, Danil Asotsky in Construction and Comparison
of High-Dimensional Sobol’ Generators (Equation 6.1 and Figures 8 and 9), also reproduced by Peter Caspers.
$$
I_1 = \int_{[0,1]^d} \prod_{i=1}^d \left( 1 + 0.01(x_i - 0.5) \right) dx_1 dx_2 \dots dx_d .
$$
Absolute error in integral I1 computed with different Sobol sequences
In the Figure above, I did not use 30031 points of the sequence as in the original test, but the closest power of 2, that is 32768, which is supposed to be more optimal (the choice of 30031 is strange but actually does not really change the outcome). Except for Sobol-Harase, all use Joe Kuo new-joe-kuo-6.21201 direction numbers. Sobol-Harase uses Shin Harase niederreiter-nut-s21201.txt, which interestingly, does not fare as well as the more classic Joe Kuo set.
The results vary quite a bit depending on the random number generator used, or its seed. In the above, the Well1024a generator with default seed for the seeds of Burley’s algorithm has the lowest maximum error, but if we change the seed to 123456, this is not true anymore. Algorithm 823 (Owen and Owen-Faure-Tezuka) is not worse, and its results are also dependent on the seed (although I am under the impression that the impact is not as large). This is something that was not shown in Peter Caspers Blog.
I struggled a bit having Jack Audio Connection Kit working in Opensuse Tumbleweed.
My error was to install the jack package. The solution is actually extremely simple: use pipewire-jack instead of jack.
The state of the art of Sobol scrambling has changed slightly recently, thanks to the paper from Brent Burley of Walt Disney Studios Practical Hash-based Owen Scrambling.
Before that, ACM Algorithm 823 by Hickernell and Hong was the usual reference. Brent Burley’s algorithm is supposedly both faster and with better properties. In particular, it performs both shuffling and scrambling.
Peter Caspers implemented the new algorithm in Quantlib. It’s neat to have some decent implementation in CPP. In applying the algorithm of the paper, a few sub-optimal choices were made:
The base RNG for seeds (for the shuffling and scrambling) is Mersenne-Twister. Contrary to what I thought at the beginning, it is not used for skipping/jumping as the seeds are only set in the initialization, and are then constant for all numbers in the sequence. The skipping/jumping logic is however absent from the Quantlib CPP code, which is a bit unfortunate, but is almost trivial to code (just update the nextSequenceCounter_ variable).
There is a 4-by-4 seeding logic which looks odd. The paper code processes 4 dimensions as it is specialized to CG, although even then, it processes those 4 dimensions slightly differently: a much simpler hash function is used (it is given in the code URL inside the paper) and the seed value is not updated by the hash function: the seed is merely hashed with the dimension number, for 4 dimensions.
I don’t really see what there is to gain to follow this 4-by-4 seeding. It would be simpler and less concerning to simply use the RNG to produce N seeds where N is the number of dimensions (instead of N/4). The only possible gain with the 4-by-4 seeding is some memory (as only N/4 seeds are kept in memory), but the total amount is not significant: for 21000 dimensions, keeping all in memory takes around 656 KB (yes, 640 KB is not all we need!). The cost of 4-by-4 seeding is a potentially less random scrambling.
On the same subject, there is another new low discrepancy sequence which appears to have some good properties as well as being as fast to generate as Sobol, it’s called SZ Sequence.
Recently, I have spent some time on simple neural networks. The idea is to employ them as universal function approximators for some problems appearing in quantitative finance. There are some great papers on it such as the one from Liu et al. (2019) or Horvath et al. (2019) Deep Learning Volatility or Rosenbaum & Zhang (2021).
Incidentally, I met Liu back when I was finishing my PhD in TU Delft around 2020.
I thought I would try out what Julia offers in terms of library for neural networks. Being a very trendy subject, and Julia a modern language for the scientific community, I had imagined the libraris to be of good quality (like the many I have been using in the past). Surprisingly, I was wrong.
First, I tried SimpleChain, which just crashed (core dumped!) on a very simple example. I did not bother finding the root cause, I decided to look for another library. I then tried LUX. The execution kept going forever without returning with the @compile keyword. I probably was not being lucky, even though my code was only 10s of lines long. So I decided to use Flux, which actually is as simple to use as Pytorch, a well-known library in the Python world.
Things works with Flux, and I do manage to do many experiments. But the performance is not great. This was another surprise: Pytorch was actually faster for many tasks on the CPU (no GPU). For example, to train a multi-layer-perceptron with 4 hidden layers of 200 neurons, Julia was taking several hours, until I hit CTRL+C and launched the same training in Pytorch (which took 15 minutes). I think it may be something with AdamW optimizer and relatively wide (but not really compared to LMM stuff) networks.
Overall, the experience is pretty disappointing. Maybe it shows how much effort has been put in the Python ecosystem.
The optimization is on the x’s instead of the y’s, meaning the strike axis is fixed.
An original approach to avoid spurious modes, although I would have liked more details on it, with concrete examples of the penalty. Penalties are often challenging to get right.
A nice analysis of the asymptotic behavior in the wings, showing that an exponentially quadratic form for extrapolation corresponds to linear slopes in implied variance.
I had also explored a fixed strike axis initially (even back in 2014 - I uploaded some old notes on SSRN around this). I found it to be somewhat unstable back then, but it may well be that I did not put enough efforts into it, especially since optimizing the y’s instead of the x’s allowed for a straightforward use of B-splines, which is very attractive. It is very direct to impose monotonicity with B-splines.
The main advantage of optimizing on the x’s instead of the y’s is that the knots are defined in the more usual strike space.
Now I find interesting that someone else stumbled upon essentially the same idea starting from another point of view. There may be more merits to this parameterization than I thought.
The challenges I found with stochastic collocations were:
spurious spike when the collocation function derivative is close to zero which happens on occasion. This is not really like an extra mode - it is not smooth. Penalty or minimal slope are possible workarounds, but challenging to make robust.
dependency on the choice of knots, especially if nearly exact interpolation is required.
more minor, non-predictable runtime: the non-linear optimization is sometimes very fast, sometimes much slower. This is the case of most arbitrage-free interpolation techniques.
Leif Andersen and Mark Lake recently proposed the use of Non-Uniform Fast Fourier Transform for option pricing via the characteristic function. Fourier techniques are most commonly used for pricing vanilla options under the Heston model, in order to calibrate the model. They can be applied to other models, typically with known characteristic function, but also with numerically solved characteristic function as in the rough Heston model, and to different kind of payoffs, for example variance/volatility swaps, options. The subject has been vastly explored already so what’s new with this paper?
At first, it was not obvious to me. The paper presents 3 different approaches, two where the probability density and cumulative density is first computed at a (not so sparse) set of knots and the payoff is integrated over it. The remaining approach is the classic one presented in the Carr and Madan paper from 1999 as well as in most of the litterature on the subject. The only additional trick is really the use of NUFFT along with some clever adaptive quadrature to compute the integral.
I thought the main cost in the standard Fourier techniques was the evaluation of the characteristic function at many points, because the characteristic function is usually relatively complicated. And, in practice, it mainly is. So what does the new NUFFT algorithm bring? Surely the characteristic function must still be evaluated at the same points.
The NUFFT becomes interesting to compute the option price at many many strikes at the same time. The strikes do not need to be equidistant and can be (almost) arbitrarily chosen. In Andersen and Lake paper, even a very fast technique such as the COS method reaches a threshold of option per second throughput, as the number of strikes is increased, mainly because of the repeated evaluation of the sum over different strikes, so the cost becomes proportional to the number of strikes.
Evaluating at many many points becomes not much more expensive than evaluating at a few points only. This is what opens up the possibilities for the first 2 approaches based on the density.
It turns out that it is not very difficult to rewrite the COS method so that it can make use of the NUFFT as well. I explain how to do it here. Below is the throughput using NFFT.jl Julia package:
Throughput on the Heston model with M=256 points.
While a neat trick, it is however not 100% clear to me at this point where this is really useful: calibration typically involve a small number of strikes per maturity.
According to the litterature, the initial forward variance curve is typically built from the implied volatilities through the variance swap replication: for each maturity, the price of a newly issued variance swap is computed to this maturity, using the implied volatilities for this maturity. Then we may compute the forward variance by interpolating linearly the variance swap prices and differentiating. This leads to the plot below.
Forward variance curves for SPX500 as of October 2024.
One immediate question is how much does the truncation range play a role in this forward variance curve? Above, we plot the replications prices based on filtering the input data up to 10 Delta / 90 Delta for each maturity. Beyond this range, liquidity is usually much lower and the reliability of the input quotes may not be as good. The short end, and not so short forward variance up to 4 years is quite different. The long end is more aligned with something that looks like a basis spread between the two curves.
We may also calibrate the forward variance curve model directly to the implied volatilities during the model calibration, at the same time as the other stochastic volatility parameters (using a time-dependent correlation and vol-of-vol). This time-dependent Heston parameterization has effectively 3 parameters per expiry. In this case, we may want to also filter the input data to focus on the 10 Delta / 90 Delta range as we expect the stochastic volatility model to fit well where it matters the most - where the vanillas are liquid.
In the figure above, we also plot the forward variance curve implied by the model, through a calibration against vanilla options. We can see it is again quite different.
If we were to use the true replication based forward variance curve, we would effectively attempt to fit far away vols in the wings, which somehow seems wrong. Now if the model was able to fit everything well there would be no issue, but with 3 parameters per maturity, a choice has to be made. And the choice of the full replication does not look particularly great, as we fit exactly something that is not liquid.