Which erfcx?

The use of erfcx instead of direct erfc or CDF in a Black-Scholes implied volatility solver leads to gain in accuracy and performance in general. But which erfcx should we use?

This note compares practical erfcx implementations for Rust implied volatility solvers:

  • Commons: the local Rust port of Apache Commons Numbers BoostErf.erfcx
  • Cody: the Cody rational approximation exposed by the jaeckel = 0.2.0 crate during this comparison, originally coming from Netlib.
  • Johnson: Steven G. Johnson’s Faddeeva implementation, as exposed by errorfunctions = 0.2.0, faddeeva-sys = 0.1.0, and Julia SpecialFunctions.erfcx(::Float64)

The Commons-style implementation is a local Rust port of the Apache Commons Numbers/BoostErf approach. It computes the small central region through the local erf rational approximation and uses exp_m1(x*x) to avoid cancellation in exp(x^2) * erfc(x) near zero. For positive inputs it uses several rational subintervals: a central polynomial/rational erf branch, erfc-style rational branches around 0.5..1.5, 1.5..2.5, and 2.5..4, and an asymptotic rational in 1/x^2 beyond that. For very large positive x, it returns 1 / (sqrt(pi) * x). For negative x, it uses the reflected form with a split-square exponential, returns Inf below the finite overflow cutoff, and otherwise computes 2 exp(x^2) - erfcx(|x|) in a way that preserves more bits near the overflow boundary.

The Cody implementation used here is the jaeckel crate’s Cody/CALERF-style implementation. It has three main positive-side rational regions: AB for |x| <= 0.46875, CD for 0.46875 < |x| <= 4, and PQ as an asymptotic rational in 1/x^2 for |x| > 4. It uses smoothed exponentials by splitting x onto a 1/16 grid before applying the residual exponential. For negative x, it reflects with erfcx(-x) = 2 exp(x^2) - erfcx(x) and clamps below its XNEG threshold to f64::MAX.

Cody near-overflow spike.

The rare Cody spike in the finite negative overflow-boundary window is not caused by using the wrong rational interval but is deliberate. Cody/CALERF does use the PQ asymptotic rational for the positive |x| > 4 part, but the negative-side fix-up is still the reflected 2 exp(x^2) - erfcx(|x|) form with an XNEG overflow guard. The published IEEE double CALERF constant is the rounded XNEG = -26.628D0; the public Rust crate uses a refined XNEG = -26.6287357137514 with a source comment noting that the original value was -26.628. The plot spike is therefore boundary/guard behavior inherited from the CALERF-style negative branch, with a refined but still conservative cutoff near the true finite Float64 overflow boundary. A literal Cody reference would clamp even earlier rather than use a separate high-accuracy negative-tail expansion in that range.

errorfunctions = 0.2.0 is a close Rust translation of Steven G. Johnson’s Faddeeva C/C++ package. Its real erfcx implementation uses:

  • a 100-interval Chebyshev table after the map y = 4 / (4 + x) for 0 <= x <= 50
  • a continued-fraction expansion for x > 50
  • the reflection identity erfcx(-x) = 2 exp(x^2) - erfcx(x) for negative x

Julia SpecialFunctions.erfcx(::Float64) calls Faddeeva_erfcx_re from libopenspecfun, and faddeeva-sys binds the same Johnson C symbol. On the sampled Float64 grid below, errorfunctions, faddeeva-sys, and Julia SpecialFunctions were bit-identical.

Generic Erfcx Accuracy

Reference: Julia BigFloat erfcx(x) at 256-bit precision, rounded back to Float64. The grid used 7021 total samples, of which 6272 had finite rounded Float64 references.

backend samples invalid max_ulp max_rel max_abs worst_x
Commons 6272 0 1 1.2e-16 4.990e291 -2.6600845000e1
Cody 6272 0 337 3.7e-14 6.726e294 -2.6628735714e1
Johnson 6272 0 491 5.5e-14 1.614e212 -2.2769845000e1

The large absolute errors occur where erfcx(x) itself is enormous. The relative errors are still small, but the ULP differences matter because the IV tests include very sensitive tiny and saturated cases.

512-ULP Local Window Accuracy

The coarse grid above is useful for broad coverage, but it does not answer the more local question: what happens across consecutive representable inputs near the sensitive regions? I ran a temporary Rust verifier that generated 512 consecutive Float64 inputs around 37 representative centers, including the old coarse-grid worst tail locations and the finite overflow boundary. References are Julia BigFloat erfcx(x) at 256-bit precision rounded back to Float64. ULP error is the positive-Float64 bit-distance from that rounded reference.

The windows covered 18,944 input values. The finite-reference tables below use 18,176 values; 768 inputs from the overflow-boundary windows round to Inf and are excluded from the finite-error percentiles.

Global local-window stats:

backend samples invalid exact_bits p50 p95 p99 max max_rel worst_x
Commons 18176 0 13682 0 1 1 2 3.6e-16 2.5000000000e0
Cody 18176 0 10375 0 2 3 41238 4.6e-12 -2.6628735714e1
Johnson 18176 0 6490 1 395 509 511 5.7e-14 -2.6628700000e1

The p95/p99 picture is clearer than the max alone. Commons-style stays within 1 ulp at p99 and within 2 ulps at max. Cody is usually close, but has a rare near-overflow-tail spike. Johnson is the opposite shape: no huge single outlier here, but the negative tail is broadly hundreds of ulps away.

In the mid negative tail around x = -24, Johnson is consistently about 510 ulps away while Cody and Default/Commons stay close to the rounded BigFloat reference.

region backend samples p95 p99 max
negative near-overflow tail Commons 256 1 1 1
negative near-overflow tail Cody 256 19083 36126 41238
negative near-overflow tail Johnson 256 487 506 511
negative tail Commons 512 1 1 1
negative tail Cody 512 4 5 5
negative tail Johnson 512 510 510 510

Zone definitions:

  • neg-near-overflow x < -26
  • neg-tail -26 <= x < -6.1
  • neg-transition -6.1 <= x < -0.5
  • central -0.5 <= x <= 0.5
  • pos-core 0.5 < x < 4
  • pos-tail 4 <= x <= 50
  • pos-far-tail x > 50

Per-zone local-window stats:

zone backend samples invalid exact_bits p50 p95 p99 max max_rel worst_x
neg-near-overflow Commons 3584 0 2682 0 1 1 1 1.9e-16 -2.6e1
neg-near-overflow Cody 3584 0 1611 1 2 3 41238 4.6e-12 -2.7e1
neg-near-overflow Johnson 3584 0 34 218 460 496 511 5.7e-14 -2.7e1
neg-tail Commons 3584 0 2885 0 1 1 1 2.1e-16 -6.1e0
neg-tail Cody 3584 0 2069 0 2 4 5 6.3e-16 -2.4e1
neg-tail Johnson 3584 0 872 26 509 510 510 5.7e-14 -2.4e1
neg-transition Commons 2048 0 1446 0 1 1 2 2.3e-16 -5.0e-1
neg-transition Cody 2048 0 1155 0 1 2 3 3.4e-16 -5.0e-1
neg-transition Johnson 2048 0 1291 0 15 24 26 3.6e-15 -6.1e0
central Commons 2049 0 1649 0 1 1 2 3.6e-16 5.0e-1
central Cody 2049 0 1313 0 1 2 3 3.6e-16 -5.0e-1
central Johnson 2049 0 902 1 3 4 5 7.2e-16 2.5e-1
pos-core Commons 2559 0 1772 0 1 1 2 3.6e-16 2.5e0
pos-core Cody 2559 0 982 1 3 4 5 6.6e-16 2.5e0
pos-core Johnson 2559 0 793 1 2 3 4 7.2e-16 5.0e-1
pos-tail Commons 2049 0 1294 0 1 1 1 2.0e-16 5.0e1
pos-tail Cody 2049 0 1291 0 1 1 1 2.0e-16 5.0e1
pos-tail Johnson 2049 0 1047 0 1 2 2 4.1e-16 2.0e1
pos-far-tail Commons 2303 0 1954 0 1 1 1 1.9e-16 1.0e8
pos-far-tail Cody 2303 0 1954 0 1 1 1 1.9e-16 1.0e8
pos-far-tail Johnson 2303 0 1551 0 1 1 2 3.1e-16 1.0e2

Local-window timing on the same 18,944 samples, best of five passes with 1000 sweeps per pass:

backend ns/call
Commons 9.86
Cody 10.04
Johnson 6.28

So yes: the tail is the right place to look. The worst Johnson behavior is the negative tail and near-overflow tail. Cody has a rare sharper spike right at the finite overflow boundary. Commons is the most stable over these consecutive-ULP windows.

Generic Erfcx Timing

Mixed grid, 24003 samples, best of five timing passes, 100 sweeps per pass.

backend ns/call
Commons 6.29
Cody 6.55
Johnson/errorfunctions 6.05
Johnson/faddeeva-sys 4.92
Julia SpecialFunctions 4.62

The Julia number is measured inside one Julia process launched by the Rust diagnostic program; it is a Julia-loop function-call timing, not per-call Rust-to-Julia FFI overhead.

Impact on ThiopheneIV Implied Volatility Solver

Values are max error in ULPs of reference total volatility on a specific set of options (see the ThiopheneIV paper for the definition of the exact zones). The rust implementation of the solver is available here

ThiopheneIV max ulp

backend CLY-3D CLY-20 CLY-80 Jaeckel Market Corners Stress HighVol
Cody 190 52 8 142 520 557 285 2
Commons 133 49 8 63 177 329 138 2
Johnson 400 71 16 97 1210 537 484 3

ThiopheneIV+ max ulp

backend CLY-3D CLY-20 CLY-80 Jaeckel Market Corners Stress HighVol
Cody 23 5 4 14 30 41 33 2
Commons 23 5 4 13 29 41 33 2
Johnson 22 5 7 13 29 41 32 3

The polished ThiopheneIV+ path is robust across erfcx choices. The unpolished fast solver is where Commons is clearly better for benchmark.

Impact on ThiopheneIV Performance

ThiopheneIV ns/call

backend CLY-3D CLY-20 CLY-80 Jaeckel Market Corners Stress HighVol
Cody 146 146 148 146 147 155 155 150
Commons 155 147 147 156 153 159 157 151
Johnson 179 180 188 211 203 185 204 161

ThiopheneIV+ ns/call

backend CLY-3D CLY-20 CLY-80 Jaeckel Market Corners Stress HighVol
Cody 199 200 199 185 193 210 205 187
Commons 202 203 197 197 200 214 210 189
Johnson 233 229 245 249 251 240 254 210

Cody is slightly faster than Commons for ThiopheneIV, but Commons is close and much more accurate. Johnson is slower in the solver context despite competitive standalone erfcx timing, most likely because its branch/table structure interacts differently with the full pricing paths.

Conclusion

Use the Commons erfcx implementation. It gives the best generic erfcx accuracy against the BigFloat reference among the tested Rust-callable implementations, the most stable 512-consecutive-ULP local-window profile, the best unpolished ThiopheneIV accuracy, and passes the strict reference/corner suite. Cody remains a good fast approximation, but it is of slighly lower accuracy (which impacts non-polished ThiopheneIV implied vols) and has a near-overflow-tail spike (very likely without impact on my use case).

A Faster Monotone Implied Volatiltty Solver

Choi, Huh and Su have a very good paper entitled Tighter uniform bounds for Black–Scholes implied volatility and the applications to root-finding. What’s particularly great is that it gives both a decent lower bound and a proof a monotone convergence using Newton’s method starting from this lower bound.

The industry standard for solving the Black-Scholes implied volatility is Peter Jäckel Let’s be rational. I was wondering if one could not create a robust mathematically backed solver at least as fast and accurate as Jäckel’s algorithm.

It turns out that it is not so difficult to extend the monotone convergence proof to the Euler-Chebyshev method on the logarithm of normalized price. This is however not numerically robust enough when the normalized call option price (c) is close to 1: c is between 0 and 1, and when close to 1, the misbehaves numerically due to the limited floating point accuracy. It is not a convergence issue, but a true floating point arithmetic issue. A remedy is to switch to the log of the complementary normalized price (1-c). But then the Euler-Chebyshev method is not monotone on it. Fortunately, it turns out that Halley’s method is monotone on the logarithm of the complementary normalized price. Overall, when combined, the methods lead to a much faster pricing: 3 iterations are enough to reach machine epsilon level of accuracy.

Newton, Euler-Chebyshev and Halley's methods from Iterative Methods for the Solution of Equations by Traub (1964).

For a truly robust solver, this is not enough. There are more floating point or practical convergence issues: a Bachelier based refinement for tiny prices is necessary, and polishing via the very accurate Jäckel Black-Scholes price function allows to reach closer to machine epsilon accuracy. This last polishing step end up relatively expensive compared to the rest (20% of computational time).

Above, I have described the journey behind creating this new solver. The reverse journey also tells an interesting tale. We could start from the Black-Scholes option price, and notice that a really accurate Black-Scholes formula implementation (close to machine epsilon) is not trivial. The textbook formula based on the normal CDF, often used in various software systems used at banks or hedge funds, is not very accurate in many cases. One should really use Peter Jäckel’s implementation everywhere, otherwise it is pointless to attempt to solve with such a high accuracy (in the paper, I thus make the Jäckel polishing step optional).

Error in the Black-Scholes price with different formulas: textbook CDF, more advanced erfcx, and Jäckel's expansion. On those realistic examples, we hit Jäckel's expansion zone.

I named the resulting solver with the somewhat cryptic name Thiophene, as Thiophene is a chemical compound with formula \[ C_4 - H_4 - S \], and C-H-S are the initials of Choi, Huh and Su.

Conclusion

A mathematical proof of convergence is great to have, but far from enough to obtain a robust solver. The paper on the Thiophene implied volatlity solver is available at https://arxiv.org/abs/2605.22427 and minimal Java code here and Rust code here.

Almost Explicit Implied Volatility

Several years ago, I had explored accuracy and performance of different ways to imply the Black-Scholes volatility. Jherek Healy proposed some improvements over my naive algorithm on his blog. Recently, a Linkedin post mentioned a new paper from Wolfgang Schadner which presents an almost explicit formula for the implied volatility. Almost because it actually relies on some implementation of the quantile function for the inverse Gaussian distribution: it requires a specialized numerical algorithm in practice, such as the one from Giner and Smyth.

The idea is actually not new, the same formula has been used back in 2016 by Jaehyuk Choi in his github repository. If we have a good inverse Gaussian distribution function, the algorithm is clean and simple:

using Distributions
export impliedVolatilityIG
const N = Normal()
function impliedVolatilityIG(isCall::Bool, price::T, forward, strike,  timeToExpiry, df) where {T}
    c, ex = normalizePrice(isCall, price, forward, strike, df)
    if c >= 1 / ex || c >= 1
        throw(DomainError(c, string("price higher than intrinsic value ", 1 / ex)))
    elseif c <= 0
        throw(DomainError(c, "price is negative"))
    end
    x = log(ex)
    return if abs(x) < 4 * eps(T)
        2 * quantile(N, (c + 1) / 2) / sqrt(timeToExpiry)
    else
        mu = 2 / abs(x)
        u = cquantile(InverseGaussian(mu), c)
        2 / sqrt(u * timeToExpiry)
    end
end

reusing the price normalization idea of Li.

function normalizePrice(isCall::Bool, price, f, strike, df)
    c = price / (f * df)
    ex = f / strike
    if !isCall
        if ex <= 1
            c = c + 1 - 1 / ex # put call parity
        else #duality + put call parity            
            c = ex * c
            ex = 1 / ex
        end
    else
        if ex > 1 # use c(-x0, v)            
            c = (f * (c - 1) + strike) / strike # in out duality, c = ex*c + 1 - ex 
            ex = 1 / ex
        end
    end
    return c, ex
end

I was wondering how accurate it was especially for very out-of-the-money options, and how fast it was in practice. Below I compare various implementations in Julia:

  • Peter Jaeckel, which is arguably the industry standard.
  • Jherek Healy, which consists in (log-)Householder iterations on top of a good explcit initial guess.
  • The inverse Gaussian approach using Julia’s Distribution.jl package (as in the above code).
  • The inverse Gaussian approach using a basic port of statsmod R algorithm, adapted to compute the complementary quantile function. This corresponds to a different implementation of cquantile(InverseGaussian(mu),c).

I consider call optiun prices for a fixed strike K = 200 for a forward F = 100, and a time to maturity T = 1.0 year, varying the volatility from 0.02 to 4.0 in steps of 0.01. This constitute 399 samples.

Algorithm vol RMSD vol MAXAD Price MAXAD Rel. Price MAXAD Time
Jaeckel 8.66e-16 4.00e-15 4.26e-14 1.81e-12 104 μs
Healy 1.57e-15 9.77e-15 8.53e-14 3.02e-10 107 μs
IG Distribution 7.37e-16 5.41e-15 2.84e-14 8.99e-11 246 μs
IG statsmod 7.77e-16 7.55e-15 8.24e-13 3.03e-10 311 μs

The inverse Gaussian approach is competitive but still twice slower than Peter Jaeckel’s algorithm on this example. How does the error in volatility actually look like?

Error in impled volatility of the various algorithms

The accuracy of the inverse Gaussian looks as good as Jackel’s algorithm, when using Julia’s package Distibutions.jl (it is slightly less accurate towards low vol, but slightly more accurate for high vols).

It is revealing to also look at the case of varying strikes for a fixed volatility. We consider a volatility fixed to 0.1 (or 10%) and vary the strike from 100 to 500 by steps of 1.

Algorithm vol RMSD vol MAXAD Price MAXAD Rel. Price MAXAD Time
JAECKEL 6.50e-16 2.73e-15 7.11e-15 6.59e-12 154 μs
HEALY 6.56e-16 2.73e-15 3.55e-14 6.59e-12 110 μs
IG Distribution 1.21e-15 7.76e-15 1.78e-14 1.95e-11 1704 μs
IG statsmod 8.68e-16 4.27e-15 1.06e-14 1.32e-11 288 μs

On this example the Julia Distributions.jl leads to a performance degradation by a factor of 10 and is less accurate. It would be worth investigating exactly where lies the bottleneck, as statsmod custom implementation keeps a good performance profile.

Error in impled volatility of the various algorithms

Owen Scrambling a la Burley

In my last post, I had a look at Quantlib implementation of a new scrambling method for Sobol due to Brent Burley of Walt Disney Studios Practical Hash-based Owen Scrambling.

Because it originates from the CG community, I had assumed that this was faster than the more classic scrambling ACM Algorithm 823 by Hickernell and Hong. I was wrong. It may be faster for specific use cases, but in general it is (much) slower, especially for the kind of (quasi) Monte-Carlo simulation that we use in quantitative finance.

The slowness is because we can’t just generate the next number from the sequence, we must jump to each number instead. The shuffling is implemented by jumping to some hash based location. And jumping is somewhat significantly slower than getting the next number. Furthermore, the grey code optimization can not be applied anymore.

In my code, for arbirary dimension (on the example to integrate successively a function from dimension 1 to dimension 5000 using a 100 steps on the dimension side), the Burley implementation (actually optimized to skip faster a la Quantlib instead of the original code from Burley) is 3 times slower than Algorithm 823.

It is also not obviously better on the example of Ilya M. Sobol’, Danil Asotsky in Construction and Comparison of High-Dimensional Sobol’ Generators (Equation 6.1 and Figures 8 and 9), also reproduced by Peter Caspers. $$ I_1 = \int_{[0,1]^d} \prod_{i=1}^d \left( 1 + 0.01(x_i - 0.5) \right) dx_1 dx_2 \dots dx_d . $$

Absolute error in integral I1 computed with different Sobol sequences

In the Figure above, I did not use 30031 points of the sequence as in the original test, but the closest power of 2, that is 32768, which is supposed to be more optimal (the choice of 30031 is strange but actually does not really change the outcome). Except for Sobol-Harase, all use Joe Kuo new-joe-kuo-6.21201 direction numbers. Sobol-Harase uses Shin Harase niederreiter-nut-s21201.txt, which interestingly, does not fare as well as the more classic Joe Kuo set.

The results vary quite a bit depending on the random number generator used, or its seed. In the above, the Well1024a generator with default seed for the seeds of Burley’s algorithm has the lowest maximum error, but if we change the seed to 123456, this is not true anymore. Algorithm 823 (Owen and Owen-Faure-Tezuka) is not worse, and its results are also dependent on the seed (although I am under the impression that the impact is not as large). This is something that was not shown in Peter Caspers Blog.

Jack Audio in Opensuse Tumbleweed

I struggled a bit having Jack Audio Connection Kit working in Opensuse Tumbleweed. My error was to install the jack package. The solution is actually extremely simple: use pipewire-jack instead of jack.

sudo zypper in pipewire-jack qjackctl

Owen Scrambling in Quantlib

The state of the art of Sobol scrambling has changed slightly recently, thanks to the paper from Brent Burley of Walt Disney Studios Practical Hash-based Owen Scrambling. Before that, ACM Algorithm 823 by Hickernell and Hong was the usual reference. Brent Burley’s algorithm is supposedly both faster and with better properties. In particular, it performs both shuffling and scrambling.

Peter Caspers implemented the new algorithm in Quantlib. It’s neat to have some decent implementation in CPP. In applying the algorithm of the paper, a few sub-optimal choices were made:

  • The base RNG for seeds (for the shuffling and scrambling) is Mersenne-Twister. Contrary to what I thought at the beginning, it is not used for skipping/jumping as the seeds are only set in the initialization, and are then constant for all numbers in the sequence. The skipping/jumping logic is however absent from the Quantlib CPP code, which is a bit unfortunate, but is almost trivial to code (just update the nextSequenceCounter_ variable).
  • There is a 4-by-4 seeding logic which looks odd. The paper code processes 4 dimensions as it is specialized to CG, although even then, it processes those 4 dimensions slightly differently: a much simpler hash function is used (it is given in the code URL inside the paper) and the seed value is not updated by the hash function: the seed is merely hashed with the dimension number, for 4 dimensions.

I don’t really see what there is to gain to follow this 4-by-4 seeding. It would be simpler and less concerning to simply use the RNG to produce N seeds where N is the number of dimensions (instead of N/4). The only possible gain with the 4-by-4 seeding is some memory (as only N/4 seeds are kept in memory), but the total amount is not significant: for 21000 dimensions, keeping all in memory takes around 656 KB (yes, 640 KB is not all we need!). The cost of 4-by-4 seeding is a potentially less random scrambling.

On the same subject, there is another new low discrepancy sequence which appears to have some good properties as well as being as fast to generate as Sobol, it’s called SZ Sequence.

Deep Neural Networks and Julia

Recently, I have spent some time on simple neural networks. The idea is to employ them as universal function approximators for some problems appearing in quantitative finance. There are some great papers on it such as the one from Liu et al. (2019) or Horvath et al. (2019) Deep Learning Volatility or Rosenbaum & Zhang (2021). Incidentally, I met Liu back when I was finishing my PhD in TU Delft around 2020.

I thought I would try out what Julia offers in terms of library for neural networks. Being a very trendy subject, and Julia a modern language for the scientific community, I had imagined the libraris to be of good quality (like the many I have been using in the past). Surprisingly, I was wrong.

First, I tried SimpleChain, which just crashed (core dumped!) on a very simple example. I did not bother finding the root cause, I decided to look for another library. I then tried LUX. The execution kept going forever without returning with the @compile keyword. I probably was not being lucky, even though my code was only 10s of lines long. So I decided to use Flux, which actually is as simple to use as Pytorch, a well-known library in the Python world.

Things works with Flux, and I do manage to do many experiments. But the performance is not great. This was another surprise: Pytorch was actually faster for many tasks on the CPU (no GPU). For example, to train a multi-layer-perceptron with 4 hidden layers of 200 neurons, Julia was taking several hours, until I hit CTRL+C and launched the same training in Pytorch (which took 15 minutes). I think it may be something with AdamW optimizer and relatively wide (but not really compared to LMM stuff) networks.

Overall, the experience is pretty disappointing. Maybe it shows how much effort has been put in the Python ecosystem.

Stochastic Collocation - Old And New

Thomas Roos recently put a preprint on SSRN called Simple, Flexible, Analytic, Arbitrage Free Volatility Interpolation. Being interested in the subject, I had a detailed look at it. It turns out that Thomas stumbled upon spline stochastic collocation without realizing it.

There are a few differences in his approach:

  • The optimization is on the x’s instead of the y’s, meaning the strike axis is fixed.
  • An original approach to avoid spurious modes, although I would have liked more details on it, with concrete examples of the penalty. Penalties are often challenging to get right.
  • A nice analysis of the asymptotic behavior in the wings, showing that an exponentially quadratic form for extrapolation corresponds to linear slopes in implied variance.

I had also explored a fixed strike axis initially (even back in 2014 - I uploaded some old notes on SSRN around this). I found it to be somewhat unstable back then, but it may well be that I did not put enough efforts into it, especially since optimizing the y’s instead of the x’s allowed for a straightforward use of B-splines, which is very attractive. It is very direct to impose monotonicity with B-splines.

The main advantage of optimizing on the x’s instead of the y’s is that the knots are defined in the more usual strike space.

Now I find interesting that someone else stumbled upon essentially the same idea starting from another point of view. There may be more merits to this parameterization than I thought.

The challenges I found with stochastic collocations were:

  • spurious spike when the collocation function derivative is close to zero which happens on occasion. This is not really like an extra mode - it is not smooth. Penalty or minimal slope are possible workarounds, but challenging to make robust.
  • dependency on the choice of knots, especially if nearly exact interpolation is required.
  • more minor, non-predictable runtime: the non-linear optimization is sometimes very fast, sometimes much slower. This is the case of most arbitrage-free interpolation techniques.

NUFFT and COS

Leif Andersen and Mark Lake recently proposed the use of Non-Uniform Fast Fourier Transform for option pricing via the characteristic function. Fourier techniques are most commonly used for pricing vanilla options under the Heston model, in order to calibrate the model. They can be applied to other models, typically with known characteristic function, but also with numerically solved characteristic function as in the rough Heston model, and to different kind of payoffs, for example variance/volatility swaps, options. The subject has been vastly explored already so what’s new with this paper?

At first, it was not obvious to me. The paper presents 3 different approaches, two where the probability density and cumulative density is first computed at a (not so sparse) set of knots and the payoff is integrated over it. The remaining approach is the classic one presented in the Carr and Madan paper from 1999 as well as in most of the litterature on the subject. The only additional trick is really the use of NUFFT along with some clever adaptive quadrature to compute the integral.

I thought the main cost in the standard Fourier techniques was the evaluation of the characteristic function at many points, because the characteristic function is usually relatively complicated. And, in practice, it mainly is. So what does the new NUFFT algorithm bring? Surely the characteristic function must still be evaluated at the same points.

The NUFFT becomes interesting to compute the option price at many many strikes at the same time. The strikes do not need to be equidistant and can be (almost) arbitrarily chosen. In Andersen and Lake paper, even a very fast technique such as the COS method reaches a threshold of option per second throughput, as the number of strikes is increased, mainly because of the repeated evaluation of the sum over different strikes, so the cost becomes proportional to the number of strikes.

Evaluating at many many points becomes not much more expensive than evaluating at a few points only. This is what opens up the possibilities for the first 2 approaches based on the density.

It turns out that it is not very difficult to rewrite the COS method so that it can make use of the NUFFT as well. I explain how to do it here. Below is the throughput using NFFT.jl Julia package:

Throughput on the Heston model with M=256 points.

While a neat trick, it is however not 100% clear to me at this point where this is really useful: calibration typically involve a small number of strikes per maturity.

Forward Variance Models and Calibration

The modern rough volatility models adopt a forward variance curve terminology (see for example this paper on a rational approximation for the rough Heston, or this presentation on affine forward variance models or this paper on affine forward variance models). In this form, the rough Heston model reads:

According to the litterature, the initial forward variance curve is typically built from the implied volatilities through the variance swap replication: for each maturity, the price of a newly issued variance swap is computed to this maturity, using the implied volatilities for this maturity. Then we may compute the forward variance by interpolating linearly the variance swap prices and differentiating. This leads to the plot below.

Forward variance curves for SPX500 as of October 2024.

One immediate question is how much does the truncation range play a role in this forward variance curve? Above, we plot the replications prices based on filtering the input data up to 10 Delta / 90 Delta for each maturity. Beyond this range, liquidity is usually much lower and the reliability of the input quotes may not be as good. The short end, and not so short forward variance up to 4 years is quite different. The long end is more aligned with something that looks like a basis spread between the two curves.

We may also calibrate the forward variance curve model directly to the implied volatilities during the model calibration, at the same time as the other stochastic volatility parameters (using a time-dependent correlation and vol-of-vol). This time-dependent Heston parameterization has effectively 3 parameters per expiry. In this case, we may want to also filter the input data to focus on the 10 Delta / 90 Delta range as we expect the stochastic volatility model to fit well where it matters the most - where the vanillas are liquid.

In the figure above, we also plot the forward variance curve implied by the model, through a calibration against vanilla options. We can see it is again quite different. If we were to use the true replication based forward variance curve, we would effectively attempt to fit far away vols in the wings, which somehow seems wrong. Now if the model was able to fit everything well there would be no issue, but with 3 parameters per maturity, a choice has to be made. And the choice of the full replication does not look particularly great, as we fit exactly something that is not liquid.

Next