The Role of Erfcx and Erfc in the Black-Scholes Formula

This note compares special-function choices at the level that matters to the implied volatility solvers: the normalized Black beta price, not standalone erfcx(x).

For x = log(F/K) <= 0 and total volatility s = sigma sqrt(T), the beta-space OTM call price is

B(x, s) = 0.5 * (exp(x/2) * erfc(q1) - exp(-x/2) * erfc(q2))
q1 = -(x/s + s/2) / sqrt(2)
q2 = -(x/s - s/2) / sqrt(2)

The corresponding forward-normalized OTM-call price is c = B(x, s) / sqrt(ex), with ex = exp(x).

Implementations Compared

  • Default/current: Jaeckel Region I/II expansions and falls back to the Apache Commons mixed erfc/erfcx formula in the middle region.
  • Jaeckel acc libm+Commons: the same Jaeckel region-dispatched, included into the temporary harness, using system erf/erfc and the local Apache Commons erfcx.
  • Jaeckel acc Cody+Cody: the same Jaeckel region-dispatched formula, using Cody erf, erfc, and erfcx from jaeckel = 0.2.0.
  • Jaeckel acc Johnson+Johnson: the same Jaeckel region-dispatched formula, using Johnson/Faddeeva erf, erfc, and erfcx through errorfunctions = 0.2.0.
  • Jaeckel acc libm+Cody and Jaeckel acc libm+Johnson: the same formula with system erf/erfc fixed and only erfcx swapped, to isolate the scaled-erfc backend effect.
  • Jaeckel I/II + pure Commons: Jaeckel Region I/II expansions are kept, but the Region III/middle fallback is replaced by the always-erfcx identity with the local Apache Commons erfcx.
  • IG statmod upper: the inverse-Gaussian upper-tail representation of the Black beta price, using a scalar Rust port of the CRAN statmod 1.5.2 .pinvgauss mean-one upper-tail log-probability core, with the final log(1-exp(r)) evaluated by a Rust log1mexp helper near cancellation. This is not a full statmod port: it omits R vectorization, NA/Inf handling, finite/infinite-mean special cases, and R’s exact pnorm implementation. The public-wrapper small-CV gamma approximation is also omitted because, under the Black mapping here, it would require dispersion=-2/x < 1e-14, or x < -2e14, far outside the finite Float64 x=log(ex) domain. No purpose-built Rust inverse-Gaussian crate was found in cargo search inverse-gaussian / cargo search invgauss; statrs 0.18.0 source also did not contain an inverse-Gaussian/Wald distribution implementation.
  • Commons erfcx-mixed: the Cody/Jaeckel threshold formula with the local Apache Commons Numbers erfcx port and system erfc.
  • Cody erfcx-mixed: the same threshold formula using jaeckel = 0.2.0 erfcx_cody.
  • Johnson erfcx-mixed: the same threshold formula using errorfunctions = 0.2.0 Johnson/Faddeeva erfcx.
  • libm erfc direct: the direct formula above, using system C erfc through FFI. Rust std does not provide erfc or erfcx.
  • Commons/Cody/Johnson pure-erfcx: the always-erfcx identity 0.5 * exp(-0.5*(h^2+t^2)) * (erfcx(q1) - erfcx(q2)), included to show why the mixed formula avoids some high-volatility tail problems.

Mixed Erfc/Erfcx Fallback Logic

The mixed formula starts from the direct beta-space Black expression:

2B = exp(x/2) * erfc(q1) - exp(-x/2) * erfc(q2)

with

h = x / s
t = s / 2
q1 = -(h + t) / sqrt(2)
q2 = -(h - t) / sqrt(2)

For either term, the stable tail identity is:

exp( x/2) * erfc(q1) = exp(-0.5 * (h^2 + t^2)) * erfcx(q1)
exp(-x/2) * erfc(q2) = exp(-0.5 * (h^2 + t^2)) * erfcx(q2)

because q1^2 = 0.5 * (h + t)^2, q2^2 = 0.5 * (h - t)^2, and x = h*s = 2*h*t. The mixed implementation uses that identity term by term instead of forcing both terms through one representation.

The branch threshold is Cody’s 0.46875, the same split used by the classic error-function approximations. Arguments below it are kept in the ordinary erfc representation; arguments above it use scaled erfcx so that the small tail probability is not represented as a nearly-underflowed erfc multiplied by a compensating exponential.

if q1 < 0.46875 and q2 < 0.46875:
    2B = exp(x/2) * erfc(q1)
       - exp(-x/2) * erfc(q2)

if q1 < 0.46875 and q2 >= 0.46875:
    2B = exp(x/2) * erfc(q1)
       - exp(-0.5 * (h^2 + t^2)) * erfcx(q2)

if q1 >= 0.46875 and q2 < 0.46875:
    2B = exp(-0.5 * (h^2 + t^2)) * erfcx(q1)
       - exp(-x/2) * erfc(q2)

if q1 >= 0.46875 and q2 >= 0.46875:
    2B = exp(-0.5 * (h^2 + t^2)) * (erfcx(q1) - erfcx(q2))

On the valid Black domain, the third formal case is actually unreachable. Since t = s/2 >= 0,

q2 - q1 = sqrt(2) * t >= 0,

so q1 >= rho implies q2 >= rho as well. The erfcx, erfc branch would require q1 >= rho and q2 < rho at the same time; that would need negative t, not merely positive h = x/s. It is kept in the branch table only as the symmetric algebraic case.

So “mixed” does not mean that erfc is a fallback after erfcx fails. It means each of the two Black terms is routed to the representation that is locally better conditioned. Negative or central arguments stay with erfc, where erfc(q) is order one and cheap. Positive tail arguments switch to erfcx, where the scaled value remains order 1/q and avoids underflow.

This explains two patterns in the tables below. In the low-volatility OTM tail, direct erfc loses many ulps because it subtracts two tiny tail probabilities after ordinary scaling; the mixed erfcx versions are much better. In the high-volatility upper-price window, the pure-erfcx identity is worse because it exposes the cancellation in erfcx(q1) - erfcx(q2) even when ordinary erfc would have kept one side well behaved.

Reference And Local Windows

Reference values are Julia BigFloat at 512-bit precision, rounded back to Float64. ULP error is the positive-Float64 bit distance to that rounded reference. Each window contains 512 consecutive Float64 values in either s or x, with the other coordinate fixed.

id zone axis x s
central-atm central s -1.000000e-2 2.000000e-1
tiny-near-atm tiny-near-atm s -1.000000e-8 1.000000e-4
erfc-threshold branch-transition s -1.000000e-1 1.364500e-1
moderate-otm moderate-otm s -1.000000e0 5.000000e-1
low-vol-otm lower-tail s -1.000000e-1 2.000000e-2
deep-otm lower-tail s -3.000000e0 7.000000e-1
far-otm lower-tail s -6.000000e0 1.000000e0
high-vol upper-price s -1.000000e-2 1.000000e1
very-high-vol upper-price s -1.000000e0 5.000000e1
near-overflow-vol upper-price s -1.000000e0 7.500000e1
x-central x-window x -1.000000e-2 2.000000e-1
x-lower-tail x-window x -4.000000e0 8.000000e-1

Global Accuracy

backend samples invalid exact_bits p50 p95 p99 max max_rel worst_window
Default/current 6144 0 2301 1 15 24 40 7.4e-15 far-otm
Jaeckel acc libm+Commons 6144 0 2301 1 15 24 40 7.4e-15 far-otm
Jaeckel acc Cody+Cody 6144 0 2291 1 14 22 40 7.4e-15 far-otm
Jaeckel acc Johnson+Johnson 6144 0 2296 1 15 25 42 7.8e-15 far-otm
Jaeckel acc libm+Cody 6144 0 2291 1 14 22 40 7.4e-15 far-otm
Jaeckel acc libm+Johnson 6144 0 2296 1 15 25 42 7.8e-15 far-otm
Jaeckel I/II + pure Commons 6144 0 1277 4 686 1202 1570 2.9e-13 near-overflow-vol
Commons erfcx-mixed 6144 0 1212 5 2823 3019 3068 5.2e-13 tiny-near-atm
IG statmod upper 6144 0 1158 7 6372 6568 7666 1.5e-12 low-vol-otm
Cody erfcx-mixed 6144 0 1180 6 2823 3019 3068 5.2e-13 tiny-near-atm
Johnson erfcx-mixed 6144 0 1161 6 2823 3019 3068 5.2e-13 tiny-near-atm
libm erfc direct 6144 0 1116 10 3049 5952 10329 2.0e-12 low-vol-otm
Commons pure-erfcx 6144 0 255 13 1776 1972 2021 3.4e-13 tiny-near-atm
Cody pure-erfcx 6144 0 174 15 9968 10164 10213 1.7e-12 tiny-near-atm
Johnson pure-erfcx 6144 0 257 19 59120 59316 59365 1.0e-11 tiny-near-atm

The headline is different from the raw erfcx comparison. In the normalized Black formula, the largest errors in these local windows are not the raw negative-erfcx overflow tail. They are tiny near-ATM and low-volatility OTM cancellation zones. The default is much better there because it uses Jaeckel Region I/II expansions before falling back to the mixed erfc/erfcx formula. Once that region-dispatched formula is kept, backend choice barely moves the result: all Jaeckel-accurate variants have global p50 = 1 ulp and max = 40 to 42 ulps on this sample set.

Jaeckel Accurate Backend Sensitivity

The Jaeckel accurate path is not just the mixed formula with a different erfcx. It routes first through:

  • Region I: asymptotic expansion times the normalized vega. This path is independent of erf, erfc, and erfcx.
  • Region II: small-t expansion times the normalized vega. This path uses erfcx only in the central y_prime branch; deeper left-tail y_prime is rational.
  • Region III/middle: the mixed Cody/Jaeckel erfc/erfcx formula described above.

That dispatch is why the special-function backend differences collapse compared with the generic mixed or pure-erfcx formulas.

Selected Jaeckel-accurate rows:

backend global p50 p95 p99 max tiny max low-vol max erfc-threshold max far-otm max
Default/current 1 15 24 40 1 17 2 40
Jaeckel acc libm+Commons 1 15 24 40 1 17 2 40
Jaeckel acc Cody+Cody 1 14 22 40 1 17 2 40
Jaeckel acc Johnson+Johnson 1 15 25 42 1 17 2 42
Jaeckel acc libm+Cody 1 14 22 40 1 17 2 40
Jaeckel acc libm+Johnson 1 15 25 42 1 17 2 42

The paired Cody rows match the libm+Cody rows in aggregate, and the paired Johnson rows match the libm+Johnson rows in aggregate. In these windows, the small residual backend sensitivity comes from erfcx, not from swapping the ordinary erf/erfc implementation. Cody is fractionally better than Commons by p95/p99 in the far-OTM-dominated aggregate; Johnson is fractionally worse, with a 42-ulp maximum instead of 40. Those differences are tiny next to the thousands of ulps lost by the non-Jaeckel mixed/direct formulas in the tiny near-ATM and low-vol OTM windows.

Jaeckel I/II With Pure-Erfcx Fallback

A tempting simplification is to keep Jaeckel Regions I and II, but replace the Region III/middle fallback with the pure-erfcx identity:

B = 0.5 * exp(-0.5 * (h^2 + t^2)) * (erfcx(q1) - erfcx(q2))

With Apache Commons erfcx, this is not as accurate as the default (Apache Commons erfc and erfcx with all Jaeckel zones) overall:

backend global p50 p95 p99 max tiny max low-vol max high-vol max very-high max near-overflow max
Default/current 1 15 24 40 1 17 1 0 0
Jaeckel acc libm+Commons 1 15 24 40 1 17 1 0 0
Jaeckel I/II + pure Commons 4 686 1202 1570 1 17 60 745 1570

The simplification keeps the important Region II tiny near-ATM behavior and the Region I/II lower-tail behavior, so those windows match the default. The loss comes from middle/fallback cases, especially the high-volatility upper-price windows. There the pure-erfcx identity exposes cancellation in erfcx(q1) - erfcx(q2), while the mixed fallback keeps the locally better erfc representation for negative or central arguments.

It is not meaningfully faster in this diagnostic either. On the same optimized local-window timing run, the pure fallback row was 24.83 ns/call, compared with 24.80 ns/call for the same included Jaeckel formula with the mixed Commons fallback and 25.07 ns/call for the Default/current path. That difference is inside normal microbenchmark noise, and the generic pure-erfcx kernel was slower than the generic mixed Commons kernel (21.32 vs 19.80 ns/call).

Inverse Gaussian Formula Experiment

For x < 0, the Black beta price can be written as an inverse-Gaussian upper tail. In the statmod mean-one parameterization, let

q = -2x / s^2
dispersion = -2 / x
Y ~ IG(mean = 1, dispersion)

Then

B(x, s) = exp(x/2) * P(Y > q).

The statmod upper-tail formula is

P(Y > q) = Phi(-(q-1)/sqrt(dispersion*q))
         - exp(2/dispersion) * Phi(-(q+1)/sqrt(dispersion*q)).

Substituting q=-2x/s^2 and dispersion=-2/x gives the ordinary Black expression in a different parameterization. The temporary Rust harness ports the statmod .pinvgauss core branch for this scalar mean-one upper tail, including its asymptotic right-tail override, and evaluates the normal log tails with the local Apache Commons erfcx support.

One Rust-side numerical improvement was kept: statmod writes the upper-tail combination as a + log1p(-exp(b-a)); the harness uses the equivalent a + log1mexp(b-a), switching to log(-expm1(b-a)) when b-a is close to zero. That slightly reduces cancellation in the tiny near-ATM and low-volatility OTM windows, but it does not change the conclusion.

The one public-wrapper branch that looks relevant at first glance is statmod’s small-CV gamma approximation:

if mean * dispersion < 1e-14:
    pgamma(q, shape=1/(mean*dispersion), scale=(mean*dispersion)*mean)

For this Black identity mean=1 and dispersion=-2/x, so the branch would need x < -2e14. A finite Float64 normalized Black input derived from ex=exp(x)>0 has x >= log(5e-324) ~= -744.44. The gamma approximation therefore cannot trigger on the valid Black domain tested here; implementing it would not change these results.

After installing R, the temporary harness was checked against the actual R code by sourcing the CRAN statmod 1.5.2 R/invgauss.R file. The statmod package was not installed as an R library on this machine, so this is a source-level check of the same public R implementation rather than a namespace-loaded package check. On the 6144 Black-mapped local-window samples, R’s public pinvgauss(..., lower.tail=FALSE, log.p=TRUE) and the literal .pinvgauss core agree exactly in price space, confirming that no wrapper special case is active on this domain.

The remaining Rust-vs-R price differences come from the normal log-tail backend and the final cancellation. Comparing prices directly:

comparison exact p50 p95 p99 max max_rel worst
Rust log1mexp price vs R pinvgauss 3020 1 648 2406 7348 1.4e-12 low-vol-otm
Rust literal price vs R core literal 4671 0 64 2461 7421 1.4e-12 low-vol-otm
R wrapper vs R core literal price 6144 0 0 0 0 0.0e0 central-atm
R log1mexp core vs R literal price 4115 0 648 648 648 1.1e-13 tiny-near-atm

Against the Julia BigFloat references, literal R/statmod does not improve the Black price accuracy versus the literal Rust port; it lands on the same conclusion as the Rust experiment:

backend/formula exact p50 p95 p99 max max_rel worst
Rust IG with log1mexp 1158 7 6372 6568 7666 1.5e-12 low-vol-otm
Rust literal statmod 1174 7 7017 7212 7739 1.5e-12 low-vol-otm
R pinvgauss literal statmod 1691 8 7017 7214 7739 1.5e-12 low-vol-otm
R core with log1mexp 1665 8 6373 6570 7666 1.5e-12 low-vol-otm

So R fidelity is good for the relevant scalar statmod formula, and the only measurable Rust-side deviation kept here is the log1mexp substitution. It slightly improves the IG formula relative to literal R/statmod, but the formula remains far less accurate than the Jaeckel Region I/II expansions in the sensitive Black windows.

This representation is not Jaeckel-accurate on the sensitive windows:

backend global p50 p95 p99 max tiny max low-vol max high-vol max very-high max near-overflow max
Default/current 1 15 24 40 1 17 1 0 0
Commons erfcx-mixed 5 2823 3019 3068 3068 593 1 0 0
IG statmod upper 7 6372 6568 7666 6608 7666 1 0 0

The inverse-Gaussian form is fine in the high-volatility upper-price windows, but it is worse than the mixed erfc/erfcx formula in the tiny near-ATM and low-volatility OTM cancellation windows. It is also slower in this scalar Rust port: 63.00 ns/call on the same local-window timing harness, compared with about 25 ns/call for the Jaeckel-dispatched default and about 20 ns/call for the generic Commons mixed formula.

Selected Windows

Black formula local window examples.

Selected 512-ULP window stats:

window = tiny-near-atm

backend p50 p95 p99 max
Default/current 0 1 1 1
Commons erfcx-mixed 2864 3048 3064 3068
IG statmod upper 6404 6588 6604 6608
Cody erfcx-mixed 2864 3048 3064 3068
Johnson erfcx-mixed 2864 3048 3064 3068
libm erfc direct 2864 3048 3064 3068
Johnson pure-erfcx 59161 59345 59361 59365

window = low-vol-otm

backend p50 p95 p99 max
Default/current 6 14 16 17
Commons erfcx-mixed 146 429 548 593
IG statmod upper 2299 5108 7130 7666
Cody erfcx-mixed 254 683 885 1345
Johnson erfcx-mixed 180 493 649 831
libm erfc direct 3526 6657 7242 10329

window = erfc-threshold

backend p50 p95 p99 max
Default/current 1 2 2 2
Commons erfcx-mixed 7 18 23 28
IG statmod upper 9 24 28 32
Cody erfcx-mixed 8 22 28 31
Johnson erfcx-mixed 20 43 57 74
libm erfc direct 7 21 27 34

window = high-vol

backend p50 p95 p99 max
Default/current 1 1 1 1
Commons erfcx-mixed 1 1 1 1
IG statmod upper 1 1 1 1
Cody erfcx-mixed 1 1 1 1
Johnson erfcx-mixed 1 1 1 1
libm erfc direct 1 1 1 1
Johnson pure-erfcx 39 56 56 57

Interpretation by region:

  • Tiny near-ATM: backend choice is almost irrelevant for the mixed formula; all mixed/direct forms lose about 3000 ulps. The default Region II expansion stays within 1 ulp.
  • Low-volatility OTM tail: direct erfc is worst because it subtracts two close tail probabilities. The erfcx-mixed formulas are much better, and Commons is best among the three erfcx backends in this window. The default expansion is better again.
  • Branch transition: this is where backend differences are visible but modest. Commons and Cody are close; Johnson is weaker but still far from the direct-erfc lower-tail failure.
  • Upper-price high-volatility: the mixed formula and direct-erfc formula are all within 1 ulp in this window. The pure-erfcx identity is worse because it exposes cancellation/rounding in erfcx(q1) - erfcx(q2) instead of routing negative-side arguments through erfc.

Timing

Timing was measured on the same 6144 local-window samples, best of five passes with 2000 sweeps per pass, using cargo run --release.

backend ns/call
Default/current 25.07
Jaeckel acc libm+Commons 24.80
Jaeckel acc Cody+Cody 26.55
Jaeckel acc Johnson+Johnson 25.39
Jaeckel acc libm+Cody 23.44
Jaeckel acc libm+Johnson 24.28
Jaeckel I/II + pure Commons 24.83
Commons erfcx-mixed 19.80
IG statmod upper 63.00
Cody erfcx-mixed 17.92
Johnson erfcx-mixed 19.27
libm erfc direct 26.89
Commons pure-erfcx 21.32
Cody pure-erfcx 19.20
Johnson pure-erfcx 19.59

These timings are for the formula kernels on this artificial local-window mix, not for the full IV solvers. The temporary Jaeckel backend rows include a small runtime selector used only by the diagnostic harness. The default is slower than the generic mixed formulas because it does more routing and uses the Region I/II expansion machinery, but the accuracy gain in tiny near-ATM and low-volatility OTM zones is large.

Takeaways

  • A raw erfcx(x) benchmark overstates the relevance of the negative-erfcx tail for the stable Black price. The mixed formula avoids much of that tail by using erfc on negative-side arguments.
  • The most important Black-formula accuracy zones here are tiny near-ATM and low-volatility OTM, where formula structure dominates backend choice.
  • For a generic erfcx-mixed Black formula, Apache Commons is the best of the three tested erfcx backends on the low-vol OTM window, Cody is usually close, and Johnson is somewhat weaker near the branch transition.
  • For the full Jaeckel accurate formula, all tested backend combinations are essentially tied: global p50 = 1 ulp, max = 40 to 42 ulps, and the sensitive tiny near-ATM window stays within 1 ulp for every backend.
  • In these sampled windows, swapping Cody or Johnson erf/erfc together with erfcx gives the same aggregate result as keeping libm erf/erfc and swapping only erfcx.
  • Keeping Jaeckel Regions I/II but using a pure Commons-erfcx fallback is simpler, but not accuracy-equivalent to the default: it matches the default in tiny near-ATM and lower-tail windows, then loses up to 1570 ulps in the high-volatility upper-price fallback region.
  • The pure-erfcx fallback is not a clear speed win either; in this release-mode diagnostic it was effectively tied with the mixed Jaeckel fallback, while the generic pure-erfcx kernel was slower than the generic mixed Commons kernel.
  • The inverse-Gaussian/statmod representation is a valid Black formula, but not an accuracy replacement for Jaeckel’s expanded evaluator here: even with a Rust log1mexp cancellation tweak, it loses up to 6608 ulps in the tiny near-ATM window and 7666 ulps in the low-volatility OTM window, and the straightforward scalar port is much slower.
  • The default remains the accuracy-favored choice because the Jaeckel Region I/II expansions reduce errors from thousands of ulps to tens of ulps or better in the sensitive windows.

References

  • For the Cody Netlib Slatec code Rational Chebyshev approximations for the error function by W. J. Cody, Math. Comp., 1969, PP. 631-638
  • For the Apache Commons Number code, this issue description gives a good overview of the implementation, which is a mix of Boost rational functions and Cody rational function to obtain the best accuracy.
  • Steven G. Johnson mentions in his code the following:

(…) I originally used an erfcx routine derived from DERFC in SLATEC, but I have since replaced it with a much faster routine written by me which uses a combination of continued-fraction expansions and a lookup table of Chebyshev polynomials. For speed, I implemented a similar algorithm for Im[w(x)] of real x, since this comes up frequently in the other error functions.

Which erfcx?

The use of erfcx instead of direct erfc or CDF in a Black-Scholes implied volatility solver leads to gain in accuracy and performance in general. But which erfcx should we use?

This note compares practical erfcx implementations for Rust implied volatility solvers:

The Commons-style implementation is a local Rust port of the Apache Commons Numbers/BoostErf approach. It computes the small central region through the local erf rational approximation and uses exp_m1(x*x) to avoid cancellation in exp(x^2) * erfc(x) near zero. For positive inputs it uses several rational subintervals: a central polynomial/rational erf branch, erfc-style rational branches around 0.5..1.5, 1.5..2.5, and 2.5..4, and a rational function taken from Rational Chebyshev approximations for the error function by W. J. Cody, Math. Comp., 1969, PP. 631-638 beyond that. For very large positive x, it returns 1 / (sqrt(pi) * x). For negative x, it uses the reflected form with a split-square exponential, returns Inf below the finite overflow cutoff, and otherwise computes 2 exp(x^2) - erfcx(|x|) in a way that preserves more bits near the overflow boundary.

The Cody implementation used here is the jaeckel crate’s Cody/CALERF-style implementation. It has three main positive-side rational regions: AB for |x| <= 0.46875, CD for 0.46875 < |x| <= 4, and PQ as an asymptotic rational in 1/x^2 for |x| > 4. It uses smoothed exponentials by splitting x onto a 1/16 grid before applying the residual exponential. For negative x, it reflects with erfcx(-x) = 2 exp(x^2) - erfcx(x) and clamps below its XNEG threshold to f64::MAX.

Cody near-overflow spike.

The rare Cody spike in the finite negative overflow-boundary window is not caused by using the wrong rational interval but is deliberate. Cody/CALERF does use the PQ asymptotic rational for the positive |x| > 4 part, but the negative-side fix-up is still the reflected 2 exp(x^2) - erfcx(|x|) form with an XNEG overflow guard. The published IEEE double CALERF constant is the rounded XNEG = -26.628D0; the public Rust crate uses a refined XNEG = -26.6287357137514 with a source comment noting that the original value was -26.628. The plot spike is therefore boundary/guard behavior inherited from the CALERF-style negative branch, with a refined but still conservative cutoff near the true finite Float64 overflow boundary. A literal Cody reference would clamp even earlier rather than use a separate high-accuracy negative-tail expansion in that range.

errorfunctions = 0.2.0 is a close Rust translation of Steven G. Johnson’s Faddeeva C/C++ package. Its real erfcx implementation uses:

  • a 100-interval Chebyshev table after the map y = 4 / (4 + x) for 0 <= x <= 50
  • a continued-fraction expansion for x > 50
  • the reflection identity erfcx(-x) = 2 exp(x^2) - erfcx(x) for negative x

Julia SpecialFunctions.erfcx(::Float64) calls Faddeeva_erfcx_re from libopenspecfun, and faddeeva-sys binds the same Johnson C symbol. On the sampled Float64 grid below, errorfunctions, faddeeva-sys, and Julia SpecialFunctions were bit-identical.

Generic Erfcx Accuracy

Reference: Julia BigFloat erfcx(x) at 256-bit precision, rounded back to Float64. The grid used 7021 total samples, of which 6272 had finite rounded Float64 references.

backend samples invalid max_ulp max_rel max_abs worst_x
Commons 6272 0 1 1.2e-16 4.990e291 -2.6600845000e1
Cody 6272 0 337 3.7e-14 6.726e294 -2.6628735714e1
Johnson 6272 0 491 5.5e-14 1.614e212 -2.2769845000e1

The large absolute errors occur where erfcx(x) itself is enormous. The relative errors are still small, but the ULP differences matter because the IV tests include very sensitive tiny and saturated cases.

512-ULP Local Window Accuracy

The coarse grid above is useful for broad coverage, but it does not answer the more local question: what happens across consecutive representable inputs near the sensitive regions? I ran a temporary Rust verifier that generated 512 consecutive Float64 inputs around 37 representative centers, including the old coarse-grid worst tail locations and the finite overflow boundary. References are Julia BigFloat erfcx(x) at 256-bit precision rounded back to Float64. ULP error is the positive-Float64 bit-distance from that rounded reference.

The windows covered 18,944 input values. The finite-reference tables below use 18,176 values; 768 inputs from the overflow-boundary windows round to Inf and are excluded from the finite-error percentiles.

Global local-window stats:

backend samples invalid exact_bits p50 p95 p99 max max_rel worst_x
Commons 18176 0 13682 0 1 1 2 3.6e-16 2.5000000000e0
Cody 18176 0 10375 0 2 3 41238 4.6e-12 -2.6628735714e1
Johnson 18176 0 6490 1 395 509 511 5.7e-14 -2.6628700000e1

The p95/p99 picture is clearer than the max alone. Commons-style stays within 1 ulp at p99 and within 2 ulps at max. Cody is usually close, but has a rare near-overflow-tail spike. Johnson is the opposite shape: no huge single outlier here, but the negative tail is broadly hundreds of ulps away.

In the mid negative tail around x = -24, Johnson is consistently about 510 ulps away while Cody and Default/Commons stay close to the rounded BigFloat reference.

region backend samples p95 p99 max
negative near-overflow tail Commons 256 1 1 1
negative near-overflow tail Cody 256 19083 36126 41238
negative near-overflow tail Johnson 256 487 506 511
negative tail Commons 512 1 1 1
negative tail Cody 512 4 5 5
negative tail Johnson 512 510 510 510

Zone definitions:

  • neg-near-overflow x < -26
  • neg-tail -26 <= x < -6.1
  • neg-transition -6.1 <= x < -0.5
  • central -0.5 <= x <= 0.5
  • pos-core 0.5 < x < 4
  • pos-tail 4 <= x <= 50
  • pos-far-tail x > 50

Per-zone local-window stats:

zone backend samples invalid exact_bits p50 p95 p99 max max_rel worst_x
neg-near-overflow Commons 3584 0 2682 0 1 1 1 1.9e-16 -2.6e1
neg-near-overflow Cody 3584 0 1611 1 2 3 41238 4.6e-12 -2.7e1
neg-near-overflow Johnson 3584 0 34 218 460 496 511 5.7e-14 -2.7e1
neg-tail Commons 3584 0 2885 0 1 1 1 2.1e-16 -6.1e0
neg-tail Cody 3584 0 2069 0 2 4 5 6.3e-16 -2.4e1
neg-tail Johnson 3584 0 872 26 509 510 510 5.7e-14 -2.4e1
neg-transition Commons 2048 0 1446 0 1 1 2 2.3e-16 -5.0e-1
neg-transition Cody 2048 0 1155 0 1 2 3 3.4e-16 -5.0e-1
neg-transition Johnson 2048 0 1291 0 15 24 26 3.6e-15 -6.1e0
central Commons 2049 0 1649 0 1 1 2 3.6e-16 5.0e-1
central Cody 2049 0 1313 0 1 2 3 3.6e-16 -5.0e-1
central Johnson 2049 0 902 1 3 4 5 7.2e-16 2.5e-1
pos-core Commons 2559 0 1772 0 1 1 2 3.6e-16 2.5e0
pos-core Cody 2559 0 982 1 3 4 5 6.6e-16 2.5e0
pos-core Johnson 2559 0 793 1 2 3 4 7.2e-16 5.0e-1
pos-tail Commons 2049 0 1294 0 1 1 1 2.0e-16 5.0e1
pos-tail Cody 2049 0 1291 0 1 1 1 2.0e-16 5.0e1
pos-tail Johnson 2049 0 1047 0 1 2 2 4.1e-16 2.0e1
pos-far-tail Commons 2303 0 1954 0 1 1 1 1.9e-16 1.0e8
pos-far-tail Cody 2303 0 1954 0 1 1 1 1.9e-16 1.0e8
pos-far-tail Johnson 2303 0 1551 0 1 1 2 3.1e-16 1.0e2

Local-window timing on the same 18,944 samples, best of five passes with 1000 sweeps per pass:

backend ns/call
Commons 9.86
Cody 10.04
Johnson 6.28

So yes: the tail is the right place to look. The worst Johnson behavior is the negative tail and near-overflow tail. Cody has a rare sharper spike right at the finite overflow boundary. Commons is the most stable over these consecutive-ULP windows.

Generic Erfcx Timing

Mixed grid, 24003 samples, best of five timing passes, 100 sweeps per pass.

backend ns/call
Commons 6.29
Cody 6.55
Johnson/errorfunctions 6.05
Johnson/faddeeva-sys 4.92
Julia SpecialFunctions 4.62

The Julia number is measured inside one Julia process launched by the Rust diagnostic program; it is a Julia-loop function-call timing, not per-call Rust-to-Julia FFI overhead.

Impact on ThiopheneIV Implied Volatility Solver

Values are max error in ULPs of reference total volatility on a specific set of options (see the ThiopheneIV paper for the definition of the exact zones). The rust implementation of the solver is available here

ThiopheneIV max ulp

backend CLY-3D CLY-20 CLY-80 Jaeckel Market Corners Stress HighVol
Cody 190 52 8 142 520 557 285 2
Commons 133 49 8 63 177 329 138 2
Johnson 400 71 16 97 1210 537 484 3

ThiopheneIV+ max ulp

backend CLY-3D CLY-20 CLY-80 Jaeckel Market Corners Stress HighVol
Cody 23 5 4 14 30 41 33 2
Commons 23 5 4 13 29 41 33 2
Johnson 22 5 7 13 29 41 32 3

The polished ThiopheneIV+ path is robust across erfcx choices. The unpolished fast solver is where Commons is clearly better for benchmark.

Impact on ThiopheneIV Performance

ThiopheneIV ns/call

backend CLY-3D CLY-20 CLY-80 Jaeckel Market Corners Stress HighVol
Cody 146 146 148 146 147 155 155 150
Commons 155 147 147 156 153 159 157 151
Johnson 179 180 188 211 203 185 204 161

ThiopheneIV+ ns/call

backend CLY-3D CLY-20 CLY-80 Jaeckel Market Corners Stress HighVol
Cody 199 200 199 185 193 210 205 187
Commons 202 203 197 197 200 214 210 189
Johnson 233 229 245 249 251 240 254 210

Cody is slightly faster than Commons for ThiopheneIV, but Commons is close and much more accurate. Johnson is slower in the solver context despite competitive standalone erfcx timing, most likely because its branch/table structure interacts differently with the full pricing paths.

Conclusion

Use the Commons erfcx implementation. It gives the best generic erfcx accuracy against the BigFloat reference among the tested Rust-callable implementations, the most stable 512-consecutive-ULP local-window profile, the best unpolished ThiopheneIV accuracy, and passes the strict reference/corner suite. Cody remains a good fast approximation, but it is of slighly lower accuracy (which impacts non-polished ThiopheneIV implied vols) and has a near-overflow-tail spike (very likely without impact on my use case).

A Faster Monotone Implied Volatiltty Solver

Choi, Huh and Su have a very good paper entitled Tighter uniform bounds for Black–Scholes implied volatility and the applications to root-finding. What’s particularly great is that it gives both a decent lower bound and a proof a monotone convergence using Newton’s method starting from this lower bound.

The industry standard for solving the Black-Scholes implied volatility is Peter Jäckel Let’s be rational. I was wondering if one could not create a robust mathematically backed solver at least as fast and accurate as Jäckel’s algorithm.

It turns out that it is not so difficult to extend the monotone convergence proof to the Euler-Chebyshev method on the logarithm of normalized price. This is however not numerically robust enough when the normalized call option price (c) is close to 1: c is between 0 and 1, and when close to 1, the misbehaves numerically due to the limited floating point accuracy. It is not a convergence issue, but a true floating point arithmetic issue. A remedy is to switch to the log of the complementary normalized price (1-c). But then the Euler-Chebyshev method is not monotone on it. Fortunately, it turns out that Halley’s method is monotone on the logarithm of the complementary normalized price. Overall, when combined, the methods lead to a much faster pricing: 3 iterations are enough to reach machine epsilon level of accuracy.

Newton, Euler-Chebyshev and Halley's methods from Iterative Methods for the Solution of Equations by Traub (1964).

For a truly robust solver, this is not enough. There are more floating point or practical convergence issues: a Bachelier based refinement for tiny prices is necessary, and polishing via the very accurate Jäckel Black-Scholes price function allows to reach closer to machine epsilon accuracy. This last polishing step end up relatively expensive compared to the rest (20% of computational time).

Above, I have described the journey behind creating this new solver. The reverse journey also tells an interesting tale. We could start from the Black-Scholes option price, and notice that a really accurate Black-Scholes formula implementation (close to machine epsilon) is not trivial. The textbook formula based on the normal CDF, often used in various software systems used at banks or hedge funds, is not very accurate in many cases. One should really use Peter Jäckel’s implementation everywhere, otherwise it is pointless to attempt to solve with such a high accuracy (in the paper, I thus make the Jäckel polishing step optional).

Error in the Black-Scholes price with different formulas: textbook CDF, more advanced erfcx, and Jäckel's expansion. On those realistic examples, we hit Jäckel's expansion zone.

I named the resulting solver with the somewhat cryptic name Thiophene, as Thiophene is a chemical compound with formula \[ C_4 - H_4 - S \], and C-H-S are the initials of Choi, Huh and Su.

Conclusion

A mathematical proof of convergence is great to have, but far from enough to obtain a robust solver. The paper on the Thiophene implied volatlity solver is available at https://arxiv.org/abs/2605.22427 and minimal Java code here and Rust code here.

Almost Explicit Implied Volatility

Several years ago, I had explored accuracy and performance of different ways to imply the Black-Scholes volatility. Jherek Healy proposed some improvements over my naive algorithm on his blog. Recently, a Linkedin post mentioned a new paper from Wolfgang Schadner which presents an almost explicit formula for the implied volatility. Almost because it actually relies on some implementation of the quantile function for the inverse Gaussian distribution: it requires a specialized numerical algorithm in practice, such as the one from Giner and Smyth.

The idea is actually not new, the same formula has been used back in 2016 by Jaehyuk Choi in his github repository. If we have a good inverse Gaussian distribution function, the algorithm is clean and simple:

using Distributions
export impliedVolatilityIG
const N = Normal()
function impliedVolatilityIG(isCall::Bool, price::T, forward, strike,  timeToExpiry, df) where {T}
    c, ex = normalizePrice(isCall, price, forward, strike, df)
    if c >= 1 / ex || c >= 1
        throw(DomainError(c, string("price higher than intrinsic value ", 1 / ex)))
    elseif c <= 0
        throw(DomainError(c, "price is negative"))
    end
    x = log(ex)
    return if abs(x) < 4 * eps(T)
        2 * quantile(N, (c + 1) / 2) / sqrt(timeToExpiry)
    else
        mu = 2 / abs(x)
        u = cquantile(InverseGaussian(mu), c)
        2 / sqrt(u * timeToExpiry)
    end
end

reusing the price normalization idea of Li.

function normalizePrice(isCall::Bool, price, f, strike, df)
    c = price / (f * df)
    ex = f / strike
    if !isCall
        if ex <= 1
            c = c + 1 - 1 / ex # put call parity
        else #duality + put call parity            
            c = ex * c
            ex = 1 / ex
        end
    else
        if ex > 1 # use c(-x0, v)            
            c = (f * (c - 1) + strike) / strike # in out duality, c = ex*c + 1 - ex 
            ex = 1 / ex
        end
    end
    return c, ex
end

I was wondering how accurate it was especially for very out-of-the-money options, and how fast it was in practice. Below I compare various implementations in Julia:

  • Peter Jaeckel, which is arguably the industry standard.
  • Jherek Healy, which consists in (log-)Householder iterations on top of a good explcit initial guess.
  • The inverse Gaussian approach using Julia’s Distribution.jl package (as in the above code).
  • The inverse Gaussian approach using a basic port of statsmod R algorithm, adapted to compute the complementary quantile function. This corresponds to a different implementation of cquantile(InverseGaussian(mu),c).

I consider call optiun prices for a fixed strike K = 200 for a forward F = 100, and a time to maturity T = 1.0 year, varying the volatility from 0.02 to 4.0 in steps of 0.01. This constitute 399 samples.

Algorithm vol RMSD vol MAXAD Price MAXAD Rel. Price MAXAD Time
Jaeckel 8.66e-16 4.00e-15 4.26e-14 1.81e-12 104 μs
Healy 1.57e-15 9.77e-15 8.53e-14 3.02e-10 107 μs
IG Distribution 7.37e-16 5.41e-15 2.84e-14 8.99e-11 246 μs
IG statsmod 7.77e-16 7.55e-15 8.24e-13 3.03e-10 311 μs

The inverse Gaussian approach is competitive but still twice slower than Peter Jaeckel’s algorithm on this example. How does the error in volatility actually look like?

Error in impled volatility of the various algorithms

The accuracy of the inverse Gaussian looks as good as Jackel’s algorithm, when using Julia’s package Distibutions.jl (it is slightly less accurate towards low vol, but slightly more accurate for high vols).

It is revealing to also look at the case of varying strikes for a fixed volatility. We consider a volatility fixed to 0.1 (or 10%) and vary the strike from 100 to 500 by steps of 1.

Algorithm vol RMSD vol MAXAD Price MAXAD Rel. Price MAXAD Time
JAECKEL 6.50e-16 2.73e-15 7.11e-15 6.59e-12 154 μs
HEALY 6.56e-16 2.73e-15 3.55e-14 6.59e-12 110 μs
IG Distribution 1.21e-15 7.76e-15 1.78e-14 1.95e-11 1704 μs
IG statsmod 8.68e-16 4.27e-15 1.06e-14 1.32e-11 288 μs

On this example the Julia Distributions.jl leads to a performance degradation by a factor of 10 and is less accurate. It would be worth investigating exactly where lies the bottleneck, as statsmod custom implementation keeps a good performance profile.

Error in impled volatility of the various algorithms

Owen Scrambling a la Burley

In my last post, I had a look at Quantlib implementation of a new scrambling method for Sobol due to Brent Burley of Walt Disney Studios Practical Hash-based Owen Scrambling.

Because it originates from the CG community, I had assumed that this was faster than the more classic scrambling ACM Algorithm 823 by Hickernell and Hong. I was wrong. It may be faster for specific use cases, but in general it is (much) slower, especially for the kind of (quasi) Monte-Carlo simulation that we use in quantitative finance.

The slowness is because we can’t just generate the next number from the sequence, we must jump to each number instead. The shuffling is implemented by jumping to some hash based location. And jumping is somewhat significantly slower than getting the next number. Furthermore, the grey code optimization can not be applied anymore.

In my code, for arbirary dimension (on the example to integrate successively a function from dimension 1 to dimension 5000 using a 100 steps on the dimension side), the Burley implementation (actually optimized to skip faster a la Quantlib instead of the original code from Burley) is 3 times slower than Algorithm 823.

It is also not obviously better on the example of Ilya M. Sobol’, Danil Asotsky in Construction and Comparison of High-Dimensional Sobol’ Generators (Equation 6.1 and Figures 8 and 9), also reproduced by Peter Caspers. $$ I_1 = \int_{[0,1]^d} \prod_{i=1}^d \left( 1 + 0.01(x_i - 0.5) \right) dx_1 dx_2 \dots dx_d . $$

Absolute error in integral I1 computed with different Sobol sequences

In the Figure above, I did not use 30031 points of the sequence as in the original test, but the closest power of 2, that is 32768, which is supposed to be more optimal (the choice of 30031 is strange but actually does not really change the outcome). Except for Sobol-Harase, all use Joe Kuo new-joe-kuo-6.21201 direction numbers. Sobol-Harase uses Shin Harase niederreiter-nut-s21201.txt, which interestingly, does not fare as well as the more classic Joe Kuo set.

The results vary quite a bit depending on the random number generator used, or its seed. In the above, the Well1024a generator with default seed for the seeds of Burley’s algorithm has the lowest maximum error, but if we change the seed to 123456, this is not true anymore. Algorithm 823 (Owen and Owen-Faure-Tezuka) is not worse, and its results are also dependent on the seed (although I am under the impression that the impact is not as large). This is something that was not shown in Peter Caspers Blog.

Jack Audio in Opensuse Tumbleweed

I struggled a bit having Jack Audio Connection Kit working in Opensuse Tumbleweed. My error was to install the jack package. The solution is actually extremely simple: use pipewire-jack instead of jack.

sudo zypper in pipewire-jack qjackctl

Owen Scrambling in Quantlib

The state of the art of Sobol scrambling has changed slightly recently, thanks to the paper from Brent Burley of Walt Disney Studios Practical Hash-based Owen Scrambling. Before that, ACM Algorithm 823 by Hickernell and Hong was the usual reference. Brent Burley’s algorithm is supposedly both faster and with better properties. In particular, it performs both shuffling and scrambling.

Peter Caspers implemented the new algorithm in Quantlib. It’s neat to have some decent implementation in CPP. In applying the algorithm of the paper, a few sub-optimal choices were made:

  • The base RNG for seeds (for the shuffling and scrambling) is Mersenne-Twister. Contrary to what I thought at the beginning, it is not used for skipping/jumping as the seeds are only set in the initialization, and are then constant for all numbers in the sequence. The skipping/jumping logic is however absent from the Quantlib CPP code, which is a bit unfortunate, but is almost trivial to code (just update the nextSequenceCounter_ variable).
  • There is a 4-by-4 seeding logic which looks odd. The paper code processes 4 dimensions as it is specialized to CG, although even then, it processes those 4 dimensions slightly differently: a much simpler hash function is used (it is given in the code URL inside the paper) and the seed value is not updated by the hash function: the seed is merely hashed with the dimension number, for 4 dimensions.

I don’t really see what there is to gain to follow this 4-by-4 seeding. It would be simpler and less concerning to simply use the RNG to produce N seeds where N is the number of dimensions (instead of N/4). The only possible gain with the 4-by-4 seeding is some memory (as only N/4 seeds are kept in memory), but the total amount is not significant: for 21000 dimensions, keeping all in memory takes around 656 KB (yes, 640 KB is not all we need!). The cost of 4-by-4 seeding is a potentially less random scrambling.

On the same subject, there is another new low discrepancy sequence which appears to have some good properties as well as being as fast to generate as Sobol, it’s called SZ Sequence.

Deep Neural Networks and Julia

Recently, I have spent some time on simple neural networks. The idea is to employ them as universal function approximators for some problems appearing in quantitative finance. There are some great papers on it such as the one from Liu et al. (2019) or Horvath et al. (2019) Deep Learning Volatility or Rosenbaum & Zhang (2021). Incidentally, I met Liu back when I was finishing my PhD in TU Delft around 2020.

I thought I would try out what Julia offers in terms of library for neural networks. Being a very trendy subject, and Julia a modern language for the scientific community, I had imagined the libraris to be of good quality (like the many I have been using in the past). Surprisingly, I was wrong.

First, I tried SimpleChain, which just crashed (core dumped!) on a very simple example. I did not bother finding the root cause, I decided to look for another library. I then tried LUX. The execution kept going forever without returning with the @compile keyword. I probably was not being lucky, even though my code was only 10s of lines long. So I decided to use Flux, which actually is as simple to use as Pytorch, a well-known library in the Python world.

Things works with Flux, and I do manage to do many experiments. But the performance is not great. This was another surprise: Pytorch was actually faster for many tasks on the CPU (no GPU). For example, to train a multi-layer-perceptron with 4 hidden layers of 200 neurons, Julia was taking several hours, until I hit CTRL+C and launched the same training in Pytorch (which took 15 minutes). I think it may be something with AdamW optimizer and relatively wide (but not really compared to LMM stuff) networks.

Overall, the experience is pretty disappointing. Maybe it shows how much effort has been put in the Python ecosystem.

Stochastic Collocation - Old And New

Thomas Roos recently put a preprint on SSRN called Simple, Flexible, Analytic, Arbitrage Free Volatility Interpolation. Being interested in the subject, I had a detailed look at it. It turns out that Thomas stumbled upon spline stochastic collocation without realizing it.

There are a few differences in his approach:

  • The optimization is on the x’s instead of the y’s, meaning the strike axis is fixed.
  • An original approach to avoid spurious modes, although I would have liked more details on it, with concrete examples of the penalty. Penalties are often challenging to get right.
  • A nice analysis of the asymptotic behavior in the wings, showing that an exponentially quadratic form for extrapolation corresponds to linear slopes in implied variance.

I had also explored a fixed strike axis initially (even back in 2014 - I uploaded some old notes on SSRN around this). I found it to be somewhat unstable back then, but it may well be that I did not put enough efforts into it, especially since optimizing the y’s instead of the x’s allowed for a straightforward use of B-splines, which is very attractive. It is very direct to impose monotonicity with B-splines.

The main advantage of optimizing on the x’s instead of the y’s is that the knots are defined in the more usual strike space.

Now I find interesting that someone else stumbled upon essentially the same idea starting from another point of view. There may be more merits to this parameterization than I thought.

The challenges I found with stochastic collocations were:

  • spurious spike when the collocation function derivative is close to zero which happens on occasion. This is not really like an extra mode - it is not smooth. Penalty or minimal slope are possible workarounds, but challenging to make robust.
  • dependency on the choice of knots, especially if nearly exact interpolation is required.
  • more minor, non-predictable runtime: the non-linear optimization is sometimes very fast, sometimes much slower. This is the case of most arbitrage-free interpolation techniques.

NUFFT and COS

Leif Andersen and Mark Lake recently proposed the use of Non-Uniform Fast Fourier Transform for option pricing via the characteristic function. Fourier techniques are most commonly used for pricing vanilla options under the Heston model, in order to calibrate the model. They can be applied to other models, typically with known characteristic function, but also with numerically solved characteristic function as in the rough Heston model, and to different kind of payoffs, for example variance/volatility swaps, options. The subject has been vastly explored already so what’s new with this paper?

At first, it was not obvious to me. The paper presents 3 different approaches, two where the probability density and cumulative density is first computed at a (not so sparse) set of knots and the payoff is integrated over it. The remaining approach is the classic one presented in the Carr and Madan paper from 1999 as well as in most of the litterature on the subject. The only additional trick is really the use of NUFFT along with some clever adaptive quadrature to compute the integral.

I thought the main cost in the standard Fourier techniques was the evaluation of the characteristic function at many points, because the characteristic function is usually relatively complicated. And, in practice, it mainly is. So what does the new NUFFT algorithm bring? Surely the characteristic function must still be evaluated at the same points.

The NUFFT becomes interesting to compute the option price at many many strikes at the same time. The strikes do not need to be equidistant and can be (almost) arbitrarily chosen. In Andersen and Lake paper, even a very fast technique such as the COS method reaches a threshold of option per second throughput, as the number of strikes is increased, mainly because of the repeated evaluation of the sum over different strikes, so the cost becomes proportional to the number of strikes.

Evaluating at many many points becomes not much more expensive than evaluating at a few points only. This is what opens up the possibilities for the first 2 approaches based on the density.

It turns out that it is not very difficult to rewrite the COS method so that it can make use of the NUFFT as well. I explain how to do it here. Below is the throughput using NFFT.jl Julia package:

Throughput on the Heston model with M=256 points.

While a neat trick, it is however not 100% clear to me at this point where this is really useful: calibration typically involve a small number of strikes per maturity.

Next