Maximum Implied Variance Slope

The paper The Moment Formula for Implied Volatility at Extreme Strikes by Roger Lee redefined how practioners extrapolate the implied volatility, by showing that the total implied variance can be at most linear in the wings, with a slope below 2.

Shortly after, the SVI model of Jim Gatheral, with its linear wings, started to become popular.

In a recent paper in collaboration with Winfried Koller, we show that the asymptotic bounds are usually overly optimistic. This is somewhat expected, as the bound only holds as the log-moneyness goes to infinity. What is less expected, is that, even at very large strikes (nearly up to the limit of what SVI allow numerically), we may happen to have arbitrages, even though the bounds are respected.

In practice, this manifests by increasing call option prices far away in the wings (increasing call prices are an arbitrage opportunity), which could be problematic for any kind of replication based pricing, such as variance swap pricing.

Increasing call prices for various calibrated SVI parameters respecting the asymptotic boundary constraints.

The paper on arxiv.

The Return of the Arbitrage in the Perfect Volatility Surface

In a Wilmott article from 2018 (Wilmott magazine no. 97) titled “Arbitrage in the perfect volatility surface”, Uwe Wystup points out some interesting issues on seemingly innocuous FX volatility surfaces:

  • a cubic spline tends to produce artificial peaks/modes in the density.
  • SVI not arbitrage-free even on seemingly trivial input.

The examples provided are indeed great and the remarks very valid. There is more to it however:

  • a cubic spline on strikes or log-moneyness does not produce the artificial peak.
  • SVI with a=0 is arbitrage-free on this example.

For the first point, the denominator of the Dupire formula in terms of the representation of total variance as a function of logmoneyness gives some clues as it constitues a vega scaled version of the probability density with direct link to the total variance and its derivatives. In particular it is a simple function of its value, first and second derivatives, without involving any non-linear function and the second derivative only appears as a linear term. As such a low order polynomial representation of the variance in log-moneyness may be adequate. In contrast, the delta based representation introduces a strong non-linearity with the cumulative normal distribution function.

Splines on AUD/NZD 1w options.

For the second point, it is more of a curiosity. Everybody knows that SVI is not always arbitrage-free and examples of arbitrages abound. Finding such a minimal example is nice. I noticed that many of the real life butterfly spread arbitrages are avoided if we follow Zeliade’s recommendation to set a >= 0, where a is the base ordinate parameter of SVI. Yes the fit is slightly worse, but the increased calibration robustness and decrease of the number of smiles with arbitrage is probably worth it.

SVI on AUD/NZD 1w options.

SVI on AUD/NZD 1w options.

Even if I am no particular fan of SVI or of the cubic spline to represent implied volatilities, the subject merited some clarifications.

extract of the Wilmott article.

There is a third point which I found quite surprising. The axis is inconsistent between the first two plots of Figure 1. 10 delta means that we are out of the money. For a call, the strike is thus larger than the forward, and thus the corresponding log-moneyness must be positive. But on the Figure, the 10 Delta call volatility of the first plot does not correspond to the highest log-moneyness point of the second plot. One of the axis is reversed. This is particularly surprising for an expert like Uwe.

Ooohhhh!

It is reassuring that this kind of mistakes also happen to the best. Every FX quant has Uwe Wystup’s book, next to the Clark and the Castagna.

Princeton Fintech and Quant conference of December 2022

I recently presented my latest published paper On the Bachelier implied volatility at extreme strikes at the Princeton Fintech and Quant conference. The presenters were of quite various backgrounds. The first presentations were much more business oriented with lots of AI keywords, but relatively little technical content while the last presentation was about parallel programming. Many were a pitch to recruit to employees.

The diversity was interesting: it was refreshing to hear about quantitative finance from vastly different perspectives. The presentation from Morgan Stanley about their scala annotation framework to ease up parallel programming was enlightening. The main issue they were trying to solve is the necessity for all the boilerplate code to handle concurrency, caching, robustness, which obfuscates significantly the business logic in the code. This is an old problem. Decades ago, rule engines were the trend for similar reasons. Using the example of pricing bonds, the presenters put very well forward the issues in evidence, issues that I found very relevant.

While writing my own presentation, I realized I had not tried to plot the Bachelier implied volatility fractals. The one corresponding to the Householder method is pretty, and not too far off from the Black-Scholes case, possibly because the first, second and third derivatives towards the vol are similar between the two models.

Iterations of the Householder method on the problem of solving the Bachelier implied volatility for a given strike and maturity. Each point corresponds to an initial guess in the complex plane.

Of course, this is purely for the fun as this has no practical importance, since we can invert the Bachelier price with near machine accuracy through a fast analytic formula.

Slides of the presentation.

Desktop Linux in 2022

I have been a long time user of Fedora at work and have been happy quite happy about it. Around 6 months ago, I moved to Manjaro under VirtualBox in a Windows host, because the company imposes the use of a VPN client that does not run on Linux. It’s much less great for several reasons:

  • VirtualBox makes everything graphics much slower. It is still way better than WSL2 as I found WSL2 to be a poor man’s Linux, where graphics is via some half broken X11 client for Windows. With VirtualBox, programming is ok, watching videos is not (you need to use the host for that). Something relatively neat is that the VM can be run on any other host, including a Linux host by just copying the vm file.
  • I picked Manjaro out of curiosity and because it was perhaps faster/more optimized than Fedora. I don’t have too many issues with it. But I found out that the use of AUR packages would frequently break on upgrades. Futhermore, it is updated less frequently than Fedora, which is a bit surprising for a rolling release. I also installed it with ext4. With hindsight, I think both choices were poor, Fedora with brtfs is the way to go, or some Ubuntu variant.
  • Work also imposes various antivirus programs which tend to always keep the cpu busy.

Overall I am still much more productive than with a raw Windows, which is surprisingly slow for many software development related tasks. Having to deal more with Windows showed me how slow the system is, and how poor the experience is. Before using it again (after many years completely off it), I had started thinking it was good (well, it does crash much less than back in the days).

I otherwise use Fedora with either KDE or Gnome on my personal desktop and laptop with very few issues overall. Gnome seems much better since 42 even if overall, Gnome shell has not been the most stable in the past. Steam runs great. I also encountered some grub bootloader bug recently, which disappeared once I upgraded my bios.

Roughness of the Implied Volatility

This is a follow up of my previous post on rough volatility. I recently tried to reproduce the results of the paper Rough Volatility: Fact or Artefact? as I was curious to apply the technique using different inputs. The 5-minutes SPX realized volatility is freely available in CSV format at the Oxford-Man Institute of Quantitative Finance and it is thus relatively straightforward to reproduce the numbers presented in the paper.

Using a sampling of K=75 and L=75*75, I obtain an rounghness index H=0.181. The paper uses K=70 and L = 70x70, and their Figure 19 of the paper states 0.187. It turns out that there are more than L observations in the time-series, and, with K=70, I end up with a roughness index H=0.222 when I start from the first observation (year 2000), up to the observation L+1. But it is possible to slide this window and compute the roughness index at each starting point. The results are enlightening.

SPX historical vol roughness index sliding estimate with K=70.

density of SPX historical vol roughness index estimate with K=70. The mean is 0.158 with a standard deviation of 0.034

The mean estimated roughness is 0.158 with a standard deviation of 0.034. The results of the paper are within one standard deviation. But it shows it may be dangerous to use a single window for the estimate. If we use a smaller window size K=50, L=50x50, we can see regimes of roughness in time.

SPX roughness index sliding estimate with K=50.

density of SPX roughness index estimate with K=70. The mean is 0.145 with a standard deviation of 0.048

I also double checked that the SPX closing prices resulted in an estimate H=0.5, for the two window sizes. The SPX price process follows the standard Brownian motion statistics.

The paper explains that using the 5 minutes realized volatility is not necessarily a good proxy to measure the true underlying instant volatility roughness. I wondered then what happened if we use the daily quotes of the short term at-the-money (ATM) implied volatility as a proxy. Indeed, in a given stochastic volatility model, the short term ATM implied volatility is approximately the instant volatility of the stochastic volatility process. It turns out that there is some freely available data from the Nasdaq for MSFT, but only for the period 2015-01-02 to 2018-12-32 (four years). This constitutes 1006 samples. For the 30-day implied volatility, with K=28, the data implies a roughness index close to 0.5 (0.448).

density of MSFT 30-day implied volatility roughness index estimate. The mean is 0.448 with a standard deviation of 0.059

density of MSFT 10-day implied volatility roughness index estimate. The mean is 0.350 with a standard deviation of 0.049

Roughness is apparent in the 10-day implied volatility, with H=0.35. For the 60-day implied vol, I measure a roughness H=0.41 with a deviation of 0.06, so again not significantly different from the standard Brownian motion roughness. The roughness of the implied volatility skew, also available in this data, is more pronounced.

Another interesting analysis is the roughness of the VIX index, which can be mapped to the price of a newly issued 30 days variance swap on the SPX index. I am actually slightly surprised that the various papers on rough volatility have not looked at it already. I just took the quotes from Yahoo. With K=51, I find H=0.347 with a standard deviation of 0.026. This is close to the 10-day implied volatility roughness measured previously.

VIX roughness index sliding estimate with K=51.

density of VIX roughness index estimate with K=51. The mean is 0.347 with a standard deviation of 0.026

What about the Heston model? What is the estimate of the roughness index on a simulated stochastic variance path? Using one time-step per day for 5 years, and K=25, we find H=0.505 with a standard deviation of 0.032 for the roughness of the instantaneous variance, as well as for the one of the one-month at-the-money implied volatility (using v(t) and constant Heston parameters). This is, as expected, the standard roughness of the underlying Brownian motion.

To conclude, the Hurst index H of the implied volatility, and of the VIX index appear to be not as low as the one of the 5 minutes historical volatility. The estimated roughness H=0.35 is much closer to the standard Brownian motion roughness. In fact, for the 30-day implied volatility (with daily quotes), the measure is even closer to 0.5.

Looking at the 5 minutes historical volatility may make sense to price intraday options, but it is not clear what is the importance of it for traditional derivatives pricing (with maturities longer than a week). Does the historical micro-structure really matter there?

After writing all that, I found out Livieri et al. (2017) paper titled Rough volatility: evidence from option prices which also measures the roughness of the implied volatilities. Interestingly, I did not reproduce some of the claims in that paper: I find a large bias in the implied volatility proxy even for a low H value. The large bias, combined to the uncertainty in the measurement makes the implied volatility proxy not very practical to estimate the true Hurst index of the underlying volatility process. Another discrepancy with the paper is around the Heston model: I do not find any apparent roughness at all in the Heston model.

I added this post, with more details, as a note on arxiv, just in case someone needs to reference this. I intend to place some the Julia code I wrote in the context of the paper on github, although I did not find the time yet.

Volatility: Rough or Not? A Short Review

It is well-known that the assumption of constant volatility in the Black-Scholes model for pricing financial contracts is wrong and may lead to serious mispricing, especially for any exotic derivative contracts. A classic approach is to use a deterministic local volatility model to take into account the variation both in the time dimension and in the underlying asset price dimension. But the model is still not realistic in terms of forward smile (the implied volatilities of forward starting options). A stochastic volatility component must be added to correct for it. More recently, the concept of rough volatility emerged in many academic papers. Instead of using a classic Brownian motion for the stochastic volatility process, a fractional Brownian motion is used. The idea of using a fractional Brownian motion for financial time-series can be traced back to Mandelbrot, but it is only relatively recently that it has been proposed for the volatility process (and not the stock process).

After many published papers on the subject, Rama Cont and Purba Das recently added their preprint Rough Volatility: Fact or Artefact? on SSRN, raising the question whether using a rough volatility process makes sense from a practical point of view. Indeed, they show that the Hurst index measured is different for the realized volatility (even based on very short sampling intervals) and the instantaneous volatility. The authors make the conjecture that is is all due to the discretization error (a.k.a. the microstructure noise), and that in fact, a standard Brownian motion is compatible with the low Hurst index of the realized volatility.

As mentioned in their conclusion, they are not alone in showing that the rough volatility assumption may not be justified, LCG Rogers has also published a note in 2019 Things we think we know, with the same conclusion.

In a 2019 paper, Masaaki Fukasawa, Tetsuya Takabatake and Rebecca Westphal also attempt to answer the question Is volatility rough?. They present the same problem with the estimation method of Rosenbaum and Gatheral, namely that their measure may actually not measure the true Hurst index, and this could be why it is always close to H=0.1, regardless of the market index. It is shown that for specific models, the Rosenbaum and Gatheral measure is dependent on the period length considered (5 minutes vs. 1 minute lead to different Hurst indices). The authors therefore devise a more robust (and much more complex) way to estimate the Hurst index of time series, which is not so sensitive to the period length considered. They find that the market volatility, measured through the realized variance proxy is rough, with H=0.04, rougher than what Jim Gatheral suggested.

To conclude, the study of Gatheral and Rosenbaum, which motivated the use of rough volatility, was based on “biased” estimates of the Hurst index of the realized volatility of high frequency data as evidenced in the three papers cited above. Clearly then, this justification does not hold. But Fukasawa et al. suggest the volatility is rougher, and it is slightly surprising that Rama Cont and Purba Das do not take more those results into account. They shortly discuss it, but then go on proposing another biased estimate (one which seems to lead to the same results as Rosenbaum and Gatheral). To their credit, Rama Cont and Purba Das show that the microstructure noise is far from IID, while Fukasawa et al. assume it is close to IID in their method. Perhaps their indirect conclusion is that Fukasaza et al. estimates are also biased (i.e. measuring something else - although the results presented in their paper seem solid).

Now, there are also other interesting features of the rough volatility models, such as the variation of the at-the-money skew in time, which mimicks the market implied skews in practice, a feature not easily achieved by stochastic volatility models (this typically requires a multidimensional stochastic volatility process to approach it crudely).

Monte-Carlo Parallelization: to vectorize or not?

When writing a Monte-Carlo simulation to price financial derivative contracts, the most straightforward is to code a loop over the number of paths, in which each path is fully calculated. Inside the loop, a payoff function takes this path to compute the present value of the contract on the given path. The present values are recorded to lead to the Monte-Carlo statistics (mean, standard deviation). I ignore here any eventual callability of the payoff which may still be addressed with some work-arounds in this setup. The idea can be schematized by the following go code:

for i := 0; i < numSimulations; i++ {
	pathGenerator.ComputeNextPath(&path) //path contains an array of n time-steps of float64.
	pathEvaluator.Evaluate(&path, output)
	statistics.RecordValue(output)
}
A python programmer would likely not write a simulation this way, as the python code inside the large loop will not be fast. Instead, the python programmer will write a vectorized simulation, generating all paths at the same time.
pathGenerator.ComputeAllPaths(&paths) //paths contains an array of n time-steps of vectors of size numSimulations 
pathEvaluator.EvaluateAll(&paths, output)
statistics.RecordValues(output)

The latter approach sounds nice, but may easily blow up the memory, if all paths are stored in memory. One solution is to store only packages of path, and do the simulation multiple times, being careful as to where we start the random numbers along the way. It is possible to further optimize memory by not keeping each full path in memory, but just the current path values, along with a list of vector variables which track the payoff path dependency. This requires a special Brownian-bridge implementation in the context of Sobol quasi-random numbers, as described in Jherek Healy’s book.

We can thus summarize the various approaches by (A) iterative on scalars (B) fully vectorized (C) iterative on vectors. I intentionally abuse the language here as the scalars are not scalars, but 1 path comprised of values at all relevant times - in this terminology, scalar means a single path.

Which one is most appropriate?

I first heard about this debate in a Quantlib user meeting a long time ago. Alexander Sokol was arguing for (B) as it simplified the implementation of AAD, while Peter Caspers was more a (A) kind of guy. At the time, I had not fully grasped the interest of (B) as it seemed more complex to put in place properly.

It is interesting to think about those in terms of parallelization. At first, it may seem that (A) is simple to parallelize: you just process N/M package where M is the number of processors and N the total number of paths. Unfortunately, such a design implies a stateful payoff, which is not necessarily trivial to make thread-safe (actually very challenging). So the first drawback is that you may have to limit the parallelization to the path generation and to perform the payoff evaluation in a single thread. And then if you think more about it, you notice that while the first package is being computed, the payoff evaluation will just wait there. Sure, other packages will be computed in parallel, but this may again blow off the memory as you essentially keep all the paths in each package in memory. In practice, we would thus need to compute each path in some sort of thread pool and push those to a path queue of fixed length (blocking), which is read by the single thread, by taking a path from the queue, evaluate the payoff on this path, and pushing it back to a free queue in order to recycle the path. Alternatively, some sort of ring buffer structure could be used for the paths. One has also to be quite careful as well about the random number generator skipping to the correct position in order to keep the same sequence of random numbers as in a fully single threaded approach.

Taking parallelization into account, (A) is not that the trivial approach anymore. In contrast, (B) or (C) would parallelize the payoff evaluation by design, making the implementation of the full simulation parallelization much easier. Arguably, the package of paths approach may consume less memory than (C), as (1) the path generation usually involves many intermediate time-steps which may be discarded (kept only in cache per core) while in a vectorized approach, the full vector of those intermediate time-steps may be needed (typically for a classical Brownian-Bridge variance reduction), (2) the payoff evaluates scalars (and its state is thus comprised of scalars). A clever vectorized payoff evaluation would however reuse most vectors. The memory reduction may not be real in practice.

Now, if we include callability or particle method kind of evaluation, full vectorization, but keeping only the payoff (vector) variables in memory and building up the path step by step may be the most effective as described in Jherek Healy’s book.

To conclude, not vectorizing the path generation and payoff evaluation would be a design mistake nowadays.

More Automatic Differentiation Awkwardness

This blog post from Jherek Healy presents some not so obvious behavior of automatic differentiation, when a function is decomposed into the product of two parts where one part goes to infinity and the other to zero, and we know the overall result must go to zero (or to some other specific number). This decomposition may be relatively simple to handle for the value of the function, but becomes far less trivial to think of in advance, at the derivative level.

I played myself with Julia’s excellent ForwardDiff package, and found out a completely different example which illustrates a simple and easy trap with automatic differentiation. The example is a typical cubic spline evaluation routine:

function evaluate(self, z)
  i = searchsortedfirst(self.x, z)  # x[i-1]<z<=x[i]
  if z == self.x[i]
    return self.a[i]
  end
  if i > 1
    i -= 1
  end
  h = z - self.x[i]
  return self.a[i] + h * (self.b[i] + h * (self.c[i] + h * (self.d[i])))
end

The ForwardDiff will lead to derivative = 0 at z = x[i], while, in reality, it is not - the derivative is b[i]. The case z = x[i] is just a (not so clever) optimization. Granted, it is wrong only at the spline knots, but if those correspond to key points we evaluate as part of a minimization, it might become very problematic.

Another interesting remark to make on automatic differentiation is on the choice of backward (reverse) vs. forward (tangent) automatic differentiation: we often read that backward is more appropriate when the dimension of the output is low compared to the dimension of the input, and forward when the dimension of the output is large compared to the dimension of the input. This is however not strictly true. A good example is the case of a least-squares minimization. The objective could be seen as a one-dimensional output: the sum of squares of the values of an underlying function. But in reality, it is really N-dimensional, where N is the number of terms in the sum, as one sum involves N calculations of the underlying function. And thus, for least-squares, backward automatic differentiation will likely be slower than forward since the number of terms in the sum is typically larger than the number of input parameters we are trying to fit.

Interestingly, the Matlab documentation mentions explicitly this use case:

lsqnonlin defaults to forward AD when the number of elements in the objective vector is greater than or equal to the number of variables. Otherwise, lsqnonlin defaults to reverse AD.

Quadprog in Julia

As described on wikipedia, a quadratic programming problem with n variables and m constraints is of the form $$ \min(-d^T x + 1/2 x^T D x) $$ with the constraints \( A^T x \geq b_0 \), were \(D\) is a \(n \times n\)-dimensional real symmetric matrix, \(A\) is a \(n \times m\)-dimensional real matrix, \( b_0 \) is a \(m\)-dimensional vector of constraints, \( d \) is a \(n\)-dimensional vector, and the variable \(x\) is a \(n\)-dimensional vector.

Solving convex quadratic programming problems happens to be useful in several areas of finance. One of the applications is to find the set of arbitrage-free option prices closest to a given set of market option prices as described in An arbitrage-free interpolation of class C2 for option prices (also published in the Journal of Derivatives). Another application is portfolio optimization.

In a blog post dating from 2018, Jherek Healy found that the sparse solve_qp solver of scilab was the most efficient across various open-source alternatives. The underlying algorithm is actually Berwin Turlach quadprog, originally coded in Fortran, and available as a R package. I had used this algorithm to implement the techniques described in my paper (and even proposed a small improvement regarding machine epsilon accuracy treatment, now included in the latest version of quadprog).

Julia, like Python, offers several good convex optimizers. But those support a richer set of problems than only the basic standard quadratic programming problem. As a consequence, they are not optimized for our simple use case. Indeed, I have ported the algorithm to Julia, and found out a 100x performance increase over the COSMO solver on the closest arbitrage-free option prices problem. Below is a table summarizing the results (also detailed on the github repo main page).

Solver RMSE Time (ms)
GoldfarbIdnaniSolver 0.0031130349998388157 0.205
COSMO 0.0031130309597602370 21.485
SCS 0.0031130429769795193 8.381

Positive Stochastic Collocation

In the context of my thesis, I explored the use of stochastic collocation to capture the marginal densities of a positive asset. Indeed, most financial asset prices must be non-negative. But the classic stochastic collocation towards the normally distributed random variable, is not.

A simple tweak, proposed early on by Grzelak, is to assume absorption and use the put-call parity to price put options (which otherwise depend on the left tail). This sort of works most of the time, but a priori, there is no guarantee that we will end up with a positive put option price. As an extreme example, we may consider the case where the collocation price formula leads to \(V_{\textsf{call}}(K=0) < f \) where \(f \) is the forward price to maturity. The put-call parity relation applied at \(K=0 \) leads to \(V_{\textsf{put}}(K=0) = V_{\textsf{call}}(K=0)-f < 0 \). This means that for some strictly positive strike, the put option price will be negative, which is non-sensical. In reality, it thus implies that absorption must happen earlier, not at \(S=0 \), but at some strictly positive asset price. And then it is not so obvious to chose the right value in advance.

In the paper, I look at alternative ways of expressing the absorption, which do not have this issue. Intuitively however, one may wonder why we would go through the hassle of considering a distribution which may end up negative to model a positive price.

In finance, the assumption of lognormal distribution of price returns is very common. The most straighforward would thus be to collocate towards a lognormal variate (instead of a normal one), and use an increasing polynomial map from \([0, \infty) \) to \([0, \infty) \). There is no real numerical challenge to implement the idea. However it turns out not to work well, for reasons explained in the paper, one of those being a lack of invariance with regards to the lognormal distribution volatility.

This did not make it directly to the final thesis, because it was not core to it. Instead, I explore absorption with the Heston driver process (although, in hindsight, I should have just considered mapping 0 to 0 in the monotonic spline extrapolation). I recently added the paper on the simpler case of positive collocation with a normal or lognormal process to the arxiv.

Previous

Next