More Automatic Differentiation Awkwardness
This blog post from Jherek Healy presents some not so obvious behavior of automatic differentiation, when a function is decomposed into the product of two parts where one part goes to infinity and the other to zero, and we know the overall result must go to zero (or to some other specific number). This decomposition may be relatively simple to handle for the value of the function, but becomes far less trivial to think of in advance, at the derivative level.
I played myself with Julia’s excellent ForwardDiff package, and found out a completely different example which illustrates a simple and easy trap with automatic differentiation. The example is a typical cubic spline evaluation routine:
function evaluate(self, z)
i = searchsortedfirst(self.x, z) # x[i-1]<z<=x[i]
if z == self.x[i]
return self.a[i]
end
if i > 1
i -= 1
end
h = z - self.x[i]
return self.a[i] + h * (self.b[i] + h * (self.c[i] + h * (self.d[i])))
end
The ForwardDiff will lead to derivative = 0 at z = x[i], while, in reality, it is not - the derivative is b[i]. The case z = x[i] is just a (not so clever) optimization. Granted, it is wrong only at the spline knots, but if those correspond to key points we evaluate as part of a minimization, it might become very problematic.
Another interesting remark to make on automatic differentiation is on the choice of backward (reverse) vs. forward (tangent) automatic differentiation: we often read that backward is more appropriate when the dimension of the output is low compared to the dimension of the input, and forward when the dimension of the output is large compared to the dimension of the input. This is however not strictly true. A good example is the case of a least-squares minimization. The objective could be seen as a one-dimensional output: the sum of squares of the values of an underlying function. But in reality, it is really N-dimensional, where N is the number of terms in the sum, as one sum involves N calculations of the underlying function. And thus, for least-squares, backward automatic differentiation will likely be slower than forward since the number of terms in the sum is typically larger than the number of input parameters we are trying to fit.
Interestingly, the Matlab documentation mentions explicitly this use case:
lsqnonlin defaults to forward AD when the number of elements in the objective vector is greater than or equal to the number of variables. Otherwise, lsqnonlin defaults to reverse AD.