# The Neural Network in Your CPU

Aug 6, 2017 · 3 minute read · Commentsprogramming

Machine learning and artificial intelligence are the current hype (again). In their new Ryzen processors, AMD advertises the Neural Net Prediction. It turns out this is was already used in their older (2012) Piledriver architecture used for example in the AMD A10-4600M. It is also present in recent Samsung processors such as the one powering the Galaxy S7. What is it really?

The basic idea can be traced to a paper from Daniel Jimenez and Calvin Lin “Dynamic Branch Prediction with Perceptrons”, more precisely described in the subsequent paper “Neural methods for dynamic branch prediction”. Branches typically occur in `if-then-else`

statements. Branch prediction consists in guessing which code branch, the `then`

or the `else`

, the code will execute, thus allowing to precompute the branch in parallel for faster evaluation.

Jimenez and Lin rely on a simple single-layer perceptron neural network whose input are the branch outcome (global or hybrid local and global) histories and the output predicts which branch will be taken. In reality, because there is a single layer, the output y is simply a weighted average of the input (x, and the constant 1):

$$ y = w_0 + \sum_{i=1}^n x_i w_i $$

\( x_i = \pm 1 \) for a taken or not taken. \( y > 0 \) predicts to take the branch.

Ideally, each static branch is allocated its own perceptron. In practice, a hash of the branch address is used.

The training consists in updating each weight according to the actual branch outcome t : \( w_i = w_i + 1 \) if \( x_i = t \) otherwise \( w_i = w_i - 1 \). But this is done only if the predicted outcome is lower than the training (stopping) threshold or if the branch was mispredicted. The threshold keeps from overtraining and allow to adapt quickly to changing behavior.

The perceptron is one of those algorithms created by a psychologist. In this case, the culprit is Frank Rosenblatt. Another more recent algorithm created by a psychologist is the particle swarm optimization from James Kennedy. As in the case of particle swarm optimization, there is not a single well defined perceptron, but many variations around some key principles. A reference seems to be the perceptron from H.D. Block, probably because he describes the perceptron in terms closer to code, while Rosenblatt was really describing a perceptron machine.

The perceptron from H.D. Block is slightly more general than the perceptron used for branch prediction:

- the output can be -1, 0 or 1. The output is zero if the weighted average is below a threshold (a different constant from the training threshold of the branch prediction perceptron).
- reinforcement is not done on inactive connections, that is for \( x_i = 0 \).
- a learning rate \( \alpha \) is used to update the weight: \( w_i += \alpha t x_i \)

The perceptron used for branch prediction is quite different from the deep learning neural networks fad, which have many more layers, with some feedback loop. The challenge of those is the training: when many layers are added to the perceptron, the gradients of each layer activation function multiply in the backpropagation algorithm. This makes the “effective” gradient at the first layers to be very small, which translates to tiny changes in the weights, making training not only very slow but also likely stuck in a sub-optimal local minimum. Beside the vanishing gradient problem, there is also the catastrophic interference problem to pay attention to. Those issues are today dealt with the use of specific strategies to train / structure the network combined with raw computational power that was unavailable in the 90s.