Understanding Backpropagation

I think it is useful to learn backpropagation outside the context of neural nets first until you gain the intuition. Once you have gained the intuition, the equations involved in backpropagating error through a neural net are much easier to understand.

Backpropagation is nothing more than the chain rule from calculus.

Chain Rule

For any nested function: f(g(x)) the derivative, f', is equal to the derivative of the “inner function”, g'(x), multiplied by the derivative of the “outer function” if the inner function were just a variable.

Consider the function:

f(x) = (2x + 3)^5

This function can really be thought of as two nested functions:

g(x) = 2x + 3

f(x) = x^5

So, according to the chain rule, to find f' we find the derivative of g(x):

g'(x) = 2

and multiply it by the derivative of the outer function if the inner function were just a variable:

f'(x) = 2 * 5(2x + 3)^4

Backpropagation with a simple function

Let’s define a function, L , that is a function of many variables:

L = ((a * b) + c) * f

Let’s define some intermediate variables that are the result of operations on the variables we just defined:

  • a * b = e
  • e * c = d

With these new intermediate variables we can rewrite L as:

L = (e + c) * f

L = d * f

We can now assign values to our variables and evaluate L. This is analogous to initializing the weights and biases of a neural net, and doing a forward pass:

  • a = 2
  • b = -3
  • c = 10
  • f = -2

L = ((2 * -3) + 10) * -2 = -8

Here is a visual representation of our expression graph:

As you can see by looking at this graph, L can be written in terms of d and f like this:

L = d * f

To find the derivative of L with respect to d and f we can treat the other variable like a constant and apply the constant rule:

\dfrac{\partial L}{\partial d} = f = -2

\dfrac{\partial L}{\partial f} = d = 4

Let’s fill those gradients in on our graph:

Now we want to find the gradients for e and c. Let’s start with c.

Taking a look at our expression graph we can see two important facts:

  • d = e + c
  • L = d * f

In other words, d is a function of c, and L is a function of d. These are nested functions. So, to find the derivatives we can apply the chain rule.

We computed the derivative of the “outer function” in the previous step.

\dfrac{\partial L}{\partial d}

We only need to calculate the derivative of the “inner function” and multiply it by \dfrac{\partial L}{\partial d}.

\dfrac{\partial L}{\partial c} = \dfrac{\partial d}{\partial c} *  \dfrac{\partial L}{\partial d}

\dfrac{\partial L}{\partial c} =  1 *-2 = -2

The gradient for d is propagating backward into our calculation of the gradients for c. This gradient for c will in turn, propagate backward into our calculation of the gradients for a and b.

First we can find the gradient for e:

\dfrac{\partial L}{\partial e} = \dfrac{\partial d}{\partial e} * \dfrac{\partial L}{\partial d}

\dfrac{\partial L}{\partial e} = 1 *-2 = -2

Filling in those gradients on our expression graph:

Let’s finish this off and find the gradients for a and b. Taking another look at our graph:

  • e = a * b
  • L = e + c * f

These are nested functions and we can find their gradients using the chain rule. The “outer function” this time is \dfrac{\partial L}{\partial e}. We will just need to find the local derivatives and multiply them together:

\dfrac{\partial L}{\partial a} = \dfrac{\partial e}{\partial a} * \dfrac{\partial L}{\partial e}

\dfrac{\partial L}{\partial a} =  -3 *-2 = 6

\dfrac{\partial L}{\partial b} = \dfrac{\partial e}{\partial b} * \dfrac{\partial L}{\partial e}

\dfrac{\partial L}{\partial b} =  2 *-2 = -4

Here’s the final expression graph:

We just found all of the gradients in this expression graph using backpropagation. In case it was not obvious, I called this function L because this is exactly what we will do for our loss function. It will be a more complicated function, but the calculations are very similar and they are all derived from the chain rule.

The Four Fundamental Equations of Backpropagation

To make things easy for the initial explanation I will pretend that our neural net has only 3 neurons: 1 input, 1 hidden layer neuron, and 1 output neuron.

I think it is easier to understand how the equations are derived in the context of a simple net like this. Once you understand that, the only difference between it and a more complicated net is using vectors and matrices to group things together. The equations are essentially the same.

The First Fundamental Equation of Backpropagation

The real implementation of backpropagation I will show works on a layer by layer basis, but this is just for efficiency. In the toy example above we broke up each operation into its simplest possible expression. In either case, the gradients you get will be the same. It does not matter how fine-grained you choose to break your nested functions down.

For each layer we will be calculating a value called the error, denoted by the Greek symbol \delta. The error is the partial derivative of the loss with respect to the weighted sum of the neuron:

\delta = \dfrac{\partial L}{\partial z}

First, we must calculate the error of the output layer.

The loss function is a function of the output. Let’s call the output, o:

L(o)

The output is a function of the weighted sum. It is the weighted sum after activation:

o = \sigma(z)

I hope this is starting to look familiar at this point. This is another nested function:

L(\sigma(z))

We can apply the chain rule to find the gradient of z:

\dfrac{\partial L}{\partial z} = \dfrac{\partial o}{\partial z} *  \dfrac{\partial L}{\partial o}

In the case of the quadratic loss function we are using, the derivative of the loss function with respect to the output is:

\dfrac{\partial L}{\partial o} = o - l

Where l is the label. If you were using a different loss function, your derivative would be different.

To find the derivative of the “inner function” we really just pass the weighted sum into the derivative of the activation function:

\dfrac{\partial o}{\partial z} = \sigma'(z)

Putting these two together yields our first fundamental equation:

\delta_{output}= o - l * \sigma'(z)

The Second Fundamental Equation of Backpropagation

Now we need some way of using this error to find the error in the previous layers.

The weighted sum of any layer in the network is a function of the weighted sum of the layer before it. For any layer, the weighted sum of the next layer is defined in terms of the weighted sum of the current layer with this equation:

z_{l + 1}= w_{l + 1} * \sigma(z_l) + b_{l + 1}

This is another application of the chain rule.

\dfrac{\partial L}{\partial z_l} =  \dfrac{\partial z_{l + 1}}{\partial z_l} * \dfrac{\partial L}{\partial z_{l + 1}}

For our hidden layer, we already know the derivative of the loss with respect to the weighted sum of the output layer. We calculated that in the first equation. It’s \delta_{output}. This is the derivative of our “outer function”. All we need to do is find the derivative of the “inner function”.

\dfrac{\partial z_{l+1}}{\partial z_l} = w_{l + 1} * \sigma'(z_l)

Putting these two together gives us:

\delta_l=  w_{l + 1} * \delta_{l + 1} * \sigma'(z_l)

This is the second fundamental equation.

The \delta_{l + 1} used in this equation does not need to be \delta_{output}. It only needs to be the error from the next layer in the net. This equation can be used to backpropagate error through any number of preceding layers.

The Third Fundamental Equation of Backpropagation

Now that we have the error for all of our layers we can use it to find the gradients of the weights and biases.

How do we find the gradient of the biases using the error? The error is the gradient of the bias. This is really no surprise because the weighted sum is a function of the bias:

z = w \cdot x + b

This is another application of the chain rule:

\dfrac{\partial L}{\partial b} = \dfrac{\partial z}{\partial b} * \dfrac{\partial L}{\partial z}

The derivative of the “outer function” is \delta:

\dfrac{\partial L}{\partial z} = \delta

Since b is just a constant being added to the product of the weights and inputs, any unit change in b will be a unit change in z. In other words:

\dfrac{\partial z}{\partial b} = 1

Putting it together:

\dfrac{\partial L}{\partial b} = 1 * \delta = \delta

The Fourth Fundamental Equation of Backpropagation

All that’s left is to find the gradient of the weights using the error. Once again we apply the chain rule:

\dfrac{\partial L}{\partial w} = \dfrac{\partial z}{\partial w} * \dfrac{\partial L}{\partial z}

\dfrac{\partial L}{\partial z} = \delta

All that’s left is to find the local derivative: \dfrac{\partial z}{\partial w}. Remember how z is computed:

z = w * x + b

\dfrac{\partial z}{\partial w} = x

Putting these two together gives us:

\dfrac{\partial L}{\partial w} = x * \delta

Conclusion

That’s how we use backpropagation to find the gradients of the weights and biases in the network.

The only three changes we need to make to the equations to accommodate a neural net of arbitrary size are:

  1. All of the zs, bs, and \deltas in these equations would actually be column vectors. All of the ws would be matrices.
  2. For the second fundamental equation, the w_{l + 1} in the equation is actually the transpose of the weight matrix of the next layer. This would be more accurately written as (w_{l + 1})^T. In our one-neuron-per-layer example it didn’t matter. I think the relationship between this equation and the chain rule is made more obvious if w is just a scalar. In a real scenario you would want to take the transpose of the weight matrix of the next layer to make sure that you are multiplying the error by the same weights that the outputs from your layer were multiplied by when entering.
  3. The multiplication of vectors in these equations represents the Hadamard product. The Hadamard product is just the result of entry-wise multiplication of two vectors.

In the next blog post I will show how to actually implement the four fundamental equations of backpropagation in code.

Resources

One response to “Understanding Backpropagation”

  1. […] Where is the final output activation of output neuron$. If you recall the equation for the gradient of the bias of an output neuron is (see Understanding Backpropagation): […]

    Like

Leave a reply to The Cross-Entropy Loss Function – Robert Vagene's Blog Cancel reply