Backpropagation calculus | Deep Learning Chapter 4

TL;DR

Backpropagation gradients come from applying the chain rule along the computation path from each parameter to the cost.

Briefing Cornell Notes

Briefing

Backpropagation’s calculus boils down to one practical question: how much does the cost change when a single weight or bias nudges a network’s internal numbers? The core insight is that the derivative of the cost with respect to a parameter can be written as a product of simpler derivatives using the chain rule—turning a messy network into a sequence of local sensitivities. That product tells learning algorithms exactly which parameters most strongly affect the error, enabling gradient descent to step “downhill” on the cost surface.

The walkthrough starts with an intentionally tiny network: one neuron per layer, so the last layer depends on just one weight w(L) and one bias b(L). The last neuron’s activation a(L) comes from a nonlinear function (sigmoid or ReLU) applied to a weighted sum z(L) = w(L)·a(L-1) + b(L). For a single training example with target y, the cost is C0 = (a(L) − y)^2. To find how sensitive C is to w(L), the method treats the computation as a chain: a tiny change in w(L) nudges z(L), which nudges a(L), which then changes the cost. The derivative dC/dw(L) becomes the chain rule product of three ratios: (dC/da(L))·(da(L)/dz(L))·(dz(L)/dw(L)).

Each factor has a clear meaning. The derivative dC/da(L) = 2(a(L) − y) scales with the output error: if the network output is far from the target, even small parameter tweaks can swing the cost a lot. The middle term da(L)/dz(L) is the derivative of the chosen nonlinearity, capturing how responsive the activation is to changes in its input. The last term dz(L)/dw(L) = a(L-1) shows that the weight’s influence depends on how strongly the previous neuron is “on”—a direct nod to the “neurons that fire together wire together” intuition.

The same logic applies to biases: since z(L) = w(L)·a(L-1) + b(L), the derivative dz(L)/db(L) is 1, making the bias sensitivity almost the same as the weight case. Backpropagation also reveals how error signals propagate backward: the derivative of z(L) with respect to the previous activation a(L-1) equals the weight w(L). That backward sensitivity lets the calculation iterate layer by layer, producing partial derivatives for every weight and bias.

The transcript then generalizes to multiple neurons per layer. Activations become indexed (a(L)j), and weights become edge-specific (w(L)jk) connecting neuron k in layer L−1 to neuron j in layer L. The cost sums squared output errors across neurons in the final layer. The derivative with respect to a particular weight keeps the same chain-rule structure, but the derivative with respect to an activation in layer L−1 changes because that activation affects the cost through multiple outgoing connections—so contributions from all relevant paths add together. Once those activation sensitivities are known, the same backward process yields gradients for all incoming weights and biases. The result is a systematic calculus engine for building the gradient vector that gradient descent uses to minimize cost across training data.

Cornell Notes

The key move in backpropagation calculus is rewriting dC/d(parameter) as a chain-rule product of local derivatives. In a one-neuron-per-layer network, the last layer cost is C0 = (a(L) − y)^2 with a(L) = sigmoid/ReLU(z(L)) and z(L) = w(L)a(L−1) + b(L). The derivative dC/dw(L) factors into dC/da(L) = 2(a(L)−y), da(L)/dz(L) = derivative of the nonlinearity, and dz(L)/dw(L) = a(L−1). Bias derivatives are nearly identical because dz(L)/db(L)=1. With multiple neurons, indexing expands (a(L)j, w(L)jk), and an activation’s influence on the cost sums across all paths to the output layer.

Why does the derivative of the cost with respect to a weight become a product of three terms?

Because the computation from weight to cost is a chain: a small change in w(L) changes z(L), which changes a(L), which changes C. Using the chain rule, dC/dw(L) = (dC/da(L))·(da(L)/dz(L))·(dz(L)/dw(L)). Each factor corresponds to one link in the computation graph: cost sensitivity to activation, activation sensitivity to the nonlinearity input, and that input’s sensitivity to the weight.

What does dC/da(L) = 2(a(L) − y) tell you about learning?

It ties gradient magnitude directly to output error. If the network output a(L) is far from the target y, then (a(L) − y) is large, making dC/da(L) large in magnitude. That means small parameter nudges can produce bigger cost changes when the model is currently wrong.

How does the choice of activation function enter the gradient?

Through da(L)/dz(L), the derivative of the nonlinearity used to compute a(L) from z(L). If sigmoid or ReLU is used, its slope determines how responsive the activation is to changes in its input. This factor modulates the gradient before it reaches earlier layers.

Why is dz(L)/dw(L) equal to a(L−1) in the one-neuron-per-layer example?

Because z(L) = w(L)·a(L−1) + b(L). Differentiating z(L) with respect to w(L) treats a(L−1) as a constant multiplier, giving dz(L)/dw(L) = a(L−1). So the weight’s effect depends on how strongly the previous neuron is activated.

In a multi-neuron layer, why do derivatives with respect to earlier activations require summing multiple paths?

An activation in layer L−1 can influence several neurons in layer L, each of which contributes to the final cost. For example, an activation a(L−1)k affects a(L)0 and a(L)1 (and more in general), so the total sensitivity of the cost to a(L−1)k is the sum of contributions through each outgoing connection. That’s why the chain-rule structure remains, but the “activation-to-cost” derivative aggregates multiple routes.

Review Questions

For the one-neuron-per-layer network, write the chain-rule factorization for dC/dw(L) and identify what each factor represents.
How does the gradient with respect to a bias b(L) differ from the gradient with respect to the weight w(L) in the simplified setup?
In the multi-neuron case, what changes in the derivative with respect to an activation in layer L−1, and why does it require summing over multiple downstream neurons?

Key Points

1
Backpropagation gradients come from applying the chain rule along the computation path from each parameter to the cost.
2
In the last-layer one-neuron example, dC/da(L) = 2(a(L) − y), so gradient size scales with output error.
3
The activation function’s derivative appears as da(L)/dz(L), directly controlling how strongly errors propagate through nonlinearities.
4
For z(L) = w(L)a(L−1) + b(L), the weight derivative is dz(L)/dw(L) = a(L−1), linking learning to the previous layer’s activation strength.
5
Bias gradients are nearly the same as weight gradients because dz(L)/db(L) = 1 in the simplified model.
6
With multiple neurons, indexing expands (a(L)j, w(L)jk), and earlier activations affect the cost through multiple outgoing paths, so their sensitivities add up.

Highlights

The cost derivative with respect to a weight factors into three interpretable pieces: output-error sensitivity, nonlinearity slope, and the previous activation level.

Bias sensitivity is almost identical to weight sensitivity in the simplified last-layer setup because the weighted sum’s derivative with respect to bias is 1.

Backpropagation’s “backward” step comes from derivatives like dz(L)/da(L−1) = w(L), which pass sensitivity to earlier activations.

In multi-neuron layers, an activation’s influence on the cost is the sum of contributions through every downstream neuron it feeds.

Topics

Backpropagation Calculus
Chain Rule
Gradient Descent
Neural Network Derivatives
Activation Functions