Backpropagation in CNN | Part 1

TL;DR

Training a CNN via backpropagation reduces to computing gradients of the loss with respect to trainable parameters and updating them to minimize loss.

Briefing Cornell Notes

Briefing

Backpropagation for a simple CNN is built from a clear chain of derivatives: start with the loss from the final prediction, then push gradients backward through the fully connected (flattened) layer and into the convolution layer weights and biases. The practical payoff is that training reduces to computing four key gradients—∂L/∂W1, ∂L/∂b1, ∂L/∂W2, and ∂L/∂b2—then using them to update parameters so the loss drops.

The walkthrough begins with a minimal CNN flow: an input image X passes through a convolution filter to produce a feature map Z1, which is transformed by an activation to A1, then downsampled by max pooling to get a smaller representation F. After flattening, the network produces a single prediction neuron (for binary classification) via a linear layer. In the example, the convolution output shape shrinks (e.g., 3×3 to 2×2 after pooling), and the flattened vector feeds a final neuron that outputs a scalar prediction A2 (often treated as a logit/probability depending on the exact setup).

Only two places contain trainable parameters in this simplified architecture: the convolution filter weights W1 and bias b1, and the final fully connected weights W2 and bias b2. The convolution filter has a size like 3×3, and the bias is a single scalar per filter. The final layer’s weight matrix W2 connects the flattened pooled features to the single output neuron; its shape depends on the flattened feature size (for instance, 1×4 if the pooled feature map becomes 2×2).

For the loss, the setup targets binary classification (e.g., detecting a dog vs. cat). The loss function is binary cross-entropy, computed per instance. For a batch, the loss becomes the average over N images, which matters because gradients scale accordingly.

A logical diagram is used to formalize the forward pass: Z1 = X * W1 + b1, A1 = activation(Z1), pooling produces F, flattening feeds into the final linear step to produce A2, and the loss L is computed from A2 and the ground-truth label Y. Backpropagation then follows the chain rule. Gradients like ∂L/∂W2 are derived by tracking how changing W2 changes A2, and how changing A2 changes L. The same logic applies to b2, and then the gradients are propagated backward to the flattened layer and further toward W1 and b1.

A key conceptual emphasis is the “indirect connection” problem: parameters don’t connect to the loss directly; they influence intermediate tensors (Z2, A2, etc.). The derivative is therefore expressed as a product of partial derivatives along the computational path (e.g., ∂L/∂W2 = ∂L/∂A2 · ∂A2/∂W2). The transcript also flags that convolution and pooling backpropagation require special derivative forms, which will be handled in later parts.

Finally, the batch case is addressed: when using mini-batches (say 32 or 64 images), the gradient shapes incorporate the batch dimension. The derivative of the averaged loss introduces scaling by the number of images M, so gradients effectively carry factors like 1/M and the resulting gradient matrices reflect batch-wise dimensions (e.g., A2 gradients becoming shaped like 1×4×M before aggregation). The session ends by previewing the next steps: computing backpropagation through convolution, flatten, and max pooling layers in subsequent parts, and then applying the full process to more complex CNN architectures.

Cornell Notes

The core idea is to train a CNN by computing gradients of the loss with respect to its trainable parameters, then updating those parameters to reduce loss. In this simplified CNN, only four parameters matter: convolution weights W1 and bias b1, plus final fully connected weights W2 and bias b2. Backpropagation uses the chain rule to connect how a change in each parameter affects intermediate tensors (like Z1, A1, pooled features, A2) and ultimately the binary cross-entropy loss. The transcript derives gradients for the final layer first (through the flatten-to-output path), then explains how those gradients flow backward toward the convolution layer. It also notes that mini-batch training changes gradient scaling because the batch loss is an average over M images.

How does the simplified CNN produce a single prediction from an image, and where do the trainable parameters sit?

The image X goes through convolution to form Z1 = X * W1 + b1, then an activation to produce A1. Max pooling downsamples A1 into a smaller feature map F, which is flattened into a vector. A final linear step with weights W2 and bias b2 maps the flattened features to a single output neuron, yielding the prediction A2. Trainable parameters appear only in two places: W1 and b1 in the convolution layer, and W2 and b2 in the final (flatten-to-output) layer.

Why does backpropagation rely on the chain rule rather than a direct derivative from parameters to loss?

Parameters influence the loss indirectly through intermediate computations. For example, changing W2 changes the final pre-activation/linear output (often denoted Z2) and then changes A2, which then changes the loss L. So ∂L/∂W2 is written as a product of partial derivatives along the path, such as ∂L/∂A2 · ∂A2/∂W2. This “path-based” view handles indirect connections through tensors like Z2 and A2.

What loss function is used for the binary classification setup, and how does batch averaging affect gradients?

The setup uses binary cross-entropy for a single instance: L computed from the prediction A2 and label Y (with the transcript referencing the standard logistic/binary cross-entropy form). For a batch, the overall loss is the average over N images (sum of per-instance losses divided by N). Because the loss is averaged, gradients inherit a scaling factor of 1/N (or 1/M in the later notation), affecting gradient magnitudes and batch-wise shapes.

How is ∂L/∂W2 (and ∂L/∂b2) obtained in the simplified architecture?

The transcript treats the final layer as a linear mapping from the flattened features to the prediction neuron. It computes ∂L/∂Z2 (or the equivalent derivative with respect to the final linear output) and then uses the linear relationship to get ∂L/∂W2 and ∂L/∂b2. Concretely, the gradient with respect to W2 becomes an outer-product-like combination of the upstream gradient (from the loss) and the input to that layer (the flattened features), while ∂L/∂b2 becomes the upstream gradient itself. The key is matching tensor shapes (e.g., W2 shaped like 1×4 if the flattened vector has length 4).

What changes when moving from single-image training to mini-batch training?

With mini-batches, predictions and intermediate activations are computed for M images simultaneously, so tensors gain a batch dimension. Since the loss is averaged across the batch, the gradient expressions include batch scaling (effectively dividing by M). The transcript notes that gradients like those for A2 and the corresponding derivative matrices reflect batch-wise dimensions (e.g., factors like 1×4×M before aggregation), ensuring updates reflect the mean loss over the batch.

Review Questions

In the simplified CNN, list the four gradients that must be computed to train the model, and explain which layer each gradient belongs to.
Why does the gradient of the loss with respect to W2 involve intermediate derivatives (like ∂L/∂A2), and what does that represent conceptually?
How does averaging the loss over a mini-batch change the scaling or shape of gradients compared with single-image training?

Key Points

1
Training a CNN via backpropagation reduces to computing gradients of the loss with respect to trainable parameters and updating them to minimize loss.
2
In the simplified architecture, only convolution parameters (W1, b1) and final-layer parameters (W2, b2) are trainable, so gradients focus on these four quantities.
3
Backpropagation uses the chain rule to handle indirect parameter-to-loss influence through intermediate tensors like Z1, A1, pooled features, and A2.
4
Binary cross-entropy is used for the binary classification example, and mini-batch training averages per-instance losses, introducing 1/M scaling in gradients.
5
A logical forward-pass diagram (convolution → activation → max pooling → flatten → linear output → loss) makes the gradient flow easier to track.
6
Batch training adds a batch dimension to activations and gradients, so tensor shapes and scaling must be handled consistently when computing updates.

Highlights

The simplified CNN’s training boils down to four gradients: ∂L/∂W1, ∂L/∂b1, ∂L/∂W2, and ∂L/∂b2, derived through chain-rule backpropagation.

Even when a parameter doesn’t touch the loss directly, gradients still flow by multiplying partial derivatives along the computational path.

Binary cross-entropy drives the final-layer gradient, while max pooling and convolution require specialized backprop steps reserved for later parts.

Mini-batch averaging changes gradient scaling: the loss is averaged over M images, so gradients effectively include a 1/M factor.

Flattening turns pooled feature maps into a vector, creating the bridge where final-layer gradients can be computed before propagating backward further.

Topics

Backpropagation
CNN Training
Binary Cross-Entropy
Chain Rule
Convolution Gradients

Backpropagation in CNN | Part 1 | Deep Learning