Backpropagation in CNN | Part 1 | Deep Learning
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Training a CNN via backpropagation reduces to computing gradients of the loss with respect to trainable parameters and updating them to minimize loss.
Briefing
Backpropagation for a simple CNN is built from a clear chain of derivatives: start with the loss from the final prediction, then push gradients backward through the fully connected (flattened) layer and into the convolution layer weights and biases. The practical payoff is that training reduces to computing four key gradients—∂L/∂W1, ∂L/∂b1, ∂L/∂W2, and ∂L/∂b2—then using them to update parameters so the loss drops.
The walkthrough begins with a minimal CNN flow: an input image X passes through a convolution filter to produce a feature map Z1, which is transformed by an activation to A1, then downsampled by max pooling to get a smaller representation F. After flattening, the network produces a single prediction neuron (for binary classification) via a linear layer. In the example, the convolution output shape shrinks (e.g., 3×3 to 2×2 after pooling), and the flattened vector feeds a final neuron that outputs a scalar prediction A2 (often treated as a logit/probability depending on the exact setup).
Only two places contain trainable parameters in this simplified architecture: the convolution filter weights W1 and bias b1, and the final fully connected weights W2 and bias b2. The convolution filter has a size like 3×3, and the bias is a single scalar per filter. The final layer’s weight matrix W2 connects the flattened pooled features to the single output neuron; its shape depends on the flattened feature size (for instance, 1×4 if the pooled feature map becomes 2×2).
For the loss, the setup targets binary classification (e.g., detecting a dog vs. cat). The loss function is binary cross-entropy, computed per instance. For a batch, the loss becomes the average over N images, which matters because gradients scale accordingly.
A logical diagram is used to formalize the forward pass: Z1 = X * W1 + b1, A1 = activation(Z1), pooling produces F, flattening feeds into the final linear step to produce A2, and the loss L is computed from A2 and the ground-truth label Y. Backpropagation then follows the chain rule. Gradients like ∂L/∂W2 are derived by tracking how changing W2 changes A2, and how changing A2 changes L. The same logic applies to b2, and then the gradients are propagated backward to the flattened layer and further toward W1 and b1.
A key conceptual emphasis is the “indirect connection” problem: parameters don’t connect to the loss directly; they influence intermediate tensors (Z2, A2, etc.). The derivative is therefore expressed as a product of partial derivatives along the computational path (e.g., ∂L/∂W2 = ∂L/∂A2 · ∂A2/∂W2). The transcript also flags that convolution and pooling backpropagation require special derivative forms, which will be handled in later parts.
Finally, the batch case is addressed: when using mini-batches (say 32 or 64 images), the gradient shapes incorporate the batch dimension. The derivative of the averaged loss introduces scaling by the number of images M, so gradients effectively carry factors like 1/M and the resulting gradient matrices reflect batch-wise dimensions (e.g., A2 gradients becoming shaped like 1×4×M before aggregation). The session ends by previewing the next steps: computing backpropagation through convolution, flatten, and max pooling layers in subsequent parts, and then applying the full process to more complex CNN architectures.
Cornell Notes
The core idea is to train a CNN by computing gradients of the loss with respect to its trainable parameters, then updating those parameters to reduce loss. In this simplified CNN, only four parameters matter: convolution weights W1 and bias b1, plus final fully connected weights W2 and bias b2. Backpropagation uses the chain rule to connect how a change in each parameter affects intermediate tensors (like Z1, A1, pooled features, A2) and ultimately the binary cross-entropy loss. The transcript derives gradients for the final layer first (through the flatten-to-output path), then explains how those gradients flow backward toward the convolution layer. It also notes that mini-batch training changes gradient scaling because the batch loss is an average over M images.
How does the simplified CNN produce a single prediction from an image, and where do the trainable parameters sit?
Why does backpropagation rely on the chain rule rather than a direct derivative from parameters to loss?
What loss function is used for the binary classification setup, and how does batch averaging affect gradients?
How is ∂L/∂W2 (and ∂L/∂b2) obtained in the simplified architecture?
What changes when moving from single-image training to mini-batch training?
Review Questions
- In the simplified CNN, list the four gradients that must be computed to train the model, and explain which layer each gradient belongs to.
- Why does the gradient of the loss with respect to W2 involve intermediate derivatives (like ∂L/∂A2), and what does that represent conceptually?
- How does averaging the loss over a mini-batch change the scaling or shape of gradients compared with single-image training?
Key Points
- 1
Training a CNN via backpropagation reduces to computing gradients of the loss with respect to trainable parameters and updating them to minimize loss.
- 2
In the simplified architecture, only convolution parameters (W1, b1) and final-layer parameters (W2, b2) are trainable, so gradients focus on these four quantities.
- 3
Backpropagation uses the chain rule to handle indirect parameter-to-loss influence through intermediate tensors like Z1, A1, pooled features, and A2.
- 4
Binary cross-entropy is used for the binary classification example, and mini-batch training averages per-instance losses, introducing 1/M scaling in gradients.
- 5
A logical forward-pass diagram (convolution → activation → max pooling → flatten → linear output → loss) makes the gradient flow easier to track.
- 6
Batch training adds a batch dimension to activations and gradients, so tensor shapes and scaling must be handled consistently when computing updates.