Training Convnet - Deep Learning and Neural Networks with Python and Pytorch p.6

TL;DR

Build the ConvNet with Conv2D layers followed by max pooling and ReLU, then reshape the conv output before the first linear layer.

Briefing Cornell Notes

Briefing

Convolutional neural networks can be trained end-to-end in PyTorch by building the model architecture, working out the “flattened” size between convolution and fully connected layers, and then running a basic training loop with an optimizer and loss function. The practical bottleneck in this workflow is not the training itself—it’s determining the correct input dimension for the first linear layer after multiple Conv2D and max-pooling operations, because PyTorch doesn’t provide a simple “flatten layer” that can be queried for shape the way some other frameworks make it feel.

The tutorial starts by defining a custom neural network class that inherits from nn.Module. It constructs three 2D convolution layers (Conv2d) with a 5×5 kernel and increasing channel depth: the first maps 32 channels to 32 convolutional features, then subsequent layers expand to 64 and 128 channels. After each convolution, max pooling with a 2×2 window and a ReLU activation are applied to reduce spatial dimensions while keeping the model expressive. This convolutional stack produces a tensor that still has height and width, so it must be reshaped before feeding into linear layers.

To handle that reshape, the code performs a “dummy forward pass” during initialization. A random input tensor shaped like the dataset images (batch size 1, channels 1, height 50, width 50) is passed through the convolution+pooling pipeline to measure the resulting output shape. The flattened dimension is then computed from that shape and used to set the input size of the first fully connected layer (FC1). The second linear layer (FC2) outputs two logits, matching a binary classification setup (e.g., cat vs. dog). The forward method then runs the same convolution/pooling stack, flattens the result with view, applies FC1 with ReLU, and finally returns FC2 outputs.

Once the model is defined, training begins with standard PyTorch components: an Adam optimizer with learning rate 0.001 and mean squared error (MSE) loss. The dataset is converted into tensors, image pixel values are scaled from 0–255 down to 0–1 by dividing by 255, and the data is split into training and validation sets using a 10% validation fraction. Training uses mini-batches (batch size 100) and a single epoch on CPU, iterating through slices of the training tensors. Each step zeroes gradients, computes outputs, calculates loss, backpropagates with loss.backward(), and updates parameters via optimizer.step().

Evaluation runs with torch.no_grad() and computes accuracy by taking argmax over predicted outputs and comparing to ground-truth labels. The reported accuracy is around 64%, better than random guessing but not yet strong. The tutorial closes by emphasizing that CPU training is slow and that the next step is moving the same code to GPU for faster experimentation, plus adding more epochs and better model analysis to decide when to stop training and how to compare architectures.

Cornell Notes

The core task is building and training a ConvNet in PyTorch: define Conv2D + max-pooling layers, then connect them to fully connected layers for classification. The hardest engineering detail is figuring out the flattened size after convolutions and pooling, because the input dimension to the first linear layer depends on those operations. The solution used here is a dummy forward pass during initialization: send a random tensor through the conv/pool stack, read the resulting tensor shape, compute the flattened dimension, and set FC1 accordingly. Training then follows a standard loop with Adam (lr=0.001), MSE loss, mini-batches, and an accuracy check using argmax. This matters because correct shape wiring is required before any meaningful training can happen.

Why does the model need a dummy forward pass to size FC1 in PyTorch?

After multiple Conv2D and max-pooling layers, the output tensor still has spatial dimensions (height and width). The first linear layer must know how many features it will receive once that tensor is flattened. Instead of trying to manually compute the resulting dimensions (which is error-prone because pooling and convolution change sizes in non-obvious ways), the code runs a single random input through the conv/pool pipeline during initialization, inspects the output shape, and uses that to set FC1’s input size.

How does the forward pass transform data from images to class scores?

The forward method applies three convolution blocks in sequence. Each block applies max pooling (2×2) and ReLU activation. After the final convolution/pooling, the tensor is reshaped with view (flattening) into a 1D feature vector per example, then passed through FC1 with ReLU, and finally through FC2 to produce two output logits for the two classes.

What does “flatten” mean here, and how is it implemented?

Flattening converts the conv output tensor of shape (batch, channels, height, width) into (batch, channels*height*width). In this code, flattening is done with x.view(-1, self.fc1_input_dim) (conceptually: keep batch dimension, collapse the rest). The flattened dimension is computed earlier from the dummy forward pass and stored so the view operation matches the actual conv output size.

How is accuracy computed during evaluation?

Evaluation runs under torch.no_grad() to avoid gradient tracking. For each batch, the model outputs two scores (logits). The predicted class is torch.argmax(outputs, dim=1), and the true class comes from the label tensor. Correct predictions are counted and accuracy is computed as correct/total.

What training loop mechanics are essential in PyTorch for this ConvNet?

Each mini-batch step does: (1) optimizer.zero_grad() (or net.zero_grad()), (2) outputs = net(batch_X), (3) loss = loss_function(outputs, batch_Y), (4) loss.backward() to compute gradients, and (5) optimizer.step() to update parameters. Without these steps in order, the model won’t learn.

Why scale image pixels by dividing by 255?

The dataset pixel values start in the 0–255 range. Dividing by 255 rescales them to 0–1, which typically stabilizes optimization and makes gradients more manageable. The code applies this scaling when converting training inputs into tensors.

Review Questions

What specific tensor shape information must be known to define FC1, and how does the dummy forward pass provide it?
Describe the order of operations in the forward method from convolution to pooling to flattening to linear layers.
During evaluation, why is argmax taken along dim=1, and what does that correspond to in a two-class problem?

Key Points

1
Build the ConvNet with Conv2D layers followed by max pooling and ReLU, then reshape the conv output before the first linear layer.
2
Determine FC1’s input dimension by running a dummy forward pass and reading the resulting tensor shape, then flattening that size.
3
Use a standard PyTorch training loop: zero gradients, forward pass, compute loss, backpropagate, and optimizer.step().
4
Scale image inputs from 0–255 to 0–1 to improve training stability.
5
Split data into training and validation sets (here, 90/10) and compute accuracy using argmax over the two output scores.
6
Expect CPU training to be slow; moving to GPU is the next practical step for more epochs and faster experimentation.

Highlights

The biggest practical hurdle is wiring shapes correctly between convolution/pooling outputs and the first fully connected layer; the code solves it by measuring the flattened size via a dummy forward pass.

The model outputs two logits for binary classification and uses argmax(dim=1) to convert those logits into predicted class labels.

Training follows the canonical PyTorch pattern: optimizer.zero_grad(), forward, loss.backward(), optimizer.step().

Pixel scaling (divide by 255) is applied before training, turning 8-bit image values into normalized floats.

Accuracy improves beyond random guessing (around 64% reported), but CPU training speed limits how much experimentation can be done next.

Topics

ConvNet Architecture
PyTorch Shape Debugging
Training Loop
Max Pooling
GPU Acceleration

Mentioned

Ralph Rodriguez
David Lavoie
Anand