Building our Neural Network - Deep Learning and Neural Networks with Python and Pytorch p.3

TL;DR

Define the model by subclassing `nn.Module` and call `super().__init__()` so PyTorch can register layers correctly.

Briefing Cornell Notes

Briefing

The core work in this installment is building a complete feed-forward neural network in PyTorch: defining a model class, wiring fully connected layers, specifying how data flows through them, and producing class-probability outputs via log-softmax. The practical payoff is immediate—once the forward pass is in place, random “image-like” tensors can be pushed through the network to generate 10-class predictions, setting the stage for training in the next tutorial.

The model is implemented as a subclass of `nn.Module`, with an `__init__` method that constructs four linear layers. The input size is set to 784 because the network expects flattened MNIST-style images: 28×28 pixels are reshaped into a single vector of length 784. The hidden layers are configured as three stages of 64 neurons each (`784 → 64 → 64 → 64`), using `nn.Linear` for each fully connected transform. The final layer maps to 10 outputs (`64 → 10`), matching ten classes labeled 0 through 9.

A key PyTorch detail is the need to call `super().__init__()` during initialization. Omitting it triggers a cryptic error (“cannot assign module before module init call”), which the tutorial uses as a cautionary example of what goes wrong when the parent `nn.Module` initialization isn’t executed.

Data flow is defined in a `forward(self, X)` method. The input tensor is passed through `fc1`, `fc2`, and `fc3`, with a ReLU activation applied after each linear layer using `F.relu`. The output layer (`fc4`) is treated differently: it does not use ReLU. Instead, the network returns `F.log_softmax(X, dim=1)` to produce a log probability distribution over classes. The tutorial emphasizes that ReLU is appropriate for hidden layers to prevent values from exploding, while the output layer should be constrained to a probability-like interpretation suitable for multi-class classification.

To prove the wiring works, the tutorial generates random data shaped like a batch of 28×28 images. Passing a raw 28×28 tensor into the model causes a size mismatch because the network expects flattened vectors of length 784. The fix is reshaping with `view`, using `-1, 28*28` (or equivalent shapes like `1, 28, 28`) so the batch dimension is handled correctly. After reshaping, the model produces outputs for each of the 10 classes.

Finally, the tutorial notes that the network’s first passes are effectively untrained: weights aren’t meaningfully initialized yet, so predictions are not reliable. Still, the forward pass now returns something usable for later steps—computing loss and gradients—so the next stage can adjust weights to improve accuracy. It also highlights PyTorch’s flexibility: logic can be embedded inside `forward`, enabling more complex conditional architectures later, while gradients are handled automatically.

Cornell Notes

A PyTorch neural network is built by subclassing `nn.Module`, defining four fully connected layers, and implementing a `forward` method that controls how tensors move through the network. The input is flattened from 28×28 into 784 features, then passed through hidden layers of size 64 with ReLU activations. The output layer produces 10 class scores, converted into log probabilities using `F.log_softmax(X, dim=1)` for multi-class classification. A common pitfall is forgetting `super().__init__()`, which causes module initialization errors. Another pitfall is feeding unflattened image tensors, which triggers size mismatch until reshaping with `view(-1, 28*28)` is applied.

Why does the network expect 784 inputs instead of 28×28 images directly?

Each `nn.Linear` layer operates on flat feature vectors. For MNIST-style inputs, 28×28 pixels are reshaped into a single vector of length 784 (28*28). The tutorial explains this as flattening rows of pixels into one long row, so the first linear layer can accept the data as `784 → 64`.

What does `super().init()` do in a PyTorch `nn.Module` subclass, and what happens if it’s omitted?

`super().__init__()` runs the parent `nn.Module` initialization. Without it, PyTorch can’t properly register submodules like `nn.Linear` layers, leading to an error such as “cannot assign module before module init call.” The tutorial uses this as a concrete example of a common initialization mistake.

How is the forward pass structured, and where does ReLU belong?

The forward pass applies linear transforms and activations in hidden layers: `X = F.relu(fc1(X))`, then `X = F.relu(fc2(X))`, then `X = F.relu(fc3(X))`. The final layer (`fc4`) is not followed by ReLU; instead it feeds into `F.log_softmax` so outputs become a class-wise probability distribution for multi-class prediction.

Why use `F.log_softmax` on the output layer instead of ReLU?

ReLU is meant for hidden layers to keep activations from exploding. The output layer needs a constrained interpretation across classes—ideally one class is most likely, with others having smaller probabilities. `F.log_softmax(X, dim=1)` produces log probabilities over the 10 class scores, making the output suitable for multi-class classification.

What does `dim=1` mean in `F.log_softmax(X, dim=1)`?

`dim` selects which axis gets normalized into a probability distribution. With batched outputs shaped like `[batch_size, 10]`, `dim=1` normalizes across the 10 class scores for each item in the batch. The tutorial contrasts this with `dim=0`, which would normalize across the batch dimension instead.

Why does reshaping with `view(-1, 28*28)` fix the size mismatch error?

The model’s first linear layer expects input vectors of length 784. Feeding a tensor shaped like `[28, 28]` doesn’t match that expectation. Reshaping to `[batch_size, 784]` (using `-1` to infer batch size automatically) aligns the tensor dimensions with the network’s `784` input requirement.

Review Questions

What exact tensor shape must be fed into the network for the first linear layer to work, and how does `view(-1, 28*28)` ensure it?
Explain why ReLU is applied after `fc1`, `fc2`, and `fc3` but not after the final `fc4` layer.
In `F.log_softmax(X, dim=1)`, what axis is normalized, and why is that axis the correct one for multi-class outputs?

Key Points

1
Define the model by subclassing `nn.Module` and call `super().__init__()` so PyTorch can register layers correctly.
2
Use `nn.Linear` layers to map `784 → 64 → 64 → 64 → 10`, where 784 comes from flattening 28×28 images.
3
Implement a `forward(self, X)` method that applies ReLU activations after hidden layers to control activation growth.
4
Convert final class scores into log probabilities with `F.log_softmax(X, dim=1)` for multi-class classification.
5
Flatten inputs before passing them to the network; unflattened 28×28 tensors cause size mismatch errors.
6
Reshape batches with `view(-1, 28*28)` so the batch dimension is flexible while features match the expected 784 input size.

Highlights

The network’s output layer uses `F.log_softmax(X, dim=1)` rather than ReLU, turning raw scores into a class-wise probability distribution for 10 categories.

For MNIST-style inputs, flattening 28×28 into 784 features is mandatory because `nn.Linear` expects vectors, not images.

Forgetting `super().__init__()` produces a module initialization error that prevents layer assignment from working.

A size mismatch during inference is often just a shape issue: the fix is reshaping with `view(-1, 28*28)` before calling the model.

Topics

Neural Network Construction
PyTorch nn.Module
Fully Connected Layers
Forward Pass
Log Softmax

Building our Neural Network - Deep Learning and Neural Networks with Python and Pytorch p.3

Briefing

Cornell Notes

Why does the network expect 784 inputs instead of 28×28 images directly?

What does `super().init()` do in a PyTorch `nn.Module` subclass, and what happens if it’s omitted?

How is the forward pass structured, and where does ReLU belong?

Why use `F.log_softmax` on the output layer instead of ReLU?

What does `dim=1` mean in `F.log_softmax(X, dim=1)`?

Why does reshaping with `view(-1, 28*28)` fix the size mismatch error?

Review Questions

Key Points

Highlights

Topics

Mentioned

Get summaries like this for any content

Briefing

Cornell Notes

Why does the network expect 784 inputs instead of 28×28 images directly?

What does `super().__init__()` do in a PyTorch `nn.Module` subclass, and what happens if it’s omitted?

How is the forward pass structured, and where does ReLU belong?

Why use `F.log_softmax` on the output layer instead of ReLU?

What does `dim=1` mean in `F.log_softmax(X, dim=1)`?

Why does reshaping with `view(-1, 28*28)` fix the size mismatch error?

Review Questions

Key Points

Highlights

Topics

Mentioned

Get summaries like this for any content

What does `super().init()` do in a PyTorch `nn.Module` subclass, and what happens if it’s omitted?