Neural Networks from Scratch - P.4 Batches, Layers, and Objects

TL;DR

Batching improves training speed through parallel computation and improves generalization by reducing sensitivity to individual samples.

Briefing Cornell Notes

Briefing

Neural-network training shifts from single examples to mini-batches because it improves both speed and learning stability—and the mechanics of that shift force a key change in how matrix multiplication is done. The core move is converting one input sample (a 1D feature vector) into a batch (a 2D matrix of many samples), then updating the math so the dot product still lines up correctly with the weight matrix.

The first reason for batching is computational efficiency: with batches, many operations can run in parallel, which is why training is typically done on GPUs rather than CPUs. The second reason is generalization. When a neuron sees one sample at a time, its fitted line “wiggles” as it chases individual points; showing multiple samples at once smooths that behavior, making the learned relationship less sensitive to any single training example. The transcript uses a visualization with 512 theoretical samples and shows that increasing batch size (e.g., from 1 to 4 to 16) reduces the movement of the fit line. It also warns against extremes: feeding all samples at once can encourage overfitting to in-sample data, harming performance on unseen data. Practical batch sizes are often around 32, with ranges like 32–64 common and 128 sometimes used.

After establishing why batches matter, the tutorial converts a single input list into a list-of-lists representing multiple samples, while keeping weights and biases tied to neurons rather than to batch size. That’s where a classic shape problem appears: once inputs become a batch matrix, the dot product no longer matches dimensions unless the weight matrix is transposed. The fix is to transpose weights so that each input row vector can multiply the appropriate columns of the weight matrix. With the corrected shapes, the result becomes a batch of outputs—one output vector per sample—followed by adding the bias row vector to every row of the matrix product.

The next step is stacking layers. Adding a second dense layer is straightforward conceptually: create another set of weights and biases, compute layer 2 outputs from layer 1 outputs, and repeat. But manually copying and editing arrays quickly becomes unmanageable as layer counts grow. To address that, the transcript introduces an object-oriented “dense layer” class that stores weights and biases and provides a forward pass.

Inside the dense layer, weights are initialized as small random values (using a NumPy random generator with a fixed seed) and biases start at zeros. The initialization takes two parameters: the number of inputs (feature count per sample) and the number of neurons (output size). A key design choice is shaping weights as (n_inputs, n_neurons) so the forward pass can use inputs dot weights directly without transposing every time. The forward method computes inputs × weights + biases, producing an output matrix with one row per sample in the batch.

Finally, two dense layer objects are created (layer 1 with 4 inputs and 5 neurons; layer 2 with 5 inputs and 2 neurons). Passing a batch through layer 1 yields a (batch_size, 5) output, which becomes the input to layer 2, producing a (batch_size, 2) output. The transcript closes by noting that activation functions come next, followed by loss calculation and optimization of weights and biases.

Cornell Notes

Batching turns a single feature vector into a matrix of many samples, enabling parallel computation on GPUs and improving generalization by reducing the “wiggle” caused by fitting one example at a time. Once inputs become a batch matrix, matrix multiplication requires careful shape alignment; transposing the weight matrix fixes dimension mismatches. Biases are added as a row vector to every output row. To scale beyond hand-coded layers, a Dense layer class stores weights and biases and implements a forward pass computing inputs × weights + biases. Weights are initialized with small random values and biases with zeros, and weights are shaped to avoid repeated transposes during forward passes.

Why does converting from single samples to batches help training, and what trade-off comes with larger batches?

Batches help in two ways. First, they enable parallel computation: larger batches mean more matrix operations can run simultaneously, which is why GPUs are used instead of CPUs. Second, batches improve generalization: showing multiple samples at once makes the fitted relationship less sensitive to any single point, reducing “wiggling” in the learned fit. The trade-off is that using all samples at once can encourage overfitting to in-sample data, so batch sizes are typically moderate (often around 32, sometimes 64, with 128 less common).

What changes in the math when inputs become a batch matrix?

Inputs move from a vector shape to a matrix shape (multiple samples). That changes the dot-product dimension requirements: the index-1 dimension of the first operand must match the index-0 dimension of the second operand. When inputs and weights are shaped for single-sample multiplication, the batch version triggers a shape error. The fix is to transpose weights so that each input row vector multiplies the correct columns of the weight matrix, producing one output row per sample.

Why don’t weights and biases need to change when batch size changes?

Weights and biases belong to neurons, not to the number of samples being processed. Increasing batch size only changes how many input rows are fed through the same neuron connections. The layer’s output size (number of neurons) stays the same; only the number of output rows grows with the batch size.

How does adding biases work with batched outputs?

After computing the matrix product (inputs × weights), biases are added as a row vector to every row of the output matrix. Concretely, each output element at a given row and column is the matrix-product value plus the corresponding bias value for that neuron, so the same bias offsets apply across all samples in the batch.

What design choice in the Dense layer avoids repeated transposes during forward passes?

The Dense layer initializes weights with shape (n_inputs, n_neurons). With that layout, the forward pass can compute inputs dot weights directly, producing outputs with shape (batch_size, n_neurons). This eliminates the need to transpose weights every time forward propagation runs.

How are weights and biases initialized, and why does initialization matter?

Weights are initialized as small random values (e.g., around 0 with a narrow range such as ±0.1) to keep activations from exploding as values propagate through layers. Biases are initialized to zeros, but the transcript notes a potential pitfall: if biases and weights lead to neurons outputting zeros initially, the network can become “dead” (all-zero outputs). In such cases, starting biases non-zero can help.

Review Questions

When inputs become a batch matrix, what specific dimension mismatch causes the dot-product shape error, and how does transposing weights resolve it?
Why does batch size affect generalization, and why can an excessively large batch size increase overfitting risk?
In the Dense layer class, what are the shapes of inputs, weights, and outputs, and how do those shapes determine whether a transpose is needed?

Key Points

1
Batching improves training speed through parallel computation and improves generalization by reducing sensitivity to individual samples.
2
Batch size should be moderate; very large batches (including all samples) can increase overfitting to in-sample data.
3
When inputs become a batch matrix, matrix multiplication requires careful shape alignment; transposing weights fixes common dimension mismatches.
4
Biases are added as a row vector to every row of the batched matrix-product output.
5
Stacking layers means layer 1 outputs become layer 2 inputs; doing this manually scales poorly, so a Dense layer object is introduced.
6
A Dense layer forward pass computes inputs × weights + biases, and shaping weights as (n_inputs, n_neurons) avoids repeated transposes.
7
Weights are initialized with small random values and biases with zeros to keep early activations in a stable range.

Highlights

Batching reduces the “wiggling” of a neuron’s fit line by exposing it to multiple samples at once, which supports generalization.

The moment inputs become a batch, dot products often fail due to shape mismatches—transposing weights restores the required dimension alignment.

Biases don’t become batch-specific; they’re broadcast across all samples by adding the same bias row vector to every output row.

A Dense layer class turns repeated manual layer math into reusable forward propagation, making multi-layer networks manageable.

Initializing weights with small random values helps prevent activation blow-ups as values propagate through layers.

Topics

Neural Networks From Scratch
Batches
Dense Layers
Matrix Transpose
Forward Pass