Neural Networks from Scratch - P.4 Batches, Layers, and Objects
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Batching improves training speed through parallel computation and improves generalization by reducing sensitivity to individual samples.
Briefing
Neural-network training shifts from single examples to mini-batches because it improves both speed and learning stability—and the mechanics of that shift force a key change in how matrix multiplication is done. The core move is converting one input sample (a 1D feature vector) into a batch (a 2D matrix of many samples), then updating the math so the dot product still lines up correctly with the weight matrix.
The first reason for batching is computational efficiency: with batches, many operations can run in parallel, which is why training is typically done on GPUs rather than CPUs. The second reason is generalization. When a neuron sees one sample at a time, its fitted line “wiggles” as it chases individual points; showing multiple samples at once smooths that behavior, making the learned relationship less sensitive to any single training example. The transcript uses a visualization with 512 theoretical samples and shows that increasing batch size (e.g., from 1 to 4 to 16) reduces the movement of the fit line. It also warns against extremes: feeding all samples at once can encourage overfitting to in-sample data, harming performance on unseen data. Practical batch sizes are often around 32, with ranges like 32–64 common and 128 sometimes used.
After establishing why batches matter, the tutorial converts a single input list into a list-of-lists representing multiple samples, while keeping weights and biases tied to neurons rather than to batch size. That’s where a classic shape problem appears: once inputs become a batch matrix, the dot product no longer matches dimensions unless the weight matrix is transposed. The fix is to transpose weights so that each input row vector can multiply the appropriate columns of the weight matrix. With the corrected shapes, the result becomes a batch of outputs—one output vector per sample—followed by adding the bias row vector to every row of the matrix product.
The next step is stacking layers. Adding a second dense layer is straightforward conceptually: create another set of weights and biases, compute layer 2 outputs from layer 1 outputs, and repeat. But manually copying and editing arrays quickly becomes unmanageable as layer counts grow. To address that, the transcript introduces an object-oriented “dense layer” class that stores weights and biases and provides a forward pass.
Inside the dense layer, weights are initialized as small random values (using a NumPy random generator with a fixed seed) and biases start at zeros. The initialization takes two parameters: the number of inputs (feature count per sample) and the number of neurons (output size). A key design choice is shaping weights as (n_inputs, n_neurons) so the forward pass can use inputs dot weights directly without transposing every time. The forward method computes inputs × weights + biases, producing an output matrix with one row per sample in the batch.
Finally, two dense layer objects are created (layer 1 with 4 inputs and 5 neurons; layer 2 with 5 inputs and 2 neurons). Passing a batch through layer 1 yields a (batch_size, 5) output, which becomes the input to layer 2, producing a (batch_size, 2) output. The transcript closes by noting that activation functions come next, followed by loss calculation and optimization of weights and biases.
Cornell Notes
Batching turns a single feature vector into a matrix of many samples, enabling parallel computation on GPUs and improving generalization by reducing the “wiggle” caused by fitting one example at a time. Once inputs become a batch matrix, matrix multiplication requires careful shape alignment; transposing the weight matrix fixes dimension mismatches. Biases are added as a row vector to every output row. To scale beyond hand-coded layers, a Dense layer class stores weights and biases and implements a forward pass computing inputs × weights + biases. Weights are initialized with small random values and biases with zeros, and weights are shaped to avoid repeated transposes during forward passes.
Why does converting from single samples to batches help training, and what trade-off comes with larger batches?
What changes in the math when inputs become a batch matrix?
Why don’t weights and biases need to change when batch size changes?
How does adding biases work with batched outputs?
What design choice in the Dense layer avoids repeated transposes during forward passes?
How are weights and biases initialized, and why does initialization matter?
Review Questions
- When inputs become a batch matrix, what specific dimension mismatch causes the dot-product shape error, and how does transposing weights resolve it?
- Why does batch size affect generalization, and why can an excessively large batch size increase overfitting risk?
- In the Dense layer class, what are the shapes of inputs, weights, and outputs, and how do those shapes determine whether a transpose is needed?
Key Points
- 1
Batching improves training speed through parallel computation and improves generalization by reducing sensitivity to individual samples.
- 2
Batch size should be moderate; very large batches (including all samples) can increase overfitting to in-sample data.
- 3
When inputs become a batch matrix, matrix multiplication requires careful shape alignment; transposing weights fixes common dimension mismatches.
- 4
Biases are added as a row vector to every row of the batched matrix-product output.
- 5
Stacking layers means layer 1 outputs become layer 2 inputs; doing this manually scales poorly, so a Dense layer object is introduced.
- 6
A Dense layer forward pass computes inputs × weights + biases, and shaping weights as (n_inputs, n_neurons) avoids repeated transposes.
- 7
Weights are initialized with small random values and biases with zeros to keep early activations in a stable range.