Neural Networks from Scratch - P.6 Softmax Activation
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Softmax activation is designed for classification output layers because it converts raw scores into per-sample probability distributions that sum to 1.
Briefing
Softmax activation is introduced as the missing piece for classification networks: it turns raw output scores into a normalized probability distribution so training can measure “how wrong” predictions are. Instead of relying on per-neuron activations like ReLU—which either clips negatives to zero or otherwise breaks the meaning of relative differences—softmax uses exponentiation plus normalization to produce outputs that sum to 1. In the ideal case, the correct class gets probability 1.0 while all other classes drop to 0.0, making it possible to quantify error in a consistent, learnable way.
The core mechanics start with a simple vector of layer outputs. For prediction alone, picking the largest value works, but training requires a formal way to compare outputs relatively across classes. Softmax addresses two problems with naive approaches: (1) ReLU destroys negative information by clipping, so values like −20 and −1,000,000 become indistinguishable once zeroed; and (2) linear or absolute/squared fixes don’t preserve sign meaning in a way that supports stable optimization and backpropagation. Exponentiation resolves the negativity issue by mapping any input x to e^x, producing strictly positive values while still reflecting magnitude differences—e^1.1 is about 3, while e^−1.1 is about 0.3329.
After exponentiating, softmax normalizes by dividing each exponentiated score by the sum of all exponentiated scores in the output layer. That yields a probability distribution. The transcript walks through both a raw Python implementation (explicit loops) and a NumPy implementation (vectorized operations). It then extends the method from a single vector to a batch of outputs, emphasizing the importance of summing along the correct axis. On a 2D batch matrix, softmax must normalize each sample independently, so the sum is taken across classes (axis=1), and dimensions are preserved (keepdims=True) to ensure correct broadcasting during division.
A practical numerical stability fix comes next: exponentiation can overflow when inputs are large. The solution is to subtract the maximum value in each sample’s class scores before exponentiating. This doesn’t change the final softmax probabilities because the same constant shift applies to every class score within a sample; it only prevents overflow by ensuring the largest adjusted score becomes 0, keeping exponentiated values in a safer range (between 0 and 1).
Finally, the softmax activation class is implemented in the existing framework: it computes exponentials of (inputs − max(inputs)) and then divides by the per-sample sum to produce probabilities. The model is then assembled with two dense layers and a softmax output layer, using spiral data with three classes. With random initialization, the output probabilities start near an even split (roughly one-third per class), setting the stage for the next step—loss functions and training—covered in a subsequent video.
Cornell Notes
Softmax activation converts raw class scores from the output layer into probabilities that sum to 1 for each input sample. It does this by exponentiating each score (so negatives remain meaningful and outputs stay positive) and then normalizing by the sum of exponentials across classes. For batches, normalization must happen per sample by summing along the class axis (axis=1) while keeping dimensions for correct broadcasting. To avoid numerical overflow, softmax subtracts the maximum score in each sample before exponentiating; this shift leaves the final probabilities unchanged. The transcript ends by wiring softmax into a small two-layer network on spiral data and showing that random weights produce near-uniform class probabilities before training begins.
Why can’t a classification network just use ReLU in the output layer and then normalize?
How does softmax turn arbitrary scores into a probability distribution?
What goes wrong when implementing softmax on a batch if the wrong axis is summed?
Why subtract the maximum value before exponentiating, and does it change results?
What does the softmax output look like right after random initialization?
Review Questions
- In softmax, what exact computation ensures outputs sum to 1, and across which dimension should that sum be taken for a batch?
- Explain why subtracting max(inputs) before exponentiation prevents overflow without changing the final probabilities.
- Compare how ReLU and softmax treat negative output scores and why that matters for learning.
Key Points
- 1
Softmax activation is designed for classification output layers because it converts raw scores into per-sample probability distributions that sum to 1.
- 2
Exponentiation preserves relative magnitude information even for negative scores, unlike ReLU which clips negatives to zero.
- 3
Normalization divides each exponentiated class score by the sum of exponentiated scores across all classes for that same sample.
- 4
Batch implementations must sum along the class axis (axis=1) and use keepdims=True to keep shapes compatible for broadcasting.
- 5
Numerical stability is improved by subtracting the maximum score per sample before exponentiating; this prevents overflow while leaving probabilities unchanged.
- 6
A two-layer dense network with a softmax output produces near-uniform class probabilities at initialization, which is expected before training and loss computation begin.