Neural Networks from Scratch - P.6 Softmax Activation

TL;DR

Softmax activation is designed for classification output layers because it converts raw scores into per-sample probability distributions that sum to 1.

Briefing Cornell Notes

Briefing

Softmax activation is introduced as the missing piece for classification networks: it turns raw output scores into a normalized probability distribution so training can measure “how wrong” predictions are. Instead of relying on per-neuron activations like ReLU—which either clips negatives to zero or otherwise breaks the meaning of relative differences—softmax uses exponentiation plus normalization to produce outputs that sum to 1. In the ideal case, the correct class gets probability 1.0 while all other classes drop to 0.0, making it possible to quantify error in a consistent, learnable way.

The core mechanics start with a simple vector of layer outputs. For prediction alone, picking the largest value works, but training requires a formal way to compare outputs relatively across classes. Softmax addresses two problems with naive approaches: (1) ReLU destroys negative information by clipping, so values like −20 and −1,000,000 become indistinguishable once zeroed; and (2) linear or absolute/squared fixes don’t preserve sign meaning in a way that supports stable optimization and backpropagation. Exponentiation resolves the negativity issue by mapping any input x to e^x, producing strictly positive values while still reflecting magnitude differences—e^1.1 is about 3, while e^−1.1 is about 0.3329.

After exponentiating, softmax normalizes by dividing each exponentiated score by the sum of all exponentiated scores in the output layer. That yields a probability distribution. The transcript walks through both a raw Python implementation (explicit loops) and a NumPy implementation (vectorized operations). It then extends the method from a single vector to a batch of outputs, emphasizing the importance of summing along the correct axis. On a 2D batch matrix, softmax must normalize each sample independently, so the sum is taken across classes (axis=1), and dimensions are preserved (keepdims=True) to ensure correct broadcasting during division.

A practical numerical stability fix comes next: exponentiation can overflow when inputs are large. The solution is to subtract the maximum value in each sample’s class scores before exponentiating. This doesn’t change the final softmax probabilities because the same constant shift applies to every class score within a sample; it only prevents overflow by ensuring the largest adjusted score becomes 0, keeping exponentiated values in a safer range (between 0 and 1).

Finally, the softmax activation class is implemented in the existing framework: it computes exponentials of (inputs − max(inputs)) and then divides by the per-sample sum to produce probabilities. The model is then assembled with two dense layers and a softmax output layer, using spiral data with three classes. With random initialization, the output probabilities start near an even split (roughly one-third per class), setting the stage for the next step—loss functions and training—covered in a subsequent video.

Cornell Notes

Softmax activation converts raw class scores from the output layer into probabilities that sum to 1 for each input sample. It does this by exponentiating each score (so negatives remain meaningful and outputs stay positive) and then normalizing by the sum of exponentials across classes. For batches, normalization must happen per sample by summing along the class axis (axis=1) while keeping dimensions for correct broadcasting. To avoid numerical overflow, softmax subtracts the maximum score in each sample before exponentiating; this shift leaves the final probabilities unchanged. The transcript ends by wiring softmax into a small two-layer network on spiral data and showing that random weights produce near-uniform class probabilities before training begins.

Why can’t a classification network just use ReLU in the output layer and then normalize?

ReLU clips negative values to zero. If an output score is negative—whether it is −20 or −1,000,000—it becomes 0, so the model loses information about how strongly a class was scored. After that clipping, normalization can’t recover the original relative differences, making learning based on “how wrong” predictions become unreliable.

How does softmax turn arbitrary scores into a probability distribution?

Given class scores x, softmax computes e^x for each class and then divides each exponentiated value by the sum of all exponentiated values: probabilities = exp(x) / sum(exp(x)). The result is strictly positive outputs that sum to 1 across classes for each sample, enabling a consistent notion of correctness.

What goes wrong when implementing softmax on a batch if the wrong axis is summed?

On a 2D batch matrix shaped like (batch_size, num_classes), softmax must normalize each row (each sample) independently. Summing along axis=0 would sum across samples (columns), mixing normalization across different inputs. Summing along axis=1 correctly sums across classes per sample. keepdims=True is used so the denominator keeps the right shape for elementwise division.

Why subtract the maximum value before exponentiating, and does it change results?

Exponentiating large numbers can overflow. Subtracting the per-sample maximum shifts all class scores by the same constant, making the largest adjusted score 0 and reducing exponentiation magnitude. Because the same shift applies to every class within a sample, the final normalized probabilities remain identical—only overflow risk is reduced.

What does the softmax output look like right after random initialization?

With random weights, the network’s output scores are essentially random, so softmax probabilities tend to be near-uniform across classes. In the spiral dataset example with three classes, the transcript notes outputs are close to roughly one-third per class before training.

Review Questions

In softmax, what exact computation ensures outputs sum to 1, and across which dimension should that sum be taken for a batch?
Explain why subtracting max(inputs) before exponentiation prevents overflow without changing the final probabilities.
Compare how ReLU and softmax treat negative output scores and why that matters for learning.

Key Points

1
Softmax activation is designed for classification output layers because it converts raw scores into per-sample probability distributions that sum to 1.
2
Exponentiation preserves relative magnitude information even for negative scores, unlike ReLU which clips negatives to zero.
3
Normalization divides each exponentiated class score by the sum of exponentiated scores across all classes for that same sample.
4
Batch implementations must sum along the class axis (axis=1) and use keepdims=True to keep shapes compatible for broadcasting.
5
Numerical stability is improved by subtracting the maximum score per sample before exponentiating; this prevents overflow while leaving probabilities unchanged.
6
A two-layer dense network with a softmax output produces near-uniform class probabilities at initialization, which is expected before training and loss computation begin.

Highlights

Softmax replaces “pick the largest score” with a learnable probability distribution by exponentiating and normalizing class scores.

Clipping negatives with ReLU destroys information (e.g., −20 and −1,000,000 both become 0), undermining training signals.

For batches, softmax must normalize each sample independently by summing across classes (axis=1).

Subtracting the per-sample maximum before exponentiation prevents overflow without changing the final softmax probabilities.

After random initialization on three-class spiral data, softmax outputs start close to one-third per class, reflecting untrained weights.

Topics

Softmax Activation
Classification Probabilities
Numerical Stability
Batch Vectorization
NumPy Implementation