Neural Networks from Scratch - P.8 Implementing Loss
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Categorical cross-entropy for training uses −log of the predicted probability assigned to the true class, computed per sample and averaged across the batch.
Briefing
Categorical cross-entropy loss gets upgraded from a single-sample calculation to a batch-ready, numerically stable implementation—complete with support for both sparse (scalar) and one-hot encoded targets. The core shift is moving from “take the softmax confidence for the correct class and compute −log(confidence)” for one example, to doing the same operation across a whole batch, then averaging the per-sample losses to produce a single training signal.
In the batch setting, softmax outputs arrive as a 2D array: one row per sample, one column per class. Targets arrive either as scalar class indices (e.g., [0, 1, 1] for dog/cat/cat) or as one-hot vectors (e.g., [[1,0,0],[0,1,0],[0,1,0]]). For each sample, the loss needs the predicted probability assigned to the target class. With scalar targets, that means indexing each softmax row at the target index. With one-hot targets, it means multiplying the softmax row by the one-hot vector and summing across classes—leaving only the probability of the correct class.
The implementation also tackles a practical failure mode: if the model assigns a probability of exactly 0 to the correct class, then −log(0) becomes infinite. Even a single infinite value can poison the batch mean, turning the entire batch loss into infinity. The fix is to clip predicted probabilities away from 0 and 1 using a small epsilon (the transcript uses 1e−7). Values are clipped to the range [1e−7, 1 − 1e−7], preserving the intent of the loss while preventing numerical blow-ups.
To make this fit into a larger neural-network framework, the transcript introduces a base Loss class with a calculate method that computes sample losses via a forward method, then reduces them to a batch loss using the mean. A derived Loss_CategoricalCrossentropy class implements forward by clipping predictions, extracting correct confidences differently depending on whether targets are scalar or one-hot (checked via the shape of y_true), and then computing negative log-likelihoods as −log(correct_confidences). The resulting vector of per-sample losses is then averaged to yield the final loss for the batch.
After loss is in place, accuracy is briefly addressed as a companion metric: take argmax over softmax outputs to get predicted class indices, compare them to target indices, and average the matches. Accuracy is treated as useful for monitoring, but loss remains the primary optimization target because it provides a continuous measure of how wrong predictions are—setting the stage for the next steps: updating weights and biases through optimization in later videos.
Cornell Notes
The transcript shows how to implement categorical cross-entropy loss for batches, not just single samples. It computes per-sample loss as −log(p_correct), where p_correct is the softmax probability assigned to the true class. Because −log(0) is infinite, predictions are clipped to [1e−7, 1 − 1e−7] to keep batch loss finite. The code handles two target formats: scalar class indices (shape length 1) and one-hot encoded vectors (shape length 2), extracting correct confidences via indexing or via elementwise multiply-and-sum. Finally, batch loss is the mean of sample losses, and accuracy can be computed using argmax comparisons.
How does categorical cross-entropy change when moving from one sample to a batch?
Why is clipping needed, and what goes wrong without it?
How are correct confidences extracted when targets are scalar class indices?
How are correct confidences extracted when targets are one-hot encoded?
What does the base Loss class do in the framework?
How is accuracy computed from softmax outputs and targets?
Review Questions
- What exact numerical issue occurs when −log is applied to a predicted probability of 0, and how does clipping prevent it?
- Given scalar targets versus one-hot targets, how do you extract the correct class confidence from a batch of softmax outputs?
- Why is batch loss computed as the mean of sample losses, and how does that relate to the optimization goal?
Key Points
- 1
Categorical cross-entropy for training uses −log of the predicted probability assigned to the true class, computed per sample and averaged across the batch.
- 2
Batch softmax outputs are treated as a 2D array (samples × classes), and the loss must extract one correct-class confidence per sample.
- 3
Numerical stability is essential: predicted probabilities must be clipped to avoid −log(0) producing infinite loss values.
- 4
Scalar targets (class indices) require indexing each softmax row at the target index to get correct confidences.
- 5
One-hot targets require elementwise multiplication with the one-hot vectors and summing across classes to isolate the correct-class confidence.
- 6
A base Loss class can standardize batch reduction (mean of sample losses) while derived losses implement the forward() logic.
- 7
Accuracy can be computed via argmax over softmax outputs and comparing predicted class indices to target indices, but loss remains the primary training metric.