Neural Networks from Scratch - P.7 Calculating Loss with Categorical Cross-Entropy
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Softmax produces a probability distribution over classes, so training needs a loss that uses those probabilities rather than only the final predicted label.
Briefing
Training a classifier neural network needs more than “right vs. wrong.” After the softmax layer turns raw scores into a probability distribution over classes, the training process requires a loss function that quantifies how wrong those probabilities are—especially when the model is correct but not confident.
Accuracy can’t do that. Optimizing directly for accuracy would treat predictions as binary outcomes, discarding useful information contained in the predicted probabilities. A model that assigns 87% confidence to the correct class should be rewarded more than one that assigns 50.2%, even though both may still predict the correct label via arg max. Loss functions keep that gradient-friendly information by measuring error continuously rather than categorically.
For classification with softmax outputs, categorical cross-entropy becomes the standard choice. It compares two probability distributions: the target distribution from the training label and the predicted distribution from the network. The general formula takes the negative sum over classes of (target probability × log(predicted probability)). With one-hot encoded targets, the target distribution has a single 1 at the true class index and 0 elsewhere, which collapses the full expression into a simple form: categorical cross-entropy reduces to the negative log of the predicted probability for the target class.
One-hot encoding is the mechanism behind that simplification. For a problem with n classes, the target label becomes an n-length vector filled with zeros except for a 1 at the index of the correct class. When categorical cross-entropy multiplies this one-hot vector by the log of the predicted probabilities, every term with a 0 target probability vanishes, leaving only the log term for the correct class.
The logarithm matters because it turns “confidence” into a penalty that grows sharply as the model becomes less certain. The natural log is used when people write “log” in this context (base e, or ln). The transcript also walks through what a logarithm is conceptually—solving for x in e^x = b—and demonstrates the relationship with euler’s number, noting that any mismatch in numeric checks comes from floating-point precision.
A concrete worked example uses softmax outputs like [0.7, 0.1, 0.2] with the target class at index 0. Under one-hot encoding, the loss becomes -log(0.7). If the model instead predicted 0.5 for the target class, the loss increases to -log(0.5). In other words: higher predicted probability for the correct class yields lower loss; lower probability yields higher loss.
That negative-log-of-the-target-probability form is the key takeaway that sets up the next step: implementing categorical cross-entropy cleanly in code, including batching, without getting lost in the original “full” summation formula.
Cornell Notes
Softmax outputs a probability distribution across classes, so training needs a loss that measures how wrong those probabilities are—not just whether the arg max label matches. Accuracy throws away confidence information, which is why categorical cross-entropy is used for softmax classifiers. With one-hot encoded targets, the categorical cross-entropy formula simplifies dramatically: it becomes the negative log of the predicted probability assigned to the true (target) class. Using natural log (“log” meaning base e) turns lower confidence in the correct class into a larger penalty. This simplified form is what gets implemented next, including for batches, because it stays mathematically correct while being easier to code.
Why isn’t accuracy a good loss function for training a softmax classifier?
What does categorical cross-entropy measure for classification with softmax?
How does one-hot encoding simplify categorical cross-entropy?
What does “log” mean in this context, and why does it matter?
How does the loss change when the model’s predicted probability for the true class changes?
Review Questions
- Given a softmax output vector and a one-hot target label, can you compute categorical cross-entropy using only -log(predicted_probability_of_true_class)?
- Why does arg max-based accuracy fail to distinguish between high-confidence and low-confidence correct predictions?
- What role does the natural log (ln) play in shaping the loss penalty as predicted probability for the true class decreases?
Key Points
- 1
Softmax produces a probability distribution over classes, so training needs a loss that uses those probabilities rather than only the final predicted label.
- 2
Accuracy discards confidence information; two correct predictions with different softmax confidence should not be treated equally during optimization.
- 3
Categorical cross-entropy compares target and predicted probability distributions using a negative log-based penalty.
- 4
With one-hot encoded targets, categorical cross-entropy simplifies to the negative log of the predicted probability for the true class.
- 5
“log” in this context means natural log (base e), which is used for both mathematical convenience and practical backpropagation behavior.
- 6
Higher predicted probability for the correct class yields lower loss; lower predicted probability yields higher loss.
- 7
The simplified -log(p_true) form is the foundation for implementing categorical cross-entropy efficiently, including across batches.