Neural Networks from Scratch - P.7 Calculating Loss with Categorical Cross-Entropy

TL;DR

Softmax produces a probability distribution over classes, so training needs a loss that uses those probabilities rather than only the final predicted label.

Briefing Cornell Notes

Briefing

Training a classifier neural network needs more than “right vs. wrong.” After the softmax layer turns raw scores into a probability distribution over classes, the training process requires a loss function that quantifies how wrong those probabilities are—especially when the model is correct but not confident.

Accuracy can’t do that. Optimizing directly for accuracy would treat predictions as binary outcomes, discarding useful information contained in the predicted probabilities. A model that assigns 87% confidence to the correct class should be rewarded more than one that assigns 50.2%, even though both may still predict the correct label via arg max. Loss functions keep that gradient-friendly information by measuring error continuously rather than categorically.

For classification with softmax outputs, categorical cross-entropy becomes the standard choice. It compares two probability distributions: the target distribution from the training label and the predicted distribution from the network. The general formula takes the negative sum over classes of (target probability × log(predicted probability)). With one-hot encoded targets, the target distribution has a single 1 at the true class index and 0 elsewhere, which collapses the full expression into a simple form: categorical cross-entropy reduces to the negative log of the predicted probability for the target class.

One-hot encoding is the mechanism behind that simplification. For a problem with n classes, the target label becomes an n-length vector filled with zeros except for a 1 at the index of the correct class. When categorical cross-entropy multiplies this one-hot vector by the log of the predicted probabilities, every term with a 0 target probability vanishes, leaving only the log term for the correct class.

The logarithm matters because it turns “confidence” into a penalty that grows sharply as the model becomes less certain. The natural log is used when people write “log” in this context (base e, or ln). The transcript also walks through what a logarithm is conceptually—solving for x in e^x = b—and demonstrates the relationship with euler’s number, noting that any mismatch in numeric checks comes from floating-point precision.

A concrete worked example uses softmax outputs like [0.7, 0.1, 0.2] with the target class at index 0. Under one-hot encoding, the loss becomes -log(0.7). If the model instead predicted 0.5 for the target class, the loss increases to -log(0.5). In other words: higher predicted probability for the correct class yields lower loss; lower probability yields higher loss.

That negative-log-of-the-target-probability form is the key takeaway that sets up the next step: implementing categorical cross-entropy cleanly in code, including batching, without getting lost in the original “full” summation formula.

Cornell Notes

Softmax outputs a probability distribution across classes, so training needs a loss that measures how wrong those probabilities are—not just whether the arg max label matches. Accuracy throws away confidence information, which is why categorical cross-entropy is used for softmax classifiers. With one-hot encoded targets, the categorical cross-entropy formula simplifies dramatically: it becomes the negative log of the predicted probability assigned to the true (target) class. Using natural log (“log” meaning base e) turns lower confidence in the correct class into a larger penalty. This simplified form is what gets implemented next, including for batches, because it stays mathematically correct while being easier to code.

Why isn’t accuracy a good loss function for training a softmax classifier?

Accuracy treats predictions as binary (correct/incorrect) and ignores the confidence values inside the softmax distribution. Two models can both predict the right class via arg max, yet one might assign 87% probability to the correct class while another assigns 50.2%. A loss function should reward higher confidence and penalize lower confidence, which accuracy cannot do because it collapses everything to a 0/1 outcome.

What does categorical cross-entropy measure for classification with softmax?

Categorical cross-entropy compares the target distribution (from labels) to the predicted distribution (from softmax). The general form is the negative sum over classes of (target probability × log(predicted probability)). It effectively penalizes the model based on how much probability it assigns to the correct class, with the log making the penalty increase as the predicted probability for the true class decreases.

How does one-hot encoding simplify categorical cross-entropy?

With one-hot encoding, the target vector has length n (number of classes), contains zeros everywhere, and a single 1 at the index of the true class. In the cross-entropy sum, every term where the target probability is 0 becomes 0, leaving only one surviving term: -log(predicted_probability_of_true_class). That’s why the “complex” formula reduces to the negative log of the predicted probability at the target index.

What does “log” mean in this context, and why does it matter?

When “log” is used without a base, it refers to the natural logarithm (ln), i.e., base e. The transcript emphasizes that some people mistakenly think it’s base 10, but for this series/programming context it’s natural log. Using ln is convenient for backpropagation and turns probability into a penalty that grows as confidence drops.

How does the loss change when the model’s predicted probability for the true class changes?

Because the loss is -log(p_true), higher p_true produces lower loss and lower p_true produces higher loss. The example uses softmax outputs [0.7, 0.1, 0.2] with the true class at index 0, giving loss = -log(0.7). If the predicted probability for the true class were 0.5 instead, the loss becomes -log(0.5), which is larger—showing the penalty increases as confidence decreases.

Review Questions

Given a softmax output vector and a one-hot target label, can you compute categorical cross-entropy using only -log(predicted_probability_of_true_class)?
Why does arg max-based accuracy fail to distinguish between high-confidence and low-confidence correct predictions?
What role does the natural log (ln) play in shaping the loss penalty as predicted probability for the true class decreases?

Key Points

1
Softmax produces a probability distribution over classes, so training needs a loss that uses those probabilities rather than only the final predicted label.
2
Accuracy discards confidence information; two correct predictions with different softmax confidence should not be treated equally during optimization.
3
Categorical cross-entropy compares target and predicted probability distributions using a negative log-based penalty.
4
With one-hot encoded targets, categorical cross-entropy simplifies to the negative log of the predicted probability for the true class.
5
“log” in this context means natural log (base e), which is used for both mathematical convenience and practical backpropagation behavior.
6
Higher predicted probability for the correct class yields lower loss; lower predicted probability yields higher loss.
7
The simplified -log(p_true) form is the foundation for implementing categorical cross-entropy efficiently, including across batches.

Highlights

Accuracy can’t guide learning when confidence matters; softmax confidence needs to influence the error signal.

One-hot encoding collapses the full categorical cross-entropy sum into a single term: -log(predicted probability of the target class).

Using natural log turns small drops in predicted probability into rapidly increasing penalties.

A worked example with softmax [0.7, 0.1, 0.2] shows loss = -log(0.7), and lowering that to 0.5 increases the loss. 

Topics

Softmax
Categorical Cross-Entropy
One-Hot Encoding
Natural Log
Loss Functions