Neural Networks from Scratch - P.8 Implementing Loss

TL;DR

Categorical cross-entropy for training uses −log of the predicted probability assigned to the true class, computed per sample and averaged across the batch.

Briefing Cornell Notes

Briefing

Categorical cross-entropy loss gets upgraded from a single-sample calculation to a batch-ready, numerically stable implementation—complete with support for both sparse (scalar) and one-hot encoded targets. The core shift is moving from “take the softmax confidence for the correct class and compute −log(confidence)” for one example, to doing the same operation across a whole batch, then averaging the per-sample losses to produce a single training signal.

In the batch setting, softmax outputs arrive as a 2D array: one row per sample, one column per class. Targets arrive either as scalar class indices (e.g., [0, 1, 1] for dog/cat/cat) or as one-hot vectors (e.g., [[1,0,0],[0,1,0],[0,1,0]]). For each sample, the loss needs the predicted probability assigned to the target class. With scalar targets, that means indexing each softmax row at the target index. With one-hot targets, it means multiplying the softmax row by the one-hot vector and summing across classes—leaving only the probability of the correct class.

The implementation also tackles a practical failure mode: if the model assigns a probability of exactly 0 to the correct class, then −log(0) becomes infinite. Even a single infinite value can poison the batch mean, turning the entire batch loss into infinity. The fix is to clip predicted probabilities away from 0 and 1 using a small epsilon (the transcript uses 1e−7). Values are clipped to the range [1e−7, 1 − 1e−7], preserving the intent of the loss while preventing numerical blow-ups.

To make this fit into a larger neural-network framework, the transcript introduces a base Loss class with a calculate method that computes sample losses via a forward method, then reduces them to a batch loss using the mean. A derived Loss_CategoricalCrossentropy class implements forward by clipping predictions, extracting correct confidences differently depending on whether targets are scalar or one-hot (checked via the shape of y_true), and then computing negative log-likelihoods as −log(correct_confidences). The resulting vector of per-sample losses is then averaged to yield the final loss for the batch.

After loss is in place, accuracy is briefly addressed as a companion metric: take argmax over softmax outputs to get predicted class indices, compare them to target indices, and average the matches. Accuracy is treated as useful for monitoring, but loss remains the primary optimization target because it provides a continuous measure of how wrong predictions are—setting the stage for the next steps: updating weights and biases through optimization in later videos.

Cornell Notes

The transcript shows how to implement categorical cross-entropy loss for batches, not just single samples. It computes per-sample loss as −log(p_correct), where p_correct is the softmax probability assigned to the true class. Because −log(0) is infinite, predictions are clipped to [1e−7, 1 − 1e−7] to keep batch loss finite. The code handles two target formats: scalar class indices (shape length 1) and one-hot encoded vectors (shape length 2), extracting correct confidences via indexing or via elementwise multiply-and-sum. Finally, batch loss is the mean of sample losses, and accuracy can be computed using argmax comparisons.

How does categorical cross-entropy change when moving from one sample to a batch?

For a batch, softmax outputs become a 2D array (samples × classes). Instead of computing −log(p_correct) once, the method extracts p_correct for every sample (one value per row), computes a vector of negative log-likelihoods, and then averages those values to get the batch loss.

Why is clipping needed, and what goes wrong without it?

If the model predicts exactly 0 probability for the correct class, then −log(0) becomes infinite. Even one infinite value can make the mean of the batch losses infinite, effectively breaking training. Clipping predictions to a small epsilon range (the transcript uses 1e−7) prevents log(0) while minimally altering near-zero probabilities.

How are correct confidences extracted when targets are scalar class indices?

When y_true is scalar (e.g., [0, 1, 1]), correct confidences are gathered by indexing each softmax row at the target index. Conceptually: for sample i, take y_pred_clipped[i, y_true[i]]. This yields one probability per sample corresponding to the true class.

How are correct confidences extracted when targets are one-hot encoded?

When y_true is one-hot (shape length 2), correct confidences are computed by multiplying y_pred_clipped by y_true elementwise and summing across the class axis (axis=1). Because one-hot vectors contain a single 1 and the rest 0s, the sum leaves exactly the predicted probability for the true class.

What does the base Loss class do in the framework?

It centralizes reduction logic: calculate() calls a forward() method to produce sample losses, then computes batch loss as the mean of those sample losses. The forward() implementation varies by loss type, but calculate() stays consistent.

How is accuracy computed from softmax outputs and targets?

Predicted classes come from argmax over softmax outputs along the class axis (axis=1). Those predicted indices are compared to the target indices; matches count as correct (1) and mismatches as incorrect (0). Accuracy is the mean of that correctness vector.

Review Questions

What exact numerical issue occurs when −log is applied to a predicted probability of 0, and how does clipping prevent it?
Given scalar targets versus one-hot targets, how do you extract the correct class confidence from a batch of softmax outputs?
Why is batch loss computed as the mean of sample losses, and how does that relate to the optimization goal?

Key Points

1
Categorical cross-entropy for training uses −log of the predicted probability assigned to the true class, computed per sample and averaged across the batch.
2
Batch softmax outputs are treated as a 2D array (samples × classes), and the loss must extract one correct-class confidence per sample.
3
Numerical stability is essential: predicted probabilities must be clipped to avoid −log(0) producing infinite loss values.
4
Scalar targets (class indices) require indexing each softmax row at the target index to get correct confidences.
5
One-hot targets require elementwise multiplication with the one-hot vectors and summing across classes to isolate the correct-class confidence.
6
A base Loss class can standardize batch reduction (mean of sample losses) while derived losses implement the forward() logic.
7
Accuracy can be computed via argmax over softmax outputs and comparing predicted class indices to target indices, but loss remains the primary training metric.

Highlights

Clipping predictions to [1e−7, 1 − 1e−7] prevents batch loss from turning infinite when any correct-class probability hits 0.

Scalar targets use direct indexing into the softmax output; one-hot targets use multiply-and-sum to isolate the correct-class probability.

Batch loss is the mean of per-sample negative log-likelihoods, turning many individual errors into one optimization signal.

Topics

Categorical Cross-Entropy
Batch Loss
Numerical Stability
One-Hot Targets
Scalar Targets