Neural Networks from Scratch - P.9 Introducing Optimization and derivatives

TL;DR

Randomly sampling weights and biases can slightly reduce loss on an easy dataset but yields minimal accuracy gains and poor performance on more complex data like the spiral.

Briefing Cornell Notes

Briefing

Optimization is the missing ingredient between “we can change weights and biases” and “the model actually improves.” Early attempts that randomly search weight and bias combinations can nudge loss downward, but progress is slow and accuracy barely moves—especially once the task becomes more than a toy problem. Even when tweaks are made from the current best parameters (a “keep the change if loss drops” strategy), the method still struggles on the spiral dataset: loss stays above 1.0 and accuracy hovers around 40%. The core problem is that random search faces an enormous (effectively infinite) search space, treats every parameter as equally influential, and often gets stuck in local minima.

To navigate that landscape intelligently, the training process needs a way to measure how each parameter affects the loss. That leads to calculus—specifically derivatives—as the tool for estimating the “impact” of a variable on a function. The discussion starts with a simple function, f(x)=2x, where the impact is constant and equals the slope (2). Then it moves to a non-linear example, f(x)=2x², where the impact changes depending on where you are on the curve. For non-linear functions, the relevant quantity becomes the slope of the tangent line at a point—an “instantaneous slope.” Because true infinitely close points aren’t practical, the approach uses numerical differentiation: approximate the derivative by taking two points extremely close together (using a tiny delta like 0.0001) and computing the slope between them. The tutorial walks through how to visualize the curve, compute the approximate derivative, and draw the corresponding tangent line using the line form y = mx + b, where b can be found via algebra (b = y − mx).

The practical payoff is clear: if derivatives tell how loss changes when a parameter changes, optimization can update weights and biases in the direction that reduces loss rather than guessing. But numerical differentiation is computationally expensive for neural networks. With millions of parameters, estimating derivatives by brute-force perturbation would require many forward passes: for each weight (and each bias), add a small delta, run a forward pass to compute loss, revert, and repeat—also doing this per training sample. That’s still better than pure random search, but it scales poorly.

The next step is to replace numerical differentiation with partial derivatives, which can compute the sensitivity of the loss to each parameter more efficiently for multivariable functions. The transcript frames this as the bridge from “calculus concepts” to “an implementable optimization method,” setting up the analytical derivative work to come in the following tutorial.

Cornell Notes

The transcript explains why naive optimization fails and how calculus provides a path forward. Randomly sampling weights and biases can slightly reduce loss on an easy dataset, but it barely improves accuracy and performs poorly on the spiral dataset (loss > 1.0, accuracy ~40%). A “tweak-and-keep” strategy improves the easy case but still suffers from the huge search space, equal treatment of parameters, and getting trapped in local minima. To optimize effectively, the method needs a way to quantify how changing a parameter changes the loss. Derivatives—starting with numerical differentiation using a small delta—estimate the slope of the loss with respect to a variable, but doing this with many weights would require too many forward passes, motivating partial derivatives next.

Why does random search over weights and biases fail to produce reliable learning?

Random search repeatedly samples new weight/bias combinations, evaluates loss via a forward pass, and keeps the best result. On a nearly linearly separable dataset it can reduce loss slightly, but accuracy barely improves. On the spiral dataset, loss remains above 1.0 and accuracy stays around 40%. The underlying issues are (1) an effectively infinite search space even for small models, (2) treating all weights and biases as equally influential when some parameters matter more, and (3) convergence to local minima where random moves may not escape.

How does the “tweak-and-keep” approach differ from full random search, and why does it still have limits?

Instead of sampling entirely new weights and biases each iteration, the method starts from the current best parameters and applies small random adjustments. If the adjustment decreases loss, the new weights/biases are kept; otherwise they revert. This yields continued progress on the easy vertical-line dataset, reaching about 0.17 loss and ~93% accuracy. But on the spiral dataset, the same logic still learns slowly and plateaus with higher loss and low accuracy, because it still lacks a principled direction for updates and can still get stuck in local minima.

What does a derivative measure in this optimization context?

A derivative measures the instantaneous slope of a function at a point, which corresponds to how sensitively the output changes with respect to the input at that location. For f(x)=2x, the slope is constant (2). For a non-linear function like f(x)=2x², the slope varies by x, so the tangent line slope at the specific point is the relevant “impact” measure. In neural networks, the goal is analogous: determine how loss changes as weights or biases change.

How does numerical differentiation approximate derivatives, and why use a small delta?

Numerical differentiation approximates the derivative by computing the slope between two very close points on the curve: (f(x+δ)−f(x))/δ. Using a tiny delta (e.g., 0.0001) makes the two points nearly “infinitely close,” reducing approximation error. The transcript illustrates this with f(x)=2x² at x=1, where the approximate derivative is close to the true derivative (4), with small discrepancy due to the finite delta.

Why is numerical differentiation too expensive for neural networks with many parameters?

To estimate the derivative of loss with respect to each weight using numerical differentiation, the method would perturb one parameter at a time by a small delta, run a forward pass to compute loss, revert, and repeat for every weight (and similarly for every bias). With millions of parameters and multiple samples, this implies an enormous number of forward passes—effectively brute forcing the multivariable sensitivity—making it slower than ideal even if it beats random search.

What problem does partial derivatives solve compared with numerical differentiation?

Partial derivatives provide the sensitivity of a multivariable function (loss as a function of many weights/biases) with respect to each parameter without needing separate forward passes for every tiny perturbation. The transcript positions partial derivatives as the next step after introducing numerical differentiation, setting up analytical derivatives that scale to neural networks.

Review Questions

How do random search and tweak-and-keep differ in how they choose candidate weights and biases, and how do those differences show up in the reported accuracy/loss outcomes?
In the examples f(x)=2x and f(x)=2x², what changes about “impact” and why does the tangent line matter for non-linear functions?
What computational bottleneck arises when using numerical differentiation to estimate derivatives for millions of weights, and how does that motivate partial derivatives?

Key Points

1
Randomly sampling weights and biases can slightly reduce loss on an easy dataset but yields minimal accuracy gains and poor performance on more complex data like the spiral.
2
A tweak-and-keep strategy improves learning on the easy dataset by iteratively accepting only loss-decreasing parameter changes, but it still struggles on the spiral dataset.
3
Optimization becomes difficult because the search space is effectively infinite, parameter influence is uneven, and local minima can trap naive update rules.
4
Derivatives quantify how sensitively a function’s output changes with respect to an input, which is the needed signal for directing weight and bias updates toward lower loss.
5
Numerical differentiation approximates derivatives by using a very small delta to estimate the slope between nearly adjacent points on a curve.
6
Numerical differentiation scales poorly for neural networks because it would require many forward passes per parameter per sample.
7
Partial derivatives are positioned as the scalable alternative for multivariable loss functions, enabling analytical gradient-based optimization next.

Highlights

On the spiral dataset, semi-random tweaking plateaus around ~40% accuracy with loss above 1.0, illustrating how quickly naive optimization runs out of steam.

The transcript contrasts constant slope (f(x)=2x) with location-dependent slope (f(x)=2x²), motivating tangent-line derivatives as the right “impact” measure.

Numerical differentiation approximates derivatives using a tiny delta (e.g., 0.0001), but it still behaves like brute force when applied to millions of weights.

The computational cost of perturbing each weight/bias individually via forward passes makes numerical differentiation impractical for real neural networks.

The path forward is partial derivatives, which set up efficient gradient computation for multivariable optimization.

Topics

Neural Network Optimization
Derivatives
Numerical Differentiation
Local Minima
Partial Derivatives