Neural Networks from Scratch - P.9 Introducing Optimization and derivatives
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Randomly sampling weights and biases can slightly reduce loss on an easy dataset but yields minimal accuracy gains and poor performance on more complex data like the spiral.
Briefing
Optimization is the missing ingredient between “we can change weights and biases” and “the model actually improves.” Early attempts that randomly search weight and bias combinations can nudge loss downward, but progress is slow and accuracy barely moves—especially once the task becomes more than a toy problem. Even when tweaks are made from the current best parameters (a “keep the change if loss drops” strategy), the method still struggles on the spiral dataset: loss stays above 1.0 and accuracy hovers around 40%. The core problem is that random search faces an enormous (effectively infinite) search space, treats every parameter as equally influential, and often gets stuck in local minima.
To navigate that landscape intelligently, the training process needs a way to measure how each parameter affects the loss. That leads to calculus—specifically derivatives—as the tool for estimating the “impact” of a variable on a function. The discussion starts with a simple function, f(x)=2x, where the impact is constant and equals the slope (2). Then it moves to a non-linear example, f(x)=2x², where the impact changes depending on where you are on the curve. For non-linear functions, the relevant quantity becomes the slope of the tangent line at a point—an “instantaneous slope.” Because true infinitely close points aren’t practical, the approach uses numerical differentiation: approximate the derivative by taking two points extremely close together (using a tiny delta like 0.0001) and computing the slope between them. The tutorial walks through how to visualize the curve, compute the approximate derivative, and draw the corresponding tangent line using the line form y = mx + b, where b can be found via algebra (b = y − mx).
The practical payoff is clear: if derivatives tell how loss changes when a parameter changes, optimization can update weights and biases in the direction that reduces loss rather than guessing. But numerical differentiation is computationally expensive for neural networks. With millions of parameters, estimating derivatives by brute-force perturbation would require many forward passes: for each weight (and each bias), add a small delta, run a forward pass to compute loss, revert, and repeat—also doing this per training sample. That’s still better than pure random search, but it scales poorly.
The next step is to replace numerical differentiation with partial derivatives, which can compute the sensitivity of the loss to each parameter more efficiently for multivariable functions. The transcript frames this as the bridge from “calculus concepts” to “an implementable optimization method,” setting up the analytical derivative work to come in the following tutorial.
Cornell Notes
The transcript explains why naive optimization fails and how calculus provides a path forward. Randomly sampling weights and biases can slightly reduce loss on an easy dataset, but it barely improves accuracy and performs poorly on the spiral dataset (loss > 1.0, accuracy ~40%). A “tweak-and-keep” strategy improves the easy case but still suffers from the huge search space, equal treatment of parameters, and getting trapped in local minima. To optimize effectively, the method needs a way to quantify how changing a parameter changes the loss. Derivatives—starting with numerical differentiation using a small delta—estimate the slope of the loss with respect to a variable, but doing this with many weights would require too many forward passes, motivating partial derivatives next.
Why does random search over weights and biases fail to produce reliable learning?
How does the “tweak-and-keep” approach differ from full random search, and why does it still have limits?
What does a derivative measure in this optimization context?
How does numerical differentiation approximate derivatives, and why use a small delta?
Why is numerical differentiation too expensive for neural networks with many parameters?
What problem does partial derivatives solve compared with numerical differentiation?
Review Questions
- How do random search and tweak-and-keep differ in how they choose candidate weights and biases, and how do those differences show up in the reported accuracy/loss outcomes?
- In the examples f(x)=2x and f(x)=2x², what changes about “impact” and why does the tangent line matter for non-linear functions?
- What computational bottleneck arises when using numerical differentiation to estimate derivatives for millions of weights, and how does that motivate partial derivatives?
Key Points
- 1
Randomly sampling weights and biases can slightly reduce loss on an easy dataset but yields minimal accuracy gains and poor performance on more complex data like the spiral.
- 2
A tweak-and-keep strategy improves learning on the easy dataset by iteratively accepting only loss-decreasing parameter changes, but it still struggles on the spiral dataset.
- 3
Optimization becomes difficult because the search space is effectively infinite, parameter influence is uneven, and local minima can trap naive update rules.
- 4
Derivatives quantify how sensitively a function’s output changes with respect to an input, which is the needed signal for directing weight and bias updates toward lower loss.
- 5
Numerical differentiation approximates derivatives by using a very small delta to estimate the slope between nearly adjacent points on a curve.
- 6
Numerical differentiation scales poorly for neural networks because it would require many forward passes per parameter per sample.
- 7
Partial derivatives are positioned as the scalable alternative for multivariable loss functions, enabling analytical gradient-based optimization next.