What Textbooks Don't Tell You About Curve Fitting

TL;DR

Least squares arises as the maximum-likelihood solution for a linear model with additive Gaussian noise, making the squared residual term a consequence of the noise assumption.

Briefing Cornell Notes

Briefing

Linear regression’s familiar “minimize squared vertical errors” rule isn’t just a convenient math trick—it drops out of a probabilistic assumption about noise. By treating observed outputs as a linear function of inputs plus random error, then choosing the parameters that make the observed data most likely, the optimization objective becomes least squares. The squared term appears because the noise is assumed to be Gaussian: under that assumption, maximizing the probability of the data is equivalent to minimizing the sum of squared differences between predicted and observed values.

That probabilistic framing also answers why “vertical distances” and “squares” show up in the first place. The model imagines a cloud of plausible outcomes around the predicted value for any given input, with the cloud’s shape determined by the noise distribution. When each data point is treated as an independent draw, the overall likelihood of the dataset is the product of per-point probabilities; taking logs turns that product into a sum, and the Gaussian form collapses neatly into the squared-error objective. In short: least squares is the maximum-likelihood solution for a linear model with additive Gaussian noise.

The discussion then pivots to what happens when multiple parameter settings fit the data equally well—or when “fit the data” conflicts with prior beliefs. A coin-bias example shows the limitation of pure maximum-likelihood: observing 4 heads out of 5 pushes the estimate toward 0.8, even though experience suggests most coins are near 0.5. The fix is to incorporate priors over parameters, selecting weights that jointly explain the data and align with expectations. Mathematically, this becomes maximizing the joint probability of data and parameters, which decomposes into a likelihood term (data fit) times a prior term (parameter plausibility).

From that joint-probability view, regularization stops being an arbitrary penalty and becomes a direct consequence of choosing a prior distribution over weights. Assuming weights follow a zero-centered Gaussian prior yields L2 regularization (ridge regression), which penalizes the squared magnitude of coefficients and gently shrinks them toward zero. The strength of the penalty is governed by the ratio between noise variance (how trustworthy the data are) and the prior variance (how strongly the prior expects weights to be small).

Assuming instead that weights are likely exactly zero except for a few important ones leads to a Laplace (double-exponential) prior, producing L1 regularization (lasso). Because L1 penalizes absolute values rather than squares, it more readily drives many coefficients to exactly zero, creating sparse models. The transcript links this sparsity preference to domains like genomics and neuroscience, where only a small subset of features or neurons typically matters for a given outcome.

Overall, the core takeaway is that least squares and common regularizers are not separate inventions: they are maximum-likelihood and maximum-a-posteriori solutions under specific noise and prior assumptions. Change those assumptions—Gaussian vs. non-Gaussian noise, or Gaussian vs. Laplace priors—and the resulting objectives change in predictable, principled ways. That same probabilistic logic extends beyond linear regression to much broader machine learning systems, where modeling assumptions often determine what “good behavior” looks like.

Cornell Notes

Linear regression’s least-squares objective emerges from maximum-likelihood estimation under a Gaussian noise model. Treat each observed target as a linear prediction plus additive error; assuming the error is normally distributed makes maximizing the probability of the data equivalent to minimizing the sum of squared prediction errors. When multiple parameter settings fit similarly—or when experience suggests parameters should behave a certain way—pure likelihood can be misleading, as illustrated by the biased-coin example. Incorporating prior beliefs leads to maximizing joint probability (likelihood × prior), which produces regularization. A Gaussian prior over weights yields L2 (ridge) regularization, while a Laplace prior yields L1 (lasso) regularization and encourages sparsity.

Why does least squares use squared vertical errors rather than absolute errors or other powers?

The squared-error form follows from assuming additive Gaussian noise. With a linear model y ≈ wᵀx + ε and ε drawn from a normal distribution, the likelihood of observing a particular y given w has a Gaussian form. Maximizing the product of per-point likelihoods (or its log) simplifies to minimizing the sum of squared residuals between predicted and observed values. The “squares” are therefore a direct consequence of the Gaussian noise assumption, not an arbitrary choice.

How does the probabilistic view turn “fit the data” into an optimization objective?

For a fixed parameter vector w, each data point has a probability of being observed, computed from the noise model. Assuming independent samples, the dataset likelihood is the product of those probabilities. Taking the logarithm converts the product into a sum, making optimization tractable. Under Gaussian noise, the log-likelihood reduces to a quadratic penalty on residuals, yielding the standard least-squares objective.

What goes wrong with maximum-likelihood regression when parameters are underconstrained?

Maximum likelihood can overreact to limited evidence because it only asks which parameters make the observed data most probable. The coin example shows this: observing 4 heads out of 5 leads to an estimate near 0.8 even though typical coins are expected to be near 0.5. Without priors, the method ignores reasonable expectations about plausible parameter values.

How do priors over weights produce regularization?

Instead of maximizing only the likelihood P(data|w), the method maximizes the joint probability P(data, w). Using conditional probability, this becomes maximizing P(data|w) × P(w). The prior term P(w) becomes a penalty in the optimization objective. Different priors yield different regularizers: a Gaussian prior yields L2, while a Laplace prior yields L1.

Why does a Gaussian prior lead to L2 regularization (ridge)?

A zero-centered Gaussian prior assigns higher probability to small-magnitude weights and lower probability to large ones. When the log of the joint objective is formed, the prior contributes a term proportional to the sum of squared weights. That term penalizes large coefficients, shrinking them toward zero. The regularization strength λ reflects the balance between noise variance (data reliability) and prior variance (how strongly weights are expected to be small).

Why does a Laplace prior lead to L1 regularization (lasso) and sparsity?

A Laplace (double-exponential) prior has a sharp peak at zero with heavier tails than a Gaussian. When incorporated into the log-joint objective, it produces a penalty proportional to the sum of absolute weight values. Because absolute-value penalties more readily drive coefficients to exactly zero, L1 regularization tends to produce sparse solutions—many weights become zero while a smaller subset remains nonzero.

Review Questions

In what exact way does the Gaussian noise assumption mathematically force the residuals to be squared in the regression objective?
How does maximizing P(data, w) differ from maximizing P(data|w), and how does that difference translate into regularization terms?
What prior over weights would you choose if you believed most coefficients should be exactly zero, and why would that choice encourage sparsity?

Key Points

1
Least squares arises as the maximum-likelihood solution for a linear model with additive Gaussian noise, making the squared residual term a consequence of the noise assumption.
2
Assuming independent data points turns the dataset likelihood into a product of per-point probabilities; taking logs converts it into a sum that yields the standard quadratic objective.
3
Pure maximum-likelihood can produce implausible parameter values when data are limited, because it ignores prior expectations about what parameters should look like.
4
Regularization can be derived by maximizing joint probability (likelihood × prior), turning prior beliefs about weights into explicit penalty terms.
5
A zero-centered Gaussian prior over weights yields L2 regularization (ridge), which shrinks all coefficients toward zero without forcing exact zeros.
6
A Laplace prior over weights yields L1 regularization (lasso), which penalizes absolute values and more strongly encourages exact sparsity.
7
The regularization strength λ reflects the trade-off between noise variance (how much to trust the data) and prior variance (how strongly to trust prior beliefs about weights).

Highlights

The squared-error objective isn’t arbitrary: it’s what you get when the noise term is modeled as Gaussian.

Least squares is maximum likelihood under a specific noise distribution; change the noise model and the objective changes.

The biased-coin example shows why likelihood alone can be misleading when prior plausibility matters.

L2 and L1 regularization correspond to different priors over weights: Gaussian gives ridge, Laplace gives lasso and sparsity.

Topics

Linear Regression
Maximum Likelihood
Gaussian Noise
Regularization
L1 vs L2