Simple Linear Regression | Lecture 49

TL;DR

Simple linear regression fits a line y = m x + b to map one input feature to a numeric output by minimizing overall prediction error.

Briefing Cornell Notes

Briefing

Simple linear regression is presented as the first practical machine-learning tool for turning a roughly linear relationship between one input and one numeric output into a usable prediction rule. The core idea is to fit a straight line of the form y = m x + b to data—where m is the slope and b is the intercept—so the line becomes a “best fit” that makes the smallest overall prediction errors. That fitted line then lets someone plug in a new input value (like a student’s GPA) and estimate the output (like expected placement package), even for inputs not explicitly present in the training set.

The lecture starts by placing linear regression inside supervised learning: regression problems predict numerical outputs, while classification problems predict labels. It then distinguishes three related model types: simple linear regression (one input feature), multiple linear regression (more than one input feature), and a brief mention of polynomial/“nonlinear” extensions as the next step when data isn’t well approximated by a straight line. A recurring example uses placement data: GPA as the input and package as the output. The data is plotted to show an approximately linear trend, but with real-world scatter—some students with similar GPAs end up with different packages due to factors like interview difficulty, company fit, or practice.

When the data isn’t perfectly linear, the lecture argues that regression’s job is not to force an exact line through every point, but to choose m and b so the line minimizes error. If the relationship were perfectly linear, a single line could pass through all points; with scatter, the algorithm searches for the line that stays closest to the points overall. The “best fit line” is described geometrically as the one that reduces the distance between predicted values on the line and the actual output values.

From there, the lecture moves into implementation and mathematics. It describes splitting data into training and testing sets, fitting a linear regression model on the training portion, and evaluating predictions on the test portion. It also shows how to compute predictions using the learned coefficients (m and b) and how to inspect the fitted slope and intercept from a model object. The lecture emphasizes that the next step—covered in later material—is deriving m and b from first principles.

The “first principles” derivation frames linear regression as minimizing a loss function. Errors are squared so positive and negative deviations don’t cancel out, leading to mean squared error (MSE) as the standard objective. The lecture then discusses related loss/metric choices: mean absolute error (MAE), MSE, and root mean squared error (RMSE), noting differences in sensitivity to outliers and units. Finally, it introduces model-quality scores: R² (coefficient of determination) as a measure of how much variance in the output the model explains compared with a baseline mean predictor, and adjusted R² to penalize adding irrelevant features. The practical takeaway is that good performance depends on both fit and feature relevance, so adjusted R² is highlighted when multiple inputs are involved.

Overall, the session builds from intuition (fit a line), to workflow (train/test split and prediction), to evaluation (loss functions and R²/adjusted R²), and ends with guidance to compute these metrics and compare them when building regression models on real datasets.

Cornell Notes

The lecture frames simple linear regression as supervised learning for predicting a numeric output from one numeric input by fitting a line y = m x + b. When data is only approximately linear, the model chooses m (slope) and b (intercept) to minimize prediction error, using a “best fit line” concept. It then connects the geometric idea of error distances to loss functions—especially squared error—leading to MSE as the common training objective. After fitting, it evaluates performance using MAE, MSE/RMSE, and R², with adjusted R² introduced to account for the number of input features and reduce misleading gains from irrelevant variables. This matters because it turns a scatter plot into a reliable prediction rule and provides metrics to judge whether the rule generalizes.

Why is linear regression treated as a supervised learning method, and what distinguishes regression from classification?

Supervised learning uses labeled data: inputs paired with known outputs. In regression, the output is numeric, so the model learns a numeric mapping (e.g., GPA → package). In classification, the output is a category/label (e.g., pass/fail). The lecture ties simple linear regression specifically to regression problems where the output column contains numerical values.

What does “best fit line” mean when the plotted data points don’t lie exactly on a straight line?

A perfect line would pass through every point, but real data is scattered. The “best fit line” is the line y = m x + b that minimizes overall error—predicted values on the line should be as close as possible to the actual output values. Geometrically, the lecture describes reducing the distances between points and the line, and it motivates squaring errors so deviations above and below the line don’t cancel out.

How do m and b determine predictions in simple linear regression?

Once m (slope) and b (intercept) are learned, predictions follow y = m x + b. The intercept b represents the predicted output when x = 0, while the slope m controls how strongly the output changes as x increases. The lecture emphasizes that if m is large, small changes in x produce larger changes in predicted y; if m is small, predictions change more slowly with x.

Why does the lecture focus on squared error (MSE) rather than raw absolute error?

Squared error is used because it penalizes larger deviations more strongly and avoids cancellation between positive and negative errors. The lecture contrasts MAE (mean absolute error) with MSE: MAE uses |error| and is more robust to outliers, while MSE uses error², which is differentiable and easier for calculus-based optimization. RMSE is then introduced as the square-rooted version of MSE to bring the metric back to the original output units.

What is R² measuring, and why can adjusted R² be more reliable?

R² compares the model’s error to a baseline predictor that uses the mean of the target. It can be interpreted as the fraction of variance in the output explained by the input features. The lecture warns that adding more features can inflate R² even if new features are irrelevant. Adjusted R² compensates for the number of predictors, so it penalizes unnecessary complexity and better reflects whether added inputs truly improve explanatory power.

What is the practical workflow for fitting and evaluating a regression model?

The lecture describes plotting the data, then splitting it into training and test sets (often with a test size such as 20%). The model is trained on the training set, then predictions are made on the test set. Performance is checked by comparing predicted outputs to true outputs using metrics like MAE/MSE/RMSE and R²/adjusted R².

Review Questions

In simple linear regression, what roles do the slope m and intercept b play in the prediction equation y = m x + b?
Why does squaring errors lead to MSE as a common objective function for linear regression training?
How does adjusted R² address the limitation of R² when adding irrelevant input features?

Key Points

1
Simple linear regression fits a line y = m x + b to map one input feature to a numeric output by minimizing overall prediction error.
2
Regression is part of supervised learning because training uses input-output pairs; regression predicts numeric values while classification predicts labels.
3
When data is scattered rather than perfectly linear, the “best fit line” is the one that stays closest to points in aggregate, not one that passes through every point.
4
Model evaluation uses loss/metrics such as MAE, MSE, and RMSE, each with different sensitivity to outliers and different unit interpretations.
5
R² measures how much variance in the target the model explains relative to predicting the mean, but it can be misleading when more features are added.
6
Adjusted R² penalizes adding extra predictors, making it more trustworthy than raw R² when many input features are available.
7
A standard workflow is to split data into train/test sets, fit on training data, predict on test data, and then compute metrics to judge generalization.

Highlights

The lecture’s central move is turning a scatter plot into a prediction rule by learning m and b for y = m x + b.

Squared error is used so positive and negative deviations don’t cancel, leading naturally to MSE as the optimization target.

R² can rise simply by adding features, so adjusted R² is emphasized to reduce the advantage of irrelevant inputs.

Loss metrics (MAE/MSE/RMSE) differ in outlier sensitivity and interpretability, so comparing them helps decide what “good” means for a specific problem.

Topics

Simple Linear Regression
Supervised Learning
Loss Functions
R² and Adjusted R²
Train-Test Evaluation

Mentioned

MSE
MAE
RMSE
R2
SGD

Simple Linear Regression | Lecture 49 | DSMP 2023