Simple Linear Regression | Lecture 49 | DSMP 2023
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Simple linear regression fits a line y = m x + b to map one input feature to a numeric output by minimizing overall prediction error.
Briefing
Simple linear regression is presented as the first practical machine-learning tool for turning a roughly linear relationship between one input and one numeric output into a usable prediction rule. The core idea is to fit a straight line of the form y = m x + b to data—where m is the slope and b is the intercept—so the line becomes a “best fit” that makes the smallest overall prediction errors. That fitted line then lets someone plug in a new input value (like a student’s GPA) and estimate the output (like expected placement package), even for inputs not explicitly present in the training set.
The lecture starts by placing linear regression inside supervised learning: regression problems predict numerical outputs, while classification problems predict labels. It then distinguishes three related model types: simple linear regression (one input feature), multiple linear regression (more than one input feature), and a brief mention of polynomial/“nonlinear” extensions as the next step when data isn’t well approximated by a straight line. A recurring example uses placement data: GPA as the input and package as the output. The data is plotted to show an approximately linear trend, but with real-world scatter—some students with similar GPAs end up with different packages due to factors like interview difficulty, company fit, or practice.
When the data isn’t perfectly linear, the lecture argues that regression’s job is not to force an exact line through every point, but to choose m and b so the line minimizes error. If the relationship were perfectly linear, a single line could pass through all points; with scatter, the algorithm searches for the line that stays closest to the points overall. The “best fit line” is described geometrically as the one that reduces the distance between predicted values on the line and the actual output values.
From there, the lecture moves into implementation and mathematics. It describes splitting data into training and testing sets, fitting a linear regression model on the training portion, and evaluating predictions on the test portion. It also shows how to compute predictions using the learned coefficients (m and b) and how to inspect the fitted slope and intercept from a model object. The lecture emphasizes that the next step—covered in later material—is deriving m and b from first principles.
The “first principles” derivation frames linear regression as minimizing a loss function. Errors are squared so positive and negative deviations don’t cancel out, leading to mean squared error (MSE) as the standard objective. The lecture then discusses related loss/metric choices: mean absolute error (MAE), MSE, and root mean squared error (RMSE), noting differences in sensitivity to outliers and units. Finally, it introduces model-quality scores: R² (coefficient of determination) as a measure of how much variance in the output the model explains compared with a baseline mean predictor, and adjusted R² to penalize adding irrelevant features. The practical takeaway is that good performance depends on both fit and feature relevance, so adjusted R² is highlighted when multiple inputs are involved.
Overall, the session builds from intuition (fit a line), to workflow (train/test split and prediction), to evaluation (loss functions and R²/adjusted R²), and ends with guidance to compute these metrics and compare them when building regression models on real datasets.
Cornell Notes
The lecture frames simple linear regression as supervised learning for predicting a numeric output from one numeric input by fitting a line y = m x + b. When data is only approximately linear, the model chooses m (slope) and b (intercept) to minimize prediction error, using a “best fit line” concept. It then connects the geometric idea of error distances to loss functions—especially squared error—leading to MSE as the common training objective. After fitting, it evaluates performance using MAE, MSE/RMSE, and R², with adjusted R² introduced to account for the number of input features and reduce misleading gains from irrelevant variables. This matters because it turns a scatter plot into a reliable prediction rule and provides metrics to judge whether the rule generalizes.
Why is linear regression treated as a supervised learning method, and what distinguishes regression from classification?
What does “best fit line” mean when the plotted data points don’t lie exactly on a straight line?
How do m and b determine predictions in simple linear regression?
Why does the lecture focus on squared error (MSE) rather than raw absolute error?
What is R² measuring, and why can adjusted R² be more reliable?
What is the practical workflow for fitting and evaluating a regression model?
Review Questions
- In simple linear regression, what roles do the slope m and intercept b play in the prediction equation y = m x + b?
- Why does squaring errors lead to MSE as a common objective function for linear regression training?
- How does adjusted R² address the limitation of R² when adding irrelevant input features?
Key Points
- 1
Simple linear regression fits a line y = m x + b to map one input feature to a numeric output by minimizing overall prediction error.
- 2
Regression is part of supervised learning because training uses input-output pairs; regression predicts numeric values while classification predicts labels.
- 3
When data is scattered rather than perfectly linear, the “best fit line” is the one that stays closest to points in aggregate, not one that passes through every point.
- 4
Model evaluation uses loss/metrics such as MAE, MSE, and RMSE, each with different sensitivity to outliers and different unit interpretations.
- 5
R² measures how much variance in the target the model explains relative to predicting the mean, but it can be misleading when more features are added.
- 6
Adjusted R² penalizes adding extra predictors, making it more trustworthy than raw R² when many input features are available.
- 7
A standard workflow is to split data into train/test sets, fit on training data, predict on test data, and then compute metrics to judge generalization.