Statistics for Research - L24 || #Regression Analysis using #R

TL;DR

Linear regression predicts a dependent variable from one or more independent variables using a least-squares line that minimizes distance to observed data points.

Briefing Cornell Notes

Briefing

Linear regression is presented as a practical way to predict a dependent outcome from one or more predictors, using a least-squares line that summarizes the pattern in data. After scatter plots and correlation analysis have already described direction, form, and strength of linear relationships, regression turns that relationship into a usable prediction tool: it estimates the dependent variable’s value based on independent variables, and it does so by choosing the line that stays as close as possible to the observed data points. The closer the line is to the points, the better the predictions—meaning the predicted y values land near the actual y values.

The lesson anchors the method with real-world examples: predicting house prices from size, bedrooms, and location; forecasting sales revenue from marketing spend, competitor activity, and economic indicators (using historical data from prior years); estimating probability of default from borrower characteristics like credit score, annual income, and debt-to-income ratio; and predicting student satisfaction from survey scores tied to university responsibility, service quality, and leadership. In each case, regression is framed as a bridge from observed relationships to forecasting outcomes.

A concrete walkthrough uses a scatter plot of “vision” versus “organizational performance.” As vision scores rise, organizational performance tends to increase, and the fitted line reflects that upward trend. To make a prediction, the process is illustrated by selecting an x value (vision score of 6), moving vertically to the regression line, and reading the corresponding y value (predicted organizational performance). The quality of the model is tied to how tightly most data points cluster around the line; when points sit near the line across the range of x values, the predictor is treated as useful.

The implementation in R follows a clear workflow. First, the stats package is loaded (it’s part of R’s base installation). Next, data are read from a CSV file in the same folder as the R script, with comma-separated values, and stored in an object (named data s in the transcript). The linear model is then fit using the lm() function, specifying the dependent variable (op) and the independent variable (vision) with the formula structure lm(op ~ vision, data = data s). Model results are extracted with summary(model), which returns residuals and key statistics.

Interpretation centers on whether vision has a significant impact on organizational performance. Significance is assessed using the t value and p value: the transcript notes significance when the t value exceeds 1.96 and the p value is below 0.01, with “three stars” indicating very small p-values. Model fit is summarized with R square, described as the percentage of variation in the dependent variable explained by the predictors (the example given is 38.34%). The F statistics and its p value provide an overall test of the model’s explanatory power.

Finally, the lesson extends to multiple regression. With more than one independent variable, the lm() formula uses plus signs (e.g., op ~ TAA + vision), and the same summary-based approach is used to retrieve coefficients, R square/adjusted R square, and F statistics. The transcript also mentions exporting results to Excel and using packages like openxlsx to write regression outputs to an .xlsx file. The session ends by pointing to a next step: reporting regression results clearly from R output.

Cornell Notes

Linear regression is used to predict a dependent variable from one or more independent variables by fitting a least-squares line through data. Predictions come from the fitted line: choose an x value (predictor), move to the line, and read the corresponding y value (predicted outcome). Model quality depends on how closely the line matches the observed points, and statistical output from R’s lm() and summary() helps judge significance and fit. Significance is assessed with t values and p values (with stars indicating very small p-values), while R square is interpreted as the share of variation in the dependent variable explained by the predictors. The same workflow extends to multiple regression by adding predictors with “+” in the lm() formula.

How does the least-squares regression line connect to prediction quality?

The regression line is chosen so it stays as close as possible to the observed data points (least squares). A “good prediction” means the predicted y value from the line is close to the actual y value for each data point. When most points cluster near the line—like the vision vs. organizational performance example—predictions are more reliable and the predictor is treated as useful.

What does R’s lm() formula mean in the single-predictor example?

The transcript uses lm(op ~ vision, data = data s). Here, op is the dependent variable (DV/effect), vision is the independent variable (IV/cause), and data = data s tells R where the variables live. The fitted model is stored in a model object, and summary(model) is used to extract residuals, coefficients, t values, p values, R square, and F statistics.

How should significance of a predictor be interpreted from summary() output?

Significance is judged using the t value and p value. The transcript notes an example where vision is significant because the t value exceeds 1.96 and the p value is below 0.01. It also mentions that “three stars” correspond to very small p-values (around 0.001 or less, as implied by the transcript’s star notation).

What does R square represent, and how is it interpreted numerically?

R square is described as the proportion of variation in the dependent variable explained by the independent variable(s). The transcript gives an example: multiplying R square by 100 yields 38.34% of the change in organizational performance accounted for by vision in the model.

How does multiple regression change the lm() call?

With multiple independent variables, predictors are added with plus signs in the formula. The transcript uses the structure lm(dependent ~ IV1 + IV2, data = ...). After fitting, summary() is again used to retrieve R square/adjusted R square, F statistics, and other details for interpreting the combined model.

Why might results be exported to Excel in this workflow?

The transcript suggests copying regression output and converting number formats (e.g., formatting cells to three decimals) for readability. It also mentions using the openxlsx package (installing it if needed) to write regression results into an .xlsx file, which helps with reporting and documentation.

Review Questions

In the vision vs. organizational performance example, how would you make a prediction for a specific vision score using the regression line?
What statistics from summary(lm(...)) help determine (1) whether a predictor is significant and (2) how much variance the model explains?
How do you modify the lm() formula to move from simple regression to multiple regression with two predictors?

Key Points

1
Linear regression predicts a dependent variable from one or more independent variables using a least-squares line that minimizes distance to observed data points.
2
Scatter plots and correlation analysis help describe linear relationships, but regression converts that relationship into explicit predictions.
3
A prediction is made by selecting an x value on the predictor, moving to the regression line, and reading the corresponding y value.
4
In R, fit a model with lm(dependent ~ independent, data = your_data) and extract results with summary(model).
5
Significance is assessed using t values and p values; the transcript highlights significance when t > 1.96 and p < 0.01, with stars indicating very small p-values.
6
Model fit is summarized with R square, interpreted as the percentage of variation in the dependent variable explained by the predictors.
7
Multiple regression adds predictors in the lm() formula using plus signs (e.g., dependent ~ IV1 + IV2).

Highlights

Regression turns a visual pattern into a prediction rule by fitting a least-squares line through the data.

A “good” regression line is one that stays close to most data points, so predicted y values land near actual y values.

R’s summary output provides the tools to judge both significance (t and p values) and fit (R square and F statistics).

Multiple regression in R is implemented by adding predictors with “+” inside the lm() formula.

Topics

Linear Regression
Least Squares
Model Significance
R lm() Function
Multiple Regression