23. SEMinR Lecture Series | Step 4: Out of Sample Predictive Power

TL;DR

Predictive power should be assessed out of sample; R² only measures in-sample explanatory power.

Briefing Cornell Notes

Briefing

Relying on R² to judge predictive power can mislead—PLS predict is built to measure how well a PLS path model forecasts unseen data using out-of-sample evaluation. Instead of treating in-sample fit as prediction quality, the method splits the dataset into a training portion (used to estimate model parameters) and a holdout portion (used to test predictions). The holdout data is never used during estimation, so the resulting prediction errors reflect genuine forecasting ability on new observations.

PLS predict operationalizes this idea through k-fold cross-validation. The full sample is divided into k roughly equal folds; in each fold, one subset becomes the holdout sample while the remaining folds form the training sample. Predictions are generated for the holdout observations using the model estimated on the training data, and prediction errors are computed by comparing predicted values to actual values in the holdout set. This process repeats across all folds so every observation serves as holdout exactly once (or approximately, depending on how the split lands). Running multiple folds helps avoid “abnormal solutions” that could arise from a single arbitrary train/test split.

Prediction quality is quantified using out-of-sample error metrics computed on the holdout predictions. The core idea is that “error” here means residuals—the difference between actual and predicted values—not a mistake. The most common metric is RMSE (root mean square error), but MAE (mean absolute error) can be more appropriate when the distribution of prediction errors is highly skewed (e.g., when residuals show a long left or right tail). To decide between RMSE and MAE, the workflow checks the error distribution via plots for the endogenous indicators; if skewness is modest, RMSE is typically used.

To interpret the magnitude of these errors, PLS predict compares them against a naive benchmark: a linear model (LM) baseline. The LM benchmark is obtained by running linear regression for each dependent construct indicator on the indicators of the exogenous constructs in the PLS model. Predictive power then follows a guideline based on how many indicators have lower out-of-sample RMSE/MAE than the LM benchmark: if all indicators beat the benchmark, predictive power is high; if most do, it’s medium; if only a minority do, it’s low; and if none do, the model lacks predictive power.

Models with mediator constructs add another decision point: predictions can be generated using either a direct antecedence (DA) approach or an earliest antecedence (EA) approach. DA treats both antecedents and the mediator as predictors of the outcome, while EA excludes the mediator from the prediction step. Simulation evidence cited in the lecture favors DA for higher predictive accuracy.

In the R workflow, PLS predict is run with the estimated PLS model, a chosen number of folds (default depths of 10 are mentioned), and repetitions. The results are stored in a summary object, and the out-of-sample matrices are compared to the LM out-of-sample benchmark. In the example results, each indicator’s out-of-sample RMSE under the PLS model is lower than the corresponding LM benchmark value, leading to the conclusion that the model has high out-of-sample predictive power.

Cornell Notes

The lecture distinguishes in-sample fit from true predictive power and introduces PLS predict as an out-of-sample evaluation method for PLS path models. Instead of using R², it estimates the model on a training sample and tests predictions on a holdout sample that is excluded from estimation. k-fold cross-validation implements this by repeatedly splitting the data into training and holdout folds, producing out-of-sample prediction errors for each endogenous indicator. Those errors are summarized with RMSE or MAE (chosen based on whether prediction-error residuals are skewed) and compared against an LM benchmark baseline. If all indicators have lower out-of-sample errors than the LM benchmark, predictive power is classified as high.

Why is R² not a reliable measure of predictive power in PLS path modeling?

R² reflects in-sample explanatory power: how well the model explains endogenous variables for the same data used to estimate it. Predictive power instead asks how well the model forecasts new or future observations. That requires out-of-sample evaluation—using data not involved in parameter estimation—so prediction errors reflect real generalization rather than fit to the training data.

How does PLS predict create training and holdout samples?

PLS predict uses k-fold cross-validation. The dataset is split into k folds; in each fold, one subset acts as the holdout sample while the remaining folds form the training sample. The model is estimated on the training data, then used to predict the holdout observations. This repeats across folds so each subset becomes holdout (approximately once), and the out-of-sample prediction errors are computed from those holdout predictions.

What do RMSE and MAE measure in this context, and when should MAE be preferred?

Both metrics quantify residual-based prediction error: the difference between actual and predicted values on the holdout set. RMSE (root mean square error) is the most popular choice, but MAE (mean absolute error) is more appropriate when the prediction-error distribution is highly skewed—e.g., when residuals show a long left or right tail. The lecture recommends checking skewness via plots of prediction errors for endogenous indicators.

How is predictive power determined using the LM benchmark?

For each endogenous indicator, the out-of-sample RMSE/MAE from the PLS model is compared to an LM baseline. The LM benchmark is built by running linear regression of each dependent indicator on the exogenous indicators from the PLS model. Guidelines classify predictive power based on how many indicators have lower PLS errors than the LM benchmark: all indicators → high predictive power; majority → medium; minority → low; none → no predictive power.

What changes when the PLS path model includes a mediator construct, and why does DA matter?

Mediators are both outcomes and predictors, so prediction generation can follow different rules. The direct antecedence (DA) approach uses both antecedents and the mediator as predictors of the outcome construct. The earliest antecedence (EA) approach excludes the mediator from the prediction step. Simulation evidence cited favors DA for higher prediction accuracy, so DA is recommended in the workflow.

What does the R implementation of PLS predict require, and what outputs are used for the decision?

The workflow uses the PLS predict function with the estimated PLS model, the technique (DA is used), the number of folds, and repetitions. Results are stored in a summary object (e.g., summary_predict), and the out-of-sample prediction error matrices are compared to the corresponding LM out-of-sample benchmark. The example concludes high predictive power because every indicator’s out-of-sample RMSE under PLS is lower than the LM benchmark value.

Review Questions

In what way does out-of-sample evaluation address the shortcomings of using R² for predictive power?
How do you decide between RMSE and MAE when using PLS predict?
What rule determines whether predictive power is high, medium, low, or absent when comparing PLS errors to the LM benchmark?

Key Points

1
Predictive power should be assessed out of sample; R² only measures in-sample explanatory power.
2
PLS predict estimates the model on a training sample and evaluates predictions on a holdout sample that is excluded from estimation.
3
k-fold cross-validation implements the training/holdout split by rotating which fold serves as the holdout set.
4
Use RMSE by default, but switch to MAE when prediction-error residuals are highly skewed (long left/right tails).
5
Compare each endogenous indicator’s out-of-sample RMSE/MAE against an LM benchmark baseline to classify predictive power.
6
When mediators are present, generate predictions using the direct antecedence (DA) approach for higher accuracy than earliest antecedence (EA).
7
In R, run PLS predict with the estimated PLS model, DA technique, chosen folds, and repetitions, then compare the out-of-sample error matrices to LM benchmarks.

Highlights

PLS predict replaces R²-based “prediction” claims with a holdout-based forecast test using training/holdout splits.

k-fold cross-validation ensures every fold becomes holdout, reducing sensitivity to a single split and helping avoid abnormal solutions.

Predictive power is judged by whether PLS out-of-sample RMSE/MAE beats a naive LM benchmark for each endogenous indicator.

Mediator models require choosing a prediction rule; DA (including the mediator as a predictor) is favored for accuracy.

In the worked example, every indicator’s out-of-sample RMSE under PLS is lower than the LM benchmark, indicating high predictive power.

Topics

PLS Predict
Out-of-Sample Power
k-Fold Cross-Validation
RMSE vs MAE
LM Benchmark
Mediators Prediction (DA/EA)

Mentioned

PLS
R²
RMSE
MAE
LM
DA
EA

23. SEMinR Lecture Series | Step 4: Out of Sample Predictive Power | How to use PLSPredict