3. SEM | SPSS AMOS Lecture Series - Data Screening and Imputation using SPSS

TL;DR

Create an ID column to reliably track and audit individual cases after sorting or filtering.

Briefing Cornell Notes

Briefing

Data screening and missing-data handling sit at the foundation of any SEM workflow: before testing whether one construct influences another, researchers must verify that the observed indicators are coded correctly, that respondents didn’t produce unusable patterns, and that missing values are treated in a way that preserves statistical validity. The lecture frames this as a sequence—first catch errors, outliers, and survey noncompliance; then diagnose missingness; finally choose an imputation strategy when gaps exist—because downstream model estimates depend heavily on the quality of the input data.

The first priority is basic integrity checks. After data entry, an ID column should be created as an increasing sequence aligned with each row, making it easier to locate and audit specific cases later. For paper questionnaires, each form should be numbered so incorrect entries can be traced back to the original hard copy. To detect abandonment, the data can be sorted by the final questionnaire columns in ascending order; if a respondent stops answering near the end, the decision becomes whether the missing portion is small enough to keep. Missing only the last one or two items may be acceptable, but missing roughly 40–50% of the questionnaire typically warrants deleting that response.

Respondent misconduct is treated as another major threat. On Likert-type scales (e.g., 1–7), a key red flag is straight-lining—marking the same option across all items. The lecture recommends adding attention checks (including reverse-coded questions) to verify that respondents are reading and responding thoughtfully. A practical diagnostic is to compute the standard deviation of each respondent’s answers across the Likert items: very low variation suggests the responses are not meaningfully differentiated. Because doing this directly in SPSS can be cumbersome, the workflow uses Excel to calculate row-wise standard deviation while excluding the ID column. As a rule of thumb, standard deviation below 0.25 should trigger strong scrutiny and likely deletion, though the threshold isn’t universal and should be adjusted to survey size and context.

Next comes screening for impermissible values—cases where respondents entered values outside the allowed range. In SPSS, this is done via descriptive statistics (Analyze → Descriptive Statistics → Descriptives) with minimum and maximum options to confirm that observed extremes match expected bounds.

Missing data diagnosis then uses SPSS frequencies and descriptive statistics to confirm whether missingness exists in specific indicators. The lecture discourages listwise or pairwise deletion because it discards substantial data when even one item is missing, which can reduce statistical power and bias estimates. Instead, imputation is presented as a preferable alternative when missingness isn’t excessive. Imputation replaces missing entries with plausible numeric guesses; the lecture notes that series mean imputation is popular for ease of use but can reduce variance and ignore individual differences. Linear interpolation is offered as a second SPSS-based approach: it estimates a missing value by using the last valid value before the gap and the next valid value after it, assuming a linear progression. In SPSS, both methods are implemented through Transform → Replace Missing Values, where SPSS creates new imputed variables (e.g., with an underscore suffix) and can switch the method from series mean to linear interpolation via the method change option.

Cornell Notes

Before SEM can test relationships among constructs, the data must be cleaned: create an ID column, check for abandonment and respondent misconduct, and verify that values fall within allowed ranges. Abandonment can be detected by sorting the last questionnaire columns and deciding whether missingness is small enough to keep. Misconduct is flagged by straight-lining on Likert scales; a practical check is computing each respondent’s standard deviation across Likert items, with values below 0.25 treated as a strong warning sign. Missing data should be diagnosed in SPSS, and deletion (listwise/pairwise) is discouraged because it throws away too much data. Imputation is presented as the alternative, with SPSS methods including series mean imputation and linear interpolation.

Why is creating an ID column and numbering questionnaires emphasized before analysis?

An ID column (an increasing number aligned to each row) makes it easy to locate and audit specific cases after sorting or filtering. Numbering paper questionnaires provides a direct path back to the original hard copy when an entered value appears incorrect, enabling quick correction rather than guessing.

How can abandonment be detected, and what decision rule is suggested for keeping or deleting responses?

Sort the final questionnaire columns in ascending order to see whether a respondent stops answering near the end. If the respondent missed only the last one or two items, keeping the response can be reasonable. If the respondent missed a large share—around 40–50%—the lecture recommends deleting that response.

What indicators suggest respondent misconduct on a 1–7 Likert scale, and how is it measured?

Straight-lining—marking the same option for every item—is treated as a major red flag because it’s unlikely that a person truly feels exactly the same way across all questions. Attention checks (including reverse questions) can help detect inattentive responding. The lecture also recommends computing each respondent’s standard deviation across Likert items; very low variation indicates potential misconduct.

What standard deviation threshold is used as a rule of thumb, and how should it be interpreted?

A standard deviation below 0.25 across a respondent’s Likert responses is presented as a strong warning sign, suggesting the case should be deleted or at least heavily scrutinized. The lecture stresses there’s no universal golden rule; acceptable thresholds depend on survey size and the expected variability.

Why does the lecture discourage listwise/pairwise deletion for missing data?

Deletion discards entire responses (listwise) or partially complete cases (pairwise), which can remove a large amount of data when even one item is missing. The lecture argues this wastes information and can harm parameter estimation, whereas imputation can recover missing values and preserve more data.

How do series mean imputation and linear interpolation differ in SPSS, and what tradeoffs are noted?

Series mean imputation replaces missing values with the mean of the indicator, implemented in SPSS via Transform → Replace Missing Values using the default method. The tradeoff is reduced variance and less sensitivity to individual differences. Linear interpolation estimates the missing value using the last valid value before the gap and the next valid value after it, assuming a linear trend; it’s selected in SPSS by changing the imputation method to Linear Interpolation.

Review Questions

What steps should be completed before testing construct relationships in SEM, and why do they matter for model validity?
Compare the logic of abandonment detection versus respondent misconduct detection—what patterns are each looking for?
In SPSS, where would you implement Transform → Replace Missing Values, and what changes when switching from series mean to linear interpolation?

Key Points

1
Create an ID column to reliably track and audit individual cases after sorting or filtering.
2
Detect abandonment by sorting the last questionnaire items and decide deletion based on how much of the survey is missing (small tail missing may be acceptable; large gaps like ~40–50% are not).
3
Flag respondent misconduct using straight-lining checks, attention checks (including reverse questions), and per-respondent standard deviation across Likert items.
4
Use a standard deviation below 0.25 as a rule-of-thumb trigger for strong scrutiny, while recognizing thresholds should be tailored to the survey context.
5
Screen for impermissible values by checking minimum and maximum ranges in SPSS descriptive statistics.
6
Prefer imputation over listwise/pairwise deletion when missingness isn’t excessive, because deletion discards too much data.
7
Implement imputation in SPSS via Transform → Replace Missing Values, choosing between series mean (easy but variance-reducing) and linear interpolation (gap-based estimation assuming linearity).

Highlights

A practical misconduct diagnostic is row-wise standard deviation across Likert items; values under 0.25 signal likely straight-lining and warrant deletion or deep review.

Abandonment can be spotted by sorting the final columns—missing only the last one or two items may be salvageable, but missing ~40–50% typically isn’t.

Series mean imputation is simple but compresses variance and ignores individual differences, while linear interpolation uses neighboring valid values to estimate gaps.

The workflow treats data screening as a prerequisite for SEM: indicator validity, respondent behavior, and missingness handling all affect model estimates.

SPSS-based range checks (minimum/maximum) help catch invalid entries before any missing-data strategy is applied.