3. SEM | SPSS AMOS Lecture Series - Data Screening and Imputation using SPSS
Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Create an ID column to reliably track and audit individual cases after sorting or filtering.
Briefing
Data screening and missing-data handling sit at the foundation of any SEM workflow: before testing whether one construct influences another, researchers must verify that the observed indicators are coded correctly, that respondents didn’t produce unusable patterns, and that missing values are treated in a way that preserves statistical validity. The lecture frames this as a sequence—first catch errors, outliers, and survey noncompliance; then diagnose missingness; finally choose an imputation strategy when gaps exist—because downstream model estimates depend heavily on the quality of the input data.
The first priority is basic integrity checks. After data entry, an ID column should be created as an increasing sequence aligned with each row, making it easier to locate and audit specific cases later. For paper questionnaires, each form should be numbered so incorrect entries can be traced back to the original hard copy. To detect abandonment, the data can be sorted by the final questionnaire columns in ascending order; if a respondent stops answering near the end, the decision becomes whether the missing portion is small enough to keep. Missing only the last one or two items may be acceptable, but missing roughly 40–50% of the questionnaire typically warrants deleting that response.
Respondent misconduct is treated as another major threat. On Likert-type scales (e.g., 1–7), a key red flag is straight-lining—marking the same option across all items. The lecture recommends adding attention checks (including reverse-coded questions) to verify that respondents are reading and responding thoughtfully. A practical diagnostic is to compute the standard deviation of each respondent’s answers across the Likert items: very low variation suggests the responses are not meaningfully differentiated. Because doing this directly in SPSS can be cumbersome, the workflow uses Excel to calculate row-wise standard deviation while excluding the ID column. As a rule of thumb, standard deviation below 0.25 should trigger strong scrutiny and likely deletion, though the threshold isn’t universal and should be adjusted to survey size and context.
Next comes screening for impermissible values—cases where respondents entered values outside the allowed range. In SPSS, this is done via descriptive statistics (Analyze → Descriptive Statistics → Descriptives) with minimum and maximum options to confirm that observed extremes match expected bounds.
Missing data diagnosis then uses SPSS frequencies and descriptive statistics to confirm whether missingness exists in specific indicators. The lecture discourages listwise or pairwise deletion because it discards substantial data when even one item is missing, which can reduce statistical power and bias estimates. Instead, imputation is presented as a preferable alternative when missingness isn’t excessive. Imputation replaces missing entries with plausible numeric guesses; the lecture notes that series mean imputation is popular for ease of use but can reduce variance and ignore individual differences. Linear interpolation is offered as a second SPSS-based approach: it estimates a missing value by using the last valid value before the gap and the next valid value after it, assuming a linear progression. In SPSS, both methods are implemented through Transform → Replace Missing Values, where SPSS creates new imputed variables (e.g., with an underscore suffix) and can switch the method from series mean to linear interpolation via the method change option.
Cornell Notes
Before SEM can test relationships among constructs, the data must be cleaned: create an ID column, check for abandonment and respondent misconduct, and verify that values fall within allowed ranges. Abandonment can be detected by sorting the last questionnaire columns and deciding whether missingness is small enough to keep. Misconduct is flagged by straight-lining on Likert scales; a practical check is computing each respondent’s standard deviation across Likert items, with values below 0.25 treated as a strong warning sign. Missing data should be diagnosed in SPSS, and deletion (listwise/pairwise) is discouraged because it throws away too much data. Imputation is presented as the alternative, with SPSS methods including series mean imputation and linear interpolation.
Why is creating an ID column and numbering questionnaires emphasized before analysis?
How can abandonment be detected, and what decision rule is suggested for keeping or deleting responses?
What indicators suggest respondent misconduct on a 1–7 Likert scale, and how is it measured?
What standard deviation threshold is used as a rule of thumb, and how should it be interpreted?
Why does the lecture discourage listwise/pairwise deletion for missing data?
How do series mean imputation and linear interpolation differ in SPSS, and what tradeoffs are noted?
Review Questions
- What steps should be completed before testing construct relationships in SEM, and why do they matter for model validity?
- Compare the logic of abandonment detection versus respondent misconduct detection—what patterns are each looking for?
- In SPSS, where would you implement Transform → Replace Missing Values, and what changes when switching from series mean to linear interpolation?
Key Points
- 1
Create an ID column to reliably track and audit individual cases after sorting or filtering.
- 2
Detect abandonment by sorting the last questionnaire items and decide deletion based on how much of the survey is missing (small tail missing may be acceptable; large gaps like ~40–50% are not).
- 3
Flag respondent misconduct using straight-lining checks, attention checks (including reverse questions), and per-respondent standard deviation across Likert items.
- 4
Use a standard deviation below 0.25 as a rule-of-thumb trigger for strong scrutiny, while recognizing thresholds should be tailored to the survey context.
- 5
Screen for impermissible values by checking minimum and maximum ranges in SPSS descriptive statistics.
- 6
Prefer imputation over listwise/pairwise deletion when missingness isn’t excessive, because deletion discards too much data.
- 7
Implement imputation in SPSS via Transform → Replace Missing Values, choosing between series mean (easy but variance-reducing) and linear interpolation (gap-based estimation assuming linearity).