Data Screening, Cleaning and How to Replace Missing Values in SPSS

TL;DR

Create an ID column early so each case can be traced after sorting and filtering.

Briefing Cornell Notes

Briefing

Before any model testing—such as checking whether one construct influences another—data must be screened for basic quality problems: entry errors, outliers, and signs that respondents didn’t answer thoughtfully. The workflow starts with setting up an ID column so each case can be traced back after sorting or filtering. From there, the first practical check is respondent abandonment: if someone stops answering near the end of a questionnaire, the record may be deleted only when the missing portion is large (for example, 40–50% of items), while small gaps (like missing one or two final questions) can be handled by keeping the rest of the responses.

Next comes respondent misconduct, especially in Likert-style items where answers should vary. A common red flag is marking the same option across nearly all questions on a 1–7 scale, which suggests the respondent may not be reading items. Attention checks—such as reverse-coded questions or items that require selecting a specific number—can help detect this behavior. Another quantitative approach is to compute the standard deviation of each respondent’s Likert responses and flag cases with extremely low variance. A rule of thumb used here is that standard deviation below 0.25 indicates too little variation and should trigger deletion, though the threshold should ultimately be judged against survey size and what level of agreement/disagreement is plausible.

After handling misconduct and abandonment, the screening process shifts to impermissible values and missing data. Impermissible values are those outside an expected range, often caused by keying errors. In SPSS, this is checked through descriptive statistics by requesting minimum and maximum values for selected variables; if the observed extremes match expected bounds, the dataset passes this particular test.

Missing data is assessed using SPSS frequencies and descriptive statistics to confirm whether blanks or system-missing values exist. When missingness is present, the transcript contrasts two broad strategies: deletion versus imputation. Listwise or pairwise deletion is discouraged because it discards substantial data—dropping an entire survey response if even one item is missing. Instead, imputation is presented as preferable when missingness is not excessive; prior research cited suggests that imputation can recover parameter estimates even when 20–30% of values are missing.

Two imputation methods are emphasized. The most common is series mean imputation, which replaces each missing value with the mean of that indicator. It’s easy to implement but can reduce variance and ignore individual differences. The second method is linear interpolation, which estimates missing values by using the last valid value before the gap and the next value after it, assuming a linear progression. In SPSS, both methods are implemented via Transform → Replace Missing Values, selecting the variables with missing entries and choosing either the default series mean method or linear interpolation (using the Change option to switch methods). The overall message is that careful screening and thoughtful imputation choices protect the validity of later analyses.

Cornell Notes

The transcript lays out a practical pre-analysis checklist for survey data in SPSS: create an ID column, check for respondent abandonment, detect respondent misconduct, verify impermissible values, and then diagnose missing data. Abandonment is handled by deciding whether to delete a case based on how much of the questionnaire is unanswered (small gaps may be kept; large gaps like 40–50% may be deleted). Misconduct is flagged when Likert responses show almost no variation; standard deviation below 0.25 is given as a rule of thumb, supported by attention checks and reverse-coded items. For missing data, deletion (listwise/pairwise) is discouraged because it throws away too much information; imputation is recommended, especially when missingness is moderate (cited 20–30%). SPSS imputation can use series mean or linear interpolation via Transform → Replace Missing Values.

Why start with an ID column, and how does it help during screening?

An increasing ID column (from 1 through the last row) makes it easier to locate and verify specific cases after sorting by other variables. It also supports back-checking: if errors or suspicious patterns appear, the ID helps trace the record back to the original paper questionnaire for correction.

How can respondent abandonment be detected, and when should a case be deleted?

Abandonment is checked by sorting the last few columns in ascending order to see whether a respondent stopped answering near the end. If the respondent missed only one or two final questions, the record can often be retained. If the respondent left a large portion unanswered—such as 40–50% of the questionnaire—the case is more likely to be deleted after determining that the missingness is too extensive.

What are the main signs of respondent misconduct in Likert-scale data, and how is it measured?

A key sign is straight-lining: marking the same option across many or all items on a 1–7 scale, which is unlikely if the respondent read and considered each question. Attention checks (including reverse questions) can help detect inattentive responding. Quantitatively, the transcript recommends computing the standard deviation of each respondent’s Likert responses; values below 0.25 suggest too little variation and should trigger deletion consideration.

How are impermissible values screened in SPSS?

Impermissible values are detected by checking minimum and maximum values for each variable. In SPSS, this is done through Analyze → Descriptive Statistics → Descriptives, selecting variables and enabling minimum and maximum in Options. If observed extremes match expected bounds (e.g., the minimum and maximum are consistent), there are no obvious keying errors.

What’s the tradeoff between deletion and imputation for missing data?

Listwise or pairwise deletion discards data: a single missing item can cause the entire response to be dropped from analysis. The transcript discourages this approach and instead recommends imputation when missingness is not excessive. It cites research suggesting imputation can remedy up to 20–30% missing data while still supporting good parameter estimates.

How do series mean imputation and linear interpolation differ in SPSS?

Series mean imputation replaces each missing value with the mean of that indicator, using Transform → Replace Missing Values with the default method (series mean). The drawback is reduced variance and less sensitivity to individual differences. Linear interpolation estimates missing values between the last valid value before the gap and the next valid value after it, assuming a linear pattern; in SPSS the method is switched to Linear Interpolation using the Change option before applying OK.

Review Questions

What screening steps should happen before testing relationships between constructs, and why does each step matter?
Under what conditions would a respondent’s incomplete survey be kept versus deleted?
How would you implement and choose between series mean imputation and linear interpolation in SPSS?

Key Points

1
Create an ID column early so each case can be traced after sorting and filtering.
2
Screen for respondent abandonment by checking whether unanswered items cluster at the end of the questionnaire; delete only when missingness is substantial.
3
Detect respondent misconduct by looking for straight-lining on Likert items and by using attention checks such as reverse-coded questions.
4
Use respondent-level standard deviation of Likert responses as a quantitative misconduct flag; treat values below 0.25 as a strong warning signal while still applying judgment.
5
Check for impermissible values by reviewing minimum and maximum values for each variable in SPSS descriptive statistics.
6
Assess missing data with SPSS frequencies/descriptives before deciding on a remedy.
7
Prefer imputation over listwise/pairwise deletion when missingness is moderate; implement series mean or linear interpolation via Transform → Replace Missing Values.

Highlights

Abandonment can be handled selectively: missing one or two final items may be acceptable, but leaving 40–50% of the questionnaire unanswered is a stronger case for deletion.

Straight-lining on a 1–7 Likert scale is treated as a misconduct red flag, and standard deviation below 0.25 is offered as a practical threshold for review.

Impermissible values are efficiently checked by requesting minimum and maximum values in SPSS descriptive statistics.

Imputation is positioned as a better default than deletion because it preserves data and can still yield reliable parameter estimates even with 20–30% missingness.

In SPSS, both series mean and linear interpolation imputation are implemented through Transform → Replace Missing Values, with linear interpolation requiring switching the method via Change.

Topics

Data Screening
Respondent Misconduct
Missing Data Imputation
SPSS Data Cleaning
Likert Scale Quality Checks

Mentioned

SPSS