CB-SEM using #SmartPLS4 - 4 - Data Screening and Imputation

TL;DR

Add an ID column before analysis so any suspicious or incorrect case can be traced and corrected later.

Briefing Cornell Notes

Briefing

Before any CB-SEM or SmartPLS4 analysis, data quality work determines whether the model’s results are trustworthy—especially around respondent misconduct, impermissible values, and missing data. The workflow starts with a basic but strict screen: check for data entry errors, outliers, and signs of respondents not taking the survey seriously. A practical first step is adding an ID column (an increasing number per row) so any problematic case can be traced back later. If questionnaires were filled on paper, each hard copy should be numbered so incorrect entries can be verified against the original form.

Next comes a targeted check for incomplete responses. Sorting the last columns in ascending order helps reveal whether someone abandoned the survey partway through. If a respondent missed only the final one or two items, keeping the case may be reasonable; if they skipped a large share—such as 40–50%—deleting that response becomes more defensible. The screening then shifts to respondent misconduct on Likert-type items (e.g., 1–7). A key red flag is near-identical answers across questions, which can indicate the respondent didn’t read items. To catch this, the transcript recommends adding attention checks (including reverse-coded items) and also computing each respondent’s standard deviation across their Likert responses. A rule of thumb is that standard deviation below 0.25 suggests too little variation and warrants deletion, though the threshold should be judged in context because survey size and design can affect what counts as “acceptable” variance.

After misconduct and completeness checks, the focus turns to impermissible values and missingness. Impermissible values are handled by verifying minimum and maximums in SPSS using descriptive statistics (Analyze → Descriptive Statistics → Descriptives) and confirming that observed extremes match what the scale allows. Missing data is assessed using SPSS frequencies or descriptive statistics (Analyze → Descriptive Statistics → Frequencies), checking whether any variables contain missing entries.

When missing data appears, two broad strategies are contrasted: deletion (listwise or pairwise) versus imputation. Deletion is discouraged because it can discard large amounts of otherwise usable data; prior research cited in the transcript suggests imputation can correct up to roughly 20–30% of missing values while still supporting good parameter estimates. Imputation replaces missing entries with numeric guesses, and four methods are mentioned overall, with two emphasized as common in SPSS: series mean imputation and linear interpolation.

Series mean imputation fills missing values using the mean of the indicator (the default in SPSS’s “Transform → Replace Missing Values”). Its convenience comes with a cost: it reduces variance and can weaken individual differences. Linear interpolation instead estimates a missing value by using the last valid value before the gap and the next valid value after it, inserting a value between them under the assumption the data behave roughly linearly. In SPSS, switching the method to “linear interpolation” requires selecting the indicator(s) and using the Change option so the method actually updates. The overall takeaway is that careful screening and context-aware imputation choices protect the measurement model before moving into CB-SEM analysis in SmartPLS4.

Cornell Notes

CB-SEM work in SmartPLS4 depends on cleaning the dataset first: verify IDs, detect incomplete surveys, flag respondent misconduct, and confirm scale-consistent values. The transcript recommends checking respondent misconduct by looking for low variation across Likert items and using attention checks; standard deviation below 0.25 is treated as a strong warning sign (with room for judgment). Impermissible values are screened in SPSS by checking minimum and maximums for each indicator. Missing data should be handled with imputation rather than deletion when missingness is not excessive, since listwise/pairwise deletion can discard too much information. Two emphasized imputation methods are series mean (default in SPSS, but it reduces variance) and linear interpolation (uses the last and next valid values to estimate a gap).

What’s the first practical step for making later data checks and corrections possible?

Create an ID column as the first column of the dataset, using an increasing number for each row. This makes it easy to locate a specific respondent later—especially after sorting by other columns to inspect patterns like abandonment or suspicious response behavior.

How can survey abandonment be detected quickly in a spreadsheet-style workflow?

Sort the last few columns (the final questionnaire items) in ascending order. If a respondent stopped answering, their row will show missing or blank entries concentrated at the end. The decision then depends on how much is missing: skipping one or two final items may be acceptable to retain, while skipping a large portion (the transcript gives an example of 40–50%) supports deleting the response.

What signals respondent misconduct on Likert scales, and how is it checked?

A major red flag is answering every Likert item with nearly the same value, which suggests the respondent may not be reading questions. The transcript recommends adding attention checks (including reverse questions) and computing each respondent’s standard deviation across their Likert responses. In Excel, standard deviation is calculated per row across the Likert items (excluding the ID). A standard deviation below 0.25 is treated as a threshold to strongly consider deletion, though the researcher should judge what variance is acceptable for that specific survey.

How are impermissible values screened in SPSS?

Use SPSS descriptive statistics to check minimum and maximum values for each indicator. The transcript’s path is Analyze → Descriptive Statistics → Descriptives, then select the variables of interest and enable minimum and maximum in Options. If the minimum or maximum equals values that shouldn’t exist for the scale, those entries are treated as errors to correct or remove.

Why prefer imputation over listwise/pairwise deletion when missing data exists?

Deletion discards entire cases (listwise) or drops variables per analysis (pairwise), which can throw away substantial usable data. The transcript cites prior research suggesting imputation can remedy about 20–30% of missing data while still producing good parameter estimates, making imputation preferable when missingness is not excessive.

What are the trade-offs between series mean imputation and linear interpolation?

Series mean imputation replaces each missing value with the mean of that indicator (SPSS default under Transform → Replace Missing Values). It’s easy to apply but reduces variance and can mask individual differences. Linear interpolation estimates the missing value using the last valid value before the gap and the next valid value after it, inserting an intermediate value; it assumes a roughly linear progression. In SPSS, switching to linear interpolation requires selecting the indicator(s) and using the Change button so the method updates.

Review Questions

What steps in the screening process help distinguish data entry errors from respondent misconduct and from missingness?
When would deletion of a respondent’s response be more defensible than imputation, based on the transcript’s guidance?
How do series mean imputation and linear interpolation differ in their assumptions and their impact on variance?

Key Points

1
Add an ID column before analysis so any suspicious or incorrect case can be traced and corrected later.
2
Check for abandonment by sorting the last questionnaire items; decide case-by-case whether missingness is small (keep) or large (delete).
3
Detect respondent misconduct by combining attention checks with a per-respondent standard deviation across Likert items; values below 0.25 are a strong warning sign.
4
Screen for impermissible values in SPSS by verifying minimum and maximums for each indicator against what the scale allows.
5
Assess missing data using SPSS frequencies/descriptives before choosing a handling method.
6
Prefer imputation over listwise/pairwise deletion when missingness is not excessive, since deletion can discard too much data.
7
Use series mean imputation for simplicity but expect reduced variance; use linear interpolation when the pattern between neighboring valid values is plausibly linear.

Highlights

A standard deviation across a respondent’s Likert answers below 0.25 is treated as a practical threshold for likely misconduct (with judgment based on survey context).

Impermissible values are best caught early by checking SPSS minimum and maximums for each indicator rather than discovering problems after model estimation.

Imputation is favored over deletion because listwise/pairwise approaches can discard large amounts of otherwise usable data; the transcript cites a 20–30% remediation range.

Series mean imputation is the SPSS default but it compresses variance and can blunt individual differences.

Linear interpolation estimates missing values using the last valid and next valid observations, inserting an intermediate value under a linearity assumption.

Topics

CB-SEM Data Screening
Respondent Misconduct
Missing Data Imputation
SPSS Descriptives
SmartPLS4 Preparation

CB-SEM using #SmartPLS4 - 4 - Data Screening and Imputation - New Insights