CB-SEM using #SmartPLS4 - 4 - Data Screening and Imputation - New Insights
Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Add an ID column before analysis so any suspicious or incorrect case can be traced and corrected later.
Briefing
Before any CB-SEM or SmartPLS4 analysis, data quality work determines whether the model’s results are trustworthy—especially around respondent misconduct, impermissible values, and missing data. The workflow starts with a basic but strict screen: check for data entry errors, outliers, and signs of respondents not taking the survey seriously. A practical first step is adding an ID column (an increasing number per row) so any problematic case can be traced back later. If questionnaires were filled on paper, each hard copy should be numbered so incorrect entries can be verified against the original form.
Next comes a targeted check for incomplete responses. Sorting the last columns in ascending order helps reveal whether someone abandoned the survey partway through. If a respondent missed only the final one or two items, keeping the case may be reasonable; if they skipped a large share—such as 40–50%—deleting that response becomes more defensible. The screening then shifts to respondent misconduct on Likert-type items (e.g., 1–7). A key red flag is near-identical answers across questions, which can indicate the respondent didn’t read items. To catch this, the transcript recommends adding attention checks (including reverse-coded items) and also computing each respondent’s standard deviation across their Likert responses. A rule of thumb is that standard deviation below 0.25 suggests too little variation and warrants deletion, though the threshold should be judged in context because survey size and design can affect what counts as “acceptable” variance.
After misconduct and completeness checks, the focus turns to impermissible values and missingness. Impermissible values are handled by verifying minimum and maximums in SPSS using descriptive statistics (Analyze → Descriptive Statistics → Descriptives) and confirming that observed extremes match what the scale allows. Missing data is assessed using SPSS frequencies or descriptive statistics (Analyze → Descriptive Statistics → Frequencies), checking whether any variables contain missing entries.
When missing data appears, two broad strategies are contrasted: deletion (listwise or pairwise) versus imputation. Deletion is discouraged because it can discard large amounts of otherwise usable data; prior research cited in the transcript suggests imputation can correct up to roughly 20–30% of missing values while still supporting good parameter estimates. Imputation replaces missing entries with numeric guesses, and four methods are mentioned overall, with two emphasized as common in SPSS: series mean imputation and linear interpolation.
Series mean imputation fills missing values using the mean of the indicator (the default in SPSS’s “Transform → Replace Missing Values”). Its convenience comes with a cost: it reduces variance and can weaken individual differences. Linear interpolation instead estimates a missing value by using the last valid value before the gap and the next valid value after it, inserting a value between them under the assumption the data behave roughly linearly. In SPSS, switching the method to “linear interpolation” requires selecting the indicator(s) and using the Change option so the method actually updates. The overall takeaway is that careful screening and context-aware imputation choices protect the measurement model before moving into CB-SEM analysis in SmartPLS4.
Cornell Notes
CB-SEM work in SmartPLS4 depends on cleaning the dataset first: verify IDs, detect incomplete surveys, flag respondent misconduct, and confirm scale-consistent values. The transcript recommends checking respondent misconduct by looking for low variation across Likert items and using attention checks; standard deviation below 0.25 is treated as a strong warning sign (with room for judgment). Impermissible values are screened in SPSS by checking minimum and maximums for each indicator. Missing data should be handled with imputation rather than deletion when missingness is not excessive, since listwise/pairwise deletion can discard too much information. Two emphasized imputation methods are series mean (default in SPSS, but it reduces variance) and linear interpolation (uses the last and next valid values to estimate a gap).
What’s the first practical step for making later data checks and corrections possible?
How can survey abandonment be detected quickly in a spreadsheet-style workflow?
What signals respondent misconduct on Likert scales, and how is it checked?
How are impermissible values screened in SPSS?
Why prefer imputation over listwise/pairwise deletion when missing data exists?
What are the trade-offs between series mean imputation and linear interpolation?
Review Questions
- What steps in the screening process help distinguish data entry errors from respondent misconduct and from missingness?
- When would deletion of a respondent’s response be more defensible than imputation, based on the transcript’s guidance?
- How do series mean imputation and linear interpolation differ in their assumptions and their impact on variance?
Key Points
- 1
Add an ID column before analysis so any suspicious or incorrect case can be traced and corrected later.
- 2
Check for abandonment by sorting the last questionnaire items; decide case-by-case whether missingness is small (keep) or large (delete).
- 3
Detect respondent misconduct by combining attention checks with a per-respondent standard deviation across Likert items; values below 0.25 are a strong warning sign.
- 4
Screen for impermissible values in SPSS by verifying minimum and maximums for each indicator against what the scale allows.
- 5
Assess missing data using SPSS frequencies/descriptives before choosing a handling method.
- 6
Prefer imputation over listwise/pairwise deletion when missingness is not excessive, since deletion can discard too much data.
- 7
Use series mean imputation for simplicity but expect reduced variance; use linear interpolation when the pattern between neighboring valid values is plausibly linear.