Statistics for Research - L11 - What are Outliers and How to Solve the Issue using SPSS?

TL;DR

Outliers are unusually high or low observations that can shift results, especially by inflating or deflating the mean.

Briefing Cornell Notes

Briefing

Outliers—unusually high or low data points—can quietly distort research results by pulling summary statistics like the mean away from what most participants actually report. A common example given is a survey about daily coffee intake: if most respondents fall between 0 and 4 cups, but one person reports 17 cups, that extreme value is far from the typical pattern and can inflate the sample mean even though it doesn’t represent the average coffee drinker. The practical takeaway is that outliers aren’t just a statistical curiosity; they can change conclusions if they’re handled incorrectly.

The session lays out several ways to identify outliers in a dataset, then connects those findings to decision rules and remedies. On the identification side, it recommends starting with descriptive checks such as minimum and maximum values, then using visual tools like histograms and box plots to spot points that sit far outside the expected range. For a more standardized approach, the transcript describes converting responses into Z scores. Z scores standardize each value relative to the sample (mean 0, standard deviation 1), making it easier to compare across cases. A Z score beyond about ±3.3 is treated as a strong indicator of an outlier.

Once outliers are found, the next step is deciding what to do with them—an issue framed as context-dependent rather than automatic. If an extreme value is clearly spurious (for example, inconsistent with the questionnaire context or without a plausible justification), removing it can be appropriate. If the value seems legitimate but is only slightly beyond the Z-score threshold, keeping it may be safer. Several handling options are presented: ignore outliers when they don’t materially affect results; trim them by removing extreme cases (especially when they stem from data entry or measurement errors); transform the data (such as applying a log transformation) to reduce outlier impact; or replace outliers using percentile-based rules—specifically substituting with the 5th or 95th percentile value.

The transcript then demonstrates how to carry out these steps in SPSS using a variable labeled “Vision one.” The first method uses Analyze → Descriptive Statistics → Frequencies, where minimum and maximum values reveal extremes (for instance, a maximum of 16 with only a couple of observations at the high end). Those specific cases can be located via Edit → Find, then traced back to the original dataset or questionnaire to verify whether the value is a genuine response or an entry mistake. A second method uses Analyze → Descriptive Statistics → Explore with outlier statistics and a box plot, where “extreme outliers” appear as marked points (stars) and can be navigated directly to the corresponding cases. Finally, the session shows how to compute standardized values in SPSS (saving standardized values) and then sort them to quickly spot cases with Z scores well above ±3.3, after which the analyst can choose to delete or replace them using the percentile approach or other replacement strategies like interpolation. The overarching message: outlier handling must be deliberate, because removing or ignoring them can shift results and lead to incorrect conclusions.

Cornell Notes

Outliers are unusually high or low observations that can distort research findings by shifting statistics such as the mean. The transcript emphasizes identifying outliers using descriptive statistics (minimum/maximum), visual diagnostics (histograms and box plots), and standardized scores (Z scores). Values with Z scores beyond about ±3.3 are treated as strong outlier candidates. After identification, the key decision is whether outliers are spurious (e.g., data entry errors) or plausible responses; that context determines whether to delete, ignore, trim, transform (e.g., log), or replace using percentile rules like the 5th/95th percentiles. SPSS workflows are demonstrated for tracing extreme cases and saving standardized values to locate outliers quickly.

Why can a single extreme response change the results of a study?

Because outliers can pull summary measures away from the typical pattern. In the coffee example, most responses fall between 0 and 4 cups, but one participant reports 17. That value is far from the rest and increases the sample mean even though it doesn’t represent the majority’s behavior.

What are three practical ways to identify outliers in SPSS from the transcript?

First, use Analyze → Descriptive Statistics → Frequencies to check minimum and maximum values, then use Edit → Find to locate the specific extreme values (e.g., cases with values like 14 or 16). Second, use Analyze → Descriptive Statistics → Explore to generate box plots; extreme outliers appear as marked points (stars) and can be traced to specific cases. Third, compute standardized values by saving Z-score equivalents (Analyze → Descriptive Statistics → Explore with outlier/statistics options, then save standardized values), sort them, and flag cases with Z scores beyond about ±3.3.

What does a Z score do, and what threshold is used as an outlier rule of thumb?

A Z score standardizes each observation relative to the sample, converting raw values into a common scale with mean 0 and standard deviation 1. The transcript uses a rule of thumb: Z scores greater than 3.3 (or less than −3.3) indicate an outlier.

How should analysts decide whether to delete an outlier?

Context matters. If an extreme value doesn’t make sense for the questionnaire or is extreme without plausible justification, deletion is recommended. If the value is plausible and only slightly beyond the Z-score threshold, keeping it may be safer to avoid distorting conclusions.

What replacement and transformation strategies are mentioned for handling outliers?

Several options are listed: ignore outliers if they don’t affect results much; trim by removing outliers (especially when they come from data entry/measurement errors); transform the data using a log transformation to reduce outlier impact; or replace outliers using percentile-based values—specifically substituting with the 5th or 95th percentile score. The transcript also notes other replacement approaches such as interpolation (e.g., linear interpolation) and “serial mean” as alternatives.

How does the transcript show tracing an extreme value back to the original response?

After identifying extremes via min/max, the workflow uses Edit → Find to locate the specific value in the dataset (e.g., the observation with value 14 or 16). The case number is then used to check the original data source—either the collected dataset or the hard-form questionnaire—to confirm whether the value is a genuine response or a data entry mistake.

Review Questions

What are the main identification methods for outliers described here, and how does each method differ in what it reveals?
Under what circumstances would you delete an outlier versus keep it, according to the transcript’s guidance?
In SPSS, what steps allow you to locate the exact case number for an extreme value and verify it against the original questionnaire?

Key Points

1
Outliers are unusually high or low observations that can shift results, especially by inflating or deflating the mean.
2
Minimum/maximum checks help flag extreme values, but you must trace them back to specific cases to verify whether they’re errors or real responses.
3
Box plots in SPSS visually mark extreme outliers (e.g., stars), and those marks can be linked to case numbers for follow-up.
4
Z-score standardization enables a consistent outlier rule; values beyond about ±3.3 are treated as strong outlier candidates.
5
Outlier handling should be driven by context: remove clearly spurious values, but consider keeping plausible values that only slightly exceed thresholds.
6
SPSS provides multiple practical remedies: ignore, trim, transform (log), or replace using percentile rules like the 5th/95th percentiles.
7
Any decision to remove or adjust outliers can change conclusions, so the approach should be justified and carefully considered.

Highlights

A single extreme response (e.g., 17 cups of coffee when most are 0–4) can inflate the mean and misrepresent the typical participant.

Z scores standardize data so outliers can be flagged consistently; the transcript uses ±3.3 as a key threshold.

SPSS workflows combine statistical detection (min/max, standardized values) with case-level verification via Edit → Find and case tracing.

Topics

Outliers
Z Scores
SPSS
Box Plot
Data Cleaning

Mentioned

Z score