Statistics for Research - L12 - How to Identify and Deal with Outliers in R?

TL;DR

Outliers can distort results by inflating or deflating summary statistics like the mean, even when they represent only a small fraction of responses.

Briefing Cornell Notes

Briefing

Outliers—unusually high or low responses—can distort research results by pulling summary statistics like the mean away from what most participants actually report. A classic example is a coffee-consumption survey where nearly everyone answers between 0 and 4 cups per day, but one participant reports 17 cups. That single extreme value can inflate the sample mean and make it look like the “average” coffee drinker consumes far more than the typical respondent, even if the outlier is due to a caffeine issue or a data-entry mistake.

The session lays out practical ways to spot outliers before deciding what to do with them. One approach is to inspect descriptive statistics—especially minimum and maximum values—to see whether any observations fall far outside the expected range. Visual checks also matter: histograms, box plots, and bar graphs can reveal points that sit apart from the rest. For a more standardized method, responses can be converted into Z scores, which rescale values so they have a mean of 0 and a standard deviation of 1. In this framework, values with Z scores greater than about 3.3 (or less than -3.3) are treated as outliers, making it easier to compare across cases and variables.

Once outliers are identified, the key decision is whether they are meaningful or spurious. If an extreme response makes no sense in context—such as a value that appears to be a mistake—or is unjustifiably extreme, removing it can be appropriate. If the response is plausible and only slightly beyond the threshold, keeping it may be safer. The session also lists several handling options: ignore outliers when they don’t materially affect results; trim them by removing erroneous observations (common when they stem from measurement or entry errors); reduce their influence using transformations like taking the log of the data; or replace them using percentile-based values (for example, substituting with the 5th or 95th percentile).

The practical R workflow is then demonstrated. Data is loaded from a CSV file using `read.csv` with headers enabled. After viewing and checking the data, `summary()` is used to find minimum and maximum values for a selected variable. If extremes appear, `which()` helps locate the exact observations that violate expected bounds (e.g., values greater than 7). To apply the Z-score rule, the session uses `scale()` to compute standardized scores and again uses `which()` to identify observations whose Z scores exceed the threshold.

A worked example follows: after finding two problematic observations for `Vision one`, the script determines their column index and observation numbers, then corrects the entries by referencing the original questionnaire (changing values like 16 and 14 to 6 and 4). After writing the corrected data back to the file, rerunning the checks shows that the outliers disappear. The session closes with the takeaway that outlier handling should be deliberate—because removing or ignoring points without justification can lead to incorrect conclusions—and that R can support both detection and correction.

Cornell Notes

Outliers are unusually high or low responses that can skew results, especially summary measures like the mean. The session recommends detecting them using descriptive statistics (min/max), visual tools (histograms, box plots, bar graphs), and standardized Z scores. In the Z-score approach, values with Z scores beyond roughly ±3.3 are flagged as outliers. After identification, researchers should decide whether outliers are valid (keep) or spurious (remove, trim, transform, or replace). A concrete R workflow is provided: load CSV data, use `summary()` to find extremes, use `which()` to locate violating observations, compute Z scores with `scale()`, then correct or delete problematic entries and re-check until no outliers remain.

Why can a single extreme response (like 17 cups of coffee) change research conclusions?

Because extreme values can pull the mean upward even when most responses cluster in a normal range. If nearly all participants report 0–4 cups but one reports 17, that one value increases the sample mean and makes it look like the typical participant consumes far more than they actually do. The distortion happens even if the outlier is rare and may be due to a real condition or a data-entry error.

What are three complementary ways to identify outliers before using Z scores?

First, check descriptive statistics such as minimum and maximum values using `summary()`. Second, use visual diagnostics like histograms, box plots, or bar graphs to spot points that sit far from the main cluster. Third, apply a standardized method by converting values to Z scores so outlier detection is based on distance from the mean in standard-deviation units.

How does the Z-score method work, and what threshold is used here?

Each response can be transformed into a Z score using `scale()` so the standardized values have mean 0 and standard deviation 1. The session uses a rule of thumb: Z scores greater than about 3.3 (or less than -3.3) indicate an outlier. This standardization helps compare across cases and variables on different scales.

What decision framework determines whether to remove, keep, or adjust outliers?

Outliers should be evaluated in context. If an extreme response is implausible or lacks justification—such as values that don’t match the questionnaire or appear to be entry errors—removal or trimming is favored. If the response is plausible and only slightly beyond the Z-score threshold, keeping it may be safer. The session also warns that removing or ignoring outliers without justification can lead to incorrect conclusions.

What R functions are used to locate outliers and to find which observations violate rules?

`summary()` is used to inspect min/max values. `which()` identifies the specific observation indices that meet a condition (e.g., values greater than 7). For Z-score detection, `scale()` computes standardized scores for a chosen column, and `which()` is then used again to find observations whose Z scores exceed the threshold.

How can outliers be handled in practice using the workflow shown?

The session demonstrates correcting erroneous entries rather than just deleting them. After identifying two outlier observations in a column (by observation number and column index), the values are replaced with corrected numbers sourced from the original questionnaire. The updated dataset is written back to the CSV, and rerunning the outlier checks confirms that no outliers remain.

Review Questions

What are the main risks of using the mean when outliers are present, and how does the coffee example illustrate that risk?
Describe how `summary()`, `which()`, and `scale()` work together in the outlier-detection workflow.
List at least three different strategies for handling outliers and explain when each would be appropriate.

Key Points

1
Outliers can distort results by inflating or deflating summary statistics like the mean, even when they represent only a small fraction of responses.
2
Minimum/maximum checks and visual plots (histograms, box plots, bar graphs) provide quick, practical ways to spot suspicious values.
3
Standardizing with Z scores enables consistent outlier detection across variables; values beyond roughly ±3.3 are flagged here.
4
Outlier handling should be context-driven: remove or trim only when responses are implausible or clearly spurious, and consider keeping values that are plausible.
5
R can identify exact problematic observations using `which()` after checking extremes with `summary()`.
6
Z-score outliers can be found by computing standardized scores with `scale()` and then filtering observations whose Z scores exceed the threshold.
7
Correcting data-entry errors (by updating the CSV with verified values) can eliminate outliers and preserve valid data rather than discarding it.

Highlights

A single extreme response (e.g., 17 cups when most are 0–4) can significantly inflate the mean and misrepresent the typical participant.

Z scores standardize responses so outlier detection can rely on distance from the mean; the session uses a ±3.3 rule of thumb.

The workflow pinpoints outliers to specific observation numbers and column indices, enabling targeted correction rather than blanket deletion.

Correcting erroneous questionnaire entries in the dataset can remove outliers entirely after re-running the checks.

Topics

Outliers
Z Scores
Data Cleaning
R Outlier Detection
Survey Data

Mentioned

Z score