Statistics for Research - L12 - How to Identify and Deal with Outliers in R?
Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Outliers can distort results by inflating or deflating summary statistics like the mean, even when they represent only a small fraction of responses.
Briefing
Outliers—unusually high or low responses—can distort research results by pulling summary statistics like the mean away from what most participants actually report. A classic example is a coffee-consumption survey where nearly everyone answers between 0 and 4 cups per day, but one participant reports 17 cups. That single extreme value can inflate the sample mean and make it look like the “average” coffee drinker consumes far more than the typical respondent, even if the outlier is due to a caffeine issue or a data-entry mistake.
The session lays out practical ways to spot outliers before deciding what to do with them. One approach is to inspect descriptive statistics—especially minimum and maximum values—to see whether any observations fall far outside the expected range. Visual checks also matter: histograms, box plots, and bar graphs can reveal points that sit apart from the rest. For a more standardized method, responses can be converted into Z scores, which rescale values so they have a mean of 0 and a standard deviation of 1. In this framework, values with Z scores greater than about 3.3 (or less than -3.3) are treated as outliers, making it easier to compare across cases and variables.
Once outliers are identified, the key decision is whether they are meaningful or spurious. If an extreme response makes no sense in context—such as a value that appears to be a mistake—or is unjustifiably extreme, removing it can be appropriate. If the response is plausible and only slightly beyond the threshold, keeping it may be safer. The session also lists several handling options: ignore outliers when they don’t materially affect results; trim them by removing erroneous observations (common when they stem from measurement or entry errors); reduce their influence using transformations like taking the log of the data; or replace them using percentile-based values (for example, substituting with the 5th or 95th percentile).
The practical R workflow is then demonstrated. Data is loaded from a CSV file using `read.csv` with headers enabled. After viewing and checking the data, `summary()` is used to find minimum and maximum values for a selected variable. If extremes appear, `which()` helps locate the exact observations that violate expected bounds (e.g., values greater than 7). To apply the Z-score rule, the session uses `scale()` to compute standardized scores and again uses `which()` to identify observations whose Z scores exceed the threshold.
A worked example follows: after finding two problematic observations for `Vision one`, the script determines their column index and observation numbers, then corrects the entries by referencing the original questionnaire (changing values like 16 and 14 to 6 and 4). After writing the corrected data back to the file, rerunning the checks shows that the outliers disappear. The session closes with the takeaway that outlier handling should be deliberate—because removing or ignoring points without justification can lead to incorrect conclusions—and that R can support both detection and correction.
Cornell Notes
Outliers are unusually high or low responses that can skew results, especially summary measures like the mean. The session recommends detecting them using descriptive statistics (min/max), visual tools (histograms, box plots, bar graphs), and standardized Z scores. In the Z-score approach, values with Z scores beyond roughly ±3.3 are flagged as outliers. After identification, researchers should decide whether outliers are valid (keep) or spurious (remove, trim, transform, or replace). A concrete R workflow is provided: load CSV data, use `summary()` to find extremes, use `which()` to locate violating observations, compute Z scores with `scale()`, then correct or delete problematic entries and re-check until no outliers remain.
Why can a single extreme response (like 17 cups of coffee) change research conclusions?
What are three complementary ways to identify outliers before using Z scores?
How does the Z-score method work, and what threshold is used here?
What decision framework determines whether to remove, keep, or adjust outliers?
What R functions are used to locate outliers and to find which observations violate rules?
How can outliers be handled in practice using the workflow shown?
Review Questions
- What are the main risks of using the mean when outliers are present, and how does the coffee example illustrate that risk?
- Describe how `summary()`, `which()`, and `scale()` work together in the outlier-detection workflow.
- List at least three different strategies for handling outliers and explain when each would be appropriate.
Key Points
- 1
Outliers can distort results by inflating or deflating summary statistics like the mean, even when they represent only a small fraction of responses.
- 2
Minimum/maximum checks and visual plots (histograms, box plots, bar graphs) provide quick, practical ways to spot suspicious values.
- 3
Standardizing with Z scores enables consistent outlier detection across variables; values beyond roughly ±3.3 are flagged here.
- 4
Outlier handling should be context-driven: remove or trim only when responses are implausible or clearly spurious, and consider keeping values that are plausible.
- 5
R can identify exact problematic observations using `which()` after checking extremes with `summary()`.
- 6
Z-score outliers can be found by computing standardized scores with `scale()` and then filtering observations whose Z scores exceed the threshold.
- 7
Correcting data-entry errors (by updating the CSV with verified values) can eliminate outliers and preserve valid data rather than discarding it.