Statistics for Research - L8 - Descriptive Statistics using R

TL;DR

Central tendency summarizes what’s typical using mean (interval/ratio), median (ordinal), and mode (nominal).

Briefing Cornell Notes

Briefing

Descriptive statistics boil down to two goals: summarize what’s “typical” in a dataset and quantify how widely values spread. Central tendency answers what a typical value looks like—using the mean, median, or mode depending on the measurement scale—while dispersion (variability) shows whether those typical values are reliable or whether observations are scattered. Together, these measures form the baseline evidence researchers report in theses and papers.

Central tendency starts with the mean, the arithmetic average found by summing all observations and dividing by the number of cases. Mean is the go-to measure for interval and ratio variables, such as heights. When values are ordered but not evenly spaced—ordinal data—median becomes the better choice, defined as the middle value once data are arranged from lowest to highest. For nominal variables, where categories have no natural order, the mode is used: the category value that appears most often. The transcript illustrates mode with a small example where “male” (coded as 1) appears more frequently than “female” (coded as 2), making 1 the mode.

Dispersion then addresses a different question: even if there’s a typical value, how tightly do observations cluster around it? The range provides a quick sense of spread by subtracting the minimum from the maximum. For a more informative measure, standard deviation is used to indicate how much values generally vary from the mean. Lower standard deviation means most observations sit close to the mean, while higher standard deviation signals wider scattering. Standard error refines this idea for inference by estimating how accurately a sample mean represents the true population mean; it’s computed by dividing the standard deviation by the square root of the sample size. In practice, lower standard error implies greater accuracy.

The session then shifts from concepts to implementation in R, positioning R as a free, open-source statistical computing language used to write and run data-analysis code. After recommending an IDE like RStudio, it walks through the workflow: download R and RStudio, create an R script, load a CSV dataset using read.csv, and verify the import with head. From there, summary statistics are produced with summary, which returns key distribution metrics such as minimum, quartiles, median, mean, and maximum for each variable.

For specific measures, the transcript demonstrates mean(), median(), and standard deviation calculations by targeting a variable via the dataset object and the $ operator (e.g., data$V1). It also notes a practical limitation: R doesn’t provide a built-in mode function by default, so the workflow uses an external package. Installing and loading the package (via install.packages and library) enables a mode() function to compute the most frequent value for numeric variables and categorical codes (e.g., mode for a gender variable). The result is a repeatable recipe for reporting central tendency and dispersion in research settings, with a promise of more advanced statistics in later sessions.

Cornell Notes

The session defines descriptive statistics as a way to report two things about data: central tendency and dispersion. Central tendency identifies a typical value using mean for interval/ratio variables, median for ordinal variables, and mode for nominal variables. Dispersion measures how spread out observations are, using range, standard deviation, and standard error; standard deviation reflects variability around the mean, while standard error gauges how accurately a sample mean estimates the population mean. It then demonstrates how to compute these measures in R by loading a CSV file, checking it with head, using summary for overall statistics, and applying functions like mean() and median(). Because mode isn’t built in, it uses an external package (e.g., DescTools) to calculate mode.

How do researchers decide whether to use mean, median, or mode for central tendency?

Mean is used for interval and ratio variables because it relies on meaningful numerical distances between values. Median is used for ordinal variables because it only requires an ordering (from lowest to highest) and picks the middle observation. Mode is used for nominal variables because it applies to categories without inherent order, selecting the most frequently occurring category value.

What’s the difference between standard deviation and standard error in measuring dispersion?

Standard deviation quantifies how much individual observations typically vary from the mean—low values mean observations cluster tightly around the mean, while high values indicate wide spread. Standard error estimates the accuracy of the sample mean as an estimate of the population mean, computed as standard deviation divided by the square root of the sample size; smaller standard error implies higher accuracy.

What does the range tell you, and why might it be less informative than standard deviation?

Range is the difference between the maximum and minimum values, giving a quick snapshot of spread. It doesn’t reflect how values are distributed between those extremes, so two datasets can share the same range while having very different clustering; standard deviation captures that clustering around the mean.

What are the core steps shown for producing descriptive statistics in R from a CSV file?

The workflow is: create an R script, load data with read.csv (using the correct file path and header=TRUE), verify import with head, then compute overall statistics with summary. For targeted measures, compute mean/median/standard deviation by calling functions like mean(data$V1) or median(data$age), using the $ operator to select a variable.

Why is mode handled differently in R, and how is it computed anyway?

Mode isn’t available as a built-in function in base R in this session. The workaround is to install and load a package (the transcript uses DescTools), then call its mode() function on the variable of interest (e.g., mode(data$V1) or mode(data$gender), where gender is treated as coded values).

Review Questions

When would median be preferred over mean, and what assumption about the data scale makes that choice appropriate?
How do standard deviation and standard error respond differently when sample size changes?
In R, what is the purpose of using the $ operator when computing mean or median for a specific variable?

Key Points

1
Central tendency summarizes what’s typical using mean (interval/ratio), median (ordinal), and mode (nominal).
2
Dispersion measures how spread out values are using range, standard deviation, and standard error.
3
Standard deviation reflects variability around the mean; lower values indicate tighter clustering.
4
Standard error is standard deviation divided by the square root of sample size and indicates how accurately the sample mean estimates the population mean.
5
RStudio is recommended as an IDE for running R code and organizing scripts.
6
Descriptive statistics in R can be generated by loading CSV data with read.csv, inspecting with head, and using summary().
7
Mode requires a package in this workflow (e.g., DescTools) because it isn’t built into base R in the demonstrated approach.

Highlights

Mean is computed by summing all observations and dividing by the number of cases; it’s appropriate for interval and ratio data.

Median is the middle value after sorting and fits ordinal variables where only rank matters.

Standard error equals standard deviation divided by √n, linking sample size to how trustworthy the sample mean is.

R’s summary() function provides min, quartiles, median, mean, and max for each variable in one step.

Mode isn’t built in by default in this workflow; installing and loading DescTools enables mode() for numeric and coded categorical variables.

Topics

Central Tendency
Dispersion
Descriptive Statistics
R Programming
Mode Calculation