Statistics for Research - L8 - Descriptive Statistics using R
Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Central tendency summarizes what’s typical using mean (interval/ratio), median (ordinal), and mode (nominal).
Briefing
Descriptive statistics boil down to two goals: summarize what’s “typical” in a dataset and quantify how widely values spread. Central tendency answers what a typical value looks like—using the mean, median, or mode depending on the measurement scale—while dispersion (variability) shows whether those typical values are reliable or whether observations are scattered. Together, these measures form the baseline evidence researchers report in theses and papers.
Central tendency starts with the mean, the arithmetic average found by summing all observations and dividing by the number of cases. Mean is the go-to measure for interval and ratio variables, such as heights. When values are ordered but not evenly spaced—ordinal data—median becomes the better choice, defined as the middle value once data are arranged from lowest to highest. For nominal variables, where categories have no natural order, the mode is used: the category value that appears most often. The transcript illustrates mode with a small example where “male” (coded as 1) appears more frequently than “female” (coded as 2), making 1 the mode.
Dispersion then addresses a different question: even if there’s a typical value, how tightly do observations cluster around it? The range provides a quick sense of spread by subtracting the minimum from the maximum. For a more informative measure, standard deviation is used to indicate how much values generally vary from the mean. Lower standard deviation means most observations sit close to the mean, while higher standard deviation signals wider scattering. Standard error refines this idea for inference by estimating how accurately a sample mean represents the true population mean; it’s computed by dividing the standard deviation by the square root of the sample size. In practice, lower standard error implies greater accuracy.
The session then shifts from concepts to implementation in R, positioning R as a free, open-source statistical computing language used to write and run data-analysis code. After recommending an IDE like RStudio, it walks through the workflow: download R and RStudio, create an R script, load a CSV dataset using read.csv, and verify the import with head. From there, summary statistics are produced with summary, which returns key distribution metrics such as minimum, quartiles, median, mean, and maximum for each variable.
For specific measures, the transcript demonstrates mean(), median(), and standard deviation calculations by targeting a variable via the dataset object and the $ operator (e.g., data$V1). It also notes a practical limitation: R doesn’t provide a built-in mode function by default, so the workflow uses an external package. Installing and loading the package (via install.packages and library) enables a mode() function to compute the most frequent value for numeric variables and categorical codes (e.g., mode for a gender variable). The result is a repeatable recipe for reporting central tendency and dispersion in research settings, with a promise of more advanced statistics in later sessions.
Cornell Notes
The session defines descriptive statistics as a way to report two things about data: central tendency and dispersion. Central tendency identifies a typical value using mean for interval/ratio variables, median for ordinal variables, and mode for nominal variables. Dispersion measures how spread out observations are, using range, standard deviation, and standard error; standard deviation reflects variability around the mean, while standard error gauges how accurately a sample mean estimates the population mean. It then demonstrates how to compute these measures in R by loading a CSV file, checking it with head, using summary for overall statistics, and applying functions like mean() and median(). Because mode isn’t built in, it uses an external package (e.g., DescTools) to calculate mode.
How do researchers decide whether to use mean, median, or mode for central tendency?
What’s the difference between standard deviation and standard error in measuring dispersion?
What does the range tell you, and why might it be less informative than standard deviation?
What are the core steps shown for producing descriptive statistics in R from a CSV file?
Why is mode handled differently in R, and how is it computed anyway?
Review Questions
- When would median be preferred over mean, and what assumption about the data scale makes that choice appropriate?
- How do standard deviation and standard error respond differently when sample size changes?
- In R, what is the purpose of using the $ operator when computing mean or median for a specific variable?
Key Points
- 1
Central tendency summarizes what’s typical using mean (interval/ratio), median (ordinal), and mode (nominal).
- 2
Dispersion measures how spread out values are using range, standard deviation, and standard error.
- 3
Standard deviation reflects variability around the mean; lower values indicate tighter clustering.
- 4
Standard error is standard deviation divided by the square root of sample size and indicates how accurately the sample mean estimates the population mean.
- 5
RStudio is recommended as an IDE for running R code and organizing scripts.
- 6
Descriptive statistics in R can be generated by loading CSV data with read.csv, inspecting with head, and using summary().
- 7
Mode requires a package in this workflow (e.g., DescTools) because it isn’t built into base R in the demonstrated approach.