Statistics for Research - L21 - #Correlation Analysis using R
Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Correlation coefficient r quantifies both direction (sign) and strength (magnitude) of a linear relationship between two quantitative variables.
Briefing
Correlation coefficient (r) is the go-to numerical measure for quantifying the direction and strength of a linear relationship between two quantitative variables—something scatter plots can suggest but not reliably measure. The sign of r indicates direction: a negative value means one variable rises while the other falls (an inverse relationship), while a positive value means they move together. Because r is designed for linear patterns, it should be paired with a scatter plot to confirm that the relationship is actually approximately straight-line rather than curved or otherwise non-linear.
Interpreting r comes with commonly used, though not universal, guidelines. Values near 0 indicate very weak linear association, while values closer to 1 (or -1) indicate stronger linear relationships. One set of thresholds described: |r| ≤ 0.1 is very weak; 0.1 to 0.3 is weak; 0.3 to 0.5 is moderate; 0.5 to 0.7 is strong; and above 0.7 is very strong. The transcript also flags a practical warning: correlations above about 0.85 can signal multicollinearity—when two constructs are measuring essentially the same underlying concept.
The session then moves from interpretation to implementation in R. After loading a dataset into a data frame, correlation is computed using the correlation() function. For two variables, the method defaults to Pearson (linear correlation for continuous data), but it can be switched to Spearman (rank-based) when variables are ordinal. In the example, composite scores are created by averaging multiple questionnaire items into constructs such as Vision (four items) and Organizational Performance (five items). The correlation between Vision and Organizational Performance is reported as a positive value around 0.61, which is consistent with a positive direction and a “strong” linear association under the provided thresholds.
For multiple variables, the workflow expands to correlation matrices. A vector of variable names is assembled (e.g., perceived organizational support items and organizational performance items), a subset data frame is created, and correlation() is run to produce a matrix of pairwise correlations. The matrix can be summarized, formatted, and exported to Excel. When reporting, the transcript emphasizes presenting the most meaningful relationships rather than every pairwise correlation if there are many constructs.
To add statistical significance, the hmisc package is used to obtain p-values for each correlation in the matrix, enabling reporting beyond effect size alone. Finally, the session shows how to visualize correlation structure using correlation plots (via a dedicated library), producing a heat-map style display where circle size and color intensity reflect the strength and direction of relationships. The overall takeaway is a complete R-based pipeline: build composite variables, compute Pearson or Spearman correlations, interpret r using context-appropriate thresholds, extract p-values, and present results in tables and heat maps for clear research reporting.
Cornell Notes
Correlation coefficient r quantifies the direction and strength of a linear relationship between two quantitative variables. The sign of r indicates direction (negative = inverse movement; positive = same-direction movement), while the magnitude indicates strength, with common thresholds ranging from very weak near 0 to very strong above about 0.7. In R, correlations are computed with correlation(), using Pearson by default and switching to Spearman for ordinal/rank data. For multiple constructs, correlation matrices summarize all pairwise relationships, and hmisc can provide p-values for significance. Correlation plots can then visualize the matrix as a heat map, making strong associations easier to spot and report.
How should r be interpreted in terms of direction and strength?
Why does correlation require checking linearity rather than relying on r alone?
When should Pearson vs Spearman correlation be used in R?
How are composite construct scores created before computing correlations?
What is the difference between a two-variable correlation and a correlation matrix, and how is each reported?
How are p-values and visualizations added to correlation results in R?
Review Questions
- What does the sign of r tell you, and how would you classify a correlation of r = -0.45 using the thresholds given?
- In what situation would Spearman correlation be preferred over Pearson, and why?
- If you compute a correlation matrix for many constructs, what reporting strategy helps avoid listing every pairwise result?
Key Points
- 1
Correlation coefficient r quantifies both direction (sign) and strength (magnitude) of a linear relationship between two quantitative variables.
- 2
Common interpretation thresholds classify |r| near 0 as very weak and values above ~0.7 as very strong, but context-specific judgment is still required.
- 3
Correlation in R defaults to Pearson; switching to Spearman is appropriate for ordinal/rank-based variables.
- 4
Composite variables are typically created by averaging multiple questionnaire items before running correlation analyses.
- 5
Correlation matrices summarize pairwise correlations across many variables, and results should be reported selectively when the number of constructs is large.
- 6
hmisc can provide p-values for correlations, enabling reporting of both effect size and statistical significance.
- 7
Correlation plots can visualize the correlation matrix as a heat map, helping identify strong relationships quickly.