Statistics for Research - L21 - #Correlation Analysis using R

TL;DR

Correlation coefficient r quantifies both direction (sign) and strength (magnitude) of a linear relationship between two quantitative variables.

Briefing Cornell Notes

Briefing

Correlation coefficient (r) is the go-to numerical measure for quantifying the direction and strength of a linear relationship between two quantitative variables—something scatter plots can suggest but not reliably measure. The sign of r indicates direction: a negative value means one variable rises while the other falls (an inverse relationship), while a positive value means they move together. Because r is designed for linear patterns, it should be paired with a scatter plot to confirm that the relationship is actually approximately straight-line rather than curved or otherwise non-linear.

Interpreting r comes with commonly used, though not universal, guidelines. Values near 0 indicate very weak linear association, while values closer to 1 (or -1) indicate stronger linear relationships. One set of thresholds described: |r| ≤ 0.1 is very weak; 0.1 to 0.3 is weak; 0.3 to 0.5 is moderate; 0.5 to 0.7 is strong; and above 0.7 is very strong. The transcript also flags a practical warning: correlations above about 0.85 can signal multicollinearity—when two constructs are measuring essentially the same underlying concept.

The session then moves from interpretation to implementation in R. After loading a dataset into a data frame, correlation is computed using the correlation() function. For two variables, the method defaults to Pearson (linear correlation for continuous data), but it can be switched to Spearman (rank-based) when variables are ordinal. In the example, composite scores are created by averaging multiple questionnaire items into constructs such as Vision (four items) and Organizational Performance (five items). The correlation between Vision and Organizational Performance is reported as a positive value around 0.61, which is consistent with a positive direction and a “strong” linear association under the provided thresholds.

For multiple variables, the workflow expands to correlation matrices. A vector of variable names is assembled (e.g., perceived organizational support items and organizational performance items), a subset data frame is created, and correlation() is run to produce a matrix of pairwise correlations. The matrix can be summarized, formatted, and exported to Excel. When reporting, the transcript emphasizes presenting the most meaningful relationships rather than every pairwise correlation if there are many constructs.

To add statistical significance, the hmisc package is used to obtain p-values for each correlation in the matrix, enabling reporting beyond effect size alone. Finally, the session shows how to visualize correlation structure using correlation plots (via a dedicated library), producing a heat-map style display where circle size and color intensity reflect the strength and direction of relationships. The overall takeaway is a complete R-based pipeline: build composite variables, compute Pearson or Spearman correlations, interpret r using context-appropriate thresholds, extract p-values, and present results in tables and heat maps for clear research reporting.

Cornell Notes

Correlation coefficient r quantifies the direction and strength of a linear relationship between two quantitative variables. The sign of r indicates direction (negative = inverse movement; positive = same-direction movement), while the magnitude indicates strength, with common thresholds ranging from very weak near 0 to very strong above about 0.7. In R, correlations are computed with correlation(), using Pearson by default and switching to Spearman for ordinal/rank data. For multiple constructs, correlation matrices summarize all pairwise relationships, and hmisc can provide p-values for significance. Correlation plots can then visualize the matrix as a heat map, making strong associations easier to spot and report.

How should r be interpreted in terms of direction and strength?

r’s sign shows direction: r < 0 means one variable increases while the other decreases (inverse relationship), while r > 0 means they move together (positive relationship). Strength is based on |r| using common guidelines mentioned: ≤ 0.1 very weak; 0.1–0.3 weak; 0.3–0.5 moderate; 0.5–0.7 strong; > 0.7 very strong. The transcript also notes these are guidelines, not strict rules, and context matters.

Why does correlation require checking linearity rather than relying on r alone?

Correlation measures only the strength of a linear relationship. A high or low r does not guarantee the relationship is linear; it could be curved or non-linear. The transcript stresses using a scatter plot to verify that points follow an approximately straight-line pattern before trusting r as a linear-association measure.

When should Pearson vs Spearman correlation be used in R?

Pearson is the default in correlation() and is appropriate for continuous variables with linear relationships. Spearman is recommended when variables are ordinal (rank-based association). The transcript explicitly mentions switching the method to Spearman when ordinal variables are involved.

How are composite construct scores created before computing correlations?

Constructs like Vision and Organizational Performance are formed by averaging multiple questionnaire items. In the example, Vision is defined by four items and Organizational Performance by five items; each respondent’s composite score is computed as the mean of those items. The resulting composite variables are then used as inputs to correlation calculations.

What is the difference between a two-variable correlation and a correlation matrix, and how is each reported?

A two-variable correlation reports a single r value (e.g., the transcript cites a positive correlation around 0.61 between Vision and Organizational Performance). A correlation matrix reports pairwise correlations across multiple variables, producing a table of r values. When many variables exist, the transcript recommends reporting only the most extremely significant relationships and summarizing the rest as moderately significant or not significant.

How are p-values and visualizations added to correlation results in R?

To obtain p-values for correlations in a matrix, hmisc is used, producing significance values for each pairwise correlation (accessible via the resulting object’s p-value components). For visualization, a correlation plot function (from an appropriate library) can generate a heat-map style display of the correlation matrix, where circle size/intensity reflects correlation strength and direction; the transcript also notes the need to load the library before calling the function.

Review Questions

What does the sign of r tell you, and how would you classify a correlation of r = -0.45 using the thresholds given?
In what situation would Spearman correlation be preferred over Pearson, and why?
If you compute a correlation matrix for many constructs, what reporting strategy helps avoid listing every pairwise result?

Key Points

1
Correlation coefficient r quantifies both direction (sign) and strength (magnitude) of a linear relationship between two quantitative variables.
2
Common interpretation thresholds classify |r| near 0 as very weak and values above ~0.7 as very strong, but context-specific judgment is still required.
3
Correlation in R defaults to Pearson; switching to Spearman is appropriate for ordinal/rank-based variables.
4
Composite variables are typically created by averaging multiple questionnaire items before running correlation analyses.
5
Correlation matrices summarize pairwise correlations across many variables, and results should be reported selectively when the number of constructs is large.
6
hmisc can provide p-values for correlations, enabling reporting of both effect size and statistical significance.
7
Correlation plots can visualize the correlation matrix as a heat map, helping identify strong relationships quickly.

Highlights

A negative r indicates an inverse relationship: as one variable increases, the other decreases.

r is meaningful for linear patterns only; scatter plots are needed to confirm linearity.

Pearson correlation is the default in R, while Spearman is used for ordinal variables.

Correlation matrices plus hmisc p-values support reporting both strength and significance across multiple constructs.

Heat-map style correlation plots make strong positive/negative relationships stand out visually.

Topics

Correlation Coefficient
Pearson vs Spearman
Composite Scores
Correlation Matrix
P-Values
Heat Map Visualization