Get AI summaries of any video or article — Sign up free
01. SPSS Classroom Lectures| Basic Statistical Concepts (P1) | Reliability and Validity thumbnail

01. SPSS Classroom Lectures| Basic Statistical Concepts (P1) | Reliability and Validity

Research With Fawad·
5 min read

Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Reliability is consistency of results under similar conditions; validity is whether the instrument measures the intended construct.

Briefing

Reliability and validity are the two core quality checks behind any questionnaire or measurement tool—reliability for consistency, validity for accuracy. A measurement is reliable when it produces the same numeric results repeatedly under similar conditions with the same subjects. Validity, by contrast, requires that the instrument actually measures the concept it claims to measure. The distinction matters because consistency alone can be misleading: an instrument can give stable results while still targeting the wrong construct.

A classic reliability-versus-validity example illustrates the risk. A wall clock that always shows 6:00 when someone enters the room is consistent, but it is not necessarily valid for telling the correct time. Likewise, a test that repeatedly returns the same score for a child’s recall of last day activities may be reliable, yet someone might incorrectly claim it measures the child’s IQ. Another analogy compares constructs: using a job satisfaction scale to measure job commitment may yield consistent scores, but those scores could reflect something else entirely—such as memory or another unrelated trait—rather than the intended concept.

Reliability can be assessed through the test–retest approach: administer the same instrument to the same people twice under similar conditions, then correlate the two sets of results. Higher correlation indicates greater consistency. In practice, test–retest is difficult because repeated participation is hard to arrange, and subjects may no longer respond neutrally after the first exposure. The example of repeatedly taking a GMAT test highlights how familiarity can introduce bias, undermining the attempt to measure pure consistency.

Because test–retest is often impractical, researchers use other reliability techniques. For categorical data, Cohen’s Kappa coefficient is commonly used. For internal consistency across items in a scale, Cronbach’s Alpha is widely applied. In modern survey research, construct reliability is frequently assessed using composite reliability, which fits naturally within confirmatory factor analysis.

Validity is assessed by checking how accurately the measurement aligns with the underlying trait it is meant to represent. For job satisfaction, the underlying trait might be reflected through multiple indicators—salary, environment, co-workers, and job security—so validity asks whether those indicators truly capture job satisfaction rather than something adjacent. Face validity is the first check: the instrument should appear, to experts and respondents, to measure the intended concept. Beyond that, predictive validity tests whether the measure forecasts related outcomes (e.g., GMAT performance predicting MBA performance). Content validity checks whether the instrument covers the full intended domain of the construct; measuring only reading skills when the construct includes reading, writing, and listening would miss major parts of the domain.

Construct validity, grounded in theory, looks for expected patterns of relationships among variables. It includes convergent validity—items intended to measure the same construct should correlate strongly—and discriminant validity—different constructs should relate differently, not collapse into one another. Statistical tools such as average variance extracted and the Fornell–Larcker criterion (or related methods) are used to support these claims. Reliability and validity together determine whether a measurement is both stable and meaningful for research conclusions.

Cornell Notes

Reliability and validity are the two main quality standards for measurement instruments. Reliability means consistency: repeating the same test on the same subjects under similar conditions should yield similar results. Validity means accuracy: the instrument must measure the intended construct, not just produce stable numbers. Reliability is often assessed with test–retest correlations, though this can be hard due to non-neutral responses after repeated testing. Internal consistency measures like Cronbach’s Alpha and construct reliability via composite reliability are common alternatives. Validity is evaluated through face validity, predictive validity, content validity, and construct validity, including convergent and discriminant validity using theory-driven expected relationships and statistical checks.

Why can a measurement be reliable without being valid?

Reliability requires consistency, not correctness. An instrument can return the same numeric value every time while still measuring the wrong construct. For example, a wall clock that always shows 6:00 is consistent (reliable) but not necessarily accurate for telling the correct time (not valid). Similarly, a test that consistently measures a child’s recall of last day activities could be reliable, yet it would not be valid if someone claims it measures IQ.

How does the test–retest method assess reliability, and why is it difficult in practice?

Test–retest administers the same instrument to the same subjects twice under similar conditions, then correlates the two results. Higher correlation indicates greater consistency. It is difficult because researchers may not be able to re-contact the same respondents, and participants may become biased or less neutral after the first exposure—familiarity can change how they respond (e.g., repeatedly taking a GMAT test).

What do Cronbach’s Alpha and Cohen’s Kappa measure in reliability assessment?

Cohen’s Kappa coefficient is used for reliability with categorical data. Cronbach’s Alpha is used for internal reliability of a set of questions in a scale, checking whether items within the same instrument move together consistently.

How do face validity, predictive validity, and content validity differ?

Face validity checks whether the instrument appears to measure the intended concept, using qualitative judgment from experts and actual subjects. Predictive validity tests whether the measure forecasts related outcomes (e.g., GMAT performance predicting MBA performance). Content validity checks whether the instrument covers the entire intended domain of the construct; for instance, assessing only reading skills would lack content validity if English language skills also include writing and listening.

What is the difference between convergent and discriminant validity within construct validity?

Convergent validity examines whether items meant to measure the same construct are interrelated—for example, multiple job satisfaction items should correlate with each other. Discriminant validity checks that different constructs relate differently rather than collapsing into the same pattern; it tests whether measures of distinct constructs are empirically distinguishable. Statistical approaches mentioned include average variance extracted for convergent validity and the Fornell–Larcker criterion (or related methods) for discriminant validity.

Review Questions

  1. Give one example of how an instrument could be reliable but not valid, and explain the difference between consistency and accuracy.
  2. Describe how test–retest reliability works and list two practical problems that can weaken it.
  3. Match each validity type (face, predictive, content, construct) to what it tests and provide a brief example for one of them.

Key Points

  1. 1

    Reliability is consistency of results under similar conditions; validity is whether the instrument measures the intended construct.

  2. 2

    Stable scores do not guarantee correctness—an instrument can be reliable while measuring the wrong trait.

  3. 3

    Test–retest reliability uses two administrations and correlates results, but repeated testing can be impractical and can introduce bias.

  4. 4

    Cohen’s Kappa supports reliability for categorical data, while Cronbach’s Alpha supports internal consistency across scale items.

  5. 5

    Composite reliability is commonly used for construct reliability in survey research and aligns with confirmatory factor analysis.

  6. 6

    Validity is assessed through face validity, predictive validity, content validity, and construct validity, including convergent and discriminant validity.

  7. 7

    Construct validity relies on theory-driven expected relationships among variables and uses statistical checks such as average variance extracted and the Fornell–Larcker criterion.

Highlights

Reliability answers “Do we get the same numbers again?” while validity answers “Do we measure the right thing?”
A clock can be perfectly consistent and still fail validity if it doesn’t represent the correct time.
Test–retest correlations can be undermined because repeated exposure can change how respondents answer.
Content validity fails when the instrument covers only part of the construct’s domain (e.g., reading-only for a broader language-skills definition).
Convergent validity expects strong relationships among items for the same construct, while discriminant validity expects separation between different constructs.

Topics