Get AI summaries of any video or article — Sign up free
Webinar -  Scale Development and Validation: A thorough guide on how to develop and validate a scale thumbnail

Webinar - Scale Development and Validation: A thorough guide on how to develop and validate a scale

Research With Fawad·
5 min read

Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Start scale development with an operational definition that specifies how the construct will be measured in the study context; it governs item selection, dimension structure, and naming.

Briefing

Scale development is the process of turning an abstract, unobservable concept (a latent construct) into a measurable questionnaire score—then proving that the items work together reliably and validly. Because most quantitative research depends on measurement, researchers often start with existing scales, but when a scale doesn’t fit a specific context (like higher education), a new scale must be built from scratch. A scale is typically a standardized set of self-report items whose responses are summed (or otherwise combined) to represent the underlying construct—such as “university social responsibility” or “servant leadership”—through multiple dimensions.

The foundation comes first: an operational definition. Before writing or selecting items, researchers must define what the construct means in their specific study context and how it will be operationalized. That definition then guides every later decision—what items are included, how they are grouped into dimensions, and even how the dimensions are named. Without a clear operational definition, the entire scale development effort risks becoming misaligned with what the construct is supposed to measure.

Next comes item generation and refinement. A common approach begins with a literature search to collect existing items from prior scales, then expands the item pool using stakeholder input such as interviews, focus groups, or expert discussions. Items are then reviewed by an expert panel—often combining academics and practitioners—to judge whether each statement is relevant, clear, and truly measures the intended construct. Researchers may also use “validation items” (e.g., reverse-phrased or negating items) to detect careless responding. Through iterative revisions and pilot testing, the item pool is reduced to a workable set that still represents the construct’s dimensions.

Psychometric testing follows, starting with exploratory factor analysis (EFA). EFA acts as a data-reduction and structure-checking tool: it tests whether the items actually cluster together in the way the researcher hypothesized. Key diagnostics include KMO (sampling adequacy), Bartlett’s test (whether correlations are sufficient for factor analysis), eigenvalues (often retaining factors with values greater than 1), and factor loadings (how strongly each item relates to a factor). Items with weak loadings or cross-loadings—where an item loads on multiple factors—are removed or revised. The goal is a coherent factor structure that matches the conceptual model.

Confirmatory factor analysis (CFA) then tests whether the proposed measurement model fits the data. CFA evaluates factor loadings and model fit indices such as RMSEA and SRMR (with thresholds discussed), and it may use modification indices to improve fit—though only within constraints like correlating residuals for conceptually similar indicators. After the structure is accepted, reliability and validity are established. Reliability is assessed using composite reliability (and related measures), while convergent validity is checked via Average Variance Extracted (AVE). Discriminant validity is tested using the Fornell–Larcker criterion, comparing the square root of AVE for each construct against its correlations with other constructs.

A practical example in the transcript illustrates the workflow: starting from a large literature-derived item pool, adding interview-based items, using expert marking to cut down the number of items, piloting, then running EFA to remove problematic items until a stable seven-factor structure emerges. The process ends with CFA, followed by composite reliability, AVE, and discriminant validity checks—producing a scale that is both statistically defensible and conceptually grounded for its target context.

Cornell Notes

Scale development turns latent constructs into measurable scores by building a questionnaire with items that reflect an operational definition of the construct. The process starts with defining the construct in the study context, generating an item pool from literature and stakeholder input, and using expert review to ensure relevance and clarity. Pilot data then feed into exploratory factor analysis (EFA) to test whether items cluster into the intended dimensions, removing weak or cross-loading items. Confirmatory factor analysis (CFA) checks whether the proposed measurement model fits the data. Finally, reliability and validity are established using composite reliability, AVE for convergent validity, and the Fornell–Larcker criterion for discriminant validity.

Why does an operational definition come before item writing or item selection?

An operational definition specifies how the construct will be used and measured in the study context. It determines which items are appropriate, how items are grouped into dimensions, and how dimensions are named. The transcript emphasizes that without this alignment, later steps (item inclusion, dimension structure, and scoring) can drift away from what the construct is meant to represent, making the scale development effort ineffective.

How does EFA differ from CFA in scale development?

EFA is used to explore and verify the dimensional structure when the researcher’s grouping may be wrong. It tests whether items cluster together based on intercorrelations, using diagnostics like KMO, Bartlett’s test, eigenvalues (e.g., >1), and factor loadings. CFA comes after EFA and tests whether the hypothesized measurement model fits the data, using model fit indices (e.g., RMSEA, SRMR, and others mentioned) and examining standardized loadings and residuals.

What practical problems in EFA lead to item removal?

Items are removed when they show weak factor loadings (the transcript uses a rule of thumb around 0.5) or when they cross-load—loading on more than one factor. The transcript also describes iterative troubleshooting: sometimes items are removed one by one because removing one item can change the loading pattern of others. Another issue is when a factor ends up with too few items (the example notes keeping at least three items per factor/dimension).

How are reliability and validity established after the factor structure is accepted?

Reliability is assessed using composite reliability (and related approaches). Convergent validity is checked using Average Variance Extracted (AVE), calculated from CFA standardized loadings; AVE values above 0.5 are treated as evidence of convergent validity. Discriminant validity is tested with the Fornell–Larcker criterion: the square root of AVE for each construct should be higher than that construct’s correlations with other constructs.

What role do expert panels and stakeholder interviews play beyond literature review?

Literature provides an initial item pool, but stakeholder input expands and contextualizes the item set. The transcript describes interviewing academics and students (or practitioners) to capture what “university social responsibility” means in that setting. Expert panels then evaluate each statement for whether it measures the intended latent construct and whether items and their categorization are clear and appropriate, often driving multiple revision cycles.

What does the transcript suggest about using modification indices in CFA?

Modification indices can flag where model fit can improve, often by suggesting correlated residuals between specific items. The transcript cautions that correlations should be conceptually appropriate—e.g., correlating residuals between indicators within a similar construct—rather than correlating items across unrelated dimensions (it gives an example of not correlating an item from one construct with an item from another).

Review Questions

  1. What steps in scale development depend most directly on the operational definition, and what goes wrong when it is unclear?
  2. In an EFA output, how would you decide whether an item should be removed due to weak loading versus cross-loading?
  3. Using the Fornell–Larcker criterion, what comparison must be true for discriminant validity to be supported?

Key Points

  1. 1

    Start scale development with an operational definition that specifies how the construct will be measured in the study context; it governs item selection, dimension structure, and naming.

  2. 2

    Build an item pool from existing literature scales, then expand it with stakeholder interviews or focus groups to ensure contextual relevance.

  3. 3

    Use an expert panel to judge each item’s relevance and clarity and to validate item categorization into dimensions; expect multiple revision cycles.

  4. 4

    Run EFA on pilot data to test whether items cluster into the intended factors; remove items with weak loadings or cross-loadings and troubleshoot iteratively.

  5. 5

    Use CFA to confirm the proposed measurement model and assess fit using indices such as RMSEA and SRMR; apply modification indices only within conceptually defensible constraints.

  6. 6

    Establish reliability with composite reliability and establish validity with AVE (convergent validity) and the Fornell–Larcker criterion (discriminant validity).

  7. 7

    Track item reduction from initial literature pool through expert marking and pilot testing so the final scale remains defensible and aligned with the conceptual model.

Highlights

A scale is built by measuring latent constructs through multiple items that form dimensions; the operational definition determines what those items must represent.
EFA is a structure-checking and data-reduction step: weak loadings and cross-loadings are practical signals to revise or remove items.
CFA shifts from exploration to confirmation—testing whether the hypothesized factor model fits the data and whether residual patterns are acceptable.
Reliability and validity are not optional add-ons: composite reliability, AVE, and Fornell–Larcker discriminant validity checks are used to justify the final scale.
The workflow is iterative: literature → stakeholder input → expert review → pilot EFA → item refinement → main CFA → reliability/validity reporting.

Mentioned

  • EFA
  • CFA
  • KMO
  • SRMR
  • RMSEA
  • AVE
  • SPSS
  • AMOS
  • Jovi
  • L scale