Webinar - Scale Development and Validation: A thorough guide on how to develop and validate a scale
Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Start scale development with an operational definition that specifies how the construct will be measured in the study context; it governs item selection, dimension structure, and naming.
Briefing
Scale development is the process of turning an abstract, unobservable concept (a latent construct) into a measurable questionnaire score—then proving that the items work together reliably and validly. Because most quantitative research depends on measurement, researchers often start with existing scales, but when a scale doesn’t fit a specific context (like higher education), a new scale must be built from scratch. A scale is typically a standardized set of self-report items whose responses are summed (or otherwise combined) to represent the underlying construct—such as “university social responsibility” or “servant leadership”—through multiple dimensions.
The foundation comes first: an operational definition. Before writing or selecting items, researchers must define what the construct means in their specific study context and how it will be operationalized. That definition then guides every later decision—what items are included, how they are grouped into dimensions, and even how the dimensions are named. Without a clear operational definition, the entire scale development effort risks becoming misaligned with what the construct is supposed to measure.
Next comes item generation and refinement. A common approach begins with a literature search to collect existing items from prior scales, then expands the item pool using stakeholder input such as interviews, focus groups, or expert discussions. Items are then reviewed by an expert panel—often combining academics and practitioners—to judge whether each statement is relevant, clear, and truly measures the intended construct. Researchers may also use “validation items” (e.g., reverse-phrased or negating items) to detect careless responding. Through iterative revisions and pilot testing, the item pool is reduced to a workable set that still represents the construct’s dimensions.
Psychometric testing follows, starting with exploratory factor analysis (EFA). EFA acts as a data-reduction and structure-checking tool: it tests whether the items actually cluster together in the way the researcher hypothesized. Key diagnostics include KMO (sampling adequacy), Bartlett’s test (whether correlations are sufficient for factor analysis), eigenvalues (often retaining factors with values greater than 1), and factor loadings (how strongly each item relates to a factor). Items with weak loadings or cross-loadings—where an item loads on multiple factors—are removed or revised. The goal is a coherent factor structure that matches the conceptual model.
Confirmatory factor analysis (CFA) then tests whether the proposed measurement model fits the data. CFA evaluates factor loadings and model fit indices such as RMSEA and SRMR (with thresholds discussed), and it may use modification indices to improve fit—though only within constraints like correlating residuals for conceptually similar indicators. After the structure is accepted, reliability and validity are established. Reliability is assessed using composite reliability (and related measures), while convergent validity is checked via Average Variance Extracted (AVE). Discriminant validity is tested using the Fornell–Larcker criterion, comparing the square root of AVE for each construct against its correlations with other constructs.
A practical example in the transcript illustrates the workflow: starting from a large literature-derived item pool, adding interview-based items, using expert marking to cut down the number of items, piloting, then running EFA to remove problematic items until a stable seven-factor structure emerges. The process ends with CFA, followed by composite reliability, AVE, and discriminant validity checks—producing a scale that is both statistically defensible and conceptually grounded for its target context.
Cornell Notes
Scale development turns latent constructs into measurable scores by building a questionnaire with items that reflect an operational definition of the construct. The process starts with defining the construct in the study context, generating an item pool from literature and stakeholder input, and using expert review to ensure relevance and clarity. Pilot data then feed into exploratory factor analysis (EFA) to test whether items cluster into the intended dimensions, removing weak or cross-loading items. Confirmatory factor analysis (CFA) checks whether the proposed measurement model fits the data. Finally, reliability and validity are established using composite reliability, AVE for convergent validity, and the Fornell–Larcker criterion for discriminant validity.
Why does an operational definition come before item writing or item selection?
How does EFA differ from CFA in scale development?
What practical problems in EFA lead to item removal?
How are reliability and validity established after the factor structure is accepted?
What role do expert panels and stakeholder interviews play beyond literature review?
What does the transcript suggest about using modification indices in CFA?
Review Questions
- What steps in scale development depend most directly on the operational definition, and what goes wrong when it is unclear?
- In an EFA output, how would you decide whether an item should be removed due to weak loading versus cross-loading?
- Using the Fornell–Larcker criterion, what comparison must be true for discriminant validity to be supported?
Key Points
- 1
Start scale development with an operational definition that specifies how the construct will be measured in the study context; it governs item selection, dimension structure, and naming.
- 2
Build an item pool from existing literature scales, then expand it with stakeholder interviews or focus groups to ensure contextual relevance.
- 3
Use an expert panel to judge each item’s relevance and clarity and to validate item categorization into dimensions; expect multiple revision cycles.
- 4
Run EFA on pilot data to test whether items cluster into the intended factors; remove items with weak loadings or cross-loadings and troubleshoot iteratively.
- 5
Use CFA to confirm the proposed measurement model and assess fit using indices such as RMSEA and SRMR; apply modification indices only within conceptually defensible constraints.
- 6
Establish reliability with composite reliability and establish validity with AVE (convergent validity) and the Fornell–Larcker criterion (discriminant validity).
- 7
Track item reduction from initial literature pool through expert marking and pilot testing so the final scale remains defensible and aligned with the conceptual model.