Understanding the Questionnaire/Scale Development Process. Edited Webinar

TL;DR

Scale development is essential when existing questionnaires don’t measure a construct accurately for a study’s context and operational definition.

Briefing Cornell Notes

Briefing

Scale development is necessary when existing questionnaires fail to measure a concept in the specific way a study needs—especially for constructs that are not well represented in the literature. Researchers often start with the goal of establishing relationships between variables, but many key concepts (like job satisfaction, employee engagement, or community hostility) are not directly measurable. Instead, they are “latent variables” that require multiple questionnaire items to capture different aspects of the underlying construct. When no suitable scale exists—such as for community hostility, or for higher-education social responsibility—researchers must build a new scale tailored to their context rather than cobbling together mismatched items.

The process begins with defining the concept in operational terms, not just conceptually. An operational definition determines what the study will actually measure and, crucially, which items belong. Without it, a questionnaire can drift into measuring the wrong components, leading to reviewer rejection and a study that cannot be defended. The webinar emphasizes purposiveness: every measurement choice should follow from how the construct is operationalized. For example, higher education social responsibility must include elements like ethics, academics, and research; using a corporate-focused CSR instrument that omits those elements creates a mismatch. Similar examples show how organizational commitment can be operationalized as employees’ emotional desire to stay, or how perceptions of CSR image can be operationalized as consumers’ belief that a firm supports socially beneficial activities.

Once the construct is operationalized, scale development proceeds through a structured sequence. Researchers generate an item pool using existing literature when available, then refine it through expert review and/or interviews and focus groups. Items are written in a consistent format (often as statements rated on a Likert-type scale such as 5- or 7-point options). Next comes categorization into dimensions: items that reflect related themes are grouped into subdimensions (for instance, in a servant leadership in higher education project, items were organized into dimensions such as ethical behavior, development orientation, emotional healing, empowerment, humility, pioneering, relationship building, and wisdom). This dimensional structure is not assumed to be correct; it must be tested.

After data collection, exploratory factor analysis (EFA) is used to check whether the expected grouping holds statistically. EFA functions as a data-reduction method, clustering highly correlated items into fewer factors so researchers can work with dimensions rather than dozens of individual items. Items may be removed if they load weakly, cross-load onto multiple dimensions, or appear to reflect something different from the intended construct—often because of wording problems or respondent misunderstanding. In one example, an initial set of 68 items was reduced to 37 after EFA, and even one dimension was eliminated.

Finally, researchers assess reliability and validity. Reliability checks whether the scale produces consistent results, while validity checks whether it measures what it claims to measure. The webinar also notes that some studies skip early pilot testing and rely on expert content/face validity before conducting EFA and confirmatory factor analysis (CFA) in the main study. Regardless of the exact route, the end goal is a defensible, psychometrically supported scale that can be used to test relationships between constructs in subsequent research.

Cornell Notes

Scale development is required when existing questionnaires don’t adequately measure a construct in the specific context of a study. Because many constructs are latent variables (not directly observable), they must be measured with multiple items that are summed or averaged into a single score representing the underlying construct. The process starts with an operational definition that guides which items should be included; otherwise, questionnaires can omit core elements and fail reviewer scrutiny. Item generation draws on literature, expert input, and interviews or focus groups, followed by grouping items into dimensions. Exploratory factor analysis then tests whether the proposed dimensional structure holds, removing weak or cross-loading items, after which reliability and validity are assessed (often with EFA and CFA).

Why can’t researchers rely on a single item to measure many psychological or social constructs?

Most constructs of interest—such as job satisfaction, employee engagement, internal marketing, or organizational performance—are latent variables. They are not directly measurable like age or height. Because one item cannot capture all the components of a construct, multiple observed indicators are used. Responses to those items are then summed or averaged to produce a score that represents the underlying latent variable.

How does an operational definition protect a scale from becoming “off-target”?

An operational definition specifies how the construct will be measured in that study, which in turn determines which subdimensions and items belong. Without it, researchers may adopt items from existing questionnaires that fit partially but miss essential elements. The webinar’s example: higher education social responsibility should include ethics, academics, and research. If a corporate CSR instrument omits those elements, the measurement no longer matches the construct, undermining the study’s defensibility.

What is the role of exploratory factor analysis (EFA) in scale development?

EFA is used after data collection to test whether the expected dimensional structure emerges statistically. It reduces data by clustering highly correlated items into fewer factors, allowing researchers to analyze dimensions rather than dozens of items. Items may be removed if they load too weakly, cross-load onto multiple dimensions, or appear to reflect a different underlying concept—sometimes due to problematic wording or respondent confusion.

How do researchers generate and refine an item pool before EFA?

Item generation typically starts with a literature search for existing measures and definitions. If instruments exist, items may be adapted; if not, researchers rely on expert input and interviews or focus groups to generate items from the construct’s meaning. Items are then formatted consistently (e.g., Likert-type response options), reviewed by experts for inclusion/exclusion, and often piloted before the main data collection.

What does it mean to “categorize items into dimensions,” and why must that be tested?

After generating many items, researchers group them into dimensions based on shared themes (e.g., ethical behavior, empowerment, humility). But the grouping is a hypothesis, not a guarantee. EFA checks whether respondents’ answers actually support that structure. In one example, an initial 68-item scale was reduced to 37 items after EFA, and one dimension was removed when it did not hold up statistically.

What reliability and validity checks are expected after factor analysis?

Reliability assesses whether the scale yields consistent results. Validity assesses whether the scale measures the intended construct. The webinar notes that many studies proceed from EFA to confirmatory factor analysis (CFA) and then report reliability and validity evidence, though some approaches rely on expert content/face validity before running EFA and CFA in the main study.

Review Questions

What consequences follow from failing to write and use an operational definition when selecting or adapting questionnaire items?
Describe how EFA changes the number of variables researchers analyze and what kinds of items are typically removed during this step.
Give an example of how a construct’s context (e.g., higher education vs. corporations) can require a different scale than what exists in the literature.

Key Points

1
Scale development is essential when existing questionnaires don’t measure a construct accurately for a study’s context and operational definition.
2
Most constructs in social science are latent variables, so they require multiple items whose responses are aggregated into a single score.
3
Operational definitions determine what the study measures and which items should be included; mismatches can invalidate the research.
4
Item generation commonly uses literature, expert review, and interviews or focus groups, then formats items with consistent response scales.
5
Dimensional structure must be tested empirically; EFA clusters correlated items into fewer factors and helps remove weak or cross-loading items.
6
Reliability and validity assessments are required to demonstrate that the final scale measures the intended construct consistently and correctly.
7
A defensible scale development write-up needs step-by-step justification, including sources for items and evidence for the resulting dimensions.

Highlights

Operational definitions are the backbone of scale development: they decide which items belong and which ones cause reviewer rejection.

Exploratory factor analysis functions as both a test of dimensional structure and a data-reduction tool, turning many items into a smaller set of factors.

Weak loadings, cross-loadings, and confusing wording can force item removal, even after items were initially selected from strong sources.

Context matters: instruments built for corporations may fail to capture the ethics, academics, and research elements needed for higher education social responsibility.

A defensible scale requires psychometric evidence—reliability and validity—often supported by EFA and CFA.

Topics

Mentioned

Fawad
EFA
CFA