16. SPSS AMOS | Reporting Measurement Model (Part 2) | Reporting Reliability and Validity

Q: Why does the reporting process begin with factor loadings, and what threshold is used?

Factor loadings show whether each observed indicator properly reflects its latent construct. The transcript uses standardized regression weights as loadings and applies a cutoff of 0.50. An indicator with a loading below 0.50 is treated as weak and is removed to improve measurement quality. In the example, LS5 had a standardized loading of 0.463, so it was deleted from the AMOS diagram and the model was re-run.

Q: How is construct reliability reported in this workflow, and what benchmark is applied?

Construct reliability is reported using composite reliability (CR), not Cronbach’s alpha in the main example. The benchmark is CR ≥ 0.70. Reported values include 0.813 for authentic leadership, 0.918 for ethical leadership, and 0.891 for life satisfaction—each exceeding 0.70, supporting the conclusion that construct reliability is established.

Q: What statistic is used for convergent validity, and how is a partial AVE failure handled?

Convergent validity uses Average Variance Extracted (AVE) with a threshold of 0.50. The transcript notes AVE values were above 0.50 for all constructs except authentic leadership. In the write-up, it’s recommended to explicitly mention the AVE shortfall, while also noting that composite reliability for authentic leadership still exceeded the 0.70 benchmark, allowing a defensible validity conclusion.

Q: What is the Fornell–Larcker criterion for discriminant validity?

Fornell–Larcker requires that the square root of AVE for a construct be greater than that construct’s correlations with other constructs. Practically, it compares √AVE (on the diagonal) against inter-construct correlations (off-diagonal). If √AVE is higher than all corresponding correlations, discriminant validity is supported.

Q: Why does the transcript also use HTMT, and what cutoff determines discriminant validity?

Fornell–Larcker is described as increasingly criticized, so HTMT is used as a newer discriminant validity check. Discriminant validity is supported when HTMT ratios are below 0.85 (citing Henseler, 2015). In the example, every HTMT ratio stays under 0.85, so discriminant validity is concluded to be established.

Q: What table elements does the transcript recommend for reporting reliability and validity?

A practical reporting table includes: item/indicator loadings, Cronbach’s alpha (optional in this workflow), composite reliability, and AVE (for convergent validity). For discriminant validity, it suggests reporting the Fornell–Larcker matrix and the HTMT matrix, with consistent decimal places (e.g., two decimals) for readability.

TL;DR

Re-estimate the CFA after removing indicators with standardized loadings below 0.50 to strengthen measurement quality.

Briefing Cornell Notes

Briefing

Reliability and validity reporting in a confirmatory factor analysis (CFA) hinges on a clear sequence: document measurement quality first (model fit and factor loadings), then report construct reliability and convergent validity, and finally demonstrate discriminant validity. The practical takeaway is that acceptable measurement models aren’t just about overall fit—they also require indicator loadings meeting a threshold and reliability/validity statistics that clear commonly used benchmarks.

After rebuilding the CFA model and re-running estimates in IBM SPSS AMOS, the workflow starts with standardized factor loadings. Indicators with weak loadings are treated as poor reflections of their latent construct. In the example, one item (LS5) shows a standardized regression weight of 0.463, which falls below the 0.50 cutoff. That indicator is deleted from the diagram, and the model is re-estimated to improve measurement quality.

With the revised model in place, the reporting focus shifts to construct reliability and convergent validity. The transcript emphasizes composite reliability as the reliability metric (rather than Cronbach’s alpha, which is mentioned but not used here). Composite reliability values are reported per construct and compared against a benchmark of 0.70. The example results range from 0.813 (authentic leadership) to 0.918 (ethical leadership), with life satisfaction at 0.891—each above 0.70—supporting the conclusion that construct reliability is established.

Convergent validity is then assessed using Average Variance Extracted (AVE), with a threshold of 0.50. AVE values are described as meeting the requirement for all constructs except authentic leadership. Even so, the transcript notes that authentic leadership still passes the reliability benchmark (composite reliability above 0.70), allowing a qualified conclusion: the construct can be argued as valid because the composite reliability indicates sufficient internal consistency and the latent construct explains a substantial portion of indicator variance.

Discriminant validity comes last, and the transcript contrasts two approaches. The Fornell–Larcker criterion is presented first: discriminant validity is supported when the square root of AVE for each construct exceeds its correlations with other constructs. However, Fornell–Larcker is also flagged as increasingly criticized in the literature. As an alternative, the Heterotrait–Monotrait ratio (HTMT) is used, with discriminant validity supported when HTMT ratios are below a limit of 0.85 (citing Henseler, 2015). In the example, all HTMT ratios remain under 0.85, leading to the conclusion that discriminant validity is established.

For writing up results, the transcript recommends copying AMOS output into a spreadsheet, then building clean tables for loadings, reliability (including composite reliability), convergent validity (AVE), and discriminant validity (Fornell–Larcker and HTMT). The overall order—measurement model, construct reliability, convergent validity, then discriminant validity—keeps reporting consistent and defensible.

Cornell Notes

The workflow for reporting a CFA measurement model’s quality starts with standardized factor loadings and model re-estimation. Indicators with loadings below 0.50 are removed; in the example, LS5 had a standardized loading of 0.463 and was deleted before re-running the model. Reliability is reported using composite reliability (benchmark ≥ 0.70), with values such as 0.813 for authentic leadership, 0.918 for ethical leadership, and 0.891 for life satisfaction. Convergent validity is assessed via AVE (benchmark ≥ 0.50); AVE met the threshold for all constructs except authentic leadership, which is handled in the write-up using the strong composite reliability. Discriminant validity is evaluated first with Fornell–Larcker and then more robustly with HTMT ratios (benchmark < 0.85), where all ratios in the example stayed below 0.85.

Why does the reporting process begin with factor loadings, and what threshold is used?

Factor loadings show whether each observed indicator properly reflects its latent construct. The transcript uses standardized regression weights as loadings and applies a cutoff of 0.50. An indicator with a loading below 0.50 is treated as weak and is removed to improve measurement quality. In the example, LS5 had a standardized loading of 0.463, so it was deleted from the AMOS diagram and the model was re-run.

How is construct reliability reported in this workflow, and what benchmark is applied?

Construct reliability is reported using composite reliability (CR), not Cronbach’s alpha in the main example. The benchmark is CR ≥ 0.70. Reported values include 0.813 for authentic leadership, 0.918 for ethical leadership, and 0.891 for life satisfaction—each exceeding 0.70, supporting the conclusion that construct reliability is established.

What statistic is used for convergent validity, and how is a partial AVE failure handled?

Convergent validity uses Average Variance Extracted (AVE) with a threshold of 0.50. The transcript notes AVE values were above 0.50 for all constructs except authentic leadership. In the write-up, it’s recommended to explicitly mention the AVE shortfall, while also noting that composite reliability for authentic leadership still exceeded the 0.70 benchmark, allowing a defensible validity conclusion.

What is the Fornell–Larcker criterion for discriminant validity?

Fornell–Larcker requires that the square root of AVE for a construct be greater than that construct’s correlations with other constructs. Practically, it compares √AVE (on the diagonal) against inter-construct correlations (off-diagonal). If √AVE is higher than all corresponding correlations, discriminant validity is supported.

Why does the transcript also use HTMT, and what cutoff determines discriminant validity?

Fornell–Larcker is described as increasingly criticized, so HTMT is used as a newer discriminant validity check. Discriminant validity is supported when HTMT ratios are below 0.85 (citing Henseler, 2015). In the example, every HTMT ratio stays under 0.85, so discriminant validity is concluded to be established.

What table elements does the transcript recommend for reporting reliability and validity?