31. SEMinR. How to Analyze Categorical Predictor Variables using SEMinR in R?
Based on Research With Fawad's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Dummy code nominal categorical predictors by creating separate indicator variables for each non-reference category.
Briefing
Categorical predictors in SEMinR can be handled by dummy coding each category and then estimating separate path effects against a chosen reference group—making it possible to test whether country differences translate into statistically significant changes in customer loyalty. In the example, “country” has three nominal categories (China, Pakistan, Italy) and the goal is to determine whether country affects “loyalty,” measured by six items (cl1–cl6). Because the categories have no natural order, the model treats them as pure nominal groups rather than ordinal levels.
The workflow starts by converting the country variable into dummy variables. Each non-reference country becomes its own indicator: Pakistan is coded as one dummy variable (country2) and Italy as another (country3), while China is omitted from the model to serve as the reference category. This reference-category setup matters because the estimated coefficients for Pakistan and Italy are interpreted relative to China. The transcript notes that dummy coding can be done manually in R or with the fastDummies library (via the dummy_cols function), and the same approach can be used for other categorical variables (the example briefly shows gender being dummy coded as well).
Next comes the measurement model and structural model setup in SEMinR. The measurement model defines loyalty as a latent construct indicated by cl1 through cl6, while the structural model specifies the relationships of interest: the path from each dummy-coded country indicator to loyalty. Pakistan and Italy are entered separately so their effects can be compared to the omitted reference group (China). The model is then estimated and summarized, producing path coefficients for the country-to-loyalty relationships.
The results show negative path coefficients for both Pakistan and Italy. A negative coefficient means loyalty for consumers in that country is lower than loyalty in the reference category (China). The transcript gives an example interpretation consistent with a SmartPLS-style path coefficient (e.g., −0.334 for Pakistan), reinforcing that the direction of the effect is determined by the sign: positive would indicate higher loyalty than China, while negative indicates lower loyalty.
Finally, statistical significance is assessed using bootstrapping. Bootstrapping is run through the bootstrap_model function, and the summary output includes bootstrapped path results and t statistics. The transcript concludes that the differences are significant: Pakistan shows a significant lower loyalty compared with China, and Italy also shows a significant lower loyalty compared with China. Confidence intervals are also inspected, and the significance decision is tied to a one-tailed threshold (t > 1.645), consistent with the directionality implied by the negative coefficients.
Overall, the approach turns a nominal categorical predictor into interpretable group comparisons within SEM: dummy code categories, omit one as reference, estimate separate paths to the endogenous construct, then use bootstrapping to confirm whether the observed differences versus the reference group are statistically reliable.
Cornell Notes
The SEMinR approach for a nominal categorical predictor (like Country with China, Pakistan, Italy) uses dummy coding and a reference category. Each non-reference category becomes its own dummy variable (Pakistan and Italy), while the omitted category (China) serves as the baseline for comparison. The measurement model defines the endogenous construct (Customer Loyalty) as a latent variable measured by six items (cl1–cl6). The structural model estimates separate paths from each dummy-coded country to loyalty, so coefficient signs indicate whether loyalty is higher or lower than China. Bootstrapping via bootstrap_model provides t statistics and confidence intervals to test whether the country differences versus the reference group are statistically significant (using a one-tailed threshold of 1.645 in the example).
Why must a nominal categorical variable like Country be dummy coded in SEMinR, and what does the reference category do?
How are dummy variables created in the example, and what tool is used?
What does a negative path coefficient from a country dummy to loyalty mean?
Why does the model include only two country dummies when there are three categories?
How is statistical significance for the country differences assessed?
Review Questions
- In a three-category nominal predictor, which category is omitted in the dummy-coded SEMinR model, and how does that choice affect interpretation of coefficients?
- If the path coefficient from the Pakistan dummy to loyalty is negative, what does that imply about loyalty relative to the reference category, and how would a positive coefficient change the interpretation?
- What role does bootstrapping play in determining whether country differences in loyalty are statistically significant in this SEMinR workflow?
Key Points
- 1
Dummy code nominal categorical predictors by creating separate indicator variables for each non-reference category.
- 2
Omit one category (the reference category) to avoid singular metrics and to define comparisons against a baseline group.
- 3
Interpret each country dummy’s path coefficient as the difference in the endogenous construct relative to the reference category.
- 4
Negative coefficients indicate lower loyalty than the reference group; positive coefficients indicate higher loyalty.
- 5
Use SEMinR’s measurement model to define the latent endogenous construct (e.g., loyalty from cl1–cl6).
- 6
Estimate the structural model with separate paths from each dummy-coded category to the latent construct.
- 7
Run bootstrapping (bootstrap_model) and use t statistics/confidence intervals to test whether the observed differences versus the reference category are significant (one-tailed threshold of 1.645 in the example).