twitter email
Pitfalls in measurement invariance testing
Jan 4, 2019
5 minutes read

glass building

In a new paper in the European Journal of Psychological Assessment, Timo Gnambs and I examined the soundness of reporting measurement invariance (MI) testing in the context of multigroup confirmatory factor analysis (MGCFA). Of course, there are several good primers on MI testing (e.g., Cheung & Rensvold, 2002; Wicherts & Dolan, 2010) and textbooks that elaborate on the theoretical base (e.g., Millsap, 2011), but a clearly written tutorial with example syntax how to implement MI practically was still missing. In the first part of the paper, we demonstrate that a sobering large amount of reported degrees of freedom (df) do not match with the df recalculated based on information given in the articles. More specifically, we both reviewed 128 studies including 302 measurement invariance MGCFA testing procedures from six leading peer-reviewed journals that focus on psychological assessment and on a regular base. Overall, about a quarter of all articles included at least one discrepancy with some systematic differences between the journals. However, it was interesting to see that the metric and scalar step of invariance testing were more frequently affected.

In the second part of the manuscript, we elaborate on the different restrictions necessary to test configural, metric, scalar, and strict measurement invariance. To this end, we provide syntax in lavaan and Mplus for a) the marker variable method (i.e., setting a factor loading of a marker variable to one), b) the reference group method (i.e., setting the variance of the latent variables to one), and c) the effects-coding method (i.e., constraining the mean of the loadings to one) by Little (2006) . We also identified two typical pitfalls in using these methods: First, in testing metric MI with the reference group method, researchers seem to neglect to free the factor variances, thus, estimating a model with invariant loadings and variances. Second, in scalar MI the factor means are - for the first time in the nested MI testing procedure - freely estimated. However, some researchers keep the constraints on the factor means. Accordingly, potential meaningful group differences can wrongly deteriorate model fit.

In the last part, we give some recommendations which apply to all parties involved in the publication process – authors, reviewers, editors, and publishers:

  1. Familiarize yourself with the constraints of MI testing using different identification strategies and pay attention to the aforementioned pitfalls. Furthermore, we encourage researchers to use the effects-coding method (Little et al., 2006), which allows to estimate and test the factor loadings, variances, and latent means simultaneously. In contrast to other scaling methods, the effects-coding method does not rely on fixing single measurement parameters to identify the scale, which might lead to problems in MI testing if these parameters function differently across groups, but are constrained to be equal.

  2. Describe the measurement model in full detail (i.e., number of indicators, factors, cross-loadings, residual covariances, and groups) and explicitly state which parameters are constrained at the different MI steps, so that it is clear which models are nested within each other. In addition, use unambiguous terminology when referring to specific steps in MI testing. For example, label the invariance step by the parameters that have been fixed (e.g., “invariance of factor loadings” instead of “metric invariance”).

  3. In line with the current efforts of the Open Science Framework (Nosek et al., 2015) to make scientific research more transparent, open, and reproducible, we strongly advocate to make the raw data and the model syntax available in a freely accessible data repository. If legal restrictions or ethical considerations prevent the sharing of raw data, it is possible to create synthesized data sets (Nowok, Raab, & Dibben, 2016). If you want me to cover this method in a future post drop me a line.

  4. We encourage authors and reviewers to routinely use our online tool - - where you can enter the number of indicators, latent variables, groups, etc. to double-check the df of your reported models. In this context, we welcome the recent effort of journals in psychology to include soundness checks on manuscript submission such as statcheck to improve the accuracy of statistical reporting.

  5. In our opinion, the results also indicate that statistical and methodological courses need to be taught more rigorously in university teaching, especially in structured Ph.D. programs. A vigorous training should include both conceptual (e.g., Markus & Borsboom, 2013) and statistical work (Millsap, 2011). To bridge the gap between psychometric researchers and applied working psychologists, a variety of teaching resources can be recommended that introduce invariance testing in general (Cheung & Rensvold, 2002; Wicherts & Dolan, 2010) or specific aspects of MI such as longitudinal MI (Geiser, 2013), MI in higher-order models (Chen, Sousa, & West, 2005), and MI with categorical data (Pendergast, von der Embse, Kilgus, & Eklund, 2017).


  • Chen, F. F., Sousa, K. H., & West, S. G. (2005). Testing measurement invariance of second-order factor models. Structural Equation Modeling: A Multidisciplinary Journal, 12, 471–492.
  • Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 9, 233–255.
  • Geiser, C. (2013). Data Analysis with Mplus. New York: Guilford Press.
  • Little, T. D., Slegers, D. W., & Card, N. A. (2006). A non-arbitrary method of identifying and scaling latent variables in SEM and MACS models. Structural Equation Modeling, 13, 59–72.
  • Markus, K. A., & Borsboom, D. (2013). Frontiers of Test Validity Theory: Measurement, Causation, and Meaning. New York: Routledge.
  • Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York, NY: Routledge.
  • Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., … Yarkoni, T. (2015). Promoting an open research culture. Science, 348(6242), 1422–1425.
  • Nowok, B., Raab, G. M., & Dibben, C. (2016). Synthpop: bespoke creation of synthetic data in R. Journal of Statistical Software, 74.
  • Pendergast, L., von der Embse, N., Kilgus, S., & Eklund, K. (2017). Measurement equivalence: A non-technical primer on categorical multi-group confirmatory factor analysis in school psychology. Journal of School Psychology, 60, 65–82.
  • Schroeders, U. & Gnambs, T. (in press). Degrees of freedom in multigroup confirmatory factor analyses: Are models of measurement invariance testing correctly specified? European Journal of Psychological Assessment.
  • Wicherts, J. M., & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis: An illustration using IQ test performance of minorities. Educational Measurement: Issues and Practice, 29, 39–47.

Back to posts