In 2009, I wrote a small chapter that was part of an EU conference book on the transition to computer-based assessment. Now and then I’m coming back to this piece of work - in my teaching and my publications (e.g., the EJPA paper on testing reasoning ability across different devices). Now I want to make it publically available. Hopefully, it will be interesting to some of you. The chapter is the (unaltered) preprint version of the book chapter, so if you want to cite it, please use the following citation:
Schroeders, U. (2009). Testing for equivalence of test data across media. In F. Scheuermann & J. Björnsson (Eds.), The transition to computer-based assessment. Lessons learned from the PISA 2006 computer-based assessment of science (CBAS) and implications for large scale testing (pp. 164-170). JRC Scientific and Technical Report EUR 23679 EN.
In order to stay abreast of social and technological changes and to capitalize on potential advantages of computer-based over paper-pencil testing, researchers are – in a first step of this transition process – concerned with moving already existing psychometric measures to computers. Therefore, testing for equivalence of a measure across test media becomes important in understanding whether computerizing measures affect the assessment of the underlying construct positively or adversely. In practical terms during the transition period equivalence is crucial in order to guarantee the comparability of test data and therewith the fairness for the test takers across media. After defining the term equivalence the available empirical evidence and proper statistical methods are discussed. It is argued that confirmatory factorial analysis is the soundest statistical tool for equivalence comparisons. The chapter concludes with some practical advices on what to do in order to adequately support the claim that a measure is equivalent across test media.
Given the potential advantages of computer-based testing (CBT) over paper-pencil-testing (PPT) - like computer adaptive testing (CAT, see Thompson & Weiss, this volume) or the potential to reduce costs of testing (see Latour, this volume) – educational and psychological testing is transferred considerably to a new test medium. Besides large scale studies (e.g., NAEP, see Bridgeman; CBAS, see Haldane, both this volume) there is a variety of small scale studies. In an initial step researchers are concerned to transfer readily available paper-based measures to computers. Subsequently, opportunities provided by the new test medium like multimedia extensions might catch a researcher’s interest and trigger additional changes of the instrument. These two trends reflect two different research strategies. This chapter addresses data analytic issues in the first research strategy, primarily, the issue of equivalence of measures across test media, and is divided into three sections: (A) What is equivalence?, (B) Is there equivalence?, and (C) How to test for equivalence? The chapter concludes with some practical recommendations on how to achieve equivalence.
Searching for the term equivalence in the “Standards for educational and psychological testing” (AERA, APA, & NCME, 1999) you will find several passages dealing with the issue. In the paragraphs about test administration (p. 62), score comparability (p. 57), and fairness of testing (p. 73) equivalence is immanent, but could easily be replaced by different labels like “unbiasedness” or “test fairness”. We will use the term “equivalence” following this working definition: The scores of measures are equivalent if they capture the same construct with the same measurement precision, providing interchangeable scores for individual persons. This definition suggests that two measures are equivalent if they are strict parallel, that is, test scores of such measures are solely dependent on the latent ability dimension and independent of test administration. Equivalence is given if the underlying source of all within group variance also accounts for the complete variance between the groups (PP vs. PC). Thus, equivalence is accurately described as measurement invariance (Drasgow & Kanfer, 1985). As we will see later on, there are different types of equivalence or measurement invariance. The next section will shed some light on the question whether evidence for equivalence can be found in the literature of educational and psychological testing.
Numerous studies try to clarify the question of equivalence across test media with respect to a) a specific measure (e.g., the computerized GRE, Goldberg & Pedulla, 2002), b) specific subgroups of testees (e.g., ethnic or gender groups, Gallagher, Bridgeman, & Cahalan, 2002) or c) specific soft- and hardware realizations (e.g., pen-based computer input, Overton, Taylor, Zickar, & Harms, 1996). However, the findings of these studies often remained unconnected and inconclusive. Mead and Drasgow (1993) attempted to connect these individual findings in their frequently cited – but by now outdated – meta-analytical study. Their synopsis endorses the structural equivalence of ability test data for power tests gathered through CBT versus PPT. The cross-mode correlation corrected for measurement error was r = .97 whereas this coefficient was only r = .72 for speed tests. The authors argue that the reason for the low cross-mode correlation among speed tests is substantiated in different motor skill requirements and differences in presentation (instructions, presenting of the items). By adjusting the requirements of a CBT to a PPT both artifacts should be eliminated and equivalence should be established. Consistent with this suggestion, Neuman and Baydoun (1998) demonstrated that the differences across media can be minimized for clerical speed tests if CBT follows the same administration and response procedures as PPT. The authors concluded that their tests administered on different media measure the same construct but with different reliability.
Kim (1999) presented a comprehensive meta-analysis featuring two enhancements over Mead and Drasgow’s earlier work: First, the sample of 51 studies including 226 effect sizes was more heterogeneous including studies on classroom tests and dissertations. Second, the authors corrected for within-study dependency in effect size estimation using a method recommended by Gleser and Olkin (1994), thus, avoiding both the overrepresentation of studies with many dependent measures and the inflation of false positive outcomes. According to Kim, in a global evaluation of equivalence of computer-based vs. paper-based tests the most important distinction concerns CAT vs. CBT. For CAT, mathematics and sort of publication are significant moderators in predicting the effect size, however, considering CBT alone no significant moderators could be found.
In the recent past, two more meta-analyses (Wang, Jiao, Young, Brooks, & Olson, 2007, 2008) for mathematics and English reading comprehension respectively for K-12 students cover the research results of the last 25 years. For mathematics 14 studies containing 44 independent data sets allowed a comparison of the scores from PPT and CBT measures. After excluding six data sets contributing strongly to deviance in effect size homogeneity the weighted mean effect size was statistically not different from zero. One moderator variable, the delivery algorithm (fixed vs. adaptive) used in computerized administration, contributed statistically significant to the prediction of the effect size, whereas all other moderator variables investigated (study design, grade level, sample size, type of test, Internet-based testing, and computer practice) had no salient influence. For English reading assessment the weighted mean effect size was also not statistically different from zero after excluding six from 42 datasets extracted from eleven primary studies in an attempt to eliminate effect size heterogeneity. In comparison to the meta-analysis in mathematics, the moderator variables differ: Four moderator variables (study design, sample size, computer delivery algorithm, and computer practice) affected the differences in reading comprehension scores between test media and three other postulated moderator variables (grade level, type of test, and Internet-based testing) had no statistically meaningful influence. Even though, on a mean level no differences between test media could be found for mathematic and reading comprehension, the postulation of differential moderator variables for both disciplines might indicate a capitalization on chance or the relevance of unconsidered moderators (e.g., year of the study). Obviously the small sample of studies in both domains limits the generalizability of the results.
Considering all evidence presented so far, the effects of the test medium on test performance are nil or small. However, meta-analyses on the equivalence of ability measures across test media have a conceptual flaw. In order to allow an adequate assessment of the equivalence of test data across administration modes a comparison of mean scores (and dispersions) is insufficient. Let us explain this point in more detail.
Horkay, Bennett, Allen, and Kaplan (2005) compared the performance of two nationally representative samples in a recent National Assessment of Educational Progress (NAEP) study in writing. One sample took the test on paper, the other sample worked on a computerized version. Albeit, means in both conditions were roughly the same, computer familiarity (consisting of a) hands-on computer proficiency, b) extent of computer use, and c) computer use for writing) added about 11% points over paper writing score to the prediction of writing performance in the PC-condition. Thus, students with greater hands-on skill achieved higher PC-writing scores when holding constant their performance on a paper-and-pencil writing test. Importantly, this difference in the construct measured by both instantiations would have remained undetected if the evaluation was solely based on tests of mean differences. So how does an appropriate procedure to test for equivalence between paper-based and computer-based measures look like?
Let us begin with the distinction between within- and between-subjects-designs in the context of cross-media-equivalence. Within the former the same subjects work both on paper and computer, within the latter different groups of subjects work either on paper or on computer. In both cases a potential test medium effect cannot be established by comparing or analyzing mean differences of the manifest or the latent scores. Strategies often applied in literature are based on the tacit assumption that the sources of within- and cross-media variance are actually the same. However, this assumption has to be tested by analyzing the means, variances and the covariances of the data. For the reflective measurement models used predominantly in educational and psychological measurement, the framework of confirmatory factor analysis (CFA) provides the best method for equivalence testing. In CFA the communality of many manifest variables is explained through a smaller number of underlying latent factors. This is achieved by, first, reproducing the covariance structure of the observed variables with the postulated covariance of a theoretical driven model, and second, evaluating the fit of the variable-reduced model to the empirical data. In case of a within-subject-design, the logic of testing is to check whether or not an additional, test medium specific factor accounts for unexplained variance and affects model fit beneficially. In case of a between-subject-design between-group comparisons and within-group comparisons are possible (for a detailed discussion see Lubke, Dolan, Kelderman, & Mellenbergh, 2003) by using multi-group confirmatory factor analysis (MGCFA).
Imagine the following within-subject-scenario: Subjects are tested on three different media (paper-pencil, notebook-computer, personal digital assistant (PDA)) with three reasoning tests covering the content domains verbal, numerical, and figural. The items might vary across media but could be considered as drawn from an item universe with equally constrained random selection. After completing the three tests on one medium subjects continue with the next medium. Ideally, the reasoning tests are parallel in difficulty and no learning effects occur between the tests. Sequence effects are controlled by having an additional between-groups factor in the design controlling for the six different sequences. As mentioned before, in order to test for equivalence in this example of a within-subject-design, first, a theoretical-based structural model has to be established. In our case this could be a model with three correlated content-specific-reasoning factors (verbal, numerical, figural) or a g-factor model (cp. Wilhelm, 2005). Through the introduction of a nested test medium factor one can try to tap construct-irrelevant variance in the covariance structure. Because the derived model with the nested medium-factor is nested within the original model (Bollen, 1989) the difference in model fit can be assessed with a conventional Chi-square-difference test.
In the multi-trait-multi-method-context different modeling strategies have been proposed to take method-specific variance – like the variance due to a test medium – into account. Figure 1 depicts a model with three traits and three methods – measured with nine indicators totally. Correlation within both traits and methods are allowed, but correlations between traits and methods are restricted to zero. Depending on theoretical considerations a number of competing models could be postulated, for instance, a correlated-trait-uncorrelated-method-model (CTUM-model, Marsh & Grayson, 1995). In case of inconsistent method artifacts on the indicators or an influence that is not unidimensional correlated errors should substitute method factors, resulting in two possible models: correlated-trait-correlated-uniqueness-model (CTCU-model) and the uncorrelated-trait-correlated-uniqueness-model (UTUC-model, Widaman, 1985). In order to solve identification problems with the CTCM model Eid (2000; Eid, Lischetzke, Nussbeck & Trierweiler, 2003) proposed the correlated-trait-correlated-method-minus-one-model (CTC(M-1)-model). In this model one method is chosen as a reference that is excluded from modeling. This implies that the method factors have to be interpreted in comparison to the reference method. In our example it would probably be sensible to choose the paper-pencil-condition as a reference method because we want to establish whether computers make a difference in comparison to the traditional method of using paper-based measures. Both method factors are correlated and this correlation could be interpreted as a computer-literacy-method factor that is orthogonal to the other factors in this model. One advantage of the CTC(M-1)-model is that the variance is totally decomposed into a trait-specific, a method-specific, and an error component. One disadvantage of this model architecture might be that content and method factors are expected to be uncorrelated. Once method factors in the context of ability testing are interpreted it frequently turns out that those method factors might also express individual differences in methods and given the ubiquitous positive manifold amongst ability measures considering these method factors to be orthogonal to other ability factors is implausible. To sum up, in order to ascertain equivalence of data across media in the within-subject-model it is pivotal to check if the introduction of a method factor is required.
In the between-subject-design an extension of CFA – the multi-group confirmatory factor analysis (MGCFA) – is a suitable method to check for equivalence of test data gathered with different test media. If you look on the (overarching) CFA approach in terms of a regression model the prediction of the observed score y for a person j contingent on the latent ability level η through an indicator i on a medium m is described by Y(i,m,j) = τ(i,m) + λ(i,m) ∙ η(m,j) + ε(i,m,j), where τ is the intercept, λ is the factor loading and ε is the residual term. In order to guarantee measurement invariance all these variables have to be equal across test media conditions.
Consider another example of a paper-based test measuring crystallized intelligence. The test is transferred to computers and the researcher is confronted with the question of equivalence of the test across media. The three graphs in figure 2 describe various possible scenarios of measurement invariance for the crystallized intelligence test administered on both media, PP and PC. In the first graph (A) the subtests differ with regard to their slope or factor loadings (λ(i,PP) > λ(i,PC)). In the second graph (B) the difference lies in the intercepts (τ(i,PP) > τ(i,PC); the difference between the intercepts amount to the level of overprediction or underprediction. This situation of constant over- or underprediction independent of the ability level is referred to as uniform bias (Mellenbergh, 1982). In the third graph (C) the variance around the expected value is unequal implying different variances in the residual term (ε(i,PP) ≠ ε(i,PC)). Even though the underlying ability distribution in both groups is the same, unequal model parameters cause differences in the observed scores or, put differently, produce measurement invariance or non-equivalence. A straight forward procedure to assess the level of equivalence across test media is to compare four models in a fixed order, from the least to the most restrictive model.
Continuous variables | Loadings | Intercepts | Residuals | Means |
---|---|---|---|---|
Configural invariance | * | * | * | Fixed at 0 |
Weak factorial invariance | Fixed | * | * | Fixed at 0 |
Residual variance invariance | Fixed | * | Fixed | Fixed at 0 |
Strict factorial invariance | Fixed | Fixed | Fixed | Fixed at 0/* |
Note. The asterisk (*) indicates that the parameter is freely estimated. Fixed = the parameter is fixed to equity across groups; Fixed at 0 = factor means are fixed to 0 in both groups. Fixed at 0/* = factor means are fixed to 0 in the first group and freely estimated in all other groups.
Table 1 lists the different steps in invariance testing. In step 1, testing for configural invariance, all measurement parameters (factor loadings, residual variances, and intercepts) are freely estimated in both conditions (PP and PC). In step 2, metric invariance, models are invariant with respect to their factor loadings whereas all other parameters (residual variances and intercepts) are freely estimated. If measurement invariance is established on this stage, administration mode does not affect the rank order of individuals. This condition is also referred to as metric or weak invariance and is a prerequisite for meaningful cross-group comparisons (Bollen, 1989). In step 3, residual variance invariance, on top of the restrictions in step 2 the residual variances between groups are fixed to equality. In the most restrictive model (step 4) all measurement parameters are equal. If this standard is met strict factorial invariance (Meredith, 1993) holds. Wicherts (2007) explains why – in the last step in testing for strict equivalence – it is essential to allow for differences in factor means while fixing the intercepts to equality. Neglecting this possibility would force any factor mean difference in the group into differences in the intercepts, thus, concealing group differences. Each of the above models is nested within the previous ones, for example, model C derives from model B by imposing additional constraints. Because of this nested character a potential deterioration in model fit is testable through a Chi-square-difference-test. Cheung and Rensvold (2002) evaluated different goodness-of-fit indices with respect to a) their sensitivity to model misfit and b) dependency on model complexity and sample size. Based on a simulation they recommend using Δ(Gamma hat) and Δ(McDonald’s noncentrality index) in addition to Δ(CFI) in order to evaluate measurement invariance. Although multi-group models are the method of choice in the between-subject scenario there are some interesting issues concerning: a) effect-sizes, b) location of invariance violation, c) application to ordinal measures, c) application to ordinal measures, and d) the modeling of non-invariance (Millsap, 2005).
In this chapter two methods have been presented that have a series of advantages over non-factorial approaches and clearly are more adequate than procedures frequently applied in the literature. In this discussion we want to focus on some heuristics on what can be done to achieve equivalence prior to collecting data? Because the testing is influenced by both software (e.g., layout design) and hardware aspects (e.g., display resolution) much effort has been devoted to answer this question from a technological perspective, for example, about the legibility of online texts depending on font characteristics, length and number of lines, and white spacing (cp. Leeson, 2006). Bearing in mind the rapid changes in soft- and hardware it seems hard to give long-lasting advice. Nevertheless, when you are testing on a variety of computers (e.g., using school facilities or testing unproctored via the Internet) try to implement a test environment that is independent of a specific operating system. In order to exclude as few candidates as possible keep the technical aspects of the testing simple. From a psychological perspective chances are enhanced to establish equivalence across test media if the PC-condition is handled and as thoroughly scrutinized as a parallel paper-based test form. However, even a sound construction does not immunize against violations of stronger forms of equivalence. In this case it is inevitable and advisable to account for the additional source of variance. One way to accomplish this is to survey potential moderators like computer experience, accessibility to computers, and hands-on skills. However, as long as we do not know exactly why essential properties of ability measures vary across test media, investigating the equivalence of computer- and paper-based test data is critical.
In a new paper in the European Journal of Psychological Assessment, Timo Gnambs and I examined the soundness of reporting measurement invariance (MI) testing in the context of multigroup confirmatory factor analysis (MGCFA). Of course, there are several good primers on MI testing (e.g., Cheung & Rensvold, 2002; Wicherts & Dolan, 2010) and textbooks that elaborate on the theoretical base (e.g., Millsap, 2011), but a clearly written tutorial with example syntax how to implement MI practically was still missing. In the first part of the paper, we demonstrate that a sobering large amount of reported degrees of freedom (df) do not match with the df recalculated based on information given in the articles. More specifically, we both reviewed 128 studies including 302 measurement invariance MGCFA testing procedures from six leading peer-reviewed journals that focus on psychological assessment and on a regular base. Overall, about a quarter of all articles included at least one discrepancy with some systematic differences between the journals. However, it was interesting to see that the metric and scalar step of invariance testing were more frequently affected.
In the second part of the manuscript, we elaborate on the different restrictions necessary to test configural, metric, scalar, and strict measurement invariance. To this end, we provide syntax in lavaan and Mplus for a) the marker variable method (i.e., setting a factor loading of a marker variable to one), b) the reference group method (i.e., setting the variance of the latent variables to one), and c) the effects-coding method (i.e., constraining the mean of the loadings to one) by Little (2006) . We also identified two typical pitfalls in using these methods: First, in testing metric MI with the reference group method, researchers seem to neglect to free the factor variances, thus, estimating a model with invariant loadings and variances. Second, in scalar MI the factor means are - for the first time in the nested MI testing procedure - freely estimated. However, some researchers keep the constraints on the factor means. Accordingly, potential meaningful group differences can wrongly deteriorate model fit.
In the last part, we give some recommendations which apply to all parties involved in the publication process – authors, reviewers, editors, and publishers:
Familiarize yourself with the constraints of MI testing using different identification strategies and pay attention to the aforementioned pitfalls. Furthermore, we encourage researchers to use the effects-coding method (Little et al., 2006), which allows to estimate and test the factor loadings, variances, and latent means simultaneously. In contrast to other scaling methods, the effects-coding method does not rely on fixing single measurement parameters to identify the scale, which might lead to problems in MI testing if these parameters function differently across groups, but are constrained to be equal.
Describe the measurement model in full detail (i.e., number of indicators, factors, cross-loadings, residual covariances, and groups) and explicitly state which parameters are constrained at the different MI steps, so that it is clear which models are nested within each other. In addition, use unambiguous terminology when referring to specific steps in MI testing. For example, label the invariance step by the parameters that have been fixed (e.g., “invariance of factor loadings” instead of “metric invariance”).
In line with the current efforts of the Open Science Framework (Nosek et al., 2015) to make scientific research more transparent, open, and reproducible, we strongly advocate to make the raw data and the model syntax available in a freely accessible data repository. If legal restrictions or ethical considerations prevent the sharing of raw data, it is possible to create synthesized data sets (Nowok, Raab, & Dibben, 2016). If you want me to cover this method in a future post drop me a line.
We encourage authors and reviewers to routinely use our online tool - https://ulrich-schroeders.de/fixed/df-mgcfa/ - where you can enter the number of indicators, latent variables, groups, etc. to double-check the df of your reported models. In this context, we welcome the recent effort of journals in psychology to include soundness checks on manuscript submission such as statcheck to improve the accuracy of statistical reporting.
In our opinion, the results also indicate that statistical and methodological courses need to be taught more rigorously in university teaching, especially in structured Ph.D. programs. A vigorous training should include both conceptual (e.g., Markus & Borsboom, 2013) and statistical work (Millsap, 2011). To bridge the gap between psychometric researchers and applied working psychologists, a variety of teaching resources can be recommended that introduce invariance testing in general (Cheung & Rensvold, 2002; Wicherts & Dolan, 2010) or specific aspects of MI such as longitudinal MI (Geiser, 2013), MI in higher-order models (Chen, Sousa, & West, 2005), and MI with categorical data (Pendergast, von der Embse, Kilgus, & Eklund, 2017).
Maybe you have seen my recent Tweet:
Please share this call and contribute to a new Special Issue on "New Methods and Assessment Approaches in Intelligence Research" in the @Jintelligence1, we are guest-editing together with Hülür, @HildePsych, and @pdoebler. More information: https://t.co/PevdPeyRgm pic.twitter.com/Y6hRllQa8m
— Ulrich Schroeders (@Navajoc0d3) November 11, 2018
Dear Colleagues,
Our understanding of intelligence has been—and still is—significantly influenced by the development and application of new computational and statistical methods, as well as novel testing procedures. In science, methodological developments typically follow new theoretical ideas. In contrast, great breakthroughs in intelligence research followed the reverse order. For instance, the once-novel factor analytic tools preceded and facilitated new theoretical ideas such as the theory of multiple group factors of intelligence. Therefore, the way we assess and analyze intelligent behavior also shapes the way we think about intelligence.
We want to summarize recent and ongoing methodological advances inspiring intelligence research and facilitating thinking about new theoretical perspectives. This Special Issue will include contributions that:
- take advantage of auxiliary data usually assessed in technology-based assessment (e.g., reaction times, GPS data) or take a mobile sensing approach to enriching traditional intelligence assessment;
- study change or development in (intensive) longitudinal data with time series analysis, refined factor analytic methods, continuous time modeling, dynamic structural equation models, or other advanced methods of longitudinal data modeling;
- examine the structure of and change in cognitive abilities with network analysis and similar recently popularized tools; and
- use supervised and unsupervised machine learning methods to analyze so-called Big Data in the field of intelligence research.
We invite original research articles and tutorials that use and explain the aforementioned and other innovative methods to familiarize readers with new ways to study intelligence. To this end, we appreciate reusable and commented syntax provided as online material. We especially welcome contributions from other disciplines, such as computer science and statistics. For your convenience, we have also compiled a list of free accessible intelligence data sets: https://goo.gl/PGFtv3.
Prof. Dr. Ulrich Schroeders,
Prof. Dr. Gizem Hülür,
Prof. Dr. Andrea Hildebrandt,
Prof. Dr. Philipp Doebler
Guest Editors
Manuscripts can be submitted until the deadline—June, 1st, 2019. If you are interest in contributing to the Special Issue on “New Methods and Assessment Approaches in Intelligence Research”, please send a title and a short abstract (about 100 words) to the Editorial Office of the Journal of Intelligence. Accepted papers will be published continuously in the journal and will be listed together on the special issue website.
We really look forward to exciting submissions!
Our meta-analysis – Steger, Schroeders, & Gnambs (2018) – comparing test-scores of proctored vs. unproctored assessment is now available as online first publication and sometime in the future to be published in the European Journal of Psychological Assessment. In more detail, we examined mean score differences and correlations between both assessment contexts with a three-level random-effects meta-analysis based on 49 studies with 109 effect sizes. We think this is a timely topic since web-based assessments are frequently compromised by a lack of control over the participants’ test-taking behavior, but researchers are nevertheless in the need to compare the data obtained through unproctored test conditions with data from controlled settings. The inevitable question is to what extent such a comparison is feasible.
To prevent participants from cheating, test administrators have invented a ragbag of countermeasures such as honesty contracts or follow-up verification tests which also made it into general testing guidelines (International Test Commission, 2006). For example, test-takers have to sign honesty contracts that threaten to punish cheating before the actual assessment take place to raise participants’ commitment and conscientiousness. So, do you think this strategy pan out? Unfortunately, our moderator analyses showed no significant effect for the implementation of countermeasures against cheating at all. Despite the vast body of research that advocates the implementation of countermeasures, we found no empirical evidence for their effectiveness. It is disheartening to think that they are still treated as cure against cheating in prominent testing guidelines.
However, this post is less about the findings of the meta-analysis, but a request to contribute your research. Although we did our best, to cover the complete range of studies - for example, by sending a request to various mailing lists (DGPs, AIR, GIR-L), ResearchGate, and the Psych Meth Discussion Group on Facebook - we think there still might be some studies out there that felt through the grid. And there are some interesting new development of interactive meta-analysis such as MetaLab we want to try out. Thus, we reached out to our fellow researchers on ResearchGate kindly asking for their assistance. And I want to reiterate the call on my website:
Dear colleagues,
It may seem like we can—now that the paper is done and published—file away this project and move on to new adventures. But truth be told, we’re still curious! So, in case you have a study (published, forthcoming, or in your file drawer) that fits our inclusion criteria, please let us know!
Here are our main two inclusion criteria again:
The study…
(a) reported a comparison of test scores obtained in a (remotely) proctored setting (e.g., lab, supervised classroom or test center) versus an unproctored setting (e.g., web-based, unsupervised assessment at home),
and
(b) administered cognitive ability measures.
If you think your study also met the criteria, then give us a hint. We are looking forward to hearing from you!
We’ll keep you posted of any new developments and insights!
Diana Steger, Ulrich Schroeders, & Timo Gnambs
You can contact any author you like.
In case of doubt, please write to Diana. 🙂
In a recent paper – Edossa, Schroeders, Weinert, & Artelt, 2018 – we came across the issue of longitudinal measurement invariance testing with categorical data. There are quite good primers and textbooks on longitudinal measurement invariance testing with continuous data (e.g., Geiser, 2013). However, at the time of writing the manuscript there wasn’t an application of measurement invariance testing in the longitudinal run with categorical data. In case your are interest in using such an invariance testing procedure, we uploaded the R syntax for all measurement invariance steps.
Basically, the testing procedure has the same parameter restrictions as the cross-sectional multi-group confirmatory factor analysis (for more information, see Schroeders & Wilhelm, 2011), with the exception of additional residual correlations across time points. In contrast, to measurement invariance testing with continuous data, one of the main difference is that with categorical data thresholds and factor loadings have to be varied in tandem (see Table 1), which is not always acknowledged. Consequently, the step of metric measurement invariance testing is dropped.
Continuous variables | Loadings | Intercepts | Residuals | Means |
---|---|---|---|---|
Configural invariance | * | * | * | Fixed at 0 |
Weak invariance | Fixed | * | * | Fixed at 0 |
Strong invariance | Fixed | Fixed | * | Fixed at 0/* |
Strict invariance | Fixed | Fixed | Fixed | Fixed at 0/* |
Categorical variables | Loadings | Thresholds | Residuals | Means |
Configural invariance | (* | *) | Fixed at 1 | Fixed at 0 |
Strong invariance | (Fixed | Fixed) | Fixed at 1/* | Fixed at 0/* |
Strict invariance | (Fixed | Fixed) | Fixed at 1 | Fixed at 0/* |
Note. The asterisk (*) indicates that the parameter is freely estimated. Fixed = the parameter is fixed to equity over time points; Fixed at 1 = the residual variances are fixed to 1 at all time points; Fixed at 0 = factor means are fixed at 0 at all time points. Fixed at 0/* = factor means are fixed at 0 at the first time point and freely estimated at the other time points. Fixed at 1/* = the residual variances are fixed to 1 at the first time point and freely estimated at the other time points. Parameters in parentheses need to be varied in tandem.
During the revision of the manuscript, Liu et al. (2016) published another approach for longitudinal measurement invariance testing with ordered-categorical data in Psychological Methods, which actually yields the same results and df.
I had the great chance to co-author two recent publications of Timo Gnambs, both dealing with the Rosenberg Self-Esteem Scale (RSES; Rosenberg, 1965). As a reminder, the RSES is a popular ten item self-report instrument measuring a respondent’s global self-worth and self-respect. But basically both papers are not about the RSES per se, rather they are applications of two recently introduced powerful and flexible extensions of the Structural Equation Modeling (SEM) Framework: Meta-Analytic Structural Equation Modeling (MASEM) and Local Weighted Structural Equation Modeling (LSEM), which will be described in more detail later on.
As a preliminary notion, a great deal of the psychological assessment literature is intrigued (or obsessed) with the question of dimensionality of a psychological measure. If you take a look at the table of content of any assessment journal (e.g., Assessment, European Journal of Psychological Assessment, Psychological Assessment), you’ll find several publications dealing with the factor structure of a psychological instrument. That’s in the nature of things. Or, as Reise, Moore, and Haviland (2010) formulated:
The application of psychological measures often results in item response data that arguably are consistent with both unidimensional (a single common factor) and multidimensional latent structures (typically caused by parcels of items that tap similar content domains). As such, structural ambiguity leads to seemingly endless “confirmatory” factor analytic studies in which the research question is whether scale scores can be interpreted as reflecting variation on a single trait.
To break this vicious circle, Reise et al. (2010) re-discovered the bi-factor model. Besides the appropriate way of modeling, another issue is that factors moderating the structure are left out.
MASEM is the integration of two techniques—Meta-Analysis and Structural Equation Modeling—that have a long-standing tradition, but with limited exchange between both disciplines. There are more or less technically written primers on MASEM (Cheung, 2015a, Cheung & Chang, 2005, Cheung & Cheung, 2016) and of course the books by Mike Cheung (2015b) and Suzanne Jak (2015), but the basic idea is rather simple. MASEM is a two-stage approach: In a first step, the correlation coefficients have to be extracted from primary studies, which are meta-analytically combined into a pooled correlation matrix. Often, this correlation is simply taken as input for a structural question model, but this approach is flawed in several ways (see Cheung & Chang, 2005 for a full account). For example, often the combined correlation matrix is equated with a variance-covariance, which leads to biased fit statistics and parameter estimates (Cudeck, 1989). Another issue concerns determination of the sample size, which is usually done by calculating the mean of the individual sample sizes. However, correlations that are based on more studies are estimated with more precision and should have a larger impact. The two-stage MASEM approach described by Cheung tackle these issue.
I think there are three assets of this study worth mentioning:
Reference. Gnambs, T., Scharl, A., & Schroeders, U. (2018). The structure of the Rosenberg Self-Esteem Scale: A cross-cultural meta-analysis. Zeitschrift für Psychologie, 226, 14–29. doi:10.1027/2151-2604/a000317
Abstract. The Rosenberg Self-Esteem Scale (RSES; Rosenberg, 1965) intends to measure a single dominant factor representing global self-esteem. However, several studies have identified some form of multidimensionality for the RSES. Therefore, we examined the factor structure of the RSES with a fixed-effects meta-analytic structural equation modeling approach including 113 independent samples (N = 140,671). A confirmatory bifactor model with specific factors for positively and negatively worded items and a general self-esteem factor fitted best. However, the general factor captured most of the explained common variance in the RSES, whereas the specific factors accounted for less than 15%. The general factor loadings were invariant across samples from the United States and other highly individualistic countries, but lower for less individualistic countries. Thus, although the RSES essentially represents a unidimensional scale, cross-cultural comparisons might not be justified because the cultural background of the respondents affects the interpretation of the items.
Local Weighted Structural Equation Modeling (LSEM) is a recently development SEM technique (Hildebrandt et al., 2009; Hildebrandt et al., 2016) to study invariance of model parameters over a continuous context variable such as age. Frequently, the influence of context variables on parameters in a SEM is studied by introducing a categorical moderator variable and by applying multiple-group mean and covariance structure (MGMCS) analyses. In MGMCS, certain measurement parameters are fixed to be equal across groups to test for different forms of measurement invariance. LSEM, however, allows studying variance–covariance structures depending on a continuous context variable.
There are several conceptual and statistical issues in categorizing context variables that are continuous in nature (see also MacCallum, Zhang, Preacher, & Rucker, 2002). First, in the framework of MGMCS, building subgroups increases the risk of overlooking nonlinear trends and interaction effects (Hildebrandt et al., 2016). Second, categorization leads to a loss in information on individual differences within a given group. When observations that differ across the range of a continuous variable are grouped, variation within those groups cannot be detected. Third, setting cutoffs and cutting a distribution of a moderator into several parts is frequently arbitrary. Thus, neither the number of groups nor the ranges of the context variables are unique. Critically, the selected ranges can influence the results (Hildebrandt et al., 2009; MacCallum et al., 2002). In summary, from a psychometric viewpoint there are several advantages in using LSEM to study measurement invariance depending on a continuous moderator variable.
In principle, LSEMs are traditional SEMs that weight observations around focal points (i.e., specific values of the continuous moderator variable) with a kernel function. The core idea is that observations near the focal point provide more information for the corresponding SEM than more distant observations, which is also depicted in the figure. Observations exactly at the focal point receive a weight of 1; observations with moderator values higher or lower than the focal point receive smaller weights. For example, if the difference between the focal point and moderator is |1/3|, the weight is about .50 (see the gray dashed lines).
Again, I want to mention two highlights of the paper:
Reference. Gnambs, T. & Schroeders, U. (2017). Cognitive abilities explain wording effects in the Rosenberg Self-Esteem Scale. Assessment. Advance online publication. doi:10.1177/1073191117746503
Abstract. There is consensus that the 10 items of the Rosenberg Self-Esteem Scale (RSES) reflect wording effects resulting from positively and negatively keyed items. The present study examined the effects of cognitive abilities on the factor structure of the RSES with a novel, nonparametric latent variable technique called local structural equation models. In a nationally representative German large-scale assessment including 12,437 students competing measurement models for the RSES were compared: a bifactor model with a common factor and a specific factor for all negatively worded items had an optimal fit. Local structural equation models showed that the unidimensionality of the scale increased with higher levels of reading competence and reasoning, while the proportion of variance attributed to the negatively keyed items declined. Wording effects on the factor structure of the RSES seem to represent a response style artifact associated with cognitive abilities.
In both studies, we argue for a bifactor model with a common RSES factor and a specific factor for all negatively worded items. Some more complex model (i.e., additional residual correlation between two of the positively worded items 3 and 4) had a better fit, but was totally data-driven. However, the model with the nested method factor for the negatively worded items is perfectly in line with the previous literature (e.g., Motl & DiStefano, 2002).
After several years running this website on WordPress, it’s time for a change. WordPress has become overloaded, sometimes the back-end is not responsive, and writing a blog post is too tedious—in a nutshell, WordPress isn’t right for me. Hugo is an open-source static site generator built around Google’s Go programming language, which is renowned for its speed. In contrast to dynamic websites that heavily rely on php-scripting and MySQL-databases that are used to store all the content, static websites consist of html, css, and js. Making static websites sounds retro, but is in fact up-to-date and comes with a lot of benefits—in short, it’s right for me.
This principle that the content shouldn’t be governed by the form, implies that content and layout are separated to a maximum degree. All the flashy things such as a Twitter time line, commentaries, etc. are nice to have, but they also distract from the essential purpose of this site, that is, providing some ideas about research on psychological assessment. Thus, from today, it is to renounce the formal aspects and spread the word. Hopefully, the change will help me writing posts on a more regular base.
I really love notepad++. All zero drafts of my papers (for the concept please take a look at Joan Bolker’s amazing and helpful primer on academic writing) are written in plain ASCII-text because of its simplicity and the text-per-page ratio. Using a simple text-editor instead of an overblown word processor (even the word is ridiculous) brings you faster into a Hemingway mode of writing (i.e., “write drunk, edit sober”). For the first revision of text files, I turn on the Notepad++ plugin DSpellCeck to get rid of typos. I also like writing in Markdown, a lightweight markup language because it is easy and intuitive, especially when you start your electronic life at the rise of the FidoNet and your favorite text tool was the MS-DOS version of WordPerfect 5.1. By the way, there’s is another Notepad++ plug-in called MarkdownViewer++ to preview your markdown document.
One of the biggest advantages of static websites is speed. According to PageSpeed Tools the computer version of the website scores 90/100 (mobile version 74/100) out of the box. This is pretty fast. Building all files takes Hugo less than 2 seconds, and uploading the < 4 MB of content including images to the server takes approximately 30 seconds. Lightning fast also refers to the fact that you can easily write new blog posts, no logging in, layout checking, and tweaking—simply write. Moreover, static websites are much more secure because they are prebuilt without any server code running on your website. Thus, it is immune to php vulnerabilities such as SQL injection or session hijacking.
Hugo provides all the basic functionality of a modern website. That means, syntax highlighting is built in, without the need to install additional plug-ins and to adopt css-files. It’s easy to build multi-language sites; just make an additional md-file with the language coded added to the file name (i.e., intro.md
for the default English version and intro.de.md
for the German version of the same page). Supporting multiple languages in WordPress has always been pain, even with decent non-commercial plug-in qTranslate X. Through Shortcodes you can add a lot more functionality: you can share slides of your talks via Speaker Deck, add Instagram photos, Twitter posts, Vimeo videos, etc. It is also possible to comment on your static website using Disqus, a third-party service that provides comment and community capabilities to static websites via js.
Building your website with Hugo is Set it and forget it! You neither have to update your Wordpress installation nor your plug-ins or themes. The latter was especially annoying when updating a theme overwrites your optimized css-file. Moreover, you do not have to manually back-up your content and plug-ins, since your complete website is stored in a human readable format on your local hard drive. Excellent!
…Hugo is the static website generator of my choice. But, of course, there are several static website generators out there such as Jekyll or Octopress (for a contemporary list). Let’s give it a try.