Correlational effect size benchmarks.
Frank A. Bosco, Herman Aguinis, Kulraj Singh, James G. Field, and Charles A. Pierce
Abstract:
Effect size information is essential for the scientific enterprise and plays an increasingly central role in the scientific process. We extracted 147,328 correlations and developed a hierarchical taxonomy of variables reported in Journal of Applied Psychology and Personnel Psychology from 1980 to 2010 to produce empirical effect size benchmarks at the omnibus level, for 20 common research domains, and for an even finer grained level of generality. Results indicate that the usual interpretation and classification of effect sizes as small, medium, and large bear almost no resemblance to findings in the field, because distributions of effect sizes exhibit tertile partitions at values approximately one-half to one-third those intuited by Cohen (1988). Our results offer information that can be used for research planning and design purposes, such as producing better informed non-nil hypotheses and estimating statistical power and planning sample size accordingly. We also offer information useful for understanding the relative importance of the effect sizes found in a particular study in relationship to others and which research domains have advanced more or less, given that larger effect sizes indicate a better understanding of a phenomenon. Also, our study offers information about research domains for which the investigation of moderating effects may be more fruitful and provide information that is likely to facilitate the implementation of Bayesian analysis. Finally, our study offers information that practitioners can use to evaluate the relative effectiveness of various types of interventions.
Frank A. Bosco, Herman Aguinis, Kulraj Singh, James G. Field, and Charles A. Pierce
Abstract:
Effect size information is essential for the scientific enterprise and plays an increasingly central role in the scientific process. We extracted 147,328 correlations and developed a hierarchical taxonomy of variables reported in Journal of Applied Psychology and Personnel Psychology from 1980 to 2010 to produce empirical effect size benchmarks at the omnibus level, for 20 common research domains, and for an even finer grained level of generality. Results indicate that the usual interpretation and classification of effect sizes as small, medium, and large bear almost no resemblance to findings in the field, because distributions of effect sizes exhibit tertile partitions at values approximately one-half to one-third those intuited by Cohen (1988). Our results offer information that can be used for research planning and design purposes, such as producing better informed non-nil hypotheses and estimating statistical power and planning sample size accordingly. We also offer information useful for understanding the relative importance of the effect sizes found in a particular study in relationship to others and which research domains have advanced more or less, given that larger effect sizes indicate a better understanding of a phenomenon. Also, our study offers information about research domains for which the investigation of moderating effects may be more fruitful and provide information that is likely to facilitate the implementation of Bayesian analysis. Finally, our study offers information that practitioners can use to evaluate the relative effectiveness of various types of interventions.
Improving the measurement of group-level constructs by optimizing between-group differentiation.
Paul D. Bliese, Mark A. Maltarich, Jonathan L. Hendricks, David A. Hofmann, and Amy B. Adler
Abstract:
The ability to detect differences between groups partially impacts how useful a group-level variable will be for subsequent analyses. Direct consensus and referent-shift consensus group-level constructs are often measured by aggregating group member responses to multi-item scales. We show that current measurement validation practice for these group-level constructs may not be optimized with respect to differentiating groups. More specifically, a 10-year review of multilevel articles in top journals reveals that multilevel measurement validation primarily relies on procedures designed for individual-level constructs. These procedures likely miss important information about how well each specific scale item differentiates between groups. We propose that group-level measurement validation be augmented with information about each scale item’s ability to differentiate groups. Using previously published datasets, we demonstrate how ICC(1) estimates for each item of a scale provide unique information and can produce group-level scales with higher ICC(1) values that enhance predictive validity. We recommend that researchers supplement conventional measurement validation information with information about item-level ICC(1) values when developing or modifying scales to assess group-level constructs.
Paul D. Bliese, Mark A. Maltarich, Jonathan L. Hendricks, David A. Hofmann, and Amy B. Adler
Abstract:
The ability to detect differences between groups partially impacts how useful a group-level variable will be for subsequent analyses. Direct consensus and referent-shift consensus group-level constructs are often measured by aggregating group member responses to multi-item scales. We show that current measurement validation practice for these group-level constructs may not be optimized with respect to differentiating groups. More specifically, a 10-year review of multilevel articles in top journals reveals that multilevel measurement validation primarily relies on procedures designed for individual-level constructs. These procedures likely miss important information about how well each specific scale item differentiates between groups. We propose that group-level measurement validation be augmented with information about each scale item’s ability to differentiate groups. Using previously published datasets, we demonstrate how ICC(1) estimates for each item of a scale provide unique information and can produce group-level scales with higher ICC(1) values that enhance predictive validity. We recommend that researchers supplement conventional measurement validation information with information about item-level ICC(1) values when developing or modifying scales to assess group-level constructs.
The accuracy of dominance analysis as a metric to assess relative importance: The joint impact of sampling error variance and measurement unreliability.
Michael T. Braun, Patrick D. Converse, and Frederick L. Oswald
Abstract:
Dominance analysis (DA) has been established as a useful tool for practitioners and researchers to identify the relative importance of predictors in a linear regression. This article examines the joint impact of two common and pervasive artifacts—sampling error variance and measurement unreliability—on the accuracy of DA. We present Monte Carlo simulations that detail the decrease in the accuracy of DA in the presence of these artifacts, highlighting the practical extent of the inferential mistakes that can be made. Then, we detail and provide a user-friendly program in R (R Core Team, 2017) for estimating the effects of sampling error variance and unreliability on DA. Finally, by way of a detailed example, we provide specific recommendations for how researchers and practitioners should more appropriately interpret and report results of DA.
Michael T. Braun, Patrick D. Converse, and Frederick L. Oswald
Abstract:
Dominance analysis (DA) has been established as a useful tool for practitioners and researchers to identify the relative importance of predictors in a linear regression. This article examines the joint impact of two common and pervasive artifacts—sampling error variance and measurement unreliability—on the accuracy of DA. We present Monte Carlo simulations that detail the decrease in the accuracy of DA in the presence of these artifacts, highlighting the practical extent of the inferential mistakes that can be made. Then, we detail and provide a user-friendly program in R (R Core Team, 2017) for estimating the effects of sampling error variance and unreliability on DA. Finally, by way of a detailed example, we provide specific recommendations for how researchers and practitioners should more appropriately interpret and report results of DA.
What predicts within-person variance in applied psychology constructs? An empirical examination.
Nathan P. Podsakoff, Trevor M. Spoelma, Nitya Chawla, and Allison S. Gabriel
Abstract:
The attention paid to intraindividual phenomena in applied psychology has rapidly increased during the last two decades. However, the design characteristics of studies using daily experience sampling methods and the proportion of within-person variance in the measures employed in these studies vary substantially. This raises a critical question yet to be addressed: are differences in the proportion of variance attributable to within- versus between-person factors dependent on construct-, measure-, design-, and/or sample-related characteristics? A multilevel analysis based on 1,051,808 within-person observations reported in 222 intraindividual empirical studies indicated that decisions about what to study (construct type), how to study it (measurement and design characteristics), and from whom to obtain the data (sample characteristics) predicted the proportion of variance attributable to within-person factors. We conclude with implications and recommendations for those conducting and reviewing applied intraindividual research.
Nathan P. Podsakoff, Trevor M. Spoelma, Nitya Chawla, and Allison S. Gabriel
Abstract:
The attention paid to intraindividual phenomena in applied psychology has rapidly increased during the last two decades. However, the design characteristics of studies using daily experience sampling methods and the proportion of within-person variance in the measures employed in these studies vary substantially. This raises a critical question yet to be addressed: are differences in the proportion of variance attributable to within- versus between-person factors dependent on construct-, measure-, design-, and/or sample-related characteristics? A multilevel analysis based on 1,051,808 within-person observations reported in 222 intraindividual empirical studies indicated that decisions about what to study (construct type), how to study it (measurement and design characteristics), and from whom to obtain the data (sample characteristics) predicted the proportion of variance attributable to within-person factors. We conclude with implications and recommendations for those conducting and reviewing applied intraindividual research.
Modeling congruence in organizational research with latent moderated structural equations.
Rong Su, Qi Zhang, Yaowu Liu, and Louis Tay
Abstract:
A growing volume of research has used polynomial regression analysis (PRA) to examine congruence effects in a broad range of organizational phenomena. However, conclusions from congruence studies, even ones using the same theoretical framework, vary substantially. We argue that conflicting findings from congruence research can be attributable to several methodological artifacts, including measurement error, collinearity among predictors, and sampling error. These methodological artifacts can significantly affect the estimation accuracy of PRA and undermine the validity of conclusions from primary studies as well as meta-analytic reviews of congruence research. We introduce two alternative methods that address this concern by modeling congruence within a latent variable framework: latent moderated structural equations (LMS) and reliability-corrected single-indicator LMS (SI-LMS). Using a large-scale simulation study with 6,322 conditions and close to 1.9 million replications, we showed how methodological artifacts affected the performance of PRA, specifically, its (un)biasedness, precision, Type I error rate, and power in estimating linear, quadratic, and interaction effects. We also demonstrated the substantial advantages of LMS and SI-LMS compared with PRA in providing accurate and precise estimates, particularly under undesirable conditions. Based on these findings, we discuss how these new methods can help researchers find more consistent effects and draw more meaningful theoretical conclusions in future research. We offer practical recommendations regarding study design, model selection, and sample size planning. In addition, we provide example syntax to facilitate the application of LMS and SI-LMS in congruence research.
Rong Su, Qi Zhang, Yaowu Liu, and Louis Tay
Abstract:
A growing volume of research has used polynomial regression analysis (PRA) to examine congruence effects in a broad range of organizational phenomena. However, conclusions from congruence studies, even ones using the same theoretical framework, vary substantially. We argue that conflicting findings from congruence research can be attributable to several methodological artifacts, including measurement error, collinearity among predictors, and sampling error. These methodological artifacts can significantly affect the estimation accuracy of PRA and undermine the validity of conclusions from primary studies as well as meta-analytic reviews of congruence research. We introduce two alternative methods that address this concern by modeling congruence within a latent variable framework: latent moderated structural equations (LMS) and reliability-corrected single-indicator LMS (SI-LMS). Using a large-scale simulation study with 6,322 conditions and close to 1.9 million replications, we showed how methodological artifacts affected the performance of PRA, specifically, its (un)biasedness, precision, Type I error rate, and power in estimating linear, quadratic, and interaction effects. We also demonstrated the substantial advantages of LMS and SI-LMS compared with PRA in providing accurate and precise estimates, particularly under undesirable conditions. Based on these findings, we discuss how these new methods can help researchers find more consistent effects and draw more meaningful theoretical conclusions in future research. We offer practical recommendations regarding study design, model selection, and sample size planning. In addition, we provide example syntax to facilitate the application of LMS and SI-LMS in congruence research.
The implications of unconfounding multisource performance ratings.
Duncan J. R. Jackson, George Michaelides, Chris Dewberry, Benjamin Schwencke, and Simon Toms
Abstract:
The multifaceted structure of multisource job performance ratings has been a subject of research and debate for over 30 years. However, progress in the field has been hampered by the confounding of effects relevant to the measurement design of multisource ratings and, as a consequence, the impact of ratee-, rater-, source-, and dimension-related effects on the reliability of multisource ratings remains unclear. In separate samples obtained from 2 different applications and measurement designs (N₁ [ratees] = 392, N₁ [raters] = 1,495; N₂ [ratees] = 342, N₂ [raters] = 2,636), we, for the first time, unconfounded all systematic effects commonly cited as being relevant to multisource ratings using a Bayesian generalizability theory approach. Our results suggest that the main contributors to the reliability of multisource ratings are source-related and general performance effects that are independent of dimension-related effects. In light of our findings, we discuss the interpretation and application of multisource ratings in organizational contexts.
Duncan J. R. Jackson, George Michaelides, Chris Dewberry, Benjamin Schwencke, and Simon Toms
Abstract:
The multifaceted structure of multisource job performance ratings has been a subject of research and debate for over 30 years. However, progress in the field has been hampered by the confounding of effects relevant to the measurement design of multisource ratings and, as a consequence, the impact of ratee-, rater-, source-, and dimension-related effects on the reliability of multisource ratings remains unclear. In separate samples obtained from 2 different applications and measurement designs (N₁ [ratees] = 392, N₁ [raters] = 1,495; N₂ [ratees] = 342, N₂ [raters] = 2,636), we, for the first time, unconfounded all systematic effects commonly cited as being relevant to multisource ratings using a Bayesian generalizability theory approach. Our results suggest that the main contributors to the reliability of multisource ratings are source-related and general performance effects that are independent of dimension-related effects. In light of our findings, we discuss the interpretation and application of multisource ratings in organizational contexts.
Selecting response anchors with equal intervals for summated rating scales.
Camron W. Casper, Bryan D. Edwards, Craig J. Wallace, Ronald S. Landis, and Dustin A. Fife
Abstract:
Summated rating scales are ubiquitous in organizational research, and there are well-delineated guidelines for scale development (e.g., Hinkin, 1998). Nevertheless, there has been less research on the explicit selection of the response anchors. Constructing survey questions with equal-interval properties (i.e., interval or ratio data) is important if researchers plan to analyze their data using parametric statistics. As such, the primary objectives of the current study were to (a) determine the most common contexts in which summated rating scales are used (e.g., agreement, similarity, frequency, amount, and judgment), (b) determine the most commonly used anchors (e.g., strongly disagree, often, very good), and (c) provide empirical data on the conceptual distance between these anchors. We present the mean and standard deviation of scores for estimates of each anchor and the percentage of distribution overlap between the anchors. Our results provide researchers with data that can be used to guide the selection of verbal anchors with equal-interval properties so as to reduce measurement error and improve confidence in the results of subsequent analyses. We also conducted multiple empirical studies to examine the consequences of measuring constructs with unequal-interval anchors. A clear pattern of results is that correlations involving unequal-interval anchors are consistently weaker than correlations involving equal-interval anchors.
Camron W. Casper, Bryan D. Edwards, Craig J. Wallace, Ronald S. Landis, and Dustin A. Fife
Abstract:
Summated rating scales are ubiquitous in organizational research, and there are well-delineated guidelines for scale development (e.g., Hinkin, 1998). Nevertheless, there has been less research on the explicit selection of the response anchors. Constructing survey questions with equal-interval properties (i.e., interval or ratio data) is important if researchers plan to analyze their data using parametric statistics. As such, the primary objectives of the current study were to (a) determine the most common contexts in which summated rating scales are used (e.g., agreement, similarity, frequency, amount, and judgment), (b) determine the most commonly used anchors (e.g., strongly disagree, often, very good), and (c) provide empirical data on the conceptual distance between these anchors. We present the mean and standard deviation of scores for estimates of each anchor and the percentage of distribution overlap between the anchors. Our results provide researchers with data that can be used to guide the selection of verbal anchors with equal-interval properties so as to reduce measurement error and improve confidence in the results of subsequent analyses. We also conducted multiple empirical studies to examine the consequences of measuring constructs with unequal-interval anchors. A clear pattern of results is that correlations involving unequal-interval anchors are consistently weaker than correlations involving equal-interval anchors.
Modeling (in)congruence between dependent variables: The directional and nondirectional difference (DNDD) framework.
Timothy C. Bednall and Yucheng Zhang
Abstract:
This article proposes a new approach to modeling the antecedents of incongruence between 2 dependent variables. In this approach, incongruence is decomposed into 2 orthogonal components representing directional and nondirectional difference (DNDD). Nondirectional difference is further divided into components representing shared and unique variability. We review previous approaches to modeling antecedents of difference, including the use of arithmetic, absolute, and squared differences, as well as the approaches of Edwards (1995) and Cheung (2009). Based on 2 studies, we demonstrate the advantages of DNDD approach compared with other methods. In the first study, we use a Monte Carlo simulation to demonstrate the circumstances under which each type of difference arises, and we compare the insights revealed by each approach. In the second study, we provide an illustrative example of DNDD approach using a field dataset. In the discussion, we review the strengths and limitations of our approach and propose several practical applications. Our article proposes 2 extensions to the basic DNDD approach, including modeling difference with a known target or “true” value, and using multilevel analysis to model nondirectional difference with exchangeable ratings.
Timothy C. Bednall and Yucheng Zhang
Abstract:
This article proposes a new approach to modeling the antecedents of incongruence between 2 dependent variables. In this approach, incongruence is decomposed into 2 orthogonal components representing directional and nondirectional difference (DNDD). Nondirectional difference is further divided into components representing shared and unique variability. We review previous approaches to modeling antecedents of difference, including the use of arithmetic, absolute, and squared differences, as well as the approaches of Edwards (1995) and Cheung (2009). Based on 2 studies, we demonstrate the advantages of DNDD approach compared with other methods. In the first study, we use a Monte Carlo simulation to demonstrate the circumstances under which each type of difference arises, and we compare the insights revealed by each approach. In the second study, we provide an illustrative example of DNDD approach using a field dataset. In the discussion, we review the strengths and limitations of our approach and propose several practical applications. Our article proposes 2 extensions to the basic DNDD approach, including modeling difference with a known target or “true” value, and using multilevel analysis to model nondirectional difference with exchangeable ratings.
Interpreting moderated multiple regression: A comment on Van Iddekinge, Aguinis, Mackey, and DeOrtentiis (2018).
Jeffrey B. Vancouver, Bruce W. Carlson, Lindsay Y. Dhanani, and Cassandra E. Colton
Abstract:
When data contradict theory, data usually win. Yet, the conclusion of Van Iddekinge, Aguinis, Mackey, and DeOrtentiis (2018) that performance is an additive rather than multiplicative function of ability and motivation may not be valid, despite applying a meta-analytic lens to the issue. We argue that the conclusion was likely reached because of a common error in the interpretation of moderated multiple-regression results. A Monte Carlo study is presented to illustrate the issue, which is that moderated multiple regression is useful for detecting the presence of moderation but typically cannot be used to determine whether or to what degree the constructs have independent or nonjoint (i.e., additive) effects beyond the joint (i.e., multiplicative) effect. Moreover, we argue that the practice of interpreting the incremental contribution of the interaction term when added to the first-order terms as an effect size is inappropriate, unless the interaction is perfectly symmetrical (i.e., X-shaped), because of the partialing procedure that moderated multiple regression uses. We discuss the importance of correctly specifying models of performance as well as methods that might facilitate drawing valid conclusions about theories with hypothesized multiplicative functions. We conclude with a recommendation to fit the entire moderated multiple-regression model in a single rather than separate steps to avoid the interpretation error highlighted in this article.
Jeffrey B. Vancouver, Bruce W. Carlson, Lindsay Y. Dhanani, and Cassandra E. Colton
Abstract:
When data contradict theory, data usually win. Yet, the conclusion of Van Iddekinge, Aguinis, Mackey, and DeOrtentiis (2018) that performance is an additive rather than multiplicative function of ability and motivation may not be valid, despite applying a meta-analytic lens to the issue. We argue that the conclusion was likely reached because of a common error in the interpretation of moderated multiple-regression results. A Monte Carlo study is presented to illustrate the issue, which is that moderated multiple regression is useful for detecting the presence of moderation but typically cannot be used to determine whether or to what degree the constructs have independent or nonjoint (i.e., additive) effects beyond the joint (i.e., multiplicative) effect. Moreover, we argue that the practice of interpreting the incremental contribution of the interaction term when added to the first-order terms as an effect size is inappropriate, unless the interaction is perfectly symmetrical (i.e., X-shaped), because of the partialing procedure that moderated multiple regression uses. We discuss the importance of correctly specifying models of performance as well as methods that might facilitate drawing valid conclusions about theories with hypothesized multiplicative functions. We conclude with a recommendation to fit the entire moderated multiple-regression model in a single rather than separate steps to avoid the interpretation error highlighted in this article.
Assessing and interpreting interaction effects: A reply to Vancouver, Carlson, Dhanani, and Colton (2021).
Chad H. Van Iddekinge, Herman Aguinis, James M. LeBreton, Jeremy D. Mackey, and Philip S. DeOrtentiis
Abstract:
Van Iddekinge et al. (2018)'s meta-analysis revealed that ability and motivation have mostly an additive rather than an interactive effect on performance. One of the methods they used to assess the ability × motivation interaction was moderated multiple regression (MMR). Vancouver et al. (2021) presented conceptual arguments that ability and motivation should interact to predict performance, as well as analytical and empirical arguments against the use of MMR to assess interaction effects. We describe problems with these arguments and show conceptually and empirically that MMR (and the ΔR and ΔR2 it yields) is an appropriate and effective method for assessing both the statistical significance and magnitude of interaction effects. Nevertheless, we also applied the alternative approach Vancouver et al. recommended to test for interactions to primary data sets (k = 69) from Van Iddekinge et al. These new results showed that the ability × motivation interaction was not significant in 90% of the analyses, which corroborated Van Iddekinge et al.'s original conclusion that the interaction rarely increments the prediction of performance beyond the additive effects of ability and motivation. In short, Van Iddekinge et al.'s conclusions remain unchanged and, given the conceptual and empirical problems we identified, we cannot endorse Vancouver et al.'s recommendation to change how researchers test interactions. We conclude by offering suggestions for how to assess and interpret interactions in future research.
Chad H. Van Iddekinge, Herman Aguinis, James M. LeBreton, Jeremy D. Mackey, and Philip S. DeOrtentiis
Abstract:
Van Iddekinge et al. (2018)'s meta-analysis revealed that ability and motivation have mostly an additive rather than an interactive effect on performance. One of the methods they used to assess the ability × motivation interaction was moderated multiple regression (MMR). Vancouver et al. (2021) presented conceptual arguments that ability and motivation should interact to predict performance, as well as analytical and empirical arguments against the use of MMR to assess interaction effects. We describe problems with these arguments and show conceptually and empirically that MMR (and the ΔR and ΔR2 it yields) is an appropriate and effective method for assessing both the statistical significance and magnitude of interaction effects. Nevertheless, we also applied the alternative approach Vancouver et al. recommended to test for interactions to primary data sets (k = 69) from Van Iddekinge et al. These new results showed that the ability × motivation interaction was not significant in 90% of the analyses, which corroborated Van Iddekinge et al.'s original conclusion that the interaction rarely increments the prediction of performance beyond the additive effects of ability and motivation. In short, Van Iddekinge et al.'s conclusions remain unchanged and, given the conceptual and empirical problems we identified, we cannot endorse Vancouver et al.'s recommendation to change how researchers test interactions. We conclude by offering suggestions for how to assess and interpret interactions in future research.
Best Practices in Data Collection and Preparation: Recommendations for Reviewers, Editors, and Authors
Herman Aguinis, N. Sharon Hill, and James R. Bailey
Abstract:
We offer best-practice recommendations for journal reviewers, editors, and authors regarding data collection and preparation. Our recommendations are applicable to research adopting different epistemological and ontological perspectives—including both quantitative and qualitative approaches—as well as research addressing micro (i.e., individuals, teams) and macro (i.e., organizations, industries) levels of analysis. Our recommendations regarding data collection address (a) type of research design, (b) control variables, (c) sampling procedures, and (d) missing data management. Our recommendations regarding data preparation address (e) outlier management, (f) use of corrections for statistical and methodological artifacts, and (g) data transformations. Our recommendations address best practices as well as transparency issues. The formal implementation of our recommendations in the manuscript review process will likely motivate authors to increase transparency because failure to disclose necessary information may lead to a manuscript rejection decision. Also, reviewers can use our recommendations for developmental purposes to highlight which particular issues should be improved in a revised version of a manuscript and in future research. Taken together, the implementation of our recommendations in the form of checklists can help address current challenges regarding results and inferential reproducibility as well as enhance the credibility, trustworthiness, and usefulness of the scholarly knowledge that is produced.
Herman Aguinis, N. Sharon Hill, and James R. Bailey
Abstract:
We offer best-practice recommendations for journal reviewers, editors, and authors regarding data collection and preparation. Our recommendations are applicable to research adopting different epistemological and ontological perspectives—including both quantitative and qualitative approaches—as well as research addressing micro (i.e., individuals, teams) and macro (i.e., organizations, industries) levels of analysis. Our recommendations regarding data collection address (a) type of research design, (b) control variables, (c) sampling procedures, and (d) missing data management. Our recommendations regarding data preparation address (e) outlier management, (f) use of corrections for statistical and methodological artifacts, and (g) data transformations. Our recommendations address best practices as well as transparency issues. The formal implementation of our recommendations in the manuscript review process will likely motivate authors to increase transparency because failure to disclose necessary information may lead to a manuscript rejection decision. Also, reviewers can use our recommendations for developmental purposes to highlight which particular issues should be improved in a revised version of a manuscript and in future research. Taken together, the implementation of our recommendations in the form of checklists can help address current challenges regarding results and inferential reproducibility as well as enhance the credibility, trustworthiness, and usefulness of the scholarly knowledge that is produced.