What term refers to established criteria for comparing individual scores on a test?

Assessment

Yiyun Shou, ... Hui-Fang Chen, in Comprehensive Clinical Psychology (Second Edition), 2022

4.02.5.1.3 Criterion Validity

Criterion validity indicates how well the scores or responses of a test converge with criterion variables with which the test is supposed to converge (Cronbach and Meehl, 1955). There are numerous uses and definitions of criterion validity in the literature depending on how one defines “criterion variables”, and the term is often mixed with several other key validity terms such as concurrent validity or convergent validity. For the purposes of the present article, criterion variables are defined as other measures of the same construct, conceptually relevant constructs or conceptually relevant behaviors or performances. Criterion validity is concurrent when the scores of a test and criterion variables are obtained at the same time (often called concurrent validity), or predictive/postdictive when the criterion variables are measured after/before the current test (often called predictive/postdictive validity) (Grimm and Widaman, 2012).

Criterion validity can be tested in various situations and for various purposes. For example, a psychologist may wish to propose a shorter form of a test to replace the original, longer test for the purpose of greater time efficiency. Criterion and concurrent validity of the short form can be demonstrated by its correlation with the original test. A psychologist may wish to evaluate a self-report test of a mental disorder, and the concurrent validity of the test can be assessed by comparing the scores of the test with a clinical diagnosis that is made at the same time. For predictive criterion validity, researchers often examine how the results of a test, such as intelligence or depression, predict a highly relevant outcome, such as educational achievement or suicide attempts, which are assessed at some point in the future. Bivariate correlations between the scores of the test and criterion variables are often used to evaluate criterion validity, whereas regression methods can be used if researchers would like to control background variables, such as gender and age, when examining how well the test scores predict the criterion variable.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128186978001102

Survey Research Methods

A. Fink, in International Encyclopedia of Education (Third Edition), 2010

Criterion Validity

Criterion validity compares responses to future performance or to those obtained from other, more well-established surveys. Criterion validity is made up two subcategories: predictive and concurrent. Predictive validity refers to the extent to which a survey measure forecasts future performance. A graduate school entry examination that predicts who will do well in graduate school has predictive validity. Concurrent validity is demonstrated when two assessments agree or a new measure is compared favorably with one that is already considered valid. For example, to establish the concurrent validity of a new survey, the survey researcher can either administer the new and validated measure to the same group of respondents and compare the responses, or administer the new instrument to the respondents and compare the responses to experts' judgment. A high correlation between the new survey and the criterion means concurrent validity. Establishing concurrent validity is useful when a new measure is created that claims to be better (shorter, cheaper, and fairer).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947002967

Validity, Data Sources

Michael P. McDonald, in Encyclopedia of Social Measurement, 2005

Criterion Validity

Criterion validity is the comparison of a measure against a single measure that is supposed to be a direct measure of the concept under study. Perhaps the simplest example of the use of the term validity is found in efforts of the American National Election Study (ANES) to validate the responses of respondents to the voting question on the post-election survey. Surveys, including the ANES, consistently estimate a measure of the turnout rate that is unreliable and biased upwards. A greater percentage of people respond that they voted than official government statistics of the number of ballots cast indicate.

To explore the reliability of the measure of turnout, ANES compared a respondent's answer to the voting question against actual voting records. A respondent's registration was also validated. While this may sound like the ideal case of validating a fallible human response to an infallible record of voting, the actual records are not without measurement error. Some people refuse to provide names or give incorrect names, either on registration files or to the ANES. Votes may be improperly recorded. Some people live outside the area where surveyed and records were left unchecked. In 1984, ANES even discovered voting records in a garbage dump. The ANES consistently could not find voting records for 12–14% of self-reported voters. In 1991, the ANES revalidated the 1988 survey and found 13.7% of the revalidated cases produced different results than the cases initially validated in 1989. These discrepancies reduced the confidence in the reliability of the ANES validation effort and, given the high costs of validation, the ANES decided to drop validation efforts on the 1992 survey.

The proceeding example is of criterion validity, where the measure to be validated is correlated with another measure that is a direct measure of the phenomenon of concern. Positive correlation between the measure and the measure it is compared against is all that is needed for evidence that a measure is valid. In some sense, criterion validity is without theory. “If it were found that accuracy in horseshoe pitching correlated highly with success in college, horseshoe pitching would be a valid measure of predicting success in college” (Nunnally, as quoted in the work of Carmines and Zeller). Conversely, no correlation, or worse negative correlation, would be evidence that a measure is not a valid measure of the same concept.

As the example of ANES vote validation demonstrates, criterion validity is only as good as the validity of the reference measure to which one is making a comparison. If the reference measure is biased, then valid measures tested against it may fail to find criterion validity. Ironically, two similarly biased measures will corroborate one another, so a finding of criterion validity is no guarantee that a measure is indeed valid.

Carmines and Zeller argue that criterion validation has limited use in the social sciences because often there exists no direct measure to validate against. That does not mean that criterion validation may be useful in certain contexts. For example, Schrodt and Gerner compared machine coding of event data against that of human coding to determine the validity of the coding by computer. The validity of the machine coding is important to these researchers, who identify conflict events by automatically culling through large volumes of newspaper articles. As similar large-scale data projects emerge in the information age, criterion validation may play an important role in refining the automated coding process.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000463

The External Validity of Studies Examining the Detection of Concealed Knowledge Using the Concealed Information Test

Gershon Ben-Shakhar, Tal Nahari, in Detecting Concealed Information and Deception, 2018

Abstract

The criterion validity of the Concealed Information Test (CIT) and its ability to differentiate between individuals possessing knowledge of critical items and those unaware of these items has been demonstrated in many laboratory experiments. However, despite impressive validity estimates resulting from these studies, the external validity of CIT laboratory experiments has been repeatedly questioned. This chapter reviews various attempts to examine the external validity of CIT research, either through studies conducted under realistic settings or by controlled experiments manipulating factors that distinguish between the typical laboratory environment and the realistic forensic application of the CIT. In most cases, the results of these studies support the external validity of the CIT as a valid tool.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128127292000033

Measures of Affect Dimensions

Gregory J. Boyle, ... Carroll E. Izard, in Measures of Personality and Social Psychological Constructs, 2015

Criterion/Predictive

Evidence for criterion validity has been provided for the precursor scales to the STPI and most likely, the STPI itself would exhibit similar criterion validity to the precursor scales. For example, the STAI scales correlate with impaired performance and attentional bias (Eysenck & Derakshan, 2011). Trait anger is associated with elevated blood pressure (Spielberger & Reheiser, 2009). Matthews, Panganiban, and Hudlicka (2011) showed that under neutral mood conditions (N=60), STPI trait anxiety correlated .40 with viewing frequency of threat stimuli. Wrenn, Mostofsky, Tofler, Muller, and Mittleman (2013) conducted a prospective cohort study of 1968 survivors of myocardial infarction using the STPI anxiety and anger scales, and found that anxiety was associated with a higher mortality risk over 10 years. In a study of 103 overweight adolescents, Cromley et al. (2012) tested whether STPI trait anxiety and trait anger correlated with a lower body satisfaction (odds ratios for the STPI predictors were .76 and .90 respectively).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123869159000085

EEG Evaluation of traumatic brain injury and EEG biofeedback treatment

Robert W. Thatcher Ph.D., in Introduction to Quantitative EEG and Neurofeedback (Second Edition), 2009

IX Predictive Validity of QEEG in the Evaluation of TBI—Neuropsychological

Predictive (or criterion) validity has a close relationship to hypothesis testing by subjecting the measure to a discriminant or cluster analysis or to some statistical analysis in order to separate a clinical sub-type. Nunnally (1978) gives a useful definition of predictive validity as: “when the purpose is to use an instrument to estimate some important form of behavior that is external to the measuring instrument itself, the latter being referred to as criterion [predictive] validity.” For example, science “validates” the clinical usefulness of a measure by its false positive and false negative rates, and by the extent to which there are statistically significant correlations to other clinical measures and, especially, to clinical outcomes (Cronback, 1971; Mas et al., 1993; Hughes and John, 1999).

Another example of predictive validity is the ability to discriminate traumatic brain injured patients from age-matched normal control subjects at classification accuracies greater than 95% (Thatcher et al., 1989, 2001b; Thornton, 1999). Another example of predictive validity is the ability of QEEG normative values to predict cognitive functioning in TBI patients (Thatcher et al., 1998a, 1998b, 2001a, 2001b). Table 11.1 shows correlations between QEEG and a variety of neuropsychological tests, and serves as another example of clinical predictive validity and content validity. As seen in Table 11.1, relatively strong correlations exist between QEEG measures and performance on neuropsychological tests. Also, as the severity of TBI increases then there is a systematic increase in deviation from normal EEG values which correlates to a systematic decrease in neuropsychological test performance (Thatcher et al., 1998a, 1998b, 2001a, 2001b). Such relationships between clinical measures and the EEG demonstrate the predictive validity of EEG in the evaluation of TBI as well as normal brain functioning (Thatcher et al., 2003, 2005c).

Table 11.1. Correlations between neuropsychological test scores and QEEG discriminant scores in TBI patients (N=108) (from Thatcher et al., 2001a).

Pearson Product-Moment CorrelationCorrelationProbability
WAIS Test—scaled scores
Vocabulary −0.416 0.05
Similarities −0.640 0.001
Picture arrangement −0.576 0.01
Performance −0.504 0.01
Digit symbol −0.524 0.01
BOSTON Naming Test
# of spontaneous correct responses −0.482 0.05
WORD Fluency Test—total correct words
COWA −0.568 0.01
Animals −0.630 0.001
Supermarket −0.709 0.001
ATTENTION Test—raw scores
Trail Making A—Response Time 0.627 0.001
Trail Making B—Response Time 0.627 0.001
Stroop—Word −0.427 0.05
Stroop—Color −0.618 0.001
Stroop—Color+Word −0.385 ns
WISC Test—executive functioning—raw scores
Perseverative responses 0.408 0.05
% Concept. level responses −0.200 ns
Categories completed −0.187 ns
Design fluency – # originals −0.454 0.05
Design fluency – # rule violations 0.304 ns
WECHSLER Memory Test—raw scores
Logical memory II −0.382 ns
Visual production II −0.509 0.01
Digit span (forward+backward) −0.336 ns
Digit span (forward) −0.225 ns
%-tile rank forward −0.300 ns
Digit span (backward) −0.213 ns
CVLT Test—raw scores
Recall — List A −0.509 0.01
Recall — List B −0.554 0.01
List A — short-delay free −0.518 0.01
Semantic Cluster ratio −0.162 ns
Recall Errors — free intrusions 0.409 0.05
Recall Errors — cued intrusions 0.520 0.01
Recognition hits −0.595 0.01
Recognition false positives 0.280 ns

The reliability and stability of the QEEG discriminant function was evaluated by comparing the discriminant scores at baseline to the discriminant scores obtained upon repeated EEG testing at 6 months and 12 months after the initial baseline EEG test. No statistically significant differences were found between any of the post-injury periods up to 4 years post-injury, thus demonstrating high reliability even several years after injury (Thatcher et al., 2001a).

The results of a cross-validation analysis of the QEEG and TBI are shown in Fig. 11.2. In this study, quantitative EEG analyses were conducted on 503 confirmed TBI patients located at four different Veterans Affairs hospitals (Palo Alto, CA; Minneapolis, MN; Richmond, VA; and Tampa, Fl), and three military hospitals (Balboa Naval Medical Center, Wilford Hall Air Force Medical Center, and Walter Reed Army Medical Center). Figure 11.2 shows histograms of the distribution of QEEG TBI discriminant scores in the 503 TBI subjects who were tested 15 days to 4 years post-injury. It can be seen that the distribution of the QEEG discriminant scores, and thus the severity of the injury, varied at the different hospitals. The VA patients exhibited more deviant QEEG scores than the active duty military personnel which was consistent with the clinical evaluations, including neuropsychological testing.

What term refers to established criteria for comparing individual scores on a test?

Figure 11.2. Histograms showing the QEEG discriminant score distribution from 503 TBI outpatients located at four different Veterans Affairs hospitals (A), and three military hospitals (B). Normal=0 and most severe TBI=10 (from Thatcher et al., 2001a).

Table 11.2 shows the results of multivariate analyses of variance in which statistically significant differences in neuropsychological performance were predicted by the QEEG discriminant score groupings. The group which had lower EEG discriminant scores was associated with higher neuropsychological functioning when compared with the group which had higher EEG discriminant scores.

Table 11.2. Results of multivariate analyses of variance between low and high EEG discriminant score groups in a cross-validation study (Thatcher et al., 2001a).

Multivariate Analyses:F-ratioProbability
WAIS Test—Scaled Scores
Vocabulary 8.7448 0.0038
Similarities 6.3690 0.0130
Picture arrangement 8.2771 0.0048
Performance 13.2430 0.0004
Digit symbol 21.0620 0.0001
Boston Naming Test
# of spontaneous correct responses 4.8616 0.0290
Word Fluency Test—total correct words
COWA 5.2803 0.0230
Animals 14.0170 0.0003
Supermarket 18.8370 0.0001
Attention Test—raw scores
Trail Making A—Response Time 7.6953 0.0064
Trail Making B—Response Time 4.6882 0.0324
Stroop—Word 16.5080 0.0001
Stroop—Color 9.6067 0.0024
Stroop—Color+Word 4.3879 0.0383
WISC Test—executive functioning-raw scores
Perseverative responses ns ns
% Concept. level responses ns ns
Categories completed ns ns
Design fluency — # originals ns ns
Design fluency — # rule violations ns ns
Wechsler Memory Test—raw scores
Logical memory II 3.9988 0.0484
Visual production II 7.1378 0.0089
Digit span (forward+backward) ns ns
Digit span (forward) ns ns
%-tile rank forward ns ns
Digit span (backward) ns ns
CVLT Test—raw scores
Recall — List A ns ns
Recall — List B ns ns
List A — short—delay free 7.0358 0.0089
Semantic cluster ratio ns ns
Recall errors — free intrusions ns ns
Recall errors — cued intrusions ns ns
Recognition hits ns ns
Recognition false positives ns ns

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123745347000113

Measures of Ability and Trait Emotional Intelligence

Alexander B. Siegling, ... K.V. Petrides, in Measures of Personality and Social Psychological Constructs, 2015

Criterion/Predictive

Several studies attesting to the criterion validity of the EQ-i 2.0 are presented in the manual. Corporate job success was positively related to the EQ-i 2.0 total score, with comparisons between leaders and the normative average showing medium to large effects. As well, EI was higher for postgraduate versus high school students (d=0.33) and this difference was further supported by higher scores on most of the composite scales and subscales for the university groups. An examination of clinical groups, defined as either depressed/dysthymic or other clinical diagnosis, showed that they scored lower on the total EI score than the normative sample (d=0.57 and 0.45, respectively). This trend held for all composite scales except the Interpersonal scale.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123869159000140

Assessing anxiety sensitivity

R. Kathryn McHugh, in The Clinician's Guide to Anxiety Sensitivity Treatment and Assessment, 2019

Concurrent and criterion validity

Support for the concurrent and criterion validity of anxiety sensitivity is vast. Anxiety sensitivity is elevated among those with anxiety (Reiss et al., 1986; Rifkin et al., 2015; Taylor, Koch, & McNally, 1992), obsessive-compulsive, (Deacon & Abramowitz, 2006) and traumatic stress disorders (Bardeen, Tull, Stevens, & Gratz, 2015; Taylor et al., 1992) relative to those without. Anxiety sensitivity is also correlated with anxiety symptom severity in both unselected (Keough, Riccardi, Timpano, Mitchell, & Schmidt, 2010) and clinical samples (Deacon & Abramowitz, 2006). Those with higher anxiety sensitivity have heightened responding to distress, such as pain (Zvolensky, Goodie, McNeil, Sperry, & Sorrell, 2001), carbon dioxide inhalation (i.e., panic provocation; McNally & Eke, 1996; Richey, Schmidt, Hofmann, & Timpano, 2010), and psychosocial and traumatic stressors (McHugh, Gratz, & Tull, 2017; Shostak & Peterson, 1990).

Anxiety sensitivity also has been validated with respect to physiological and neurological markers of heightened reactivity to threat-related and other affective stimuli. For example, anxiety sensitivity has been associated with heightened fear-potentiated startle (McMillan, Asmundson, Zvolensky, & Carleton, 2012) and skin conductance response to carbon dioxide challenge (Gregor & Zvolensky, 2008). Neuroimaging studies have indicated that anxiety sensitivity is associated with activation and volumetric differences in limbic structures central to emotional processing, such as the amygdala and the insula (Killgore et al., 2011; Rosso et al., 2010; Stein, Simmons, Feinstein, & Paulus, 2007). Consistent with the perspective that anxiety sensitivity reflects an interpretation of anxiety symptoms and sensations as dangerous or threatening, a recent study identified a role for prefrontal regions that are associated with the appraisal of stimuli (e.g., the anterior cingulate) in anxiety sensitivity (Harrison et al., 2015).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128134955000024

Validity

Paul F.M. Krabbe, in The Measurement of Health and Health Status, 2017

Criterion Validity

In contrast to content validity, criterion or predictive validity is determined analytically. The concept is only applicable if another existing instrument can be identified as superior. Correlation coefficients (product-moment, Spearman rank, intraclass) are often estimated between a criterion measure (gold standard) and a competing measure to test for equivalence. By definition, the criterion must be a superior, more accurate measure of the phenomenon if it is to serve as a verifying norm. Widely applied and recognized depression scales such as Hamilton Depression Rating (Hamilton, 1960, 1980) and Beck Depression Inventory (Beck et al., 1961, 1996) are possible gold standards for developing alternative (briefer) depression instruments.

Elevating a single criterion to a “gold standard” is difficult, if not impossible, for assessing health instruments because no such standard exists in this setting. It is difficult even to imagine what would constitute a gold standard, since there is no generally accepted concept of health status, nor of health. If there is no gold standard, criterion validity cannot be proved. Therefore, this type of validity can be ignored in this setting and is not addressed elsewhere in this book.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128015049000076

Wechsler Individual Achievement Test

DONNA RURY SMITH, in Handbook of Psychoeducational Assessment, 2001

Validity

Evidence of three traditional types of validity—content, construct, and criterion evidence—were evaluated in order to demonstrate that the WIAT measures what it is intended to measure. Even though the content validity was assessed by a panel of national curriculum experts and deemed representative, school curricula vary. Users should review achievement test items to determine how closely the items match what is taught in their school.

Evidence of construct validity includes the expected intercorrelations among subtests reported by age, intercorrelations with the Wechsler scales (see Table D.6 of the WIAT manual), studies of group differences between the standardization sample and various clinical groups as well as differences between the various age/grade groups, and a multitrait-multimethod study of the WIAT and other achievement tests (Roid, Twing, O'Brien, & Williams, 1992). In summary, there was a striking consistency in the correlations among scores on the reading, mathematics, and spelling subtests of the WIAT and those of the corresponding subtests on the other achievement measures.

Since the majority of school psychologists’ assessment time is spent with students with learning disabilities (Smith, Clifford, Hesley, & Leifgren, 1992), WIAT scores were correlated with school grades, group-administered achievement tests, and clinical study groups. Flanagan (1997) notes that a strength of the WIAT is the demonstrated treatment validity because “data are reported that indicate that the WIAT effectively aids in diagnosis of educational/clinical concerns” (p. 84). Special study groups included children classified as gifted and children with mental retardation, emotional disturbance, learning disabilities, attention-deficit/hyperactivity disorder (ADHD), or hearing impairment. Mean composite scores ranged from 112.1 (SD = 9.9) to 117.8 (SD = 9.5) for gifted children, from 58.0 (SD = 10.2) to 66.3 (SD = 10.3) for children with mental retardation, and from 74.6 (SD = 12.0) to 92.8 (SD = 12.6) for children with learning disabilities. These results confirmed predicted expectations for achievement scores in each group.

Independent studies (Slate, 1994; Martelle & Smith, 1994; Saklofske, Schwean, & O'Donnell, 1996; Michalko & Saklofske, 1996) have provided additional evidence of WIAT validity. Saklofske, Schwean, and O'Donnell (1996) studied a sample of 21 children on Ritalin and diagnosed with ADHD and obtained subtest and composite means quite similar to those reported in the WIAT manual. Gentry, Sapp, and Daw (1995) compared subtest scores on the WIAT and the Kaufman Test of Educational Achievement (K-TEA) for 27 emotionally conflicted adolescents and found higher correlations between pairs of subtests (range of .79 to .91) than those reported in the WIAT manual. Because comparisons are often made between the WIAT and the WJPB-R, Martelle and Smith (1994) compared composite and cluster scores for the two tests in a sample of 48 students referred for evaluation of learning disabilities. WIAT composite score means were reported as 83.38 (SD = 10.31) on Reading, 89.32 (SD = 10.60) on Mathematics, 99.24 (SD = 11.84) on Language, and 80.32 on Writing. WJPB-R cluster score means were 87.67 (SD = 11.80) on Broad Reading, 92.09 (SD = 11.62) on Broad Mathematics, and 83.88 (SD = 8.24) on Broad Written Language. Although global scales of the WIAT and WJPB-R relate strongly to each other, mean WIAT composites were 3 to 6 points lower than mean scores on the WJPB-R clusters. Subtest analysis indicated some differences in the way skills are measured; for example, the two reading comprehension subtests (WIAT Reading Comprehension and WJPB-R Passage Comprehension) are essentially unrelated (r = .06). Study authors suggest that “for students with learning disabilities, the two subtests measure reading comprehension in different ways, resulting in scores that may vary greatly from test to test” (p. 7). In addition to a strong relationship between the WIAT Mathematics Composite and the WJPB-R Broad Mathematics cluster, WIAT Numerical Operations correlated equally well with Applied Problems (r = .63) and with Calculation (r = .57) on the WJPB-R, suggesting that WIAT Numerical Operations incorporates into one subtest those skills measured by the two WJPB-R subtests. At the same time, the WJPB-R Quantitative Concepts subtest does not have a counterpart on the WIAT. The Language Composite of the WIAT, however, is a unique feature of that test.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780120585700500082

What term refers to established criteria for comparing individual scores on a standardized test?

Criterion-referenced tests compare a person's knowledge or skills against a predetermined standard, learning goal, performance level, or other criterion. With criterion-referenced tests, each person's performance is compared directly to the standard, without considering how other students perform on the test.

What is reliability and validity score?

Validity will tell you how good a test is for a particular situation; reliability will tell you how trustworthy a score on that test will be. You cannot draw valid conclusions from a test score unless you are sure that the test is reliable.

What is criterion validity in testing?

Criterion validity (or criterion-related validity) evaluates how accurately a test measures the outcome it was designed to measure. An outcome can be a disease, behavior, or performance. Concurrent validity measures tests and criterion variables in the present, while predictive validity measures those in the future.

Which type of validity is established by a comparison to another measurement?

7. Criterion-related validity. Criterion-related validity (also called instrumental validity) is a measure of the quality of your measurement methods. The accuracy of a measure is demonstrated by comparing it with a measure that is already known to be valid.