Limitations Of Norms In Psychological Testing Research Paper

Limitations of Norms in Psychological Testing

Tests that are norm-referenced provide a number of benefits over non-norm-referenced tests. Psychological tests enable the gathering of valuable information about individual functioning for many different areas. Most norm-referenced tests are relatively quick to administer, such that a psychologist can obtain a sampling of behavior with a small investment of time and resources. A primary advantage of psychological testing is that rich and detailed information is revealed through the testing that would otherwise be unavailable to the psychologist. However, norm-referenced tests are far from perfect and the quality, reliability, and validity of norm-referenced tests varies substantially in some very important ways.

A number of assumptions are important to the construction of norms. The characteristic being measured must accommodate the ordering of individuals from low to high along an asymmetrical continuum that should at least be ordinal (Angoff, 1984). In addition, the relation of the scores must be transitive. That is to say that the mathematical definition of transitive states that: If a condition "applies between two successive members of a sequence, it must also apply between any two members taken in order," such that, for example, "if A is larger than B, and B. is larger than C, then A is larger than C" ("Transitive," n.d.) (Angoff, 1984). The operational definition of the characteristic being measured must be reasonably clear and valid to a degree that it yields similar orderings of the characteristic in the individuals (Angoff, 1984). The range of scores for a characteristic must all evaluate that same characteristic (Angoff, 1984). There must be a good match between the group(s), the target characteristics, and the test design and purpose (Angoff, 1984). Norms are meaningful and useful only to the extent that they have carefully defined. The norms population must be appropriate to the subject being tested and to the test; the challenge is to define the concept of appropriateness without conflating it with the concept of difficulty (Angoff, 1984). This means that a test or a subject can be difficult for many of the test takers, yet the test or the subject can still be considered appropriate for that population of test takers (Angoff, 1984).

Normative data should be developed for each distinct norms population for which it is meaningful to make comparisons with individuals or the group (Angoff, 1984). The test items themselves must be subject to pilot testing in which the data about the test items is drawn from samples of the population for which the test is being developed; that is to say, for the groups for which the norms will be provided (Angoff, 1984). Populations that serve as the basis for a set of norms should evidence homogeneity (Angoff, 1984). This means that all the individual are clearly members of the group and are logical and/or actual "competitors" in the same arena (Angoff, 1984).

Overview of Norms in Psychological Testing

A variety of norms exist, including the following: National norms, local norms, age and grade equivalents, item norms, school mean norms, user-selected norms, special study norms, and norms that yield direct meaning. This discussion centers on norms that are used for psychological tests, for which the following section provides an overview of how norms are developed.

Standardization samples are generated for psychological tests so that tests can be referenced to a normal distribution that is used to compare scores on specific future tests. Standardization relies on the creation of a large sample of test takers who are representative of the larger population for which the test is being developed. This standardization sample is referred to as the norm group or norming group. The raw scores of a sample group are converted into percentiles, which can be associated with a constructed normal distribution that will be used to rank the relative standing of individuals who take the test some time in the future.

Norms function as frames of reference for the interpretation of test scores, but norms are not performance standards or clinical ideals. The size of norm groups varies widely, ranging from just a few hundred up to a hundred thousand people. As with other types of samples, the more individuals that are included in the norm group, the closer the sample is to an approximation of a normal population distribution. Moreover, normative data illustrates how the dimensions of major population subgroups differ and the extent to which test variables are associated with the population classifications.

Limitations of Norms on Psychological Test Interpretation

That several different authoritative sources are involved in the determination of criteria for norms is an inherent complication that erodes efforts to ensure that norm-referenced tests conform to particular high standards. The process of establishing norms for indexing performance over time and comparing the performance of individuals in groups follows a specified course of action which may be periodically modified, as described below.

Frankeburg, et al. (1992) conducted a major revision and re-standardization of the popular Denver Developmental Screening Test. Specific items and some features were a concern to test users and, since the issues had been raised over several years, the test was changed after 23 years running (Frankeburg, et al., 1992). Regression analysis, test-retest reliability, and inter-rater reliability were used to evaluate the test items (Frankeburg, et al., 1992). The new Denver II showed an 86% increase in language items, two new articulation items, new age scale and a new category of item interpretation to accommodate milder developmental delays, a behavior rating scale, and some new training materials (Frankeburg, et al., 1992).

A good example of the importance of adjusting normative data through periodic reviews is evident in the research conducted by Vakil, et al. (2010) for the Rey Auditory Verbal Learning Test (AVLT) The Rey Auditory Verbal Learning Test enables the derivation of several verbal memory measures, and the "simultaneous comparison of performance on several measures allows for a more comprehensive characterization of verbal memory than with a single measure" (Vakil, et al., 2010, p. 663). This test is differentially sensitive to the effects of age, brain trauma, gender, and psychiatric condition. New normative data was established -- as a supplement to the existing norms -- for the Rey AVLT. The norms were based on individual trials of cohort groups (943 children from age 8 to 17 years, and 528 adult aged 21 to 91 years), and were the result of changes in composite scores for the very young and the very old age groups, which was attributed to frontal lobe maturation and deterioration.

Reconciling Limitations and Appropriate Use of Norms

Several studies are included here that serve as illustrations of the problems encountered in the field when psychologists encounter lax standards for ensuring norm referencing process are of high and standard quality. Sociodemographic factors can profoundly influence the accuracy of neuropsychological test, as demonstrated by Ferrett, et al. (2014). In their research with Afrikaans-speaking and English-speaking adolescents from the Cape Town region of South Africa, Ferrett, et al. (2014) used ANCOVAs to demonstrate that quality of education and age had the biggest impact on test performance of the possible sociodemographic factors. Three tests endorsed by the World Health Organization (WHO) were used: The Grooved Pegboard Test (AVLT), the Children's Color Trails Test (CCTT), and the WHO / UCLA version of the Auditory Verbal Learning Test (AVLT). The authors concluded that, "Comparisons between diagnostic interpretations made using foreign normative data vs. those using current local data demonstrates that it is imperative to use appropriately stratified normative data to guard against misinterpreting performance" (Ferrett, et al., 2014, p.1).

Test developers do not always make a sufficient effort to ensure that tests they design have adequate psychometric properties (Kirk and Vigeland, 2014). For example, Kirk and Vigeland (2014) conducted a review of the psychometric properties of six different norm-referenced assessments that were intended to measure children's phonological error patterns. In this review, the researchers evaluated the normative sample, reliability, and validity by using the current recommendations and criteria in the literature (Kirk and Vigeland, 2014). The sample size was found to be inadequate, there was poor evidence of construct validity, and insufficient information was provided about diagnostic accuracy (Kirk and Vigeland, 2014).

Spaulding, et al. (2012) conducted research to determine if norm-referenced tests sanctioned by U.S. State Education Departments served to identify the severity of language impairment in children. The researchers evaluated the consistency across state criteria in test manuals, the intentions of the test developers, and the characteristics of the tests (Spaulding, 2012). Manuals for 45 norm-referenced tests to assess the language of children were reviewed (Spaulding, 2012). Only eight states were observed to publish guidelines specifying the use of norm-referenced tests for determining language impairment severity (Spaulding, 2012). No only was there wide variation in the severity determination cutoff-point criteria, but the cutoff-point criteria did not align with the severity cutoff points that were detailed in the test manuals (Spaulding, 2012).…

Angoff, W.H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service. Retreived from

Ferrett, H.L., Thomas, K.G., Tapert, S.F., Carey, P.D., Conradie, S., Cuzen, N.L., Stein, D. J, and Fein, G. (2014, June). The cross-cultural utility of foreign- and locally-derived normative data for three WHO-endorsed neuropsychological tests for South African adolescents. Metabolic Brain Disease, 29(2), 395-408. DOI: 10.1007/s11011.014.9495-6. Retrieved from

Frankeburg, W.K., Dodds, J.A., Shapiro, H. And Bresnick, B. (1992, January). The Denver II: A major revision and re-standardization of the Denver Developmental Screening Test. Pediatrics, 89(1), 91-97. Retrieved from

Kirk, C. And Vigeland, K.C. (2014, October). A psychometric review of norm-referenced tests used to assess phonological error patterns. Language, Speech, and Hearing Services in Schools, 45(4), 365-77. DOI: 10.1044/2014_LSHSS-13-0053. Retreived from
Spaulding, T.J., Swartwout Szulga, M., and Figueroa, C. (2012, April). Using norm-referenced tests to determine the severity of language between U.S. policy makers and test developers. Language, Speech, and Hearing Services in Schools, 43(2), 365-77. DOI: 10.1044/-1461(2011/10-0103). Retreived from
Transitive. Google. Retrieved from
Vakil, E., Greenstein, Y., & Blachstein, H. (2010). Normative data for composite scores for children and adults derived from the Rey Auditory Verbal Learning Test. Clinical Neuropsychologist, 24(4), 662-677. Retrieved from

