Robert Glaser's 1963 paper "Instructional Technology and the Measurement of Learning Outcomes" marked a watershed in psychometrics, the measurement of educational effectiveness. Glaser's innovation came through classifying two particular means of comparing test outcomes, and his definitions continue to drive controversial change in the provision of education across the United States to this day. The No Child Left Behind Act of 2001 represents the maturation of a very concrete and nationwide movement toward what Glaser termed "criterion-referenced measures" (Glaser 1963, p. 7), the measurement of individual student test results against absolute scores intended to demonstrate mastery of coursework, as opposed to "norm-referenced measures" (Glaser 1963, p. 8), which rank students' mastery of coursework relative to each other. Both types of measurement are used for different purposes at the same time, often with the same instrument (Popham and Husek 1969, p. 19), even while one national modern trend seems to be a revaluation of the one over the other as if they were binary opposites, rather than complimentary methods of assessment.
The formal study of test performance measurement separates psychometric techniques into many categories. Glaser's definition of criterion-referenced against norm-referenced measurement rested on a prior distinction between aptitude and achievement testing, where aptitude refers to a student's potential to learn in the future, and achievement testing attempts to measure proficiency, or students' mastery of course content, although this distinction is sometimes blurred (Glaser 1963, p. 6). Criterion- as opposed to norm-referenced measurement generally attempts to quantify achievement, either after presentation of target material or both before and after presentation, with the initial test providing a benchmark against which post-teaching comprehension can then be compared (so-called ipsative measurement, where the student competes with herself (Neil, Wadley and Phinn 1999, p. 304)). Aptitude tests could be criterion-referenced if mastery could be predicted from specific knowledge already obtained, except there would be no way to test comprehension of material to which the student has not yet been treated. Inferential statistics are widely used in both regimes to imply future performance, overt or unstated, but norm-referenced assessment has traditionally been used for more formal aptitude or performance measurement, especially for so-called "High Stakes," or competitive testing objectives (Popham and Husek 1969, p. 21)
Whereas ipsative measurement tests a student's comprehension against their own prior knowledge -- really, the prior lack thereof -- after the treatment has been applied, discussion of norm- or criterion-referenced psychometrics implies students' performance is measured against each other or against a target benchmark, respectively. The difference lies in how the results are compared, because the two formats often appear similar in presentation and both depend on the validity of evidence justifying the appropriateness of individual test questions (Fernandez-Ballestros 1993, p. 283). The difference in how the outcomes are presented become significant because either form of comparison gives rise to widely different consequences for both the acquisition (learning) and the transmission (pedagogy) of subject content.
Structural Differences in Application, Interpretation
Colburn (2009) points out that while norm-referenced assessment is often used to measure individual competency and also rank students, attempting to derive both types of results from the same instrument is inappropriate because the two types of measurement are structurally distinct. This departure has arisen since Popham's early argument that the same test can yield information for both styles of reference (1969, p. 36). Either way, there are constraints that underlie both however they are applied. Since norm-referenced assessment compares students' performance against each other to derive hierarchical rank (best; second-best; third-best), then if all students answer a particular question correctly, that data point is often rejected as useless because it reveals no difference between test subjects. Therefore this implies norm-referenced assessment requires a large enough series of questions such that all students do not get all the answers correct, or hierarchical ranking would be impossible and a norm-referenced assessment would be useless. Test subjects' rank can be sliced into percentiles, and a particular rank designated as the line between pass and fail, but within cohorts, norm-referenced assessment is powerless to distinguish between identical scores. Grading on the curve requires just that, a curve, not a spike where all students are the same.
Teaching under criterion-referenced assessment on the other hand could include a result where all students scored perfectly, because the designation between pass and fail would depend…