Essay Doctorate 1,421 words

Distinguish Terms \'Criterion -- Referenced Assessment\' \'Norm-Referenced

Last reviewed: February 28, 2011 ~8 min read

Distinguish terms 'criterion -- referenced assessment' 'norm-referenced assessment'.

Robert Glaser's 1963 paper "Instructional Technology and the Measurement of Learning Outcomes" marked a watershed in psychometrics, the measurement of educational effectiveness. Glaser's innovation came through classifying two particular means of comparing test outcomes, and his definitions continue to drive controversial change in the provision of education across the United States to this day. The No Child Left Behind Act of 2001 represents the maturation of a very concrete and nationwide movement toward what Glaser termed "criterion-referenced measures" (Glaser 1963, p. 7), the measurement of individual student test results against absolute scores intended to demonstrate mastery of coursework, as opposed to "norm-referenced measures" (Glaser 1963, p. 8), which rank students' mastery of coursework relative to each other. Both types of measurement are used for different purposes at the same time, often with the same instrument (Popham and Husek 1969, p. 19), even while one national modern trend seems to be a revaluation of the one over the other as if they were binary opposites, rather than complimentary methods of assessment.

The formal study of test performance measurement separates psychometric techniques into many categories. Glaser's definition of criterion-referenced against norm-referenced measurement rested on a prior distinction between aptitude and achievement testing, where aptitude refers to a student's potential to learn in the future, and achievement testing attempts to measure proficiency, or students' mastery of course content, although this distinction is sometimes blurred (Glaser 1963, p. 6). Criterion- as opposed to norm-referenced measurement generally attempts to quantify achievement, either after presentation of target material or both before and after presentation, with the initial test providing a benchmark against which post-teaching comprehension can then be compared (so-called ipsative measurement, where the student competes with herself (Neil, Wadley and Phinn 1999, p. 304)). Aptitude tests could be criterion-referenced if mastery could be predicted from specific knowledge already obtained, except there would be no way to test comprehension of material to which the student has not yet been treated. Inferential statistics are widely used in both regimes to imply future performance, overt or unstated, but norm-referenced assessment has traditionally been used for more formal aptitude or performance measurement, especially for so-called "High Stakes," or competitive testing objectives (Popham and Husek 1969, p. 21)

Whereas ipsative measurement tests a student's comprehension against their own prior knowledge -- really, the prior lack thereof -- after the treatment has been applied, discussion of norm- or criterion-referenced psychometrics implies students' performance is measured against each other or against a target benchmark, respectively. The difference lies in how the results are compared, because the two formats often appear similar in presentation and both depend on the validity of evidence justifying the appropriateness of individual test questions (Fernandez-Ballestros 1993, p. 283). The difference in how the outcomes are presented become significant because either form of comparison gives rise to widely different consequences for both the acquisition (learning) and the transmission (pedagogy) of subject content.

Structural Differences in Application, Interpretation

Colburn (2009) points out that while norm-referenced assessment is often used to measure individual competency and also rank students, attempting to derive both types of results from the same instrument is inappropriate because the two types of measurement are structurally distinct. This departure has arisen since Popham's early argument that the same test can yield information for both styles of reference (1969, p. 36). Either way, there are constraints that underlie both however they are applied. Since norm-referenced assessment compares students' performance against each other to derive hierarchical rank (best; second-best; third-best), then if all students answer a particular question correctly, that data point is often rejected as useless because it reveals no difference between test subjects. Therefore this implies norm-referenced assessment requires a large enough series of questions such that all students do not get all the answers correct, or hierarchical ranking would be impossible and a norm-referenced assessment would be useless. Test subjects' rank can be sliced into percentiles, and a particular rank designated as the line between pass and fail, but within cohorts, norm-referenced assessment is powerless to distinguish between identical scores. Grading on the curve requires just that, a curve, not a spike where all students are the same.

Teaching under criterion-referenced assessment on the other hand could include a result where all students scored perfectly, because the designation between pass and fail would depend on the absolute number of correct answers per each student, even if those were expressed as a percentage. Every data point provides performance information because scores are compared for individual students against a predetermined benchmark between pass or fail, in whatever array of gradation (e.g. basic / proficient / advanced (New York State 2002, p. 4), or Performance Levels 1-5 (California 2009)), where scores are completely independent of each other and duplication is irrelevant. All students in a class could earn a zero score if they boycotted a final exam, to construct an example, but no rank could be assigned under a norm-referenced assessment regime in such an event. While grades are often converted to percentages for ease of comparison under criterion-referenced testing, they still represent an absolute number of right answers divided by the denominator of possible right answers, so are functionally different than a norm-referenced percentile rank, which indicates the percentage of students scoring worse than an individual regardless of the number of answers on the testing instrument, especially where all students answered some of the questions correctly. These similarities and differences rely on two major pre-established assumptions, that test questions are actually accurate and refer to the conditions tested for, and produce uniform results over time and multiple individuals. Any discussion of the choice between norm- or criterion-referenced testing assumes construct validity and reliability, without which choice of reporting styles becomes moot (Fernandez-Ballestros 2003, p. 283).

Implications for Pedagogy

While every student scoring perfect or zero probably happens very rarely, such hypothetical falsification demonstrates a very real consequence of the choice between norm-referenced and criterion-referenced assessment, the value of testing as an indicator of pedagogical adequacy, or the effectiveness of coursework in progress or over different rounds. Since norm-referenced assessment often disregards scores where no variation occurs, i.e. no students get the question absolutely wrong or right, norm-referenced assessment has less value as an indicator for the appropriateness of teaching or coursework than criterion-referenced assessment. While a teacher could and probably should derive a rough indication of how appropriate the speed, intensity or complexity of a course or exam may be for a particular class under a curve-based grading regime if all students pass handily or score uniformly poorly, criterion-based assessment provides a method for rigorous review and comparison of teaching methods, effectiveness of coursework presentation, and opportunity to adapt pedagogy to the idiosyncratic learning abilities of specific classes. These factors emerge in several different ways.

Criterion-referenced assessment can be applied formatively or summatively, depending on where in the course of presentation the test is administered. Colburn (2009) likens formative criterion-referenced assessment to a medical diagnosis, where the educator tests a student's knowledge to assess strengths and weaknesses that can then be addressed over the rest of the module. I would add from my own experience that such assessment, while not scientific, is useful for tailoring classroom methods and content 'on the fly' toward the individual strengths and weaknesses of widely varying students.

You’re 84% through this paper. Sign up to read the full paper.

Sign Up Now — Instant Access Already a member? Log in
130,000+ paper examples AI writing assistant Citation generator Cancel anytime
Cite This Paper
PaperDue. (2011). Distinguish Terms \'Criterion -- Referenced Assessment\' \'Norm-Referenced. PaperDue. https://www.paperdue.com/essay/distinguish-terms-criterion-referenced-49896

Always verify citation format against your institution’s current style guide requirements.