This paper examines the critical distinction between validity and reliability in educational assessment instruments. It discusses how validity—the accuracy of what a test measures—and reliability—consistency of results—serve different purposes depending on assessment context. The paper analyzes content-based assessments, exploring evidence of content mastery, methods for determining whether assessments reflect learner knowledge, and limitations of content-focused testing. Special attention is given to cultural bias in standardized testing, the shift toward inquiry-based methods, and the challenge of designing assessments that measure both factual knowledge and higher-order thinking skills across diverse student populations.
When an assessment instrument has validity, it accurately measures what it is designed to measure. An instrument with reliability consistently yields the same results every time it is used. Whether validity or reliability is more desirable with respect to a test instrument depends on the purpose of the test and how the results will be used. A teacher who develops a test for an individual classroom will probably be more interested in the validity of the test. It will be important to determine whether students are meeting learning goals and objectives. Assessments that are developed for large populations and used repeatedly, such as standardized tests, should be valid, measuring achievement as they are designed to do, but also reliable, providing a standard by which students are assessed at the district, state, or national level.
Shank (2006, p. 5) asserts that test assessments are not the best way to determine the quantity and quality of learning that has taken place but, nonetheless, practical, easy to use, and thus commonly employed. As Shank points out, "The optimal assessment type depends primarily on whether the objective is declarative (facts: name, list, state, match, describe, explain...) or procedural (task: calculate, formulate, build, drive, assemble, determine...). Research shows that there is a big difference between these two types—the difference between knowing about and knowing how (practical application to real-world tasks)."
In any case, validity of a test instrument speaks to its quality as an assessment tool. An instructor might not realize a test is not valid until after it is administered and the results are tabulated. If most students in the class do poorly, for example, the instructor needs to look at unit or course content, reflect on delivery methods, and try to figure out where the breakdown occurred. If most students fail to do reasonably well on a test, it is not a valid measure of the intended learning objectives.
A recent article in Education Digest points out that "assessments that accurately reflect traditional ways of knowing for a specific cultural group can provide richer and more valid results" (Culture and Assessment, 2011, p. 44). The authors cite as an example a question on a standardized test that asked students to write about the disadvantages of using laboratory animals for research. The answers of native Hawaiian students reflected the belief that there is "no such thing" as laboratory animals, that all animals are our human brothers and therefore not used for experimentation. This is but one example of the cultural bias that skews the validity of test results. In an individual classroom and often within a school or district, creators of test instruments can take into account cultural norms and traditions and thus largely eliminate this kind of bias. For instruments administered on the national level, however, it is much more difficult because our population is so diverse, both in terms of socioeconomics and racially, culturally, and ethnically.
Testing is supposed to be a learning experience that focuses on what students know (Petress, 2007, n.p.). Guidelines for developing valid instruments are the same for instructors whether their students are in elementary school or in college. Test questions must be clear and unambiguous. There must be a connection between the material covered in class and the questions asked. Students must be able to prepare for the test by participating in class and working with the materials provided for instruction. For a first grade test on addition, for example, instruction components would include group instruction, guided practice with manipulatives and independent practice with worksheets. At the college level, instruction components would include lectures, class discussions, texts, and supplemental reading materials. In both cases, a valid test instrument would test students on the knowledge they developed through use of these materials. "Tests need to be clear in form and purpose, goal centered, assessed with learning in mind, be well connected to class discussions, text, outside readings, and class activities; and not come as a surprise to attentive students" (Petress, n.p.). For the classroom teacher, an instrument with validity will satisfy these parameters.
When teachers give content-based assessments, they are measuring how much information students have retained from lectures, discussions, readings and other learning experiences (e.g., homework, projects). In creating a content-based assessment, the teacher must look at all the learning materials and experiences that have taken place during the unit or course of study. The questions that are asked must accurately reflect this content so mastery can be assessed. Teachers have to ask the right questions to give students an opportunity to give the right answers.
Instructors must design test instruments that allow students to demonstrate their content knowledge and also put that knowledge into practice. It is not enough for students to remember facts; they must be able to put the facts in the greater context of what the unit or course is designed to teach them.
The Christian Science Monitor reported last year that American students lag behind their global counterparts in science and math (Paulson, 2010). The Programme for International Assessment (PISA) has long been used to demonstrate so-called failures in the American education system, though "some experts caution that comparing countries with vastly different populations is fraught with complexities, and that the rankings aren't as straightforward as they might seem" (Paulson, 2010). Nevertheless, recent attention has been focused on increasing the use of inquiry-based methods as a better choice than content-based assessments to reflect learner knowledge. As Day and Matthews (2008, p. 336) point out, science inquiry requires higher-order thinking skills and these are difficult to measure with large-scale assessments. In individual classrooms, it is easier for teachers to move away from the traditional multiple-choice tests that largely test factual knowledge and comprehension of science content. Test designers in New York State, as in a handful of other states, have had some success designing more process-based assessments. For example, an item on the August 2004 exam (NYSPD, 2006, cited in Day & Matthews, 2008, p. 340) presented students with a hypothetical experiment and asked them to identify its flaws. As Day and Matthews conclude, this is "a great way of assessing both students' understanding of the inquiry process and their ability to use higher-order thinking skills."
"Why factual knowledge alone is insufficient for learning"
You’re 77% through this paper. Sign up to read the remaining 1 section.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.