This paper examines what makes an academic assessment effective or ineffective by analyzing the core psychometric concepts of reliability and validity. Using the SAT and GRE as primary case studies, the paper evaluates how well these standardized tests predict student success at the undergraduate and graduate levels, respectively. It also considers the performance of standardized testing programs used in New York City elementary schools. The paper concludes that effective assessments must reflect holistic student achievement over time and that consistent, agreed-upon standards are essential to avoid misleading students, parents, and educators about academic proficiency.
For a test to be accepted within the academic community, it must be both reliable and valid. A reliable test produces consistent results, while a valid test measures what it purports to measure. A good example of reliability is the "kitchen scale" analogy: if you weigh the same cup of flour and get 4 ounces, then 4.25 ounces, then 4.5 ounces on successive attempts, the scale is not reliable (Classroom Assessment, 2013, Florida Center for Instructional Technology). Similarly, a test that suggests the same student is alternately above or below grade level after taking it in rapid succession β with no additional preparation β raises serious questions about its reliability. "Generally, if the reliability of a standardized test is above .80, it is said to have very good reliability; if it is below .50, it would not be considered a very reliable test" (Classroom Assessment, 2013, Florida Center for Instructional Technology).
A test can be reliable but not valid. For example, a cup of flour might reliably weigh 4 ounces on a scale with every attempt, but if it actually weighs 5 ounces, the scale is not valid. Similarly, a test that consistently places a student above or below grade level β contrary to the findings of other accepted assessments β lacks validity. Test validity, however, can be more subjective to assess when applied to human beings.
A prominent example of validity debates in standardized testing is the SAT, the exam many students must take to be considered as applicants for college. The SAT is purportedly a reliable predictor of students' grades during their first year of college β it is not an intelligence test, contrary to what many people believe. "The College Board's Handbook for the SAT Program 2000β2001 claims the SAT-V and SAT-M have a correlation of .47 and .48, respectively, with freshman GPA (FGPA)" (SAT I: A Faulty Instrument for Predicting College Success, 2007, Fair Test).
However, this assessment is controversial. While the SAT's validity as a predictor of first-year grades was already debatable, whether it could validly assess a student's overall future academic success in college was even more so. "After a three-year validity study analyzing the power of the SAT I, SAT II, and high school grades to predict success at the state's eight public universities, University of California (UC) President Richard Atkinson presented a proposal in February 2001 to drop the SAT I requirement for UC applicants. The results from the UC validity study, which tracked 80,000 students from 1996β1999, highlighted the weak predictive power of the SAT I" (SAT I: A Faulty Instrument for Predicting College Success, 2007, Fair Test). In other words, when a student's entire academic career was considered, SAT scores were poor predictors of performance, indicating that the test fell short as a valid measure of predicting overall college achievement. Critics have long complained that the SAT fails to measure "high-level intellectual strengths, imagination, judgment, inductive reasoning, and abilities to reflect, organize, and synthesize" β qualities that are important in upper-level college coursework (Heller, 1997, p. 110).
The SAT has since been reformed to some degree, and data continues to be accumulated to assess whether changes such as the inclusion of an essay portion have improved its reliability and validity. Nevertheless, this history illustrates that although a test may be reliable, it may not necessarily be considered valid β at least not in terms of what admissions staff wish it to indicate, namely overall future college performance. The old SAT appeared to be a reasonably reliable and valid predictor of first-year performance in college, but whether that single year deserves such weight in the admissions process remains debatable.
"Assesses GRE validity for graduate school performance"
"Critiques inconsistent NYC elementary school test results"
In summary, to create more effective assessments, it is essential that evaluations reflect holistic student achievement rather than focusing on a single year or a narrow set of skills. Furthermore, before subjecting students to high-stakes assessments, there must be meaningful agreement about standards β so that parents and students do not have the disorienting experience of being told one year that students are proficient, and the next year that they require remedial help and that their school is "failing." Consistency, transparency, and a broader view of student learning are essential to any assessment system that aims to be both reliable and genuinely valid.
You’re 54% through this paper. Sign up to read the remaining 2 sections.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.