This paper examines item analysis as a tool for improving educational tests, focusing on its core concepts—item difficulty, item discrimination, and internal consistency—as well as the theoretical frameworks used to conduct it. The paper outlines arguments in favor of item analysis, including its ability to shorten tests, reduce bias, and align assessments with frameworks such as Bloom's Taxonomy. It also presents counterarguments, including concerns about over-reliance on singular test items and the limitations exposed by adaptive computer-based testing formats such as the revised GRE. The paper concludes that item analysis, while imperfect, offers teachers a standardized, evidence-based method for constructing fairer and more efficient assessments.
Item analysis is a technique used to shorten tests while still providing reliable information about student performance. It can also be used to clarify or improve questions students will be tested on in the future, as well as to eliminate questions that are not reflective of students' real abilities. An item analysis is conducted after the fact—that is, after a test is administered—and allows the teacher to improve and redesign the test based on the feedback received from student responses. With the rise of test analysis technology, teachers in classrooms as well as professional test designers can now use the method to improve their assessments.
A typical score report offers data such as the average or mean response as well as the standard deviation from that average ("Understanding item analysis reports," 2015). Item difficulty is also assessed, along with the test's ability to determine how well students understood the material being tested. A test with a high level of internal consistency in this area will be both more reliable and valid than one that is not. Ideally, the difficulty level of a specific item should be slightly greater than the midpoint to eliminate the chance of random guessing resulting in a correct response ("Understanding item analysis reports," 2015).
A key concept behind item analysis is that of item discrimination: the extent to which a response to an item correlates with a high or low overall score on the test. For example, a difficult test question might show a high correlation of correct answers among students with high overall marks, and a correlation of incorrect answers among students with low marks. This would suggest an effective test question, as opposed to one that produces a relatively random pattern of answers (McDonald, 2013, p. 231). Conversely, test questions that appeared to stump the otherwise highest-performing test-takers—while being answered correctly by the lowest-performing ones—would be problematic in terms of their efficacy in measuring ability.
Testing time is finite, and item analysis allows tests to be shorter and more carefully designed to reflect the needs of teachers and school districts. Teachers can also engage in classification of items to ensure a wide range of student needs and abilities are assessed. For instance, Bloom's Taxonomy can be used to rate various questions based on the types of higher-level thinking required to answer them ("Item analysis," 2015). Test questions answered only by the most sophisticated thinkers in the class might highlight potential skills deficits in the student population as a whole, as well as problems with the test itself.
Teachers often use the same tests from year to year, but testing can be—and should always be—a work in progress. Test items must constantly be screened for confusing wording that does not address the desired content area; for bias against a specific population (such as along lines of race or gender); and for whether the phrasing of the question or answer nudges the reader too strongly toward a particular response (Krishnan, 2013, p. 7).
There are also a number of useful, peer-reviewed techniques for screening potential biases and other problems, including Classical Measurement Theory (CMT) or Classical Test Theory (CTT) versus Item Response Theory (IRT), otherwise known as the Rasch model (Krishnan, 2013, p. 2). CTT uses smaller sampling sizes and, because it is sample-dependent, results are not easily generalizable. IRT estimates, by contrast, can be used to assess the overall accuracy of items for test-takers at different levels of ability. In other words, a test item that is useful for a highly skilled population may not be equally useful for a less skilled one. CTT tends to be simpler and less costly to implement. When using CTT, the assumption is that if a sample population is randomly selected, errors will occur but will be normally distributed, uncorrelated with one another and with the true score, and will have an expected mean of zero across repeated trials (Krishnan, 2013, p. 11).
"Critiques and limitations of the method"
Overall, item analysis is a useful technique, particularly given the pressures teachers face in the modern educational climate. It is not a perfect technique, and teachers must still be mindful of the aptitudes and needs of their students when constructing tests. However, given the increased pressure on teachers to create accurate assessments in short periods of time for diverse populations, item analysis offers a valuable tool. Tests that are not sufficiently difficult can produce an inflated pass rate and reduce motivation among students who are not working to their potential. Conversely, tests populated with too many difficult or biased items can dramatically decrease motivation even among conscientious high achievers. When properly applied, item analysis can help strike a productive balance.
You’re 64% through this paper. Sign up to read the remaining 1 section.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.