This paper examines two foundational concepts in psychological and educational measurement: reliability and validity. It defines reliability as the consistency of test results over time and across conditions, then describes four major types — test-retest, inter-rater, parallel-forms, and internal consistency. The paper next addresses validity, covering face, content, criterion (concurrent and predictive), and construct validity, drawing on key scholars such as Joppe, Cherry, and Cook and Campbell. Additional sections explore what psychologists must do before administering a test to ensure adequate reliability and validity for a specific client, the ethical and legal obligations governing psychological assessment, and the importance of each validity type in educational institutions and mental health clinic settings.
Reliability is defined by Joppe (2002, p. 1) as the level of consistency of obtained results over a period of time, as well as an accurate representation of the population under study. If the outcome of a study can be reproduced using a similar methodology, then the instruments used in the research are said to be reliable.
It is worth noting that reliability involves an element of both replicability and repeatability of observations or results. The work of Kirk and Miller (1986, pp. 41–42) indicated that there exist three different types of reliability in quantitative research. These all relate to: the extent to which a given measure, if repeated, remains constant; the stability of the measure over time; and the similarity of measurements within a given time period.
The work of Charles (1995) focuses on the idea of consistency with which a given test item is answered. The test-retest method is one common type of reliability test. The attribute of an instrument that is tested for reliability is called stability. A stable measure produces similar results, and a high level of stability is indicative of a high level of instrument reliability, meaning the results are consistently measurable.
There is, however, a problem with the test-retest method as pointed out by Joppe (2000). The problem can ultimately make the test unreliable to a certain degree. Joppe (2000) explained that the test-retest technique may sensitize respondents to the specific subject matter, thereby influencing their responses. Reliability therefore refers to the level of consistency of a given measure. In psychology, for example, if a test is designed to measure a trait such as introversion, every administration of that test to a given subject should yield approximately similar results. The downside is that reliability is never easy to calculate precisely, though ways of approximating it do exist (Cherry, n.d.).
There are several types of reliability (PTI, 2006; Cherry, n.d.), as described below.
In this type of reliability test, the test is administered twice at two distinct points in time (Cherry, n.d.). This approach assumes that there will be no change in the construct or quality being measured between administrations. It is generally employed for characteristics that are stable over time, such as intelligence.
This form of reliability is assessed by having two independent judges score the test. The scores obtained are then compared in order to determine the level of consistency between the raters' estimates. One technique for testing inter-rater reliability is to score items on a 1–10 scale and then calculate the correlation between the two sets of scores to determine the degree of agreement.
This form of reliability is determined by comparing different tests that were originally created using similar content. It is achieved by generating a large set of test items aimed at measuring the same quality and then randomly dividing those items into two separate tests.
This type of reliability is used to judge the consistency of results obtained across items on the same test. It involves comparing test items that all measure the same construct in order to determine the internal consistency of the instrument.
Joppe (2000) explained validity as a determination of whether a given research instrument truly measures what it is intended to measure, as well as the degree of truthfulness of the results. Wainer and Braun (1988) referred to validity more specifically as construct validity.
Validity in the context of psychology has been discussed extensively by Tebes (2000). The work of Cook and Campbell (1979) identified four major types of validity: internal validity, statistical conclusion validity, external validity, and construct validity. Internal validity addresses causal inferences between two variables. Statistical conclusion validity concerns inferences about covariations between two variables. External validity involves generalization to other settings, time periods, and populations. Construct validity involves generalization regarding the theoretical relationship between cause and effect.
Cherry (n.d.) defined face validity as a simple form of validity that involves determining whether the test appears to measure whatever it is meant to measure. In this approach, researchers take the validity of the test at "face value" by examining whether the test seems to measure the intended variable. For example, a researcher measuring happiness would say the test has face validity if it appears to measure happiness levels. The disadvantage of this approach is that it is not precise, since it captures only the superficial signs of a variable. Researchers therefore need to carry out further investigations beyond face validity alone.
Content validity refers to the degree to which the items in a given instrument reflect the content universe for which the instrument will be appropriately generalized (Straub et al., 2004). Generally, content validity entails evaluating a new instrument to ensure that it includes all items considered essential while eliminating those deemed irrelevant to the construct domain, as pointed out by Lewis (1995). When a test possesses content validity, its items represent the entire range of possibilities for what the test should cover. Content validity has the disadvantage of potential bias, since it depends on the subjective opinions of judges rating the items.
A test is said to possess criterion-related validity if it has demonstrated that it is effective in predicting the criterion or indicators of a given construct (Cherry, n.d.). Miller et al. (2003) noted that criterion-related validity is assessed when one needs to determine the relationship between test scores and a specific criterion — for example, the relationship between scores on an admissions test and grade point average. There are two subtypes of criterion validity:
Concurrent validity occurs when the criterion measures are obtained at the same time as the test scores. This indicates the extent to which the test scores accurately estimate the current state of a situation or individual based on the criterion. For example, in measuring depression, a test may be described as having concurrent validity if it successfully measures the current depression levels experienced by a subject.
"Face, content, criterion, and construct validity"
"Steps psychologists take before administering tests"
"Legal and ethical obligations in test administration"
"Importance of validity in institutional contexts"
You’re 63% through this paper. Sign up to read the remaining 4 sections.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.