This paper explores the fundamental principles of measurement in social scientific research, examining how researchers decide to measure the presence, absence, or number of concepts. The paper discusses reliability and validity as core concerns, explains the four levels of measurement (nominal, ordinal, interval, and ratio), and addresses data recoding strategies. Through multiple exercises and examples—including measurement of military assertiveness, moral values, political ideology, and congressional voting patterns—the paper demonstrates how to select appropriate measurement schemes, evaluate validity, and apply these concepts to hypothesis testing across diverse research domains.
Measurement involves deciding how to measure the presence, absence, or number of concepts in a research project. Reliability and validity of measures are key concerns for any researcher developing a measurement scheme.
A reliable measure yields a consistent, stable result as long as the concept being measured remains unchanged. Measurement strategies that rely on memories, for example, may be quite unreliable, because the ability to remember specific information may vary depending on when the measurement is made and whether distractions are present. In contrast, valid measures correspond well with the meaning of the concept being measured. Researchers often develop elaborate schemes to measure complex concepts, requiring careful attention to both consistency and accuracy.
Level of measurement is an important aspect of any measurement scheme. There are four levels of measurement, ranging from lowest to highest: nominal, ordinal, interval, and ratio. Choosing the appropriate statistics for the analysis of data depends on knowing the level of measurement of your variables.
A variable can be measured using a variety of schemes. Choosing the scheme that uses the highest level possible provides the most information and is the most precise measure of a concept. However, the appropriate level depends on the nature of the concept being measured.
Nominal Level: This is the lowest level of measurement. Nominal measures simply categorize data without any ordering. Examples include employment sector (public or private), marital status (never married, married, widowed, divorced, separated), or tone of a news article (positive, mixed, negative).
Ordinal Level: At this level, categories can be ordered or ranked, but the distance between categories is not uniform. Examples include volunteer work frequency (fewer than 5 hours, between 5 and 10 hours, more than 10 hours per month), education level (freshman, sophomore, junior, senior), or frequency of newspaper reading (every day, 5–6 days per week, down to less than 1 day per week).
Interval Level: This level includes ordered categories with meaningful distances between them, but no true zero point. Year of first election to public office is an example of an interval measure.
Ratio Level: The highest level of measurement, ratio scales have all the properties of interval scales plus a meaningful zero point. Examples include child poverty (percentage of children living in poverty), per-pupil education spending, and number of years served in Congress. These measures allow for meaningful comparisons of magnitude and proportion.
Reliability and validity, while related, measure different aspects of a good measure. Reliability refers to consistency—whether a measure produces the same result repeatedly under unchanged conditions. Validity refers to accuracy—whether a measure actually captures what it claims to measure.
Exercise 5-2 illustrates this distinction well. When measuring discrimination experienced by racial and ethnic groups, asking respondents the exact number of times they experienced discrimination in the past three months will not yield reliable information. People cannot accurately remember precise frequencies over extended periods. Instead, asking respondents to categorize their experience as "very often," "fairly often," "once in a while," or "never" yields more reliable data. Although subjective, this categorical approach is easier to gauge than exact frequencies, and respondents can more consistently apply these categories across multiple items.
Validity takes different forms. Face validity refers to whether an item appears, on its surface, to measure the intended concept. For example, when measuring military assertiveness—defined as the inclination toward militant versus accommodative approaches to defending American interests—items directly addressing military strength, national defense, and military spending exhibit face validity. Items about obedience to authority or moral standards, while potentially correlated with assertiveness, do not directly address the construct and thus have weaker face validity.
Construct validity involves demonstrating that an item's responses correlate with the theoretical construct in predictable ways. This requires empirical testing beyond the face value of the item. The distinction matters: face validity is a preliminary judgment, while construct validity requires evidence.
Researchers frequently recode data, thus changing the level of measurement of a variable. Recoding allows researchers to collapse multiple categories into fewer, more manageable groups. Two primary strategies guide this process:
Theoretical Recoding: Choose categories that are meaningfully distinct, where theory would tell you that the differences between the categories are important or where you can see distinct clusters of scores or values. For example, when combining actual household income amounts into income levels, a researcher might consider what the official poverty level is and group all households with incomes below that level into the lowest income group. This approach ensures that category boundaries align with conceptually meaningful thresholds.
Equally Sized Categories: Choose categories so that each category has roughly an equal number of cases. In addition, limit the number of categories so that each category has at least ten cases. This approach facilitates statistical analysis by ensuring sufficient sample size within each category.
The choice between these strategies depends on the research question and available data. A frequency distribution of Senate voting records on labor issues, for instance, could be recoded into two ordinal categories either by dividing at the 50-point mark (below 50 and above 50) or by identifying natural breaks in the distribution. The resulting categories represent support levels that researchers can use in subsequent analysis.
Complex social science concepts require careful operationalization—translating abstract ideas into measurable, observable behaviors or responses. Operationalization is necessary before theory can be tested or hypotheses examined.
Consider the concept of moral values. Before measuring moral values, researchers must conceptualize what this abstract notion means in concrete terms. A person with high moral values might "always tell the truth," "never jay-walk," or "not cheat people with whom they do business." Each of these observable behaviors provides an operationalization of the abstract construct.
To measure these operationalized behaviors, researchers typically use response scales such as Likert scales, which ask respondents to indicate how true each item is for them (strongly agree, agree, neither agree nor disagree, disagree, strongly disagree). This approach effectively translates the construct of "moral values" into an operation that can be measured and analyzed, creating an index score that represents the respondent's moral value orientation.
Measurement strategy selection matters significantly. When measuring political ideology, for example, two different strategies produce different results. Strategy 1 asks respondents to rate the importance of various policy goals using a five-point scale and then adds the responses into an index score. However, this approach permits respondents to rate all goals as "very important," making it impossible to discriminate meaningfully between liberals and conservatives.
Strategy 2 uses forced-choice items, where respondents must choose between liberal and conservative alternatives. This approach produces a clearer categorization because respondents cannot select all options equally. The forced choice creates a more reliable measure of ideological position by compelling respondents to make meaningful distinctions between competing values.
Frequency distributions and data tables provide concrete examples of measurement in practice. The American Federation of Labor-Congress of Industrial Organizations (AFL-CIO) rating system, which measures senators' voting alignment with labor priorities on a scale from 0 to 100, demonstrates interval-level measurement applied to legislative behavior. By grouping these scores into categories (such as 0–49 and 50–100), researchers can transform interval data into ordinal categories for comparative analysis.
Similarly, the League of Conservation Voters 2006 ratings of state House delegations illustrate how measurement schemes apply across different policy domains. These ratings show considerable variability across states, with some delegations averaging 0 (no conservation votes) and others averaging 100 (perfect conservation voting record). Converting these scores into four categories for a variable called "Support for LCV" allows researchers to classify state delegations by conservation orientation while maintaining analytical precision.
The final application of measurement principles involves hypothesis testing. For each hypothesis, researchers must identify variables, operationalize them, and develop valid and reliable measurement strategies.
"Connecting measurement design to empirical research questions"
You’re 74% through this paper. Sign up to read the remaining 1 section.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.