This paper provides an accessible overview of the chi-square statistical test, tracing its origins to Karl Pearson's foundational 1900 publication and explaining its core purpose: analyzing categorical, qualitative data to determine relationships between variables. The paper distinguishes quantitative from qualitative data, introduces the concept of Pearson's Chi-Square test, and walks through a concrete example involving student graduation rates to illustrate how the formula is applied. It also explains supporting concepts such as degrees of freedom and probability tables, giving readers a clear foundation for understanding when and how chi-square analysis is appropriately used in research contexts.
There are many different types of information available in the world, and each type can be utilized in very different and highly specific ways depending on both the form of the information and the needs of those using it. From one perspective, these types of information can be classified into two broader categories: quantitative and qualitative. Quantitative information is information that can essentially be reduced to numeric form. It can arise out of either counting or measurement, leading to discrete or continuous data points that can be further analyzed and manipulated to yield deeper understandings of quantifiable phenomena and events. Qualitative data, on the other hand, cannot be reduced to numbers and must be analyzed through other means.
Statistics has developed as a field of mathematics that enables researchers to analyze both quantitative and qualitative information in ways that allow for comparison and interpretation across many different research contexts.
The chi-square analysis is one statistical tool developed specifically as a way of analyzing and manipulating qualitative data. The chi-square method was created in order to compare categorical data and determine what type of relationship exists between different qualitative variables (HWS, 2010). A drug trial, for instance, might need to compare the number of people receiving a drug against the rates at which their symptoms improved, relative to a control group not taking the drug. The chi-square analysis test would be a necessary tool in determining the drug's true efficacy.
There are actually several different types of chi-square analysis that can be utilized depending on the needs and scope of the research, but the most common is the Pearson's Chi-Square test. Karl Pearson was a scientist, philosopher, and mathematician of considerable renown both during and after his lifetime. His development of a specific method for analyzing the goodness of fit of a sample distribution — and for testing the independence of certain variables or phenomena, as in the drug trial example above — is only one of his contributions to the worlds of science and data analysis (Plackett, 1983).
In 1900, Pearson began working with the Chair of Zoology at University College London, who supplied him with a great deal of data. At that time, his decade of work in correlation (methods of determining the degree to which separate observations occur together, or specifically in the other's absence, suggesting some relationship) and regression analysis (determining the relationship between two or more variables on a dependent variable) were culminating into the method of data analysis now bearing his name, published that same year (Plackett, 1983).
Essentially, Pearson's formula translates qualitative data from a set of observations into a single number. Probability tables with corresponding numbers — with variances built in for different levels of significance and different degrees of freedom — provide the probability of dependence for any given chi-square statistic.
The most straightforward example of a chi-square test uses two populations and one variable of examination with a binary ("yes/no") set of possibilities. One commonly cited example involves examining the high school graduation rate of students in a special program versus the graduation rate of a control group of students not involved in the program (Lane, 2010). If a grid is constructed to organize the data points, there would be two rows — one for each population — and two columns: one recording the number of students who graduated per population, and the other recording the number who did not (Lane, 2010).
Using Pearson's formula to develop the chi-square statistic, the columns and rows are each summed separately, yielding four different totals. These totals, multiplied together, become the denominator of the fraction that constitutes the chi-square statistic. The four original data points make up one term in the numerator. The other term is derived by multiplying the diagonally adjacent cells of the data grid (row 1, column 1 multiplied by row 2, column 2; and row 1, column 2 multiplied by row 2, column 1), subtracting one product from the other, and then squaring the result. Dividing the numerator by the denominator yields the chi-square statistic (HWS, 2010).
In order to use a chi-square probability table to find the probability of dependence associated with a given statistic, the degrees of freedom must be known. The degrees of freedom refer to the number of available data points used for estimating or predicting other data. A straightforward way to derive this figure is to subtract one from the number of rows and one from the number of columns, then multiply these two values together. In a 2Ă—2 grid, there is therefore one degree of freedom.
The formula becomes somewhat more complex with larger data sets, but the underlying logic remains the same. As a result, chi-square analysis can be used to examine populations with many variables in just the same way it is applied to simple two-group comparisons, making it one of the most flexible and widely applicable tools available to researchers working with categorical data.
From its origins in Karl Pearson's 1900 publication to its broad application across disciplines today, the chi-square test remains an essential method for analyzing relationships in categorical data. Whether applied to a simple two-group comparison or to a complex multi-variable data set, the core logic of the formula stays consistent: translating qualitative observations into a single statistic that can be evaluated against established probability tables. Understanding its history, its mathematical foundations, and its practical applications gives researchers and students alike a valuable tool for making sense of the qualitative world in quantitative terms.
HWS. (2010). The chi-square statistic. Hobart and William Smith College. Retrieved February 26, 2010, from
"Deriving degrees of freedom and reading probability tables"
You’re 93% through this paper. Sign up to read the remaining 1 section.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.