This paper examines the role of variables in statistical modeling, prediction, and hypothesis testing. It explains the purpose of simple linear regression and scatter diagrams, walking through a worked example using tree girth and leaf count to illustrate the regression equation, slope, y-intercept, and residuals. The paper then extends the discussion to multiple regression analysis, explaining the coefficient of determination (R²) and partial regression coefficients. Practical applications are highlighted, including predicting atmospheric gas pressure and estimating coronary heart disease risk from multiple independent variables such as blood pressure, cholesterol, and BMI.
Statistics can be used to describe a phenomenon or conduct hypothesis testing, but statistics can also be used to predict outcomes (Hanneman, Kposowa, & Riddle, 2012, p. 7). To enable predictions to be made, variables are required in the statistical model. Variables allow researchers to model the mathematical relationship between two or more phenomena. For example, the relationship between tree circumference and the number of leaves could be described by two variables: girth and leaves. This relationship may or may not be causal, but it could be correlational. If a causal relationship is suspected, then reducing the number of leaves may inhibit the expansion of tree girth. In that case, leaves would be the independent variable and girth the dependent variable. If the model is valid, then tree girth could be predicted for any value of leaves. Variables therefore allow researchers to test hypotheses and make predictions.
Linear regression is a statistical tool for quantifying the relationship between two variables (Journal of Tropical Pediatrics, n.d., p. 3). It is used to predict the mean response of the dependent variable (Y) given a change in the independent variable (X). Scatter diagrams are useful because the amount of error inherent to the prediction model can be visualized. When the best-fit line is included in the scatter plot, the magnitudes of the errors (residuals) for a given value of X can be easily seen. Scatter plots are also useful for troubleshooting the prediction model.
If X = leaves and Y = tree girth, then Y = bX + a, where b is the slope of the best-fit line and a is the y-intercept (Journal of Tropical Pediatrics, n.d., p. 5). This is the regression equation, and a and b are the regression coefficients. If the regression coefficients were a = −2.5 and b = 0.05 for a given tree species, then the regression equation would be Y = 0.05X + (−2.5). Any value of X could be entered into this formula to determine the predicted value of Y. For example, if the number of leaves on a tree were 250, then the predicted girth would be 0.05 × 250 + (−2.5) = 10 cm.
"Multiple predictors, R-squared, and real-world uses"
Always verify citation format against your institution’s current style guide requirements.