¶ … power of statistical analysis is the power to define, interpret, and understanding numerical data which represents patterns in the real world. Without the ability to measure statistical data, the empirical, hypothetical world of educational models would not be able to be checked by actual performance in the absolute. While statistics has applications in many fields, statistical data is possibly the most powerful when used to identify patterns in personal behavior, and other fields of study which do not exhibit direct patterns across a sampling group. For example, mathematical equations govern how a specific metal will respond to different loads, and different conditions. However, there are no direct mathematical equations which govern the percentage of teenage drivers who will be involved in traffic accidents over a period of time. In order to interpret the influential factors over teen drivers, a statistical measurement of actual experience can be undertaken. Through statistical analysis, patterns and tendencies can be discovered, and decisions can be made based on real life experience rather than theory, and assumption.
For this review of statistical methods, the following data table will be used. This data is a measure of the tar, nicotine, and CO2 which is produces while a given cigarette brand is smoked. The data presented below is taken from Mendenhall and Sincich (1992) and is a subset of the data produced by the Federal Trade Commission. It was submitted by Lauren McIntyre, Department of Statistics, North Carolina State University.
Brand
Tar (mg)
Nicotine (mg)
Weight (g)
Carbon Monoxide (mg)
Alpine
Benson & Hedges
Bull Durham
Camel Lights
Carlton
Chesterfield
Golden Lights
Kent
Kool
M
Lark Lights
Marlboro
Merit
Multi-Filter
Newport Lights
Now
Old Gold
Pall Mall Light
Raleigh
Salem Ultra
Tareyton
TRUE
Viceroy Rich Light
Virginia Slims
Winston Lights
Statistical data can be categories in the following groups.
Statistical data
Categorical Data
Continuous Data
Can be Divided into Can be Divided into Nominal Data
Ordinal Data
Interval Data
Ratio Data
Also called:
Non-metric Data
Qualitative Data
Nonparametric Data
Attribute Data
Also Called:
Metric Data
Quantitative Data
Parametric Data
Variable Data
Nominal Data is data that can be categorized, but cannot be ranked based in intensity, nor its magnitude. Examples of nominal data include political parties, religions, favorite flavors of ice cream. Ordinal Data is data that can be categorized, and ranked by class, but whose magnitude cannot be measured For example, ordinal data can be rated by a scale such as 'Excellent-Good-Fair-Poor-Bad.' Interval Data is data that can be categorized, ranked, and whose magnitude can be measured. For example, student Grade Point Averages, SAT scores, can be both measures, and ranked according to age, gender, or nationality of the student. Ratio Data is data that can be categorized, ranked, and whose magnitude can be measured, and is such that a score of zero is a valid score, and represents the total absence of the trait being measured. For example, a person's height, or the temperature can be used in ratio data calculations.
Frequency distribution is the measure of the frequency which a particular data presents itself across a given sampling. A chart or table showing how often each value or range of values of a variable appears in a data set is considered a frequency distribution. For example, the number of accidents occurring within the population of teenage driver would create a frequency distribution. Central tendency is a measure of location of the middle or the center of a distribution. The mean or average value is the most commonly used measure of central tendency. Calculated from the cigarette data above, the Mean tar grams for a cigarette is 12.216 milligrams.
A weighted average is a measure which gives additional weight to the occurrence of measured data based on population sampling. Returning to our teenage driver example, an average measure of teen accident per 1000 teen drivers may produce a general figure. A more accurate measurement that could be accurately applied to all teens would be to produce weighted averages which took into account factors such as drugs and alcohol, or number of passengers in the vehicle and how these factors weighted the occurrence of accidents among teen drivers. From a weighted average computation of this type, probability distributions could be plotted regarding the likelihood of a teen accident, based on the additional factors.
Normal distributions for data sets will typically fall within a bell shaped curve. Often just called the bell-curve or bell-shaped curve, which measures the occurrence of most scores in the graph accumulated around the middle of the measurement parameters. The mean, median and mode are all equal in this type of measurement, and the scores at either end of the distribution, those which are extremely high, or extremely low, occur less often. For example, a curve representing the results of an intelligence test would have the highest number of people in the middle, or measuring within the 'average' intelligence range. The number of people decreases as the scores get farther away on either side of the average, thus creating a bell shape curve. Once the parameters are defined, the sampling process is used to test a hypothesis, or determine the actual frequency of given behavior or event.
Confidence intervals are developed from an estimate using a range of values (an interval) to predict the expected value of an unknown parameter. The confidence interval is identified as a specific level of confidence, or probability, that the estimate will be correct (i.e. that the hypothesized interval will in fact contain the true value of the parameter).
Regarding calculating and interpreting data, the Mean of the tar quantities in the above cigarette date is 22.216 mg, the Variance is 113, and the Standard deviation is 10.64.
The chi squared test of Association allows for the comparison of two attributes in a sample of data in order to determine if there is any relationship between them. The purpose behind this test is to compare the observed frequencies with the frequencies that would be expected if a hypothesis of no association / statistical independence were true. By assuming the variables are independent, we can also predict an expected frequency for each cell in the contingency table. For the cigarette data above, the null hypothesis is that there is no relationship between the tar and nicotine quantities in any given cigarette. Since these elements are individual elements in the tobacco, we could hypothesis that the quantities would not be related. However, by making a paired data set out of the tar and nicotine quantities for each brand measured, and plotting the, we come up with the following graph.
Similar results are attained by plotting the tar vs. Carbon monoxide content, and the nicotine vs. carbon monoxide content.
Regarding a linear regression analysis of this relationship, we find that the slope of the line is close to 0.5, and the relationship is a direct linear relationship between the amount of tar in a cigarette and the amount of nicotine.
Nonlinear trends in statistical data can be the most challenging to work with. When non-linear relationships exist, there may be a mathematical relationship which is based on a logarithm, or other multi-factor influence. However, true non-linear relationship, such as the height and weight of a specific person who shops in a given department store may leave the statistician without any relationship whatsoever. Non-linear data can also be the result of data which is being acted on by an artificial, outside force. In this case, the statistician is able to verify the existence of an outside force, and then approach the process of identifying the force.
You’re 84% through this paper. Sign up to read the full paper.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.