SPSS Data Analysis
American Heart Association Prediction of Stroke Risks
Over a ten-year study, the American Heart Association collected data on age, blood pressure level, and smoking information in order to calculate the risk of strokes within the sample population. Within the context of this study, risk is interpreted by the probability (times 100) that the patient will have a stroke over the next ten-year period. With those who smoke, there is a dummy variable assigned to correlate the data. In this case a 1 indicates a smoker, and 0 indicates a nonsmoker.
Data Set
Risk
Age
Blood Pressure
Smoker
Using the data, develop an estimated regression equation that relates the risk of a stroke to the person's age, blood pressure, and whether the person is a smoker
With the three separate independent variables representing the individual's age, blood pressure, and whether or not the smoke, the regression equation must reflect a multi-linear regression analysis. Here, the dependent variable equates to the numeric value of the risk level for each individual depending on their relation of their age, blood pressure, and smoking habits. With the regression analysis done using the data set above, the constant value equates to -93.401; each independent variable also has its own coefficient which must be used within the final regression equation. Thus, the equation goes as follows:
Y = a + b1*X1 + b2*X2 + b3*X3
And equates to the following with the constant and independent coefficients plugged into it.
Y = -93.401 + 0.98869x1 + 0.2994x2 + 6.5766x3
B. Use the regression analysis tool to obtain a complete diagnostics.
Variables Entered/Removedb
Model
Variables Entered
Variables Removed
Method
1
smoker paitient, blood pressure, paitient age (years)a
Enter
a. All requested variables entered.
b. Dependent Variable: Risk of stroks (%)
Model Summaryb
Model
R
R Square
Adjusted R. Square
Std. Error of the Estimate
1
.935a
.873
.850
5.75657
a. Predictors: (Constant), smoker paitient, blood pressure, paitient age (years)
b. Dependent Variable: Risk of stroks (%)
ANOVAb
Model
Sum of Squares
df
Mean Square
F
Sig.
1
Regression
3
36.823
.000a
Residual
16
33.138
Total
19
a. Predictors: (Constant), smoker paitient, blood pressure, paitient age (years)
b. Dependent Variable: Risk of stroks (%)
Coefficientsa
Model
Unstandardized Coefficients
Standardized Coefficients
t
Sig.
B
Std. Error
Beta
1
(Constant)
-91.759
15.223
-6.028
.000
paitient age (years)
1.077
.166
.697
6.488
.000
blood pressure
.252
.045
.553
5.568
.000
smoker paitient
8.740
3.001
.302
2.912
.010
a. Dependent Variable: Risk of stroks (%)
Casewise Diagnosticsa
Case Number
Std. Residual
Risk of stroks (%)
Predicted Value
a. Dependent Variable: Risk of stroks (%)
Residuals Statisticsa
Minimum
Maximum
Mean
Std. Deviation
N
Predicted Value
4.4606
54.1511
26.9500
13.88058
20
Std. Predicted Value
-1.620
1.960
.000
1.000
20
Standard Error of Predicted Value
1.903
3.532
2.538
.445
20
Adjusted Predicted Value
4.8474
54.2600
26.8973
13.98313
20
Residual
-13.10645
8.55608
.00000
5.28260
20
Std. Residual
-2.277
1.486
.000
.918
20
Stud. Residual -2.418
1.678
.004
1.016
20
Deleted Residual
-14.78714
10.90265
.05268
6.48651
20
Stud. Deleted Residual
-2.940
1.790
-.025
1.107
20
Mahal. Distance 1.127
6.203
2.850
1.340
20
Cook's Distance
.000
.193
.057
.070
20
Centered Leverage Value
.059
.326
.150
.071
20
a. Dependent Variable: Risk of stroks (%)
Curve Fit
Case Processing Summary
N
Total Cases
20
Excluded Casesa
0
Forecasted Cases
0
Newly Created Cases
0
a. Cases with a missing value in any variable are excluded from the analysis.
Variable Processing Summary
Variables
Dependent
Independent
paitient age (years)
blood pressure smoker paitient
Risk of stroks (%)
Number of Positive Values
20
20
10
20
Number of Zeros
0
0
10
0
Number of Negative Values
0
0
0
0
Number of Missing Values
User-Missing
0
0
0
0
System-Missing
0
0
0
0
Model Description
Model Name
MOD_1
Dependent Variable
1
paitient age (years)
2
blood pressure
3
smoker paitient
Equation
1
Linear
Independent Variable
Risk of stroks (%)
Constant
Included
Variable Whose Values Label Observations in Plots
Unspecified
Model Summary and Parameter Estimates
Dependent Variable:paitient age (years)
Equation
Model Summary
Parameter Estimates
R Square
F
df1
df2
Sig.
Constant
b1
Linear
.423
13.186
1
18
.002
58.104
.421
The independent variable is Risk of stroks (%) .
C. Is smoking a significant factor in the risk of a stroke? Explain. Use a=0.05
With the regression analysis previously conducted, the factor of whether or not smoking proves to be a significant factor within the risk of a stroke can be sufficiently examined. In order to conduct this regression analysis, the following equation was used in the examination of only the smoking variable in comparison to the dependent numeric value of predicted risk of stroke.
As the graph and equation shows, there is a significant impact on risk factor if the individual smokes. Although the other independent variables, including age and blood pressure, also play a factor, smoking seems to show a significant increase in the predicted risk of a stroke within the individuals included in the data set. Thus, it can be sufficiently assumed that smoking itself is a significant signal in an increased risk factor for predicted strokes.
D. What is the probability of a stroke over the next ten years for Thompson, a 68-year-old smoker who has a blood pressure of 175?
Coefficients from SPSS Regression Analysis age.697
smoking.302
With the equation formulated earlier that computes the overall numeric risk value being Y = a + b1*X1 + b2*X2 + b3*X3, we can now begin to plug in both the computed constant and coefficients along with new independent variables of an individual not included in the original data set. The equation with the constant and coefficients included, the final equation to be used with new variable sets is Y = -93.401 + .697x1 + 0.553x2 + .302x3. Here, we must first define the variables used in the regression analysis. Variable 0 represented the age of each individual within the data set, Variable 2 represented blood pressure, and variable 3 represented smoking habits. Variable 1 is equated to the dependent variable, or numeric risk value, and so is represented as Y. Thus, with a 68-year-old man who smokes and has a blood pressure of 175, shows an equation to:
Y=-93.401 + .697 (age) + .553 (blood pressure) + .302 (smoking)
Y = -93.401 + .697 (68) + .553(175)+ .302(1)
Y=-93.401 + 47.396 + 96.775+ .302
Y=51.072
Here then, the risk level is at 51.072, and can then be rounded down to 51. The individual in question here then has a risk factor of 51 in terms of his risk for having a stroke within the next ten years, meaning that the base probability is estimated at .51072. It is clear that the man's blood pressure and smoking habits are the two independent variable factors that play the most significant role in formulating such a high risk of stroke within the next ten years in comparison to the other individuals within the original data set.
Question 2
A. Fuel Additives and Mileage
Data Table
Sample a
Sample B
17.3
18.7
18.4
17.8
19.1
21.3
16.7
21
18.2
22.1
18.5
18.7
17.5
19.8
20.7
20.2
Data Rank
Rank a
Rank B
2
8.5
6
4
10
15
1
14
5
16
7
8.5
5
11
13
12
Data Set
Sum
Mean
17.9571429
5.14285714
Variance
0.6795238
2.2641071
Rank Sum
Rank Mean
4.9
11.3
Combined Sum
Combined Median Rank
8.1
Testing commenced on two separate fuel additives in order to test their differing effect on the mileage of the cars. One sample included seven cars, and the other nine cars. Their mileage per gallon can then be used to determine if there is a significant difference between the two additives in terms of mileage information. The two sets of data represent comparable observations and measurable central tendencies. Both include the mileage of vehicles which were used as sampling for the fuel additives in question. Additionally, each sample test is independent of the other and the observations in each sample itself are also independent of each other. Thus, the Mann-Whitney statistical test proves a viable option to compare the two sets of data from the two different and independent fuel additives. The Mann-Whitney test allows for the observation of one sample population in regards to how it fairs in comparison to another sample population, where the variances are equal amongst both sample groups.
Thus, the following equation can be implemented within the computation of the Mann-Whitney statistical test.
UA= nanb +na (na+1) -- TA
2
na= 7 (critical values for U)
nb= 9
TA= the sum of the ranks of Sample a
nanb +na (na+1) = the maximum value of TA
2
With these values, the following computations were made, including the value of U, P (1), and P (2), which can then be analyzed to show if there is a significant difference between the two additives and how they affect the mileage rate of the vehicles they are used in.
Ranks
fuel additives (per m)
N
Mean Rank
Sum of Ranks
gas mileage additive 1
7
4.86
34.00
additive 2
9
11.33
Total
16
Test Statisticsb
gas mileage
Mann-Whitney U
6.000
Wilcoxon W
34.000
Z
-2.701
Asymp. Sig. (2-tailed)
.007
Exact Sig. [2*(1-tailed Sig.)]
.005a
a. Not corrected for ties.
b. Grouping Variable: fuel additives (per m)
P (1)
0.004
P (2)
0.008
With the two P. values being so far apart, as well as the variance of the two groups being of significant value, around 2 whole values, it is clear that there is a significant difference to be noted between the two sample groups. Through the analysis of both the variance and the computations worked out through the Mann-Whitney test, it is clear that Sample B. has a higher rate of miles per gallon than the vehicles tested in Sample a. Here, the significant difference can then be interpreted that the fuel additive used within the context of Sample B. is more effective in terms of increased mileage within its test vehicles.
B. Exercise and Calories Burnt
Data Table
Swimming
Tennis
Cycling
Data Rank
Rank a
Rank B
Rank C
8
9
5
4
14
1
11
13
3
6
10
7
12
15
2
Data Set
Sum
2040
Mean
Variance
Rank Sum
41
61
18
Rank Mean
8.2
12.2
3.6
Combined Sum
Combined Median of Ranks
8
Three separate exercises were observed three times a week for forty minutes each session. The data here shows the number of calories burnt by each different activity within that context of forty minute work outs three days a week. By using the Kruskal -- Wallis test, the data can help determine if there was a significant difference between the three activities and corresponding calorie burnt data. The test itself requires a measured independent variable, and one nominal variable with one measurement variable. In the contest of this analysis, the ranked data is the set being computed. It also depends on the fact that the K. samples are random and independent, coming specifically out of a larger sample population. Additionally, all populations within the two sample sets are expected to have normal distribution and similar variances. Here the equation for analysis is as follows, with a=0.05.
SSbg (R)=n (mean of the group -- combined mean)
H= SSbg (R)
N (N=1)/12
Ranks
Activities
N
Mean Rank
calories burned swimming
5
8.20
Tennis
5
12.20
cycling
5
3.60
Total
15
Test Statisticsa, b calories burned
Chi-Square
9.260
df
2
Asymp. Sig.
.010
a. Kruskal Wallis Test
b. Grouping Variable: Activities
H=
9.26
df=
2
P=
0.0098
Within this data set, the sample sizes are at the 5 limit mark to create the notion that the distribution of H. is closely corresponding to the approximation of df, where df=k-1. Thus, with the computed analysis, it is clear that one sample population does show a significant difference the other two. It can be assumed that Cycling is significantly different in terms of how many calories it burns compared to the other two sample groups. It is significantly lower in terms of how many calories it burns within the context in comparison to the other sampled activities of swimming and tennis. Swimming and Tennis are much closer, with less of a significant difference between them, showing much more correlation in regards to the amount of calories burned within the workout regime setting. Based on the analysis, however, it is clear hat Tennis burns the most calories out of the two listed activities with less of a significant difference. .
Question 3
Quality of Inpatient Treatment
In thus data set, forty patients represent the sample set to be used to determine the correlation between the number of visitations and perceived quality of the care based on the opinion of the patient. The patients were divided into visitor categories, in which 1=frequent, 2=occasional, and 3=rare. Then, treatment was valued between the scale of 1=good, 2=fair, and 3=poor. A Chi-square Test was then performed on the data set to determine if there was a significant difference between the number of visits and the perceived quality of care within the given set of surveyed patients.
You’re 82% through this paper. Sign up to read the full paper.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.