Professional Documents
Culture Documents
PUNE.
Analysis of Australian
athletes data set
submittted by…….
Name- Namdev Dnyandev Dongare
PRN- 1616
Subject- ST-206 (Multivariate Data Analysis)
Description Of The Data
These data were collected in a study of how data on various characteristics of the
blood varied with sport body size and sex of the athlete. The dataset has 202
observations and 13 variables namely,
Data Source:
Websites: https://vincentarelbundock.github.io/Rdatasets/datasets.html
2. R-Software
The Actual Data
Wcc 202 0 7.109 0.127 1.800 3.300 5.900 6.850 8.300 14.300
Ferr 202 0 76.88 3.34 47.50 8.00 41.00 65.50 97.50 234.00
Bmi 202 0 22.956 0.202 2.864 16.750 21.063 22.720 24.480 34.420
Ssf 202 0 69.02 2.29 32.57 28.00 43.80 58.60 90.55 200.80
pcBfat 202 0 13.507 0.436 6.190 5.630 8.532 11.650 18.095 35.520
Lbm 202 0 64.874 0.920 13.070 34.360 54.615 63.035 75.000 106.000
1. To reduce the dimensions of the data so it could be easily plotted and some patterns in
the data could be recognized.
2. Factor Analysis :
We will see if we can reduce the dimensions of the data and identify the factors
that might explain the dimensions of the data variability.
3. Cluster Analysis :
To classify observations into groups are initially not known.
[1] 326.443
[1] 14.07809
[1] 1.959964
Hence, reject H0 at 5% L. O. S.
Calculated values of Kappa 1 and Kappa 2 both are significantly higher than there
respective table values. Hence we conclude that the data is not from the Multivariate Normal
Population.
Q-Q Plot
25
20
popq
15
10
5
10 20 30 40 50
samq
Factor Analysis
For carrying out factor analysis, we consider the variables rcc, wcc, hc, hg, ferr, bmi,
ssf, pcBfat, lbm, ht, wt. For performing Factor Analysis, we have used Correlation
matrix as all the variables are in the different units. In this dataset , units of variables are
different.
> cortest.bartlett(D)
R was not square, finding R from data
$chisq
[1] 3884.211
$p.value
[1] 0
$df
[1] 55
Thus, Here P-value is less than 0.05, Therefore reject H0 i.e There is statistically significant
interrelationship between variable of the given data. This test tells us is that there is some of
the variables are corelated to each other. Therefore, we can use the Factor Analysis.
Scree Plot:
> scree(D)
Scree plot
5
PC
FA
Eigen values of factors and components
4
3
2
1
0
2 4 6 8 10
The Scree plot says that, there are optimal number of Factors are only THREE.
1) We can see that, in 1st factor the weight of the variable wcc is very low as compare to
the other variables in the 1st factor i.e contribution of variable WCC is less in the 1st
factor.
2) Similarly, in 2nd factor variable ferr contributes very less whereas the variable ssf
contribute more than the other variables.
3) From the communality it is clear that about 96% of the information of the variable
lbm is explained by 3 factors whereas the 3 factor explain about only 25% of the
information of the variable ferr.
4) Also, the 3 factors explain most of the total data variation 79.1%.
For the our data, the 1st factor are computed from the from the original data using the
coefficient listed under Factor 1:
Factor 1= 0.168 rcc +0.034 wcc +0.174 hc + 0.176 hg + 0.081 ferr + 0.115 bmi - 0.079
ssf – 0.106 pcBfat + 0.179 lbm + 0.132 Ht +0.151 Wt
Note that, In factor 1 the coefficient of the rcc ,hc, hg is high as compare to other
variables. Also note that the Hemoglobin and Hematocrit are parameters of Red Blood
cells. If these are in normal range , then the fitness of human body is good. Thus we can 1st
Factor labelled as Fitness of Body.
Factor 2 = -0.099 rcc +0.092 wcc - 0.106 hc - 0.093 hg + 0.025 ferr + 0.265 bmi + 0.329
ssf + 0.296 pcBfat + 0.116 lbm + 0.124 Ht +0.240 Wt
Factor 3 = 0.218 rcc +0.592 wcc +0.207hc + 0.203 hg + 0.254 ferr + 0.065 bmi + 0.166
ssf + 0.177 pcBfat - 0.246 lbm - 0.398 Ht - 0.186 Wt
Scree Plot:
Scree Plot of rcc, ..., wt
5
3
Eigenvalue
1 2 3 4 5 6 7 8 9 10 11
Factor Number
For the our data, we can conclude that the 1st three factors account for most of the total
variability in data. The remaining factors account for very small proportion of the variability.
Score Plot
The plot of the factors provides checks on the assumption of the normality. The above score
plot shows that the points are not randomly distributed around zero. Therefore, the data is not
normally distributed.
Loading Plot:
The loading plot provides information about the loadings of the 1st two factors. For the our
data ,the variables rcc, hg,lbm, hc have large positive loadings on factor 1 and the variables
pcBfat,bmi, ssf have large positive loadings on Factor 2.
Loading Plot of rcc, ..., wt
ssf
0.8 pcBfat
bmi
wt
0.6
Second Factor
0.4
ht lbm
wcc
0.2
ferr
0.0
-0.2 hg
rcchc
-0.4
-0.50 -0.25 0.00 0.25 0.50 0.75 1.00
First Factor
Cluster analysis
Using R-software…..
> TWSS<-c()
> for(i in 1:10)
+ {
+ K<-kmeans(D,i)
+ TWSS[i]<-K$tot.withinss
+ }
> TWSS
[1] 772163.8 461735.4 326456.4 246097.8 218087.2 193601.0
[7] 168152.3 162824.8 143664.8 134860.9
> plot(1:10,TWSS,type="b",xlab="Number of Clusters",ylab="Total
Within cluster SS")
Here, we observe that Total Within Sum of Squares (TWSS) goes on decreasing as we
increase the no. of clusters. But after n = 4 (i.e. no. of clusters = 4), there is no significant
decrease in the TWSS.
Hence by increasing no. of clusters will not help much. Therefore, we can say that optimal
no. of clusters should be 4.
6e+05
Total Within cluster SS
4e+05
2e+05
2 4 6 8 10
Number of Clusters
Here we can see that Within cluster sum of squares are least for cluster 2 and cluster 4.
This shows that observations in these two clusters are more similar to each other.
Whereas cluster 1 has highest Within cluster sum of squares, meaning that observations
in this cluster are less similar to each other.
Cluster Centroids
Grand
Variable Cluster1 Cluster2 Cluster3 Cluster4 centroid
Rcc 4.619 4.409 4.918 4.991 4.719
Wcc 6.945 7.933 7.017 7.705 7.109
Hc 42.211 40.060 45.044 45.221 43.092
Hg 14.195 13.587 15.309 15.458 14.566
Ferr 47.930 73.333 99.611 188.737 76.876
Bmi 22.111 25.020 23.515 24.811 22.956
Ssf 69.570 140.927 50.539 61.495 69.022
pcBfat 14.236 25.295 9.486 11.263 13.507
Lbm 61.230 59.164 71.392 72.716 64.874
Ht 178.857 178.093 182.641 181.963 180.104
Wt 71.322 79.307 79.039 82.274 75.008
Distances Between Cluster Centroids
Distance between cluster 1 and cluster 2 is lowest meaning that these clusters are closer to each
other.
Whereas cluster 1 and cluster 4 are farther from each other.
Dendrogram
Dendrogram
Complete Linkage, Euclidean Distance
0.00
33.33
Similarity
66.67
100.00
12147119440586704319431780753641799201351622773389135841122567919015874212437838166122732896997960442247913597169453546403810314796585787798768208947907693756617322088504085647179086675216580938806101544501121693117038272611174975236022755537170872931711574403015761183277
244242 4 59 22 5295 6 3 3 3 33 51201113911312410411146101190131 4141103121 511121123131134183868 61 66668 8888899 9102111611155101186181103151155151111 469119511012811219211941 21912601111116161811919 345 55345184 1161 9111611281911811 21
Observations
The final Partion for our data contain 4 cluster of 202 observations, classified by
similarities:
Conclusion:
1. According to Multivariate Normality Test , our data is not Normally
distributed.
…………thank you……………