You are on page 1of 18

Department of Statistics,

Savitribai Phule Pune University,

PUNE.

Analysis of Australian
athletes data set
submittted by…….
Name- Namdev Dnyandev Dongare
PRN- 1616
Subject- ST-206 (Multivariate Data Analysis)
Description Of The Data
These data were collected in a study of how data on various characteristics of the
blood varied with sport body size and sex of the athlete. The dataset has 202
observations and 13 variables namely,

rcc- red blood cell count, in per liter


wcc- while blood cell count, in per liter
hc- hematocrit, percent
hg- hemoglobin concentration, in g per decaliter
ferr- plasma ferritins, ng
bmi- Body mass index, kg
ssf- sum of skin folds
pcBfat- percent Body fat
lbm- lean body mass, kg
ht- height, cm
wt- weight, kg
sex- a factor with levels f m (f-Female, m-Male)
sport- a factor with
levels B_Ball Field Gym Netball Row Swim T_400m T_Sprnt Tennis W_Polo

Data Source:

Websites: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Statistical Software used :


1. Minitab 18

2. R-Software
The Actual Data

Rcc wcc Hc hg Ferr Bmi ssf pcBfat Lbm ht Wt Sex sport


1 3.96 7.5 37.5 12.3 60 20.56 109.1 19.75 63.32 195.9 78.9 F B_Ball
2 4.41 8.3 38.2 12.7 68 20.67 102.8 21.3 58.55 189.7 74.4 F B_Ball
3 4.14 5 36.4 11.6 21 21.86 104.6 19.88 55.36 177.8 69.1 F B_Ball
4 4.11 5.3 37.3 12.6 69 21.88 126.4 23.66 57.18 185 74.9 F B_Ball
5 4.45 6.8 41.5 14 29 18.96 80.3 17.64 53.2 184.6 64.6 F B_Ball

Some descriptive statistics :

Variable N N* Mean SE Mean St Dev Minimum Q1 Median Q3 Maximum


Rcc 202 0 4.7186 0.0322 0.4580 3.8000 4.3675 4.7550 5.0325 6.7200

Wcc 202 0 7.109 0.127 1.800 3.300 5.900 6.850 8.300 14.300

Hc 202 0 43.092 0.258 3.663 35.900 40.550 43.500 45.600 59.700

Hg 202 0 14.566 0.0959 1.362 11.600 13.500 14.700 15.600 19.200

Ferr 202 0 76.88 3.34 47.50 8.00 41.00 65.50 97.50 234.00

Bmi 202 0 22.956 0.202 2.864 16.750 21.063 22.720 24.480 34.420

Ssf 202 0 69.02 2.29 32.57 28.00 43.80 58.60 90.55 200.80

pcBfat 202 0 13.507 0.436 6.190 5.630 8.532 11.650 18.095 35.520

Lbm 202 0 64.874 0.920 13.070 34.360 54.615 63.035 75.000 106.000

Ht 202 0 180.10 0.685 9.73 148.90 174.00 179.70 186.22 209.40

Wt 202 0 75.008 0.980 13.925 37.800 66.475 74.400 84.325 123.200


The Objectives of the Analysis :
The multivariate analysis techniques are applied on the data by keeping in mind the following
objectives.

1. To reduce the dimensions of the data so it could be easily plotted and some patterns in
the data could be recognized.

2. To reduce the complexity and facilitates interpretation .

3. To classify the observations into groups .

The Techniques Applied :


1. Multivariate Normality Test:
To check the data is Normal or not .

2. Factor Analysis :

We will see if we can reduce the dimensions of the data and identify the factors
that might explain the dimensions of the data variability.

3. Cluster Analysis :
To classify observations into groups are initially not known.

Multivariate Normality Assumption:


Mardia's Skewness Measure:
> K_1 # k_1: Kappa_1
[1] 1304.757

> tabval # Table Value

[1] 326.443

Mardia's Kurtosis Measure:


> K_2 # Kappa_2

[1] 14.07809

> tabval ## Table Value

[1] 1.959964

Using R software… We get,

Kappa_1 calculated = 1304.757

Chi square table value = 326.443

Here,Kappa_1 calculated > chi square table value

Hence, reject H0 at 5% L. O. S.

Calculated values of Kappa 1 and Kappa 2 both are significantly higher than there
respective table values. Hence we conclude that the data is not from the Multivariate Normal
Population.

Q-Q Plot
25
20
popq

15
10
5

10 20 30 40 50

samq

Factor Analysis
For carrying out factor analysis, we consider the variables rcc, wcc, hc, hg, ferr, bmi,
ssf, pcBfat, lbm, ht, wt. For performing Factor Analysis, we have used Correlation
matrix as all the variables are in the different units. In this dataset , units of variables are
different.

Bartlett's Test of Sphericity :


For this test and scree plot, we use “ psych” package .
H0: There is no statistically significant interrelationship
between variables of the given data.
VS
H1: There may be statistically significant interrelationship
between variables of the given data.

> cortest.bartlett(D)
R was not square, finding R from data
$chisq
[1] 3884.211
$p.value
[1] 0
$df
[1] 55

Thus, Here P-value is less than 0.05, Therefore reject H0 i.e There is statistically significant
interrelationship between variable of the given data. This test tells us is that there is some of
the variables are corelated to each other. Therefore, we can use the Factor Analysis.

Scree Plot:
> scree(D)
Scree plot
5
PC
FA
Eigen values of factors and components

4
3
2
1
0

2 4 6 8 10

factor or component number

The Scree plot says that, there are optimal number of Factors are only THREE.

output of the Factor Analysis:


WITHOUT ROTATION:

Principal Component Factor Analysis of the Correlation Matrix

Unrotated Factor Loadings and Communalities

Variable Factor1 Factor2 Factor3 Communality


Rcc 0.837 -0.254 0.252 0.828
Wcc 0.170 0.234 0.686 0.554
Hc 0.870 -0.271 0.240 0.887
Hg 0.880 -0.237 0.235 0.886
Ferr 0.404 0.064 0.294 0.254
Bmi 0.574 0.678 0.075 0.795
Ssf -0.395 0.841 0.193 0.900
pcBfat -0.531 0.757 0.205 0.897
Lbm 0.894 0.296 -0.285 0.967
Ht 0.657 0.317 -0.461 0.745
Wt 0.755 0.613 -0.216 0.993

Variance 4.9910 2.5576 1.1574 8.7059


% Var 0.454 0.233 0.105 0.791

1) We can see that, in 1st factor the weight of the variable wcc is very low as compare to
the other variables in the 1st factor i.e contribution of variable WCC is less in the 1st
factor.

2) Similarly, in 2nd factor variable ferr contributes very less whereas the variable ssf
contribute more than the other variables.

3) From the communality it is clear that about 96% of the information of the variable
lbm is explained by 3 factors whereas the 3 factor explain about only 25% of the
information of the variable ferr.

4) Also, the 3 factors explain most of the total data variation 79.1%.

Factor Score Coefficients:

Variable Factor1 Factor2 Factor3

Rcc 0.168 -0.099 0.218


Wcc 0.034 0.092 0.592
Hc 0.174 -0.106 0.207
Hg 0.176 -0.093 0.203
Ferr 0.081 0.025 0.254
Bmi 0.115 0.265 0.065
Ssf -0.079 0.329 0.166
pcBfat -0.106 0.296 0.177
Lbm 0.179 0.116 -0.246
Ht 0.132 0.124 -0.398
Wt 0.151 0.240 -0.186

For the our data, the 1st factor are computed from the from the original data using the
coefficient listed under Factor 1:

Factor 1= 0.168 rcc +0.034 wcc +0.174 hc + 0.176 hg + 0.081 ferr + 0.115 bmi - 0.079
ssf – 0.106 pcBfat + 0.179 lbm + 0.132 Ht +0.151 Wt

Note that, In factor 1 the coefficient of the rcc ,hc, hg is high as compare to other
variables. Also note that the Hemoglobin and Hematocrit are parameters of Red Blood
cells. If these are in normal range , then the fitness of human body is good. Thus we can 1st
Factor labelled as Fitness of Body.

Factor 2 = -0.099 rcc +0.092 wcc - 0.106 hc - 0.093 hg + 0.025 ferr + 0.265 bmi + 0.329
ssf + 0.296 pcBfat + 0.116 lbm + 0.124 Ht +0.240 Wt

Factor 3 = 0.218 rcc +0.592 wcc +0.207hc + 0.203 hg + 0.254 ferr + 0.065 bmi + 0.166
ssf + 0.177 pcBfat - 0.246 lbm - 0.398 Ht - 0.186 Wt

In Factor 2 the coefficient of bmi,wt,ssf,pcBfat is high as compared to other


variables.Also note that the adequate Body Mass Index , Sum of Skin Folds and Percent Body
Fat are checked for Health Maintenance . Thus, the 2nd Factor labelled as Health
Maintenance.
In Factor 3 , the coefficient of wcc, ferr ,rcc, hc, hg , ssf, pcBfat are
high as compared to remaining variables.

Scree Plot:
Scree Plot of rcc, ..., wt
5

3
Eigenvalue

1 2 3 4 5 6 7 8 9 10 11
Factor Number

For the our data, we can conclude that the 1st three factors account for most of the total
variability in data. The remaining factors account for very small proportion of the variability.

Score Plot
The plot of the factors provides checks on the assumption of the normality. The above score
plot shows that the points are not randomly distributed around zero. Therefore, the data is not
normally distributed.

Loading Plot:
The loading plot provides information about the loadings of the 1st two factors. For the our
data ,the variables rcc, hg,lbm, hc have large positive loadings on factor 1 and the variables
pcBfat,bmi, ssf have large positive loadings on Factor 2.
Loading Plot of rcc, ..., wt
ssf
0.8 pcBfat
bmi
wt
0.6
Second Factor

0.4
ht lbm
wcc
0.2
ferr

0.0

-0.2 hg
rcchc

-0.4
-0.50 -0.25 0.00 0.25 0.50 0.75 1.00
First Factor

Cluster analysis

Using R-software…..

> TWSS<-c()
> for(i in 1:10)
+ {
+ K<-kmeans(D,i)
+ TWSS[i]<-K$tot.withinss
+ }
> TWSS
[1] 772163.8 461735.4 326456.4 246097.8 218087.2 193601.0
[7] 168152.3 162824.8 143664.8 134860.9
> plot(1:10,TWSS,type="b",xlab="Number of Clusters",ylab="Total
Within cluster SS")
Here, we observe that Total Within Sum of Squares (TWSS) goes on decreasing as we
increase the no. of clusters. But after n = 4 (i.e. no. of clusters = 4), there is no significant
decrease in the TWSS.
Hence by increasing no. of clusters will not help much. Therefore, we can say that optimal
no. of clusters should be 4.
6e+05
Total Within cluster SS

4e+05
2e+05

2 4 6 8 10

Number of Clusters

Output of hierarchical clustering

Cluster Analysis of Observations: rcc, wcc, hc, hg,


ferr, ... at, lbm, ht, wt
Euclidean Distance, Complete Linkage
Final Partition

Within Average Maximum


cluster distance distance
Number of sum of from from
observations squares centroid centroid
Cluster1 114 172416 37.1766 67.6575
Cluster2 15 27594 41.6011 63.0927
Cluster3 54 64942 32.0183 66.3866
Cluster4 19 34952 40.8939 77.7602

Here we can see that Within cluster sum of squares are least for cluster 2 and cluster 4.
This shows that observations in these two clusters are more similar to each other.
Whereas cluster 1 has highest Within cluster sum of squares, meaning that observations
in this cluster are less similar to each other.

Cluster Centroids

Grand
Variable Cluster1 Cluster2 Cluster3 Cluster4 centroid
Rcc 4.619 4.409 4.918 4.991 4.719
Wcc 6.945 7.933 7.017 7.705 7.109
Hc 42.211 40.060 45.044 45.221 43.092
Hg 14.195 13.587 15.309 15.458 14.566
Ferr 47.930 73.333 99.611 188.737 76.876
Bmi 22.111 25.020 23.515 24.811 22.956
Ssf 69.570 140.927 50.539 61.495 69.022
pcBfat 14.236 25.295 9.486 11.263 13.507
Lbm 61.230 59.164 71.392 72.716 64.874
Ht 178.857 178.093 182.641 181.963 180.104
Wt 71.322 79.307 79.039 82.274 75.008
Distances Between Cluster Centroids

Cluster1 Cluster2 Cluster3 Cluster4


Cluster1 0.000 77.088 56.9577 142.059
Cluster2 77.088 0.000 96.4980 141.642
Cluster3 56.958 96.498 0.0000 89.897
Cluster4 142.059 141.642 89.8970 0.000

Distance between cluster 1 and cluster 2 is lowest meaning that these clusters are closer to each
other.
Whereas cluster 1 and cluster 4 are farther from each other.

Dendrogram
Dendrogram
Complete Linkage, Euclidean Distance

0.00

33.33
Similarity

66.67

100.00
12147119440586704319431780753641799201351622773389135841122567919015874212437838166122732896997960442247913597169453546403810314796585787798768208947907693756617322088504085647179086675216580938806101544501121693117038272611174975236022755537170872931711574403015761183277
244242 4 59 22 5295 6 3 3 3 33 51201113911312410411146101190131 4141103121 511121123131134183868 61 66668 8888899 9102111611155101186181103151155151111 469119511012811219211941 21912601111116161811919 345 55345184 1161 9111611281911811 21

Observations
The final Partion for our data contain 4 cluster of 202 observations, classified by
similarities:

Cluster ( colours ) Observations Similarity

Indigo (1st cluster) 114 Mostly Females

Dark Green (2nd cluster) 15 Both Males and Females

Dark Red (3rd cluster) 54 Both Males and Females

Dark Magenta (4th 19 Mostly Males


cluster)

Conclusion:
1. According to Multivariate Normality Test , our data is not Normally
distributed.

2. According to Factor Analysis we see that 3 Factors explain 79% variation of


the data. The 1st Factor explain the fitness of body and 2nd factor explains the
Health Maintenance.

3. According to Cluster Analysis we obtained perfect result of the data using


complete linkage method.

…………thank you……………

You might also like