Professional Documents
Culture Documents
Descriptive Statistics
Lecture-01
Saddam Hossain
Bio-Bio-1
1
Inferential Statistics:
Techniques that allow for generalizations about population parameters based on
sample statistics or to induce some properties of the population from which the data is collected
Is the disease incidence significantly different between males and females?
What is R?:
Open source Flexible Extensible Large number of statistical and numerical methods High quality visualization and graphical tools
Where to Get?:
www.r-project.org
RStudio:
www.rstudio.org
Short R History
1991: Ross Ihaka, Robert Gentleman began 1993: The first announcement of R 1995: R available by ftp 1997: The R core group is formed 2000: R 1.0.0 is released 2001: Bioconductor for the analysis of genomic data using R 2008: The Omegahat project to enable connectivity with other languages 2010: Revolution Analytics offered a commerical package around R. 2011: Rstudio Project provide a free open source IDE for R
Lecture Plan
Lecture 01 (Week 01):
Installation of R & RStudio Descriptive Statistics
measures of central tendency (mean, median, mode) quantiles (quantiles, percentiles, quantile-quantile plot (Q-Q Plot) measures of dispersion (min, max, range, variance, standard deviation, inter quantile range (IQR), median absolute deviation (MAD)) measures of asymmetry (skewness, kurtosis)
Survey on Anemia
Consider a study of anemia in women in a given clinic where 20 cases are chosen at random from the full study. From a blood sample the following readings were recorded: hemoglobin level (Hb) (1215 g/dl is normal in adult females) packed-cell volume (PCV) in percent of blood volume that is occupied by red blood cells (also called hematocrit, Ht or HCT, or erythrocyte volume fraction, EVF; 38% to 46% is normal in adult females.) age in years
an indicator variable, hb.normal, for subjects who have normal hemoglobin levels (recall that 1215 g/dl is normal in adult females), coded as 0 =no and 1 =yes.
Survey on Anemia
hb 11.1 10.7 12.4 14 13.1 10.5 9.6 12.5 13.5 13.9 15.1 13.9 16.2 16.3 16.8 17.1 16.6 16.9 15.7 16.5 pcv 35 45 47 50 31 30 25 33 35 40 45 47 49 42 40 50 46 55 42 46 age 20 22 25 28 28 31 32 35 38 40 45 49 54 55 57 60 62 63 65 67 hb.normal 0 0 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0
dim(anemia)
str(anemia)
10
Frequency Density
par(bg="blue") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red",prob=T, col.axis="white")
stem(anemia$hb)
stripchart(anemia$hb~anemia$hb.normal, method="jitter",col='white')
11
Density Plot
12
Density Plot
13
14
15
16
17
18
Median
Left area = right area A median is the number separating the higher half of the data from the lower half median can better describe the central tendency for the data The median is robust to the presence of outliers because it does not take into account the values themselves, but the ranks.
19
Median
median(anemia$hb) median(anemia$pcv) median(anemia$age)
For a finite list of numbers, sort all the numbers in an ascending or descending order, and then pick the middle one For an odd number of observations, e.g., a < b < c, the median is b For an even number of observations, e.g., a < b < c < d, the mean of b and c is taken as the median For any probability distribution on the real line, a median satisfies the following inequalities
20
Median
par(bg = 'blue') plot(density(anemia$hb),col="green",lwd=4, xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=median(anemia$hb), lwd=4, col="black")
21
Median
apply(anemia,2,median) apply(anemia[,1:3],2,median)
par(bg = 'blue') layout(matrix(1:3,nrow=1)) plot(density(anemia$hb),col="green",lwd=4,xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density\n (median) col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$hb), lwd=4, col="black") plot(density(anemia$pcv),col="green",lwd=4,xlab="pcv",ylab="Frequency Density",main="pcv Distribution Density\n(median)", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$pcv), lwd=4, col="black") plot(density(anemia$age),col="green",lwd=4,xlab="age",ylab="Frequency Density",main="age Distribution Density\n(median)", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$age), lwd=4, col="black")
22
Mode
A mode is a value that occurs the most frequently in a data set or a probability distribution (global maxima) The mode is the value associated to the maximal frequency Not a very robust statistics: for small samples, the distribution can be irregular the precise location of the mode is depends on the choice of class boundaries. The probability mass/density function may achieve its maximum value at several points, leading to multiple local maxima (multimodal) The mode may be very different from mean and median for skewed distributions. However, the three statistics coincide in symmetric unimodal distributions, such as the normal distribution.
23
Mode
No Built-In Function for mode! You have to do it! Step-01: Make a contingency table Step-02: Identify the number that is repeated more than any other number
Step-02
24
Mode
par(bg = 'blue') plot(density(anemia$age),col="green",lwd=4, xlab="age",ylab="Frequency Density",main="age Distribution Density\n(mode)", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=mode, lwd=4, col="white")
25
26
27
Q-Q Plot
Quantiles
par(bg='blue') qqnorm(anemia$hb, col='red',pch=19) qqline(anemia$hb, col='white',lwd=2)
28
Percentiles
Quantiles
quantile(anemia$hb,0.32) quantile(anemia$hb,0.57) quantile(anemia$hb,0.98) quantile(anemia$hb, c(.32, .57,.98))
29
Quantiles
Quantiles
quantile(anemia$hb,0.25) quantile(anemia$hb,0.50) quantile(anemia$hb,0.75) quantile(anemia$hb, c(.25, .50, .75))
Q1
Q2
Q3
30
31
var(anemia$hb)
sd(anemia$hb)
32
Inter Quantile Range (IQR), Median Absolute Deviation (MAD) Measure of Dispersion
IQR(anemia$hb)
mad(anemia$hb)
33
library(e1071) kurtosis(anemia$hb)
Intuitively, the kurtosis is a measure of the peakedness of the data distribution. Negative kurtosis would indicates a flat data distribution, which is said to be platykurtic. Positive kurtosis would indicates a peaked distribution, which is said to be leptokurtic. Incidentally, the normal distribution has zero kurtosis, and is said to be mesokurtic.
34
Q2 Q1
IQR
1.5 IQR
35
summary(anemia[,1:3])
36
describe(anemia[,1:3])
37
summary() describe()
table() as.integer() as.numeric()
38
THANKX
39