You are on page 1of 39

R for Bioinformatics

Descriptive Statistics
Lecture-01

Saddam Hossain

Bio-Bio-1
1

Central Dogma of Statistics

Explorative Data Analysis


2

Three Avenues of Statistics


Descriptive Statistics:
Procedures used to describe, summarize, organize, and simplify data without generalizing beyond the data at hand. Data visualization is very important at this phase.
Percentage of a disease incidence for a given period in a city

Inferential Statistics:
Techniques that allow for generalizations about population parameters based on

sample statistics or to induce some properties of the population from which the data is collected
Is the disease incidence significantly different between males and females?

Exploratory Data Analysis (EDA):


Discovering latent (hidden) properties from a data set. Clustering, Multivariate analysis, distributional semantics, advanced statistical modeling
Group the districts as per similar out-break intensity of this disease over the country

What is R?:
Open source Flexible Extensible Large number of statistical and numerical methods High quality visualization and graphical tools

Where to Get?:
www.r-project.org

RStudio:
www.rstudio.org

Short R History
1991: Ross Ihaka, Robert Gentleman began 1993: The first announcement of R 1995: R available by ftp 1997: The R core group is formed 2000: R 1.0.0 is released 2001: Bioconductor for the analysis of genomic data using R 2008: The Omegahat project to enable connectivity with other languages 2010: Revolution Analytics offered a commerical package around R. 2011: Rstudio Project provide a free open source IDE for R

Lecture Plan
Lecture 01 (Week 01):
Installation of R & RStudio Descriptive Statistics
measures of central tendency (mean, median, mode) quantiles (quantiles, percentiles, quantile-quantile plot (Q-Q Plot) measures of dispersion (min, max, range, variance, standard deviation, inter quantile range (IQR), median absolute deviation (MAD)) measures of asymmetry (skewness, kurtosis)

Lecture 02 (Week 02):


Inferential Statistics
correlation, regression, null hypothesis significance test (NHST)

Lecture 03 (Week 03):


Inferential Statistics
t-test, ANOVA

Lecture 04 (Week 04):


Explorative Data Analysis
clustering

Survey on Anemia
Consider a study of anemia in women in a given clinic where 20 cases are chosen at random from the full study. From a blood sample the following readings were recorded: hemoglobin level (Hb) (1215 g/dl is normal in adult females) packed-cell volume (PCV) in percent of blood volume that is occupied by red blood cells (also called hematocrit, Ht or HCT, or erythrocyte volume fraction, EVF; 38% to 46% is normal in adult females.) age in years

an indicator variable, hb.normal, for subjects who have normal hemoglobin levels (recall that 1215 g/dl is normal in adult females), coded as 0 =no and 1 =yes.

Survey on Anemia
hb 11.1 10.7 12.4 14 13.1 10.5 9.6 12.5 13.5 13.9 15.1 13.9 16.2 16.3 16.8 17.1 16.6 16.9 15.7 16.5 pcv 35 45 47 50 31 30 25 33 35 40 45 47 49 42 40 50 46 55 42 46 age 20 22 25 28 28 31 32 35 38 40 45 49 54 55 57 60 62 63 65 67 hb.normal 0 0 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0

Read Anemia Data


#Read Survey Data From csv File anemia=read.csv(file.choose(),row.names = 1, header=T) #Check the Data head(anemia) str(anemia) tail(anemia) colnames(anemia) dim(anemia)

colnames(anemia) head(anemia) tail(anemia)

dim(anemia)

str(anemia)

Explore Distribution - Histograms


Histogram shows the frequency distributions
Continuous variable Explore distribution hist(anemia$hb) hist(anemia$hb, main="Hb levels of women") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)") par(bg="blue") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red",break=10)

10

Frequency Density
par(bg="blue") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red",prob=T, col.axis="white")

stem(anemia$hb)

stripchart(anemia$hb, method="jitter", col='white')

stripchart(anemia$hb~anemia$hb.normal, method="jitter",col='white')

11

Density Plot

par(bg="blue") plot(density(anemia$hb), col="green",lwd=4, main="Distribution of Hb", xlab="Hb(mg/dl)", col.axis="white")

12

Density Plot

par(bg="blue") polygon(density(anemia$hb), col="red", border="white",lwd=4, main="Distribution of Hb", xlab="Hb(mg/dl)", col.axis="white")

13

Histogram with Density Plot

par(bg="blue") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red",prob=T, col.axis="white") lines(density(anemia$hb), col="white",lwd=4)

14

Arithmetic Mean (mean)


The mean is the gravity center of the distribution Beware: the mean is strongly influenced by outliers Statistical "outliers" are generally biologically relevant objects (e.g. regulated genes)

15

Arithmetic Mean (mean)

mean(anemia$hb) mean(anemia$pcv) mean(anemia$age)

sum(anemia$hb) length(anemia$hb) mean=sum(anemia$hb)/length(anemia$hb) mean

16

Arithmetic Mean (mean)


plot(density(anemia$hb)) plot(density(anemia$hb),col="green") plot(density(anemia$hb),col="green",lwd=4) par(bg = 'blue') plot(density(anemia$hb),col="green",lwd=4, xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=mean(anemia$hb), lwd=4, col="red")

17

Arithmetic Mean (mean)


mean(anemia[,1:3])
par(bg = 'blue') layout(matrix(1:3,nrow=1)) plot(density(anemia$hb),col="green",lwd=4,xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=mean(anemia$hb), lwd=4, col="red") plot(density(anemia$pcv),col="green",lwd=4, xlab="pcv",ylab="Frequency Density",main="pcv Distribution Density", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=mean(anemia$pcv), lwd=4, col="red") plot(density(anemia$age),col="green",lwd=4,xlab="age",ylab="Frequency Density",main="age Distribution Density", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=mean(anemia$age), lwd=4, col="red")

18

Median
Left area = right area A median is the number separating the higher half of the data from the lower half median can better describe the central tendency for the data The median is robust to the presence of outliers because it does not take into account the values themselves, but the ranks.

19

Median
median(anemia$hb) median(anemia$pcv) median(anemia$age)

For a finite list of numbers, sort all the numbers in an ascending or descending order, and then pick the middle one For an odd number of observations, e.g., a < b < c, the median is b For an even number of observations, e.g., a < b < c < d, the mean of b and c is taken as the median For any probability distribution on the real line, a median satisfies the following inequalities

sort(anemia$hb) length(anemia$hb) median=(sort(anemia$hb)[length(anemia$hb)/2] + sort(anemia$hb)[length(anemia$hb)/2+1])/2 median

20

Median
par(bg = 'blue') plot(density(anemia$hb),col="green",lwd=4, xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=median(anemia$hb), lwd=4, col="black")

21

Median
apply(anemia,2,median) apply(anemia[,1:3],2,median)

par(bg = 'blue') layout(matrix(1:3,nrow=1)) plot(density(anemia$hb),col="green",lwd=4,xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density\n (median) col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$hb), lwd=4, col="black") plot(density(anemia$pcv),col="green",lwd=4,xlab="pcv",ylab="Frequency Density",main="pcv Distribution Density\n(median)", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$pcv), lwd=4, col="black") plot(density(anemia$age),col="green",lwd=4,xlab="age",ylab="Frequency Density",main="age Distribution Density\n(median)", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$age), lwd=4, col="black")

22

Mode
A mode is a value that occurs the most frequently in a data set or a probability distribution (global maxima) The mode is the value associated to the maximal frequency Not a very robust statistics: for small samples, the distribution can be irregular the precise location of the mode is depends on the choice of class boundaries. The probability mass/density function may achieve its maximum value at several points, leading to multiple local maxima (multimodal) The mode may be very different from mean and median for skewed distributions. However, the three statistics coincide in symmetric unimodal distributions, such as the normal distribution.
23

Mode
No Built-In Function for mode! You have to do it! Step-01: Make a contingency table Step-02: Identify the number that is repeated more than any other number

Step-01 tbl=table(anemia$age) tbl

Step-02

mode=names(tbl)[tbl == max(tbl)] mode


as.integer(mode) as.numeric(mode) mode=as.numeric(mode) mode

24

Mode
par(bg = 'blue') plot(density(anemia$age),col="green",lwd=4, xlab="age",ylab="Frequency Density",main="age Distribution Density\n(mode)", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=mode, lwd=4, col="white")

25

Mean, Median & Mode

mean(anemia$age) median(anemia$age) temp=table(anemia$age) mode=names(temp)[temp == max(temp)] mode c(mean=mean(anemia$age),median=median(anemia$age),mode=mode)

c(mean=as.numeric(mean(anemia$age)), median=as.numeric(median(anemia$age)), mode=as.numeric(mode))

26

Mean, Median & Mode


par(bg = 'blue') plot(density(anemia$age),col="green",lwd=4, xlab="age",ylab="Frequency Density",main="age Distribution Density\n(mean, median, mode)", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=mean(anemia$age), lwd=4, col="red") abline(v=median(anemia$age), lwd=4, col="black") abline(v=mode, lwd=4, col="white") legend("topright", legend=c("mean", "median", "mode"), text.col=c('red','black','white'),ncol=1)

27

Q-Q Plot
Quantiles
par(bg='blue') qqnorm(anemia$hb, col='red',pch=19) qqline(anemia$hb, col='white',lwd=2)

28

Percentiles
Quantiles
quantile(anemia$hb,0.32) quantile(anemia$hb,0.57) quantile(anemia$hb,0.98) quantile(anemia$hb, c(.32, .57,.98))

29

Quantiles
Quantiles
quantile(anemia$hb,0.25) quantile(anemia$hb,0.50) quantile(anemia$hb,0.75) quantile(anemia$hb, c(.25, .50, .75))

quantile(anemia$hb, c(0.0,.25, .50, .75,1.0))

Q1

Q2

Q3

30

Min, Max, Range


Measure of Dispersion

min(anemia$hb) max(anemia$hb) range(anemia$hb) diff(range(anemia$hb))

31

Variance, Standard Deviation


Measure of Dispersion

var(anemia$hb)

sd(anemia$hb)

32

Inter Quantile Range (IQR), Median Absolute Deviation (MAD) Measure of Dispersion

IQR(anemia$hb)

mad(anemia$hb)

33

Skewness, Kurtosis Measure of Asymmetry


install.packages("e1071") library(e1071) skewness(anemia$hb)

library(e1071) kurtosis(anemia$hb)

Intuitively, the kurtosis is a measure of the peakedness of the data distribution. Negative kurtosis would indicates a flat data distribution, which is said to be platykurtic. Positive kurtosis would indicates a peaked distribution, which is said to be leptokurtic. Incidentally, the normal distribution has zero kurtosis, and is said to be mesokurtic.
34

Boxplot Summary Statistics


outlier 1.5 IQR

boxplot(anemia$hb, main="Hb Boxplot",ylab="Hb", col='white')

max (within 4 IQR) Q3

Q2 Q1

IQR

1.5 IQR

min (within 4 IQR) outlier

boxplot(anemia$hb~anemia$hb.normal, main="Hb Boxplot", xlab="Hb Range", ylab="Hb", col="white")

35

Summary Summary Statistics


summary(anemia$hb) summary(anemia$pcv) summary(anemia$age)

summary(anemia[,1:3])

36

Describe Summary Statistics


install.packages("psych") library(psych) describe(anemia$hb) describe(anemia$pcv) describe(anemia$age)

describe(anemia[,1:3])

37

What Have We Learned?


read.csv() file.choose() par() hist() stem() stripchart() plot() density() polygon() lines() abline() legend() boxplot() min() max() range() diff() install.packages() library() qqnorm() qqline() quantile() var() sd() IQR() mad() skewness() kurtosis()

head() str() tail() colnames() dim()

c() sum() length() sort()


mean() median()

summary() describe()
table() as.integer() as.numeric()

38

THANKX

39

You might also like