R For Bioinformatics Lec 01 Descriptive Statistics

R for Bioinformatics
Descriptive Statistics
Lecture-01
Saddam Hossain
Bio-Bio-1
1
Central Dogma of Statistics
Explorative Data Analysis

2
Three Avenues of Statistics

Descriptive Statistics:
Procedures used to describe, summarize, organize, and simplify data without generalizing beyond the data at hand. Data visualization is very important at this phase.
Percentage of a disease incidence for a given period in a city
Inferential Statistics:
Techniques that allow for generalizations about population parameters based on
sample statistics or to induce some properties of the population from which the data is collected
Is the disease incidence significantly different between males and females?
Exploratory Data Analysis (EDA):

Discovering latent (hidden) properties from a data set. Clustering, Multivariate analysis, distributional semantics, advanced statistical modeling
Group the districts as per similar out-break intensity of this disease over the country
What is R?:
Open source Flexible Extensible Large number of statistical and numerical methods High quality visualization and graphical tools
Where to Get?:
www.r-project.org
RStudio:
www.rstudio.org
Short R History
1991: Ross Ihaka, Robert Gentleman began 1993: The first announcement of R 1995: R available by ftp 1997: The R core group is formed 2000: R 1.0.0 is released 2001: Bioconductor for the analysis of genomic data using R 2008: The Omegahat project to enable connectivity with other languages 2010: Revolution Analytics offered a commerical package around R. 2011: Rstudio Project provide a free open source IDE for R
Lecture Plan
Lecture 01 (Week 01):
Installation of R & RStudio Descriptive Statistics
measures of central tendency (mean, median, mode) quantiles (quantiles, percentiles, quantile-quantile plot (Q-Q Plot) measures of dispersion (min, max, range, variance, standard deviation, inter quantile range (IQR), median absolute deviation (MAD)) measures of asymmetry (skewness, kurtosis)

Inferential Statistics
correlation, regression, null hypothesis significance test (NHST)

Inferential Statistics
t-test, ANOVA

Explorative Data Analysis
clustering
Survey on Anemia
Consider a study of anemia in women in a given clinic where 20 cases are chosen at random from the full study. From a blood sample the following readings were recorded: hemoglobin level (Hb) (1215 g/dl is normal in adult females) packed-cell volume (PCV) in percent of blood volume that is occupied by red blood cells (also called hematocrit, Ht or HCT, or erythrocyte volume fraction, EVF; 38% to 46% is normal in adult females.) age in years
an indicator variable, hb.normal, for subjects who have normal hemoglobin levels (recall that 1215 g/dl is normal in adult females), coded as 0 =no and 1 =yes.
Survey on Anemia
hb 11.1 10.7 12.4 14 13.1 10.5 9.6 12.5 13.5 13.9 15.1 13.9 16.2 16.3 16.8 17.1 16.6 16.9 15.7 16.5 pcv 35 45 47 50 31 30 25 33 35 40 45 47 49 42 40 50 46 55 42 46 age 20 22 25 28 28 31 32 35 38 40 45 49 54 55 57 60 62 63 65 67 hb.normal 0 0 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0
Read Anemia Data

#Read Survey Data From csv File anemia=read.csv(file.choose(),row.names = 1, header=T) #Check the Data head(anemia) str(anemia) tail(anemia) colnames(anemia) dim(anemia)
colnames(anemia) head(anemia) tail(anemia)
dim(anemia)
str(anemia)
Explore Distribution - Histograms

Histogram shows the frequency distributions
Continuous variable Explore distribution hist(anemia$hb) hist(anemia$hb, main="Hb levels of women") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)") par(bg="blue") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red",break=10)
10
Frequency Density
par(bg="blue") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red",prob=T, col.axis="white")
stem(anemia$hb)
stripchart(anemia$hb, method="jitter", col='white')
stripchart(anemia$hb~anemia$hb.normal, method="jitter",col='white')
11
Density Plot
par(bg="blue") plot(density(anemia$hb), col="green",lwd=4, main="Distribution of Hb", xlab="Hb(mg/dl)", col.axis="white")
12
Density Plot
par(bg="blue") polygon(density(anemia$hb), col="red", border="white",lwd=4, main="Distribution of Hb", xlab="Hb(mg/dl)", col.axis="white")
13
Histogram with Density Plot
par(bg="blue") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red",prob=T, col.axis="white") lines(density(anemia$hb), col="white",lwd=4)
14
Arithmetic Mean (mean)

The mean is the gravity center of the distribution Beware: the mean is strongly influenced by outliers Statistical "outliers" are generally biologically relevant objects (e.g. regulated genes)
15
mean(anemia$hb) mean(anemia$pcv) mean(anemia$age)
sum(anemia$hb) length(anemia$hb) mean=sum(anemia$hb)/length(anemia$hb) mean
16

plot(density(anemia$hb)) plot(density(anemia$hb),col="green") plot(density(anemia$hb),col="green",lwd=4) par(bg = 'blue') plot(density(anemia$hb),col="green",lwd=4, xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=mean(anemia$hb), lwd=4, col="red")
17

mean(anemia[,1:3])
par(bg = 'blue') layout(matrix(1:3,nrow=1)) plot(density(anemia$hb),col="green",lwd=4,xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=mean(anemia$hb), lwd=4, col="red") plot(density(anemia$pcv),col="green",lwd=4, xlab="pcv",ylab="Frequency Density",main="pcv Distribution Density", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=mean(anemia$pcv), lwd=4, col="red") plot(density(anemia$age),col="green",lwd=4,xlab="age",ylab="Frequency Density",main="age Distribution Density", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=mean(anemia$age), lwd=4, col="red")
18
Median
Left area = right area A median is the number separating the higher half of the data from the lower half median can better describe the central tendency for the data The median is robust to the presence of outliers because it does not take into account the values themselves, but the ranks.
19
Median
median(anemia$hb) median(anemia$pcv) median(anemia$age)
For a finite list of numbers, sort all the numbers in an ascending or descending order, and then pick the middle one For an odd number of observations, e.g., a < b < c, the median is b For an even number of observations, e.g., a < b < c < d, the mean of b and c is taken as the median For any probability distribution on the real line, a median satisfies the following inequalities
sort(anemia$hb) length(anemia$hb) median=(sort(anemia$hb)[length(anemia$hb)/2] + sort(anemia$hb)[length(anemia$hb)/2+1])/2 median
20
Median
par(bg = 'blue') plot(density(anemia$hb),col="green",lwd=4, xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=median(anemia$hb), lwd=4, col="black")
21
Median
apply(anemia,2,median) apply(anemia[,1:3],2,median)
par(bg = 'blue') layout(matrix(1:3,nrow=1)) plot(density(anemia$hb),col="green",lwd=4,xlab="Hb(mg/dl)",ylab="Frequency Density",main="Hb Distribution Density\n (median) col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$hb), lwd=4, col="black") plot(density(anemia$pcv),col="green",lwd=4,xlab="pcv",ylab="Frequency Density",main="pcv Distribution Density\n(median)", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$pcv), lwd=4, col="black") plot(density(anemia$age),col="green",lwd=4,xlab="age",ylab="Frequency Density",main="age Distribution Density\n(median)", col.lab="white", col.main="white",col.axis="white,cex.main=1.25, cex.lab=0.9, cex.axis=0.9) abline(v=median(anemia$age), lwd=4, col="black")
22
Mode
A mode is a value that occurs the most frequently in a data set or a probability distribution (global maxima) The mode is the value associated to the maximal frequency Not a very robust statistics: for small samples, the distribution can be irregular the precise location of the mode is depends on the choice of class boundaries. The probability mass/density function may achieve its maximum value at several points, leading to multiple local maxima (multimodal) The mode may be very different from mean and median for skewed distributions. However, the three statistics coincide in symmetric unimodal distributions, such as the normal distribution.
23
Mode
No Built-In Function for mode! You have to do it! Step-01: Make a contingency table Step-02: Identify the number that is repeated more than any other number
Step-01 tbl=table(anemia$age) tbl
Step-02
mode=names(tbl)[tbl == max(tbl)] mode

as.integer(mode) as.numeric(mode) mode=as.numeric(mode) mode
24
Mode
par(bg = 'blue') plot(density(anemia$age),col="green",lwd=4, xlab="age",ylab="Frequency Density",main="age Distribution Density\n(mode)", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=mode, lwd=4, col="white")
25
Mean, Median & Mode
mean(anemia$age) median(anemia$age) temp=table(anemia$age) mode=names(temp)[temp == max(temp)] mode c(mean=mean(anemia$age),median=median(anemia$age),mode=mode)
c(mean=as.numeric(mean(anemia$age)), median=as.numeric(median(anemia$age)), mode=as.numeric(mode))
26
Mean, Median & Mode

par(bg = 'blue') plot(density(anemia$age),col="green",lwd=4, xlab="age",ylab="Frequency Density",main="age Distribution Density\n(mean, median, mode)", col.lab="white", col.main="white",col.axis="white", cex.main=1.25, cex.lab=0.9, cex.axis=0.9 ) abline(v=mean(anemia$age), lwd=4, col="red") abline(v=median(anemia$age), lwd=4, col="black") abline(v=mode, lwd=4, col="white") legend("topright", legend=c("mean", "median", "mode"), text.col=c('red','black','white'),ncol=1)
27
Q-Q Plot
Quantiles
par(bg='blue') qqnorm(anemia$hb, col='red',pch=19) qqline(anemia$hb, col='white',lwd=2)
28
Percentiles
Quantiles
quantile(anemia$hb,0.32) quantile(anemia$hb,0.57) quantile(anemia$hb,0.98) quantile(anemia$hb, c(.32, .57,.98))
29
Quantiles
Quantiles
quantile(anemia$hb,0.25) quantile(anemia$hb,0.50) quantile(anemia$hb,0.75) quantile(anemia$hb, c(.25, .50, .75))
quantile(anemia$hb, c(0.0,.25, .50, .75,1.0))
Q1
Q2
Q3
30
Min, Max, Range

Measure of Dispersion
min(anemia$hb) max(anemia$hb) range(anemia$hb) diff(range(anemia$hb))
31
Variance, Standard Deviation

Measure of Dispersion
var(anemia$hb)
sd(anemia$hb)
32
Inter Quantile Range (IQR), Median Absolute Deviation (MAD) Measure of Dispersion
IQR(anemia$hb)
mad(anemia$hb)
33
Skewness, Kurtosis Measure of Asymmetry

install.packages("e1071") library(e1071) skewness(anemia$hb)
library(e1071) kurtosis(anemia$hb)
Intuitively, the kurtosis is a measure of the peakedness of the data distribution. Negative kurtosis would indicates a flat data distribution, which is said to be platykurtic. Positive kurtosis would indicates a peaked distribution, which is said to be leptokurtic. Incidentally, the normal distribution has zero kurtosis, and is said to be mesokurtic.
34
Boxplot Summary Statistics

outlier 1.5 IQR
boxplot(anemia$hb, main="Hb Boxplot",ylab="Hb", col='white')
max (within 4 IQR) Q3
Q2 Q1
IQR
1.5 IQR
min (within 4 IQR) outlier
boxplot(anemia$hb~anemia$hb.normal, main="Hb Boxplot", xlab="Hb Range", ylab="Hb", col="white")
35
Summary Summary Statistics

summary(anemia$hb) summary(anemia$pcv) summary(anemia$age)
summary(anemia[,1:3])
36
Describe Summary Statistics

install.packages("psych") library(psych) describe(anemia$hb) describe(anemia$pcv) describe(anemia$age)
describe(anemia[,1:3])
37
What Have We Learned?

read.csv() file.choose() par() hist() stem() stripchart() plot() density() polygon() lines() abline() legend() boxplot() min() max() range() diff() install.packages() library() qqnorm() qqline() quantile() var() sd() IQR() mad() skewness() kurtosis()
head() str() tail() colnames() dim()
c() sum() length() sort()

mean() median()
summary() describe()
table() as.integer() as.numeric()
38
THANKX
39

R For Bioinformatics Lec 01 Descriptive Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R For Bioinformatics Lec 01 Descriptive Statistics

Uploaded by

Copyright:

Available Formats

R for Bioinformatics

Central Dogma of Statistics

Explorative Data Analysis

Three Avenues of Statistics

Exploratory Data Analysis (EDA):

Lecture 02 (Week 02):

Lecture 03 (Week 03):

Lecture 04 (Week 04):

Read Anemia Data

colnames(anemia) head(anemia) tail(anemia)

Explore Distribution - Histograms

stripchart(anemia$hb, method="jitter", col='white')

par(bg="blue") plot(density(anemia$hb), col="green",lwd=4, main="Distribution of Hb", xlab="Hb(mg/dl)", col.axis="white")

par(bg="blue") polygon(density(anemia$hb), col="red", border="white",lwd=4, main="Distribution of Hb", xlab="Hb(mg/dl)", col.axis="white")

Histogram with Density Plot

par(bg="blue") hist(anemia$hb, main="Hb levels of women", xlab="Hb(mg/dl)",col="red",prob=T, col.axis="white") lines(density(anemia$hb), col="white",lwd=4)

Arithmetic Mean (mean)

Arithmetic Mean (mean)

mean(anemia$hb) mean(anemia$pcv) mean(anemia$age)

sum(anemia$hb) length(anemia$hb) mean=sum(anemia$hb)/length(anemia$hb) mean

Arithmetic Mean (mean)

Arithmetic Mean (mean)

sort(anemia$hb) length(anemia$hb) median=(sort(anemia$hb)[length(anemia$hb)/2] + sort(anemia$hb)[length(anemia$hb)/2+1])/2 median

Step-01 tbl=table(anemia$age) tbl

mode=names(tbl)[tbl == max(tbl)] mode

Mean, Median & Mode

mean(anemia$age) median(anemia$age) temp=table(anemia$age) mode=names(temp)[temp == max(temp)] mode c(mean=mean(anemia$age),median=median(anemia$age),mode=mode)

c(mean=as.numeric(mean(anemia$age)), median=as.numeric(median(anemia$age)), mode=as.numeric(mode))

Mean, Median & Mode

quantile(anemia$hb, c(0.0,.25, .50, .75,1.0))

Min, Max, Range

min(anemia$hb) max(anemia$hb) range(anemia$hb) diff(range(anemia$hb))

Variance, Standard Deviation

Skewness, Kurtosis Measure of Asymmetry

Boxplot Summary Statistics

boxplot(anemia$hb, main="Hb Boxplot",ylab="Hb", col='white')

max (within 4 IQR) Q3

min (within 4 IQR) outlier

boxplot(anemia$hb~anemia$hb.normal, main="Hb Boxplot", xlab="Hb Range", ylab="Hb", col="white")

Summary Summary Statistics

Describe Summary Statistics

What Have We Learned?

head() str() tail() colnames() dim()

c() sum() length() sort()

You might also like