You are on page 1of 95

Introduction to Bio-statistics

National Institute for Cellular Biotechnology

Jai Prakash Mehta

Reviewed by: Padraig Doolan (6-10-05)

1st Day: Contents (1.5 hrs)


National Institute for Cellular Biotechnology

Population and Sample Measure of central tendency


Mean Weighted mean Median

Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error

Normal Distribution
Properties of normal distribution

1st Day: Contents (1 hrs)


National Institute for Cellular Biotechnology

Test of significance
NULL hypothesis Types of error Level of significance Degree of freedom t-test Paired t-test Multiple testing correction Application of t-test for various application

2nd Day: Contents (1.5 hrs)


National Institute for Cellular Biotechnology

Analysis of Variance
One way ANOVA Applications of ANOVA for various applications

Correlation and Regression


Regression equation of first order Scatter Diagram Correlation Spearmans correlation Rank Correlation Regression equation of 1 st order Multiple regression Non-linear equation

2nd Day: Contents (0.5 hrs)


National Institute for Cellular Biotechnology

Experimental design
Designing a good microarray experiment

Statistics related to microarray


Cluster mathematics Hierarchical Clustering K-means clustering Self Organizing Maps

Contents
National Institute for Cellular Biotechnology

Population and Sample Measure of central tendency


Mean Weighted mean Geometric mean Median

Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error

Normal Distribution
Properties of normal distribution

Population and sample


National Institute for Cellular Biotechnology

Population is the entire data that we want to study Since the entire data cannot be studied, we take a representative number of data and study and that is termed as sample E.g. in order to get cell count, we study only part of the area under microscope. That area is the sample and should be representative of whole of the population. Rarely the sample results will be exactly equal to the population.

Population and Sample


National Institute for Cellular Biotechnology

This difference is termed as sampling errors, which can be reduced by following proper sampling procedures and incorporation large number of replicated However it is practically impossible to get the exact estimate of the population from its sample. Correct sampling is a key for getting the accurate results.

Contents
National Institute for Cellular Biotechnology

Population and Sample Measure of central tendency


Mean Weighted mean Geometric mean Median

Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error

Normal Distribution
Properties of normal distribution

Measure of central tendency


National Institute for Cellular Biotechnology

Measure of central tendency calculates the representative values for the group of data It gives a representative value to the replicates Some of the common measures of central tendency are mean, weighted mean, median, mode etc.

Mean
National Institute for Cellular Biotechnology

Average of the data Gives a representative value of the replicates Mean are susceptible to outliers e.g. mean of 3,4,5,4 will be (3+4+5+4)/4 = 4 The red lines represents the mean of the lines represented in black

Calculating mean using Excel


National Institute for Cellular Biotechnology

=AVERAGE(A1:A10) Will calculate the mean for the vector A1 to A10

Weighted mean
National Institute for Cellular Biotechnology

Applicable when there are weights associated with individual entries. e.g. we have different dilutions of a sample and the individual cell counts of each dilution. In order to quantify the overall cell count we will use this measure. The dilutions will act as there respective weights.

Median
National Institute for Cellular Biotechnology

The middle number in a set of ordered data e.g. The median of {1,1,1,2,4,6,6} is 2 since 2 is the middle number when all of the numbers are placed in order. If there are an even number, the median is the mean of the two middle numbers. e.g. If there are outliers in the datasets it is advisable to use median.

Calculating median using excel


National Institute for Cellular Biotechnology

=MEDIAN(A1:A10) Will calculate the median for the vector A1 to A10

Mode
National Institute for Cellular Biotechnology

The most frequently occurring value of a group of values. E.g Mode of 1,1,2,2,3,3,3,4,5,5,6,7 is 3 since the frequency of occurrence of 3 is maximum.

Calculating mode using excel


National Institute for Cellular Biotechnology

=MODE(A1:A10) Will calculate the median for the vector A1 to A10

When to use mean, median and mode


National Institute for Cellular Biotechnology

Mean is the most common measure Weighted mean is used when values are associated with different weights. Median is to be used if there are outliers. e.g. a family run business where the family members salary is very high and the workers salary is very low Mode is used if we want to find the most popular item e.g. the most sought after car in Dublin.

Contents
National Institute for Cellular Biotechnology

Population and Sample Measure of central tendency


Mean Weighted mean Geometric mean Median

Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error

Normal Distribution
Properties of normal distribution

Measure of Dispersion
National Institute for Cellular Biotechnology

This attribute calculates the variability in the data. If there are replicates it tells how much variation is in the data. In such case we would expect a lower value of measure of dispersion Some of the common statistics used to calculate measure of dispersion are mean deviation from mean, standard deviation, variance

Mean deviation from Mean


National Institute for Cellular Biotechnology

Calculates the variability in the data The higher the Mean deviation the greater is the variation in the data Mean deviation from mean = |xi-mean(x)|/n Where n is the number of data, mean(x) is the mean of numbers and xi is the value of individual data points. Modulus indicates taking positive values only e.g. the modulus of 3 will be 3 and the modulus of -3 will also be 3.

Mean deviation from Mean


National Institute for Cellular Biotechnology

Such measures are useful in estimating the noise in the system e.g. estimating the variability in the data. It has an advantage over SD as it does not gives more weight to the extreme values.

Standard deviation
National Institute for Cellular Biotechnology

SD is the deviation of the values from there means Shows variation in the data Tells how closely are the values of the replicates The vertical lines on the bar graph represents the SD of the values Good replicates will have lower SD of there parameters

SD calculations
National Institute for Cellular Biotechnology

S = standard deviation = sum of X = individual score M = mean of all scores n = sample size (number of scores) N-1 is replaced by N if the population instead of sample is studied.

SD calculations
National Institute for Cellular Biotechnology

Suppose X is the cell count and we want to calculate the mean and SD Mean will be 1+2+3+4+5/5=3 Calculating SD: Divide total squared deviations by n-1. That leaves 10/4 = 2.5. Take the square root of 2.5. The standard deviation equals 1.58.

Calculating SD using Excel


National Institute for Cellular Biotechnology

=STDEVA(A1:A10) Will calculate the standard deviation for the vector A1 to A10

Variance
National Institute for Cellular Biotechnology

Variance is the square of standard deviation It represents the same what SD represents This is included only as many publications mention variance rather than SD

SD may not be always appropriate


National Institute for Cellular Biotechnology

When comparing two diverse datasets with different means, the sample with higher mean will have a higher SD even though the spread is same. In such cases it is better to compare using coefficient of variation Coefficient of variation is the same SD divided by the mean of the sample. This process makes the two datasets comparable. Coefficient of variation is a better parameter over SD when comparing two populations.

Coefficient of Variation
National Institute for Cellular Biotechnology

Coefficient of variation scales the standard deviation by the size of mean making it possible to compare coefficient of variation across samples measured on different scales Coefficient of variation = standard deviation/mean Some times Coefficient of variation is represented as percentage where CV = ( sd / mean )*100 e.g. cell growth function in flask and fermenter

Comparative analysis of Various Measure of Dispersion


SD is the more common measure because squaring magnifies the variation among the data. Mean deviation is at times better than SD as it gives equal weightage to all the values, unlike SD which gives more weightage to extreme values because of the squaring of the values. Coefficient of variation is better when the samples are of diverse range as it nullifies the effect of high or low values of mean

National Institute for Cellular Biotechnology

Standard Error
National Institute for Cellular Biotechnology

Standard error tells the variability among the samples. It is the standard deviation of the sampling distribution of the mean. The formula for the standard error of the mean is SE m = SD / sqrt( N) sqrt square root where SD is the standard deviation of the original distribution and N is the sample size (the number of scores each mean is based upon)

Standard error for mean


National Institute for Cellular Biotechnology

Example of SE for mean


National Institute for Cellular Biotechnology

In order to find cell density in a fermenter we take 10 samples. We then calculate the mean cell density of each sample. Definitely it will not be the same and there will be some errors associated with each sampling. In order to find that error, technically termed as SE of mean we calculate the standard deviation of the means. Dividing that by Square root of 10 will give the standard error. SE of mean gives as estimate of the variability of the sampling i.e. error associated with the sampling.

Contents
National Institute for Cellular Biotechnology

Population and Sample Measure of central tendency


Mean Weighted mean Geometric mean Median

Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error

Normal Distribution
Properties of normal distribution

Normal Distribution
National Institute for Cellular Biotechnology

Normal distribution is symmetric Mean=Median=Mode Most of the biological systems data follows Normal Distribution Most of the analysis assumes that your data is normally distributed

A Skewed Distribution
National Institute for Cellular Biotechnology

A skewed distribution is not symmetrical and more data points are located in one area of distribution. A skewed distribution may be due to experimental error Log transformation helps in making a skewed distribution as normal distribution

Normal Distribution equation


National Institute for Cellular Biotechnology

is the mean is the standard deviation

Making data Normal Distributed


National Institute for Cellular Biotechnology

We can force data to become normally distributed by subtracting the mean from each data point followed by dividing by the standard deviation Suppose the values are 7.4, 4.5, 8.6, 4.5, 8.9. The mean will be 6.78 and sd 0.85 The new values will be (7.4-6.78)/0.85 and so on Thus the new data will be 0.72, -2.68, 2.14, -2.68, 2.49

What if the data is not normally distributed?

National Institute for Cellular Biotechnology

There are certain tests which do not assume that the data is Normally distributed e.g. MANN WHITNEY test, KRUSKAL-WALLIS test etc Such tests are known as non-parametric tests. For example an algorithm that identifies genes as Present or Absent (e.g. MAS5-Affymetrix) uses one of the above mentioned tests.

Contents
National Institute for Cellular Biotechnology

Test of significance
NULL hypothesis Types of error Level of significance Degree of freedom t-test Paired t-test Multiple testing correction Application of t-test for various application

NULL Hypothesis
National Institute for Cellular Biotechnology

The Hypothesis of no difference is termed as NULL hypothesis e.g. H0: there is no drop in insulin production in response to glucose when cells move to higher passage, no change in invasion, drug resistance.

Test of Significance
National Institute for Cellular Biotechnology

Sample estimates and population estimates may differ. It may have arisen due to errors of sampling and the hypothesis is true It may has arisen due to the fact that the parent population of the sample is different from the population considered and the hypothesis is true.

Test of Significance
National Institute for Cellular Biotechnology

If the difference between the sample estimate and the population estimate under a NULL hypothesis is due to errors of sampling, it will have some limiting value beyond which it will cease to be due to error of sampling

Types of Error
National Institute for Cellular Biotechnology

Type 1 error is committed when we reject the hypothesis when in reality it is true. The probability of committing type 1 error is represented by Type 2 error is committed when we accept the hypothesis when in reality it is not true. The probability of committing Type 2 error is generally represented by Type 1 error is more serious than type 2 error.

Types of Error
National Institute for Cellular Biotechnology

Examples of two types of error


National Institute for Cellular Biotechnology

Null Hypothesis: The patient is not affected by cancer Type 1 error: The person has a cancer and is diagnosed as not having the cancer Type 2 error: The person does not have cancer and is diagnosed as cancer. Definitely Type 1 error is more serious than Type 2 error.

Types of error
National Institute for Cellular Biotechnology

In practice it is not possible to avoid these two types of error. Hence we fix the probability of one error (Type 1 error) and try to minimize the probability of the other (Type 2 error)

Critical Region
National Institute for Cellular Biotechnology

Critical Regions
National Institute for Cellular Biotechnology

Errors due to chance will not be present beyond certain limits and thats how critical region is decided.

Critical Region
National Institute for Cellular Biotechnology

The selection of the critical region is based on the probabilities of two types of error. The probability of Type 1 error is fixed and the critical region is chosen which minimizes Type 2 error

Level Of Significance
National Institute for Cellular Biotechnology

The maximum probability of rejecting the hypothesis when it is true, or in other words, the maximum probability of Type 1 error is known as level of significance. For most of the biological systems it is customary to consider 0.05 or 5% as level of significance Lower the p value more reliable will be the result i.e. there is more signal and less noise but more likely that you will miss out on real changes.

Degree of freedom
National Institute for Cellular Biotechnology

Degree of freedom is n-1 where n is the number of elements Suppose a + b + c + d = 50. In such a case we can assign any values to a, b, c but not to all 4. Thus the degree of freedom here is 3 It is of importance when we have to get the p values for the corresponding t or z statistics.

t-test
National Institute for Cellular Biotechnology

A t-test is used to determine if the scores of two groups differ on a single variable The NULL hypothesis under investigation is Ho: There is no difference between the means of two groups. In easy terms, t-test compares sample 1 to sample 2 and detects changes between the two

Two tailed vs. one tailed


National Institute for Cellular Biotechnology

The NULL hypothesis is the hypothesis of no difference. The alternate hypothesis is that the mean of one group is different from the other group. If the alternate hypothesis is being tested for only one condition then it is termed as one tailed, otherwise it is termed a two-tailed hypothesis. Most of the analysis done is of two-tailed hypothesis.

Example of One tailed hypothesis


National Institute for Cellular Biotechnology

e.g. Bacterial count in supply water. Here we can have a hypothesis that the permissible level of bacteria should not go above 10/100ml. While sampling and testing the alternate hypothesis is only in one side i.e. we are in no way concerned if the bacterial count goes below 10/100ml

Example of Two tailed hypothesis


National Institute for Cellular Biotechnology

Most of the commonly analyzed data falls under Two tailed hypothesis Alternative hypothesis could be that the expression of a given gene is significantly higher or lower in the cancer tissue when compared to a normal tissue.

Paired t-test
National Institute for Cellular Biotechnology

Paired t-test is an extension of the normal t-test The hypothesis under investigation is that the difference between the individual pairs in a group is 0. It adds lot to the power of t-test because the individual differences are not accounted in the analysis. E.g. cell line effect before and after drug treatment. The experimental design should support this type of analysis

Calculating t-test using Excel


National Institute for Cellular Biotechnology

=TTEST(A2:A10,B2:B10,2,1) Where A2:A10 is the first array Where B2:B10 is the second array 2 if two tailed, 1 if one tailed 1 if paired, 2 if two sample with equal variance, 3 if two sample un-equal variance. Output will be p-value

Equal variance vs. Unequal variance


National Institute for Cellular Biotechnology

The more common one is Un-equal variance Equal variance is used when the sample size in the two groups are near same and the variance are nearly equal. Unequal variance is used when the sample size is not same and/or the variance are not equal.

Multiple testing correction


National Institute for Cellular Biotechnology

Multiple testing corrections adjust p-values derived from multiple statistical tests to correct for occurrence of false positives. e.g. In microarray data analysis, false positives are genes that are found to be statistically different between conditions, but are not in reality. Can arise when insufficient replicates are used in the experiment

Multiple testing correction


National Institute for Cellular Biotechnology

Such type of adjustment is more applicable when the number of elements in the two groups is highly different e.g in a typical microarray experiment the number of genes which are not affected is many times larger than the number of genes which are affected. This type of data generates a high number of False positives because of the large number of genes not affected by the treatment.

Multiple testing correction


National Institute for Cellular Biotechnology

Some of the Multiple testing correction methods are (in decreasing severity) Bonferroni Westfall and Young Permutation Benjamini and Hochberg False Discovery Rate

Disadvantages of Multiple Testing Correction

National Institute for Cellular Biotechnology

Also removes a lot number of True positives i.e. genes which are actually affected by treatment is shown as not-affected because of the adjusted pvalue For example if we apply Bonferroni correction, it divides the p-value by the number of genes (55,000 in plus arrays) and thus adjusted p-value will become 0.05/55000. Thus hardly any genes will pass this criteria. Standard MTC: Benjamini and Hochberg

Contents
National Institute for Cellular Biotechnology

Analysis of Variance One way ANOVA Applications of ANOVA for various applications Correlation and Regression Regression equation of first order Scatter Diagram Correlation Spearmans correlation Rank Correlation Regression equation of 1 st order Multiple regression Non-linear equation

ANalysis Of VAriance (ANOVA)


National Institute for Cellular Biotechnology

A test of the statistical significance of the differences among the mean scores of two or more groups on one or more variables The statistical analysis aims at assessing the total variation present and then apportioning it between the various factors responsible for the variation. It is very similar to t test expect for the fact that it can be applicable to more than 2 groups. A B (t-test as there are only two groups) A B, A C, B C (ANOVA as 3 groups are involved)

Analysis of Variance
National Institute for Cellular Biotechnology

The NULL hypothesis here is The mean of the individual groups is the same Alternate hypothesis is The mean of one or more groups is different The output is p-value and judgment criteria is the same as that of t-test. For most of the biological systems p-value of less than 0.05 will be termed as significant.

Steps in calculating ANOVA


National Institute for Cellular Biotechnology

1) Component of variation due to different groups is calculated 2) Component of variation due to replicates of individual groups is calculated 3) Component of variation due to error is calculated 4) F value is calculates as 1 divided by 3. F value is the measure of signal in relation to the noise. 5) F value is converted to corresponding P value

Calculating ANOVA
National Institute for Cellular Biotechnology

For microarray data use Genesis/Genespring. For others you can use SPSS Alternatively you can format your data in a way that Genesis accepts, and then perform ANOVA. It is fairly simple as you have to only define the groups.

Advantages of ANOVA
National Institute for Cellular Biotechnology

Is a statistically sound method to find the difference between two or more groups It is a lot better than Fold change as it takes care of the noise in the data

Contents
National Institute for Cellular Biotechnology

Analysis of Variance One way ANOVA Applications of ANOVA for various applications Correlation and Regression Regression equation of first order Scatter Diagram Correlation Spearmans correlation Rank Correlation Regression equation of 1 st order Multiple regression Non-linear equation

Correlation
National Institute for Cellular Biotechnology

The relationship between two sets of data, that when one changes, the other is likely to make a corresponding change. If the changes are in the same direction, then there is a positive correlation. If it is in the opposite direction, then it is a negative correlation. Correlation coefficient ranges from -1 (negatively correlated) to 1 (positively correlated) Value of 0 indicates no correlation

Correlation
National Institute for Cellular Biotechnology

Use of Correlation
National Institute for Cellular Biotechnology

This measure is used when we try to figure out what is the type of relationship between the two variables e.g. cell growth with Carbon-dioxide concentration It is also important as a pre-requisite to regression analysis. Before fitting a equation, it is essential to know whether there is any relationship between the variables. There can be positive correlation, negative correlation or no correlation at all.

Pearson Correlation
National Institute for Cellular Biotechnology

It is considered to be the best measure for measurement of correlation. The calculation is done on the original values, unlike that of Rank correlation where correlation is calculated on ranks. Since it is calculated on original values, extreme values can affect the correlation.

Calculating Pearson Correlation coefficient


National Institute for Cellular Biotechnology

Where x and y bar are the mean of the respective vectors rxy is the correlation coefficient Xi and Yi are the elements of the vectors is the summation

Calculating Pearson correlation using Excel


National Institute for Cellular Biotechnology

Calculating Pearson correlation coefficient in excel. =PEARSON(B1:F1,B2:F2)

Rank (Spearmans) Correlation


National Institute for Cellular Biotechnology

This method is good if there are potential chances of outliers as it does not take into consideration the extreme values as they contribute more to the differences It works on the difference of ranks after the vectors are arranged in an ascending order. As it does not take the original values the values calculated may not be that accurate for most of the analysis.

Rank (Spearmans) Correlation


National Institute for Cellular Biotechnology

d is the difference after the two vectors are arranged in an ascending order n is the size of vector R is the correlation coefficient

Regression equation
National Institute for Cellular Biotechnology

Regression equations are used when we want to quantify the relationship of one variable to that of other. It is like fitting an equation from the known values e.g. cell mortality with respect to time of drug administration Can be used to predict an output at an unknown point.

Regression equation
National Institute for Cellular Biotechnology

Regression equation
National Institute for Cellular Biotechnology

Use of studies such as predicting the outcome at a time point for which the experiment was not conducted Creating a generic model for the experiment in consideration

Multiple regression
National Institute for Cellular Biotechnology

Multiple regression is used when more than two variables under study are involved. e.g. cell growth as a function of time temperature and CO2

Non-linear equation
National Institute for Cellular Biotechnology

Non linear regression equation is fitted when there is no linear relationship among the two variables e.g. glucose level in blood after glucose consumption

Contents
National Institute for Cellular Biotechnology

Experimental design
Designing a good microarray experiment

Statistics related to microarray


Cluster mathematics Hierarchical Clustering K-means clustering Self Organizing Maps

Experimental design (Microarray)


National Institute for Cellular Biotechnology

Always have a hypothesis before conducting microarray experiment Have a one main problem on which your experimental design should be based Avoid confounding factors in your experimental design For the same cost try to increase the biological replicates in comparison to the technical replicates.

Biological vs. Technical replicates


National Institute for Cellular Biotechnology

Biological replicates are arrays that use RNA samples from different individual organisms, pools of organisms or flasks of cells, but yet compare the same treatments or control/treatment combination. Technical replicates are arrays that use the same RNA samples and also the same treatments. Thus the only differences in measurements are due to technical differences in array processing. It is highly recommended that more biological replicates are done than technical replicates, especially when dealing with individual organisms rather than a cell line.

Some designs give you more power


National Institute for Cellular Biotechnology

Paired experimental design: This designs removes the gene expression differences due to individual differences. e.g. normal tissues and cancerous tissues from same patient. Time series experiment: Apart from the other findings it gives a deeper insight in how individual genes are regulating other genes. e.g. Gene expression changes after time points of drug administration

Contents
National Institute for Cellular Biotechnology

Experimental design
Designing a good microarray experiment

Statistics related to microarray


Cluster mathematics Hierarchical Clustering K-means clustering Self Organizing Maps

Cluster mathematics (Distance)


National Institute for Cellular Biotechnology

Microarray data is an n dimensional data Nearly all the clustering algorithms use the concept of distance. Euclidean distance, Manhattan distance/Absolute distance, Correlation distance, Co-variance distance etc are some of the distance criteria used in these clustering algorithms.

Correlation Distance
National Institute for Cellular Biotechnology

Correlation distance is the most commonly used. Using this distance even genes differing very high by expression values but following a similar trend will be captured. Pearson distance is the more commonly used distance. Spearman (Rank correlation) is used when there are possible chances of outliers as this method of calculating distance eliminates the outliers samples.

Euclidean and Manhattan distance


National Institute for Cellular Biotechnology

Euclidian distance: This is also a important distance but it will only catch up those genes which have a similar pattern both in terms of trend and the actual values. This distance parameter gives more weightage to the distant values. Manhattan distance is the absolute distance and does not give more weightage to the more distant values.

Hierarchical clustering
National Institute for Cellular Biotechnology

Represents the data as tree The samples/genes in the near nodes are more close than the samples/genes on a distant node.

K-means clustering
National Institute for Cellular Biotechnology

Divides the data in user defined number of groups Starts with random seed point Every time it may result in a slightly different result

SOM
National Institute for Cellular Biotechnology

Neural network unsupervised learning algorithm Clusters are formed in a two dimension grid and the near clusters are similar compared to the distant ones. Works very efficiently on large data sets

References
National Institute for Cellular Biotechnology

www.neiu.edu/~dbehrlic/hrd408/glossary.htm http://hosting.soonet.ca/eliris/remotesensing/LectureImages/correlation.gif http://www-micro.msb.le.ac.uk/1010/1010pics/regression.gif http://jeff-lab.queensu.ca/stat/sas/sasman/sashtml/proc/zompmeth.htm http://www.inapg.inra.fr/ens_rech/siab/asteq/elba/images/reg.ht14.jpg http://www.tufts.edu/~gdallal/regpix.htm http://otter-rsch.com/admodel/cc1.html http://coe.sdsu.edu/eet/articles/standarddev/index.htm http://www.cne.gmu.edu/modules/dau/stat/data/mean.gif


http://www.georgetown.edu/departments/psychology/researchmethods/statistics/inferential/begin.htm

You might also like