Bio Statistics

Introduction to Bio-statistics
National Institute for Cellular Biotechnology
Jai Prakash Mehta
Reviewed by: Padraig Doolan (6-10-05)
1st Day: Contents (1.5 hrs)

Population and Sample Measure of central tendency

Mean Weighted mean Median
Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error
Normal Distribution
Properties of normal distribution
1st Day: Contents (1 hrs)

Test of significance
NULL hypothesis Types of error Level of significance Degree of freedom t-test Paired t-test Multiple testing correction Application of t-test for various application
2nd Day: Contents (1.5 hrs)

Analysis of Variance
One way ANOVA Applications of ANOVA for various applications
Correlation and Regression

Regression equation of first order Scatter Diagram Correlation Spearmans correlation Rank Correlation Regression equation of 1 st order Multiple regression Non-linear equation
2nd Day: Contents (0.5 hrs)

Experimental design
Designing a good microarray experiment
Statistics related to microarray

Cluster mathematics Hierarchical Clustering K-means clustering Self Organizing Maps
Contents

Mean Weighted mean Geometric mean Median
Normal Distribution
Population and sample

Population is the entire data that we want to study Since the entire data cannot be studied, we take a representative number of data and study and that is termed as sample E.g. in order to get cell count, we study only part of the area under microscope. That area is the sample and should be representative of whole of the population. Rarely the sample results will be exactly equal to the population.
Population and Sample

This difference is termed as sampling errors, which can be reduced by following proper sampling procedures and incorporation large number of replicated However it is practically impossible to get the exact estimate of the population from its sample. Correct sampling is a key for getting the accurate results.
Contents

Normal Distribution
Measure of central tendency

Measure of central tendency calculates the representative values for the group of data It gives a representative value to the replicates Some of the common measures of central tendency are mean, weighted mean, median, mode etc.
Mean
Average of the data Gives a representative value of the replicates Mean are susceptible to outliers e.g. mean of 3,4,5,4 will be (3+4+5+4)/4 = 4 The red lines represents the mean of the lines represented in black
Calculating mean using Excel

=AVERAGE(A1:A10) Will calculate the mean for the vector A1 to A10
Weighted mean
Applicable when there are weights associated with individual entries. e.g. we have different dilutions of a sample and the individual cell counts of each dilution. In order to quantify the overall cell count we will use this measure. The dilutions will act as there respective weights.
Median
The middle number in a set of ordered data e.g. The median of {1,1,1,2,4,6,6} is 2 since 2 is the middle number when all of the numbers are placed in order. If there are an even number, the median is the mean of the two middle numbers. e.g. If there are outliers in the datasets it is advisable to use median.
Calculating median using excel

=MEDIAN(A1:A10) Will calculate the median for the vector A1 to A10
Mode
The most frequently occurring value of a group of values. E.g Mode of 1,1,2,2,3,3,3,4,5,5,6,7 is 3 since the frequency of occurrence of 3 is maximum.
Calculating mode using excel

=MODE(A1:A10) Will calculate the median for the vector A1 to A10
When to use mean, median and mode

Mean is the most common measure Weighted mean is used when values are associated with different weights. Median is to be used if there are outliers. e.g. a family run business where the family members salary is very high and the workers salary is very low Mode is used if we want to find the most popular item e.g. the most sought after car in Dublin.
Contents

Normal Distribution
This attribute calculates the variability in the data. If there are replicates it tells how much variation is in the data. In such case we would expect a lower value of measure of dispersion Some of the common statistics used to calculate measure of dispersion are mean deviation from mean, standard deviation, variance
Mean deviation from Mean

Calculates the variability in the data The higher the Mean deviation the greater is the variation in the data Mean deviation from mean = |xi-mean(x)|/n Where n is the number of data, mean(x) is the mean of numbers and xi is the value of individual data points. Modulus indicates taking positive values only e.g. the modulus of 3 will be 3 and the modulus of -3 will also be 3.
Mean deviation from Mean

Such measures are useful in estimating the noise in the system e.g. estimating the variability in the data. It has an advantage over SD as it does not gives more weight to the extreme values.
Standard deviation
SD is the deviation of the values from there means Shows variation in the data Tells how closely are the values of the replicates The vertical lines on the bar graph represents the SD of the values Good replicates will have lower SD of there parameters
SD calculations
S = standard deviation = sum of X = individual score M = mean of all scores n = sample size (number of scores) N-1 is replaced by N if the population instead of sample is studied.
SD calculations
Suppose X is the cell count and we want to calculate the mean and SD Mean will be 1+2+3+4+5/5=3 Calculating SD: Divide total squared deviations by n-1. That leaves 10/4 = 2.5. Take the square root of 2.5. The standard deviation equals 1.58.
Calculating SD using Excel

=STDEVA(A1:A10) Will calculate the standard deviation for the vector A1 to A10
Variance
Variance is the square of standard deviation It represents the same what SD represents This is included only as many publications mention variance rather than SD
SD may not be always appropriate

When comparing two diverse datasets with different means, the sample with higher mean will have a higher SD even though the spread is same. In such cases it is better to compare using coefficient of variation Coefficient of variation is the same SD divided by the mean of the sample. This process makes the two datasets comparable. Coefficient of variation is a better parameter over SD when comparing two populations.
Coefficient of Variation
Coefficient of variation scales the standard deviation by the size of mean making it possible to compare coefficient of variation across samples measured on different scales Coefficient of variation = standard deviation/mean Some times Coefficient of variation is represented as percentage where CV = ( sd / mean )*100 e.g. cell growth function in flask and fermenter
Comparative analysis of Various Measure of Dispersion

SD is the more common measure because squaring magnifies the variation among the data. Mean deviation is at times better than SD as it gives equal weightage to all the values, unlike SD which gives more weightage to extreme values because of the squaring of the values. Coefficient of variation is better when the samples are of diverse range as it nullifies the effect of high or low values of mean
Standard Error
Standard error tells the variability among the samples. It is the standard deviation of the sampling distribution of the mean. The formula for the standard error of the mean is SE m = SD / sqrt( N) sqrt square root where SD is the standard deviation of the original distribution and N is the sample size (the number of scores each mean is based upon)
Standard error for mean

Example of SE for mean

In order to find cell density in a fermenter we take 10 samples. We then calculate the mean cell density of each sample. Definitely it will not be the same and there will be some errors associated with each sampling. In order to find that error, technically termed as SE of mean we calculate the standard deviation of the means. Dividing that by Square root of 10 will give the standard error. SE of mean gives as estimate of the variability of the sampling i.e. error associated with the sampling.
Contents

Normal Distribution
Normal Distribution
Normal distribution is symmetric Mean=Median=Mode Most of the biological systems data follows Normal Distribution Most of the analysis assumes that your data is normally distributed
A Skewed Distribution
A skewed distribution is not symmetrical and more data points are located in one area of distribution. A skewed distribution may be due to experimental error Log transformation helps in making a skewed distribution as normal distribution
Normal Distribution equation

is the mean is the standard deviation
Making data Normal Distributed

We can force data to become normally distributed by subtracting the mean from each data point followed by dividing by the standard deviation Suppose the values are 7.4, 4.5, 8.6, 4.5, 8.9. The mean will be 6.78 and sd 0.85 The new values will be (7.4-6.78)/0.85 and so on Thus the new data will be 0.72, -2.68, 2.14, -2.68, 2.49
What if the data is not normally distributed?
There are certain tests which do not assume that the data is Normally distributed e.g. MANN WHITNEY test, KRUSKAL-WALLIS test etc Such tests are known as non-parametric tests. For example an algorithm that identifies genes as Present or Absent (e.g. MAS5-Affymetrix) uses one of the above mentioned tests.
Contents
Test of significance
NULL hypothesis Types of error Level of significance Degree of freedom t-test Paired t-test Multiple testing correction Application of t-test for various application
NULL Hypothesis
The Hypothesis of no difference is termed as NULL hypothesis e.g. H0: there is no drop in insulin production in response to glucose when cells move to higher passage, no change in invasion, drug resistance.
Test of Significance
Sample estimates and population estimates may differ. It may have arisen due to errors of sampling and the hypothesis is true It may has arisen due to the fact that the parent population of the sample is different from the population considered and the hypothesis is true.
Test of Significance
If the difference between the sample estimate and the population estimate under a NULL hypothesis is due to errors of sampling, it will have some limiting value beyond which it will cease to be due to error of sampling
Types of Error
Type 1 error is committed when we reject the hypothesis when in reality it is true. The probability of committing type 1 error is represented by Type 2 error is committed when we accept the hypothesis when in reality it is not true. The probability of committing Type 2 error is generally represented by Type 1 error is more serious than type 2 error.
Types of Error
Examples of two types of error

Null Hypothesis: The patient is not affected by cancer Type 1 error: The person has a cancer and is diagnosed as not having the cancer Type 2 error: The person does not have cancer and is diagnosed as cancer. Definitely Type 1 error is more serious than Type 2 error.
Types of error
In practice it is not possible to avoid these two types of error. Hence we fix the probability of one error (Type 1 error) and try to minimize the probability of the other (Type 2 error)
Critical Region
Critical Regions
Errors due to chance will not be present beyond certain limits and thats how critical region is decided.
Critical Region
The selection of the critical region is based on the probabilities of two types of error. The probability of Type 1 error is fixed and the critical region is chosen which minimizes Type 2 error
Level Of Significance
The maximum probability of rejecting the hypothesis when it is true, or in other words, the maximum probability of Type 1 error is known as level of significance. For most of the biological systems it is customary to consider 0.05 or 5% as level of significance Lower the p value more reliable will be the result i.e. there is more signal and less noise but more likely that you will miss out on real changes.
Degree of freedom
Degree of freedom is n-1 where n is the number of elements Suppose a + b + c + d = 50. In such a case we can assign any values to a, b, c but not to all 4. Thus the degree of freedom here is 3 It is of importance when we have to get the p values for the corresponding t or z statistics.
t-test
A t-test is used to determine if the scores of two groups differ on a single variable The NULL hypothesis under investigation is Ho: There is no difference between the means of two groups. In easy terms, t-test compares sample 1 to sample 2 and detects changes between the two
Two tailed vs. one tailed

The NULL hypothesis is the hypothesis of no difference. The alternate hypothesis is that the mean of one group is different from the other group. If the alternate hypothesis is being tested for only one condition then it is termed as one tailed, otherwise it is termed a two-tailed hypothesis. Most of the analysis done is of two-tailed hypothesis.
Example of One tailed hypothesis

e.g. Bacterial count in supply water. Here we can have a hypothesis that the permissible level of bacteria should not go above 10/100ml. While sampling and testing the alternate hypothesis is only in one side i.e. we are in no way concerned if the bacterial count goes below 10/100ml
Example of Two tailed hypothesis

Most of the commonly analyzed data falls under Two tailed hypothesis Alternative hypothesis could be that the expression of a given gene is significantly higher or lower in the cancer tissue when compared to a normal tissue.
Paired t-test
Paired t-test is an extension of the normal t-test The hypothesis under investigation is that the difference between the individual pairs in a group is 0. It adds lot to the power of t-test because the individual differences are not accounted in the analysis. E.g. cell line effect before and after drug treatment. The experimental design should support this type of analysis
Calculating t-test using Excel

=TTEST(A2:A10,B2:B10,2,1) Where A2:A10 is the first array Where B2:B10 is the second array 2 if two tailed, 1 if one tailed 1 if paired, 2 if two sample with equal variance, 3 if two sample un-equal variance. Output will be p-value
Equal variance vs. Unequal variance

The more common one is Un-equal variance Equal variance is used when the sample size in the two groups are near same and the variance are nearly equal. Unequal variance is used when the sample size is not same and/or the variance are not equal.
Multiple testing correction

Multiple testing corrections adjust p-values derived from multiple statistical tests to correct for occurrence of false positives. e.g. In microarray data analysis, false positives are genes that are found to be statistically different between conditions, but are not in reality. Can arise when insufficient replicates are used in the experiment

Such type of adjustment is more applicable when the number of elements in the two groups is highly different e.g in a typical microarray experiment the number of genes which are not affected is many times larger than the number of genes which are affected. This type of data generates a high number of False positives because of the large number of genes not affected by the treatment.

Some of the Multiple testing correction methods are (in decreasing severity) Bonferroni Westfall and Young Permutation Benjamini and Hochberg False Discovery Rate
Disadvantages of Multiple Testing Correction
Also removes a lot number of True positives i.e. genes which are actually affected by treatment is shown as not-affected because of the adjusted pvalue For example if we apply Bonferroni correction, it divides the p-value by the number of genes (55,000 in plus arrays) and thus adjusted p-value will become 0.05/55000. Thus hardly any genes will pass this criteria. Standard MTC: Benjamini and Hochberg
Contents
Analysis of Variance One way ANOVA Applications of ANOVA for various applications Correlation and Regression Regression equation of first order Scatter Diagram Correlation Spearmans correlation Rank Correlation Regression equation of 1 st order Multiple regression Non-linear equation
ANalysis Of VAriance (ANOVA)

A test of the statistical significance of the differences among the mean scores of two or more groups on one or more variables The statistical analysis aims at assessing the total variation present and then apportioning it between the various factors responsible for the variation. It is very similar to t test expect for the fact that it can be applicable to more than 2 groups. A B (t-test as there are only two groups) A B, A C, B C (ANOVA as 3 groups are involved)
Analysis of Variance
The NULL hypothesis here is The mean of the individual groups is the same Alternate hypothesis is The mean of one or more groups is different The output is p-value and judgment criteria is the same as that of t-test. For most of the biological systems p-value of less than 0.05 will be termed as significant.
Steps in calculating ANOVA

1) Component of variation due to different groups is calculated 2) Component of variation due to replicates of individual groups is calculated 3) Component of variation due to error is calculated 4) F value is calculates as 1 divided by 3. F value is the measure of signal in relation to the noise. 5) F value is converted to corresponding P value
Calculating ANOVA
For microarray data use Genesis/Genespring. For others you can use SPSS Alternatively you can format your data in a way that Genesis accepts, and then perform ANOVA. It is fairly simple as you have to only define the groups.
Advantages of ANOVA
Is a statistically sound method to find the difference between two or more groups It is a lot better than Fold change as it takes care of the noise in the data
Contents
Analysis of Variance One way ANOVA Applications of ANOVA for various applications Correlation and Regression Regression equation of first order Scatter Diagram Correlation Spearmans correlation Rank Correlation Regression equation of 1 st order Multiple regression Non-linear equation
Correlation
The relationship between two sets of data, that when one changes, the other is likely to make a corresponding change. If the changes are in the same direction, then there is a positive correlation. If it is in the opposite direction, then it is a negative correlation. Correlation coefficient ranges from -1 (negatively correlated) to 1 (positively correlated) Value of 0 indicates no correlation
Correlation
Use of Correlation
This measure is used when we try to figure out what is the type of relationship between the two variables e.g. cell growth with Carbon-dioxide concentration It is also important as a pre-requisite to regression analysis. Before fitting a equation, it is essential to know whether there is any relationship between the variables. There can be positive correlation, negative correlation or no correlation at all.
Pearson Correlation
It is considered to be the best measure for measurement of correlation. The calculation is done on the original values, unlike that of Rank correlation where correlation is calculated on ranks. Since it is calculated on original values, extreme values can affect the correlation.
Calculating Pearson Correlation coefficient

Where x and y bar are the mean of the respective vectors rxy is the correlation coefficient Xi and Yi are the elements of the vectors is the summation
Calculating Pearson correlation using Excel

Calculating Pearson correlation coefficient in excel. =PEARSON(B1:F1,B2:F2)
Rank (Spearmans) Correlation

This method is good if there are potential chances of outliers as it does not take into consideration the extreme values as they contribute more to the differences It works on the difference of ranks after the vectors are arranged in an ascending order. As it does not take the original values the values calculated may not be that accurate for most of the analysis.
Rank (Spearmans) Correlation

d is the difference after the two vectors are arranged in an ascending order n is the size of vector R is the correlation coefficient
Regression equation
Regression equations are used when we want to quantify the relationship of one variable to that of other. It is like fitting an equation from the known values e.g. cell mortality with respect to time of drug administration Can be used to predict an output at an unknown point.
Regression equation
Regression equation
Use of studies such as predicting the outcome at a time point for which the experiment was not conducted Creating a generic model for the experiment in consideration
Multiple regression
Multiple regression is used when more than two variables under study are involved. e.g. cell growth as a function of time temperature and CO2
Non-linear equation
Non linear regression equation is fitted when there is no linear relationship among the two variables e.g. glucose level in blood after glucose consumption
Contents
Experimental design

Experimental design (Microarray)

Always have a hypothesis before conducting microarray experiment Have a one main problem on which your experimental design should be based Avoid confounding factors in your experimental design For the same cost try to increase the biological replicates in comparison to the technical replicates.
Biological vs. Technical replicates

Biological replicates are arrays that use RNA samples from different individual organisms, pools of organisms or flasks of cells, but yet compare the same treatments or control/treatment combination. Technical replicates are arrays that use the same RNA samples and also the same treatments. Thus the only differences in measurements are due to technical differences in array processing. It is highly recommended that more biological replicates are done than technical replicates, especially when dealing with individual organisms rather than a cell line.
Some designs give you more power

Paired experimental design: This designs removes the gene expression differences due to individual differences. e.g. normal tissues and cancerous tissues from same patient. Time series experiment: Apart from the other findings it gives a deeper insight in how individual genes are regulating other genes. e.g. Gene expression changes after time points of drug administration
Contents
Experimental design

Cluster mathematics (Distance)

Microarray data is an n dimensional data Nearly all the clustering algorithms use the concept of distance. Euclidean distance, Manhattan distance/Absolute distance, Correlation distance, Co-variance distance etc are some of the distance criteria used in these clustering algorithms.
Correlation Distance
Correlation distance is the most commonly used. Using this distance even genes differing very high by expression values but following a similar trend will be captured. Pearson distance is the more commonly used distance. Spearman (Rank correlation) is used when there are possible chances of outliers as this method of calculating distance eliminates the outliers samples.
Euclidean and Manhattan distance

Euclidian distance: This is also a important distance but it will only catch up those genes which have a similar pattern both in terms of trend and the actual values. This distance parameter gives more weightage to the distant values. Manhattan distance is the absolute distance and does not give more weightage to the more distant values.
Hierarchical clustering
Represents the data as tree The samples/genes in the near nodes are more close than the samples/genes on a distant node.
K-means clustering
Divides the data in user defined number of groups Starts with random seed point Every time it may result in a slightly different result
SOM
Neural network unsupervised learning algorithm Clusters are formed in a two dimension grid and the near clusters are similar compared to the distant ones. Works very efficiently on large data sets
References
www.neiu.edu/~dbehrlic/hrd408/glossary.htm http://hosting.soonet.ca/eliris/remotesensing/LectureImages/correlation.gif http://www-micro.msb.le.ac.uk/1010/1010pics/regression.gif http://jeff-lab.queensu.ca/stat/sas/sasman/sashtml/proc/zompmeth.htm http://www.inapg.inra.fr/ens_rech/siab/asteq/elba/images/reg.ht14.jpg http://www.tufts.edu/~gdallal/regpix.htm http://otter-rsch.com/admodel/cc1.html http://coe.sdsu.edu/eet/articles/standarddev/index.htm http://www.cne.gmu.edu/modules/dau/stat/data/mean.gif

http://www.georgetown.edu/departments/psychology/researchmethods/statistics/inferential/begin.htm

Bio Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bio Statistics

Uploaded by

Copyright:

Available Formats

Introduction to Bio-statistics

National Institute for Cellular Biotechnology

Jai Prakash Mehta

Reviewed by: Padraig Doolan (6-10-05)

1st Day: Contents (1.5 hrs)

Population and Sample Measure of central tendency

1st Day: Contents (1 hrs)

2nd Day: Contents (1.5 hrs)

Correlation and Regression

2nd Day: Contents (0.5 hrs)

Statistics related to microarray

Population and Sample Measure of central tendency

Population and sample

Population and Sample

Population and Sample Measure of central tendency

Measure of central tendency

Calculating mean using Excel

=AVERAGE(A1:A10) Will calculate the mean for the vector A1 to A10

Calculating median using excel

=MEDIAN(A1:A10) Will calculate the median for the vector A1 to A10

Calculating mode using excel

=MODE(A1:A10) Will calculate the median for the vector A1 to A10

When to use mean, median and mode

Population and Sample Measure of central tendency

Mean deviation from Mean

Mean deviation from Mean

Calculating SD using Excel

SD may not be always appropriate

Comparative analysis of Various Measure of Dispersion

National Institute for Cellular Biotechnology

Standard error for mean

Example of SE for mean

Population and Sample Measure of central tendency

Normal Distribution equation

is the mean is the standard deviation

Making data Normal Distributed

What if the data is not normally distributed?

National Institute for Cellular Biotechnology

Examples of two types of error

Two tailed vs. one tailed

Example of One tailed hypothesis

Example of Two tailed hypothesis

Calculating t-test using Excel

Equal variance vs. Unequal variance

Multiple testing correction

Multiple testing correction

Multiple testing correction

Disadvantages of Multiple Testing Correction

National Institute for Cellular Biotechnology

ANalysis Of VAriance (ANOVA)

Steps in calculating ANOVA

Calculating Pearson Correlation coefficient

Calculating Pearson correlation using Excel

Calculating Pearson correlation coefficient in excel. =PEARSON(B1:F1,B2:F2)

Rank (Spearmans) Correlation

Rank (Spearmans) Correlation

Statistics related to microarray

Experimental design (Microarray)

Biological vs. Technical replicates

Some designs give you more power

Statistics related to microarray

Cluster mathematics (Distance)

Euclidean and Manhattan distance