Professional Documents
Culture Documents
Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error
Normal Distribution
Properties of normal distribution
Test of significance
NULL hypothesis Types of error Level of significance Degree of freedom t-test Paired t-test Multiple testing correction Application of t-test for various application
Analysis of Variance
One way ANOVA Applications of ANOVA for various applications
Experimental design
Designing a good microarray experiment
Contents
National Institute for Cellular Biotechnology
Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error
Normal Distribution
Properties of normal distribution
Population is the entire data that we want to study Since the entire data cannot be studied, we take a representative number of data and study and that is termed as sample E.g. in order to get cell count, we study only part of the area under microscope. That area is the sample and should be representative of whole of the population. Rarely the sample results will be exactly equal to the population.
This difference is termed as sampling errors, which can be reduced by following proper sampling procedures and incorporation large number of replicated However it is practically impossible to get the exact estimate of the population from its sample. Correct sampling is a key for getting the accurate results.
Contents
National Institute for Cellular Biotechnology
Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error
Normal Distribution
Properties of normal distribution
Measure of central tendency calculates the representative values for the group of data It gives a representative value to the replicates Some of the common measures of central tendency are mean, weighted mean, median, mode etc.
Mean
National Institute for Cellular Biotechnology
Average of the data Gives a representative value of the replicates Mean are susceptible to outliers e.g. mean of 3,4,5,4 will be (3+4+5+4)/4 = 4 The red lines represents the mean of the lines represented in black
Weighted mean
National Institute for Cellular Biotechnology
Applicable when there are weights associated with individual entries. e.g. we have different dilutions of a sample and the individual cell counts of each dilution. In order to quantify the overall cell count we will use this measure. The dilutions will act as there respective weights.
Median
National Institute for Cellular Biotechnology
The middle number in a set of ordered data e.g. The median of {1,1,1,2,4,6,6} is 2 since 2 is the middle number when all of the numbers are placed in order. If there are an even number, the median is the mean of the two middle numbers. e.g. If there are outliers in the datasets it is advisable to use median.
Mode
National Institute for Cellular Biotechnology
The most frequently occurring value of a group of values. E.g Mode of 1,1,2,2,3,3,3,4,5,5,6,7 is 3 since the frequency of occurrence of 3 is maximum.
Mean is the most common measure Weighted mean is used when values are associated with different weights. Median is to be used if there are outliers. e.g. a family run business where the family members salary is very high and the workers salary is very low Mode is used if we want to find the most popular item e.g. the most sought after car in Dublin.
Contents
National Institute for Cellular Biotechnology
Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error
Normal Distribution
Properties of normal distribution
Measure of Dispersion
National Institute for Cellular Biotechnology
This attribute calculates the variability in the data. If there are replicates it tells how much variation is in the data. In such case we would expect a lower value of measure of dispersion Some of the common statistics used to calculate measure of dispersion are mean deviation from mean, standard deviation, variance
Calculates the variability in the data The higher the Mean deviation the greater is the variation in the data Mean deviation from mean = |xi-mean(x)|/n Where n is the number of data, mean(x) is the mean of numbers and xi is the value of individual data points. Modulus indicates taking positive values only e.g. the modulus of 3 will be 3 and the modulus of -3 will also be 3.
Such measures are useful in estimating the noise in the system e.g. estimating the variability in the data. It has an advantage over SD as it does not gives more weight to the extreme values.
Standard deviation
National Institute for Cellular Biotechnology
SD is the deviation of the values from there means Shows variation in the data Tells how closely are the values of the replicates The vertical lines on the bar graph represents the SD of the values Good replicates will have lower SD of there parameters
SD calculations
National Institute for Cellular Biotechnology
S = standard deviation = sum of X = individual score M = mean of all scores n = sample size (number of scores) N-1 is replaced by N if the population instead of sample is studied.
SD calculations
National Institute for Cellular Biotechnology
Suppose X is the cell count and we want to calculate the mean and SD Mean will be 1+2+3+4+5/5=3 Calculating SD: Divide total squared deviations by n-1. That leaves 10/4 = 2.5. Take the square root of 2.5. The standard deviation equals 1.58.
=STDEVA(A1:A10) Will calculate the standard deviation for the vector A1 to A10
Variance
National Institute for Cellular Biotechnology
Variance is the square of standard deviation It represents the same what SD represents This is included only as many publications mention variance rather than SD
When comparing two diverse datasets with different means, the sample with higher mean will have a higher SD even though the spread is same. In such cases it is better to compare using coefficient of variation Coefficient of variation is the same SD divided by the mean of the sample. This process makes the two datasets comparable. Coefficient of variation is a better parameter over SD when comparing two populations.
Coefficient of Variation
National Institute for Cellular Biotechnology
Coefficient of variation scales the standard deviation by the size of mean making it possible to compare coefficient of variation across samples measured on different scales Coefficient of variation = standard deviation/mean Some times Coefficient of variation is represented as percentage where CV = ( sd / mean )*100 e.g. cell growth function in flask and fermenter
Standard Error
National Institute for Cellular Biotechnology
Standard error tells the variability among the samples. It is the standard deviation of the sampling distribution of the mean. The formula for the standard error of the mean is SE m = SD / sqrt( N) sqrt square root where SD is the standard deviation of the original distribution and N is the sample size (the number of scores each mean is based upon)
In order to find cell density in a fermenter we take 10 samples. We then calculate the mean cell density of each sample. Definitely it will not be the same and there will be some errors associated with each sampling. In order to find that error, technically termed as SE of mean we calculate the standard deviation of the means. Dividing that by Square root of 10 will give the standard error. SE of mean gives as estimate of the variability of the sampling i.e. error associated with the sampling.
Contents
National Institute for Cellular Biotechnology
Measure of Dispersion
Mean deviation Standard deviation Variance Coefficient of variation Standard error
Normal Distribution
Properties of normal distribution
Normal Distribution
National Institute for Cellular Biotechnology
Normal distribution is symmetric Mean=Median=Mode Most of the biological systems data follows Normal Distribution Most of the analysis assumes that your data is normally distributed
A Skewed Distribution
National Institute for Cellular Biotechnology
A skewed distribution is not symmetrical and more data points are located in one area of distribution. A skewed distribution may be due to experimental error Log transformation helps in making a skewed distribution as normal distribution
We can force data to become normally distributed by subtracting the mean from each data point followed by dividing by the standard deviation Suppose the values are 7.4, 4.5, 8.6, 4.5, 8.9. The mean will be 6.78 and sd 0.85 The new values will be (7.4-6.78)/0.85 and so on Thus the new data will be 0.72, -2.68, 2.14, -2.68, 2.49
There are certain tests which do not assume that the data is Normally distributed e.g. MANN WHITNEY test, KRUSKAL-WALLIS test etc Such tests are known as non-parametric tests. For example an algorithm that identifies genes as Present or Absent (e.g. MAS5-Affymetrix) uses one of the above mentioned tests.
Contents
National Institute for Cellular Biotechnology
Test of significance
NULL hypothesis Types of error Level of significance Degree of freedom t-test Paired t-test Multiple testing correction Application of t-test for various application
NULL Hypothesis
National Institute for Cellular Biotechnology
The Hypothesis of no difference is termed as NULL hypothesis e.g. H0: there is no drop in insulin production in response to glucose when cells move to higher passage, no change in invasion, drug resistance.
Test of Significance
National Institute for Cellular Biotechnology
Sample estimates and population estimates may differ. It may have arisen due to errors of sampling and the hypothesis is true It may has arisen due to the fact that the parent population of the sample is different from the population considered and the hypothesis is true.
Test of Significance
National Institute for Cellular Biotechnology
If the difference between the sample estimate and the population estimate under a NULL hypothesis is due to errors of sampling, it will have some limiting value beyond which it will cease to be due to error of sampling
Types of Error
National Institute for Cellular Biotechnology
Type 1 error is committed when we reject the hypothesis when in reality it is true. The probability of committing type 1 error is represented by Type 2 error is committed when we accept the hypothesis when in reality it is not true. The probability of committing Type 2 error is generally represented by Type 1 error is more serious than type 2 error.
Types of Error
National Institute for Cellular Biotechnology
Null Hypothesis: The patient is not affected by cancer Type 1 error: The person has a cancer and is diagnosed as not having the cancer Type 2 error: The person does not have cancer and is diagnosed as cancer. Definitely Type 1 error is more serious than Type 2 error.
Types of error
National Institute for Cellular Biotechnology
In practice it is not possible to avoid these two types of error. Hence we fix the probability of one error (Type 1 error) and try to minimize the probability of the other (Type 2 error)
Critical Region
National Institute for Cellular Biotechnology
Critical Regions
National Institute for Cellular Biotechnology
Errors due to chance will not be present beyond certain limits and thats how critical region is decided.
Critical Region
National Institute for Cellular Biotechnology
The selection of the critical region is based on the probabilities of two types of error. The probability of Type 1 error is fixed and the critical region is chosen which minimizes Type 2 error
Level Of Significance
National Institute for Cellular Biotechnology
The maximum probability of rejecting the hypothesis when it is true, or in other words, the maximum probability of Type 1 error is known as level of significance. For most of the biological systems it is customary to consider 0.05 or 5% as level of significance Lower the p value more reliable will be the result i.e. there is more signal and less noise but more likely that you will miss out on real changes.
Degree of freedom
National Institute for Cellular Biotechnology
Degree of freedom is n-1 where n is the number of elements Suppose a + b + c + d = 50. In such a case we can assign any values to a, b, c but not to all 4. Thus the degree of freedom here is 3 It is of importance when we have to get the p values for the corresponding t or z statistics.
t-test
National Institute for Cellular Biotechnology
A t-test is used to determine if the scores of two groups differ on a single variable The NULL hypothesis under investigation is Ho: There is no difference between the means of two groups. In easy terms, t-test compares sample 1 to sample 2 and detects changes between the two
The NULL hypothesis is the hypothesis of no difference. The alternate hypothesis is that the mean of one group is different from the other group. If the alternate hypothesis is being tested for only one condition then it is termed as one tailed, otherwise it is termed a two-tailed hypothesis. Most of the analysis done is of two-tailed hypothesis.
e.g. Bacterial count in supply water. Here we can have a hypothesis that the permissible level of bacteria should not go above 10/100ml. While sampling and testing the alternate hypothesis is only in one side i.e. we are in no way concerned if the bacterial count goes below 10/100ml
Most of the commonly analyzed data falls under Two tailed hypothesis Alternative hypothesis could be that the expression of a given gene is significantly higher or lower in the cancer tissue when compared to a normal tissue.
Paired t-test
National Institute for Cellular Biotechnology
Paired t-test is an extension of the normal t-test The hypothesis under investigation is that the difference between the individual pairs in a group is 0. It adds lot to the power of t-test because the individual differences are not accounted in the analysis. E.g. cell line effect before and after drug treatment. The experimental design should support this type of analysis
=TTEST(A2:A10,B2:B10,2,1) Where A2:A10 is the first array Where B2:B10 is the second array 2 if two tailed, 1 if one tailed 1 if paired, 2 if two sample with equal variance, 3 if two sample un-equal variance. Output will be p-value
The more common one is Un-equal variance Equal variance is used when the sample size in the two groups are near same and the variance are nearly equal. Unequal variance is used when the sample size is not same and/or the variance are not equal.
Multiple testing corrections adjust p-values derived from multiple statistical tests to correct for occurrence of false positives. e.g. In microarray data analysis, false positives are genes that are found to be statistically different between conditions, but are not in reality. Can arise when insufficient replicates are used in the experiment
Such type of adjustment is more applicable when the number of elements in the two groups is highly different e.g in a typical microarray experiment the number of genes which are not affected is many times larger than the number of genes which are affected. This type of data generates a high number of False positives because of the large number of genes not affected by the treatment.
Some of the Multiple testing correction methods are (in decreasing severity) Bonferroni Westfall and Young Permutation Benjamini and Hochberg False Discovery Rate
Also removes a lot number of True positives i.e. genes which are actually affected by treatment is shown as not-affected because of the adjusted pvalue For example if we apply Bonferroni correction, it divides the p-value by the number of genes (55,000 in plus arrays) and thus adjusted p-value will become 0.05/55000. Thus hardly any genes will pass this criteria. Standard MTC: Benjamini and Hochberg
Contents
National Institute for Cellular Biotechnology
Analysis of Variance One way ANOVA Applications of ANOVA for various applications Correlation and Regression Regression equation of first order Scatter Diagram Correlation Spearmans correlation Rank Correlation Regression equation of 1 st order Multiple regression Non-linear equation
A test of the statistical significance of the differences among the mean scores of two or more groups on one or more variables The statistical analysis aims at assessing the total variation present and then apportioning it between the various factors responsible for the variation. It is very similar to t test expect for the fact that it can be applicable to more than 2 groups. A B (t-test as there are only two groups) A B, A C, B C (ANOVA as 3 groups are involved)
Analysis of Variance
National Institute for Cellular Biotechnology
The NULL hypothesis here is The mean of the individual groups is the same Alternate hypothesis is The mean of one or more groups is different The output is p-value and judgment criteria is the same as that of t-test. For most of the biological systems p-value of less than 0.05 will be termed as significant.
1) Component of variation due to different groups is calculated 2) Component of variation due to replicates of individual groups is calculated 3) Component of variation due to error is calculated 4) F value is calculates as 1 divided by 3. F value is the measure of signal in relation to the noise. 5) F value is converted to corresponding P value
Calculating ANOVA
National Institute for Cellular Biotechnology
For microarray data use Genesis/Genespring. For others you can use SPSS Alternatively you can format your data in a way that Genesis accepts, and then perform ANOVA. It is fairly simple as you have to only define the groups.
Advantages of ANOVA
National Institute for Cellular Biotechnology
Is a statistically sound method to find the difference between two or more groups It is a lot better than Fold change as it takes care of the noise in the data
Contents
National Institute for Cellular Biotechnology
Analysis of Variance One way ANOVA Applications of ANOVA for various applications Correlation and Regression Regression equation of first order Scatter Diagram Correlation Spearmans correlation Rank Correlation Regression equation of 1 st order Multiple regression Non-linear equation
Correlation
National Institute for Cellular Biotechnology
The relationship between two sets of data, that when one changes, the other is likely to make a corresponding change. If the changes are in the same direction, then there is a positive correlation. If it is in the opposite direction, then it is a negative correlation. Correlation coefficient ranges from -1 (negatively correlated) to 1 (positively correlated) Value of 0 indicates no correlation
Correlation
National Institute for Cellular Biotechnology
Use of Correlation
National Institute for Cellular Biotechnology
This measure is used when we try to figure out what is the type of relationship between the two variables e.g. cell growth with Carbon-dioxide concentration It is also important as a pre-requisite to regression analysis. Before fitting a equation, it is essential to know whether there is any relationship between the variables. There can be positive correlation, negative correlation or no correlation at all.
Pearson Correlation
National Institute for Cellular Biotechnology
It is considered to be the best measure for measurement of correlation. The calculation is done on the original values, unlike that of Rank correlation where correlation is calculated on ranks. Since it is calculated on original values, extreme values can affect the correlation.
Where x and y bar are the mean of the respective vectors rxy is the correlation coefficient Xi and Yi are the elements of the vectors is the summation
This method is good if there are potential chances of outliers as it does not take into consideration the extreme values as they contribute more to the differences It works on the difference of ranks after the vectors are arranged in an ascending order. As it does not take the original values the values calculated may not be that accurate for most of the analysis.
d is the difference after the two vectors are arranged in an ascending order n is the size of vector R is the correlation coefficient
Regression equation
National Institute for Cellular Biotechnology
Regression equations are used when we want to quantify the relationship of one variable to that of other. It is like fitting an equation from the known values e.g. cell mortality with respect to time of drug administration Can be used to predict an output at an unknown point.
Regression equation
National Institute for Cellular Biotechnology
Regression equation
National Institute for Cellular Biotechnology
Use of studies such as predicting the outcome at a time point for which the experiment was not conducted Creating a generic model for the experiment in consideration
Multiple regression
National Institute for Cellular Biotechnology
Multiple regression is used when more than two variables under study are involved. e.g. cell growth as a function of time temperature and CO2
Non-linear equation
National Institute for Cellular Biotechnology
Non linear regression equation is fitted when there is no linear relationship among the two variables e.g. glucose level in blood after glucose consumption
Contents
National Institute for Cellular Biotechnology
Experimental design
Designing a good microarray experiment
Always have a hypothesis before conducting microarray experiment Have a one main problem on which your experimental design should be based Avoid confounding factors in your experimental design For the same cost try to increase the biological replicates in comparison to the technical replicates.
Biological replicates are arrays that use RNA samples from different individual organisms, pools of organisms or flasks of cells, but yet compare the same treatments or control/treatment combination. Technical replicates are arrays that use the same RNA samples and also the same treatments. Thus the only differences in measurements are due to technical differences in array processing. It is highly recommended that more biological replicates are done than technical replicates, especially when dealing with individual organisms rather than a cell line.
Paired experimental design: This designs removes the gene expression differences due to individual differences. e.g. normal tissues and cancerous tissues from same patient. Time series experiment: Apart from the other findings it gives a deeper insight in how individual genes are regulating other genes. e.g. Gene expression changes after time points of drug administration
Contents
National Institute for Cellular Biotechnology
Experimental design
Designing a good microarray experiment
Microarray data is an n dimensional data Nearly all the clustering algorithms use the concept of distance. Euclidean distance, Manhattan distance/Absolute distance, Correlation distance, Co-variance distance etc are some of the distance criteria used in these clustering algorithms.
Correlation Distance
National Institute for Cellular Biotechnology
Correlation distance is the most commonly used. Using this distance even genes differing very high by expression values but following a similar trend will be captured. Pearson distance is the more commonly used distance. Spearman (Rank correlation) is used when there are possible chances of outliers as this method of calculating distance eliminates the outliers samples.
Euclidian distance: This is also a important distance but it will only catch up those genes which have a similar pattern both in terms of trend and the actual values. This distance parameter gives more weightage to the distant values. Manhattan distance is the absolute distance and does not give more weightage to the more distant values.
Hierarchical clustering
National Institute for Cellular Biotechnology
Represents the data as tree The samples/genes in the near nodes are more close than the samples/genes on a distant node.
K-means clustering
National Institute for Cellular Biotechnology
Divides the data in user defined number of groups Starts with random seed point Every time it may result in a slightly different result
SOM
National Institute for Cellular Biotechnology
Neural network unsupervised learning algorithm Clusters are formed in a two dimension grid and the near clusters are similar compared to the distant ones. Works very efficiently on large data sets
References
National Institute for Cellular Biotechnology