Professional Documents
Culture Documents
Correlation
While studying statistics, one comes across the concept of correlation. It is a statistical
method which enables the researcher to find whether two variables are related and to what
extent they are related. Correlation is considered as the sympathetic movement of two or
more variables. We can observe this when a change in one particular variable is accompanied
by changes in other variables as well, and this happens either in the same or opposite
direction, then the resultant variables are said to be correlated. Considering a data where we
find two or more variables getting valued then we might study the related variation for these
variables.
Example:
The Relationship between yield and rainfall at a constant temperature is partial correlation.
Linear Correlation
When the change in one variable results in the constant change in the other variable, we say
the correlation is linear. When there is a linear correlation, the points plotted will be in a
straight line
Example:
Consider the variables with the following values.
X: 10 20 30 40 50
Y: 20 40 60 80 100
2
Here, there is a linear relationship between the variables. There is a ratio 1:2 at all points.
Also, if we plot them they will be in a straight line.
Zero Order Correlation
One of the most common and basic techniques for analyzing the relationships between
variables is zero-order correlation. The value of a correlation coefficient can vary from -1 to
+1. A -1 indicates a perfect negative correlation, while a +1 indicates a perfect positive
correlation. A correlation of zero means there is no relationship between the two variables.
Non Linear Correlation
When the amount of change in one variable is not in a constant ratio to the change in the
other variable, we say that the correlation is non linear.
Example:
Consider the variables with the following values
X: 10 20 30 40 50
Y: 10 30 70 90 120
Here there is a non linear relationship between the variables. The ratio between them is not
fixed for all points. Also if we plot them on the graph, the points will not be in a straight line.
It will be a curve.
Non linear correlation is also known as curvilinear correlation.
Simple Correlation
If there are only two variable under study, the correlation is said to be simple.
Example:
The correlation between price and demand is simple.
Multiple Correlations
When one variable is related to a number of other variables, the correlation is not simple. It is
multiple if there is one variable on one side and a set of variables on the other side.
Example:
Relationship between yield with both rainfall and fertilizer together is multiple correlations
Interpretation of coefficient of correlation based on the error likely
1. If the coefficient of correlation is less than the error likely, then its not significant
2. If the coefficient of correlation is more than six times the error likely, it is significant.
3. If the error is too small and coefficient of correlation is 0.5 or more then the
coefficient of correlation is significant.
3
Regression
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a
curve / line to the data points, in such a manner that the differences between the distances of
data points from the curve or line is minimized. Ill explain this in more details in coming
sections.
Types of Regression
1. Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among
the first few topics which people pick while learning predictive modeling. In this
technique, the dependent variable is continuous, independent variable(s) can be continuous or
discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or
moreindependent variables (X) using a best fit straight line (also known as regression
line).
The difference between simple linear regression and multiple linear regression is that,
multiple linear regression has (>1) independent variables, whereas simple linear regression
has only 1 independent variable.
2. Logistic Regression
4
ln(odds) = ln(p/(1-p))
Above, p is the probability of presence of the characteristic of interest. A question that you
should ask here is why have we used log in the equation?.
Since we are working here with a binomial distribution (dependent variable), we need to
choose a link function which is best suited for this distribution. And, it is logit function. In
the equation above, the parameters are chosen to maximize the likelihood of observing the
sample values rather than minimizing the sum of squared errors (like in ordinary regression).
3. Polynomial Regression
y=a+b*x^2
In this regression technique, the best fit line is not a straight line. It is rather a curve that fits
into the data points.
4. Stepwise Regression
This form of regression is used when we deal with multiple independent variables. In this
technique, the selection of independent variables is done with the help of an automatic
process, which involves no human intervention.
5
This feat is achieved by observing statistical values like R-square, t-stats and AIC metric to
discern significant variables. Stepwise regression basically fits the regression model by
adding/dropping co-variates one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:
Standard stepwise regression does two things. It adds and removes predictors as
needed for each step.
Forward selection starts with most significant predictor in the model and adds variable
for each step.
Backward elimination starts with all predictors in the model and removes the least
significant variable for each step.
The aim of this modeling technique is to maximize the prediction power with minimum
number of predictor variables. It is one of the method to handle higher dimensionality of data
set.
5. Ridge Regression
Ridge Regression is a technique used when the data suffers from multicollinearity (
independent variables are highly correlated). In multicollinearity, even though the least
squares estimates (OLS) are unbiased, their variances are large which deviates the observed
value far from the true value. By adding a degree of bias to the regression estimates, ridge
regression reduces the standard errors.
Above, we saw the equation for linear regression. Remember? It can be represented as:
y=a+ b*x
This equation also has an error term. The complete equation becomes:
y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error
between the observed and predicted value]
In a linear equation, prediction errors can be decomposed into two sub components. First
is due to the biased and second is due to the variance. Prediction error can occur due to any
one of these two or both components. Here, well discuss about the error caused due to
variance.
In this equation, we have two components. First one is least square term and other one is
lambda of the summation of 2 (beta- square) where is the coefficient. This is added to
least square term in order to shrink the parameter to have a very low variance.
6. Lasso Regression
Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also
penalizes the absolute size of the regression coefficients. In addition, it is capable of reducing
the variability and improving the accuracy of linear regression models. Look at the equation
below:
Lasso regression differs from ridge regression in a way that it uses absolute values in the
penalty function, instead of squares. This leads to penalizing (or equivalently constraining the
sum of the absolute values of the estimates) values which causes some of the parameter
estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk
towards absolute zero. This results to variable selection out of given n variables.
ElasticNet is hybrid of Lasso and Ridge Regression techniques. It is trained with L1 and L2
prior as regularizer. Elastic-net is useful when there are multiple features which are
correlated. Lasso is likely to pick one of these at random, while elastic-net is likely to pick
both.
A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-Net to
inherit some of Ridges stability under rotation.
Hypothesis
A hypothesis is a tentative statement about the relationship between two or more variables. A
hypothesis is a specific, testable prediction about what you expect to happen in your study.
For example, a study designed to look at the relationship between sleep deprivation and test
performance might have a hypothesis that states, "This study is designed to assess the
hypothesis that sleep deprived people will perform worse on a test than individuals who are
not sleep deprived."
Unless you are creating a study that is exploratory in nature, your hypothesis should always
explain what you expect to happen during the course of your experiment or research.
Remember, a hypothesis does not have to be correct. While the hypothesis predicts what the
researchers expect to see, the goal of research is to determine whether this guess is right or
wrong. When conducting an experiment, researchers might explore a number of different
factors to determine which ones might contribute to the ultimate outcome.
In many cases, researchers may find that the results of an experiment do not support the
original hypothesis. When writing up these results, the researchers might suggest other
options that should be explored in future studies.
Definitions
8
Hypotheses are single tentative guesses, good hunches assumed for use in devising theory
or planning experiments intended to be given a direct experimental test when possible. (Eric
Rogers, 1966)
It is a tentative prediction about the nature of the relationship between two or more
variables.
Hypotheses are always in declarative sentence form, an they relate, either generally or
specifically , variables to variables.
Nature of Hypothesis :
Types of Hypotheses:
The null hypothesis represents a theory that has been put forward, either because it is
believed to be true or because it is to be used as a basis for argument, but has not been
proved.
Type I error
When the null hypothesis is true and you reject it, you make a type I error. The
probability of making a type I error is , which is the level of significance you set for
your hypothesis test. An of 0.05 indicates that you are willing to accept a 5% chance
that you are wrong when you reject the null hypothesis. To lower this risk, you must
use a lower value for . However, using a lower value for alpha means that you will
be less likely to detect a true difference if one really exists.
Type II error
When the null hypothesis is false and you fail to reject it, you make a type II error.
The probability of making a type II error is , which depends on the power of the test.
You can decrease your risk of committing a type II error by ensuring your test has
enough power. You can do this by ensuring your sample size is large enough to detect
a practical difference when one truly exists.
The probability of rejecting the null hypothesis when it is false is equal to 1. This
value is the power of the test.
Null Hypothesis
10
Fail to Correct Decision (probability = 1 - Type II Error - fail to reject the null
reject ) when it is false (probability = )
Standard Error
If the population size is much larger than the sample size, then the sampling distribution
has roughly the same standard error, whether we sample with or without replacement . On the
other hand, if the sample represents a significant fraction (say, 1/20) of the population size,
the standard error will be noticeably smaller, when we sample without replacement.
Student's t Distribution
According to the central limit theorem, the sampling distribution of a statistic (like a sample
mean) will follow a normal distribution, as long as the sample size is sufficiently large.
Therefore, when we know the standard deviation of the population, we can compute a z-
score, and use the normal distribution to evaluate probabilities with the sample mean.
But sample sizes are sometimes small, and often we do not know the standard deviation of
the population. When either of these problems occur, statisticians rely on the distribution of
the t statistic(also known as the t score), whose values are given by:
t = [ x - ] / [ s / sqrt( n ) ]
where x is the sample mean, is the population mean, s is the standard deviation of the
sample, and n is the sample size. The distribution of the t statistic is called the t
distribution or the Student t distribution.
The t distribution allows us to conduct statistical analyses on certain data sets that are not
appropriate for analysis, using the normal distribution.
The t distribution can be used with any statistic having a bell-shaped distribution (i.e.,
approximately normal). The sampling distribution of a statistic should be bell-shaped if any
of the following conditions apply.
The t distribution should not be used with small samples from populations that are not
approximately normal.
F-Test
12
The object of F-test is to discover whether the two independent estimates of the
population variance differ significantly or whether the two samples may be regarded as
drawn from the same normal population having the same variance. F-test is based on the
ratio of two variance and hence it is also known as the variance ratio test.
Assumptions
Chi-square Test:
Chi-square is a statistical test commonly used to compare observed data with data we would
expect to obtain according to a specific hypothesis. For example, if, according to Mendel's
laws, you expected 10 of 20 offspring from a cross to be male and the actual observed
number was 8 males, then you might want to know about the "goodness to fit" between the
observed and expected. Were the deviations (differences between observed and expected) the
result of chance, or were they due to other factors. How much deviation can occur before
you, the investigator, must conclude that something other than chance is at work, causing the
observed to differ from the expected. The chi-square test is always testing what scientists call
the null hypothesis, which states that there is no significant difference between the expected
and observed result.
Conditions
13
1. The total number of observations must be reasonably large. As a rule chi square test
should not be applied when N is less than 50
2. The data must be expressed in original units.
3. The expected frequency of any cell must not be less than 5, if it is less than 5, then
pool with adjacent frequency or cells in order to make it 5 or more. It is better to have
cell frequency
4. Each number of observations must be independent of each other.
5. The frequencies used must be absolute but not in relative terms.
6. Chi square test is dependent on degree of freedom