You are on page 1of 13

1

Correlation

While studying statistics, one comes across the concept of correlation. It is a statistical
method which enables the researcher to find whether two variables are related and to what
extent they are related. Correlation is considered as the sympathetic movement of two or
more variables. We can observe this when a change in one particular variable is accompanied
by changes in other variables as well, and this happens either in the same or opposite
direction, then the resultant variables are said to be correlated. Considering a data where we
find two or more variables getting valued then we might study the related variation for these
variables.

There are different types of Correlation. They are listed as follows:


Positive Correlation
A positive correlation is a correlation in the same direction.
Negative Correlation
A negative correlation is a correlation in the opposite direction.
Partial Correlation
The correlation is partial if we study the relationship between two variables keeping all other
variables constant.

Example:
The Relationship between yield and rainfall at a constant temperature is partial correlation.

Linear Correlation

When the change in one variable results in the constant change in the other variable, we say
the correlation is linear. When there is a linear correlation, the points plotted will be in a
straight line
Example:
Consider the variables with the following values.

X: 10 20 30 40 50
Y: 20 40 60 80 100
2

Here, there is a linear relationship between the variables. There is a ratio 1:2 at all points.
Also, if we plot them they will be in a straight line.
Zero Order Correlation
One of the most common and basic techniques for analyzing the relationships between
variables is zero-order correlation. The value of a correlation coefficient can vary from -1 to
+1. A -1 indicates a perfect negative correlation, while a +1 indicates a perfect positive
correlation. A correlation of zero means there is no relationship between the two variables.
Non Linear Correlation
When the amount of change in one variable is not in a constant ratio to the change in the
other variable, we say that the correlation is non linear.
Example:
Consider the variables with the following values
X: 10 20 30 40 50
Y: 10 30 70 90 120
Here there is a non linear relationship between the variables. The ratio between them is not
fixed for all points. Also if we plot them on the graph, the points will not be in a straight line.
It will be a curve.
Non linear correlation is also known as curvilinear correlation.
Simple Correlation
If there are only two variable under study, the correlation is said to be simple.
Example:
The correlation between price and demand is simple.
Multiple Correlations
When one variable is related to a number of other variables, the correlation is not simple. It is
multiple if there is one variable on one side and a set of variables on the other side.
Example:
Relationship between yield with both rainfall and fertilizer together is multiple correlations
Interpretation of coefficient of correlation based on the error likely

1. If the coefficient of correlation is less than the error likely, then its not significant
2. If the coefficient of correlation is more than six times the error likely, it is significant.
3. If the error is too small and coefficient of correlation is 0.5 or more then the
coefficient of correlation is significant.
3

Regression

Regression analysis is a form of predictive modelling technique which investigates


the relationship between a dependent (target) and independent variable (s) (predictor). This
technique is used for forecasting, time series modelling and finding the causal effect
relationship between the variables. For example, relationship between rash driving
and number of road accidents by a driver is best studied through regression.

Regression analysis is an important tool for modelling and analyzing data. Here, we fit a
curve / line to the data points, in such a manner that the differences between the distances of
data points from the curve or line is minimized. Ill explain this in more details in coming
sections.

Types of Regression

1. Linear Regression

It is one of the most widely known modeling technique. Linear regression is usually among
the first few topics which people pick while learning predictive modeling. In this
technique, the dependent variable is continuous, independent variable(s) can be continuous or
discrete, and nature of regression line is linear.

Linear Regression establishes a relationship between dependent variable (Y) and one or
moreindependent variables (X) using a best fit straight line (also known as regression
line).

It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the


line and e is error term. This equation can be used to predict the value of target variable based
on given predictor variable(s).

The difference between simple linear regression and multiple linear regression is that,
multiple linear regression has (>1) independent variables, whereas simple linear regression
has only 1 independent variable.

2. Logistic Regression
4

Logistic regression is used to find the probability of event=Success and event=Failure. We


should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/
No) in nature. Here the value of Y ranges from 0 to 1 and it can represented by following
equation.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence

ln(odds) = ln(p/(1-p))

logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of presence of the characteristic of interest. A question that you
should ask here is why have we used log in the equation?.

Since we are working here with a binomial distribution (dependent variable), we need to
choose a link function which is best suited for this distribution. And, it is logit function. In
the equation above, the parameters are chosen to maximize the likelihood of observing the
sample values rather than minimizing the sum of squared errors (like in ordinary regression).

3. Polynomial Regression

A regression equation is a polynomial regression equation if the power of independent


variable is more than 1. The equation below represents a polynomial equation:

y=a+b*x^2

In this regression technique, the best fit line is not a straight line. It is rather a curve that fits
into the data points.

4. Stepwise Regression

This form of regression is used when we deal with multiple independent variables. In this
technique, the selection of independent variables is done with the help of an automatic
process, which involves no human intervention.
5

This feat is achieved by observing statistical values like R-square, t-stats and AIC metric to
discern significant variables. Stepwise regression basically fits the regression model by
adding/dropping co-variates one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:

Standard stepwise regression does two things. It adds and removes predictors as
needed for each step.
Forward selection starts with most significant predictor in the model and adds variable
for each step.
Backward elimination starts with all predictors in the model and removes the least
significant variable for each step.

The aim of this modeling technique is to maximize the prediction power with minimum
number of predictor variables. It is one of the method to handle higher dimensionality of data
set.

5. Ridge Regression

Ridge Regression is a technique used when the data suffers from multicollinearity (
independent variables are highly correlated). In multicollinearity, even though the least
squares estimates (OLS) are unbiased, their variances are large which deviates the observed
value far from the true value. By adding a degree of bias to the regression estimates, ridge
regression reduces the standard errors.

Above, we saw the equation for linear regression. Remember? It can be represented as:

y=a+ b*x

This equation also has an error term. The complete equation becomes:

y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error
between the observed and predicted value]

=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.


6

In a linear equation, prediction errors can be decomposed into two sub components. First
is due to the biased and second is due to the variance. Prediction error can occur due to any
one of these two or both components. Here, well discuss about the error caused due to
variance.

Ridge regression solves the multicollinearity problem through shrinkage parameter


(lambda). Look at the equation below.

In this equation, we have two components. First one is least square term and other one is
lambda of the summation of 2 (beta- square) where is the coefficient. This is added to
least square term in order to shrink the parameter to have a very low variance.

6. Lasso Regression

Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also
penalizes the absolute size of the regression coefficients. In addition, it is capable of reducing
the variability and improving the accuracy of linear regression models. Look at the equation
below:

Lasso regression differs from ridge regression in a way that it uses absolute values in the
penalty function, instead of squares. This leads to penalizing (or equivalently constraining the
sum of the absolute values of the estimates) values which causes some of the parameter
estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk
towards absolute zero. This results to variable selection out of given n variables.

7. Elastic Net Regression


7

ElasticNet is hybrid of Lasso and Ridge Regression techniques. It is trained with L1 and L2
prior as regularizer. Elastic-net is useful when there are multiple features which are
correlated. Lasso is likely to pick one of these at random, while elastic-net is likely to pick
both.

A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-Net to
inherit some of Ridges stability under rotation.

Hypothesis

A hypothesis is a tentative statement about the relationship between two or more variables. A
hypothesis is a specific, testable prediction about what you expect to happen in your study.
For example, a study designed to look at the relationship between sleep deprivation and test
performance might have a hypothesis that states, "This study is designed to assess the
hypothesis that sleep deprived people will perform worse on a test than individuals who are
not sleep deprived."

Unless you are creating a study that is exploratory in nature, your hypothesis should always
explain what you expect to happen during the course of your experiment or research.

Remember, a hypothesis does not have to be correct. While the hypothesis predicts what the
researchers expect to see, the goal of research is to determine whether this guess is right or
wrong. When conducting an experiment, researchers might explore a number of different
factors to determine which ones might contribute to the ultimate outcome.

In many cases, researchers may find that the results of an experiment do not support the
original hypothesis. When writing up these results, the researchers might suggest other
options that should be explored in future studies.

Definitions
8

Hypotheses are single tentative guesses, good hunches assumed for use in devising theory
or planning experiments intended to be given a direct experimental test when possible. (Eric
Rogers, 1966)

A hypothesis is a conjectural statement of the relation between two or more variables.


(Kerlinger, 1956)

Hypothesis is a formal statement that presents the expected relationship between an


independent and dependent variable.(Creswell, 1994)

A research question is essentially a hypothesis asked in the form of a question.

It is a tentative prediction about the nature of the relationship between two or more
variables.

A hypothesis can be defined as a tentative explanation of the research problem, a possible


outcome of the research, or an educated guess about the research outcome. (Sarantakos,
1993: 1991)

Hypotheses are always in declarative sentence form, an they relate, either generally or
specifically , variables to variables.

An hypothesis is a statement or explanation that is suggested by knowledge or observation


but has not, yet, been proved or disproved. (Macleod Clark J and Hockey L 1981)

Nature of Hypothesis :

The hypothesis is a clear statement of what is intended to be investigated. It should be


specified before research is conducted and openly stated in reporting the results. This allows
to: Identify the research objectives, Identify the key abstract concepts involved in the research
and Identify its relationship to both the problem statement and the literature review

A problem cannot be scientifically solved unless it is reduced to hypothesis form. It is a


powerful tool of advancement of knowledge, consistent with existing knowledge and
conducive to further enquiry.
9

Types of Hypotheses:

NULL HYPOTHESES Designated by: H0 or HN Pronounced as H oh or H-null and


ALTERNATIVE HYPOTHESES Designated by: H1 or HA

The null hypothesis represents a theory that has been put forward, either because it is
believed to be true or because it is to be used as a basis for argument, but has not been
proved.

The alternative hypothesis is a statement of what a hypothesis test is set up to establish.


Opposite of Null Hypothesis. Only reached if H0 is rejected. Frequently alternative is
actual desired conclusion of the researcher!

Type I and Type II Error

Type I error
When the null hypothesis is true and you reject it, you make a type I error. The
probability of making a type I error is , which is the level of significance you set for
your hypothesis test. An of 0.05 indicates that you are willing to accept a 5% chance
that you are wrong when you reject the null hypothesis. To lower this risk, you must
use a lower value for . However, using a lower value for alpha means that you will
be less likely to detect a true difference if one really exists.

Type II error
When the null hypothesis is false and you fail to reject it, you make a type II error.
The probability of making a type II error is , which depends on the power of the test.
You can decrease your risk of committing a type II error by ensuring your test has
enough power. You can do this by ensuring your sample size is large enough to detect
a practical difference when one truly exists.

The probability of rejecting the null hypothesis when it is false is equal to 1. This
value is the power of the test.

Null Hypothesis
10

Decision True False

Fail to Correct Decision (probability = 1 - Type II Error - fail to reject the null
reject ) when it is false (probability = )

Reject Type I Error - rejecting the null Correct Decision (probability = 1 - )


when it is true (probability = )

Standard Error

A standard error is the standard deviation of the sampling distribution of


a statistic. Standard error is a statistical term that measures the accuracy with which a sample
represents a population. In statistics, a sample mean deviates from the actual mean of a
population; this deviation is the standard error.

The standard error is a measure of the variability of a statistic. It is an estimate of


the standard deviation of a sampling distribution. The standard error depends on three factors:

N: The number of observations in the population.


n: The number of observations in the sample.
The way that the random sample is chosen.

If the population size is much larger than the sample size, then the sampling distribution
has roughly the same standard error, whether we sample with or without replacement . On the
other hand, if the sample represents a significant fraction (say, 1/20) of the population size,
the standard error will be noticeably smaller, when we sample without replacement.

Student's t Distribution

The t distribution (aka, Students t-distribution) is a probability distribution that is used to


estimate population parameters when the sample size is small and/or when the population
variance is unknown.

Why Use the t Distribution?


11

According to the central limit theorem, the sampling distribution of a statistic (like a sample
mean) will follow a normal distribution, as long as the sample size is sufficiently large.
Therefore, when we know the standard deviation of the population, we can compute a z-
score, and use the normal distribution to evaluate probabilities with the sample mean.

But sample sizes are sometimes small, and often we do not know the standard deviation of
the population. When either of these problems occur, statisticians rely on the distribution of
the t statistic(also known as the t score), whose values are given by:

t = [ x - ] / [ s / sqrt( n ) ]

where x is the sample mean, is the population mean, s is the standard deviation of the
sample, and n is the sample size. The distribution of the t statistic is called the t
distribution or the Student t distribution.

The t distribution allows us to conduct statistical analyses on certain data sets that are not
appropriate for analysis, using the normal distribution.

When to Use the t Distribution

The t distribution can be used with any statistic having a bell-shaped distribution (i.e.,
approximately normal). The sampling distribution of a statistic should be bell-shaped if any
of the following conditions apply.

The population distribution is normal.


The population distribution is symmetric, unimodal, without outliers, and the sample
size is at least 30.
The population distribution is moderately skewed, unimodal, without outliers, and the
sample size is at least 40.
The sample size is greater than 40, without outliers.

The t distribution should not be used with small samples from populations that are not
approximately normal.

F-Test
12

The object of F-test is to discover whether the two independent estimates of the
population variance differ significantly or whether the two samples may be regarded as
drawn from the same normal population having the same variance. F-test is based on the
ratio of two variance and hence it is also known as the variance ratio test.

Assumptions

1. The values in each group are normally distributed

2. Variance within each group should be equal for all groups

3. Errors should be independent for each value

Chi-square Test:

Chi-square is a statistical test commonly used to compare observed data with data we would
expect to obtain according to a specific hypothesis. For example, if, according to Mendel's
laws, you expected 10 of 20 offspring from a cross to be male and the actual observed
number was 8 males, then you might want to know about the "goodness to fit" between the
observed and expected. Were the deviations (differences between observed and expected) the
result of chance, or were they due to other factors. How much deviation can occur before
you, the investigator, must conclude that something other than chance is at work, causing the
observed to differ from the expected. The chi-square test is always testing what scientists call
the null hypothesis, which states that there is no significant difference between the expected
and observed result.

Characteristics of chi square test

1. Test is based on events or frequencies


2. This test is applied, in testing the hypothesis
3. The test can be used between the entire set of observed and expected frequencies
4. The sum of the observed and expected frequencies is zero
5. For every increase in the number of degree of freedom, a new chi square distributed is
formed
6. It is general purpose test and as such is highly useful in research

Conditions
13

1. The total number of observations must be reasonably large. As a rule chi square test
should not be applied when N is less than 50
2. The data must be expressed in original units.
3. The expected frequency of any cell must not be less than 5, if it is less than 5, then
pool with adjacent frequency or cells in order to make it 5 or more. It is better to have
cell frequency
4. Each number of observations must be independent of each other.
5. The frequencies used must be absolute but not in relative terms.
6. Chi square test is dependent on degree of freedom

You might also like