You are on page 1of 4

Correlation

Definition:
1. Finding the relationship between two quantitative variables without being able to infer
causal relationships.
2. Correlation is a statistical technique used to determine the degree to which two
variables are related

Explanation:
A correlation is expressed by a value, called a correlation coefficient, which is
between 1 and + 1. The further away the correlation coefficient is from 0 the
stronger the relationship between the variables. If the correlation coefficient is 0 it
means there is no relationship between the variables being measure.
An example of two variables that could produce a 0 correlation coefficient are
illustrated below. In reality the correlation coefficient of these data points will be
very close to zero as it is very rare that there is absolutely no relationship
between variables.

A perfect negative correlation has a coefficient of -1. A negative correlation


means that the high values of one of the data sets are matched with the small
values of the other data set, or as the values for one variable increase the values
for the other decrease, similar to the age and value of a car.

A perfect positive correlation has a coefficient of +1. A positive correlation means that
the high values of one data set are matched with the high values of the other data set,
or as the values for one variable increase so do the values for the other variable.

When interpreting correlation coefficients it is important to remember two points


1 (a) The measurement is correlation not causation.
2 Two variables are correlated does not mean that one causes the other.
3
1 (b) It is worth examining the pattern of the scatter plot to see if there is anything odd
with the data sets. The following chart shows a correlation coefficient of +0.5, but if
the last variable was removed the correlation coefficient would increase to +1.

It would seem from the chart that the last x-variable was suspect and either very
atypical or even a mistake, i.e. it is an outlier.

Regression :
Calculates the best-fit line for a certain set of data.
The regression line makes the sum of the squares of the residuals smaller than
for any other line
Regression Analyses
Regression: technique concerned with predicting some variables by
knowing others
The process of predicting variable Y using variable X
Regression equation describes the regression line mathematically
Intercept
Slope
predicting from linear relationships
The simplest type of statistical prediction uses linear correlation between a variable of
interest and another variable that either directly affects it, or is at least correlated with it
(the explanatory variable).
If we look at a scatter plot that shows the size of ten towns along the bottom (x-axis)
and the number of dentists in each of those towns up the side (y-axis).

So in the first town there is a population of 50,000 people and a total of 6 dentists. If
these two variables were perfectly correlated it would be very easy to predict how many
dentists would be in any size town. It would simply be a matter of tracing a vertical line
from the town size on the x-axis up to the line passing through the all the points and
then taking a horizontal line from there over to the y-axis.

Unfortunately our variables are not perfectly correlated. So instead of using a line which
joins all the points we use what is called a least squares regression line. This is a line
which passes through all the data points in such a way that the minimum possible
distance lies between each point and the line. This line is also called the line of best fit
and is illustrated using our original dentist/population chart below.

This line can also be represented by an equation known as regression equation, or


model. This is a way of describing numerically the relationship between population size
and number of dentists.
Provided this model is reasonably accurate when the predictions are compared to the
actual values, then the unknown number of dentists in any similar town can be predicted
(or projected, or forecast) from its population size in this way.
Many relationships that one would wish to model in order to generate useful predictions
are however, more complex than in this simple illustration. The number of dentists that
set up practices in different towns will, in reality, be governed by a whole suite of factors
such as the local economy, proximity to a neighbouring town, so the resulting regression
equation would contain additional terms for each of these variables, these terms are
known as regression coefficients.
Evaluating the model
The computer output for a regression model will include confidence intervals around the
estimates for the prediction and each regression coefficient, which allow us to judge the
reliability of these statistics.
The success of the regression model in describing the relationship or system can also
2
be judged by means of the R statistic which describes the proportion of the total
variability in the system that is explained by the model. This is expressed either as a
proportion or a percentage of the variation that is explained by the model (68% in the
example above).
More of the variability in the data will be explained as more and more variables that truly
influence it are added to the analysis. However, increasing the complexity of the
analysis is often at the expense of its usefulness: complex models are harder to
understand, and more difficult to apply to generate useful predictions, because the
calculations are difficult, or because some of the explanatory variables are difficult to
measure.
So the goal when using regression modelling to generate useful predictions, is to
achieve the best compromise between maximising the reliability and explanatory power
of the model, and minimising the number of variables that need to be used to achieve a
reliable model.
The success of the predictions generated by any model can only be determined
empirically, in other words by seeing how accurate they are in real life. In the case of a
spatial prediction concerning a statistic for locations where it would not normally be
measured, we can go to these locations and measure its actual value to test the
prediction. A prediction for some point in the future can be tested when it is eventually
measured at that time, but a backwards (retrospective) projection in time is only testable
if there is independent evidence of its value in the past.

You might also like