You are on page 1of 3

Week 1 Regression Techniques

The relationship between any two variables (assuming 2 numerical variables) is identified
with the correlation, the cor() in the R.

Later we will do modeling, perform model diagnostics and finally come to an inference.

The response variable(y) is the one which we are going to predict and the explanatory
variable is the once which helps in prediction. Generally the equation is
y = mx+c where y is response variable, x is explanatory variable, m is the slope and c is the
Y-intercept.

First of all, we need to identify the linearness between the 2 variables. later we will check the
direction of linearness and the strength of the linearness from weak to very strong.

Correlation is defined as R. The magnitude or abs value of the correlation coefficient R


defines the strength of linear association between the two variables.

Correlation Properties
The sign of the R denotes the direction of the association.
The R value is always between -1 and +1. -1 indicates the correlation in -ve direction
and vice versa for the +1
R = 0 indicates no correlation.
R value is unit less and it is not affected by the change in the center or scale.
The co-relation of X with Y is same as Y with X.
R is sensitive to the outliers

Residuals

The residuals are the leftovers from the model fit. data = fit + residual

the equation is e = yi - yi^ ( residual = observed - predicted )

the model may overestimate or underestimate based on the position of residual above or
below the Regression line. If the observed value is above the line then it is overestimate and
if below then it is underestimate

Least squares

This is the method which was used frequently and easy to use.

the slope is calculated as m = sd(y)/sd(x) multiplied by R which is correlation factor.


The Regression line always crosses through the mean of x and y points (xi, yi). The intercept
value is given by c = yi - mxi

intercept is the value of y when x is 0. may be meaningless in the context of data but only
serve to adjust the height of line.

slope is the amount of increase in y if we increase 1 unit in x.

the sample equation is %poverty = 64.28 - 3.26(high school) %


prediction and extrapolation
Prediction is simply plugging the x value into the equation and finding the y value.

extrapolation is to estimate the values outside of the realm of the original data

Condition(Assumptions) for the LR are


1. Linearity The relationship between explanatory variable and response variable
should be linear. Check the scatterplot to check the linearity condition between
variables. The residuals to be scattered randomly across the zero line in the plot
2. nearly normal Residuals residuals should be nearly normally distributed around 0.
This condition may not be satisfied if the dataset consists of outliers which does not
follow the trend for rest of the data. Check the plot with the histogram plot
3. constant variability Normal Q-Q plot for the model determines the constant
variability.

R-squared
explains strength of the fit of model. It is calculated as square of correlation.
Tells us what percent of response variable can be explained by the variability of explanatory
variable. The remainder of the variability is explained by variables not explained by model.
Always between 1 or 0.

Categorical explanatory variable

we will plug 0 or 1 to the categorical explanatory variables. The y value which came after
plugging 0 is called reference level

poverty = 11.35 + 0.38 region:west

Week 2 Outliers in Linear Regression


Two types of outliers. 1. leverage point, 2. Influential point

leverage point doesnt tend to change the slope of the Regression line whereas influential
point tends to change the slope of line. Do scatter plot always between regression line with
and without point and ask whether outlier is influential or not.

Inference for LR:

Hypothesis test Ho: Is the explanatory variable is really is a significant variable in


determining the response variable? In other words it say there is no relationship between the
explanatory variable and response variable.
slope will be zero.
Alternate Hypothesis test Ha is opposite Ho and definitely has a slope
t-statistic test is used to determine the slope hence the hypothesis test

the t value is given in the output of the lm()


95% confidence level parameter is important to check the exact variability of the response
variable. Formula is slope*tb1(SEtb1). SE is standard error will be given in the output of the
linear object.

Points to be remembered:
1. Always remember the type of data we are working with; be it either a sample space,
population data or random sample.
2. Statistical inference is meaningless if we already have the population data.
3. If we have a non-random sample (biased) then the output inferential results will be
unreliable.

Week 3 Multiple Regression


The response variable is can be explained or predicted by more than 1 or more explanatory
variables. The explanatory variable may be numerical or categorical. But one need to be at
least numerical variable.

Later we will do model inference, model selection

You might also like