Professional Documents
Culture Documents
The relationship between any two variables (assuming 2 numerical variables) is identified
with the correlation, the cor() in the R.
Later we will do modeling, perform model diagnostics and finally come to an inference.
The response variable(y) is the one which we are going to predict and the explanatory
variable is the once which helps in prediction. Generally the equation is
y = mx+c where y is response variable, x is explanatory variable, m is the slope and c is the
Y-intercept.
First of all, we need to identify the linearness between the 2 variables. later we will check the
direction of linearness and the strength of the linearness from weak to very strong.
Correlation Properties
The sign of the R denotes the direction of the association.
The R value is always between -1 and +1. -1 indicates the correlation in -ve direction
and vice versa for the +1
R = 0 indicates no correlation.
R value is unit less and it is not affected by the change in the center or scale.
The co-relation of X with Y is same as Y with X.
R is sensitive to the outliers
Residuals
The residuals are the leftovers from the model fit. data = fit + residual
the model may overestimate or underestimate based on the position of residual above or
below the Regression line. If the observed value is above the line then it is overestimate and
if below then it is underestimate
Least squares
This is the method which was used frequently and easy to use.
intercept is the value of y when x is 0. may be meaningless in the context of data but only
serve to adjust the height of line.
extrapolation is to estimate the values outside of the realm of the original data
R-squared
explains strength of the fit of model. It is calculated as square of correlation.
Tells us what percent of response variable can be explained by the variability of explanatory
variable. The remainder of the variability is explained by variables not explained by model.
Always between 1 or 0.
we will plug 0 or 1 to the categorical explanatory variables. The y value which came after
plugging 0 is called reference level
leverage point doesnt tend to change the slope of the Regression line whereas influential
point tends to change the slope of line. Do scatter plot always between regression line with
and without point and ask whether outlier is influential or not.
Points to be remembered:
1. Always remember the type of data we are working with; be it either a sample space,
population data or random sample.
2. Statistical inference is meaningless if we already have the population data.
3. If we have a non-random sample (biased) then the output inferential results will be
unreliable.