You are on page 1of 29

CORRELATION AND

REGRESSION
SIMPLE LINEAR
REGRESSION
WHAT IS SIMPLE LINEAR REGRESSION ?
 Simple linear regression is a statistical
method that allows us to summarize and study
relationships between two continuous
(quantitative) variables.

 One variable denoted x, is regarded as the


predictor, explanatory, or independent
variable.
 The other variable, denoted y, is regarded
as the response, outcome, or dependent
variable.
EXAMPLE 1:

The data is about the price of house. It


consists of 10 observations.
COMANDS
1. Step one
title 'Simple Linear Regression for House Price';
data House;
input HousePrice SquareFit @@;
datalines;
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
;
run;
proc print data=House;
run;
2. STEP TWO

ods graphics on;

proc reg;
model Houseprice = Squarefit;
run;

ods graphics off;

The ODS GRAPHICS statement enables ODS to


create graphics. By default, ODS Graphics
is not enabled. You can enable ODS Graphics
by using either of the following equivalent
statements:

ODS GRAPHICS ON;


ODS GRAPHICS ;
Assumptions
i. constant variance ii. Normality

Based on the figure above,


the error terms are The histogram shows bell-
scattered randomly and shaped, it indicates that
homoscedasticity the error are normally
exist.Therefore,the error distributed.
have constant variance.

The model is in good fit as it fulfill all the assumptions.


There is a moderate positive linear relationship where r = 0.7621
between house price(‘000) and square fit.
Based on the Anova table above,the model is significant as p-
value=0.0104 less than default alpha=0.05.
The estimated model is :

HousePrice = 98.24833 + 0.10977SquareFit


58.08% total variation of house price is explained by
square fit. While 41.92% can be explained by other
variable.
MULTIPLE
LINEAR
REGRESSION
WHAT IS MULTIPLE LINEAR REGRESSION ?

• A multiple regression model with Y as dependent variable and 𝑋1 ,


𝑋2 ,…, 𝑋𝑘 as independent variables is written as
𝑌෠ = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀
• 𝛽0 :constant term
𝛽1 , 𝛽2 ,…, 𝛽𝑘 : regression coefficients of 𝑋1 , 𝑋2 ,…, 𝑋𝑘
𝜀: random error term
• The model above is called a first-order multiple regression
since each term contains a single independent variable raised to
the first power.
EXAMPLE 2
The data is about the sales of oil which
contains 10 observations.

Sales = Sales of Oil (in $100s)


Price = Price of Oil per gallon
Advertising = Advertising (in $100s)
COMMANDS
1. Step one: Key in data
data work.sales1;
input Sales Price
Advertising;
cards;
10 1.3 9
6 2 7
5 1.7 5
12 1.5 14
10 1.6 15
15 1.2 12
5 1.5 6
12 1.4 10
17 1 15
20 1.1 21
;
run;
proc print
data=work.sales1;
run;
2. Step 2 :Testing Correlation Coefficient

proc corr data = work.sales1;


var Price Advertising;
with Sales;
run;

3. Step 3: Plotting Matrix Scattered Plot

proc sgscatter;
matrix Sales Price Advertising;
run;
It shows that there is strong negative relationship
between sales and price and strong positive relationship
between sales and Advertising.
There is strong negative
relationship between sales and
price and strong positive
relationship between sales and
advertising.
Plotting scatterplot

ods graphics on;


proc reg data = work.sales1;
model Sales = Price Advertising ;
plot Sales*(Price Advertising)/pred;
run;
proc sgplot data=work.sales1;
reg y=sales x=price ;
run;

proc sgplot data=work.sales1;


reg y=sales x=advertising;
run;
4. Step four

proc reg plots=all;


model Sales = Price Advertising ;
run;
ASSUMPTIONS
NORMALITY

Based on the Q-Q plot,


Based on the histogram
most of the points lie
above, it shows that
approximately on the
there is a bell-shape
straight line. Therefore we
form. It indicates that
can conclude that the error
the error are normally
are normally distributed.
distributed.
• Constant variance

Therefore, the model is


a good fit since the
assumptions are
fulfilled.

Based on the plot of residual


versus predicted value, it shows
that the errors have constant
variance, with the residuals
scattered randomly around zero
and there is no pattern existed.
Plotting Residual vs Predicted

ods graphics on;


proc reg ;
model Sales = Price
Advertising ;
plot r.*p.;
run;
Both variables
are significant
their p-value <
𝛼 (0.05).

The estimated model:

Sales = 15.03055 - 7.67389 Price + 0.62659 Advertising

෡ 𝟎 =15.03055
𝜷 if there is no price and advertising, the sales will be
15.03055.

෡ 𝟏 =-7.67389 when the price decreases by I gallon, the sales will be


𝜷
decrease by 7.67389.

෡ 𝟐 =0.62659 when the Advertising increases by 1 unit ($100), the sales


𝜷
will be increase by 0.62659.
Based on the ANOVA table above, the p-value(0.0002)
less than 𝛼 (0.05).
So, the model is significant.
91.82% of total variation of Sales can be explained by the price
of oil and advertising while 8.18% of total variation can be
explained by other factors.

You might also like