You are on page 1of 9

MTE3105 Statistics

Topic 6

Linear Regression

6.1 Synopsis This topic discusses the relationship between two variables by using scatter diagrams and regression analysis. In scatter diagrams, we can investigate the relationships between two variables by looking at how the pairs of values are distributed in a graph. By using regression model, we can evaluate the magnitude of change in one variable due to a certain change in another variable. For example, an economist can estimate the amount of change in food expenditure due to a certain change in the income of a household by using the regression model. A sociologist may want to estimate the increase in the crime rate due to a particular For increase in the unemployment rate. Besides answering these uestions, a regression model also helps to predict the value of one variable for a given value of another variable. a given unemployment rate. 6.2 Learning Outcomes #. %. &. '. $nderstand the concept of dependent and independent variables. Interpret scatter diagrams for bivariate data and draw the line of best fit. $nderstand the concept of linear regression. (alculate the e uation of linear regression by using method of least s uares and use them to estimate values by interpolation and extrapolation. example, by using the regression line, we can predict the !approximate" crime rate of a city with

MTE3105 Statistics

6.3

Conceptual Framework

6.4

epen!ent an! "n!epen!ent #aria$les

)et*s say that an economist wishes to investigate the relationship between food expenditure and income. +hat factors or variables does a household has to consider when deciding how much money it should spend on food every week or month for example. (ertainly income of the household is one factor. ,owever, there are many other variables that also affect food expenditure. For instance, the si-e of household, the preferences and tastes of household members and any special dietary needs of household are some of the variables that influence a household*s decision on food expenditure. These variables are called in!epen!ent or e%planatory &aria$les because they all vary independently and they explain the variation in food expenditures among different households. In other words, these variables explain why different households spend different amount of money on food. .eanwhile, food expenditure is called the !epen!ent &aria$le because its value depends on the independent variables.

6.'

Scatter

iagram

/catter diagram is basically a plot of paired observation. 0airs of values of !x,y" are plotted on the hori-ontal and vertical axis on the graph paper with appropriate scales and axes. This type of graph is known as scatter diagram. It is a useful way of determining the relationship between two variables and it provides an early perception on the relationship of the two variables and thus enable the researcher to make an early conclusion regarding the relationship. If all points

MTE3105 Statistics

seem to lie near a line, the correlation is called linear correlation !diagrams !a", !b", !c" and !d"". If y increases as x increases, the correlation is called positive correlation as in !a" and !b". If y tends to decrease as x increases, the correlation is called negative correlation as in diagram !c" and !d". If there is no relationship shown between two variables, then there is no correlation between them as in diagram !e". between two variables. The following diagrams show the various relationships

(a) Perfect positive linear relationship

(b) Positive linear relationship

(c) Perfect negative linear relationship

(d) Negative linear relationship

(e) No relationship

6.6

Linear Regression

+hen studying the effect of two or more independent variables on a dependent variable using regression analysis, it is called multiple regression. ,owever, if we choose only one !usually the most important" independent variable and study the effect of that single variable on a dependent variable, it is called simple regression. Thus, a simple regression includes only

MTE3105 Statistics

two variables1 one independent and one dependent. 2ote that whether it is a simple or a multiple regression analysis, it always includes one and only one dependent variable. It is the number of independent variables that changes in simple and multiple regressions. %.3.# /imple )inear 4egression

A regression model is a mathematical e uation that describes the relationship between two or more variables. A simple regression model includes only two variables1 one independent and one dependent. The dependent variable is the one being explained, and the independent variable is the one used to explained the variation in the dependent variable. Thus a simple linear regression model is a model that gives a straight5line relationship between two variables. +e know that correlation coe((icient can be used to measure the strength of linear relationship between two variables, it however cannot be used to make estimation or forecast on the variables. To overcome this weakness, a most suitable line is drawn on a scatter diagram and the line is called regression line. In your previous work, you may have attempted to draw a line called a line o( $est (it by 6ye .ethod on the scatter diagram. 7rawing a line of best fit by 6ye .ethod is as such that there are as many points above the line as below it, or as many points to the left of the line as to the right of it. It8s the line that has the least total deviation from the actual data points. That means, if you add up all the distances between most of the points and the line, your value should be the minimum possible. This means that the line of best fit is a line that you draw through your graph that divides all of the points on your scatter plot evenly. The line should also go through the point ! x , y ", which is the means of the two sets of data.

)ine of Best Fit 7rawn by 96ye:


mean point (,)

MTE3105 Statistics

,owever, drawing by 96ye:* method is rather hapha-ard and there is a mathematical way of fitting a regression line known as method of least s uares as illustrated below.

3.3.%

$sing formulae to find the e uation of the least s uares regression line

The least s uares values of a and b are computed using the following formulas.
; a < bx For the least s uares regression line y

b; means,

SSxy SSxx

and

a ; y 5bx ,

y and x are

where SSxy ;

xy

( x )( y )
n

and

SSxx = x2 -

( x ) 2
n

; a < bx SS stands for =sum of s uares: and the least s uares regression line y

is called the regression of y on x.

6>A.0)6 #. Find the least s uares regression line for the data on incomes and food expenditures on the seven households given in Table # below. $se income as an independent variable and food expenditure as a dependent variable.

Table # 1 Incomes and Food

MTE3105 Statistics

6xpenditures of /even ,ouseholds Incomes Food 6xpenditures !4. ??" !4. ??" &@ A 'A #@ %# B &A ## #@ @ %C C %@ A Solution1
; a < bx. +e will now find the values of a and b for the regression model y

Table % shows the calculations re uired for the computation of a and b. +e denote the independent variable !income" by x and the dependent variable !food expenditure" by y. Income Food 6xpenditure

x
&@ 'A %# &A #@ %C %@ x = 212 Table %

y
A #@ B ## @ C A
y = 6

xy
&#@ B&@ #'B '%A B@ %%' %%@

x2
#%%@ %'?# ''# #@%# %%@ BC' 3%@ 2 x ; B%%%

xy = 21!"

The following steps are performed to compute a and b. /T60 #.


y , x and y # x, Compute

x = %#%
x ;
y ;

y = 3'

x 212 ; ; &?.%C@B n $ y 6 ; n $ ; A.#'%A

/T60 %.

Compute xy and x2 To calculate xy, we multiply the corresponding values of x and y. Then, we sum

all the products. The products of x and y are recorded in the third column of Table %. To compute x 2 % we s uare each of the x values and then add them. The s uared values of x are listed in the fourth column of Table %. From these calculations,

MTE3105 Statistics

xy = 21!"
/T60 &#

and

x 2 = $222

Compute SSxy and SSxx . SSxy = xy

( x )( y )
n

= 21!" -

( 212)(6 ) = 211#$1 & $


( 212) 2 $
= '"1# 2'6

SSxx = x2 -

( x ) 2
n

= $222 -

/T60 '#

Compute a and b# b = a =

SSxy SSxx

211#$1 & = "#26 2 '"1# 2'6 = (#1 2( - ("#26 2) ( &"#2'!$) = 1#1 1 is


) 1.1414 ) 0.2642x y

y 5bx

; a < bx Thus, our estimated regression model y

This regression line is called the least s uares regression line. It gives the regression of food expenditure on income. %.3.% .aking prediction using the regression line y on x.

The regression line y on x gives you the average value of y for a given value of x so in certain circumstances it can be used to predict or estimate missing values. This is known as interpolation. Interpolation is generally safe in making a prediction because it is within the range of values of the predictor in the sample used to generate the model. For example, from our estimated regression model, we can find the predicted value of y for any specific value of x. /uppose we randomly select a household whose monthly income is 4. &@?? so that x ; &@ !recall that x denotes income in hundreds of 4.". The predicted value of food expenditure for this household is
; #.#'#' < !?.%3'%" !&@" ; 4. #?.&CC' !#??" ; 4. #?&C.C' y

In other words, based on our regression line, we predict that a household with a monthly income of 4. &@?? is expected to spend 4. #?&C.C' per month on food.

MTE3105 Statistics

,owever, we must take great caution when estimating values outside the range of your data. This kind of making prediction outside the range of values of the predictor in the sample used to generate the model is called e%trapolation. The more distance the prediction is from the range of values used to fit the model, the riskier and unreliable the prediction becomes because there is no way to check whether the relationship continues to be linear between the dependent and independent variables.

6xercise #. A patient is given a drip feed containing a particular chemical and its concentration in his blood is measured, in suitable units, at one hour intervals for the next five hours. The doctor believe the figures to be subDect to random errors, arising both from the sampling procedure and the subse uent chemical analysis, but that a linear model is appropriate. Time, x !hours" (oncentration ? %.' # '.& % @.% & 3.C ' A.# @ ##.C

a" Illustrate the data on a scatter diagram b" 7etermine the e uation of the regression line by =eye: method c" Find the e uation of the regression line by least s uare method and compare the results in b" d" 6stimate the concentration of the chemical in the patient*s blood !i" &E hours !ii" #? hours after the treatment started. (omment on the likely accuracy of your prediction.

MTE3105 Statistics

%.

6nglish speaking ability score, > and the sales, F of a particular month for #? salesman are shown in the table below. > F 3? B' @' B3 @% 33 'C B3 '% B? &3 3C &' 3% %C @' %3 3' #C @'

a" 7raw a scatter diagram for the data b" By using the least s uare method, find the e uation of the regression line of sales F on 6nglish speaking ability score > c" +hat are the levels of sales for three particular salesman A, B and ( with 6nglish speaking ability scores of '?,@? and B? respectivelyG Answer1 b" F ; '3.@% < ?.'AA'> c" 33.'A, B#.'A, C#.'C

*eferences+ ,ra-sha-% . and ,ha/bers% .# (2""2)# A concise course in advanced level Statistic# 0nited 1ingdo/ + Nelson 2hornes 3td% 4ann% Pre/ S# (2""1)# Introductory Statistics Fourth Edition. N5+.ohn 6liey and Sons 7nc# Soon% ,hin 3oong et#al# (2"" )# Pre- S!P" "atriculation #uantitative "ethod# Petaling .aya+Pearson 4alaysia Sdn 8hd#

You might also like