You are on page 1of 117

S1A 6236

8egresslon Analysls
lall 2013
1
STA 6236
Regression Analysis
!"#$ &' ()*+,)+-
."// 0123
Power point slides modified from those developed by C. Nachtsheim, 2007,
adapted 2008, 2009. Further modified 2010, 2011. (But not 2012)
2
S1A 6236 lall 2013
Syllabus lnfo:
1
3
CCs: oLenual Culz uaLes

4
3
6
7
Pow Lo CeL !M10
MA 130u or 130L (MaLh and hyslcs, uaLa
Mlnlng Lab), knock on door Lo geL Mr. 1au's
auenuon
lree
Can be loaded on Mac and non-Mac machlnes
8rlng lapLop or C Lo lab (musL be done on
slLe)
30 day Lrlal llcense ln meanume
8
SlLe for downloadlng Lo be seL up
leel free Lo sLarL readlng Lhe LexL
1uLorlals on !M
uaLa seLs from !M
uaLa seLs wlLh LexL
9
Brief introduction to JMP10
Tutorial as part of the package
Data files
ch1ta1
Will use in class for all calculations
You need it to produce output to bring to
quizzes and for assignments to be submitted
Play with the package, try it out on data,
become proficient
What does this part of the output tell me?
10
AugusL 21, 2013 (Class #2)
lull scale launch of regresslon
Wlll demonsLraLe some oLher Lhlngs abouL
!M10 LonlghL
Culz daLes messed up (go Lo sllde 4)
11
SLaLus of SLu.
JMP 10 Site license in play? Thursday
Web site? Future name:
http://statistics.cos.ucf.edu/STA4222orSTA6236files
Power point slides
Special data files (those not in JMP; not part of text)
Videos?
Yes: Camtasia in the works
12
Loose Lnds. SLCL.[mp le
noL sure whaL x, y represenL !
llle SLCL.[mp from Lhe sample daLa seLs ln !M
Slope of ued llne looked Coofy" and
poLenually embarrasslng
All unexpecLed resulLs are !"#$%&%' )**)$+,%&-".
600 daLa polnLs? Serlously? Serlously? .
8epeaLs or close-Lo-repeaLs
llndlng Lhe culprlLs.

13
uemo sLu
Analyze dlsLrlbuuon mode of 62 (hlsLogram
noL so clear)
8ubble ploL
SLack x, y and [luer Lhe polnLs
14
Llecuon uaLa 2008 + WeaLher lndex
lrom C. WaLson, W1C and kAC and now
hup://Lracklng.enklops.org/
llelds should be falrly clear: SLaLe and counLy
llS code, Lhe max WCl ln Lhe counLy, pcLouL ls
Lhe percenL LurnouL, margln ls Lhe margln by
whlch Cbama or McCaln won, lncome ls medlan
lncome, pcLpov ls Lhe percenL below Lhe poverLy
llne, pcLwhlLe ls Lhe percenL whlLe.
8esulLs_2008 10aug2013
Llecuon daLa Lab on rosLer excel sheeL
13
8aslc remlses
Coal ls Lo do Lhe analysls 8lCP1 (l.e., sans ume
consLralnLs or need Lo Lake shorL cuLs)
(lf we had 10,000 varlables, . )
8ralnpower + analyucal Lools
1rack down daLa aberrauons, lssues, eLc.
Llecuon daLa le
gulde Lo clean up

16
What is Regression?
1. Method of modeling relationships
between a response variable Y and one or more
predictors X. (also known as dependent/
endogenous variable Y and independent/
exogenous variables X)

2. A way of fitting line (or curve) through data
17
Why regression analysis?
Objectives please.
Relationship between variables (response
variable with explanatory variables;
dependent with independent variables)
Prediction (BIG DATA driver)
Identify key variables of interest
Test scientific hypotheses
18
RegressionTwo Extreme Schools
of Thought Bracket the Area
Dont try this at home, I am a professional:
Tell me what you have done and I will gleefully point
out all of the errors, misunderstandings, and so forth.
Only experts should be permitted to apply regression
techniques, let alone use sophisticated software such
as JMP10. Otherwise, only junk will be produced.
Give it a try, what can possibly go wrong:
One will always learn something from a detailed
regression analysis. Thank goodness for JMP10 to
eliminate the drudgery in the computations. Try your
best and you can always ask for forgiveness before
re-running your analysis with a friendly experts help.
19
Example 1:
Store Site Selection
Model sales Y at existing sites as a function of demographic
variables:

X
1
= Population in store vicinity
X
2
= Income in area
X
3
= Age of houses in area
X
4
= Unemployment rate
X
5
= Traffic data
From equation, predict sales at new sites
20
Example 2:
Marketing Research
Model consumer response Lo a producL on basls of producL
characLerlsucs:

? = 1asLe score on so drlnk

x
1
= Sugar level
x
2
= Carbonauon level
x
3
= lce/no lce
x
4
= _______


21
Example 2:
Marketing Research
Model consumer response Lo a producL on basls of producL
characLerlsucs:

? = 1asLe score on so drlnk

x
1
= Sugar level
x
2
= Carbonauon level
x
3
= lce/no lce
x
4
= sugar conLenL
x
3
= cosL


22
Example 3: Arson Forensics
Understand burning processes for baselines of arson
investigations

Ys = time until flame out, max temp. in room, time
until max temp. reached, average temp. in room overall and
at individual locations, depth of char and bubble size
throughout the room

X
1
= Fuel type: Gasoline or Kerosene mix
X
2
= Ignitable fuel amount
X
3
= Ignitable fuel placement
X
4
= Additional materials on sofa
X
5
= Window 1 openness
X
6
= Window 2



23
Example 4:
Real Estate Pricing
Y = Selling price of houses

X
1
= Square footage
X
2
= Taxes
X
3
= Lot acreage
X
4
= Houses in area foreclosed
X
5
= Rating of neighborhood school
X
6
= Distance/time to downtown
24
One-Predictor Regression
(Chris Nachtsheim example)
533 Homes Sold in Minnetonka, MN 2001
23
One-Predictor Regression
500 1500 2500 3500 4500
0
500000
1000000
SqFt
P
r
i
c
e
Price = -1957.83 + 158.950 SqFt
S = 79122.9 R-Sq = 67.2 % R-Sq(adj) = 67.1 %
Regression Plot
26
One-Predictor Regression
500 1500 2500 3500 4500
0
500000
1000000
SqFt
P
r
i
c
e
Price = -1957.83 + 158.950 SqFt
S = 79122.9 R-Sq = 67.2 % R-Sq(adj) = 67.1 %
Regression Plot
27
Orlando Housing Market
Bad news on housing diminishing,
supposedly better the past year or so
Zillow site for recent sales
Last year, looked at previous 30 days, sold
price, square footage, #bedrooms, baths,
taxes
28
29
Little Demo in JMP10 with this data
Getting data into a data table (steps
skipped to go from Zillow to JMP)
Looking at the data
Sales price as a function of variables
Multivariate
Predictive model?
Model for understanding?
Extent of generalizing possible!.
30
uemo.
ull daLa lnLo
!M
8egress (v.)
rlce on sq
Check ouL L(s)
(wlLh/wlLhouL varlous daLa polnLs, formula
funcuon)
31
used ZlLLCW for 2013 recenL daLa
ZlLLCW noL greaL for downloads.
63 observauons
4+ bedrooms
3+ baLhrooms
non-mlsslng loL slze
Crlando area
pool"
32
Regression Models
1. Answer What is the relationship between the variables?
2. Equation used
1 Numerical dependent (response) variable
1 or more numerical or categorical independent
(explanatory) variables
3. Used mainly for prediction & estimation
33
Another Reason:
Demonstrate No Relationship!
34
Brief Recap: Pep Talk for Regression
Syllabus/Text/JMP10 tied together
Regression analysis used extensively by
statisticians as well as non-statisticians
Exploratory mode (Y versus X or Xs)
Estimation
Prediction
Understanding
33
8emlnder: AsslgnmenL
CeL ahold of !M10 (varlous opuons)
1ake LuLorlals, pracuce loadlng daLa seLs
llL ? by x
llL Model
Analyze dlsLrlbuuon
lormula
1able manlpulauons
noLe llnkages
8ead ahead ln LexL
8evlew maLrlx analysls lf rusLy
Cne demo on ?ou 1ube uslng SLCL.[mp le
36
?ou1ube updaLe + WL8 age aL LasL
?ou1ube search on S1A4222orS1A6236
Should be 3 vldeos
SLCL le
SLu from Monday Aug. 28 (parL 1)
Lven more sLu from Aug. 28 (parL 2)

hup://sLausucs.cos.ucf.edu/m[ohnson
37
?ou Lube vldeo #2
racuce wlLh Cenulne" lake Made up uaLa
Conslder a model of Lhe form
y = 10 + 3x + error where x: 1, 2, 3, 4, 3 repllcaLed
Check how you dld
llL lL. (Lrue, wlLh error)
lay wlLh parameLers, assumpuons
1ry lL wlLh 1000 repllcaLed daLa seLs
38
?ou Lube vldeo #2 4"#50
racuce wlLh Cenulne" lake Made up uaLa
Conslder a model of Lhe form
y = 10 + 3x + error where x: 1, 2, 3, 4, 3 repllcaLed
Check how you dld
llL lL. (Lrue, wlLh error)
26 !7+859 :7;75<<<< !!!
need Lo creaLe Lhe groups and Lhen ddle wlLh
assumpuons
lay wlLh parameLers, assumpuons
1ry lL wlLh 1000 repllcaLed daLa seLs
39
Pomework asslgnmenL
SlmulaLed daLa problem
See handouL
40
41
42
43

Prediction Using Regression:
$633,843= -$1,957.83 + $158.950*4,000

500 1500 2500 3500 4500
0
500000
1000000
SqFt
P
r
i
c
e
Price = -1957.83 + 158.950 SqFt
S = 79122.9 R-Sq = 67.2 % R-Sq(adj) = 67.1 %
Regression
95% PI
Regression Plot
4,000
639,000
44
llL ? by x
43
Functional v. Statistical Relations
Functional Relation: One Y-value for each X-value

46
Statistical Relation
Example 1: Y = Year-end employee evaluation
X = Mid-year evaluation
47
Curvilinear Statistical Relation
Example 2: x = Age
y = Steroid level in blood
Regression Objectives: Characterize the statistical
Relation and/or predict new
values
48
Statistical Relations
Have distribution of Y-values for each X
The mean changes systematically with X
49
Why Regression?
Regretable term to some extent! Common usage!
Galton, late 1800s: Average height of sonsat a given
fathers heighttends to regress toward the mean
of the population (mediocrity)
5.0 5.5 6.0 6.5
5.0
5.5
6.0
6.5
Fathers
S
o
n
s
Pop Average
Y=X
Regression
30
Simple Linear Regression Model:
The Assumptions
31
Notes on Simple Linear Regression
Model (Regression lite)
1. Model is simple, because only one predictor (X)
2. Model is linear because parameters enter linearly
3. Since X = X
1
(and X
2
, X
3
, etc. not present) model
is first-order
32
Homework problem set for Chapter 1

Not to be graded, but I may go through them in
an extended class period following usual lecture
format class (to be determined). You should be
comfortable doing these types of problems. For
quiz situation, you should also be comfortable
extracting relevant material from JMP output.
1.5, 1.6 (draw plot by hand), 1.7, 1.10, 1.11,
1.13, 1.16, 1.18, 1.19 (needs software and data
disc that comes with the book), 1.20 through
1.28, 1.29, 1.32, 1.33, 1.34, 1.35, 1.36,
1.39,1.43, 1.44, 1.45, 1.46

33
Features of Model
1. !
i
is a random variable, so Y
i
is also a random variable

2. Mean of Y
i
is the regression function:
3. !
i
is the vertical deviation of Y
i
from the mean at X
i

34
Features of Model (continued)
4. Variance is constant:
Var(Y
i
) = Var(!
0
+ !
1
X
i
+ "
i
) = Var("
i
) = #
2
5. Y
i
is uncorrelated with Y
j
for i ! j
6. In summary, regression model (1.1) implies that
responses Y
i
come from probability distributions
whose means are E(Y
i
) = !
0
+ !
1
X
i
and whose
variances are #
2
, the same for all levels of X. Any
two responses Y
i
and Y
j
are uncorrelated.
33
Illustration of Simple Linear Regression
Error terms NOT assumed to be normally
distributedno distributional assumptions made
other than on moments.
36
Section 1.4
Observational Data
Lung cancer impacts from smoking/smoking
cessation
Experimental Data
Feasible to control assignment of subjects to
treatments
Completely Randomized Design
Free of bias!possibly not efficient
37
38
Least Squares Estimators: Properties
1. Linear: We will show they are each linear
combinations of the Y
i
s
2. Unbiased: E{b
0
} = "
0
and E{b
1
} = "
1
3. Best: Minimum variance (maximum precision) among
all linear, unbiased estimators of these parameters.
4. Estimators
Gauss-Markov Theorem: If the assumptions hold,
the LS estimators are BLUE:
39
60
61
62
Relationship to Features of Model
slides given earlier (around 44-45)
Y = f(X) does not appear linear
"
i
and "
j
uncorrelated?
Presumption of declining values of y
Large drop in y suggests long duration
until next y observed
Is X fixed?
63
Alternative Version of Model:
Use Centered Predictor(s)
where:
Same slope, different intercept!
64
Estimating the Regression Function
Example: Persistence Study.

Each of 3 subjects given a difficult task. Y
i
is the number
of attempts before quitting.
!"# $%&' ! ( ) * +
,-% "
!
( *. // +.
0"1#%2 34 5''%16'7 #
!
( / )* ).
63
Estimating the Regression Function
Scatter Plot:
20 30 40 50
0
5
10
15
Age
A
t
t
e
m
p
t
s
Hypothesis: E{Y} = "
0
+ "
1
X
How do we estimate "
0
and "
1
?
66
Criteria for choice of !
0
and !
1
Sum of perpendicular distances "
Sum of vertical distances (absolute values) #
Sum of vertical distances squared (#)
2

Sum of horizontal distances (!)
67
Least Squares Criterion
Find the values of "
0
and "
1
that minimize the least
squares objective function Q, given the sample, (X
1
, Y
1
),
!, (X
n
,Y
n
).

Call those minimizing values: b
0
and b
1
.
68
Persistence Study
Who wins? Which fit is better?
69
How do we find b
0
and b
1
?
Calculus:
1. Take partial derivatives with respect to "
0
and "
1
,
and set equal to zero
2. Get two equations and two unknowns, solve.
Denote solutions by b
0
and b
1
:
70
71
72
Least Squares Estimators: Properties
1. Best: Minimum variance (maximum precision)
2. Linear: We will show they are each linear
combinations of the Y
i
s
3. Unbiased: E{b
0
} = "
0
and E{b
1
} = "
1

4. Estimators
Gauss-Markov Theorem: If the assumptions hold,
the LS estimators are BLUE:
73
Example 1:
Toluca Company Data
74
Toluca Company Fit
100
150
200
250
300
350
400
450
500
550
W
o
r
k

H
o
u
r
s
0 20 40 60 80 100 120 140
Lot Size
Bivariate Fit of Work
Hours by Lot Size
73
JMP8 Output


Work Hours = 62.365859 + 3.570202*Lot Size
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.821533
0.813774
48.82331
312.28
25
Summary of Fit
Model
Error
C. Total
Source
1
23
24
DF
252377.58
54825.46
307203.04
Sum of
Squares
252378
2384
Mean Square
105.8757
F Ratio
<.0001*
Prob > F
Analysis of Variance
Intercept
Lot Size
Term
62.365859
3.570202
Estimate
26.17743
0.346972
Std Error
2.38
10.29
t Ratio
0.0259*
<.0001*
Prob>|t|
Parameter Estimates
Linear Fit
Output from Fit Y by X
Platform
76
JMP9 Output


Output from Fit Y by X
Platform
77
!M ro 10 ouLpuL
78
Example 2:
Minnetonka Home Sales Again
79
One-Predictor Regression:
Assumptions met?
500 1500 2500 3500 4500
0
500000
1000000
SqFt
P
r
i
c
e
Price = -1957.83 + 158.950 SqFt
S = 79122.9 R-Sq = 67.2 % R-Sq(adj) = 67.1 %
Regression Plot
80
Estimating the mean response at X
Regression Function:
Estimator:
Using centered-X model:
81
Residuals!
Estimated residuals are key to assessing fit and
validity of assumptions
True Residuals (always unknown!):
!
i
= Y
i
("
0
+ "
1
X
i
)
Estimated Residuals:
82
More Properties related to b
0
and b
1
1.


2.

5. The regression line passes through
the point
3.

4.
83
Estimating the variance, "
2
Single population (no Xs for now, just Ys):
Degrees of freedom is n-1 here because the mean was
estimated using one statistic, namely Y.

For regression, mean is estimated by Y, which uses two
statistics, b
0
and b
1
.
= SSE / (n-1)
^
_
84
s
2
is the mean square for error in
Analysis of Variance (MSE = SSE/DF)
83
Tolucca data
Lot Size (X
i
) and corresponding work
hours Y
i
86
Demo on how to get the data
watch videos, tutorials
Try JMP
Try EXCEL and then copy (or open
notepad)
87
s
2
is the mean square for error in
Analysis of Variance (MSE = SSE/DF)
88
More Properties related to b
0
and b
1
1.


2.

5. The regression line passes through
the point
3.

4.
89
90
91
92
93
94
Good Stuff to Know Guide
Understand terminology
Assumptions for simple linear regression
Least squares criterion; normal equations
Derive estimators b
0
, b
1
for !
0
, !
1
, resp.
Sense in which the LS are BLUE
Be able to extract relevant numbers from
JMP Pro 10 output (e.g., estimates; fitted
model)
Properties of b
0
, b
1


93
More Good Stuff to Know
Assumptions for normal error regression
LS versus MLE estimators
Y
i
is normal, distribution of linear combination of these Y
i
s
Properties of the k
i
s
Distribution, mean and variance of b
1

Difference between confidence interval on the regression
function and prediction interval for future observations at x
h

SSTO=SSR+SSE and why we care about ANOVA
General linear test/full, reduced model
Definition and interpretation of R
2

96
Normal Error Regression Model
Add to the assumptions one more item:
Notes:
1. N(0,#
2
) implies normally distributed with mean zero
and variance #
2
.
2. Uncorrelated implies independence for
normal errors.
3. Normality is a strong assumptionmight not be true!
97
One Rationale for Normality
Suppose the true model involves 21 weak predictors:
X and Z
1
, !, Z
20
so that:

Y
i
= "
0
+ "
1
X
i
+ "
2
Z
1,i
+"
3
Z
2,i
+ ! + "
20
Z
20,i
+!
i

But we use:

Y
i
= "
0
+ "
1
X
j
+ !
i

So that

!
i
= "
2
Z
1,i
+"
3
Z
2,i
+ ! + "
20
Z
20,i


Central Limit Theorem suggests normality of !
i
.
(not implausible that error terms normal)


98
Maximum Likelihood Estimation
Rationale: Use as estimates, those values of the
parameters that maximize the likelihood of the
observed data

Case 1: Single sample; estimate . Assume $
2
= 100.
Data: n = 3; Y
1
= 250, Y
2
=265, Y
3
=259. Which is more
likely, = 230 or = 259?
99
Maximum Likelihood Estimation
The vertical bars indicate the likelihood or density of each
Y
i
. Consider Y
1
:
Densities (likelihoods) for all three observations for the
two alternatives are:
100
Maximum Likelihood Estimation
The product of the individual likelihoods gives the sample
likelihood
Clearly, = 259 is more likely!

In general, we can write down the likelihood as a function
of any value for :
101
Likelihood Function: L()
The likelihood is maximized at about = 258.
(Note that the average of the 3 Ys is 258.)
102
Maximum Likelihood for Regression:
Case 2
Persistence Study (n = 3). Assume # = 2.5. Suppose
"
0
= 0 and "
1
= .5. How likely are the observed values?
103
Calculate Likelihood
For the first observation, X = 20, so:

= E{Y} = 0 + 0.5(20) = 10
The likelihood is:
Likewise: f
2
= 0.7175x10
-9
and f
3
= 0.021596
104
Calculate Likelihood (continued)
So for the sample, the likelihood of "
0
= 0
and "
1
= 0.5 is:
For unknown "
0
, "
1
, and #
2
, the likelihood function is:
103
Maximum Likelihood Estimators
The values of "
0
, "
1
, and #
2
that maximize the likelihood

function, namely "
0
, "
1
, and #
2
, are called the maximum
likelihood estimators.
Some results:
^ ^ ^
106
Some KEY Points from Appendix 1
sorry, but material you should have seen
at some point in your statistical
education
Note: Review Appendix 1 with special attention to

A.1: Summation and product notation
A.3: Random variables
A.4: Normal and related distributions
A.6: Inferences about population mean
A.7: Comparisons of population means
A.8: Inferences about population variance
107
Linear Combinations of
Random Variables
Let Y
1
, . . ., Y
n
be random variables, and
a
1
, . . ., a
n
are constants. Then:

Z = a
1
Y
1
+ . . . + a
n
Y
n


is a linear combination of the random variables
Y
1
, . . . ,Y
n
108
Examples of Linear Combinations
1. Example 1: Difference of two random variables
2. Example 2: The sample mean
109
Examples of Linear Combinations
1. Example 1: Difference of two random variables
X - Y
2. Example 2: The sample mean

X = X
1
/n + X
2
/n + ! + X
n
/n
_
110
Expectation and Variance of
Linear Combinations
1. Expectation (A.29a). Let E{Y
i
} =
i
, for i = 1,2, . . ., n,
and let Z = a
1
Y
1
+ . . . + a
n
Y
n
. Then:

E{Z} = % a
i

i

2. Variance (A.31): In addition to the above,
Assume that the {Y
i
} are mutually independent and
#
2
{Y
i
} = #
i
2
, i = 1,2, . . ., n. Then:

"
2
{Z} = a
1
2

"
1
2
+ . . . + a
n
2
"
n
2


111
Examples of Linear Combinations
1. Example 1: Difference of two random variables
E(X-Y) = E(X) E(Y)
Var(X-Y) = Var(X) + Var(Y)
2. Example 2: The sample mean
E(X-bar)= 1/n [E(X
1
) + !+ E(X
n
)] = (1/n) (n*) =
Var(X-bar) = (1/n)
2
[#
2
+ $ + #
2
] = #
2
/n


112
Expectation and Variance of
Linear Combinations: Examples
Example 4:

Let Z = Y. Find E{Z} and #
2
{Z}.
_
113
114
t Distribution Examples
Example 5:

Find the t statistic corresponding to the sample
average in a sample of size n.

Assume E{Y
i
} =
0
, for i = 1,2, . . ., n


113
Linear Combinations of
Independent Normal RVs (A.40)
116
Chapter 2
117

You might also like