You are on page 1of 30

Bivariate Distributions and

Estimating Relations

ENGSTAT
Bivariate Distributions
One variable is dependent on one or more related
variables.
Relationships between different variables
Detecting patterns
Linear relationships
Regression analysis
Correlation
Bivariate Distributions
Two-way Table for Categorical Data.
Projections on New Workers by Gender and Race

Gender
Women Men %
Race
White 23% 24% 47%
Black 9% 6% 15%
Asian 7% 6% 13%
Hispanic 13% 12% 25%
Total 52% 48% 100%
Bivariate Distributions
Time Series

Number of industrial accidents recorded per week

Number of machine breakdowns per month

Quarterly production figures

Annual profits

Daily temperatures
Bivariate Distributions
Time Series
Bivariate Distributions
Scatter Plots
Correlation: Estimating the Strength of
a Linear Relation
Correlation Coefficient

1 n x x y y
r
n 1 i 1 s x s y

-1 r 1 always
A value of r near or equal to zero implies little or no linear relationship
between x and y
In contrast, the closer r is to 1 or -1, the stronger the linear
relationship between y and x, if r = 1, all the points fall exactly on the
line
A positive value of r implies that y increases as x increases
A negative value of r implies that y decreases as x increases
Regression: Modeling Linear Relationships
Linear Model y o 1 x
Fitting the Model: The Least Square Approach

slope, 1
x x y y

x x
2

y - intercept, o y 1 x
Sum of squared errors, SSE

SSE yi y i yi actual value


2

yi predicted value
Regression: Modeling Linear Relationships
Properties of Least-Square Regression Line
The sum (and the mean) of residuals is zero
The variation in residuals is as small as possible for the given dataset.
The line of the best fit will always pass through the points
Residual Analysis: Assessing the Adequacy of
the Model
Residual difference between the observed and predicted
value of y for a given value of x
residual yi yi
Conditions that must hold among residuals in order for
linear regression to work well
residuals appear to nearly random quantities, unrelated to the
value of x
The variation in residuals does not depend on x, that is, the
variation in the residuals appears to be about the same no matter
which value of x is being considered
Residual Analysis: Assessing the Adequacy of
the Model
Residual Plot
Lack of any trends or patterns can be interpreted as an indication
of random nature of residuals
Nearly constant spread of residuals across all values of x can be
interpreted as an indication of variation in residuals not
dependent on x.
If a model fits the data well, there will be no discernable pattern
in the residual plot, the points will appear to be randomly
scattered about the plane
Residual Analysis: Assessing the Adequacy of
the Model

Residual Plots
Transformations
When data do not fit a simple regression model, transform
the data and fit a linear model to the transformed data.
Exponential transformation

y ae bx
ln y ln a bx
Power transformation

y ax b ln y ln a b ln x
Exercise: Daily peak load (y) for a power plant and the
maximum outdoor temperature (x)
Day Max. T (x) Peak Power Load (y)
1 95 214
2 82 152
3 90 156
4 81 129
5 99 254
6 100 266
7 93 210
8 95 204
9 93 213
10 87 150
1 15

Regression Analysis Using


EXCEL
1 16
1 17
18

Scatter Plot
300

250

200

150

100

50

0
0 20 40 60 80 100 120
1 19
1 20
21
1 22
23
24
25
1 26

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.944093034
R Square 0.891311656
Adjusted R Square 0.877725613
Standard Error 16.17764189
Observations 10

Coefficients Standard Error t Stat P-value


Intercept -419.8491459 76.05777755 -5.52013 0.00056
X Variable 1 6.717477004 0.829350073 8.099688 3.99E-05
1 27

Multiple Regression Analysis


R-squared
cannot determine whether the coefficient estimates and predictions
are biased
you must assess the residual plots
Increases if you add a predictor to a model, even if due to chance
alone.
may appear to have a better fit simply because it has more terms
With too many predictors, it begins to model the random noise in
the data. This is called overfitting the model and it produces
misleadingly high R-squared values and a lessened ability to make
predictions.
1 28

Multiple Regression Analysis


Adjusted R-squared
compares the explanatory power of regression models that contain
different numbers of predictors
modified version of R-squared that has been adjusted for the
number of predictors in the model.
increases only if the new term improves the model more than
would be expected by chance
decreases when a predictor improves the model by less than
expected by chance
It is always lower than the R-squared.
In the example you may want to include
only 3 predictors
1 29

RESIDUAL
OUTPUT

Observation Predicted Y Residuals Standard Residuals


1 218.3111695 -4.311169514 -0.282654656
2 130.9839685 21.01603154 1.377881138
3 184.7237845 -28.72378449 -1.883227135
4 124.2664915 4.733508541 0.310344611
5 245.1810775 8.81892247 0.578197977
6 251.8985545 14.10144547 0.924537808
7 204.8762155 5.123784494 0.335932405
8 218.3111695 -14.31116951 -0.938288016
9 204.8762155 8.123784494 0.532622413
10 164.5713535 -14.57135348 -0.955346545
30

X Variable 1 Residual Plot


30

20

10

0
Residuals

0 20 40 60 80 100 120

-10

-20

-30

-40
X Variable 1

You might also like