Betas: Standardized Variables in Regression: Paul E. Johnson

Betas
1 / 46
Betas: Standardized Variables in Regression

Paul E. Johnson1
1 2
Department of Political Science
Center for Research Methods and Data Analysis, University of Kansas
2012
Betas
2 / 46
Outline
1 Introduction
j s 2 Interpreting b
3 Rescale Variables: Standardization 4 Standardized Data 5 Practice Problems
Betas Introduction
3 / 46
Outline
1 Introduction
Betas Introduction
4 / 46
Problem
Regression pops out slope estimates How can we make sense of them? Can an automatic standardization of variables help?
Betas s Interpreting b j
5 / 46
Outline
1 Introduction
6 / 46
Get Existential: What is Regression?
You theorize: yi = b0 + b1 x 1i + b2 x 2 + ... + bk xki + ei i = 1, ..., N (1)
j with which and through procedure, you make estimates b to calculate predicted values: yi = b0 + b1 x 1i + b2 x 2 + ... + bk xk i = 1, ..., N (2)
Everything else we do should be understood through this lens.
7 / 46
Yes, But What Do You DO with a Regression?
Compare 2 cases, with inputs (x 10 , x 20 , . . . , xk0 ) and (x 11 , x 21 , . . . , xk1 ) The predicted values y0 and y1 are dierent, some of the x s matter The focus is on developing substantively interesting comparisons! Wed like to narrow our attention down, to concentrate on one predictor at a time.
(x 10 , x 20 , . . . , xk0 ) and (x 10 , x 21 , . . . , xk0 ) They only dier on x 2, so the dierence between predictions must be attributable to the change from x 20 to x 21 .
8 / 46
Substantively Interesting x 20 and x 21
j are partial regression coecients b . Linear formula: other things equal, a 1 unit increase in x 2i causes 2 unit increase in the predicted value of yi an estimated b . No reason to say researcher can only compare variables by changing one unit at a time Know the problems context, pick interesting values of x 20 and x 21 for comparison.
x2 represents last year school , x 20 =8th grade, x 21 =high school x2 represents income, x 20 =$10,000, x 21 = $100,000
9 / 46
Linear and Continuous Xs: b
Maybe the calculus says it best: y 2 =b x 2 2 . But theres no absolute scale for b
If x s or y are numerically re-scaled, then the coecients will change too.
Betas Rescale Variables: Standardization
10 / 46
Outline
1 Introduction
11 / 46
Recall Eect of Fiddling with Xs
If one re-scales x 2i , replacing it with k x 2i , then the regression 1 b2 . coecient is re-scaled to k 2 is not changed, but the If one adds or subtracts from x 2i , b intercept b0 does change. Both multiplication and addition are apparently harmless .
12 / 46
Consider Fiddling with y
What happens if one multiplies yi by 2?

s . That seems obvious. doubles all the b
What happens if yi has something added or subtracted?

0 changes b
13 / 46
Occupational Prestige Data from car
l i b r a r y ( car ) P r e s t i g e $ income < P r e s t i g e $ income / 10 presmod1 < lm ( p r e s t i g e income + e d u c a t i o n + women , d a t a=P r e s t i g e )
14 / 46
My Professionally Acceptable Regression Table
(Intercept) income education women N RMSE R2 adj R 2 * p 0.05
Estimate -6.794* 0.013* 4.187* -0.009 102 7.846 0.798 0.792
S.E. (3.239) (0.003) (0.389) (0.03)
We are supercial, dont know much about the Prestige dataset How do we know what the slopes for income or women mean? Can they be compared?
15 / 46
One School of Thought: Get Inside the Data
Generally preferred by economists or political scientists (possibly statisticians) Why are you trying to compare the eects of women and income ? Learn More About Your Data, look for meaningful comparison cases Make a predicted value table: (draw on board)
Betas Standardized Data
16 / 46
Outline
1 Introduction
17 / 46
Another School of Thought: Try to Convert Variables to a Common Metric

Preferred by psychologists (and many sociologists) A standardized variable is calculated like so: yi Observed mean(yi ) standardized yi = Observed Std .Dev .(yi ) Im not calling that a Z score because Z score presumes we know the TRUE mean and standard deviation By denition, all standardized variables have a mean of 0 and a standard deviation of 1. See why? What is the common metric with standardized variables? (Im asking, seriously)
18 / 46
Use Standardized Variables in Regression
Replace yi and X 1i and X 2i and X 3i by standardized variables A standardized regression is like so: yi y sy = 1 X 1i X 1 + 2 sX 1 X 2i X 2 + 3 sX 2 X 3i X 3 +ui sX 3 (3)
The estimated coecients are called standardized regression coecients

slang, called Betas.
If ALL variables are standardized, then 0 = 0
19 / 46
Standardize the Numeric Data
Unlike SPSS, R does not make standardization easy or automatic (not an oversight, probably). s t P r e s t i g e < P r e s t i g e s t P r e s t i g e $ income < s c a l e ( s t P r e s t i g e $ income ) s t P r e s t i g e $ e d u c a t i o n < s c a l e ( s t P r e s t i g e $ e d u c a t i o n ) s t P r e s t i g e $women < s c a l e ( s t P r e s t i g e $women ) s t P r e s t i g e $ p r e s t i g e < s c a l e ( s t P r e s t i g e $ p r e s t i g e ) p r e s m o d 1 s t < lm ( p r e s t i g e income + e d u c a t i o n + women , d a t a=s t P r e s t i g e )
20 / 46
standardize function in rockchalk will automate this

Recall: presmod1 <- lm(prestige income + education + women, data=Prestige)
A l l v a r i a b l e s i n t h e model m a t r i x and t h e d e p e n d e n t v a r i a b l e w e r e c e n t e r e d . The v a r i a b l e s h e r e h a v e t h e same names a s t h e i r noncentered c o u n t e r p a r t s , but they a r e centered , even c o n s t r u c t e d v a r i a b l e s l i k e ` x1 : x2 ` and p o l y ( x1 , 2 ) . We a g r e e , t h a t ' s p r o b a b l y i l l a d v i s e d , b u t you a s k e d f o r i t by r u n n i n g s t a n d a r d i z e ( ) . O b s e r v e , t h e summary s t a t i s t i c s matrix. mean s t d . d e v . prestige 0 1 in c o m e 0 1 education 0 1 women 0 1 of the v a r i a b l e s in the design
Call : lm ( f o r m u l a = p r e s t i g e i n c o m e + e d u c a t i o n + women , d a t a = s t d d a t ) Residuals : Min 1Q Median 1.15229 0.30999 0.00793
3Q 0 .29984
Max 1 .01744
21 / 46
standardize function in rockchalk will automate this ...
Coefficients : E s t i m a t e S t d . E r r o r t v a l u e Pr ( > | t | ) ( I n t e r c e p t ) 1.388e 16 4 .516e 02 0 .000 1 .00 in c o m e 3 .242e 01 6 .855e 02 4 . 7 2 9 7 .58e 06 * * * education 6 .640e 01 6 .164e 02 10 . 7 7 1 < 2e16 * * * women 1.642e 02 5 .607e 02 0.293 0 .77 S i g n i f . c o d e s : 0 ' *** ' 0 . 0 0 1 ' ** ' 0 . 0 1 ' * ' 0 . 0 5 ' . ' 0 . 1 R e s i d u a l s t a n d a r d e r r o r : 0 . 4 5 6 1 on 98 d e g r e e s o f f r e e d o m M u l t i p l e R2 : 0 . 7 9 8 2 , A d j u s t e d R2 : 0 . 7 9 2 F s t a t i s t i c : 129 . 2 on 3 and 98 DF , p value : < 2 .2e 16
'
'
22 / 46
Side By Side: UnStandardized and Standardized Regression Estimates

Unstandardized Estimate S.E. -6.794* (3.239) 0.013* (0.003) 4.187* (0.389) -0.009 (0.03) 102 7.846 0.798 0.792 Standardized Estimate S.E. 0 (0.045) 0.324* (0.069) 0.664* (0.062) -0.016 (0.056) 102 0.456 0.798 0.792
(Intercept) income education women N RMSE R2 adj R 2 * p 0.05
Force yourself to stop and try to interpret those parameters
23 / 46
Notice something interesting about the t statistics
( Intercept ) income education women
Estimate t value Estimate t value 6.79 2.10 0 .00 0 .00 0 .01 4 .73 0 .32 4 .73 4 .19 10 . 7 7 0 .66 10 . 7 7 0.01 0.29 0.02 0.29
The estimated t values are identical, unstandardized on left and standardized on the right.
24 / 46
Why Do Some People Like Standardized Coecients?
Im an outsider, looking in. It seems like They seek an easy comparison, like a one standard deviation rise in X1 causes a 1 -standard-deviation-increase in y. So, if X1 is measured in dollars and y is measured in pounds of elephant fat per cubic yard of shipping container, or bushels of wheat per year , the standardization TRIES to make them comparable.
25 / 46
Does Standardization Make education, income, and women Comparable?
Min. 1st Qu Median Mean 3rd Qu Max Std. Dev.
education 6.380 8.445 10.540 10.738 12.648 15.970 2.73
income 61.1 410.6 593.0 679.8 818.7 2587.9 424.59
women 0.000 3.592 13.600 28.979 52.203 97.510 31.72
prestige 14.80 35.23 43.60 46.83 59.27 87.20 17.20
26 / 46
Review education
0.15 Density 0.00 6 0.05 0.10
10
12
14
16
education (car::Prestige)
27 / 46
Review income
Density
0e+00 0
4e04
8e04
500
1000
1500
2000
2500
3000
income (car::Prestige)
28 / 46
Review women
Density
0.00 0
0.01
0.02
0.03
0.04
20
40
60
80
100
women (car::Prestige)
29 / 46
Suppose theres a Categorical Predictor type
Step 1. Standardize income , education and women , but leave type presmod2 < lm ( p r e s t i g e income + e d u c a t i o n + women + t y p e , d a t a=s t P r e s t i g e ) The type variable is not standardized, so challenge is to compare a 1 unit change in education with a 1 unit (from top to bottom) change in a type.
30 / 46
In rockchalk, meanCenter can be used
The ordinary, nothing standardized regression is: presmod1 < lm ( p r e s t i g e income + e d u c a t i o n + women + t y p e , d a t a=P r e s t i g e ) This use of the meanCenter function will standardize all numeric predictors and re-t the regression presmod2mc < meanCenter ( presmod2 , c e n t e r O n l y I n t e r a c t o r s=FALSE , s t a n d a r d i z e= TRUE)
31 / 46
Compare the factors estimates with the Standardized Numeric Variables

Unstandardized Estimate S.E. -0.814 (5.331) 0.01* (0.003) 3.662* (0.646) 0.006 (0.03) 5.905 (3.938) -2.917 (2.665) 98 7.132 0.835 0.826 Partly Standardized Estimate S.E. -0.061 (0.108) 0.257* (0.065) 0.581* (0.102) 0.012 (0.056) 0.343 (0.229) -0.17 (0.155) 98 0.415 0.835 0.826
(Intercept) income education women typeprof typewc N RMSE R2 adj R 2 * p 0.05
32 / 46
Treatment of a Factor Variable
R sees type as a factor variable, it creates the dummy contrast (0,1) predictors automatically presmod2 < lm ( p r e s t i g e income + e d u c a t i o n + women + t y p e , d a t a=s t P r e s t i g e ) Although I dont understand why this is meaningful, some programs (SPSS) will standardize dichotomous variables, and feed them to the regression. In the past, Ive done that manually (dealing with NA and so forth), but standardize() in rockchalk can handle it.
p r e s m o d 3 s t < s t a n d a r d i z e ( presmod2 ) summary ( p r e s m o d 3 s t )
33 / 46
Treatment of a Factor Variable ...

A l l v a r i a b l e s i n t h e model m a t r i x and t h e d e p e n d e n t v a r i a b l e w e r e c e n t e r e d . The v a r i a b l e s h e r e h a v e t h e same names a s t h e i r noncentered c o u n t e r p a r t s , but they a r e centered , even c o n s t r u c t e d v a r i a b l e s l i k e ` x1 : x2 ` and p o l y ( x1 , 2 ) . We a g r e e , t h a t ' s p r o b a b l y i l l a d v i s e d , b u t you a s k e d f o r i t by r u n n i n g s t a n d a r d i z e ( ) . O b s e r v e , t h e summary s t a t i s t i c s matrix. mean s t d . d e v . prestige 0 1 in c o m e 0 1 education 0 1 women 0 1 typeprof 0 1 typewc 0 1 of the v a r i a b l e s in the design
Call : lm ( f o r m u l a = p r e s t i g e i n c o m e + e d u c a t i o n + women + t y p e p r o f + typewc , d a t a = s t d d a t ) Residuals : Min 1Q 0.86274 0.26217
Median 0 .01824
3Q 0 .30698
Max 1 .08206
34 / 46
Treatment of a Factor Variable ...
Coefficients : E s t i m a t e S t d . E r r o r t v a l u e Pr ( > | t | ) ( I n t e r c e p t ) 2.064e 16 4 .214e 02 0 .000 1 .000000 in c o m e 2 .579e 01 6 .487e 02 3 . 9 7 6 0 . 0 0 0 1 3 9 *** education 5 .889e 01 1 .039e 01 5 . 6 7 1 1 .63e 07 * * * women 1 .183e 02 5 .577e 02 0 .212 0 .832494 typeprof 1 .615e 01 1 .077e 01 1 .500 0 .137127 typewc 7.269e 02 6 .642e 02 1.094 0 . 2 7 6 6 2 6 S i g n i f . c o d e s : 0 ' *** ' 0 . 0 0 1 ' ** ' 0 . 0 1 ' * ' 0 . 0 5 ' . ' 0 . 1 R e s i d u a l s t a n d a r d e r r o r : 0 . 4 1 7 2 on 92 d e g r e e s o f f r e e d o m M u l t i p l e R2 : 0 . 8 3 4 9 , A d j u s t e d R2 : 0 . 8 2 6 F s t a t i s t i c : 93 . 0 7 on 5 and 92 DF , p value : < 2 .2e 16
'
'
35 / 46
Standardized Categorical Predictors Too
Unstandardized Estimate S.E. (Intercept) income education women typeprof typewc N RMSE R2 adj R 2 * p 0.05 -0.814 0.01* 3.662* 0.006 5.905 -2.917 98 7.132 0.835 0.826 (5.331) (0.003) (0.646) (0.03) (3.938) (2.665)
Standardized (except type) Estimate S.E. -0.061 0.257* 0.581* 0.012 0.343 -0.17 98 0.415 0.835 0.826 (0.108) (0.065) (0.102) (0.056) (0.229) (0.155)
All Standardized Estimate S.E. 0 0.258* 0.589* 0.012 0.161 -0.073 98 0.417 0.835 0.826
(0.042) (0.065) (0.104) (0.056) (0.108) (0.066)
36 / 46
Note the summary stats in the stantardize output
And the musical question is, DO YOU GAIN INSIGHT BY STANDARDIZING the categorical variables?
37 / 46
Translate between b and
How does the beta, say 1 dier from the unstandardized coecient, b1 ? sX 1 b1 1 = sy You can prove this to yourself by multiplying 3 by sy sy sX 1 X 2i X 2 +3 sy sX 2 X 3i X 3 +ui sX 3 (4)
(yi y ) = 1
X 1i X 1 +2 sy
38 / 46
Beta Problem 1: unknown .
King argues Betas are a misguided eort to convert a dicult problem into a easy one.
1
We dont know the true standard deviations , X 1 , X 2 , etc. We estimate by the sample standard deviation, sX 1 , sX 2 ...
1
Suppose b1 = b2 . Two variables have same eect. And they are measured on the same scale. If sx 1 = sx 2 , that will cause 1 and 2 to dier. Re-subsetting data is guaranteed to change standardized coecients.
This means that betas in a single sample are not so meaningful as we originally thought, and furthermore
39 / 46
Dierent y variances are a Problem Too
Suppose we have two groups of respondents, and the same slopes apply to both group 1 : yi group 2 : yi
2 = b0 + b1 x 1i + b2 x 2i + e 1i , e 1i N (0, e 1) 2 = b0 + b1 x 1i + b2 x 2i + e 2i , e 2i N (0, e 2)
(5) (6)
j estimates Heteroskedasticity usually does not aect the b Standardization of yi has a multiplier eect across the whole line, so all of the coecients will shrink or expand Standardization complicates problem of comparing coecients across sample.
40 / 46
Beta Problems 2: Non-Numeric Predictors
Do you really think there is any way to formalize a comparison of Sex {0, 1} and income in dollars? Consider standardizing a dichotomous variable. What does the mean mean? The standard deviation. A one standard deviation increase in the variable male causes....
Betas Practice Problems
41 / 46
Outline
1 Introduction
42 / 46
Standardized Regression Coecients

1
Take any real life data set you want that has (at least) 3 numeric variables. For ease of exposition, I will call the DV y and the IV x1, x2, and so forth, but you of course can use the real names when you describe the model.
1
Regress y on x 1. Do the usual chores: Create a scatterplot, draw the regression line, write a sentence to describe the estimated relationship. From the line you drew, pick 2 interesting values of x 1 and write a sentence comparing the predicted values. Create histograms for y and x 1 and super-impose the kernel density curves in order to get a mental image of the distributions. Calculate the mean and standard deviations. Create standardized variables yst and x 1st . Run the regression of yst on x 1st . Create a scatterplot of yst on x 1st , draw the predicted line. For the 2 interesting values of x 1 from the previous case, calculate the corresponding values of xst and gure out what the predicted value of yst is for those particular values. Then write a sentence comparing the predicted values of yst for those two cases.
43 / 46
Standardized Regression Coecients ...

4
In your opinion, did standardization improve your ability to interpret the eect of x 1 and x 1st ?
Repeat the same exercise, except this time include two or more numeric predictors. When you conduct part a), pick interesting values for all of your IVs, and make a predicted value table of this sort (Ive included example interesting values for x1 and x2). value combinations x1 x2 predicted y 9 3.2 ? 9 4.6 ? 32 3.2 ? 32 4.7 ? I could show you how to make a 3D scatterplot (see the Multicollinearity lecture), but it is probably not worth your eort.
44 / 46

3
Find a dataset with a dichotomous predictors. Or create your own dichotomous predictor by categorizing a numeric variable (In R I use the cut function for that). Conduct the same exercise again. Try to describe the regression model with unstandardized data, and then conduct the standardized model. Lets concentrate on categorical predictors with many categories. We need data with a numeric variable for y and multi-category predictor. If x1 is type of profession, for example, then when R ts the regression of y on x1, R will create the dummy variables for g-1 categories when it ts a regression. You can create your own dummy variables if you want, but in R there is an easier way because you can ask the regression model to keep the data for you after it is done tting. So instead of just running mod1 <- lm(y x, data=dat) run this mod2 <- lm(y1x2, data=dat, x=T, y=T)
45 / 46

After that, the dependent variable will be saved in the model object as mod2$y and the matrix of input variables will be saved as mod2$x. So you can grab those into a new data frame like so myNewDF <- data.frame(mod2$y, mod2$x) Heres a real life example I just ran to make sure that works. library(car) mod1 <- lm(prestige type, data=Prestige, x=TRUE, y=TRUE) dat2 <- data.frame(mod1$y, mod1$x) In dat2, the variables now are: mod1.y X.Intercept. typeprof typewc But I can beautify the names like so colnames(dat2) <- c( prestige , int , prof , wc ) The baseline value of the type is bc , but that variable disappeared into the intercept, but we can re-create it easily. dat2$bc <- dat2$int - dat2$prof - dat2wc See what I mean? bc is what remains after you remove the prof and wc.
46 / 46
After that, you can create standardized variables for prestige, bcprof and wc and then run a regression with them. Im a little worried that the separate standardization of the dummy variables prof and wc throws away the information that ows from the fact that they are indicators for the same variable. Do you know what I mean? When they are bcprof and wc , we know that they are 0 or 1 in a logical pattern. Ill have to think harder on that when I get some free time. Or else, you will work it out for me and then Ill not have to do any hard thinking.

Betas: Standardized Variables in Regression: Paul E. Johnson

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Betas: Standardized Variables in Regression: Paul E. Johnson

Uploaded by

Copyright:

Available Formats

Betas

Betas: Standardized Variables in Regression

Department of Political Science

Center for Research Methods and Data Analysis, University of Kansas

Get Existential: What is Regression?

You theorize: yi = b0 + b1 x 1i + b2 x 2 + ... + bk xki + ei i = 1, ..., N (1)

Everything else we do should be understood through this lens.

Yes, But What Do You DO with a Regression?

Substantively Interesting x 20 and x 21

Linear and Continuous Xs: b

Betas Rescale Variables: Standardization

Betas Rescale Variables: Standardization

Recall Eect of Fiddling with Xs

Betas Rescale Variables: Standardization

Consider Fiddling with y

What happens if one multiplies yi by 2?

What happens if yi has something added or subtracted?

Betas Rescale Variables: Standardization

Occupational Prestige Data from car

l i b r a r y ( car ) P r e s t i g e $ income < P r e s t i g e $ income / 10 presmod1 < lm ( p r e s t i g e income + e d u c a t i o n + women , d a t a=P r e s t i g e )

Betas Rescale Variables: Standardization

My Professionally Acceptable Regression Table

(Intercept) income education women N RMSE R2 adj R 2 * p 0.05

Estimate -6.794* 0.013* 4.187* -0.009 102 7.846 0.798 0.792

S.E. (3.239) (0.003) (0.389) (0.03)

Betas Rescale Variables: Standardization

One School of Thought: Get Inside the Data

Betas Standardized Data

Betas Standardized Data

Another School of Thought: Try to Convert Variables to a Common Metric

Betas Standardized Data

Use Standardized Variables in Regression

The estimated coecients are called standardized regression coecients

If ALL variables are standardized, then 0 = 0

Betas Standardized Data

Standardize the Numeric Data

Betas Standardized Data

standardize function in rockchalk will automate this

Call : lm ( f o r m u l a = p r e s t i g e i n c o m e + e d u c a t i o n + women , d a t a = s t d d a t ) Residuals : Min 1Q Median 1.15229 0.30999 0.00793

Betas Standardized Data

standardize function in rockchalk will automate this ...

Betas Standardized Data

Side By Side: UnStandardized and Standardized Regression Estimates

(Intercept) income education women N RMSE R2 adj R 2 * p 0.05

Force yourself to stop and try to interpret those parameters

Betas Standardized Data

Notice something interesting about the t statistics

( Intercept ) income education women

Betas Standardized Data

Why Do Some People Like Standardized Coecients?

Betas Standardized Data

Does Standardization Make education, income, and women Comparable?

Min. 1st Qu Median Mean 3rd Qu Max Std. Dev.

education 6.380 8.445 10.540 10.738 12.648 15.970 2.73

income 61.1 410.6 593.0 679.8 818.7 2587.9 424.59

women 0.000 3.592 13.600 28.979 52.203 97.510 31.72

prestige 14.80 35.23 43.60 46.83 59.27 87.20 17.20

Betas Standardized Data

Betas Standardized Data

Betas Standardized Data

Betas Standardized Data

Suppose theres a Categorical Predictor type