You are on page 1of 51

Data Analysis: Regression

Research Methods for Public Administrators Dr. Gail Johnson


Dr. G. Johnson, www.researchdemystified.org

Making Sense of Regression


Regression analysis is an advanced

analytical techniquewith the ability to consider many different variables that might explain something like differences in income or declining crime rates

Dr. G. Johnson, www.researchdemystified.org

Making Sense of Regression


Why include in an introductory research

methods textbook?
Because

regression results are often reported in

the news Because regression is not hard to understand conceptually-building on what we know about relationships and measures of association even if the actual equations are intimidating and unclear because so many symbols are used
Dr. G. Johnson, www.researchdemystified.org

Back to the Premise of Demystifying Statistics


When advocates of particular policies try to

persuade, they often use statistics. The fancier statistics might be appropriate but can also bedazzle or intimidate. Having an insiders view about measuring relationships using quantitative data may demystify these statistical techniques.

Dr. G. Johnson, www.researchdemystified.org

Making Sense of Regression


The emphasis here is on
Understanding Requirements

the key elements of regression

Application
Limitations

Dr. G. Johnson, www.researchdemystified.org

Regression Is A Powerful Analytical Technique


Enables researchers to do two things:
1.

Determine the strength of the relationship The r-squared value

Small r for regression with only one independent variable Capital R for regression with more than one independent variable

Dr. G. Johnson, www.researchdemystified.org

Regression Is A Powerful Analytical Technique


2. Determine the impact of the independent variable(s) on the dependent variable The regression coefficient is the predicted change in the dependent variable for every one unit of change in the independent variable Collectively, the regression coefficients enable the researchers to make estimates of how the dependent variable will change using different scenarios for the independent variables
Dr. G. Johnson, www.researchdemystified.org

1. R-square And Its Companions


r = correlation coefficient (overall fit or measure

of association, which is also called r, Pearsons r, Pearson Product Moment Correlation coefficient, or zero-order coefficient).

Weve seen this in prior chapter

r-square = proportion of the explained variance the

dependent variable (also called the coefficient of determination) 1 minus r-square = proportion of unexplained variance in the dependent variable
Dr. G. Johnson, www.researchdemystified.org

Interpreting R-Square Is Easy


Or at least as easy as any measure of association
Fake Example: Researchers look at GRE scores

and academic performance in graduate school as measured by grade point average

The hypothesis is that people who have high GRE scores will also have high GPAs From an admissions committee perspective: the belief that GRE scores are a good predictor of future academic success and are, therefore, a good criteria for admission decisions The researchers report an r-squared of .2
Dr. G. Johnson, www.researchdemystified.org

Interpreting R-Square Is Easy


R-square is similar to a measure of association: It varies from 0 to 1: zero indicating no relationship, 1 indicating a perfect relationship Except that it gives more informationit gives an estimate of how much change in the dependent variable (in this case, GPAs) are explained by GRE scores. Interpretation of prior slide: GREs explain 20 percent of the change in GPAs This means that 80 percent of the changes in GPA are explained by other factors.
Dr. G. Johnson, www.researchdemystified.org

10

Discussion
If you were making a recommendation to

the admissions committee, how much emphasis should they give GRE scores in admission decisions? Explain/defend your reasoning

Dr. G. Johnson, www.researchdemystified.org

11

A Different R-Squared, A Different Decision?


Suppose the researchers found an r-squared

of .65? What would you recommend? Why? What other factors might be important in predicting academic success in graduate school?

Dr. G. Johnson, www.researchdemystified.org

12

Paradox of High R-squares


Researchers want to obtain results with a

high R-square
They

want to build models that explain as much as possible about what affects the dependent variable That is, they want to discover good predictive models
But sophisticated users should be suspicious

of results with a high R-squared


Dr. G. Johnson, www.researchdemystified.org

13

Generating High R-squares


Problem of multi-collinearity
This

means using independent variables that are highly correlated with each other
Including median income and poverty rates for example

will throw off the mathematics that may give a falsely high r-squared Aggregating data in ways that reduce sample size can generate high r-squares
Dr. G. Johnson, www.researchdemystified.org

They

14

Generating High R-squares


Researchers might decide to get rid of outliers

the data points that are really, really far away from the bulk of the data If the data point is truly incorrectclearly someone typed it I wrong, it can be deleted. Otherwise, researchers should accept the outliers as part of the way things are
For more information, see Taken from J. Scott Armstrong, 1985, long-

range forecasting, 2nd ed., P. 487.


Dr. G. Johnson, www.researchdemystified.org

15

2. Regression Wizardry: Predicting Change


Regression follows the same concepts of

relationships, then takes it to the next level

It allows researchers to predict the change in the dependent variable based on every unit change in the independent variable This is the regression coefficient (or partial regression coefficient in multiple regression analysis)

If the regression coefficient = .05, it means that for every one unit change in the GRE score, there will be a .05 increase in the GPA score Assuming, of course, that there is a strong relationship
Dr. G. Johnson, www.researchdemystified.org

16

Other Examples of the Regression Coefficient


For every one unit change in years of education,

there is a $2,000 change in yearly individual income. For every one unit change in the age of a plane, there is a $500 change in maintenance costs. For every one unit change in age, there is a .3 percent decrease in memory test scores among adults. (note: these are all fake data)
Dr. G. Johnson, www.researchdemystified.org

17

Regression Requirements
Requirements:
Assumes

a linear relationship Uses random sample or census data Works with interval/ratio level data

It is possible to convert a nominal variable into a dummy variablewhich means that it only has two variables: 0 and 1to use as an independent variable
For example: Gender: female 0, male 1
Dr. G. Johnson, www.researchdemystified.org

18

Ordinary Least Squares Regresion


There

are many types of regression tools

For our purposes, I am sticking with what they call ordinary least squares (OLS) that can only be used with interval/ratio level data (i.e. real numbers) There are other types to handle other data situations For example, logistic regression is use with nominal dependent variable with only 2 categories For example: Drug Use: yes or no
Dr. G. Johnson, www.researchdemystified.org

19

The Concept of Least Squares


Regression analysis used here is based on

the idea of least squares The computer creates an imaginary "best" straight line through a set of data, such that for any value of X, the value of Y can be predicted

Dr. G. Johnson, www.researchdemystified.org

20

The dots represent each planes age and maintenance cost from prior year

Y Axis: Plane Maintenance Costs


$1,000

$500

. . . . . . .. . . . . . .. . . . . . .
Dr. G. Johnson, www.researchdemystified.org

Predicted values if perfect relationship

5 years 10 years X Axis: Age of Planes

20 years

21

The Concept of Least Squares


This line is selected because it yields the smallest

total distance between every data point and this perfect line.

The distances are squared as part of the calculation hence the name, least squares

The line is useful to the extent that the difference

between the predicted line and the actual data points is small
Dr. G. Johnson, www.researchdemystified.org

22

Simple Regression Equation


Y = a + bX + e
Where:
Y

= predicted value of the dependant variable a = the constant or Y intercept (where the imaginary line crosses the Y access) b = the regression coefficient X = the independent variable e = error (the computer will estimate the likely error) Dr. G. Johnson,
www.researchdemystified.org 23

Applying Simple Regression


Researchers are asked to estimate maintenance

costs for next years budget

This large state that has a fleet of planes used by public officials to make it easy to visit all parts of the state

Analysts believe that there is a relationship

between maintenance costs and use of the planes (measured by the miles flown)

Y= plane maintenance costs measured in dollars (the dependent variable) X = miles flown (the independent variable)
Dr. G. Johnson, www.researchdemystified.org

24

How It Is Applied
Analysts collect data over the past two years and

crunch it. The computer gives these results: Y = 100 and .020X The constant is 100:

If they do not fly at all, the computer estimates there is still a cost of $100

The .020 is the regression coefficient: This gets interpreted as: for every mile flown, there is $.02 change in maintenance costs.
Dr. G. Johnson, www.researchdemystified.org

25

Simple Regression
Y = 100 and .020X Interpreting the regression coefficient:
For

every mile flown, the maintenance costs goes up by 2 cents. For every 100 miles flown, costs are $2 For every 1,000 miles, the costs are $20 For every 100,000 miles, the costs are $20,000
Dr. G. Johnson, www.researchdemystified.org

26

Making Maintenance Cost Estimates


They can then solve the equation:
Assuming

100,000 miles will be flown, how much will they need to budget for maintenance? 100,000 multiplied by .020 = $20,000 Y= 100 + $20,000 + error
The estimate maintenance will cost:

$20,100 + error
Dr. G. Johnson, www.researchdemystified.org

27

Yes, but
How strong is the relationship between miles

flown and maintenance costs? Before we put too much faith in these budget estimates, we will want to look at the r-squared Like any measure of association, there is some choice about what is good enough, since it would be exceedingly rare to get an r-squared close to a perfect 1.
Dr. G. Johnson, www.researchdemystified.org

28

Simple Regression: Another Example


Hypothesis: If schools have a higher

percentage of poor children, then they will have lower test scores. A regression analysis shows:
A

regression coefficient of -.04 An r-squared value of .25


Dr. G. Johnson, www.researchdemystified.org

29

Simple Regression
Interpretation? Regression coefficient: For every increase in the percent of children in poverty within a school, the average test score goes down by .04 R-squared: 25% of the test scores are explained by the percent of children in poverty in the school Researchers will ask: what other factors might explain differences in test scores in the schools?

They will want to build a bigger model that will include more factors

Dr. G. Johnson, www.researchdemystified.org

30

Life More Complex


Rarely will any one single variable cause big

changes in another variable, especially complex phenomena Warning bells should sound when anyone states that a single variable caused a complex problem The economic collapse is due to consumer debt The economic collapse is due to corporate greed
Dr. G. Johnson, www.researchdemystified.org

31

Discussion: Complexity of Public Policy Issues


What are the possible causes the 2008

economic downturn? What are the possible explanations for the declining crime rate from 1991 to 2004?
In

1991, the national violent crime rate was:

1991: 753 per 100,000 population 2004: 463 per 100,000 population

Dr. G. Johnson, www.researchdemystified.org

32

What Are the Possible Causes for Urban Decay?


Lack of jobs
High % of absentee

Increase in drugs,

landlords Low % of homeowners Poor quality of schools Increased concentration of poor

crime Aging housing stock Flight of middle class to suburbs Corruption Aging infrastructure Business flight to suburbs
33

Dr. G. Johnson, www.researchdemystified.org

Multiple Regression: Added Power


Multiple regression does four things: Provides the an overall measure of the predictive strength of the model: the R-square Predict the dependent variable based on the summed contributions of the independent variables. Determines the impact of each independent variable on the dependent variable while controlling for the other variables (these are the partial regression coefficients) Determines the relative strength of each of the independent variable using the beta weights
Dr. G. Johnson, www.researchdemystified.org

34

Multiple Regression Equation


Y = a + bX1 + bX2 + bX3 + bX4 + e.
Y = dependent variable X1 = independent variable 1, controlling for X2, X3, X4 X2 = independent variable 2 controlling for X1, X3, X4 X3 = independent variable 3 controlling for X1, X2, X4 X4= independent variable 4 controlling for X1, X2, X3
Dr. G. Johnson, www.researchdemystified.org

35

Multiple Regression Equation


It has the same basic structure of simple

regression
Y

is still the dependent variable There is still a constant (a) and some amount of error (e) that the computer calculates But there are more Xs to represent the multiple independent variables

Dr. G. Johnson, www.researchdemystified.org

36

Multiple Regression Equation


The

b in front of the Xs will be the Partial Regression Coefficients The separate impact on dependent variable controlling for all the other independent variables (sometimes called holding them constant)

Dr. G. Johnson, www.researchdemystified.org

37

Multiple Regression: An Example


Hypothesis: Income is a function of education and seniority? We suggest that income (the dependent variable) will increase as both education and seniority increases (two independent variables) Y (Income) = a + education + seniority+ error
based on Lewis-Beck example

Dr. G. Johnson, www.researchdemystified.org

38

Multiple Regression: Interpretation


Results: Y= 6000 + 400X1 (education) + 200X2 (seniority) R square = .67 First look at the R-Square: This shows a strong relationshipso analysis can continue Partial regression coefficients:

For every year of education, holding seniority constant, income increases by $400. For every year of seniority, holding education constant, income increases by $200.
Dr. G. Johnson, www.researchdemystified.org

39

Multiple Regression: Application


Estimate the income of someone who has
10 years of education and 5 years of seniority We solve the regression equation: Multiply the 10 years of education by the regression coefficient of 400: equals 4,000 Multiply 5 years of senior by the regression coefficient of 200: equals 1,000 Put it together with the constant and you have Y=6000 + 400(10) + 200(5) + error

Y= $ 11,000 + error
Dr. G. Johnson, www.researchdemystified.org

40

Multiple Regression: Beta Weights


Relationship between contributions to political campaigns as a function of age and income? Y= campaign contribution (dollars) X1 = age (years) X2 = income (dollars)

Dr. G. Johnson, www.researchdemystified.org

41

Multiple Regression
Relationship between contributions to political campaigns as a function of age and income. Computer generates this equation: Y = 8 + 2X1 + .010X2 (age) (income) Interpreting the partial regression coefficients: For every one year increase in age, contributions go up by $2. For every dollar increase in income, contributions go up .01 dollars
Dr. G. Johnson, www.researchdemystified.org

42

Multiple Regression: Beta Weights


But which is stronger?
We

cannot tell because age and income are measured differently (years versus dollars)

Need to look at the Beta Weights


Beta Weights are Standardized--thus

making all variables comparable


But

they have a very limited application


Dr. G. Johnson, www.researchdemystified.org

43

Beta Weights
Returning to age and income as predictors

of campaign contributions, the computer gives us these beta weights Age = .15 Income = .45 Which is the strongest of the two? Income is the highest, therefore the stronger of the two
Dr. G. Johnson, www.researchdemystified.org

44

Takeaway Lesson
When reading research results about

relationships, my best advice is to exercise healthy skepticism and ask the tough questions before assertingor believing that research results are irrefutable facts merely because of sophisticated mathematics.
Dr. G. Johnson, www.researchdemystified.org

45

Takeaway Lesson
Knowing how difficult it is to demonstrate

causality or program impacts, be mindful when people present research asserting they have found a cause-effect relationship. Be especially cautious when people claim they have a found a single cause for a complex phenomenon even when they use advanced statistical techniques.
Dr. G. Johnson, www.researchdemystified.org

46

Takeaway Lesson
At the same time, be cautious in believing

variables are not connected or that programs do not have an impact based on data from one study.
More

research is needed is not a selfemployment program for researchers

It is also important to know when statistics

are just too frail to give a clear answer.


Dr. G. Johnson, www.researchdemystified.org

47

Ask the Tough Questions


Are they using data that is likely to be unknown or

difficult to measure?

Do the proxy measures they use make sense? Do they state all of their assumptions in constructing their measures used in their calculations?

Is the analysis appropriate to the situation? Do they provide measures of association and are

they strong enough?


Dr. G. Johnson, www.researchdemystified.org

48

Ask the Tough Questions


Is there design strong enough to rule out possible

rival explanations? Even with fancy statistics, the basic principles of good research design still must be metespecially when attempting to answer cause-effect questions

I might show a high r-square between stock market activity and sunspot activitybut I still need a good theory to explain why they are connected

Dr. G. Johnson, www.researchdemystified.org

49

Remember: It Is OK To Ask For Help


It is also important to recognize that statistics can

be so technical that it necessary to bring in experts to make sense of complex and confusing research results. No one expects you to know it all from one required research methods courseor remember it 10 years later My point: remember that it really is OK to bring in the experts to make sense of research that focuses on issues that matter.
Dr. G. Johnson, www.researchdemystified.org

50

Creative Commons
This powerpoint is meant to be used and

shared with attribution Please provide feedback If you make changes, please share freely and send me a copy of changes:
Johnsong62@gmail.com

Visit www.creativecommons.org for more

information
Dr. G. Johnson, www.researchdemystified.org

51

You might also like