Professional Documents
Culture Documents
analytical techniquewith the ability to consider many different variables that might explain something like differences in income or declining crime rates
methods textbook?
Because
the news Because regression is not hard to understand conceptually-building on what we know about relationships and measures of association even if the actual equations are intimidating and unclear because so many symbols are used
Dr. G. Johnson, www.researchdemystified.org
persuade, they often use statistics. The fancier statistics might be appropriate but can also bedazzle or intimidate. Having an insiders view about measuring relationships using quantitative data may demystify these statistical techniques.
Application
Limitations
Small r for regression with only one independent variable Capital R for regression with more than one independent variable
of association, which is also called r, Pearsons r, Pearson Product Moment Correlation coefficient, or zero-order coefficient).
dependent variable (also called the coefficient of determination) 1 minus r-square = proportion of unexplained variance in the dependent variable
Dr. G. Johnson, www.researchdemystified.org
The hypothesis is that people who have high GRE scores will also have high GPAs From an admissions committee perspective: the belief that GRE scores are a good predictor of future academic success and are, therefore, a good criteria for admission decisions The researchers report an r-squared of .2
Dr. G. Johnson, www.researchdemystified.org
10
Discussion
If you were making a recommendation to
the admissions committee, how much emphasis should they give GRE scores in admission decisions? Explain/defend your reasoning
11
of .65? What would you recommend? Why? What other factors might be important in predicting academic success in graduate school?
12
high R-square
They
want to build models that explain as much as possible about what affects the dependent variable That is, they want to discover good predictive models
But sophisticated users should be suspicious
13
means using independent variables that are highly correlated with each other
Including median income and poverty rates for example
will throw off the mathematics that may give a falsely high r-squared Aggregating data in ways that reduce sample size can generate high r-squares
Dr. G. Johnson, www.researchdemystified.org
They
14
the data points that are really, really far away from the bulk of the data If the data point is truly incorrectclearly someone typed it I wrong, it can be deleted. Otherwise, researchers should accept the outliers as part of the way things are
For more information, see Taken from J. Scott Armstrong, 1985, long-
15
It allows researchers to predict the change in the dependent variable based on every unit change in the independent variable This is the regression coefficient (or partial regression coefficient in multiple regression analysis)
If the regression coefficient = .05, it means that for every one unit change in the GRE score, there will be a .05 increase in the GPA score Assuming, of course, that there is a strong relationship
Dr. G. Johnson, www.researchdemystified.org
16
there is a $2,000 change in yearly individual income. For every one unit change in the age of a plane, there is a $500 change in maintenance costs. For every one unit change in age, there is a .3 percent decrease in memory test scores among adults. (note: these are all fake data)
Dr. G. Johnson, www.researchdemystified.org
17
Regression Requirements
Requirements:
Assumes
a linear relationship Uses random sample or census data Works with interval/ratio level data
It is possible to convert a nominal variable into a dummy variablewhich means that it only has two variables: 0 and 1to use as an independent variable
For example: Gender: female 0, male 1
Dr. G. Johnson, www.researchdemystified.org
18
For our purposes, I am sticking with what they call ordinary least squares (OLS) that can only be used with interval/ratio level data (i.e. real numbers) There are other types to handle other data situations For example, logistic regression is use with nominal dependent variable with only 2 categories For example: Drug Use: yes or no
Dr. G. Johnson, www.researchdemystified.org
19
the idea of least squares The computer creates an imaginary "best" straight line through a set of data, such that for any value of X, the value of Y can be predicted
20
The dots represent each planes age and maintenance cost from prior year
$500
. . . . . . .. . . . . . .. . . . . . .
Dr. G. Johnson, www.researchdemystified.org
20 years
21
total distance between every data point and this perfect line.
The distances are squared as part of the calculation hence the name, least squares
between the predicted line and the actual data points is small
Dr. G. Johnson, www.researchdemystified.org
22
= predicted value of the dependant variable a = the constant or Y intercept (where the imaginary line crosses the Y access) b = the regression coefficient X = the independent variable e = error (the computer will estimate the likely error) Dr. G. Johnson,
www.researchdemystified.org 23
This large state that has a fleet of planes used by public officials to make it easy to visit all parts of the state
between maintenance costs and use of the planes (measured by the miles flown)
Y= plane maintenance costs measured in dollars (the dependent variable) X = miles flown (the independent variable)
Dr. G. Johnson, www.researchdemystified.org
24
How It Is Applied
Analysts collect data over the past two years and
crunch it. The computer gives these results: Y = 100 and .020X The constant is 100:
If they do not fly at all, the computer estimates there is still a cost of $100
The .020 is the regression coefficient: This gets interpreted as: for every mile flown, there is $.02 change in maintenance costs.
Dr. G. Johnson, www.researchdemystified.org
25
Simple Regression
Y = 100 and .020X Interpreting the regression coefficient:
For
every mile flown, the maintenance costs goes up by 2 cents. For every 100 miles flown, costs are $2 For every 1,000 miles, the costs are $20 For every 100,000 miles, the costs are $20,000
Dr. G. Johnson, www.researchdemystified.org
26
100,000 miles will be flown, how much will they need to budget for maintenance? 100,000 multiplied by .020 = $20,000 Y= 100 + $20,000 + error
The estimate maintenance will cost:
$20,100 + error
Dr. G. Johnson, www.researchdemystified.org
27
Yes, but
How strong is the relationship between miles
flown and maintenance costs? Before we put too much faith in these budget estimates, we will want to look at the r-squared Like any measure of association, there is some choice about what is good enough, since it would be exceedingly rare to get an r-squared close to a perfect 1.
Dr. G. Johnson, www.researchdemystified.org
28
percentage of poor children, then they will have lower test scores. A regression analysis shows:
A
29
Simple Regression
Interpretation? Regression coefficient: For every increase in the percent of children in poverty within a school, the average test score goes down by .04 R-squared: 25% of the test scores are explained by the percent of children in poverty in the school Researchers will ask: what other factors might explain differences in test scores in the schools?
They will want to build a bigger model that will include more factors
30
changes in another variable, especially complex phenomena Warning bells should sound when anyone states that a single variable caused a complex problem The economic collapse is due to consumer debt The economic collapse is due to corporate greed
Dr. G. Johnson, www.researchdemystified.org
31
economic downturn? What are the possible explanations for the declining crime rate from 1991 to 2004?
In
1991: 753 per 100,000 population 2004: 463 per 100,000 population
32
Increase in drugs,
crime Aging housing stock Flight of middle class to suburbs Corruption Aging infrastructure Business flight to suburbs
33
34
35
regression
Y
is still the dependent variable There is still a constant (a) and some amount of error (e) that the computer calculates But there are more Xs to represent the multiple independent variables
36
b in front of the Xs will be the Partial Regression Coefficients The separate impact on dependent variable controlling for all the other independent variables (sometimes called holding them constant)
37
38
For every year of education, holding seniority constant, income increases by $400. For every year of seniority, holding education constant, income increases by $200.
Dr. G. Johnson, www.researchdemystified.org
39
Y= $ 11,000 + error
Dr. G. Johnson, www.researchdemystified.org
40
41
Multiple Regression
Relationship between contributions to political campaigns as a function of age and income. Computer generates this equation: Y = 8 + 2X1 + .010X2 (age) (income) Interpreting the partial regression coefficients: For every one year increase in age, contributions go up by $2. For every dollar increase in income, contributions go up .01 dollars
Dr. G. Johnson, www.researchdemystified.org
42
cannot tell because age and income are measured differently (years versus dollars)
43
Beta Weights
Returning to age and income as predictors
of campaign contributions, the computer gives us these beta weights Age = .15 Income = .45 Which is the strongest of the two? Income is the highest, therefore the stronger of the two
Dr. G. Johnson, www.researchdemystified.org
44
Takeaway Lesson
When reading research results about
relationships, my best advice is to exercise healthy skepticism and ask the tough questions before assertingor believing that research results are irrefutable facts merely because of sophisticated mathematics.
Dr. G. Johnson, www.researchdemystified.org
45
Takeaway Lesson
Knowing how difficult it is to demonstrate
causality or program impacts, be mindful when people present research asserting they have found a cause-effect relationship. Be especially cautious when people claim they have a found a single cause for a complex phenomenon even when they use advanced statistical techniques.
Dr. G. Johnson, www.researchdemystified.org
46
Takeaway Lesson
At the same time, be cautious in believing
variables are not connected or that programs do not have an impact based on data from one study.
More
47
difficult to measure?
Do the proxy measures they use make sense? Do they state all of their assumptions in constructing their measures used in their calculations?
Is the analysis appropriate to the situation? Do they provide measures of association and are
48
rival explanations? Even with fancy statistics, the basic principles of good research design still must be metespecially when attempting to answer cause-effect questions
I might show a high r-square between stock market activity and sunspot activitybut I still need a good theory to explain why they are connected
49
be so technical that it necessary to bring in experts to make sense of complex and confusing research results. No one expects you to know it all from one required research methods courseor remember it 10 years later My point: remember that it really is OK to bring in the experts to make sense of research that focuses on issues that matter.
Dr. G. Johnson, www.researchdemystified.org
50
Creative Commons
This powerpoint is meant to be used and
shared with attribution Please provide feedback If you make changes, please share freely and send me a copy of changes:
Johnsong62@gmail.com
information
Dr. G. Johnson, www.researchdemystified.org
51