You are on page 1of 55

Data Analysis (1/3)

Typical steps for a Statistical study


1. Define the goals
How this
lecture fits up
to this
point...

2. Collect the data


3. Organize the data
4. Present the data

5. Describe the data


6. Analyze the data
7. Interpret results

Is there a correlation/association/relationship/interaction
between the variables?

Dependence:
o
o

Independence:
o

No relationship exists
between variables

Change in one variable is


NOT accompanied by
change in the other
variable.

Relationship exists between variables


Change in one variable is accompanied by change
in the other variable (these two things seem to
happen at the same time).

No relationship

Positive relationship

As x increases, Y
increases

Negative relationship

As x increases, Y
decreases

As x increases, Y
doesnt change

To study a relationship, one variable is called the


DEPENDENT variable and the other is called the
INDEPENDENT variable.
Which statement makes more sense:
1) The age of a bus can influence the maintenance cost.
2) The maintenance cost can influence the age of the bus.
The dependent
variable may be
explained/predict
ed/influenced/
understood by the
independent
variable.

Y axis is the
DEPENDENT
variable (Cost)
which is influenced.

X-axis is the INDEPENDENT


Variable (Age) which influences the
dependent variable.

Data Analysis and Tools

Numerical

Categorical

Dependent Variable (Y)

The data and the type of research question you want


answered will determine the most appropriate analytical
procedure to select.

Chi-Square
Lets go!
Independent Variable (X)
Numerical Categorical

There are two contingency table tests:

(1) Two-way table contingency test


(also called test of independence)

(2) One-way table contingency test


(also called goodness of fit test)

Two-way table contingency test


(also called test of independence)

1. Is there a relationship between the


variables?
Visually (Stacked Bar Graph)
Mathematically (statistical test)
2. If there is a relationship, how
STRONG is the relationship?
8

Format convention
The independent variable is on the horizontal axis (X)
The dependent variable is on the vertical axis (Y).
Ex. Variables GENDER and VIEW OF LIFE
Which sentence makes more sense?
Does gender have an effect on View of life (Is life exciting , routine, or dull?)
Does View of life (Is life exciting , routine, or dull?) have an effect on gender?

Independent variable: Gender


Dependent variable: View of Life

Bad Presentation

We will study the differences in outcomes of the dependent


variable (view of life) across the independent variable
(gender). We will compare two groups (male and female)
responses.

Good Presentation

Is there a relationship/difference between the variables?

No Relationship
The 100%f stacked bar chart does
NOT significantly CHANGE for
different categories of the IV.

Relationship
The 100%f stacked bar chart DOES
significantly CHANGE for different
categories of the IV. (at least one has to
change for some relationship to be
detected).

Dependent variable: Political Party

Independent variable: Area of residence

Two-way table contingency test


(also called test of independence)

1. Is there a relationship between the


variables?
Visually (Stacked Bar Graph)
Mathematically (statistical test)
2. If there is a relationship, how
STRONG is the relationship?
11

Is there a relationship between the variables?

A sample is
taken and
organized into a

two way
contingency
table.

House Style
Split-Level
Ranch
Total

House Location
Urban
Rural
63
49
15
33
78
82

Total
112
48
160

Research Question: Is
there a difference in
house styles (DV) at
different locations (IV)?
Is there a significant
difference?
100
80

Split

Split

Ranch

Ranch

Urban

Rural

60
40
20

How to we test this claim?


1. Form hypothesis.
Ho null hypothesis: No relationship exists (variables are Independent)
Ha alternative hypothesis: Relationship exists (variables are Dependent)

2. Calculate the chi kai square (2) statistic.


Row total Column total
Expected cell count=
2
total number of cells

statistic

observed Expected

each cell

Expected

3. Find the (2) significant value in the table.


2significant = 2,df
=.05 (default value used in any statistical program) we will discuss this in detail soon!
df = (#rows-1)(# columns1)

4. Compare 2 statistic to 2 significant.


2Statistic > 2Significant REJECT Ho. Accept Ha.
2Statistic < 2Significant FAIL TO REJECT Ho.

Remember to state
your result in the
context of the
specific problem!

Find the (2) significant value in the table

df = (#rows-1)
x (# columns1)

Solution

Step 1: Form Hypothesis

Ho null hypothesis: No relationship exists


(Variables are Independent)
Ha alternative hypothesis: Variables are
related (Dependent)

Step 3: Find

2 significant

df = (#rows-1)(# columns1)
= (2 - 1)(2 - 1) = 1
=.05 (default value)
2significant= 2,df = 2.05,1 = 3.841

Step 4: Compare 2
statistic to 2 significant

Step 2: Calculate chisquared (2) statistic

Row total Column total


Expected Count for each cell =
total number of cells
The chi-square statistic compares the
observed count in each table to the
count that would be expected under
the assumption of no association
between the variables.

Observed

Expected

Location
Style Urban Rural Total
Split
63
49 112
Ranch
15
33
48
Total
78
82 160

2 Statistic (7.62) > 2 Significant


(3.841 )
Reject Ho. There is evidence of
relationship between the variables.
Remember to state your result in the context of the specific
problem! The style of house differs depending on the location.

= 11278 = 55
160

Style
Split
Ranch
Total

observed Expected

2
statistic

Location
Urban Rural Total
55
57 112
23
25
48
78
82 160

Expected

each cell

63 55

55

49 57

57

15 23 33 25

23

25

7.62

Your turn!
The market research group for Albers Brewery of Tuscon, AZ, wants to know whether
preferences of beer type (light, regular, dark) differ among gender (male, female).
If beer preference is independent of gender, one advertising campaign will be initiated.
However, if beer preference depends on the gender of the beer drinker, the firm will
tailor its promotions to different target markets.
Are the variables related?
A survey was conducted and the following
data was collected:

100%
75%
50%

25%
Light

Stacked
bar graph
50%
Regular

25%
0

43%
Light

43%
Regular

25%
Dark

14% Dark

At the .05 level of significance, is there a statistically significant


difference between beer preference for males and females? What
about at the .01 level of significance?

Your turn! Solution


Step 1: Form Hypothesis
Ho null hypothesis: No relationship exists (Variables
are Independent)

Step 2: Calculate chisquared (2) statistic


Expected Count for each cell =

Ha alternative hypothesis: Variables are related


(Dependent)

Step 3: Find

2 significant

df = (#rows-1)(# columns1)
= (3 - 1)(2 - 1) = 2
=.05 (default value)
22,.05 = 5.991
22,.01= 9.210

Step 4: Compare 2
statistic to 2 significant

2
statistic

Row total Column total


total number of cells

observed Expected

Expected

each cell

20 27 40 37

27

37

20 16 30 23

16

23

30 33 10 14

33

14

6.604

REJECT Ho at the .05 level of significance (2 Statistic (6.604) > 2 Significant (5.991)). There is a
difference in beer preferences for men and woman. More females prefer light beer than men. Men
prefer regular beer over light/dark and females prefer light/regular over dark beer.
FAIL TO REJECT Ho at the .01 level of significance (2 Statistic (6.604) < 2 Significant (9.210)).
There is not enough evidence to reject Ho. Any differences in cell frequencies could be explained by
chance.

What is Significance Level?


(also called a type 1 error).
The significance level is: how often you are wrong
The most common value is .05. This means there
is a 5% chance that we are wrong in our findings
from testing the claim. Conversely, there is a 95%
chance that we are correct.
What should be? It is subjective. Other common
values for are .01 (99% confident in our results)
and .10 (90% sure of our results).

Important wording of conclusion


REJECT vs FAIL TO REJECT
Lets use the legal system as an example. The defendant is on trial for murder. A
person is presumed innocent until proven guilty.
Ho = Innocent
Ha = Guilty
We assume Ho to be true. If there is enough evidence to prove Ha then we REJECT Ho
and Ha is true.
Verdict: GUILTY
REJECT Ho.
There is enough evidence to
convict (guilty).

Verdict: NOT GUILTY


FAIL TO REJECT Ho.
We do not say Accept Ho. There is
NOT enough evidence to convict but
we are not proving innocence.

How much evidence we need is related to how confident we want to be in our results.
(the level of significance) is how often we are wrong (also called type 1 error).
Small Claims Court for endangerment of a child: Less evidence needed to convict, =.05
means there is a 5% chance you are wrong. Casey Anthony verdict is Reject Ho (GUILTY)
Jury for 1st degree murder: More evidence needed to convict: =.01 means there is a 1%
chance you are wrong. Casey Anthony verdict is Fail to Reject Ho (NOT GUILTY)

Statistically Significant
The value of used depends
on how confident you want to
be in your results.
What is statistically
significant to one person
might not be to another.

Statisticians
Has to have at least a 95% chance of being
true to be considered worth telling people
about (why =.05 is default for any
statistical program).

Manager
If something has a 90% chance of being
true ( =.1), it is probably better to act as
if it were true rather than false!

Type I and Type II errors


Unfortunately, neither the legal system nor statistical testing are perfect.
Remember that Ho = Innocent and Ha = Guilty.
The jury finds the
defendant GUILTY
But we are WRONG
and an innocent
person goes to jail!!!

Which is worse? How you The jury finds the


feel about it depends on
defendant NOT GUILTY
the level of that you
But we are WRONG and a
choose. & have an
Inverse Relationship. If
guilty person is set free (ex.
we increase , we
OJ Simpson, Casey Anthony)!!!
decrease , and vice
versa!

This is Type I error.


If =5%, there is a 5%

This is Type II error.


If =5%, there is a 5% chance

chance that this error


will occur.
That is, we are wrong to
REJECT Ho.

that this error will occur.


That is, we are wrong to FAIL
TO REJECT Ho.

Maybe this will help you remember

Type I and Type II errors

We will learn how to calculate Type II error in another lecture

Chi-Square

2
( )

Distribution

The chi-square distribution is defined by the degrees of


freedom (df).
df = (# outcomes of row 1) (# outcomes of column 1)

As the number of possible


outcomes of a variable
increases, the curve
approaches a normal
distribution.

Compare 2 statistic to 2 significant


or p-value to
REJECT Ho

FAIL TO REJECT Ho

statistic < significant


p-value >
Fail to Reject Ho

CHISQ.TEST Excel function

p value
Since .04 (p-value) < .05 ()
Reject Ho

Chi-square Assumptions
The sample size is large
(expected frequency of each cell is > 5)
Your turn!

Yes! We satisfy the assumption. If we did not, we cannot


trust the results of this test!

What if Assumptions are not met?


Possibly Combine categories to increase
values in each cell!
Here, there are substantially
fewer older adults than any
other group. We could
combine the middle age and
older adult categories into a
not young category. Then
we would have 2x3 cross tab
with larger n values.

Young
Music 14
News 46
Sports 7

Not Young
12
23
12

Fishers exact test can be used if


E(x) <5but only for 2x2 tables

Two-way table contingency test


(also called test of independence)

1. Is there a relationship between the


variables?
Visually (Stacked Bar Graph)
Mathematically (statistical test)
2. If there is a relationship, how
STRONG is the relationship?

28

Effect size is a measure of the strength of a relationship


There are two families of effect sizes (r and d)
d family
r family
Measuring the association
(CORRELATION) between the
variables.
How much can the change
(variance) in one variable be
explained by the other?

Quantifying the size of the


difference between two
groups

How BIG is the


difference?
100
80

Split

Split

Ranch

Ranch

Urban

Rural

60

40
20

Some important points


1. We are only concerned with effect size if the result
of the (chi-square) test was statistically significant.
2. The size of the p-value is no indication of the
strength of the association (ex. small P-value does not
imply strong association)

3. We are covering only the most widely used


statistical tools (there are still many more but this is a
basic course in statistics and those tools are for
another course).

Some common measures in the r family

Phi Fi (2x2 tables)


A value of .1 is considered a weak
General Guidelines for interpreting strength: (small) effect, .3 a moderate effect
and .5 a strong (large) effect.

Cramers V (not 2x2 tables)

The common measures in the r family


observed Expected

Phi
Expected
2

statistic

each cell

63 55 49 57

55

57

15 23 33 25

23

7.62

25

7.62

0.22
n
160

The relationship is WEAK!

Squaring phi will give you the variance that can be explained.
Whether the house location is urban or rural explains
(.22x.22=.05) 5% of the variance in the style of house built.
statistic
2

observed Expected

20 27 40 37
2

27

37

20 16 30 23

16

23

30 33 10 14

33

6.604

Cramers V

Expected

each cell

14

2
n df

6.604
0.15
150 2

The relationship is WEAK!

Squaring Cramers V will give you the variance that can be


explained. The gender of a person explains (.15x.15=.023)
2.3% of the variance in beer preference.

The d family (amount of difference)


2x2 table
How BIG is the
difference?

Not 2x2 table


The chi-squared test shows a
relationship existsbut where does

the relationship (difference) occur?

100%

80%

Split

Split

60%

75%

40%
20%

100%

Rural

Rural

Urban

Rural

50%

25%
Light

50%
Regular

25%
0

Measure of effect size:


Odds ratio (OR)

43%
Light

43%
Regular

25%
Dark

14% Dark

Measure of effect size:


Adjusted standardized residuals

Odds Ratio
Group 1

Group 2

Total

Outcome 1

a+c

Outcome 2

b+d

a+b

c+d

a+b+c+d

Total

House Style
Split-Level
Ranch
Total

House Location
Urban
Rural
63
49
15
33
78
82

Total
112
48
160

ad 63 33 2079
OR

2.83
bc 15 49 735
Group 1 had odds of having outcome 1 OR times (more if OR>1; less
OR<1) than those who were in group 2.

Urban locations had odds of having a split-level house style 2.83 times more
than those who were in the rural area.
No universal agreement regarding what constitutes a strong or weak association:
OR > 2.0 is moderately strong; OR > 5.0 is strong
Weak associations are more likely to be explained by undetected biases or
confounders.

OR used to COMPARE STUDIES

Community-Based
Case-Control
Cohort

Hospital-Based
Case-Control

Oral Contraceptive Use and Ovarian Cancer


Hildreth et al,
Rosenberg et al,
La Vecchia et al,
Tzonou et al,
Booth et al,
Hartge et al,
WHO,
Wu et al,
Prazzini et al,
Newhouse et al,
Casagrande et al,
Cramer et al,
Willet et al,
Weiss,
Risch et al,
CASH,
Harlow et al,
Shu et al,
Walnut Creek,
Vessey et al,
Beral et al,

1981
1982
1984
1984
1989
1989
1989
1988
1991
1977
1979
1982
1981
1981
1983
1987
1988
1989
1981
1987
1988

+ ve Association

-ve Association

0.0

0.5

1.0

1.5

2.0

Odds Ratio

Hankinson SE et al. Obstet Gynecol. 1991;80:708-714.

2.5

3.0

3.5

www.contraceptiononline.org

Measure of effect sizes for medical studies


Question of interest: Is smoking related to cancer?

Group 1

Group 2

Total

Outcome 1

a+c

Outcome 2

b+d

a+b

c+d

a+b+c+d

Total

Odds ratio (OR)


Relative Risk

RR

ad 30 90
OR

3.9
bc 70 10

a / (a b) 30 /100

3
c / (c d ) 10 /100

People that smoke have odds of developing the


cancer 3.9 times (390%) higher than those that dont
smoke

Those that smoke are 3 times (or 300%) more likely to develop
lung cancer than those that dont smoke.

Relative risk improvement/reduction

RRR

a / (a b) c / (c d ) (30 /100) (10 /100)

2
c / (c d )
10 /100

Absolute risk improvement/reduction

ARR (a / (a b)) (c / (c d ))
(30 /100) (10 /100) .2

There is a 200% increase in lung


cancer cases for smokers compared
to those that dont smoke.

Out of every 10 people that smoke, on average 2 will likely get


cancer.

Number needed to treat/harm (NNT/NNH)


NNT 1/ ARR 1/ .2 5
One would need to harm 5 patients ( with smoke) see one case of
cancer develop.

ABSOLUTE vs RELATIVE Risk


Remember our calculation for the smoking example:
(X
Those that
dont
smoke and
got cancer

10
100

3) or 300% more likely to develop lung


cancer than those that dont smoke (RR).
There is a 200% increase in lung cancer
cases for smokers compared to those that
dont smoke (RRR).
Out of every 10 people that smoke, on average
2 will likely get cancer , 20% (ARI).

There are different ways of describing the


same risk which can profoundly affect how we
perceive it. Ultimately, when deciding on
whether to take a treatment, ideally you
should decide with your doctor if the reduction
in the ACTUAL (absolute risk) outweighs the
risks, side-effects and costs of treatment.

Group Group
A
B

RRI

Those that
smoke and
got cancer

30
100
ARI

10%

30%

200% 20%

1%

3%

200%

2%

.1%

.3%

200%

.2%

The RRR sounds better for


marketing purposes!...

The d family (amount of difference)


2x2 table
How BIG is the
difference?

Not 2x2 table


The chi-squared test shows a
relationship existsbut where does

the relationship (difference) occur?

100%

80%

Split

Split

60%

75%

40%
20%

100%

Rural

Rural

Urban

Rural

50%

25%
Light

50%
Regular

25%
0

Measure of effect size:


Odds ratio (OR)

43%
Light

43%
Regular

25%
Dark

14% Dark

Measure of effect size:


Adjusted standardized residuals

The d family for not 2x2 tables


To determine which of the categories are major contributors to the
statistical significance, the adjusted
is computed for each cell:

OE
E
(1

nrow
n
)(1 column
ntotal
ntotal

standardized residual

20 27
27

2.3
50
80
(1
)(1
)
)
150
150

Light
Regular
Dark

adjusted
standardized
male female
-2.3
3.5
0.9
-0.9
1.6
-1.9

A statistically significant relationship was found (chi-square test


rejected Ho). SPECIFICALLY, there is a statistical difference in male and
female responses for those that chose Light beer (look for values
that are above +2 or below -2).

There are two contingency table tests:

(1) Two-way table contingency test


(also called test of independence)

(2) One-way table contingency test


(also called goodness of fit test)

1. Is there a relationship between the


variable and a specific distribution?
Visually (Bar Graph)
Mathematically (statistical test)

2. What is the effect size?

42

One-way contingency table test


is called a Goodness of Fit test
A hypothesis is a belief about the results of a statistical study.
The Goodness of Fit tests if the outcomes of a variable
follows a hypothesized distribution (or put another way the
Relationship of variable to a specific distribution).
The Republican party is significantly
larger than any other.
There is an equal amount of
individuals in each party.
The Democrat party is significantly
larger than any other.

1. Is there a relationship between the


variable and a specific distribution?
Visually (Bar Graph)
Mathematically (statistical test)

2. What is the effect size?

44

Example
Is it safer to fly in the front, middle, or
back of the airplane?
Matt McCormick, a survival expert for
the National Transportation Safety
Board, told Travel Magazine that
There is no one safe place to sit.

In an effort to test this claim, United


Airlines recorded the seat position for
87 fatalities.

Collected Raw
Data must be
organized into a
frequency table.

Seat
Back
Middle
Front
Total

f
23
35
29
87

United Airlines compared


these actual results to
hypothesized results, which is
the belief that fatality is the
same whether one sits in the
front, back, or middle of an
airplane.

87/3 = 29
Front is 29
Middle is 29
Back is 29
This is a uniform
distribution!

How to we test this claim?

1. Form hypothesis.

Ho null hypothesis: The data are consistent with a specified distribution.


Ha alternative hypothesis: The data are NOT consistent with a specified distribution.

2. Calculate chi-squared (2) statistic.



2

observed

exp ected

exp ected

3. Find 2 significant in table.


2,df
=.05 (default value)
df = # outcomes 1

4. Compare 2 statistic to 2 significant.


2 statistic > 2 significant, Reject Ho, Accept Ha.
2 statistic < 2 significant, Fail to Reject Ho.

Solution

Step 1: Form Hypothesis

Ho null hypothesis: The data are consistent


with a specified distribution. There is no one
safe place to sit.

Step 2: Calculate chisquared (2) statistic


Outcome

f
Hypothesized
(observed) distribution
f (expected)

Ha alternative hypothesis: The data are


NOT consistent with a specified distribution.
There is a safe place to sit.

Back

23

29

Step 3: Find 2 significant

Middle

35

29

df = # outcomes 1 = 3 1 = 2
=.05 (default value)
2,df = 2.05,2 = 5.991

Front

29

29

Total

87

87

Step 4: Compare
to 2 significant

statistic

2.48<5.991
2 statistic < 2 significant
Fail to Reject Ho.
There is not enough evidence to refute the
claim that there is no one safe place to sit!.

observed

exp ected

23 29

exp ected

35 29

29
29
1.24 1.24 0 2.48

29 29

29

Your turn! Goodness of Fit


Is Sudden Infant Death Syndrome (SIDS) Seasonal? Data from King County,
Washington regarding the number of deaths from SIDS for each season:

Season

Winter

78

Step 1: Form Hypothesis

Spring

71

Summer

87

Fall

86

Total

322

Ho null hypothesis: Data follows


hypothesized distribution (uniform SIDS
deaths for all seasons are the same.
Ha alternative hypothesis: Data doesnt
follow the hypothesized distribution.

Step 3: Find 2 significant


df = # outcomes 1 = 4 1 = 3
=.05 (default value)
2,df = 2.05,3 = 7.815

Step 4: Compare 2
statistic to 2 significant
2 statistic

(2.10) <
Fail to Reject Ho.

2 significant

(7.815)

Step 2: Calculate chisquared (2) statistic


Season
Winter

fo
78

fe
322/4
=80.5

Spring

71

80.5

Summer

87

80.5

Fall

86

80.5

Total

322

322

78 80.5 71 80.5

2
statistic

80.5

80.5

87 80.5 86 80.5

80.5
2.10

80.5

Conclusion: Sudden infant death syndrome proportions


across seasons are not statistically different from whats
expected by chance (i.e. all seasons being equal).

CHISQ.TEST Excel function


output compares p value to

p value (.55)

statistic (2.10) < significant (7.815)


p value (.55) > (.05)
Fail to Reject Ho

Goodness of Fit Test Assumptions


Test

Assumptions

Exact Binomial test


2 outcomes
(we didnt learn by hand) Samples up to n=1000
Chi-square test
Large sample: E(x)>5

1. Is there a relationship between the


variable and a specific distribution?
Visually (Bar Graph)
Mathematically (statistical test)

2. What is the effect size?

51

Goodness of fit effect size


Just like the test of Independence, use Cramers V
(more than 2 outcomes) or Phi (2 outcomes).
Unlike the test of independence, compute the
effect size whether you Reject or Fail to Reject Ho.

The interpretation is different from the test of independence!

A value of .1 is considered a close to perfect


fit
.3 a moderate effect
.5 a weak fit.

Goodness of fit effect size


Is Sudden Infant Death Syndrome (SIDS) Seasonal? Data from King County,
Washington regarding the number of deaths from SIDS for each season:

78 80.5 71 80.5
2

2
statistic

80.5

80.5

87 80.5

80.5
2.10

86 80.5

80.5

Cramers V
V

2
n df

Season

Winter

78

Spring

71

Summer

87

Fall

86

Total

322

2.10
.05
322 3

Interpretation:
A value of 0 indicates that the sample proportions are exactly equal
(a perfect fit) to the hypothesized proportions (i.e., O = E). As v
increases, the degree of departure from a perfect fit increases.
Since V=.05, there is a small effect, or small departure from fit

Why are we learning this?


We can use the GOODNESSOF-FIT TEST to validate
the use of a specific
distribution for:
SIMULATION
PROBABILITY

To test if a sample of data


came from a population
with a specific distribution
is called a GOODNESSOF-FIT TEST.

Typical steps for a Statistical study


1. Define the goals
ReminderHow
this lecture fits in
with everything
we have learned
so far...

Research Question

2. Collect the data


Research Designs

3. Organize the data

Tables for each variable (f, %f, cf, %cf )


Table for 2 qualitative variables (f contingency table, %f
total, %f independent (column) variable)

4. Present the data

Graphs of each variable (pareto pie, bar, ogive, histogram,


boxplot)
Graphs for 2 qualitative variables (Stacked & Clustered bar
graph)
Graph for 1 quantitative variable and 1 qualitative variable
(Comparative Boxplot )

5. Describe the data

Statistics and Parameters

6. Analyze the data

Statistical tests (chi-square, Fishers Exact)


Effect size (OR, Adjusted Standardized Residual, Phi, Cramers V)

7. Interpret results

55

End of the Lecture!

Remember

If you need helpcall me, see me, or email me.

You might also like