You are on page 1of 57

Lecture 4

2/8/2016

Last week & this week


Significant figures
Accuracy & precision
Cs & Rs
MMM

Food for Thought: Obvious?


Taking aspiring helps prevent heart
attack
People are more likely to buy jeans in
particular month
Men have lower resting pulse than
women
Listening to mozart improves IQ
Mothers prefers holding new-born on
the left side

More
Robopolls underestimated the
support for OBAMA
NJ stood 22nd in releasing toxic gases
Smoking may cause lower IQ in kids
Marijuana im[airs brain
Detecting cheating?????

Statistical Sleuths
Numbers vs Data?
1.
Source or funding
2.
Researcher who had contacted
3.
Objects and selection procedure
4. Measurement or question type (street people vs
homeless)
5. Setting or environment (anonymity or timing)
6. Variety of groups or factors of interest
7.
Extent or size of any claimed effect
Case study: who is smarter?

The three Cs and the Three Rs


co-occurrence, causation,
counterfactuals
replication, randomization,
representativeness

Co-occurrence and causation


Many questions in scientific research are of the following nature
We want to know, Does A cause B? (causation)
We observe something (A) together with something else (B) (cooccurrence)
Causation usually requires co-occurrence, but co-occurrence does not
automatically imply causation.
Examples: Light switch
You walk into a room and flip the switch on the wall. The lights come on. Can you
immediately conclude that A (flipping the switch) caused B (the lights go on)?
No, because you cannot be sure that the lights would not have gone on anyway.
Example: Global warming and atmospheric CO2
Average surface temperatures around the world have apparently been increasing. Levels
of carbon dioxide in the atmosphere have also been increasing. Can we conclude that
increasing CO2 levels cause global warming?
No, because you cannot be sure that global warming would not have taken place anyway.

Meaning of causation
Suppose A and B co-occur. What does it mean for A to have caused B?
A caused B means that B would not have occurred if A had not occurred.
Or
more generally, it means that B would have been less likely to occur if A
had not occurred.
Counterfactual outcomes
Causation, therefore, is really a statement about counterfactual outcomes.
If A and B occur together, the counterfactual outcome is what would have
happened (either B or not B) if A had not occurred.
Counterfactual outcomes cannot be observed. We cannot simultaneously
observe what happens under A and not A.
Yet, using careful experimentation and statistical analysis, we can
sometimes infer or prove beyond a reasonable doubt that A causes B.
How?

The Three Rs
To infer or prove causation in a scientifically defensible manner,
we rely on three Rs
replication: Use multiple subjects or experimental units
randomization: Randomly select some units to receive A, the
others to receive not A
representativeness: Make sure that the units in your
experiment are similar to (representative of) the population to
which you want to generalize
Example: Light switch experiment
Designate ten moments in time. Randomly select five and flip
the switch at those times, but not the others.
Suppose you observe that the light is on at the five moments at
which you flipped the switch, and off at the other five moments.
This is powerful evidence that flipping the switch caused the
light to go on.

Continued
Number of ways to select five units (without replacement) out of
ten = 252 If the lights were going to be on, regardless of what
you did at five of the ten occasions, the probability that they
would have been on at exactly the five moments you selected is
1/252 = .00397 = 0.397%
Beyond a reasonable doubt!
The important thing is that we were able to get strong evidence
that A (flip switch) caused B (lights on) by using replication
(multiple moments in time) and randomization (randomly
selecting the moments at which to apply the treatment A).
In order to conclude that A causes B, not just at those ten
moments but in general (all the time), we must also believe that
the ten moments in time were not unusuali.e, that they are
representative.

Why the three Rs are important


Why replication?
Using of multiple subjects or experimental units decreases the probability that
the effects we observe could be a product of mere chance.
1 out of 2: 1/2 = .5
5 out of 10: 1/252 = .00397
10 out of 20: 1/184756 = .00000541
Why randomization?
Using randomization to select the units to receive the alternative treatments
ensures that the units receiving A are, on average, no different from those not
receiving A.
Without randomization, the two groups could be systematically different even
before the treatments are applied. Then we could not be sure that the different
responses in the two groups were not caused by these prior differences.
Why representativeness?
Representativeness allows you to generalize from the units in your study to a
broader population.
Studies using units that are not representative of the population may be
criticized for lacking external validity.

Example
Global warming and atmospheric CO2
Should we conclude that increasing CO2 levels
have caused global warming?
From the standpoint of the three Rs, this will
be
very difficult to prove or disprove, because it is
based on a sample of just one unit (our world).
Nevertheless, it is an important question!

Statistical Studies
A statistical study is an exercise in
which we collect, summarize and
interpret data. There are two major
classes of statistical studies.
randomized experiments
observational studies

Randomized experiments
In a randomized experiment, subjects or units are
randomly assigned by the researchers to receive
different treatments or interventions
The goal of a randomized experiment is to estimate
the causal effect of the treatment on one or
more outcomes.
Example: A clinical trial to compare the
effectiveness of two types of chemotherapy.
Outcomes of interest may include five-year
survival, side effects, quality-of-life measures, etc.

Observational studies
In an observational study, researchers merely
observe the characteristics of subjects or units.
No treatment or intervention is applied.
The purpose of an observational study is to
describe the characteristics of a
population, to detect patterns or
relationships, etc.
Example: A pre-election survey to estimate the
attitudes of likely voters and to predict their
voting behavior

Some comments
1. Researchers often try to infer
causality from
observational studies. This can be
misleading.
2. In some cases, a randomized
experiment is
unethical or impossible. Inferring
causality
from an observational study may be

How to conduct a statistical


study?
The three conceptsrepresentativeness, sample size
and type of studyare certainly important, but there
are many other issues as well.
The decision about what type of study to conduct
usually comes first.
The type of study usually determines what kind of
sample you can use
- With an observational study, you may be able to draw a
truly random (and therefore representative) sample
from a population
- With a randomized experiment, the subjects or units
available
for your research may not be representative

Case studies
Does aspirin prevent heart attacks? 22,071 male physicians age 40-84
randomly assigned to take daily aspirin or placebo
Placebo group experienced 17.13 heart attacks per 1,000
Aspirin group experienced 9.42 heart attacks per 1,000
Convincing evidence of a causal effect among these subjects! But how well does this
result generalize to the broader population? (Almost any study can be criticized for
either a lack of randomization or by questioning the representativeness)

Mothers smoking during pregnancy associated with lower IQ in children


Suggests a causal relationship
Other factors could account for this association
Researchers controlled for other factors (diet, education, age, drug use, parents
IQ, breastfeeding, etc.
- There are different ways to control for these factors. One could always imagine
additional factors that were not controlled for
Unfortunately, with an observational study, we simply cannot make causal
conclusions
- One cannot do a randomized experiment on humans
- We may make causal conclusions from a combination of careful analysis of
observational data and

Turning Data into Information: The


distribution of the data
The histogram is one way to turn
data (like the numbers above) into
usable information.
The histogram is much easier to
interpret

Simple ways of representing


sample
Numerical ways
- Position by mean / mode/
median/SD
- Spread by variance
Graphical ways
- Histograms
- Dot diagrams etc

1. Histogram
The most common form of graphical
presentation of a frequency
distribution is the histogram.
Histogram: It is constructed of
adjacent rectangles; the height of the
rectangles is the class frequencies
and the bases of the rectangles
extend between successive class
boundaries.

Frequency
Given
a sample of data points, we divide data into
histogram:
equally-spaced intervals, and count the number dat

of

points that fall into each interval.


A
is athe
barinterval?
chart with the length of each
How to select
histogram

to the number of
bar

in that

observations
proportion
al
for a sample will be an approximation
interval.
the probability distribution of the
of

Contd
When a histogram is constructed from
a frequency table having classes of
unequal lengths, the height of each
rectangle must be changed to Height
= relative frequency / width.
The area of the rectangle then
represents the relative frequency for
the class and the total area of the
histogram is 1.

Graph of cumulative distribution where we plot


the cumulative frequencies at the class
boundaries instead of the ordinary
frequencies at the class marks. The resulting
points are connected after that.

2. Dot Diagram
Visually summarizes individual data
Check for unusual patterns
Easily identifies outliers

Example 1: Data were collected on the


deviations of cutting speed from the target
value set by the controller.
Observed value of Cutting speed target
speed are as below3 6 2 4 7 4

This diagram visually summarize the


information that the lathe is generally running
fast.

Example 2

Number of ways to roll a 2,3,4,5. 12


with a pair of Dice.

The dot diagram is a very useful plot for displaying a small body of data
- say up to about 20 observations.
This plot allows us to see easily two features of the data; the location, or
the middle, and the scatter or variability.

Example 3

Example 3 cont.
The

dot diagram is also very useful for


comparing sets of data.

3. Frequency Distribution
A frequency distribution is a tabular
arrangement of data whereby the data is
grouped into different intervals, and then
the number of observations that belong to
each interval is determined.
Data that is presented in this manner are
known as grouped data.

Example: 80 data of emission (in ton)of sulfur


oxides from an industry plant are given below15.8 26.4 17.3 11.2 23.9 24.8 18.7 13.9 9.0
13.2 22.7 9.8 6.2 14.7 17.5 26.1 12.8 28.6 17.6
23.7 26.8
22.7 18.0 20.5 11.0 20.9 15.5 19.4 16.7 10.7
19.1 15.2 22.9 26.6 20.4 21.4 19.2 21.6 16.9
19.0 18.5 23.0
24.6 20.1 16.2 18.0 7.7 13.5 23.5 14.5 14.4
29.6 19.4 17.0 20.8 24.3 22.5 24.6 18.4 18.1
8.3 21.9 12.3

Class limits

Frequency

5.0 8.9

9.0 12.9

10

13.0 16.9

14

17.0 20.9

25

21.0 24.9

17

25.0 28.9

29.0 32.9

Total

80

Class limit and width


Lower class limit: The smallest value that
can belong to a given interval
Upper class limit: The largest value that
can belong to the interval.
Class width: The difference between the
upper class limit and the lower class limit is
defined to be the class width.
When designing the intervals to be used in a
frequency distribution, it is preferable that
the class widths of all intervals be the same.

Variants of frequency distribution


Cumulative frequency distribution: It is
obtained by computing the cumulative
frequency, defined as the total
frequency of all values less than the
upper class limit of a particular interval,
for all intervals.
Relative frequency: The ratio of the
number of observations in the interval
to the total number of observations
Percentage frequency distribution: It is
obtained by multiplying the relative
frequencies of each interval by 100.

Cumulative Frequency
Sometimes data are distributed in such a way
that they are grouped as less than, or
less, more than and or more , this is
known as cumulative distribution.
A cumulative less than distribution shows
the total number of observations that are less
than the given values.

Contd
If we convert the before mentioned data
into cumulative distribution it becomes like
thisClass limits
Frequency
Less than 5

Less than 9

Less than 13

13

Less than 17

27

Less than 21

52

Less than 25

69

Less than 29

78

Less than 33

80

One can also write less than 4.95 or 4.9


or less in place of less than 5.

Percentage Distribution
Class Limit

Percentage
distribution

Frequency

(5.0, 9.0)

3.75 %

(9.0, 13.0)

12.5 %

10

(13.0, 17.0)

17.5 %

14

(17.0, 21.0)

31.25 %

25

(21.0, 25.0)

21.25 %

17

(25.0, 29.0)

11.25 %

(29.0, 33.0)

2.5 %

Total

100 %

80

4. Stem-and-leaf Display
Smaller sets of data
Does not lose any information
Class, as well as, actual data values are
displayed
Data values are listed to the right of
the classes
Lets assume some 10 numbers as
12, 13, 21, 27, 33, 34, 35, 37, 40, 40

Contd
Frequency table
Class limits
Frequency
10 19
2
20 29
2
30 39
4
40 49
3

Contd
Stem-and-leaf: each row has a
stem and each digit on a stem
to the right of the vertical line is
a leaf.
The "stem" is the left-hand
column which contains the tens
digits.
The "leaves" are the lists in the
right-hand column, showing all
the ones digits for each of the
tens, twenties, thirties, and
forties.

Example
Let, in a certain year humidity readings
(rounded to the nearest percent) in a city are
like this
29, 44, 12, 53, 21, 34, 39, 25, 48, 23, 17, 24,
27, 32, 34, 15, 42, 21, 28, 37
By grouping these data we get the following
Humidity readings
Frequency
distribution:
10-19

20-29

30-39

40-49

50-59

How to avoid loss of information?


Replace frequency with last digit of info!
Class limits
1
2
3
4
5

2
9
4
4
3

7
1
9
8

5
5 3 4 7 1 8
2 4 7
2

The table is called stem-and-leaf display (or simply


a stem-leaf display) each line is a stem and each
digit on a stem to the right of the vertical line is a
leaf.
To the left of the vertical line (1,2,3,4,5) are stem

Mean (M)
Most common measure of central
tendency
Characteristics:
Sensitive to all observed values
Highly stable; with larger n, M is
insensitive to subtle changes in
values
Can be highly sensitive to extreme
values (particularly in smaller
samples).

The Mean: Statistical


notation
Some basic statistical notation:
X : Score on one variable for one
participant
n : Number of scores in the sample
: Sum of a set of scores
M : Mean; sum of scores divided
by n of scores: X/n

The sample mean is the most


important single statistical value,
measuring the location of a sample
What is the common term for the sample mean?
The numerical average of the sample
observation
How is this calculated?
The sum of the observations divided by the
sample size n.
For a set of n observations,x1, x2, , xn , the
sample mean is calculated as (X)/n

other ways of finding


mean?

The trimmed mean is calculated by


eliminating the highest and lowest
values in the sample and taking the
mean of the remaining values.
For a 10% trimmed mean, the largest
10% and the smallest 10% are
eliminated.

Variance
Sample variability is critical to statistical
calculations.
Sample variance and standard deviation are the
most important measures of variability.
How to calculate variance?
For a set of n observations x1,x2,,xn, the sample
variance, s2, is calculated as follows : (xi - x) 2 /
(n-1)
n-1 is called the degree of freedom associated
with the variance

S
The standard deviation, S, is the square root of
the variance.
What are the units of the standard deviation?
What does it mean if the variance (and thus
the standard deviation are large?

Range is the other measure of sample


variability.

Mode: Most frequent score in the


distribution
Example: scores = 15, 20, 21, 20, 36,15, 25,15,12
Show scores as a frequency distribution
15 is most common, and is considered the mode.

Characteristics:
used for all numerical scales, particularly categorical
insensitive to extreme values or range of scores
unstable; sensitive to small shifts in number of case

Median
The sample median is another measure of
location.
What is the median of a sample?
The observation separating the upper and the lower
halves of the sample.
The middle observation if n is odd. What if n is even?
The average of the two middle observations.
As an example the median is always used for
housing prices . Why use the median rather than
the mean?
Outliers (multi-million dollar houses) skew the results.

Median & Quartile


Lets have some data arranged in ascending order
16 33 34 43 47 48 49 49 50 53 56 59 60 65 67 67 68 71 81 86

First, sort the data in ascending order


Next, find the midpoint (sometimes between two
numbers, as in this case)
Median = 54.5
Finally, find the medians of the bottom half and the top
half: the first and third quartiles.
1st quartile = 47.5; 3rd quartile = 67

Median
Mid-point of a distribution of scores: half are above, half are
below.
List scores in numerical order (interval or ratio scale)
Locate the score in the center of the sample.
First line up the scores: 12,15,15,15,20,20,21,25,36
The middle (5th out of 9) score = 20.
If there are an even number of scores, Median = mean of
the two middle scores
Characteristics:
Sensitive to the range of scores
More stable than the mode
Not sensitive to extreme scores (e.g., changing
highest score (36) to 100 would not change the
median)

Example

1. Find Mean, Variance & S of following problem


Student IQ
X 76
Y 80
Z 89
A 96
S 100
T 102
U 106
V 108
W 122
C 130
Mean = 100.9
Variance = 259.29
Standard Deviation = square root of variance = 259.29=
16.1

Worlds tallest structure


STRUCTURE

HEIGHT IN
METRES

Great Pyramid of
Giza

139

Eiffel Tower

300

Empire State

381

Petronas Towers

452

Taipei 101

509

World Trade
Centre

526

CN Tower

553

KVLY-TV Mast

629

Warsaw Radio
Mast

646

Burj Khalifa

830

Mean = 496.5 m
Variance = 22 597.525
Standard Deviation = square root
of variance = 22 597.525=
150.32 m

M&M?

http://www.todayifoundout.com/index.php/2010/05/what-the-msstand-for-in-mms/

Done for the day!

You might also like