Professional Documents
Culture Documents
1.1
Lecture 1: An Introduction to
MATH1041
Course aims
1.2
What is statistics?
1.3
An example
A study was conducted (Johns et al. 1993) using guinea pigs to address
this question. Ten pregnant guinea pigs were injected with nicotine
tartrate, and ten were not. Offspring were then given an intelligence
test, a maze through which they had to pass to find food.
Is there evidence that guinea pigs in the treatment group (those whose
mums were smokers) were slower learners, on average?
1.4
Skills to be developed
In this course, you will learn how to approach designing studies and
analysing data to answer research questions like the above. In partic-
ular, at the end of this course, you will be able to:
1. Recognise which analysis procedure is appropriate for a given re-
search problem involving one or two variables.
2. Understand principles of study design.
3. Apply probability theory to practical problems.
4. Apply statistical procedures on a computer using R/RStudio.
5. Interpret computer output for a statistical procedure.
6. Calculate confidence intervals and conduct hypothesis tests by hand
for small datasets.
7. Understand the usefulness of Statistics in your professional area.
1.5
1.6
Lecture 2: Graphs
During this lecture, we will meet common graphs used for visualising
data.
Quantitative or categorical?
1.7
Introduction the role of graphs
Data Information
1.8
Tools for Making Data Informative
1.9
Quantitative or categorical?
A categorical variable places an individual into one of several cate-
gories.
Which of the following variables are quantitative, and which are cate-
gorical?
gender
satisfaction with UNSW (from 0 to 10)
time travelling to UNSW
method of travelling to UNSW
1.10
Recommended graphical tools
If you want to summarise one variable:
and it is quantitative: a histogram or boxplot.
and it is categorical: a bar graph (or bar chart).
1.11
What sort of graph would you use to summarise:
gender of MATH1041 students
satisfaction with UNSW (from 0 to 10)
Time travelling to UNSW
method of travelling to UNSW
1.12
What to look for in a graph
the location (where most of the data are) and spread (or variabil-
ity) of the data;
1.13
Change in location
Frequency
1 2 3 4 5 6 7 8
Variable
1.14
Spread
Larger Spread
Smaller Spread
Frequency
0 2 4 6 8 10
Variable
1.15
Typical shapes:Symmetric
Frequency
1 2 3 4 5 6 7 8 9
Variable
1.16
Typical shapes:Skewed to the left
Frequency
1 2 3 4 5 6 7 8 9 10
Variable
1.17
Typical shapes:Skewed to the right
Frequency
1 2 3 4 5 6 7 8 9 10
Variable
1.18
The following histogram depicts the scoring average of players from
the National Basketball Association (NBA) up to the 2008 season.
120
100
Frequency
80
60
40
20
0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Points per game
1.19
Comment on the location, spread and shape of the histogram.
1.20
Identify the variable(s) involved in the following questions, whether
they are quantitative or categorical, and what sort of graph you would
use to answer the questions:
Does the amount you pay for a haircut depend on your gender?
1.21
Another way to think about it:
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
1.22
Identify the variable(s) involved in the following question, whether they
are quantitative or categorical, and what sort of graph you would use
to answer the question:
1.23
Some Other Graphical Tools
Time plots (e.g. page 2022 of Moore et al.). Suitable for time
ordered data. Common in financial pages of newspapers.
1.24
Statistics Packages
Graphs (and indeed most statistical procedures) are most easily imple-
mented using a computer, and a statistics package specially devel-
oped for data analysis. Common programs used for statistics:
SAS
SPSS (PASW)
Excel
R/RStudio (Well use both R and RStudio used for most graphs
in the lecture notes).
Minitab
S+ (S-PLUS)
1.25
Fancy graphs
1.26
Class Survey Data
During the remainder of the class, will look at other results in the
getting to know the class exercise.
1.27
1.28
Lectures 34: Numerical summaries
This lecture, we will meet common types of numerical summaries of
data ways of summarising the key properties of data using a few
numbers.
Introduction
Five-number summaries
Outlier detection
Linear transformations
1.29
Introduction
From last lecture:
Data Information
1.30
Tools for Making Data Informative
1.31
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
useful
numbers:
This lecture
1.32
Relationship to Textbook
Graphical tools:
Section 1.1 Displaying Distributions with Graphs
Numerical summaries:
Section 1.2 Describing Distributions with Numbers
1.33
Types of numerical summary
proportions or percentages
mean or average
median
standard deviation
1.34
Recommended numerical summaries
1.35
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of
numbers: frequencies
1.36
Summaries of categorical variables
Consider the data from the class survey last lecture:
1.37
Numerical Summary of Gender
Gender Frequency %
Female 240 60
Male 160 40
1.38
Summaries of quantitative variables
Measures of location
Measures of location tell us how large (or small) the typical value is.
1.39
Satisfaction with UNSW
7.84
1.40
The mean is just another name for what is commonly called the av-
erage of a set of numbers.
1.41
Travel Times to UNSW
1.42
Labour cost
1.43
The Mean Can be Heavily Influenced by Outliers
If the entrepreneurs are removed (any value over $1,000), then the
mean changes from
$842, 655.86
to
$80.52
1.44
The Median an Alternative to the Mean
Even after removing big outliers, the mean is still not describing the
typical labour cost very well.
1.45
Definition of the median
1.46
Computing the Median
Data: 5, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9
55
6
7777777777
888888888
9999
Answer:
1.47
Often Mean and Median are Close
x M
UNSW satisf. 7.84 8
1.48
Medians and Boxplots
9
8
UNSW satisfaction rating
7
6
These correspond to the medians of the lower and upper half of the
data; and are called:
Q1 = first quartile
Q3 = third quartile
Again, recall the UNSW satisfaction dataset from a few slides ago:
55
6
7777777777
888888888
9999
Answer:
1.51
Measures of spread
Asthma example
0 10 20 30 40 50
0 10 20 30 40 50
50
40
30
FENO
20
10
A B
1.54
Simple Measure of Spread
A simple measure of spread is the height of the box part of the boxplot.
From earlier slides this is:
1.55
IQR for Small Example
55
6
7777777777
888888888
9999
IQR = Q3 Q1 = 8 7 = 1.
1.56
Standard Deviation another measure of spread
v
u
)2 + (x2 x
u (x x
t 1 )2 + . . . + (xn x
)2
s=
n1
1.57
Example
Recall our sample data on satisfaction with UNSW:
55
6 n = 26
7777777777
888888888
9999 x1 = 5, x2 = 5, x3 = 6, x4 = 7, . . . , x26 = 9.
You can use your calculator to show that for this dataset,
' 7.46
x
s
(5 7.46)2 + (5 7.46)2 + . . . + (9 7.46)2
s =
25
' 1.067
The standard deviation for this data set is 1.067 (to 3 decimal places).
1.58
Often IQR and s are Similar
IQR s
UNSW satisf. 2 1.3
1.59
The Standard Deviation Can be Heavily Influenced by
Outliers
s = $ 7, 632, 285.18
But if the outliers (values greater than $1, 000) are removed from the
sample then we get
s = $ 141.89
This still seems a little high though. . . probably because there are still
some people who replied with pretty large values.
1.60
IQR is Hardly Affected by Outliers
IQR = $ 67.5
IQR = $ 46.25
1.61
Five-Number Summaries
Textbook advocates the five-number summary:
Min. Q1 M Q3 Max.
where Min. and Max. are the smallest and largest values.
1.62
Class Exercise: Five-Number Summary
18 18
19 19 19 19 19 19
20 20
21 21 21 21
22
23
24 24
25
29
1.64
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of
or 5-number
numbers: frequencies
summary
1.65
Boxplot Terminology
Textbook (Moore et al.) uses the terms boxplot and modified box-
plot.
1.66
Outlier Identification
How do we decide if an observation is an outlier?
1.67
Class Exercise: 1.5 IQR Criterion
29
Min. Q1 M Q3 Max.
18 19 20.5 22.5 29
1.68
Answer:
1.69
Linear Transformations
xnew, as follows:
xnew = a + bx
x is some quantitative measurement (e.g. travel time, height, tem-
perature).
1.70
Linear Transformations Change of Units
Suppose we measure
Then
xnew = a + bx where a = 0, 1.
b = 60
1.71
Other examples
change of units a b
mins to hours 0 1
60
hours to mins 0 60
mm to metres 0 0.001
o C to o F 32 1.8
o F to o C 160
9
5
9
1.72
Linear Transformations Dont Change the Shape of
the Data
in Melbourne in both
Celsius and Fahrenheit.
1.73
Histogram of max. Melbourne temperatures
Relative Frequency 0.3
0.2
0.1
0
0 20 40 60 80 100 120
C
0.3
0.2
0.1
0
0 20 40 60 80 100 120
F
1.74
How Linear Transformations Affect Central Location
= mean
x and M = median.
new = a + b x
x
xnew = a + bx
Mnew = a + b M
1.75
How Linear Transformations Affect Spread
snew = b s
xnew = a + bx
IQRnew = b IQR
1.76
Class Exercise Transformation of Numerical Summaries
Celsius Fahrenheit
x 30
M 25
s 10
IQR 15
Note: F = 9
5 C + 32.
1.77
Additional Exercise 1-1 (based on Exercise 1.45 of 4th Edition)
How much do users pay for Internet service? Here are the monthly
fees (in dollars) paid by a random sample of 50 users of commercial
Internet service providers in August 2000:
8 9 10 10 12 13 14 15 15 15
15 18 18 19 19 20 20 20 20 20
20 20 20 20 20 20 20 20 21 21
21 21 21 22 22 22 22 22 22 22
22 22 23 25 29 30 35 40 40 50
2.1
Lectures 12: Transformations and
Relationships Between Variables
This lecture, we will learn about the role of transformation in statistics,
and introduce some ideas that are useful for studying the relationship
between two quantitative variables.
Transformations:
Introduction
Common transformations
Effect of transformation on location and scale
Relationships Between Variables:
Relationships between two variables when (at least) one is cate-
gorical
Relationships between two quantitative variables
Outliers in Scatterplots
Correlation (r)
Cautions about the use of r 2.2
Transformations
Consider xnew, formed as some function of a variable x. Examples:
9
xnew = x2 xnew = 32 + x xnew = log(x)
5
xnew = f (x)
where the function f () tells us how x has been transformed.
2.3
Types of transformation
xnew = x2
xnew = 32 + 9
5x
xnew = 10 log(x)
2.4
Why transform data?
2 0 2 0 1
8
3
7
log seeded
4
6
8
5
5
4
3
3
2
2
1
1
0 500 1,000 1,500 2,000 2,500 3,000
Seeded
2.6
Common transformations
When your data takes positives values (x > 0) and is right-skewed,
these transformations might make it more symmetric:
xnew = x
xnew = x1/4
xnew = log x
These transformations are all concave down, hence they reduce the
length of the right tail. They are in increasing order of strength that
is, for strongly skewed data, is more likely xnew = log x to work than
xnew = x.
They are also monotonically increasing that is, as x gets larger, xnew
gets larger.
2.8
Log transformation
xnew = loga x
where a is the base, commonly log10 x, loge x, log2 x.
(The base doesnt matter it only affects the scale, not the shape.)
Examples:
Wealth
Size
Profit
Population
2.10
Example:
Population (1st world countries)
5 10 15 20
Frequency
0
0 1 2 3 4 5 6
log(Population)
2.11
Situations where we need a different trans-
formation
2.12
Proportions: If data only take values between 0 and 1, try the logit
transformation:
x
xnew = log
1x
(This stretches data over the whole real line, from to .)
Right-skewed with zeros: This often happens when data are counts.
The problem is that you cant take logs because log 0 is undefined.
Try:
xnew = log(x + 1)
2.13
Exercise
IQ
2.14
Survey exercise
Consider the survey data. What transformation (if any) would you
suggest for the following variables, to make them more symmetric?
Hair cut cost
Labour cost
2.15
Effect of transformation on location
and shape
Recall that when we do a linear transformation:
xnew = a + bx x
new = a + b x
(and similarly for other measures of location).
xnew = a + bx snew = bs
(and similarly for other measures of spread).
2.16
Effect of transformation on location
and shape
Well when we do a non-linear transformation this all falls apart:
xnew = f (x) x
new =???
(and similarly for other measures of location).
2.17
There is no rule for how measures of location and shape change under
non-linear transformation they can change in unpredictable ways.
2.18
Exercise
10 20 50 35 15
Does 10xnew = x
? Comment.
2.19
Relationships Between Variables
So far, weve talked about how to study one variable.
Now well think about how to study the relationship between two vari-
ables, in two situations:
When at least one of them is categorical.
2.20
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of
or 5-number
numbers: frequencies
summ.
This lecture
2.21
Relationships between two variables
when (at least) one is categorical
To study the relationship between one categorical variable and another
variable, break the data up by categories, summarise the data in each
category using the appropriate one-variable method, and compare!
2.22
Example gender and hair cut cost
Compare!
2.23
A numerical summary of the relationship between gender and hair cost:
five-number summaries by group
Min. Q1 M Q3 Max.
Females 0 25 40 80 90909
Males 0 15 20 25 260
2.24
Example study area and gender
...
Compare!
(Or you could do it the other way around, summarising study area for
each gender, with similar results)
2.25
A graphical summary of the relationship between study area and gen-
der: clustered bar chart
female
80
60
40
20
0 male
2.26
A numerical summary of the relationship between study area and gen-
der: two-way table (of frequencies)
Females Males
Aviation 16 52
Life science 91 52
Other science 54 56
Other 99 74
2.27
Relationships between two
quantitative variables
Relationships between two quantitative variables are best explored
through a scatterplot.
From the scatterplot, we can get a sense for the nature of the rela-
tionship between the two variables.
2.28
Nature of Relationship
2.29
Nature of Relationships for Examples
relationship nature
elec. use vs. temp. decreasing; non-linear; reasonably strong
log(elec. use) vs. temp. decreasing; linear; reasonably strong
log(income) vs. age non-linear; mild strength
2.30
Temperature vs Electricity usage
Electricity usage (kWh per month)
100
80
60
40
20
5 0 5 10 15 20 25
Temperatures ( C)
2.31
Temperature vs log(Electricity usage)
4.5
4
log(usage)
3.5
2.5
5 0 5 10 15 20 25
Temperatures ( C)
2.32
Age vs log(income)
15
14
log(income)
13
12
11
20 30 40 50 60
Age (years)
2.33
Outliers in Scatterplots
Outliers could represent some unexplainable anomalies in the data.
The following plot is votes for Reform Party candidates in the 2000
presidential election versus 1996 for each county in Florida, USA.
2.34
Comparison of votes
3,500
Palm Beach
3,000
Buchanan 2000 votes
2,500
2,000
1,500
1,000
500
2.35
Correlation (r)
What is correlation?
2.36
Solution
2.37
Temperature vs log(Electricity)
4.5
log(Electricity usage)
3.5
2.5
5 0 5 10 15 20 25
Temperatures ( C)
2.38
Formula for the correlation coefficient r
!
1 X xi x
yi y
r=
n1 sx sy
2.39
Properties of r
2.40
r = 0.96
r = 0.78
r = 0.95
r = 0.74
r = 0.01
r = 0.01
2.41
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of 2-way table 5-num for correlation
or 5-number
numbers: frequencies of freq. each group
summ.
This lecture
2.42
Cautions about the use of r
In this section we discuss cautionary notes about r:
r is only useful for describing linear relationships; and
r is sensitive to outliers.
2.43
Correlation Does Not Describe Non-linear Re-
lationships
The next slide shows some data from the National Basketball Associ-
ation (NBA) 2007-08 season on:
mean points per game and age.
2.44
Age vs Average points per game (NBA 07/08)
11
6 r = 0.046
20 22 24 26 28 30 32 34
Age (years)
2.45
Comments on Previous Slide:
2.46
Correlation is Sensitive to Outliers
4,000
3,000
pIL-13
2,000 r = 0.253
1,000
Endotoxin
2.47
Scatterplot of Endotoxin vs pIL-13
4,000
r = 0.0088
3,000
pIL-13
2,000
1,000
Endotoxin
2.48
endotoxin=a measure of poisonous content of dust
pIL-13= measure of immunological activity
Conclusion:
2.49
2.50
Least-Squares Regression
2.51
Lectures 34: Least-Squares
Regression
Introduction
2.52
Introduction
Explanatory and response variables
15
14
log(income)
13
12
11
20 30 40 50 60
Age (years)
2.55
What is Regression?
2.56
Explaining Electricity Usage
2.57
Temperature vs log(Electricity usage)
4.5
4
log(usage)
3.5
2.5
5 0 5 10 15 20 25
Temperatures ( C)
2.58
Least-squares regression
Least-squares is a mathematical method for determining a line of
best fit through the scatterplot points.
2.59
Temperature vs log(Electricity usage)
4.5
4
log(usage)
2.5
5 0 5 10 15 20 25
Temperatures ( C)
2.60
How is a least squares regression fitted?
2.61
Mathematical Representation of Regression
Line
y = b0 + b1 x
where
2.62
Textbook Notation for Least-Squares Regres-
sion
Since y is used for the original data values of the y-axis variable, we
use
y
Answer:
2.64
Exercise Prediction
Consider the results from the Menss Large Hill Ski Jump competition
during the 2014 Sochi Olympic Games.
2.65
Mens Large Hill Ski Jump scores, Sochi 2014
280
270
Final
260
250
21 + 1.811 Round1
y = 21 + 1.81x
where y is the predicted score in the final round and x is the score
in the semifinal round.
Suppose an athlete scored a 137 in Round 1 but had to drop out of the
Final because of an injury. Predict what the final round score would
have been had they competed.
Answer:
2.66
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of 2-way table of 5-num for correlation
or 5-number
numbers: frequencies frequencies each group or regression
summ.
This lecture
2.67
Connection Between Regression and
Correlation
Least-squares fits a line to the x and y points with slope
intimately connected
to the
correlation
2.68
Slope and Correlation are Kind of the Same
2.69
Standardised: Temperature vs log(Electricity usage)
2
Standardised:log(usage)
0
zy = 0 0.90 zx
2.70
Least-Squares Line for General Scatterplots
y = b0 + b1 x.
b1 = r = correlation coefficient
2.71
How Regression Differs from Correlation
In regression
x = explanatory variable
y = response variable
Use regression when you want to try and explain or predict one variable
(y) using another (x).
2.72
Measuring the Strength of a
Regression: r2
An important quantity throughout regression is the r2 value. It mea-
sures the strength of the regression.
r2 = square of correlation r.
Note that:
0 r2 1.
2.73
Why r2?
2.74
r2 as Percentage Explained
2.75
The next few slides show how r2 measures the strength of the re-
gression.
2.76
Temperature vs log(Electricity usage)
data
4.5 fitted
4
log(usage)
3.5
2.5
5 0 5 10 15 20 25
Temperatures ( C)
2.77
log(Electricity usage) log(Electricity usage) vs fitted values
3.5
2.5
2.78
Comment on Previous slides
y and y
is higher.
2.79
Aside: Using r2 to Quantify Non-Linear Re-
lationships
11
10
9
8
7
6 r2 = 0.002
5
20 22 24 26 28 30 32 34
Age
2.80
A small r2 doesnt necessarily mean there is no relationship.
The r2 value discussed so far in this course only assesses linear rela-
tionships.
2.81
Bonus regression material: Cautions
about regression
Regression is not always appropriate
Residual plots
Lurking variables
Extrapolation
2.82
Regression is not always appropriate
Fitting the least-squares regression line to a set of data is not always
appropriate.
2.83
The Anscombe Examples
2.84
For all 4 data sets, the least-squares regression line is
y = 3 + 5 x
2.85
A B
12 12
10 10
8 8
6 6
4 4
C D
12 12
10 10
8 8
6 6
4 4
5 10 15 20 5 10 15 20
2.86
A B
12 12
10 10
8 8
6 6
4 4
C D
12 12
10 10
8 8
6 6
4 4
5 10 15 20 5 10 15 20
2.87
Regression Assumptions
y = b0 + b1x + error.
where error corresponds to random scatter about the line.
2.88
Residual plots
The residuals from a least-squares regression are obtained by subtract-
ing the fitted values (also called predicted values) from the response
values:
residual = y y
2.89
The residuals are the
length (and direction) of
the arrows
2.90
A residual plot is a scatterplot of the
residuals
against the
explanatory variable x.
2.91
Residual plot
0.4
0.2
0
Residuals
0.2
0.4
0.6
5 0 5 10 15 20 25
Temperature ( C)
2.92
Interpreting Residual Plots
If the regression line catches the overall pattern of the data, there
should be no pattern in the residuals.
2.93
Janka hardness example
2.94
A linear regression fit:
Janka hardness data
8.2
8
7.8
7.6
log(hardness)
7.4
7.2
7
6.8
6.6
6.4
6.2
6
25 30 35 40 45 50 55 60 65 70
density
2.95
The residual plot below has a rough arch-shaped pattern indicating
the linearity assumption is not appropriate.
0.2
0.1
residuals
0.1
0.2
0.3
2.96
A quadratic regression fit is better:
Janka hardness data
8.2
8
7.8
7.6
log(hardness)
7.4
7.2
7
6.8
6.6
6.4
6.2
6
25 30 35 40 45 50 55 60 65 70
density
2.97
The arch-shaped pattern in the residual plot is gone! Much better fit.
0.2
0.15
0.1
0.05
residuals
0
0.05
0.1
0.15
0.2
0.25
25 30 35 40 45 50 55 60 65 70
density
2.98
Outliers and influential observations
Barbara runs a catering business and wants to buy a small fleet of used
Mitsubishi Lancer cars.
She collects 39 ads from the newspapers and arrives at the data set
plotted on the following slide.
Her goal is to predict price based on age so that she has a better idea
of what a fleet should cost.
2.99
Car price/age data
20,000
18,000
16,000
14,000
Price ($AUD)
12,000
10,000
8,000
6,000
4,000
2,000
0
6 7 8 9 10 11 12 13 14 15
Age (Years)
2.100
Attaching a Special Cause to an Outlier
Barbara looks through the newspaper ads and finds that the ad for the
highest price Mitsubishi Lancer was as follows:
None of the other ads promised anywhere near as much as this one.
Answer:
2.101
Car price/age data
8,000
7,000
6,000
Price ($AUD)
5,000
4,000
3,000
2,000
1,000
0
6 7 8 9 10 11 12 13 14 15
Age (Years)
2.102
Residual plot of car price/age data
2,000
1,500
1,000
500
Residual
0
500
1,000
1,500
2,000
2,500
6 7 8 9 10 11 12 13 14 15
Age (Years)
2.103
When is it Alright to Remove Outliers?
If you dont have a good reason for removing it: try presenting both
sets of results (with and without outlier) or look into robust regression
methods.
2.104
Lurking Variables
The next slide shows data on the relationship between steals per
game and rebounds per game for the 200708 NBA season.
2.106
Rebounds vs Steals per game
2.5
Steals per game
1.5
0.5
0 2 4 6 8 10
Rebounds per game
2.107
The results are surprising:
The relationship between steals per game and rebounds per
game is either weak or non-existent. The r2 is only 0.056.
The least-squares regression line indicates that steals per game only
increases slightly as your number of rebounds per game increases.
A stronger relationship was expected players who get more rebounds
should also get more steals.
Explanation:
There are other variables lurking around e.g. the position you play.
2.108
Rebounds vs Steals per game
Centre
Forward
2.5 Guard
Steals per game
1.5
0.5
0 2 4 6 8 10
Rebounds per game
2.109
Recall that only 5.6% of variation in steals was explained by rebounds.
Position r2
Center 0.27
Forward 0.34
Guard 0.40
2.110
What to Do with Lurking Variables?
2.111
Extrapolation
2.112
Example: US Farm Population
2.113
U.S. Farm Population
34
32
30
28
Population (millions)
26
24
22
20
18
16
14
12
10
8
6
4
2
1940 1950 1960 1970 1980
Year
2.114
The regression equation for predicting farm population (
y ) from year
(x) is
y = 1166.93 0.587x
2.115
Take-home messages
Remember:
2.116
2.117
Design of experiments
Based on Moore et al.: Introduction and Section 3.1
3.1
Lectures 12: Design of experiments
Ways to obtain data
Types of experiments
3.2
Ways to obtain data
From Moore et al.:
If there is a problem with the way we have collected data, this leads
to problems with the analysis and interpretation of the data.
3.3
A strategy for using data in research:
Identify the key research question: the question you want to
discover the answer to.
3.4
There are a few ways to obtain data:
Anecdotal data Haphazardly collected data (such as data from your
own experience).
Collect your own data! We will focus on issues that come up when
collecting your own data.
3.5
Census vs sample?
When collecting your own data, you can take a census or take a
sample.
Usually, we cannot survey all the population (or we just dont have the
time!), we take a sample.
3.6
Example
3.7
Observational Study or Experiment
Observational Study: Individuals are observed and variables of inter-
est are measured, but there is no attempt to influence responses.
3.8
Observational studies and association
3.9
There are many possible explanations for an association:
Common response. e.g. Ice-cream sales and heat stroke cases.
3.10
Common response
Temperature Association
Cause
3.11
Causation
Position of
Tides
the moon
Association
Cause
3.12
Confounding
Child diet;
Habits Association
Other. . .
Confounding causes
3.13
Why do an experiment?
We can make an intervention (the cause) and see whether or not there
is an effect!
3.14
Example walking is faster than catching a
bus!
The MATH1041 survey reveals that students who walk to UNSW have
shorter travel times, on average, than people who catch a bus.
Does this mean that walking gets you to UNSW faster than catching
a bus?
Answer:
3.15
Principles of experimental design
We will use the following important definitions to describe experimental
designs:
Subjects Individuals on which the experiment is done.
Students are then asked if they intend to purchase the new smart-
phone.
1. What are the subjects?
2. What factors are considered?
3. What are the levels of each factor?
4. How many treatments are there in total? What are these treat-
ments?
5. What is the response variable? 3.17
Answer:
3.18
We often use a diagram to summarise experimental designs, as below:
3.19
All experiments should employ the following principles:
3.20
Example is yawning contagious?
http://dsc.discovery.com/tv-shows/mythbusters/videos/is-yawning-contagious-minimyth.htm
3.22
Choice of control group
The control group should differ from other treatments only in the
application of the treatment of interest.
e.g. the placebo effect patients often feel better when a doctor
gives them a treatment, no matter what the treatment is!
3.23
Example effects of smoking during preg-
nancy
3.24
Types of experiment
We will meet a few common types of experiment.
3.25
Matched pairs design
We break subjects into pairs (that have similar properties) and apply
each of two treatments to one subject from each pair.
Matched pairs designs can produce much more precise results than
complete randomisation, because we are controlling for variation in
response across pairs.
Common examples:
Identical twin studies allow us to control for genetics!
Before-after experiments we take two measurements on each
subject, and control for variation across subjects.
3.26
Example Driving skills and mobile phone use
3.28
Randomised block designs
3.29
Example response to cancer treatment
Suppose there are three treatments for cancer being studied. White
blood cell count is used to monitor response to treatment. It is be-
lieved that men and women may respond differently to the different
treatments.
3.30
Cautions about experiments
Below are some common problems when designing experiments:
She would like to compare the water content of loaves across treat-
ments, to draw conclusions about the effects of oven temperature on
water content of bread.
Alex then compares the total biomass of cotton plants in the mite and
no-mite treatments, to explore whether there is an effect of mites on
cotton plant growth.
3.33
3.34
Sampling designs
and toward statistical inference
Based on Moore et al.: Section 3.2
3.35
Lectures 34: Sampling designs and
towards inference
Introduction
3.36
Introduction
Consider the following question:
It is not feasible to ask all UNSW students what their travel time is
we sample UNSW students.
3.37
Some definitions
3.38
The simple random sample (SRS)
The most common type of sample is a simple random sample.
3.39
To create a SRS of size n there are four steps:
1. Create a dataset with all elements of the population in the first
column.
3.40
Example: a SRS of MATH1041 students
3.41
Properties of simple random samples
3.42
Other sampling designs
We have met simple random samples.
3.43
Multi-stage sample: We sample successively smaller groups from the
population in stages, resulting in a sample that consists of clusters
of individuals.
3.44
Example
Answer:
3.45
Cautions about sample surveys
Here are some common problems to watch out for when sampling:
Undercoverage: Some groups in the population are not included in
the sampling process.
e.g. MATH1041 students who enrolled since Monday (they arent
on the roll I just sampled from!).
Non-response: Some individuals cant be contacted/dont respond.
e.g. MATH1041 students who didnt attend this weeks lectures!
Response bias: Interviewer technique may bias replies.
Questionnaire design: Wording is important!
e.g. Compare the following two questions:
Would you drink tertiary-treated sewage?
Would you drink recycled water?
3.46
Towards statistical inference
How long do MATH1041 students take to get to UNSW?
But is this the true average travel time of all MATH1041 students?
Its not, but its an estimate of average travel time of all MATH1041
students.
The average travel time from our survey was 52 minutes
3.47
Some definitions:
Parameter: A number which describes some aspect of a population.
3.48
Sampling distributions
If we took a different SRS of 5 MATH1041 students, would we get
the same estimated average travel time?
3.49
The sampling distribution of a statistic is the distribution of values
taken by the statistic in all possible samples of the same size from the
same population.
But:
Most of the time we wont have a population we can sample from
in simulations. Without the survey data, we couldnt have done the
simulation!
In many cases, such as those we will consider in weeks 7-9, we can use
probability theory to work out the sampling distribution.
3.50
Bias and variability
Two key features of the sampling distribution of a statistic are its bias
and variability:
Bias: concerns the centre of the sampling distribution.
3.51
3.52
How do we reduce bias?
3.53
3.54
3.55
3.56
Probability
Based on Moore et al.: Sections 4.14.2
4.1
Lectures 12: Probability
Randomness and probability
Probability models
Probability rules
Independence
4.2
Randomness and probability
A phenomenon is random if individual outcomes are uncertain but
there is nonetheless a regular distribution of outcomes in a large num-
ber of repetitions.
e.g. When we toss a fair coin, the probability of heads and the proba-
bility of tails equal 0.5. That is, P (H) = 0.5 and P (T ) = 0.5.
4.3
Why study probability?
4.4
Probability Models
To describe a random phenomenon using a probability model, we re-
quire two components:
1. A description of all possible outcomes that can be observed. This
is known as the sample space.
4.5
Important definitions
4.6
Example
Consider the following random phenomena:
1. A coin is tossed.
In each case, write down the sample space of all possible events.
4.7
Probability rules
The following rules follow from the definition of probability:
Rule 1 Any probability is a number between 0 and 1 (inclusive).
That is, for all events A,
0 P (A) 1.
4.9
Equally likely outcomes
count of outcomes in A
=
k
4.10
Example reporting to the Board
4.11
Binomial coefficients counting combinations
Looks messy! The best way to compute n
k is using you calculator.
(Sometimes called nCr ).
4.12
Example
Warning: Binomial coefficients get very big (and messy) very quickly,
40
e.g. 15 = 40, 225, 345, 056
Your calculator starts to struggle calculating n
k when n and k are in
the hundreds.
4.13
Multiple choice quiz
How many ways could they have got three answers correct?
Answer:
4.14
OzLotto
In a game of OzLotto, you choose 7 numbers, out of the integers
from 1 to 45. Seven numbers are drawn at random, and if all are your
numbers, you win the Jackpot.
Answer:
4.15
Independence
An important idea in research is that of independence.
4.16
The idea of independence of events is defined formally in terms of
probability.
Rule 5 Two events A and B are independent if knowing whether one
occurs or not does not change the probability that the other occurs.
4.17
Example reporting to the Board (again!)
Two people out of Adam, Becca, Christina and Damian need to report
to the Board of Directors. They choose these two people at random
(as a simple random sample).
4.18
Example gender bias in promotion
Promoted?
Yes No Total
Male 0.45 0.15 0.60
Female 0.30 0.10 0.40
0.75 0.25
4.19
Additional Exercise
4.20
Answer:
4.21
4.22
General Probability Rules
Based on Moore et al.: Section 4.5
4.23
Lectures 34: General Probability
Rules
We will meet some useful probability rules, and the important idea of
conditional probability.
More addition rules
4.24
More addition rules
Addition rule for many disjoint events:
If A, B and C are disjoint events, then
4.25
Example UNSW student population
The UNSW website reported that at the start of 2004, amongst com-
mencing first year UNSW students, 87% had enrolled full-time, 15%
were international students, and 14% were both full-time and interna-
tional students.
1. What is the proportion of UNSW students that were full-time or
international?
4.26
Conditional probability defined
The probability of an event can change if we know some other event
has occurred.
For example, consider the following table of first year UNSW enrolment
data:
Full-time Part-time
Local 0.73 0.12
International 0.14 0.01
4.27
For two events A and B, (where P (A) 6= 0) the conditional proba-
bility of B given A is
P (A and B)
P (B | A) =
P (A)
4.28
Example student enrolments
Full-time Part-time
Local 0.73 0.12
International 0.14 0.01
4.29
You can think about P (B | A) as a branch on a tree diagram it is
the branch that leaves from the event A and goes to the event B:
4.30
A is the new S
This looks a bit like how you calculate probabilities from equally likely
outcomes (from last lecture):
count of outcomes in B
P (B) =
count of outcomes in S
The essential difference is that we have replaced the sample space S
with A.
P( A and B)
P( B | A) =
P( A)
Refocus to those
parts of B which
are in the A
conditioning
event A.
B
Rescale by A'
dividing by P(A)
4.32
Conditional probability and
independence
Recall that A and B are independent if
P (A and B) = P (A) P (B)
4.34
Conditional probability rules
There are three key rules to know about conditional probabilities, but
we only cover two in this course:
Multiplication rule
P (A and B) = P (A)P (B | A)
4.35
The multiplication rule
P (A and B) = P (A) P (B | A)
This rule follows from the definition of conditional probability (by re-
arranging it).
4.36
Example student enrolments
What is the proportion of UNSW students who were both full-time and
international?
4.37
On a tree diagram, the multiplication rule tells us that to find proba-
bilities of two events happening, just multiply the probabilities across
branches.
4.38
The multiplication rule for more than two
events
This rule can also be generalised to more than two events, e.g. for any
events A, B and C,
P (A and B and C) = P (A) P (B|A) P (C|A and B).
Wanna be a Wallaby?
What is the chance that a high school rugby player will get to play for
the Wallabies?
4.39
The Law of Total Probability
4.40
On a tree diagram, the law of total probability tells us that to find
the probability of the event B, we find all the outcomes involving B,
multiply across branches to find the probability of these outcomes, and
sum.
4.41
The law of total probability for multiple out-
comes
The law of total probability extends to when there are more than two
possible disjoint events to sum.
4.42
4.43
4.44
Random variables
Based on Moore et al.: Section 4.3
5.1
Lectures 12: Random variables
Random variables defined
5.2
Random variables defined
A Random Variable is a variable whose value is a numerical outcome
of a random phenomenon.
Examples:
Number of children in a family.
Height of a person.
We usually use upper-case letters from near the end of the alphabet
to denote random variables, e.g. X or Y . We reserve Z for a special
type of random variable that we will meet next week.
5.3
Discrete random variables
A random variable X is discrete if X has a countable number of
possible values, say x1, x2, x3, x4, . . . .
Value of X x1 x2 x3 xk
Probability p1 p2 p3 pk
2. p1 + p2 + + pk = 1.
5.4
Examples of discrete random variables:
Number of children in a family.
5.5
Probability histograms
5.6
Some examples
5.7
Example Rock-paper-scissors
Bill and Ben play rock-paper-scissors. Bill chooses one of rock, paper
and scissors at random. If Bill and Ben pick the same object, they play
again, until someone wins.
Let X be the number of times Bill and Ben play until there is a winner.
1. What is the chance that Bill and Ben pick different objects first
time around? That is, whats P (X = 1)?
2. What is the chance neither Bill nor Ben wins until the 4th time
they play?
5.8
Example Texas hold em
In the variation of poker known as Texas hold em, each player receives
two cards. Here is the probability distribution of the number of aces
you get in a two-card hand:
Number of aces 0 1 2
Probability 0.85 0.14 0.01
1. Verify this is a proper discrete random variable.
5.9
Example left-handers
About 12% of Australians are left-handed. Three people are sitting
down at lunch (assume these are a random sample of Australians with
respect to handedness).
3. Find P (X = 2).
4. Find P (X 2).
This is an important example known as the binomial distribution.
5.10
The Binomial Distribution
A special type of discrete random variable which is very useful in prac-
tice is the binomial distribution. It is used as a model for counts
in a table of frequencies for a categorical variable that takes only two
values.
The key assumptions are that the trials we are counting need to be
independent and to have the same probability of success.
(As it turns out, this can be guaranteed by random sampling!)
5.12
Exercise: recognising binomial situations
3. A couple continues to have children until the first girl is born. Let
X be the total number of children.
5.13
Answer:
5.14
Exercise: Calculating binomial probabilities
5.15
A test consists of five True/False questions. A student decides to
answer the questions randomly True or False, without even reading
the question. Let X be the number of correct answers the students
gets out of 5.
5.16
Continuous random variables
A random variable X is continuous if it takes all values in some
interval of real numbers.
This means that a continuous random variable can take infinitely many
values any value at all in the interval on which it is defined.
Discrete or continuous?
Consider the following random variables:
Number of children in a family.
Height of a person.
5.18
5.19
The density curve as a smoothed-out histogram.
5.20
Density curves are mathematical functions which can be used to de-
scribe the probabilistic behaviour of measurements of interest.
5.21
Some example density curves:
birthweight survival time after heart transplant
5.22
Why all this talk about density curves?
Density curves are important, because if you know the density curve
of a variable, you can calculate any probability you want about this
variable!
e.g. If you know the density curve of the life expectancy of patients
using beta blockers (heart medication), then you could calculate the
proportion of patients using beta blockers who will live for another 40
years.
5.23
How do you calculate probabilities using a
density curve?
The area under the density curve in the range of values you are inter-
ested in tells you the probability of observing a value in that range.
5.24
Example uniform random numbers
2. Label the y-axis of your plot as appropriate (recalling that the area
under the whole density curve must be 1).
3. Find P (X = 1
4 ).
4. Find P (0 X 1
2 ).
5. Find P ( 1
4 X 1 ).
2
5.25
5.26
Means and variances
of random variables
Based on Moore et al.: Section 4.4
5.27
Lectures 34: Means and variances
of random variables
The mean of a random variable
5.28
Example betting on a coin toss
5.29
The mean of a random variable
One way to answer this question is:
How much money will you win per game in the long run?
5.30
Notation for Means
mean = x
.
mean =
(pronounced mi-oo) is the greek letter M M for Mean!
5.31
Example betting on a coin toss
Value of X -2 2
Probability 0.5 0.5
If you play 100 games, how many wins would you expect? How many
losses? Whats the average?
5.32
The mean of a discrete random variable
Value of X x1 x2 x3 ... xk
Probability p1 p2 p3 ... pk
X = x1p1 + x2p2 + + xk pk
k
X
= xi p i
i=1
5.33
Example Texas hold em
In the variation of poker known as Texas hold em, each player receives
two cards. Here is the probability distribution of the number of aces
you get in a two-card hand:
Number of aces 0 1 2
Probability 0.85 0.14 0.01
5.34
Example rolling a fair die
Let X be the number that Xena gets when she rolls a 6-sided die
(which has equally likely outcomes 1, 2, 3, 4, 5 and 6).
5.35
The law of large numbers
The law of large numbers is that if you take a large number of inde-
pendent observations of a random variable, then the sample mean x
will be close to the true mean X . The larger the sample size, the
closer we expect x to be to the true mean X .
5.36
We can see the law of large numbers in action by looking at the hair
cut cost data.
Average hair cut cost for the population of all MATH1041 students:
$50.7
5.37
A graph of average hair cut cost against sample size:
5.38
Rules for means
If X is a random variable with mean X , and Y is a variable with mean
Y , and a and b are fixed numbers, then:
1. The variable a + bX has mean:
(a+bX) = a + bX
(X+Y ) = X + Y
5.39
Example rolling a fair die twice
Xena rolls her 6-sided die twice. Let Y = X1 + X2 be the sum of the
values from her two rolls.
What is mean of Y ?
Answer:
5.40
Example Texas hold em variation
Number of aces 0 1 2
Probability 0.85 0.14 0.01
Should you take him up on his offer, or will Steven win out in the long
run?
Answer:
5.41
Example betting big on a coin toss
5.42
Many people give different answers for the $2 and $2,000 games, even
though the mean is the same. Clearly the mean isnt everything
when it comes to investment decisions. There is something else going
on. . .
risk!
One way to measure risk is to think about how variable your gain is,
by studying the variance.
5.43
Variance of a random variable
The mean is a measure of the centre of the distribution of a random
variable. The usual measure of spread of a random variable is its
variance.
5.44
Suppose X is a discrete random variable whose probability distribution
is:
Value of X x1 x2 x3 xk
Probability p1 p2 p3 pk
The variance of X is
2 = (x )2 p + (x )2 p + + (x )2 p
X 1 X 1 2 X 2 k X k
k
X
= (xi X )2pi
i=1
where X is the mean of the random variable X.
5.45
Examples
5.46
Rules for variances
Rule 1. If X is a random variable and a and b are fixed numbers, then:
2
(a+bX) = b2X
2.
2. Find the standard deviation of the total amount you expect to lose
playing Stevens game as described previously.
5.48
Rules for variances
Rule 4. If X and Y are independent random variables and a and b
are fixed numbers, then:
2
(aX+bY = a2 2 + b2 2 .
) X Y
5.49
Example seed weights
A laboratory scale reports that its error rate is 10mg. That is, if
you put the same object on the scales multiple times, the standard
deviation of measurements will be 10.
3. Is Dan onto a good thing, weighing his seeds more than once?
5.50
5.51
Other rules for means and variances
Non-linear transformations
Y =??? Y2 =???
e.g. if Y = log X then Y 6= log X , and the actual answer for how Y
relates to X depends on the distribution of your data.
5.52
Mean and variance of continuous random vari-
ables
The definitions of means and variances in this lecture were for discrete
random variables only. There are corresponding definitions for contin-
uous random variables. Although these are given here, you will not be
expected to know them.
5.54
mean as the centre of gravity
5.55
The Normal Distribution
6.1
Lectures 12: The normal
distribution
This lecture, we will meet the normal distribution a very important
distribution in statistics. We will be using this distribution throughout
this course both to model data and to make inferences from data.
Normal curves
6.2
Density curves: revision
For any particular range of values, the area under the curve is the
probability of an observation falling in that range.
6.3
Normal Curves
Normal curves are a special type of density curve.
6.4
-10 -5 0 5 10
6.5
Why So Many Normal Curves?
6.6
Normal distribution shorthand
X has a normal distribution with mean 100 and standard deviation 15.
m
X N(100,15)
6.7
Warning: The Moore et al. textbook normal distribution shorthand is
not universal!
6.8
The Normality Assumption
The normal distribution is a good model for a data set if its histogram
looks like a normal curve.
6.9
Measurements Often Approximately Normal:
birthweights
heights of adults
intelligence quotient (IQ)
final assessment scores
6.10
Warning: always check the normality assump-
tion!
6.11
Is the normality assumption reasonable?
Melbourne max. temps.
0.06
0.05
0.04
Density
0.03
0.02
0.01
0.00
10 20 30 40
6.12
Graphical Assessment of Normality Assump-
tion
6.13
What a Normal Distribution Should Look Like
1,000
800
Frequency
600
400
200
4 3 2 1 0 1 2 3 4
Data
6.14
What a Normal Quantile Plot Should Look Like
4
3
Sample Quantiles
2
1
0
1
2
3
4
4 3 2 1 0 1 2 3 4
Normal Quantiles
6.15
Right Skewed Data
1,000
Frequency
500
0 10 20 30 40 50
Data
6.16
Right Skewed Data
55
50
45
Sample Quantiles
40
35
30
25
20
15
10
5
0
5
4 3 2 1 0 1 2 3 4
Normal Quantiles
6.17
Left Skewed Data
1,000
Frequency
500
0 10 20 30 40 50
Data
6.18
Left Skewed Data
55
50
45
Sample Quantiles
40
35
30
25
20
15
10
5
0
5
4 3 2 1 0 1 2 3 4
Normal Quantiles
6.19
Normal Quantile Plot for Melbourne Temperatures
40
Sample Quantiles
35
30
25
20
15
10
3 2 1 0 1 2 3
Normal Quantiles
6.20
How do Normal Quantile Plots work?
6.21
The normal quantiles (z-scores) are on the horizontal axis. These are
calculated as the n values that break the normal curve into a set of
shapes with equal area.
z-scores chosen
so that it splits
areas equally
3 2 1 0 1 2 3
6.22
Normality Should Not Be Treated as Yes/No
excellent
good
fair
poor
hopeless
6.23
Extent of Normality for Class Survey Data
data normality
Melb. temp.
UNSW satisf.
travel time
hair cost
labour
6.24
The 68-95-99.7 Rule for Normal
Curves
For normal measurements:
About 68% of data fall within of the mean
(that is, within one standard deviation of their mean).
6.25
The 68-95-99.7 Rule for Normal
Curves
68% of data
95% of data
99.7% of data
3 2 1 0 1 2 3
6.26
Example: Intelligence Quotient (IQ)
6.27
Calculating Probabilities for the
Standard Normal Distribution
The standard normal distribution is the one for which
=0 and = 1.
6.28
Probability that Z < 1.4
P (Z < 1.4)
1.4
3 2 1 0 1 2 3
6.29
It can be found by using statistical such as R by using the command
pnorm():
pnorm(1.4) = P (Z < 1.4) = 0.9192
6.30
P (1.39 < Z < 0.43)
1.4 0.43
3 2 1 0 1 2 3
6.31
P (1.39 < Z < 0.43) = P (Z < 0.43) P (Z < 1.39)
= 0.6664 0.0823
= 0.5841
6.32
Probability/Proportion Calculations Concern-
ing Z
6.33
Class Exercise: Probabilities Concerning Z
3. What is the probability that Z > 1.4? (Hint: total area is 1).
6.34
Answer:
6.35
Reverse Use of Standard Normal
Probabilities
Sometimes we need to use standard normal probabilities in reverse.
More concisely:
P (Z > c) = 0.1. What is c?
Area= 0.9
Area= 0.1
c
3 2 1 0 1 2 3
6.36
Solution
R function: qnorm(0.9)
6.37
Class Exercise: Reverse Use of Standard Normal Probability Ta-
bles
Answer:
qnorm(0.33) = - 0.44
6.38
Calculating Probabilities for any
Normal Variable
Now we know how to compute probabilities concerning Z, a N (0, 1)
variable.
6.39
Example
Let
X = Intelligence Quotient (IQ) of humans.
Assuming
X is N (100, 15)
6.40
Standardising Transformation
RESULT
If
X is N (, )
then
X
Z= is N (0, 1).
6.41
Link to Linear Transformations
Z = a + bX
where
a = / and b = 1/.
6.42
Temperature Analogy
6.43
Back to the IQ Example
6.44
P (Z > 0.67)
0.67
3 2 1 0 1 2 3
6.45
Answer:
1-0.7486=0.251.
6.46
Birthweight Example
6.47
Answer:
6.48
Additional Exercise 06-1 (based on Exercise 1.82 of 4th Edition)
6.49
6.50
Additional Exercise 06-2 (based on Exercise 1.90 of 4th Edition)
6.51
6.52
Sampling distribution
of counts and proportions
Based on Moore et al.: Introduction and Section 5.1
6.53
Lectures 34: Sampling distribution
of counts and proportions
We will study the sampling distribution of binomial counts and sam-
ple proportions, important in understanding gender imbalance, polit-
ical polls and more . . .
The sampling distribution of a statistic
Sample proportions
6.54
The sampling distribution of a
statistic
A 2015 poll by the Lowy Institute of 1,200 randomly selected Aus-
tralians found that 63% of Australians think our government should
commit to significant emission reductions so that other countries will
be encouraged to do the same.
https://www.lowyinstitute.org/lowyinstitutepollinteractive/climate-change-and-energy/
6.55
When we have collected some data, we usually want to calculate a
statistic that summarises some key quantity of interest.
(Such as the 63% in the poll.)
6.56
The sampling distribution of a statistic is the probability distribution
of values taken by this random variable.
6.57
This week we will study the sampling distribution of binomial counts
and sample proportions, such as the poll example.
Does this mean that most Australians want the government to lead
the way internationally on climate change policy?
6.58
Mean and variance of a binomial
variable
Binomial Revision:
Recall the binomial distribution from last week the most common
type of discrete random variable encountered in practice.
6.59
Consider the following variables:
# Australians (in a random sample of 1,200) wanting significant
emission reductions to help combat climate change.
6.60
Revision exercise:
Assume 20% of students watched the Aust. Open Womens Final.
You have a coffee with four students (assume they are a random sam-
ple).
Let X be the number, out of the four, who watched the program.
1. What is the probability that none of the four students watched the
program?
2. What is the probability that exactly one the four watched it?
6.61
Mean and variance of a binomial variable
There is a general rule that can be used for calculating means and
variances of binomial variables.
If X B(n, p) then
X = np
2 = np(1 p)
X
q
X = np(1 p)
6.62
Assume 20% of students watched the Aust. Open Womens Final.
You have a coffee with four students (assume they are a random sam-
ple).
Let X be the number, out of the four, who watched the Aust. Open
Womens Final.
1. What is the mean (expected value) of X?
2. What is the variance of X?
3. What is the standard deviation of X?
Answer:
6.63
Sample proportions
Recall that the binomial distribution has two parameters n (the num-
ber of trials) and p (the probability of success).
6.64
We can estimate p using our observations, by calculating the sample
proportion p:
X
p =
n
6.65
p is a statistic and we want to understand its sampling distribution
(so that we understand how reliably it can estimate p in practice).
6.66
Aust. Open Womens Final example (continued)
6.67
Exercises
Recall the following examples, for which you have already derived the
distribution of X in Week 4.
6.68
For each case,
1. list the values p = X
n can take and their probabilities.
6.69
Mean & variance of sample proportions
We know that if X B(n, p), then:
q
X = np 2 = np(1 p)
X X = np(1 p).
r
p(1p) p(1p)
p = p p2
= n p = n .
6.70
Example Poll
6.71
1. What is the probability distribution of X?
6.72
The Normal approximation for
binomial counts
Lets say, for the poll example, we want to find out how likely it is that
as many as 63% of Australians in a sample of 1,200 say they want
a significant emission reduction commitment, if the true proportion p
were actually equal to 0.5. How could we calculate this probability?
6.73
Key result
If n is large enough, the distribution of
X B(n, p)
can be well approximated by the normal distribution with the same
mean and variance.
q
Y N np, np(1 p) .
6.74
Example: Poll
6.75
We can use this result to calculate probabilities for binomial problems
where n is large.
In the poll example, lets assume that the probability that a respondent
says they would pay more is actually 0.5
(i.e. there is no overall majority of Australians who would pay more).
6.76
How large does n need to be?
We can use a normal approximation to the binomial if n is large
enough . . . but when is n large enough?
if np 10 and n(1 p) 10
6.77
Example
1. X B(5, 0.25)
2. X B(15, 0.25)
3. X B(40, 0.25)
6.78
Y N (3.75, 1.677)
0.2 P (Y 4)
0.1
0
0 1 2 3 4 5 6 7 8 9
0.2 P (Y 3.5)
0.1
0
0 1 2 3 4 5 6 7 8 9
0.2 P (X 4)
0.1
0
0 1 2 3 4 5 6 7 8 9
X B(15, 0.25)
6.79
Example leaky tanks
6.80
6.81
6.82
6.83
6.84
The Central Limit Theorem
and its applications
Based on Moore et al.: Sections 5.2, 6.1
7.1
In this weeks lectures we will learn more about the sampling distribu-
the average. It is important to understand averages because
tion of X,
they are used a lot. . .
Examples:
In TV ratings: The Masterchef season average tumbled to 1.127
million from 1.873 million viewers
7.2
Example corn yields
In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are:
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5
We want to estimate the average yield () for this new variety, to see
if average yield is higher for this corn variety.
7.3
Lectures 12: The Central Limit
Theorem
lead-
In this lecture we study special aspects of the sample mean X,
ing to the Central Limit Theorem the most important theorem in
statistics!
Revision: Mean and variance of linear combinations
The mean and standard error of X
Some examples
7.4
Revision: Mean and variance of
linear combinations
If X and Y are independent random variables, and a and b are con-
stants, what can we say about the linear combination aX + bY ?
In particular:
the mean;
the variance;
the distribution?
7.5
Example
2. the total: T = X1 + X2 + X3
= 1 (X1 + X2 + X3)
3. the sample mean: X 3
7.6
The mean and standard error of X
A special and very important case of linear combinations of random
for a random sample.
variables is the sample mean X
7.7
Consider the sample x1, x2, . . . , xn to be values of independent random
variables: X1, X2, . . . , Xn which have the same distribution as X.
X1 + X2 + + Xn 1 1 1
=
X = X1 + X2 + + Xn.
n n n n
take, on average?
What value will X
compare to those
How will the variance and standard deviation for X
for X?
7.8
In general, for a random sample of size n:
2
2 =
= ,
X X
, =
X
n n
We will refer to X
as the standard error of X.
7.9
Example corn yields
In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are:
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5
What is the standard error of X?
7.10
for a
The sampling distribution of X
normal random sample
Let X and Y be two independent normally distributed random vari-
ables. Then the linear combination aX + bY has a normal distribution,
with mean
aX+bY = aX + bY 2
aX+bY = a2X
2 + b2 2 .
Y
7.11
Example
= 1 (X1 + X2 + X3)
(iii) the mean: X 3
7.12
Recall the sample mean X is a combination of random variables. Hence
the following important result:
7.13
Example test scores
2. Take an SRS of 25 students who took the test. What is the mean
and variance of the sample mean?
Answer:
7.14
The Central Limit Theorem
In words:
7.15
In notation:
N (, )
X
n
The result does apply even when the population is discrete. (e.g.
normal approximation to binomial).
7.17
Why is the Central Limit Theorem (CLT) so
important?
7.18
Some examples
The following are some examples to help see where the Central Limit
Theorem is useful. These examples are just a starting point were
going to use the CLT plenty more times in the rest of this course!
7.19
Example corn yields
In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots is found.
Previous studies for a similar variety suggest that the mean corn yield
should be 110 bushels/acre, and the standard deviation should be =
10 bushels/acre.
7.20
Answer:
7.21
Example pH
7.22
Answer:
7.23
7.24
Lectures 34: The Central Limit
Theorem and Confidence intervals
Based on Moore et al.: Sections 5.26.1
In this lecture we will use simulations and examples from the Week 1
survey to demonstrate the Central Limit Theorem the single most
important theorem in statistics.
Then you will meet your first method of inference: the confidence
interval for .
7.25
Lectures 34: The Central Limit
Theorem and Confidence intervals
Revision The Central Limit Theorem
Confidence intervals
Example speeding?
7.26
Revision The Central Limit
Theorem
7.27
Examples real and simulated
7.28
Example 1 distribution of X
7.29
The Central Limit Theorem is about the sampling distribution of
the mean.
for n = 2
Example 2 distribution of X
If we take 100 samples, each of size n = 2 and take the mean of each
sample, what sort of distribution do we expect to see?
We will simulate some samples, and find their means and look at the
distribution of the sample values. (What do we expect?)
7.30
Example 3 distribution of X for n = 4 If we take a 100 samples,
each of size n = 4 and take the means of each sample, what sort of
distribution do we expect to see?
We will simulate some samples, and find their means and look at the
distribution of the sample values. (What do we expect?)
7.31
Another example your birthday!
Recall that in the Week 1 survey you were asked what day of the month
you were born on.
Answer:
7.32
You were also asked what day of the month your mother and your
father were born on. Consider the average of your parents birthday.
Answer:
7.33
You were also asked what day of the month your youngest brother
(/sister/cousin) was born on. Consider the average of all 4 birthdays
for your family.
Answer:
7.34
Another example: travel time
How good is the normal approximation for average travel time (n = 5)?
7.35
Summary:
7.36
How large is a large enough n?
The Central Limit Theorem says that the average of a random sample
from any variable is approximately normal, if the sample size is large
enough.)
7.37
From simulations using different distributions you may notice that:
As the sample size n gets bigger, the approximation gets more
accurate.
7.38
The Central Limit Theorem says that X is approximately normal, if
sample size is large enough. Here are some rough rules of thumb
about whether sample size is large enough for X to be well approxi-
mated by a normal distribution.
How large a sample size you need depends on how close your data (X)
are to being normal:
If the normal approximation for your data is good, you dont need
many observations at all!
7.39
Confidence intervals
Now we will talk about an important application of the Central Limit
Theorem constructing a confidence interval for when we dont
know anything about a variable except its standard deviation and
the sample we have collected.
7.40
Example corn yield
In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5
We want to estimate the average yield () for this new variety, to see
if average yield is higher for this corn variety.
7.41
Note that for this sample x
= 124.14.
7.42
Then, because of the central limit theorem, the sampling distribution
of the mean is normal with:
=
X
10
=
X
15
!
10
N ,
X
15
7.43
We can use what we know about normal distributions to find a range
of values, a confidence interval that we are pretty sure contains .
is within 1.96 10 of .
95% of the time, X
15
7.44
A confidence interval for , when
is known
Assuming that:
we have a set of n independent observations from any variable
with a known standard deviation ; and
is normally distributed
X
then a 95% confidence interval for from a SRS of size n is:
!
1.96 , x
x + 1.96 .
n n
7.45
More generally:
If:
we have a set of independent observations from any variable with
a known standard deviation ; and
is normally distributed,
we can assume X
then a level C confidence interval for from a SRS of size n is obtained
from
!
z ,x
x +z ,
n n
where z is the number such that if Z N (0, 1), then P (z < Z <
z ) = C.
Where did the 1.96 come from for a 95% confidence interval?
The goal is to find the value such that the middle 95% of the area
under a Standard Normal curve will be between z and z .
7.47
95% confidence
0.4
0.3
0.2
3 1.96 1 0 1 1.96 3
7.49
It is important to remember that average corn yield () is a fixed
parameter that doesnt vary. is either in the 95% confidence interval
or it isnt and if we repeated the estimation process lots of times,
95% of our confidence intervals would contain , while 5% wouldnt
contain .
7.50
Example speeding?
The speed limit in residential areas is 50 km/hr. A concerned resident
on Barker St measures speeds of traffic during a 10-minute period and
obtained the following results:
46 54 54 45 48 46 61 50 46 54
51 50 43 59 46 41 38 54 58 50
How fast is the average car going? Is there evidence that on average,
cars exceed the speed limit?
7.51
Answer:
We have x
= 49.7 and n = 20.
7.52
Confidence intervals
and significance testing
Based on Moore et al.: Section 6.16.2
8.1
Lectures 12: Understanding
confidence intervals
The margin of error
Checking assumptions
Extras
8.2
The margin of error
Recall that if we have a random sample from any variable with a
known standard deviation , then a level C confidence interval for
from a SRS of size n is obtained from:
!
z , x
x + z
n n
where z is the number such that if Z N (0, 1), then we have the
following probability
P (z < Z < z ) = C.
8.3
For example, because P (1.96 < Z < 1.96) = 0.95, a
95% confidence interval for is:
!
1.96
x + 1.96
, x .
n n
8.4
We can rewrite this confidence interval as:
m, x
(x + m)
where m is the margin of error and
m = z .
n
8.5
The margin of error controls the width (or length) of the confidence
interval.
The smaller the margin of error, the more precisely we can estimate .
8.6
How to decrease the margin of
error?
Example body temperature
8.7
Answer:
Notice that the higher the desired confidence level, the larger the mar-
gin of error.
8.8
Example body temperature
Now consider what would happen if they had sampled n = 200 healthy
adults instead of n = 106, and the average body temperature of these
200 individuals was 36.78 degrees.
Find a 95% confidence interval for the mean body temperature and
compare it to the confidence interval when n = 106.
8.9
Answer:
Notice that the larger the sample size, the smaller the margin of error.
8.10
Example body temperature
Find a 95% confidence interval for the mean body temperature for
= 0.2 and compare it to the confidence interval when = 0.4.
8.11
Answer:
Notice that the smaller the standard deviation , the smaller the margin
of error.
8.12
There are three ways to reduce the margin of error
m = z .
n
8.13
What sample size should I use?
The margin of error is often used in practice to determine the sample
size to use in a particular study.
By solving for n in our expression for the margin of error, we can decide
on an appropriate sample size, if we are given:
The margin of error that is desired.
8.14
Example body temperature
How many healthy adults would need to be used in a study, for the
margin of error on our estimate of to be no more than 0.1 degrees?
8.15
Answer:
We have 4 = 1.96 10/ n.
Solving for n, n = 4.9, n = 9.8, sample at least 10 plots.
8.16
Checking assumptions
!
z , x
x + z
n n
8.17
Example body temperature
8.18
We assume that:
The body temperatures of the adults are: independent
(Which is OK if the sample is random.)
8.19
Normal Quantile Plot of Body Temperatures
37.5
37.0
Temperature
36.5
36.0
2 1 0 1 2
zscore
8.20
Confidence intervals: TRUE or FALSE?
8.21
8.22
Lectures 34: Tests of significance
Based on Moore et al.: Section 6.2
In this lecture you will meet the most widely used method of infer-
ence about a population from a sample significance testing, or
hypothesis testing.
Some examples
8.23
Often, you will come across P -values in your studies:
Paired t-tests indicated that there was no change in attitude (P >
0.05).
8.24
Some examples
1. Body temperature Again
8.25
2. Corn again
In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are:
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5
Average yield for the old corn variety was 110 bushels per acre. Is
there evidence that this new variety has higher yield?
8.26
3. Calcium, pregnancy and Central Americans
The level of calcium (in mg/dl) in healthy young adults varies with
mean 9.5 and standard deviation = 0.4.
Is this an indication that the mean calcium level is different from 9.5
in these women?
8.27
The reasoning of hypothesis tests
In hypothesis testing, we have a particular claim about a population
parameter (a null hypothesis) that we want to test, and we have a
sample of data we can use to test the claim.
2. Mean yield for old corn variety was 110 bushels per acre (or =
110).
8.28
We use the sample data to test the null hypothesis by asking the
question: How much evidence do these data provide against the null
hypothesis?
To measure how much evidence the data provide against the null hy-
pothesis, we use probability we calculate a P -value, the probability
of observing a test statistic as or more extreme than the observed
outcome, if the null hypothesis were true.
8.29
Example 1 Body temperature
36.78) measures how unusual this sample mean is (if the true
P (X
mean were = 37).
We can calculate this probability if we know the distribution of X,
assuming that = 37.
8.30
36.78)
P (X
In other words, the smaller this P -value is, the more evidence we have
that the original claim ( = 37) is not true!
8.31
Procedure for hypothesis testing
There are several steps to complete when doing any hypothesis test:
State the null and alternative hypotheses.
Conclusion.
8.32
The null and alternative hypotheses
z 0
But why did we use Z = ?
/ n
8.34
In order to make inferences using this test statistic, we need to know
its null distribution the sampling distribution of the test statistic
when H0 is true.
8.35
Calculate the P -value
It is defined as the probability that the test statistic would take a value
as extreme or more extreme than the value we actually observed, if H0
is true.
The smaller the P -value, the more extreme the test statistic is and so
the stronger the evidence against H0.
8.36
In the body temperature example:
37
x
If H0 : = 37 is actually true the we expect z = to be a negative
/ n
value very often.
!
37
x 36.78 37
P -value = P <
/ n 0.4/ 106
0.22
= P z<
0.03885
= 7.5 109.
8.37
Conclusion
If our P -value is large, the test statistic is not extreme and so there is
no evidence against H0.
8.38
In the body temperature example:
That is, there is very strong evidence against the null hypothesis being
true (H0 : = 37) and we reject it in favour of Ha : < 37 the
alternative hypothesis.
8.39
The Z-test for a population mean
Often we want to test a hypothesis about the population mean of
some variable X.
Assuming that:
we have a set of n independent observations from any variable
with a known standard deviation ; and
is approximately normal.
X
then we can use as a test statistic:
0
x
z= .
/ n
8.40
2. Corn again
Average yield for the old corn variety was 110 bushels per acre. Is
there evidence that this new variety has higher yield?
8.41
Answer:
Recall from last week that a 95% confidence interval for was
(119.1, 129.2). Do the results of the hypothesis test coincide with
this interval?
8.42
P -values and significance
Recall that if the P -value is small then we conclude we have evidence
against H0 in favour of Ha.
If the P -value is not small then we conclude that our test statistic is
not extreme so we have no evidence against H0.
8.43
It is common to choose a significance level and to use this as a
guide for drawing conclusions from P -values.
8.44
Below is an alternative way of interpreting P -values. Note that P -
values measure strength of evidence the smaller P is, the more evi-
dence there is against H0.
8.45
One-sided vs two-sided testing
There are various possible types of alternative hypotheses, when H0 :
= 0 .
One-sided alternatives:
Ha : > 0
Ha : < 0
Two-sided alternative:
Ha : 6= 0
8.46
The alternative hypothesis Ha is important for determining the P -value,
because it tells us what sorts of departures from H0 we are interested
in measuring.
8.47
Calcium, pregnancy and Central Americans
The level of calcium (in mg/dl) in health young adults varies with mean
9.5 and standard deviation = 0.4.
Is this an indication that the mean calcium level is different from 9.5
in these women?
8.48
Answer:
8.49
Fast food
A fast food outlet claims that the average caloric content of its meals
is 800, and the standard deviation is 25.
A consumer protection group tested 12 meals and found the average
number of calories was 873.
8.50
Answer:
8.51
8.52
Inference about a
population proportion
Based on Moore et al.: Section 8.1
9.1
Lectures 12: Inference about a
population proportion
Today we will meet a set of practical tools for making inference about
binary (yes/no) variables.
Introduction
9.2
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
1-sample Z-test
useful test:
for
(if known)
useful for
CI for
inference:
Last week
9.3
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
1-sample Z-test
useful test:
for
(if known)
useful for
CI for
inference:
This lecture
9.4
Introduction
Out of 400 MATH1041 students, 108 of them were born in the spring.
Does this provide evidence that people are more likely to be born in
Spring?
9.5
2. TV ratings and MATH1041
According to OzTAM, 6.88% of people in Sydney watched the Aus-
tralian Open Womens Tennis Final. In contrast, out of 400 MATH1041
respondents, 49 students watched this event on television.
9.6
3. Poll
9.7
In all of these examples, we have observed a binary variable (a cate-
gorical variable with only two possible responses):
1. Birth month: Spring vs not Spring.
9.8
Equivalently, we can think of our data as n trials in which we have
counted the number of successes:
1. Birth month: n =400 trials, X =108 successes.
9.9
Week 6 revision
9.10
So if the true proportion of successes is p, then
p p
r
p(1p)
n
has a standard normal distribution (approximately by the CLT), that
is:
p p
r N (0, 1).
p(1p)
n
9.11
One-sample test for p
When studying a binary variable, the proportion of successes p is of
key interest. Usually p is unknown but we may have ideas (hypotheses)
about what it should be.
Assuming that:
We have a sample of n independent measurements of any binary
variable.
p is approximately normal,
then we can test:
H0 : p = p 0
using the test statistic
p p0
z=r
p0 (1p0 )
n
9.13
Birth months and spring
Out of 400 MATH1041 students, 108 of them were born in the spring.
Does this provide evidence that people are more likely to be born in
Spring?
9.14
Answer:
9.15
Political poll
9.16
Answer:
9.17
Confidence interval for the true
proportion p
Assuming that:
We have a sample of n independent measurements of any binary
variable; and
p is approximately normal,
then an approximate level C confidence interval for the probability of
success p is
s s
p(1 p) p(1 p)
z
p , p + z
n n
where z is the number such that if Z N (0, 1), then
P (z < Z < z ) = C.
Note that the standard error involves p, whereas in the hypothesis test,
the standard error involved p0. Why?
9.19
Political Poll
Answer:
9.20
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
This lecture
9.21
The plus four confidence interval
The plus four formula is not widely used and we will not use it in
MATH1041 just stick with the slide 9.17 formula. If n is small, you
are better off using the binomial distribution directly for inference.
9.22
Some emerging patterns
In our three weeks studying methods of inference, you might have
noticed some patterns across the different methods. Here are two
patterns:
The test statistics have a common form; and
9.23
The test statistics have a common form
You may have noticed that all the test statistics we have used so far
have had the form
estimate hypothesised value
standard error of estimate
(In fact all Z statistics have this form. So do t statistics, which we will
meet next lecture!)
9.24
Confidence intervals and hypothesis tests are
related
9.25
TV ratings and MATH 1041
9.26
Answer:
9.27
Declaring statistical significance at the 0.05 level for a two-sided test
is equivalent to a 95% confidence interval of the parameter of interest
not containing the null value.
Mathematically, the two procedures are equivalent (in the sense that
every two-sided hypothesis test has a matching confidence interval)!
Summary for 2-sided test examples from the last few weeks:
TV ratings p = 6.88%
9.28
When to use which procedure?
Today we will meet a set of practical tools for making inference about
the mean of a quantitative variable.
What to do when is not known?
The t Distribution
Assumptions
9.31
What to do when is not known?
For a quantitative random variable X with mean and variance 2, we
have established that you can estimate from a random sample.
9.32
To construct a confidence interval for , we used:
X
N (0, 1)
/ n
x
The histogram approximates the density function of s/ it was made
n
using thousands of samples of size 6.
Its distribution is not standard normal (the red line) there is higher
density in the tails (because s is not exactly all the time).
9.34
Average of 6 values from the normal distribution
X
Z N (0, 1) S
n
Frequency
5 4 3 2 1 0 1 2 3 4 5
9.35
Using the standard normal distribution to construct a confidence inter-
val for when is not known, wouldnt work well, for small samples.
It can actually be shown that in this case, sample means are farther
than 1.96 standard errors from the mean 10.6% of the time, not 5%
of the time!
9.36
Even for larger samples than n = 6,
X
6 N (0, 1)
S/ n
although the approximation gets better as n increases.
9.37
X
Z N (0, 1)
n=6
S
n
5 4 3 2 1 0 1 2 3 4 5
X
n = 11
Z N (0, 1) S
n
5 4 3 2 1 0 1 2 3 4 5
X
n = 25
Z N (0, 1) S
n
5 4 3 2 1 0 1 2 3 4 5
9.38
The t distribution
Instead of using the normal distribution to make inferences about ,
we use the t distribution when is unknown.
X N (, )
then the statistic
x
t=
s/ n
has the t-distribution with n 1 degrees of freedom, which we
write in shorthand as t(n 1) or tn1.
9.39
The t-distribution has a known density function. However, this density
function is a bit complicated and its integral does not have a closed
form solution, so as for the normal distribution.
9.40
tdistribution with 1, 5 and 10 degrees of freedom vs N (0, 1)
t1 distribution t5 distribution
0.4 0.4
0.2 0.2
0.1 0.1
4 2 0 2 4 4 2 0 2 4
t10 distribution Z N (0, 1) distribution
0.4 0.4
0.2 0.2
0.1 0.1
4 2 0 2 4 4 2 0 2 4
9.41
Properties of the t-distribution:
It is actually a family of distributions, t1, t2, . . . , t .
9.42
Confidence Intervals for the Mean
(with not known)
We can construct confidence intervals for without having any
knowledge of .
Assuming that:
we have a set of n independent observations of a variable; and
In practice, this formula works well even when our data are not normal,
as long as X is approximately normal.
9.43
Example corn again
In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are:
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5
9.44
Answer:
9.45
Example lead in soil
A new method of extracting lead from soil has been developed, and
we would like to know what the average amount of lead extracted is
(in parts per million). When tried on 27 specimens of soil, it yielded a
mean of 83 ppm lead and a standard deviation of 10 ppm.
Answer:
9.46
The One-Sample t Test
Often we want to test a hypothesis about the population mean of
some variable X, without any knowledge of .
Assuming that:
we have a set of n independent observations of a variable; and
In practice, this formula works well even when our data are not normal,
as long as X is approximately normal.
9.47
As always, the alternative hypothesis Ha is important for determining
the P -value. Ha tells us what sorts of departures from H0 we are
interested in measuring.
9.48
Example lead in soil
The amount of lead in a certain type of soil, when released by a
standard extraction method, has a mean of 86 parts per million (ppm).
A new and cheaper extraction method is tried on 27 specimens of the
soil, yielding a mean of 83 ppm lead and a standard deviation of 10
ppm.
Is there significant evidence that the new method frees less lead from
the soil?
Answer:
9.49
9.50
Assumptions
X
Note that for T = S/ to have a t(n 1) distribution, we made two
n
assumptions:
we have a set of n independent observations of a variable; and
9.51
The t-distribution results works whenever the Central Limit Theorem
works. Hence Moore et al. say (page 432):
Small samples: Do not use t procedures if there are outliers present
or if the data are more than slightly skewed (data must have a good
approximation to normality).
9.52
What do you do if youre worried about non-normality?
Use another family of distributions to model the data. This is an
advanced topic outside the scope of MATH1041.
9.53
Example hair cut costs
Is there evidence that the hairdressers are overcharging? (In the sense
that they are charging more than the average amount a male student
would pay?)
What assumptions did you make to answer this question? Are these
assumptions reasonable?
9.54
Answer:
9.55
Additional exercise
9.56
9.57
9.58
9.59
9.60
Use and abuse of hypothesis
testing
Based on Moore et al.: Section 6.3
10.1
Lectures 1-2: Use and abuse of
hypothesis tests/Comparing Means
Using statistical tests wisely is not necessarily easy to do. We will work
through some important cautions when using a hypothesis test.
Comparing Means:
Introduction
10.3
Check your assumptions
In all of the above one-sample tests, there are two important assump-
tions. Your conclusions may not be valid if the assumptions are not
met.
1. Observations are independent. You can guarantee that this as-
sumption is satisfied by taking a SRS.
10.4
Fast food again
10.5
10.6
Calcium, pregnancy and Central Americans again
10.7
Practical vs statistical significance
Statistical significance is attained when the P -value is below some
chosen significance level.
10.8
Example factory working hours
Hours worked in a day is recorded for 996 factory workers. The mean
hours worked is 7.02 hours, standard deviation 0.3 hours.
Does this sample provide evidence that factory workers work longer
than 7 hour days?
10.9
Answer:
Hypotheses: H0 : = 7 Ha : > 7
7
x
Test statistic: t = s/ 2.10
n
comes from t995 if H0 is true
P-value: P (T > 2.1) = 0.018 P (T > 2.1) < 0.02 using tables alone
10.10
Textbook exercise
Answer:
10.11
Common misuses of hypothesis tests
Unfortunately, hypothesis testing is often misused in practice. Some
particularly common abuses of the method:
To conclude that the null hypothesis is true.
10.12
Concluding that the null hypothesis is true
You can never prove a theory in the same way, you can never prove
a null hypothesis!
So clearly the true mean could be 86 (so H0 could be true), but the
true mean could also be 85, 84.5, or 80.2. . .
10.13
Have the results of a hypothesis test been misinterpreted in any of the
following statements?
Paired t-tests indicated that there was no change in attitude (P >
0.05)
10.14
When interpreting a large P -value, it is often useful to construct a con-
fidence interval for the key parameter of interest this helps understand
whether or not a practically significant effect is possible despite it not
being statistically significant.
10.15
Searching for significance
Hypothesis testing is designed for the situation where you have a theory
or hypothesis you want to test you then design an experiment or study
to test this hypothesis.
10.16
Exercise searching for significance
10.17
Comparing two means
Based on Moore et al.: Section 7.2
10.18
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
This week
10.19
Do ravens fly to gunshots?
location 1 2 3 4 5 6 7 8 9 10 11 12
before 1 0 1 0 0 0 0 5 1 1 2 0
after 2 3 2 2 1 1 2 2 4 2 0 3
10.20
Pregnancy and smoking
10.21
Results:
10.22
Introduction
Hypothesis tests for data such as the above are often called two sam-
ple tests because they involve the comparison of two samples from
two different populations.
10.23
Another way to think about it: we have measured two variables on
each subject a quantitative variable and a binary variable.
10.24
The two-sample t-test
We have two samples: one from X1, one from X2. We want to test
H0 : 1 = 2.
If we assume that:
the two samples are independent of each other;
and measuring how far t is from 0. If this statistic is large and positive,
it suggests that 1 > 2, if it is large and negative, it suggests that
1 < 2 .
10.25
We know that if H0 is true, and if the above assumptions are satisfied,
then T t(n1 + n2 2). So to calculate the P -value:
* In practice, this t-test works well even when our data are not normal,
as long as x1 x
2 comes from an approximately normal distribution.
10.26
Pregnancy and smoking
Answer:
10.27
10.28
Lectures 34: Comparing two means
Understanding the two sample t-test
Paired t vs two-sample t
10.29
Understanding the formula for the
two-sample t-test
The two-sample t-test is based on the following result:
If we take two samples one from X1, which has mean 1, and one
from X2, which has mean 2, then
1 x
x 2 (1 2)
r
sp n1 + n1
1 2
1 N (1, ) and X
If X 2 N (2, ), and if X
1 and X2 are
n1 n2
independent, the distribution of the sample mean difference is:
s !
1 1
1 X
X 2 N 1 2, + .
n1 n2
10.32
Let
v
u
x
x 2 (1 2) u (n 1)s2 + (n 1)s2
t= 1 r where sp = t 1 1 2 2 .
sp n1 + n1 n1 + n2 2
1 2
Using the same arguments as previously (e.g. slide 7.48) we can con-
struct a level C confidence interval for 1 2 using
s
1 1
x 2 tsp
1 x +
n1 n2
where t is the value from the t(n1 + n2 2) distribution for which the
area between t and t is C.
10.33
Pregnancy and smoking
10.34
Answer:
10.35
Assumptions of two-sample t
procedures
In both the two-sample t-test and the confidence interval for 1 2,
we needed to make four assumptions, in order to be able to say that
our statistic has a t(n1 + n2 2) distribution:
1. the two samples are independent of each other
(ensured by sampling from different populations);
The only time you run into problems is when both n1/n2 and 1/2
are very different from 1 (they differ from 1 by at least a factor of two,
say).
10.37
Two possible solutions to this problem:
Transform data so that the standard deviations are similar. This
simple approach often solves the problem (and it often removes
skew at the same time).
Use a test statistic that does not assume equal standard deviations.
We will not use this technique in MATH1041, but see page 447
454 of Moore et al. for details.
10.38
Pregnancy and smoking
Answer:
10.39
Normal Quantile plot of Guinea pig control group
50
45
40
Sample Quantiles
35
30
25
20
15
10
10.40
Normal Quantile plot of Guinea pig treatment group
90
80
Sample Quantiles
70
60
50
40
30
20
1.5 1 0.5 0 0.5 1 1.5
Normal Quantiles
10.41
The paired t-test
Do ravens fly to gunshots?
(Ecology 2005, 86:10571060)
location 1 2 3 4 5 6 7 8 9 10 11 12
before 1 0 1 0 0 0 0 5 1 1 2 0
after 2 3 2 2 1 1 2 2 4 2 0 3
10.42
For this example, it is not appropriate to conduct a two-sample t-test,
because our two sets of samples are not independent.
10.43
The Paired t Test
10.44
Assuming that:
we have a set of n independent pairs of observations of a variable;
and
In practice, this formula works well even when our differences are not
normal, as long as X is approximately normal.
10.45
As always, the alternative hypothesis Ha is important for determining
the P -value. Ha tells us what sorts of departures from H0 we are
interested in measuring.
10.46
Do ravens fly to gunshots?
location 1 2 3 4 5 6 7 8 9 10 11 12
before 1 0 1 0 0 0 0 5 1 1 2 0
after 2 3 2 2 1 1 2 2 4 2 0 3
difference
Answer:
Assuming that:
we have a set of n independent pairs of observations of a variable;
In practice, this formula works well even when our paired differences
are not normal, as long as X is approximately normal.
10.48
Do ravens fly to gunshots?
location 1 2 3 4 5 6 7 8 9 10 11 12
before 1 0 1 0 0 0 0 5 1 1 2 0
after 2 3 2 2 1 1 2 2 4 2 0 3
difference 1 3 1 2 1 1 2 -3 3 1 -2 3
Answer:
10.49
Speed cameras
Below is the number of people exceeding the speed limit (in a month)
before and after speed cameras were installed at four locations (data
from Sydney Morning Herald, 22/9/03).
10.50
Answer:
10.51
Paired t-test vs two sample t-test
Why collect paired data?
Using a matched pairs design can control for other factors that may
be important.
e.g. raven data: the response variable was number of ravens, which
can vary greatly with location. Which of the following approaches do
you think would be more efficient?
Taking before-after measurements at 12 locations (as in the ex-
ample given previously); and
10.52
OK, well why not analyse all two sample datasets us-
ing a paired t-test?
If we have two balanced samples (n1 = n2), then why not always use
a paired t-test, to avoid confusion about which test to use?
10.53
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful boxplot or clustered comparative
bar chart scatterplot
graphs: histogram bar chart boxplots
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
This lecture
10.54
10.55
Relationships between
categorical variables
Based on Moore et al.: Chapter 9
11.1
Lectures 12: Data analysis for
two-way tables
Summarising the association between two categorical variables
Two-way tables
Simpsons paradox
Inference for two-way tables
Introduction
11.2
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful boxplot or clustered comparative
bar chart scatterplot
graphs: histogram bar chart boxplots
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
This week
11.3
Introduction
1. Gender and study area
11.4
2. Lateness of QANTAS planes
Is there evidence that one airline runs late more often than the other?
11.5
3. Cancers and where they occur
(From Roberts et al, Pathology (1981) 13:76370.)
Cancer site
Cancer Type Head/neck Trunk Extremities
Superficial spreading 16 54 115
Nodular 19 33 73
Indeterminate 11 17 28
11.6
In all of the above examples, there were two categorical variables:
1. Gender and Study area
2. Airline and Lateness
11.7
In particular, we are interested in assessing whether or not there is an
association between the two categorical variables.
1. Is there an association between gender and study area?
11.8
Two-way tables
To numerically summarise the association between two categorical vari-
ables, we use a two-way table.
Make the other variable the column variable (by listing categories
in different columns).
11.9
Example: area of study and gender
Female Male
Life Sci 38 31
Medicine 11 13
Science 51 40
Other 9 14
11.10
Example: On-time planes and airlines
11.11
Answer:
11.12
A two-way table can be thought of as a summary of the joint dis-
tribution of the two categorical variables (because we look at the
frequencies in the two variables jointly).
11.13
(One-way) table of frequencies for study area:
Life Sci 69
Medicine 24
Science 91
Other 23
Female Male
109 98
11.14
These are sometimes called the marginal distributions of gender and
study area because they can be written into the margins of the two-way
table:
Note that each marginal total is just the sum of the corresponding
row/column of the two-way table.
11.15
Example: On-time planes and airlines
Estimate the marginal distributions of the Min. late and airline vari-
ables, and add them to your two-way table.
11.16
Conditional distributions
11.17
Example: Gender and study area
11.18
Conditional distribution of gender for Life Science students:
Female Male
38 (55.1%) 31 (44.9%)
69 69
Female Male
56.0% 44.0%
11.19
Visualising Two-Way Tables
It is not easy (but possible) to assess association by staring at the
numbers in a two-way table.
A better option is often to visualise the data using one of the following:
Multiple bar charts
11.20
Example: Gender and study area
11.21
Multiple bar charts of study area for each gender:
female
80
60
40
20
0
Aviation Life Sci Other Science
male
50
40
30
20
10
0
Aviation Life Sci Other Science
11.22
Multiple bar charts of gender for each study area:
Life Sci Science
80 40
60 30
40 20
20 10
0 0
female male female male
Aviation Other
40
80
30
60
20
40
10 20
0 0
female male female male
11.23
A clustered bar chart of gender clustered by study area:
female
80
male
60
40
20
0
11.24
A bar chart of conditional probabilities of being female:
Proportion of females
1.0
0.8
0.6
0.4
0.2
0.0
11.25
Simpsons Paradox
We discussed lurking variables in Week 3 for quantitative variables,
and how relationships are distorted when lurking variables are ignored.
11.26
On-time airlines
On-time Late
Alaska Airlines 2062 317
America West 5041 476
11.27
The conditional probabilities of each airline being on-time are
11.28
But what happens when we consider the effects of the lurking variable:
Where the plane departed from
11.29
Consider on-time performance of planes from each airline, but calcu-
lated separately according to whether they departed from Seattle or
Phoenix:
Seattle Phoenix
On-time Late On-time Late
Alaska Airlines 1841 305 221 12
America West 201 61 4840 415
11.30
On-time performance, conditional on airline and where the plane de-
parted from, can be summarised as follows:
Seattle Phoenix
Alaska Airlines 85.8% on-time 94.8% on-time
America West 76.7% on-time 92.1% on-time
At both airports, Alaskan Airlines planes were more likely to run on-
time than American West!
11.31
In summary:
11.32
An association or comparison that holds for all of several groups can
reverse direction when the data are combined to form a single group.
11.33
Why did this happen?
The reason why this happened in the airline example was that where a
plane departs from is an important lurking variable that is associated
with both airline and on-time performance:
Alaskan airlines sends most of its planes to airports like Seattle,
whereas America West sends most planes through Phoenix.
Planes are more likely to be late when they depart an airport like
Seattle (which doesnt have great weather) than when departing
an airport like Phoenix.
11.34
Useful diagram:
11.35
What does this mean for me?
If there are, you should include these in your analyses, to avoid Simp-
sons paradox.
11.36
Inference for two-way tables:
Introduction
Gender and study area
In a survey of MATH1041 students, the gender of students and their
study area was recorded:
11.37
For this problem we need to test the hypothesis:
11.38
2 tests for categorical data
To test for an association between two categorical variables:
Display observed counts in a two-way table.
11.39
This chi-square test is appropriate when:
The n observations are independent.
The sample size is large enough that all expected counts > 10.
11.40
Expected counts under H0
11.41
Example Gender and study area
Find the expected counts, under the hypothesis that study area is
independent of gender.
11.42
Answer:
11.43
Where does this formula come from?
109
P (Female) =
207
and if H0 is true then gender and study area are independent i.e.
So under H0, we expect that the number of students who are female
and in life sci is about
109 69 109
# Life Sci P (Female) = 69 =
207 207
which is just the formula
row total column total
expected counts = .
n
11.44
Computing the chi-square test statistic
This statistic will be large if the observed counts are far from expected
counts (that is, the larger X 2 is, the more evidence there is against
H0).
11.45
Example Gender and study area
In a survey of MATH1041 students in week 1, the gender of students
and their study area was recorded.
11.46
Answer:
11.47
Finding the P -value for a chi-square test
Under H0, if we have a SRS for which all expected counts are larger
than 10, then X 2 has a chi-square distribution with degrees of freedom
(r 1)(c 1), where there are r rows and c columns in the two-way
table.
P (2 > X 2)
where 2 has a chi-square distribution with (r 1)(c 1) degrees of
freedom.
We write this as 2(df ) or 2
df where df = (r 1)(c 1).
11.48
The chi-what distribution?
11.49
Some example chi-square density curves:
22
0 5 10 15 20
24
0 5 10 15 20
28
0 5 10 15 20
11.50
Example Gender and study area
In a survey of MATH1041 students in Week 1, the gender of students
and their study area was recorded.
11.51
11.52
Inference for two-way tables
(Continued)
Based on Moore et al.: Introduction and Section 9.2
11.53
Lectures 34: Inference for two-way
tables
Today we will continue to discuss the key methods for making infer-
ences about the association between two categorical variables.
2 tests for categorical data more exercises
11.54
Recall the strategy for chi-squares tests for categorical data:
11.55
Cancers and where they occur
Cancer site
Cancer Type Head/neck Trunk Extremities
Superficial spreading 16 54 115
Nodular 19 33 73
Indeterminate 11 17 28
Answer:
11.56
Lateness of QANTAS planes
Is there evidence that one airline runs late more often than the other?
11.57
Answer:
First note that the expected counts in the > 15 min late will be too
small (they must be at least 10), so we will combine these with the
3-15 category:
11.58
Comparing two proportions
When comparing two sample proportions, the data could be written as
a two-way table with two rows and two columns.
Unemployment rate
In April 2009, a Roy Morgan poll of 4,315 Australians found that 7.1%
were unemployed. A similar poll of 4,914 Australians more recently
found that 10.4% were unemployed.
11.59
First, consider the unemployment poll results expressed as the propor-
tion of unemployed people:
Poll date n X p = X
n
April 2009 4,315 304 0.071
Recent poll 4,914 511 0.104
Answer:
11.60
When comparing proportions from two populations, there are two ways
we could do it:
11.61
Is there evidence that the unemployment rate changed?
Use a chi-square test to answer this question.
Answer:
11.62
11.63
11.64
Inference for regression
Based on Moore et al.: Chapter 10
12.1
Inference for Regression
Today we will discuss inference for linear regression a key statistical
tool for making inferences about whether two quantitative variables
are related.
Introduction
12.2
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical
categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful
f l boxplot
b l t or clustered
l t d comparative
ti
bar chart scatterplot
graphs: histogram bar chart boxplots
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
12.5
Drinking and blood alcohol levels
12.6
Linear regression revision
12.7
To use linear regression, we need the data to be linearly related, i.e.
something like one of the following:
12.8
r2 revision
r2 measures the strength of the linear regression, which takes on
the values:
0 r2 1
and is computed by
2 variance of y values
r = .
variance of y values
12.9
In Week 2 we fitted a linear regression to our data:
y = b0 + b1x + error
where b0 is the y-intercept, b1 is the slope of the line, and we assume
the error is random scatter around the line.
12.10
Least squares regression
A good way to estimate the line that best fits the data (for predicting
Y from X) is to use least squares regression.
12.11
To fit a least squares regression line, calculate the intercept and slope
of the line using:
sy
b0 = y b1x
and b1 = r .
sx
The line constructed in this way will minimise the sum of the squared
errors as on the previous slide.
12.12
Why inference for regression?
Recall that often we only have a sample, and we want to make infer-
ences about the population.
For linear regression, we want to make inferences about the true re-
gression line:
y = 0 + 1x
based on an estimate of this line from a sample:
y = b0 + b1x.
12.13
We need to take into account the sampling error in estimating the
regression line how much error is there in our estimates of b0 and
b1, due to us only having a sample of data, rather than the whole
population?
12.14
The fit for a sample of 20 streams:
95
90
85
80
Water Quality (IBI)
75
70
65
60
55
50
45
40 yb = 49.79 + 0.46 x
35
30
25
0 5 10 15 20 25 30 35 40 45 50 55 60
Water catchment area (square km)
12.15
The fit for a different sample of 20 streams:
95
90
85
80
Water Quality (IBI)
75
70
65
60
55
50
45
40 yb = 55.35 + 0.45 x
35
30
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
Water catchment area (square km)
12.16
Consider the following questions:
12.17
In each case we want to say something general using the relationship
between water catchment area and water quality across all streams,
based on just a sample.
y = 0 + 1x
based on just our sample of 20 streams:
y = b0 + b1x.
12.18
Inferences about the slope 1
In linear regression, the slope 1 is usually of primary interest 1 tells
us how Y and X are related.
12.19
Useful graphs:
12.20
Now well meet the key result for making inferences about 1.
Assume that:
yi = 0 + 1xi + i
where each i is independently sampled from a normal distribution with
mean 0 and standard deviation .
12.21
If the linear regression model is appropriate,
b1 1
t(n 2)
SE(b1)
where SE(b1) is the estimated standard error (SE) for b1.
We will not discuss how to calculate SE(b1) by hand, but you will need
to know where to find it in computer output. . .
12.22
12.23
Testing for a relationship between Y and X
12.24
As always, the alternative hypothesis Ha is important for determining
the P -value. Ha tells us what sorts of departures from H0 we are
interested in measuring.
12.25
Note that the R linear regression output automatically calculate the
t-statistic and two-sided P -value for a test of this hypothesis.
Can you find the test statistic in the water quality output? (slides
12.37 and 12.38)
12.26
Answer:
12.27
Exercise: Supermarket display space
12.28
Answer:
12.29
Constructing a confidence interval for 1
12.30
Exercise: Supermarket display space
Consider again the experiment that was conducted to explore the rela-
tion between the amount of display space allotted to a brand of coffee
and its weekly sales.
Answer:
12.31
Assumptions
Recall the linear regression model, which we assume is true when mak-
ing inferences about 1:
yi = 0 + 1xi + i
where each i is independently sampled from a normal distribution with
mean 0 and standard deviation .
12.32
This model can be broken down into four key assumptions:
The mean of Y (y ) has a linear relationship with X.
The errors from the line (i) have the same variance at each x
value.
How important are these assumptions?
What do we need to check for each assumption?
12.33
The importance of assumptions
Errors are normally distributed: it turns out that the CLT guar-
antees that b1 is approximately normal even if errors are not normal,
so this is not an important assumption for large n. Use the checks
outlined on slide 7.37 to see if any departure from normality is
potentially important.
12.34
The Yi are independent: The subjects must be unrelated. How
could you guarantee that this is satisfied?
Errors from the line have the same spread for all x: The stan-
dard errors of parameter estimates can be biased if this assumption
is not satisfied.
We need to check the first, second and last assumptions for our data
(although the second is not important for large sample sizes).
12.35
Residual vs fits plot
You met residual vs fits plots in Week 2 they are useful for detecting
whether linear regression is reasonable for your data.
12.36
e.g. The water quality data:
25
20
15
10
5
Residuals
0
5
10
15
20
25
30
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78
Fitted values
12.37
If there is a distinct pattern, we cant use standard methods to make
inferences about the regression line.
e.g. if there is a U -shaped pattern, the relationship is non-linear (maybe
quadratic):
1,000
200
Residual
750
500 0
Y
250
0 200
12.38
e.g. if there is a fan shape, then the spread increases with x:
200 100
Residual
100
Y
0
0
100
0 10 20 30 0 20 40 60 80 100
X Fitted Value
and so we should not be assuming that the errors from the line have
the same spread for all x, and we need to use a different method of
inference (not covered in MATH1041).
12.39
Normal quantile plot of residuals
Because we are assuming that errors from the regression line are nor-
mally distributed, we can check this using a normal quantile plot.
12.40
Does this plot suggest that any linear regression model assumptions
are reasonable for the water quality data?
Normal Quantile plot for water quality data
40
30
Regression residuals
20
10
0
10
20
30
40
2 1.5 1 0.5 0 0.5 1 1.5 2
Theoretical Quantiles
12.41
Sales/display space example
Is it reasonable to fit a linear regression? List the assumptions and
check them using the following plots.
80 80
60
Regression residuals
60
40 40
20
Residuals
20
0 0
20 20
40 40
60 60
80 80
450 500 550 600 1.510.5 0 0.5 1 1.5
Fitted values Theoretical Quantiles
12.42
Answer:
12.43
A bit more about regression inference
Standard error:
A formula for SE(b1) is:
s
SE(b1) =
sx n 1
where s and sx are the standard deviations of the errors () and X-
values, respectively.
As it turns out, the Central Limit Theorem doesnt just work for av-
erages and sums, but for weighted sums too:
Pn
(xi x
)(yi y)
b1 = i=1
Pn
)2
i=1 (xi x
which is a weighted sum of the residuals (yi y), the weights equalling
.
xi x
12.45
Exercise Drinking and blood alcohol levels
A study was conducted on 22 staff/students at a lunch function at the
University of Western Sydney, to determine how many drinks you can
have in an hour and still have a blood alcohol content (BAC) under
the legal limit, 0.05.
12.47
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004375 0.014402 -0.304 0.76441
Wine 0.010889 0.002988 3.644 0.00161
12.48
Normal QQ plot for BAC data
0.08 0.08
0.06
Regression residuals
0.06
0.04 0.04
Residuals
0.02 0.02
0 0
0.02 0.02
0.04 0.04
0.06
0.06
0.02 0.04 0.06 0.08 0.1 2 1 0 1 2
Fitted values Theoretical Quantiles
12.49
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical
categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful
f l boxplot
b l t or clustered
l t d comparative
ti
bar chart scatterplot
graphs: histogram bar chart boxplots
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
12.51
Which method do you use when?
12.52
See what the research question tells you about the data and the
analysis goal then you will know how to analyse the data.
12.53
Example
Recall the following exercise:
What does the research question tell us about the data and the analysis
goal?
12.54
Answer:
Quantitative or categorical:
The analysis goal:
Descriptive study or making inferences?
12.55
The following slides summarise the methods of data analysis discussed
in this course, and where to find the lecture notes on each method.
Note that:
The columns of the graphic are about the data (how many vari-
ables, quantitative or categorical).
The rows are about the analysis goal (numerical or graphical sum-
mary, hypothesis test or confidence interval).
See if you can use these slides to work out which method of analysis
should be used for the hair cut example above.
Try using these slides to work out which analysis method to use for
each data analysis question in last years exam!
12.56
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical
categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful
f l boxplot
b l t or clustered
l t d comparative
ti
bar chart scatterplot
graphs: histogram bar chart boxplots
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
test
1-samp for p 1-sample t-test 2 test for 2-sample t (for regression
useful test:
(for binary var) for independence binary+quant)
(or Z-test if known) slope ()
CONGRATULATIONS!
12.59