IEM Outline Lecture Notes Autumn 2016

1
200052 INTRODUCTION TO ECONOMIC METHODS

LECTURE - WEEK 1
Required Reading:
Ref. File 1: Section 1.13
Ref. File 3: Introduction and Sections 3.1 to 3.4, 3.7
KEYS TO PASSING THIS UNIT:
(i)
Undertake the required reading from the reference

files each week. (It may be necessary to re-read
some sections more than once) Approximately 4
hours per week.
(ii) Carefully study lecture material and take notice of
advice given in lectures.
(iii) Attempt tutorial exercises before tutorials and work
out where you have difficulties, which hopefully can
be resolved in tutorials.
(iv) Make a conscious effort to keep up with the material
presented.
1.
INTRODUCTION TO UNIT
1.1 How Can We Define Statistics?

Statistics, for our purposes encompasses the following
major activities:
(i)
Collection and description of information, or data descriptive statistics. We will normally be dealing
2
with a subset of a larger collection or set of data. The
subset is called a sample, the larger set a population.
(ii) Using sample data to make inferences about a
population - statistical inference.
1.2 Why Study Statistics?
(i)
(Major) It can be useful. It can help us to make

decisions in the face of uncertainty.
(ii) People are bombarded with statistics all the time.

Often statistics is used in ways that are not
warranted. It is important not to be fooled by people
who misuse statistics.
(iii) It is important to have a clear understanding of the
strengths and limitations of statistical analysis.
1.3 Structure of the Subject
Descriptive Statistics:
How we summarise the characteristics of raw data
(using graphs, summary measures, etc.)
Probability Theory and
(deductive statistics):
Probability
Distributions
Rules (or axioms) for calculating probabilities of

certain things (called events) happening.
3
Probability theory can be considered part of
descriptive statistics.
Here we will be concerned about making probability
statements about a given population.
Sampling Theory and Sampling Distributions (the basis
of inductive statistics):
Here we will be concerned with making probability
statements about characteristics of samples, given
assumptions about the population from which the
sample was drawn.
Point and Interval Estimation:
Point Estimation - Here we will be concerned about
producing a particular estimate (a number), based on
sample data, of a characteristic of a population.
Interval Estimation - Here we will not give an
estimate of a population characteristic, but rather a
range in which we are confident (to some degree) the
true value of the population characteristic is.
Hypothesis Testing:
Under this heading we will be looking at ways of
testing
hypotheses
about
characteristics
of
populations, based on sample data.
Regression Analysis:
In this case we will be concerned with estimating
linear relationships between different variables, i.e.
linear equations.
We will go on to examine statistical tests associated
with estimated regression equations.
Introduction to Differential Calculus

2.
DESCRIPTIVE STATISTICS
2.1 Some Basic Definitions Relating to Data

(i)
Elementary Units and Frames:
Statistical data normally represents measurements or

observations of a certain characteristic or variable of
interest of each member of a set of objects or people.
Each object (or person) for which the characteristic is or
can be measured is called an elementary unit.
The set or listing of all possible elementary units is called a
frame.
5
(ii) Population/Sample:
A statistical population is the set of measurements or
observations of a characteristic of interest for all
elementary units in a frame.
A population may comprise a finite or infinite number of
elements (observations), depending on the context.
A statistical sample is a subset of a population.
(iii) Parameters/Statistics:
For our purposes The numerical characteristics which describe a population
are called parameters of the population.
The numerical values calculated from sample data are
called sample statistics. These sample statistics can be
thought of as describing or characterizing the sample.
(iv) Qualitative and Quantitative Variables:
Populations may be quantitative or qualitative. Data from
quantitative populations is called quantitative or interval
data.
Data from qualitative populations is called
qualitative, nominal or categorical data.
Data from a quantitative population can be expressed
numerically in a meaningful way. The variable (or
characteristic) associated with a quantitative population is
called a quantitative variable.
Data from qualitative populations cannot be expressed

numerically in a meaningful way. The variable (or
characteristic) associated with a qualitative population is
called a qualitative or categorical variable.
Note: Just because we assign a numerical code to a
qualitative variable does not mean the variable is
quantitative.
(v) Discrete and Continuous Quantitative Variables:
A discrete quantitative variable can assume only certain
discrete numerical values (on the number line); i.e. there
are gaps between the various values. Depending on the
variable, there could be a finite or infinite number of these
discrete values.
A continuous quantitative variable can assume any value
in a specific range or interval. The interval can be of finite
or infinite width.
Note: By definition there are an infinite number of values
a continuous variable can take.
2.2 Frequency Distributions
(a) Introduction
Suppose we have a set of raw statistical data. At this stage
we will make no distinction as to whether we are talking
about a statistical population or sample.
In studying the data it is often useful to initially group the

raw data into different classes or categories. A frequency
distribution for a set of data lists the number of
observations or data points in each class used for
grouping (the class frequencies).
The classes of a
frequency distribution must be mutually exclusive (an
observation cannot fall into two classes) and exhaustive
(any observation must belong to a class).
(b) Frequency Distributions for Quantitative Data
Each class of a frequency distribution of quantitative data
usually has a lower and an upper limit, although
sometimes it is necessary or convenient to have openended classes, i.e. classes which have either an upper or
lower limit but not both.
Example:
Suppose we have data on the number of children in 100
households as follows:
Class
0 to under 2 children
6 or more children
Frequency
30
55
13
2
The class width is the difference between successive lower

class limits or upper class limits.
Note: An open-ended class has no class width.
General Advice for Forming Frequency Distributions:

The number of classes should generally be between
5 and 20.
Class widths are ideally equal, but this may not
always be possible, and open-ended classes may be
necessary.
Class limits should be chosen such that the class
midpoint is close to the average of observations in
the class. This is because in calculating summary
statistics based on grouped data the midpoint is
used as representative of all observations in the
class.
(c) Relative, Cumulative and Cumulative Relative
Frequency Distributions
A relative frequency distribution shows the proportion of
all observations falling in each class. It is obtained by
dividing the class frequencies ( f i ) by the total number of
observations in the data (n).
A cumulative frequency distribution shows, for each class
i, the total of the first i frequencies.
A cumulative relative frequency distribution shows, for
each class i, the total of the first i relative frequencies.
9
For the previous example we have
Class (i)
0 to under 2
2 to under 4
4 to under 6
6+ children
Frequency Cumulative. Relative

(fi)
Frequency Frequency
30
30
0.30
55
85
0.55
13
98
0.13
__2
100
0.02
100
1.00
Cumulative.
Rel. Freq.
0.30
0.85
0.98
1.00
An ogive is a graph of the cumulative relative frequency

distribution.
2.3 Histograms
Histograms give us a convenient way of visualising the
distribution of observations over classes. They take the
form of a series of adjacent (contiguous) rectangles, one
for each class, with the base of each rectangle centred over
the corresponding class midpoint.
In a frequency histogram the areas of the rectangles are
proportional to the class frequencies, with the factor of
proportionality the same for all classes. Thus if all the
classes have the same width, each rectangle will have the
same base width and the class frequencies can be
represented by the rectangle heights.
In a relative frequency histogram the areas of the
rectangles are proportional to the relative frequencies.
Similarly cumulative and cumulative relative frequency
histograms can be defined.
10
Note: Frequency and relative frequency histograms will

have the same shape.
Example:
Consider the following distribution
Class
0.5 to under 2.5
2.5 to under 4.5
4.5 to under 6.5
6.5 to under 8.5
Frequ.
10
30
50
10
Rel. Freq.
0.1
0.3
0.5
0.1
Cum. Freq.
10
40
90
100
Frequency Histogram
Frequency
50
30
10
0.5 2.5 4.5 6.5 8.5
Relative Frequency Histogram

Relative
Frequency
0.5
0.3
0.1
0.5 2.5 4.5 6.5 8.5
11
Cumulative Frequency Histogram
100
90
Cumulative
Frequency
40
10
0.5 2.5 4.5 6.5 8.5
2.4 Shapes of Distributions

The frequency or relative frequency histogram gives us a
representation of the shape of the distribution of the data
being analysed.
There are several terms commonly used to describe the
shapes of distributions.
A distribution is described as negatively skewed (skewed
to the left) if it has the following shape
A Distribution that is Skewed to the Left
Relative Frequency
Variable Value
12
A distribution is positively skewed (skewed to the right) if
it has the following shape.
A Distribution that is Skewed to the Right
Relative Frequency
Variable Value
A distribution is symmetric if it has the following shape.

A Symmetric Distribution
Relative Frequency
Variable Value
The above are all examples of unimodal distributions. A

bimodal distribution has two peaks.
13
2.5 Bivariate Frequency Distributions
Often it is of interest to classify observations of elementary
units according to two variables (characteristics). This
allows one to gauge the relationship between the two
variables.
Example:
Consider the final results of 50 students in a particular
subject. Each students final grade and gender are
recorded, allowing the derivation of the following
bivariate frequency distribution.
Gender
HD
Dist.
Male
Female
Column
Total
5
2
7
4
3
7
Grade
Credit Pass
10
11
21
6
2
8
Fail
2
5
7
Row
Total
27
23
50
Each combination of grade and gender is represented by a

cell in the bivariate frequency distribution, which contains
the frequency of that combination in the data.
The row totals represent, in this example, the marginal
frequencies of females and males in the class (27 and 23,
respectively).
The column totals represent the marginal frequencies of
final grades.
14
Marginal frequencies, represented by the row and column
totals, each refer to one variable only.
We can express the information in a bivariate frequency
distribution as a relative frequency distribution by
dividing each entry in the distribution by the total number
of observations.
Example:
For the previous example, the bivariate relative frequency
distribution is given by
Gender
HD
Dist.
Male
Female
Col.
Total
0.10
0.04
0.14
0.08
0.06
0.14
Grade
Credit Pass
0.20
0.22
0.42
0.12
0.04
0.16
Fail
0.04
0.10
0.14
Row
Total
0.54
0.46
1.00
The row and column totals in the above table are called
the marginal relative frequencies.
15
3.
MEASURES OF CENTRAL TENDENCY AND

DISPERSION
In this section we shall look at important ways of

summarising data from both populations and samples.
We shall be concerned with measures of the
centre of a frequency distribution
dispersion of values in a frequency distribution
3.1 Summation Notation
Suppose we have n numbers. By labelling the numbers
(1 ,2 ,3 ,...,n) , we can represent the numbers by
xi ,
1,..., n
The sum of the numbers can be denoted

n
xi
x1
x2
........ x n
i 1
n
x is a shorthand way of writing the sum.

i
i 1
Theorem (Basic Properties of Summation Notation)

Given c is some constant and a1 , a 2 ,..., a n are n
numbers:
n
cai
(i)
i 1
ai
i 1
16
(ii)
(a i
c)
ai
i 1
i 1
(iii)
(a i
c)
i 1
2
i
2c
i 1
(iv)
nc
(a i
i 1
c)
nc 2
ai
nc 2
i 1
a
i 1
ai
2
i
2c
i 1
Example:
Consider the following four labelled numbers.
a1
1 , a2
3 , a3
2 , a4
Use property (iii) of the above theorem to calculate

4
(a i
i 1
1) 2 .
17
3.2 Measures of Central Tendency
For each measure considered there are population and
sample versions. We will suppose here there are N values
in the population and n values in a sample.
Note that at this stage we are only concerned with
quantitative variables, and we assume the population
contains a finite number of values.
Definition (Mean of a Finite Quantitative Population)
If x1 , x 2 , x 3 , ......., x N represents a finite population of N
quantitative data points, then the mean of this population
is given by
N
x1
Population mean
(
x2
is the Greek letter mu)
xi
... x N
i 1
Definition (Mean of a Sample from a Quantitative

Population)
If x1 , x 2 , x 3 , ....., x n represents a particular sample of size
n from a quantitative population, then the mean of this
sample is given by
n
Sample mean
x1
x2
..... x n
n
xi
i 1
18
Definition (Mode of a Set of Data)
The mode is the data value that occurs most frequently in
a set of data (population or sample).
Definition (The Median of a Set of Data)
If quantitative data is arranged in ascending or
descending order, the middle value of data is called the
median. If there is an even number of data points, the
median is typically taken to be the arithmetic average of
the two middle values.
Example:
Consider the following set of data, which we can assume to
be a sample from a population.
1
3
5
10
1
1
1
2
n 24, x1 1, x 3
then down)
5
2
1
4
5, x11
4
7
5
2
12
6
8
6
4
6
9
30
6 , etc. (if we label across rows
19
Comparison of the Mean, Median and Mode
The mean takes account of all observation values therefore
it can be affected by extreme values or outliers, i.e. values
which differ greatly from the majority of values.
The median and mode are unaffected by extremely high or
low values.
The mode may not represent a central value in the
distribution, as in the above example, but it may be useful,
for example, for qualitative data.
If the frequency (or relative frequency) distribution is
perfectly symmetric and unimodal, the mean, median and
mode will coincide.
Symmetric Distribution
Relative Frequency
Variable Value
Mean
Median
Mode
If the distribution is skewed to the right (positively

skewed) and unimodal, mode < median < mean.
20
Distribution that is Skewed to the Right
Relative Frequency
Variable Value
Mode Mean
Median
If the distribution is skewed to the left (negatively skewed)

and unimodal, mean < median < mode.
Distribution that is Skewed to the Left
Relative Frequency
Variable Value
Mean Mode
Median
21
MAIN POINTS
A statistical population is a set of measurements or
characteristics of elementary units of interest.
Once a population is defined, a sample is a subset from
the population.
Parameters are numerical characteristics of a population.
Sample statistics are numerical characteristics of a
sample.
A frequency or relative frequency distribution describes
how data is distributed over different classes or categories.
A histogram shows graphically a frequency, relative
frequency or cumulative frequency distribution (the areas
of the contiguous rectangles are proportional to the
frequencies or relative frequencies).
The mean is affected by extreme values; the median and
the mode are not affected by extreme values.
The population mean is denoted
denoted x .
: the sample mean is
The median divides a set of quantitative data into two

equal halves.
22
LECTURE - WEEK 2
Required Reading:
Ref. File 3: Sections 3.5(a)-(d), 3.5(f)
Ref. File 4: Introduction and Sections 4.1, 4.2
3.
MEASURES OF CENTRAL TENDENCY AND

DISPERSION CONTINUED
3.3 Measures of Dispersion

(a) The Range
Definition (Range of a Set of Data)
The range of a set of quantitative data is the difference
between the highest and lowest data values.
(b) The Mean Absolute Deviation
Definition (Deviation from the Mean)
Consider a particular value x i from a finite data set. The
deviation from the mean of this value is defined as
(x i
(x i
) if the population mean is known

x ) if only a sample mean is available
23
Definition (Mean Absolute Deviation)
(i) If x 1 , x 2 , x 3 , ....., x N represents a finite quantitative
population, then the population mean absolute
deviation is given by
N
xi
i 1
Population MAD
(ii) If x 1 , x 2 , x 3 , ....., x n represents a sample from a

quantitative population, then the sample mean
absolute deviation is given by
n
xi
Sample MAD
i 1
(c) The Standard Deviation and Variance

Another more mathematically convenient way of
analysing the deviations from the mean is to square them.
This leads to the definition of the variance.
24
Definition (Variance of a Finite Quantitative Population)
If x 1 , x 2 , x 3 , ....., x N represent a finite population of N
quantitative data points, then the variance of this
population is given by
N
(x i
2
(Finite) Population variance
)2
i 1
Definition (Variance of a Sample from a Quantitative

Population)
If x 1 , x 2 , x 3 , ....., x n represent a particular sample of size n
from a quantitative population, then the variance of this
sample is given by
n
x) 2
(x i
Sample variance
s2
i 1
n 1
Alternatively we can equivalently write:

N
Population variance
s2
i 1
N
n
Sample variance
x 2i
x 2i
i 1
n 1
nx 2
25
The standard deviation is defined as the positive square

root of the variance.
Definition (Finite Population and Sample Standard
Deviations)
(i) If x 1 , x 2 , x 3 , ....., x N represent a finite quantitative
population, then the population standard deviation is
given by
N
(x i
)2
i 1
Population standard deviation
(ii) If x 1 , x 2 , x 3 , ....., x n represent a sample from a

quantitative population, then the sample standard
deviation is given by
n
(x i
Sample standard deviation
x) 2
i 1
n 1
26
An advantage of the standard deviation over the variance
is that it is expressed in the original units of measure.
Example:
Calculate s 2 and s for the previous 24 number example.
(36.3315, 6.0276)
3.4 The Coefficient of Variation

The coefficient of variation is useful for comparing the
variability of data sets with means that differ significantly,
or data sets based on different units of measure.
Definition (Coefficient of Variation)
(i) For a population with mean and standard deviation
:
Population coefficient of variation
(i) For a sample with mean x and standard deviation s:
Sample coefficient of variation
s
x
27
Example:
Suppose we wish to compare the variability of the weights
of a given sample of people with the variability of their
daily calorie intake. We are told
sample mean of weights = 68kg
sample standard deviation of weights = 5kg
sample mean of daily calorie intake = 1200 calories
sample standard deviation of daily calorie intake = 300
calories
3.5 Chebyshevs Theorem and the Empirical Rule

Theorem (Chebyshevs Theorem)
For any quantitative population with a finite variance, the
proportion of data points less than c standard deviations
from the mean is at least 1 (1 c 2 ) , where c 0 .
For hump-shaped or bell-shaped (unimodal) distributions,

Chebyshevs theorem will give a conservative indication of
the concentration of population data points around the
mean. In such cases we can refer to the empirical rule.
28
The Empirical Rule
For a bell-shaped distribution of sample or population
data, it will be approximately true that
68% of the data points will lie within 1 standard
deviation of the mean.
95% of the data points will lie within 2 standard
deviations of the mean.
99.7% of the data points will lie within 3 standard
deviations of the mean.
29
4.
INTRODUCTORY PROBABILITY THEORY
4.1 Basic Set Theory

A set is a collection of objects or elements.
Definitions (Sets)
The set of all elements of interest in a particular problem
or context is called the universal set, which can be denoted
by, say, s. Other basic definitions relating to sets are as
follows:
(i) The null set, denoted , contains no elements.
(ii) If an element denoted x is a member of a set a,
this is commonly denoted x a : if x is not a
member of set a, this can be denoted x a .
(iii) The intersection of sets a and b, denoted a b , is
the set of elements in both a and b.
(iv) The union of sets a and b, denoted a b , is the
set of elements in a and/or b.
(v) Set a is said to be a subset of set b, denoted
a b , if all elements in a are also in b: if a is not
a subset of b, this can be denoted a b .
(vi) The complement of set a, denoted a , is the set of
elements in s but not in a.
(vii) If a b
, we say a and b are mutually
exclusive or disjoint sets; they have no element in
common.
30
Venn diagrams are often a convenient way of portraying
sets and the relationship between them. An example is the
following diagram.
(a b)
a, b disjoint/mutually exclusive
s
Example
Suppose we have the set s
Define
1,3,5,7,9 , b
1,2,3,4,5,6,7,8,9,10
1,3 , c
4,7
31
4.2 Terminology Related to Statistical Experiments
An experiment, in a statistical sense, is an act or process
that leads to an outcome which cannot be predicted with
certainty.
Definition (Simple Events and Events)
A simple event of an experiment is an outcome that cannot
be decomposed into simpler outcomes. An event is a
collection or set of one or more simple events. An event is
said to have occurred if a simple event included in the
event occurs.
Definition (Sample Space of a Statistical Experiment)
The sample space of an experiment, which will be denoted
s, is the set of all possible simple events. It can be
described as the event consisting of all simple events.
Venn diagrams often provide a convenient way of

depicting sample spaces and events.
32
Definition (Discrete Sample Space)
A discrete sample space consists of either a finite number
of simple events or a countable and infinite number of
simple events.
Definition (Continuous Sample Space)
A continuous sample space consists of simple events that
represent all the points in an interval on the real number
line. The interval could be of finite or infinite width.
4.3 Basic Concepts of Probability
(a) Probabilities of Events as Relative Frequencies
Definition (Probability of an Event)
If f E is the frequency with which event E occurs in n
repetitions (trials) of an experiment under identical
conditions/rules, P (E) is defined as
P (E)
lim
n
fE
n
33
(b) Definition of a Probability Distribution
Definition (Probability Distribution)
A probability model or probability distribution for an
experiment takes the form of either a list of probabilities
of simple events or some other representation of the
relative frequency distribution of the underlying
population associated with the experiment.
(c) Axioms of Probability
Suppose an experiment has a sample space s. Any
assignment of probabilities to events in s (subsets of s)
must satisfy the following axioms:
1. For any event E in s, 0 P (E) 1
2. P(s ) 1
3. The probability of an event that is the union of a
collection of mutually exclusive events is given by
the sum of the probabilities of these mutually
exclusive events. (The additive property of
probability)
34
(d) Assigning Probabilities to Simple Events in Discrete
Sample Spaces
There are three broad
probabilities to events.
(i)
approaches
to
assigning
The Underlying Population Relative Frequency

Distribution is Known or Assumed
In this case the relative frequencies of the simple events

can be considered the probabilities of these simple events.
As a special case, the classical or equally likely
approach to assigning probabilities is applicable in
experiments where it is reasonable to assume that each
simple event is equally likely. In this case, if there are n
simple events, each will occur with probability 1/n.
(ii) The Underlying Population Relative Frequency
Distribution is Not Known or Assumed, but the
Experiment is Repeatable
This approach relies on past observation of outcomes from
an experiment that allows an approximate determination
of relative frequencies of simple events and events.
In terms of this approach, the probability of an event is
approximated by the relative frequency of the event in a
large number of identical trials of the experiment
considered. This is often referred to as the empirical or
relative frequency approach to assigning probabilities.
35
(iii) The Underlying Population Relative Frequency

Distribution is Not Known or Assumed, and the
Experiment is Not Repeatable
In many circumstances an experiment may not be
repeatable, i.e. it will only happen once.
In such
circumstances people assign subjective probabilities to the
experiment outcomes which reflect their personal beliefs.
For two events A and B defined on a sample space:
P ( A B)
and B.
probability of simple events in both A
P(A
B.
probability of simple events in A and/or
B)
B
A
36
Events A and B are said to be mutually exclusive if

. It follows immediately that, if A and B
A B
are mutually exclusive
P(A
B)
s
A
Two Approaches to Determining the Probability of an

Event Defined on a Discrete Sample Space:
(i) Add up the probabilities of the simple events
included in the event.
(ii) Use various probability rules and laws relating to
unions, intersections and complements of events
(considered later)
37
The first approach above can
performance of the following steps:
be
formalised
as
(i) Define the experiment.

(ii) List the simple events and assign probabilities to
them in a way consistent with the axioms of
probability.
(iii) Determine the simple events included in the event
of interest.
(iv) Sum the probabilities of the simple events in the
event of interest to find its probability.
Example:
Consider the experiment of tossing a fair die once and let
A be the event of obtaining an odd number of dots on the
upward facing side.
38
Example:
Suppose
s {1,2,3,4,5,6,7,8} , A {1,3,5,6} , B {2,3,4,5,8}.

Where s is the sample space of a statistical experiment and
all the simple events are equally likely.
Example:
Suppose that for s
P (1)
P ( 4)
P ( 5)
with
P ( 2)
P (7)
0.36
A
{1,2,3,4,5,6,7,8} :
P ( 3)
P ( 8)
P ( 6)
0.08
{1,3,5,6} , B
0.1
{2,3,4,5,8}
39
MAIN POINTS
For a finite population,
N
)2
(x i
2
variance
i 1
For a sample,
n
(x i
variance
s2
x) 2
i 1
n 1
The standard deviation is the square root of the variance:

it has the same units of measure as the data.
Chebyshevs theorem applies to all statistical populations.
The empirical rule applies only to hump-shaped
distributions.
The coefficient of variation measures dispersion relative
to the mean. It allows us to compare the dispersions of
data sets with different means and units of measure.
In set notation:
means and
means and/or
A means not A
40
In statistical experiments:
Simple events cannot be decomposed into simpler
outcomes.
The sample space is the set of all simple events.
Events are a collections or sets of one or more simple
events.
An event occurs if any of its included simple events
occur.
All statistical experiments can be thought of as sampling
from a statistical population.
Probabilities must obey certain axioms.
41
LECTURE - WEEK 3
Required Reading:
Ref. File 4: Sections 4.3, 4.4, 4.6
4.
PROBABILITY THEORY CONTINUED
4.4 Discrete Bivariate Probability Distributions

Definitions (Joint and Marginal Probabilities)
Suppose a statistical experiment for which simple events
take the form of intersections of outcomes with respect to
two or more variables. For such a statistical experiment:
The probabilities of the simple events are referred to
as joint probabilities
The probabilities of events representing outcomes
with respect to one of the variables only are called
marginal probabilities.
A listing or other representation of the joint
probabilities is called a joint probability distribution.
42
Example:
Suppose we have the following data on all 1950 first year
students at a particular university.
Age in
Years
Under 25
25 - 34
35 or over
Column
Total
Work Status
Not
PartWorking
Time
1200
200
100
75
10
5
1310
280
FullTime
250
100
10
360
Row
Total
1650
275
25
1950
Consider the experiment of selecting one of the students at

random. Define the following events for the experiment:
A: Under 25
B: 25 - 34
C: 35 or over
D: Not working
E: Part-time worker
F: Full-time worker
Calculate the following probabilities:
P ( A) , P (C) , P ( D) , P ( D
A ) , P (C
F) , P ( C ) , P ( C
( 11 13 , 1 78 , 131 195 , 8 13 , 5 26 , 77 78 , 11 78)
E)
43
Joint Relative Frequency Distribution
(Bivariate Distribution)
Age in
Years
Under 25
25 - 34
35 or over
Column
Total
Work Status
Not
PartFullWorking
Time
Time
1200
200
250
1950
1950
1950
Row
Total
1650
1950
100
1950
75
1950
100
1950
275
1950
10
1950
1310
1950
5
1950
280
1950
10
1950
360
1950
25
1950
1.00
44
4.5 Useful Counting Techniques
(a) The Multiplicative Rule
Theorem (Multiplicative Rule of Counting)
Suppose two sets of elements, sets a and b, consist of n A
and n B distinct elements, respectively: n A and n B need not
be equal. Then it is possible to form n A n B distinct pairs
of elements consisting of one element from set a and one
element from set b, without regard to order within a pair.
Example:
If a take-away food store sells 10 different food items and
5 different types of drink, 5 10 50 distinct food/drink
pairs are possible.
The multiplicative rule can be extended naturally. Thus
n 1n 2 ...n k different sets of k elements are possible if one
selects an element from each of k groups consisting of
n 1 , n 2 ,..., n k distinct elements, respectively.
Example:
Suppose we select 5 people at random. What is the
probability that they were born on different days of the
week, assuming an individual has an equal probability of
being born on any of the seven days of the week?
(Approx. 0.1499)
45
A simple event here is an ordered sequence of 5 elements,
the first representing the day of the week the first person
was born on, the second the day the second person was
born on, and so forth.
(b) Permutations
Definition (Permutations)
A permutation is an ordered sequence of elements.
Definition (Factorial Notation)
If N is a non-negative integer, we define:
N! N( N 1)(N
And
0! 1
2).......(3)(2)(1)
(N-factorial)
46
Theorem (Number of Permutations)
The total number of possible distinct permutations
(ordered sequences) of R elements selected (without
replacement) from N distinct elements, denoted P , is
given by
N
N!
( N R )!
Example:
Consider the numbers 1, 2, 3, 4. How many permutations
of these four numbers taken 2 at a time can be found?
(12)
(c) Combinations
Definition (Combinations)
A set of R elements selected from a set of N distinct
elements without regard to order is called a combination.
Theorem (Number of Combinations)
The total number of possible combinations of R elements
selected from a set of N distinct elements is given by.
N!
R!( N R )!
47
Example:
In how many ways can a committee of 4 people be chosen
from a group of 7 people? (35)
(d) Permutations of N Non-Distinct Elements

Theorem (Number of Permutations of N Non-Distinct
Elements)
Consider a set of N elements of which N 1 are alike, N 2
are alike,....., and N r are alike, where N i 1 ( i 1,..., r )
r
and
Ni
N . Then the number of distinct permutations
i 1
of these N elements is given by

N!
N 1 ! N 2 !......N r !
If the above result is specialized to the case where x is the

number of distinct arrangements (or distinct
permutations) of N objects where R are alike and
( N R ) are alike, then
N!
R! ( N R )!
48
Example:
Say we have 3 black flags and 2 red flags. How many
distinct ways are there of arranging these flags in a row?
(10)
Example:
Suppose there are 6 applicants for 2 similar jobs. As the
personnel manager is too lazy he simply selects 2 of the
applicants at random and gives them each a job. What is
the probability that he selects one of the 2 best applicants,
and 1 of the four worst applicants? (8/15)
49
4.6 Conditional Probability
Definition (Conditional Probability)
The probability of event A occurring given that event B
occurs, or the conditional probability of A given B (has
occurred) is denoted P ( A | B ) . Provided P (B ) 0 , this
conditional probability is defined to be
P ( A | B)
P ( A B)
P (B)
Example:
Suppose that a survey of women aged 20-30 years suggests
the following joint probability table relating to marital
status and desire to become pregnant within the next 12
months.
50
Marital status
Married
Unmarried
Total
Desire
Pregnancy No pregnancy
0.08
0.47
0.02
0.43
0.10
0.90
Total
0.55
0.45
1.00
Theorem (Multiplicative Law of Probability)

Suppose events A and B defined on a sample space.
Then
P (A
B)
P ( A ) P (B | A )
P (B ) P ( A | B )
Example:
Define events A and B in the following way:
A: A student achieves a mark of over 65% in a first year
statistics exam
B: A student goes on to complete her bachelors degree.
51
Suppose past experience indicates
P ( A)
P (B | A )
0.7
0.88
4.7 Independence of Events

Sometimes, whether an event B has occurred or not will
have no effect on the probability of A occurring. In this
case we say events A and B are independent.
Definition (Independent and Dependent Events)
Events A and B are said to be statistically independent if
P (A
B)
P ( A ) P (B)
If P ( A B ) P ( A ) P (B ) , the events are said to be

statistically dependent.
52
Alternative Definition (Independent and Dependent
Events)
Events A and B are said to be statistically independent if
P ( A | B)
P (A)
P (B | A )
P (B )
Otherwise the events are said to be statistically dependent.
Example:
Consider the single die tossing experiment again and
define the following events:
A: an odd number of dots results
B: a number of dots greater than 2 results
Are A and B independent?
53
4.8 More Useful Probability Rules
(a) The Additive Law of Probability
Theorem (Additive Law of Probability)
For two events A and B defined on a sample space
P (A
B)
P ( A ) P (B ) P ( A
Example:
Again suppose that for s
P (1)
P ( 4)
P ( 5)
with
P ( 2)
P (7)
0.36
A
P ( 3)
P ( 8)
B)
{1,2,3,4,5,6,7,8} :
P ( 6)
0.08
{1,3,5,6} , B
0.1
{2,3,4,5,8}
54
(b) The Complementation Rule
Theorem 4.7 (Complementation Rule)
Suppose an event E and its complement E defined on
some sample space s. Then
P ( E) 1 P ( E )
(b) The Law of Total Probability

Theorem (Law of Total Probability)
Suppose a sample space s and a set of k events
E1 , E 2 ,..., E k such that
P (E i ) 0
( i 1,..., k )
( i j ) (i.e. the events are mutually
Ei E j
exclusive)
s E1 E 2 ... E k (i.e. the events are exhaustive
on s)
Then for any event A defined on s:
P ( A)
P (E1
A ) P (E2
A ) .... P (Ek
A)
P (E1 ) P ( A | E1 ) P (E2 ) P ( A | E2 ) .... P (Ek ) P ( A | Ek )

k
P (E j ) P ( A | E j )
j 1
55
MAIN POINTS
In some statistical experiments the number of basic
outcomes in the sample space or event of interest can be
enumerated by using the multiplicative rule,
permutation or combination formulae, depending on how
a basic outcome can be represented most appropriately.
P ( A | B ) means the probability event A occurs given
that event B has occurred. The conditional probability
definition is
P ( A B)
P ( A | B)
P (B)
Multiplicative law of probability:

P ( A B ) P ( A ) P (B | A ) P (B ) P ( A | B )
Events A and B are statistically independent if the
probability of A occurring is not affected by whether B
has occurred.
Events A and B are independent if
P ( A B ) P ( A ) P (B )
or equivalently P ( A | B ) P ( A )
Additive law of probability:
P ( A B) P ( A ) P (B) P ( A
B)
56
LECTURE - WEEK 4
Required Reading:
Ref. File 4: Sections 4.7 to 4.9
Ref. File 5: Introduction and Sections 5.1 to 5.4
4.
PROBABILITY THEORY CONTINUED
4.9 Sampling With and Without Replacement

Definition (Random Sample from a Statistical Population)
A random sample of n elements from a statistical
population is such that every possible combination of n
elements from the population has an equal probability of
being in the sample.
Many experiments involve taking a random sample from a
finite population. If we sample with replacement, we
effectively return each observation to the population
before making the next selection. In this way the
population from which we are sampling remains the same
from one selection to the next; provided sampling is
random, the successive outcomes will be independent.
If we sample without replacement from a finite
population, the outcome of any one selection will depend
on the outcomes of all previous selections; the population
is reduced with each selection.
57
Example:
Suppose that in a given street 50 residents voted in the last
election. Of these, 15 voted for party A, 30 voted for
party B and 5 voted for neither party A nor B.
Suppose that one evening a candidate for the next election
visits the residents of the street to introduce herself. What
is the probability that the first two eligible voters she
meets voted for party A at the last election? ( 3 35 )
Example:
Consider the experiment of successively drawing 2 cards
from a deck of 52 playing cards. Define the following
events:
A 1 : ace on first draw
A 2 : ace on second draw
What is the probability of selecting 2 aces if sampling

(drawing) is (i) without replacement, and (ii) with
replacement? ( 1 221 , 1 169 )
58
Note: If we simultaneously select a sample of n elements,
we are effectively sampling without replacement.
4.10 Probability Trees
Tree diagrams can be a useful aid in calculating the
probabilities of intersections of events (i.e. joint
probabilities).
Example:
Greasy Mos take-away food store offers special $10 meal
deals consisting of a small pizza or a kebab, together with
a can of soft drink, a milkshake or a cup of fruit juice.
Past experience has shown that 60% of meal deal buyers
choose a pizza (P), 40% choose kebabs (K), 75% choose
softdrink (S), 20% choose a milkshake (M) and 5%
choose fruit juice (J). Assume the events P and K are
independent of the events S, M and J. What is the
probability that a meal deal customer (chosen at random)
will choose a pizza and fruit juice? (0.03)
The tree diagram for this example can be drawn as below.
59
P (P S) 0.6 (0.75) 0.45

S:0.75
M:0.2
P:0.6
P (P
M) 0.6 (0.2) 0.12
P (P
J) 0.6 (0.05) 0.03
P (K
S)
0.4 (0.75)
0.3
P (K
M)
0.4 (0.2)
0.08
P (K
J) 0.4 (0.05) 0.02
J:0.05
S:0.75
K:0.4
M:0.2
J:0.05
60
5.
PROBABILITY DISTRIBUTIONS OF DISCRETE

RANDOM VARIABLES
5.1 Probability Distributions and Random Variables

A probability distribution can be considered a theoretical
model for a relative frequency distribution of data from a
real life population.
A probability distribution thus specifies the probabilities

associated with the various outcomes of a statistical
experiment. It can take the form of a table, a graph or
some formula.
From now on we shall be concerned with the
characteristics of probability distributions. However, to
facilitate our study we shall now represent simple events
and events associated with statistical experiments by
values of random variables.
Definition (Random Variable)
A random variable X is a rule that assigns to each simple
event of a statistical experiment a unique numerical value.
The above definition can also be expressed in the following
slightly more mathematical way.
61
Alternative Definition (Random Variable)
A random variable X is a real valued function for which
the domain is the sample space of a statistical experiment.
In most statistical experiments of interest, outcomes give
rise to quantitative data that can be considered values of
the random variable being studied.
In experiments which give rise to categorical or qualitative

data, a random variable can normally also be defined.
Example:
Consider the experiment of selecting a person at random
and noting their hair colour.
62
Definition (Discrete Random Variable)
A discrete random variable can only assume a finite or
infinite and countable number of values.
Definition (Continuous Random Variable)
A continuous random variable can assume any value in an
interval (finite or infinite).
Definition (Discrete Probability Distribution)

A discrete probability distribution lists a probability for, or
provides a means (e.g. a rule or formula) of assigning a
probability to, each value a discrete random variable can
take.
Suppose our random variable is called X. Then P ( X x )
represents the probability that the random variable takes
on the particular value x.
Properties of the Discrete Probability Distribution of a
Random Variable X:
0
all x
P ( X x ) 1 for all values of x

P ( X x) 1
63
Example:
Consider again the experiment of tossing a fair die once
and noting the number of dots on the upward facing side
(X).
Definition (Cumulative Distribution Function)

The cumulative distribution function of a random variable
X, denoted F ( x ) , is defined as
F ( x)
P( X
x)
where x is any real number.

5.2 Expected Values of Random Variables
It is of interest to have a measure of the centre of the
probability distribution of a random variable X. This role
is filled by the expected value of X.
64
Definition (Expected Value of a Discrete Random
Variable)
The expected value of a discrete random variable X is
defined as
E( X )
xP ( X
x)
all x
If a statistical experiment considered generates values of

the random variable that coincide with values in the
population considered, and the theoretical probability
distribution of the random variable and population
relative frequency distribution are the same, the mean of
the theoretical distribution of X will be the same as the
population mean (i.e. ). That is, E ( X )
.
Example:
Suppose you buy a lottery ticket for $10. The sole prize in
the lottery is $100,000 and 100,000 tickets are sold. If the
lottery is fair (i.e. each ticket sold has an equal chance of
winning), what will be your expected gain from buying the
lottery ticket? (-9)
65
Theorem (Expected Value of a Function of a Discrete
Random Variable)
Suppose a function g ( X ) of a discrete random variable X.
The expected value of this function, if it exists, is given by
E [ g ( X )]
g( x) P ( X
x)
all x
Theorem 5.2 (Various Properties of Expected Values)

If c is any constant then
E (c )
If c is any constant and g ( X ) is any function of a

discrete or continuous random variable X then
E [cg ( X )] cE [ g ( X )]
If g i ( X ) ( i 1,..., k ) are k functions of a discrete or

continuous random variable X then
E [ g1 ( X ) ..
g k ( X )]
E [ g1 ( X )] ... E [ g k ( X )]
If h( X ) and g ( X ) are two functions of a discrete or

continuous random variable X such that h( X ) g( X )
for all X, then
E [h ( X )]
E [ g ( X )]
66
5.3 The Variance of a Random Variable
To gauge the dispersion of a random variable X about its
expected value or mean we can calculate the expected
value of its squared distance ( X E ( X ))2 from the mean.
This is called the variance of the random variable X,
denoted Var ( X ) .
Definition (Variance of a Random Variable)
The variance of any random variable X (discrete or
continuous) is given by
Var ( X )
E [( X
E ( X )) 2 ]
Definition (Standard Deviation of a Random Variable)

The standard deviation of any random variable X (discrete
or continuous) is given by
SD ( X )
Var ( X )
E [( X
E ( X ))2 ]
Again assuming the probability distribution of X is an

accurate representation of the population relative
2
frequency distribution of X, we can write Var ( X )
,
where 2 is the population variance.
67
An alternative way of writing (and calculating) Var ( X ) is
Var ( X )
E ( X 2 ) [ E ( X )]2
x2 P ( X
x)
[ E ( X )]2
(If X is discrete)
all x
Example:
Suppose a lottery offers 3 prizes: $1,000, $2,000 and
$3,000. 10,000 tickets are sold and each ticket has an
equal chance of winning a prize. Calculate the variance
and standard deviation of the random variable X
representing the value of the prize won by a ticket.
(1399.64, 37.4118)
x
0
1,000
2,000
3,000
P ( X x)
xP ( X x )
x2
9997
0
0
10000
1
1,000,000
0.1
10000
1
4,000,000
0.2
10000
1
9,000,000
0.3
10000
Total
0.6
x2 P( X
0
100
400
900
1400
x)
68
If we wish to determine the variance of a linear function

Y g ( X ) a bX of a random variable X, the following
rule can be used
Var (Y ) Var (a bX )
b 2Var ( X )
69
5.4 The Binomial Distribution
The binomial distribution is a discrete probability
distribution based on n repetitions of an experiment
whose outcomes are represented by a Bernoulli random
variable.
(a) Bernoulli Experiments
A Bernoulli experiment (or trial) is such that only 2
outcomes are possible. These outcomes can be denoted
success (S) and failure (F), with probabilities p and
(1 p ) , respectively.
A Bernoulli random variable Y is usually defined so that it
takes the value 1 if the outcome of a Bernoulli experiment
is a success, and the value 0 if the outcome is a failure.
Thus
P (Y
P (Y
1) p
0) (1 p)
The mean and variance of a Bernoulli random variable

defined in the above way are
E (Y )
Var (Y )
p
p(1 p)
70
(b) Binomial Experiments
Definition (Binomial Experiment)
A binomial experiment fulfils the following requirements:
(i)
(ii)
(iii)
(iv)
(v)
There are n repetitions or trials of a Bernoulli

experiment for which there are only two
outcomes, success or failure.
All trials are performed under identical
conditions.
The trials are independent.
The probability of success p is the same for each
trial.
The random variable of interest, say X, is the
number of successes observed in the n trials.
Theorem (The Binomial Probability Function)

Let X represent the number of successes in a binomial
experiment consisting of n trials and with a probability
p of success on each trial. The probability of x
successes in such an experiment is given by
P( X
x)
Cx p x (1 p ) n
for x
0,1,2,3,..., n
71
Example:
A company that supplies reverse-cycle air conditioning
units has found from experience that 70% of the units it
installs require servicing within the first 6 weeks of
operation. In a given week the firm installs 10 air
conditioning units. Calculate the probability that, within 6
weeks
5 of the units require servicing (0.1029 approx.)
none of the units require servicing (0 approx.)
all of the units require servicing (0.0282 approx.)
72
(c) Cumulative Binomial Probabilities
(Extract of Appendix 3)
CUMULATIVE BINOMIAL PROBABILITIES: P ( X
x | p, n)
p
n
1
x
0
1
0.05
0.9500
1.0000
0.10
0.9000
1.0000
0.15
0.8500
1.0000
0.20
0.8000
1.0000
0.25
0.7500
1.0000
0.30
0.7000
1.0000
0.35
0.6500
1.0000
0.40
0.6000
1.0000
0
1
2
0.9025
0.9975
1.0000
0.8100
0.9900
1.0000
0.7225
0.9775
1.0000
0.6400
0.9600
1.0000
0.5625
0.9375
1.0000
0.4900
0.9100
1.0000
0.4225
0.8775
1.0000
0.3600
0.8400
1.0000
0
1
2
3
0.8574
0.9928
0.9999
1.0000
0.7290
0.9720
0.9990
1.0000
0.6141
0.9393
0.9966
1.0000
0.5120
0.8960
0.9920
1.0000
0.4219
0.8438
0.9844
1.0000
0.3430
0.7840
0.9730
1.0000
0.2746
0.7183
0.9571
1.0000
0.2160
0.6480
0.9360
1.0000
0
1
2
3
4
5
6
7
8
9
10
0.5987
0.9139
0.9885
0.9990
0.9999
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.3487
0.7361
0.9298
0.9872
0.9984
0.9999
1.0000
1.0000
1.0000
1.0000
1.0000
0.1969
0.5443
0.8202
0.9500
0.9901
0.9986
0.9999
1.0000
1.0000
1.0000
1.0000
0.1074
0.3758
0.6778
0.8791
0.9672
0.9936
0.9991
0.9999
1.0000
1.0000
1.0000
0.0563
0.2440
0.5256
0.7759
0.9219
0.9803
0.9965
0.9996
1.0000
1.0000
1.0000
0.0282
0.1493
0.3828
0.6496
0.8497
0.9527
0.9894
0.9984
0.9999
1.0000
1.0000
0.0135
0.0860
0.2616
0.5138
0.7515
0.9051
0.9740
0.9952
0.9995
1.0000
1.0000
0.0060
0.0464
0.1673
0.3823
0.6331
0.8338
0.9452
0.9877
0.9983
0.9999
1.0000
10
....
....
....
....
....
0.70
0.3000
1.0000
0.0900
0.5100
1.0000
0.0270
0.2160
0.6570
1.0000
0.0000
0.0001
0.0016
0.0106
0.0473
0.1503
0.3504
0.6172
0.8507
0.9718
1.0000
Example:
Referring to previous air conditioning unit example,
calculate the probability that within 6 weeks of installation
less than 8 of the air conditioners require servicing.
(0.6172 approx.)
4 or more of the air conditioners require servicing.
(0.9894 approx.)
73
Example:
A referring to previous air conditioning unit example, use
the cumulative binomial tables to calculate the probability
that within 6 weeks of installation
5 units require servicing (0.103)
10 units require servicing (0.0282)
(d) Characteristics of the Binomial Distribution

Theorem (Mean and Variance of a Binomial Random
Variable)
Let X represent the number of successes in a binomial
experiment consisting of n trials, and where the
probability of success on each trial is p. Then
E ( X ) np
Var ( X ) np(1 p )
74
Each combination of n and p gives a particular
binomial distribution. We say n and p are the
parameters of the binomial distribution.
If p
0.5 , the binomial distribution is symmetric.
Example
Suppose n
5 and p
0.5
(probability histogram)
probability
0.3125
0.1563
0.0313
0
The binomial distribution will be skewed to the left (i.e.

negatively skewed) if p 0.5 , and skewed to the right
(i.e. positively skewed) if p 0.5 . In either case the
tendency to be skewed diminishes as n increases.
75
MAIN POINTS
If we sample without replacement from a finite
population, the outcome on any draw will depend on the
outcomes of all previous draws.
Sampling with replacement from a finite population is
equivalent to sampling from an infinite population.
Tree diagrams can facilitate the calculation of joint
probabilities (i.e. the probabilities of intersections of
events).
A probability distribution can be interpreted as a model
for the relative frequency distribution of some real
statistical population. In any given situation, the model
may or may not represent the relative frequency
distribution exactly.
It is convenient to associate the outcomes of a statistical
experiment with values of a random variable (e.g. X). We
can then think in terms of the probability distribution of
the random variable.
The mean (expected value) and variance of a discrete
random variable are given by
E( X )
x P( X
x)
all x
Var ( X )
E [( X
E ( X )) 2 ]
( x E ( X )) 2 P ( X
all x
x)
76
The binomial distribution is a model for the relative

frequency (probability) distribution of numbers of
successes in n trials of a Bernoulli experiment.
The binomial distribution can be represented by the
probability function
P( X
x)
C p x (1 p )n
where n is the number of trials, x the number of

successes and p the probability of success at each trial.
77
LECTURE - WEEK 5
Required Reading:
6.
CONTINUOUS PROBABILITY DISTRIBUTIONS
6.1 Introduction
From now on we shall be mainly concerned with studying
the distributions of continuous random variables. As we
have noted, a continuous random variable can assume any
value in a given interval.
The probability distribution for a continuous random
variable X will have a smooth curve or line as its graphical
representation. The heights of the points on this curve will
be given by a function of x, denoted f ( x ) , which is
variously called the probability density function, the
probability distribution or simply the density function of
the random variable X.
Areas under a density function f ( x ) represent

probabilities of X taking on values in the corresponding
intervals.
78
Area
P (a
b)
y = f(x)
Properties of Density Functions

If f ( x ) is a valid density function, it satisfies the following
two properties:
(i) f ( x ) 0 for all x
(ii)
f ( x) d x 1
Note: For a continuous random variable the probability

associated with any particular value of the variable is 0.
The mean and variance of a continuous random variable
are normally determined using calculus.
79
6.2 The Uniform Distribution
If a random variable X can take on any value in a given
finite interval a x b and the probability of the variable
taking a value in a given finite sub-interval is the same as
the probability the variable takes a value in any other
finite sub-interval of the same width, we say the variable X
is uniformly distributed. We have the following formal
definition.
Definition (Uniform Random Variable)
A continuous random variable X is said to be uniformly
distributed over the finite interval a X b if and only if
its density function is given by
1
b a
f ( x)
, if a
x b
0 , if x a or x b
We can calculate probabilities with respect to the random

variable X in the above definition from
P (c
d)
d c
b a
for c
a,d
80
f(x)
Total Area = 1
1/(b-a)
Theorem (Expected Value and Variance of a Uniform

Random Variable)
Suppose the random variable X is uniformly distributed
over the finite interval a x b . The expected value and
variance of X are, respectively
E( X )
Var ( X )
(a b )
2
(b a ) 2
12
Example:
The amount of petrol sold daily by a service station (say X)
is known to be uniformly distributed between 4,000 and
6,000 litres inclusive. What is the probability of sales on
any one day being between 5,500 and 6,000 litres? (0.25)
81
6.3 The Normal (Gaussian) Distribution
The normal distribution represents a family of bellshaped distributions that are distinguished according to
their mean and variance.
Definition (Normally Distributed Random Variable)
A random variable X is normally distributed if and only if
it has a density function of the following form:
f ( x)
1
2
(x
)2
for all real x
Where:
and 2 are parameters of the distribution of X.
They are used to represent E ( X ) and Var ( X ) ,
respectively.
e is the irrational number e that serves as the base
for natural logarithms (e 2.7182..)
is the irrational number representing the ratio of
the circumference of a circle to its diameter
(
3.1415..)
A normal distribution with mean
usually denoted N ( , 2 ) .
and variance
is
The normal distribution has a positive density for all real

x. Therefore it can strictly speaking never exactly match
the distribution of a variable that only takes on nonnegative values. However, even in such cases it can often
give a very good approximation.
82
The normal distribution is symmetric about
y = f(x)
For any normal distribution it will be the case that,

approximately:
68% of its values will fall within one standard
deviation ( ) of .
95.5% of its values will fall within two standard
deviations (2 ) of .
99.7% of its values will fall within three standard
deviations (3 ) of .
Computing areas under a normal density function is

difficult, but we can use a table showing probabilities
associated with the standardised normal random variable
(many calculators and Microsoft Excel are also able to
calculate these probabilities).
83
The standard normal distribution has a mean of 0 and a
variance (and standard deviation) of 1. A standard
normal variable is often denoted Z . Thus
Z ~ N (0, 1)
Probabilities relating to X ~ N ( , 2 ) can be calculated by

first calculating the standardised Z scores corresponding
to the value(s) of X and then using the standard normal
probability table. This is formalized by the following
theorem.
Theorem 6.2 (The Standardizing Transformation of NonStandard Normal Probabilities)
A random variable X is normally distributed with mean
X
and variance 2 if and only if Z
is a standard
normal random variable, that is
X ~N( ,
) if and only if Z
~ N (0,1)
84
Also note that a linear function of a normal variable is also
normally distributed.
AREAS UNDER THE STANDARD NORMAL DISTRIBUTION
The table below gives areas under the standard
normal distribution between 0 and z.
z
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
....
....
3.8
3.9
0
.0000
.0398
.0793
.1179
.1554
.1915
.2258
.2580
.2881
.3159
.3413
.3643
.3849
.4032
.4192
.4332
.4452
.4554
.4641
.4713
.....
......
.4999
.5000
1
.0040
.0438
.0832
.1217
.1591
.1950
.2291
.2612
.2910
.3186
.3438
.3665
.3869
.4049
.4207
.4345
.4463
.4564
.4649
.4719
.....
.....
.4999
.5000
2
.0080
.0478
.0871
.1255
.1628
.1985
.2324
.2642
.2939
.3212
.3461
.3686
.3888
.4066
.4222
.4357
.4474
.4573
.4656
.4726
.....
.....
.4999
.5000
3
.0120
.0517
.0910
.1293
.1664
.2019
.2357
.2673
.2967
.3238
.3485
.3708
.3907
.4082
.4236
.4370
.4484
.4582
.4664
.4732
.....
.....
.4999
.5000
4
.0160
.0557
.0948
.1331
.1700
.2054
.2389
.2704
.2996
.3264
.3508
.3729
.3925
.4099
.4251
.4382
.4495
.4591
.4671
.4738
.....
.....
.4999
.5000
5
.0199
.0596
.0987
.1368
.1736
.2088
.2422
.2734
.3023
.3289
.3531
.3749
.3944
.4115
.4265
.4394
.4505
.4599
.4678
.4744
.....
.....
.4999
.5000
6
.0239
.0636
.1026
.1406
.1772
.2123
.2454
.2764
.3051
.3315
.3554
.3770
.3962
.4131
.4279
.4406
.4515
.4608
.4686
.4750
.....
.....
.4999
.5000
7
.0279
.0675
.1064
.1443
.1808
.2157
.2486
.2794
.3078
.3340
.3577
.3790
.3980
.4147
.4292
.4418
.4525
.4616
.4693
.4756
.....
.....
.4999
.5000
Z
8
.0319
.0714
.1103
.1480
.1844
.2190
.2518
.2823
.3106
.3365
.3599
.3810
.3997
.4162
.4306
.4429
.4535
.4625
.4699
.4761
.....
.....
.4999
.5000
Example:
If Z ~ N (0, 1) determine the following probabilities:
P ( Z 0) (0.5)
P ( Z 0.5) (0.3085)
P ( 0.1 Z 0.9) (0.3557)
P ( Z 1.64) (0.0505)
9
.0359
.0754
.1141
.1517
.1879
.2224
.2549
.2852
.3133
.3389
.3621
.3830
.4015
.4177
.4319
.4441
.4545
.4633
.4706
.4767
.....
.....
.4999
.5000
85
Example:
If X ~ N (12, 4) , calculate P ( X 6.26) , P (7
P ( X 15.5) . (0.0021, 0.6853, 0.0401)
13) and
86
Example:
From several years records, a fish market manager has
determined that the weight of deep sea bream sold in the
market ( X ) is approximately normally distributed with a
mean of 420 grams and a standard deviation of 80 grams.
Assuming this distribution will remain unchanged in the
future, calculate the expected proportions of deep sea
bream sold over the next year weighing
(a) Between 300 and 400 grams. (0.3345)
(b) Between 300 and 500 grams. (0.7745)
(c) More than 600 grams. (0.0122)
87
6.4 The Normal Approximation to the Binomial
The normal distribution can be used to approximate
binomial probabilities if
np
5 and n(1 p )
The approximation sets
np and
np(1 p ) .
A continuity correction is required since we are

approximating probabilities associated with a discrete
(binomial) random variable by probabilities associated
with a continuous (normal) random variable.
The
correction is performed as follows:
Let Y ~ N (np , np(1 p )
Approximate:
P ( X x ) on the binomial by P ( x 0.5 Y x 0.5)
on the normal distribution.
P ( X x ) on the binomial by P (Y x 0.5) on the
normal distribution.
P ( X x ) on the binomial by P (Y x 0.5) on the
normal distribution.
88
Example:
It is known that 60% of cars registered in a given town use
unleaded petrol. A random sample of 200 cars is selected.
Determine the probability that, of the cars in the sample:
130 use unleaded petrol. (0.021)
more than 130 use unleaded petrol. (0.0643)
less than 130 use unleaded petrol. (0.9147)
89
MAIN POINTS
The graphical representation of a continuous random
variable is the graph of its density function This is the
counterpart of the probability histogram for a discrete
random variable.
The probability that a continuous random variable takes
on a value in some range is given by an area under the
density function.
The uniform distribution has a constant density function.
If X is normally distributed with a mean
2
, we can write this information as
X ~ N( ,
and variance
The standard normal random variable Z is such that

Z ~ N (0, 1)
Areas under a normal density function can be calculated
with reference to the standard normal table, and making
use of the symmetry of the distribution as needed.
binomial probabilities provided np 5 and n(1 p ) 5
np and 2 np(1 p ) ); the approximation can
(with
be improved by using continuity correction.
90
LECTURE - WEEK 6
Required Reading:
7.
INTRODUCTION TO ESTIMATION
7.1 Estimators and Their Properties

From now on we will mainly be concerned with random
samples of random variables.
Definition (Random Sample of Size n of a Random
Variable)
Consider a set of random variables X 1 , X 2 ,....., X n . This
set of random variables is said to represent a random
sample of size n of the random variable X if
(i) X 1 , X 2 ,....., X n are all statistically independent
And
(ii) X 1 , X 2 ,....., X n each have the same probability
distribution (or distribution function) as the random
variable X.
We will mostly use an upper case italicized letter to denote
a random variable, and a lower case non-italicized letter to
denote an actual realization or value of the variable.
91
Definition (Sample Statistic)
Suppose the random variables X 1 , X 2 ,....., X n are
associated with a sample of size n from a statistical
population. Then any function of (or formula containing)
X 1 , X 2 ,....., X n that does not depend on any unknown
parameter is called a sample statistic.
Definition
(Estimator/Estimate
of
a
Population
Parameter)
Suppose the random variables X 1 , X 2 ,....., X n are
associated with a sample of size n from a statistical
population.
Then a sample statistic involving
X 1 , X 2 ,....., X n that is used to estimate a parameter of the
population or associated probability distribution is called
an estimator of the parameter, and a realization of the
sample statistic (an actual number) is called an estimate of
the parameter.
92
Definition (Sample Mean and Variance of a Random
Variable)
Suppose the random variables X 1 , X 2 ,....., X n represent a
random sample of size n of the random variable X. The
sample mean and variance of X are then defined as,
respectively
n
Xi
Sample Mean of X
i 1
n
n
(Xi
Sample Variance of X
S2
X )2
i 1
n 1
If an estimator is used to obtain a single value estimate of

a parameter, this estimate is called a point estimate.
An interval estimate describes a range, or interval, of
values in which the population parameter is believed to be.
An interval estimate is normally centred around a point
estimate.
Since estimators are functions of random variables, they

will also be random variables whose values vary from
sample to sample. The probability distribution of an
estimator is called a sampling distribution.
93
Most statistical inference is based on a knowledge of the

sampling distributions of estimators.
Properties of Estimators
Definition (Unbiased Estimator)
Consider an estimator of some population parameter .
is an unbiased estimator of if E ( )
. If E ( )
,
is said to be a biased estimator of with the value of the
bias given by B E ( )
.
( is the lower case version of the Greek letter theta)
94
Definition (Relative Efficiency of an Estimator)
If 1 and 2 are both unbiased estimators of a population
parameter
with unequal variances, 1 is said to be
relatively more efficient than 2 if
Var (1 ) Var ( 2 )
Definition (Consistency of an Estimator)

An estimator of some population parameter is said to
be a consistent estimator of if as the (random) sample
size increases the probability increases of the estimator
yielding an estimate in some arbitrary fixed interval,
however small, centred round the true parameter value.
Theorem (Sufficient Condition for Consistency of an
Estimator)
An estimator of some population parameter
is a
consistent estimator of this parameter if
lim E ( )
n
and lim Var ( )

n
95
7.2 The Sampling Distribution of the Sample Mean
Example:
Suppose we know that in a large city 20% of households
possess no car, 60% possess one car and 20% possess two
cars. If we let X be the number of cars in a household we
can write the probability distribution of X as
P( X
x)
x 0
15
x 1
35
x 2
15
Determine the sampling distribution of X based on

random samples of size 2.
x1
0
0
0
1
1
1
2
2
2
x2
0
1
2
0
1
2
0
1
2
x
0
0.5
1
0.5
1
1.5
1
1.5
2
P (( X 1 x1 ) ( X 2 x 2 ))
1/5 1/5 = 1/25
1/5 3/5 = 3/25
1/5 1/5 = 1/25
3/5 1/5 = 3/25
3/5 3/5 = 9/25
3/5 1/5 = 3/25
1/5 1/5 = 1/25
1/5 3/5 = 3/25
1/5 1/5 = 1/25
96
x
P ( X x)
0
1/25
0.5
6/25
1
11/25
1.5
6/25
2
1/25
Theorem (The Central Limit Theorem)

Consider a random sample X 1 , X 2 ,....., X n of size n of a
random variable X with a finite mean E ( X )
and a
2
finite variance Var ( X )
. Then:
(i)
If X is (exactly) normally distributed, the sample

mean X will be exactly normally distributed with a
mean and a variance 2 n .
(ii) If X is not normally distributed, the sample mean X
will be approximately normally distributed with a
mean and a variance 2 n for large sample sizes.
This approximation is generally considered to be
valid when n 30 .
97
Note: Var ( X ) decreases as n increases and approaches
zero in the limit. This, together with the fact that X is
unbiased, ensures that X is a consistent estimator of .
Note: The standard deviation of an estimator is often
called the standard error of the estimator, although often
this term is used for an estimate of the standard deviation
of an estimator.
Example:
A particular type of light bulb has a mean life of 6,000
hours and a standard deviation of bulb life of 400 hours.
What percentage of random samples made up of 100
observations of bulb lives will yield mean bulb lives
between 5,950 and 6,050 hours? (78.88%)
98
8.
INTERVAL ESTIMATION
8.1 Introduction and Terminology

A confidence interval not only comprises an interval of
possible population parameter values, but also some
measure of the degree of belief or confidence that the
interval does indeed contain the parameter in question.
The level of confidence associated with a confidence
interval is the probability that we will obtain a realization
of the interval that contains the population parameter, i.e.
before we actually take a sample. It is usually denoted
(1
)100%, where
is the probability ( 0
1) of
obtaining a realization of the interval that does not contain
the population parameter.
Confidence intervals are constructed on the basis of
knowledge of the sampling distribution of the estimator
(or some function thereof) and a predetermined .
The z Notation:
z is used to denote the value of the standard normal
variable Z such that
P (Z
z )
99
By symmetry P (Z
Similarly z
z )
is such that
P( z
) 1
8.2 Interval Estimation of the Population Mean with

Known
Suppose we wish to obtain an interval estimator for the

mean of a random variable X with known variance 2
based on a random sample of size n. Two cases can be
considered:
(i)
X normally distributed (or approximately normally

distributed)
In this case we know

2
X~N
100
X
n
~ N ( 0, 1)
Therefore, for a given
P( z
) 1
n
z
P X
Thus the ( 1
x is given by
P X
x z
P z
n
n
n
n
1
1
)100% confidence interval for an observed
,x z
**
101
Example:
Suppose it is known from past experience that the
duration of phone calls (X) made by telephone subscribers
in a given city is approximately normally distributed with
a standard deviation of 10 minutes. A sample of 25 calls is
metered and the mean duration of these calls is found to
be 7.5 minutes. Construct a confidence interval for the
mean duration of calls (in minutes) based on this sample,
using a confidence level of 0.90. (4.21, 10.79)
(ii) X is not normally distributed but n
30
In this case by the central limit theorem X will be

approximately normally distributed and the confidence
interval formula presented above can still be used.
It is also possible in this case to replace
unknown.
by s 2 if it is
102
8.3 Properties of Confidence Intervals
The width of a confidence interval for the population
mean, where we are justified in using the normal
distribution, is given by
x z
(x z
2z
2z
For a given confidence level and a given , the confidence

interval width decreases with increasing n. This leads to
a criterion for choosing n.
If we wish to use a calculated x to estimate to within D
(units) with ( 1
)100% confidence, we should choose n
such that
n
2
2
(assuming a normally distributed population or an n
30 )
103
Example:
A clothing shop located in a busy shopping arcade is
interested in estimating the mean age of people who
frequent the arcade. The shop intends to use this
information in determining the appropriate range of
clothing it should stock in order to maximize sales. A
sample of people is to be selected at random in the arcade
and questioned by the shop manager about their age.
What should the sample size be if the shop manager
wishes to use a calculated x to estimate the average age of
people who frequent the arcade to within 1.5 years, with
95% confidence, assuming the population standard
deviation is approximately 7.5? (97)
104
MAIN POINTS
A random sample of a random variable is such that the
random variables representing the sample are
independently and identically distributed.
An estimator of a population parameter is a formula
containing the random variables representing sample
values.
The probability distribution of an estimator is called a
sampling distribution.
An unbiased estimator of a parameter has a mean equal
to the parameter value.
A consistent estimator has a probability distribution that
becomes more concentrated around the true parameter
value as n tends to infinity.
2
(if the X i ' s represent a

n
random sample of the random variable X)
E( X )
, and Var ( X )
X is and unbiased and consistent estimator of
The central limit theorem says that even if we are

sampling from a non-normal distribution, the distribution
of the sample mean will be approximately normal
provided the sample size is sufficiently large (i.e. greater
than 30).
105
Interval estimation involves determining an interval in

which we have a certain degree of confidence the
population parameter lies; it is based on a point estimator
and its sampling distribution.
z is the point on the standard normal distribution that
cuts off an area of in the right tail.
When calculating a confidence interval for :

(i)
If the population is normal, 2 known use
normal distribution
(ii)
If the population is not normal, 2 known, n
large use normal distribution (by CLT)
(iii) If the population is normal or not normal, 2
unknown, n large use normal distribution and
replace by s
The width of a confidence interval for :
Increases for a higher level of confidence
Decreases for a higher sample size
Increases for a larger population variance
106
LECTURE - WEEK 9
Required Reading:
Ref. File 7: Sections 7.6 and 7.7
9.
HYPOTHESIS TESTING
9.1 Introduction and Terminology

(i)
The General Idea of Hypothesis Testing
Typically a statistical hypothesis is a claim about the value

of an unknown population parameter, such as . The
claim is tested with reference to an estimate of the
parameter obtained from a sample. Armed with a
knowledge of the sampling distribution of the estimator
(e.g. X ) or some function of the estimator assuming the
claim is true, we must judge whether any discrepancy
between the claimed value of the parameter and the
sample estimate can be reasonably explained by sampling
variability. If this is not the case, we will be led to reject
the claim.
107
The basic process of statistical hypothesis testing can be
represented as follows:
Claim (hypothesis) about parameter
(plus knowledge of any other parameters needed to specify
the distribution of the population)
Specification of the sampling distribution of the estimator

or some function of the estimator (assuming the claim is
true)
A sample estimate of the parameter is obtained.

If this estimate is highly unlikely, we reject the claim;
otherwise we do not reject the claim.
In the context of hypothesis testing, the sample estimate of

the parameter in question (or some function thereof) is
often called, for obvious reasons, the test statistic.
(ii) Specifying Hypotheses
In any hypothesis testing situation two types of hypotheses
are specified from the outset:
The null hypothesis, labelled H 0 , specifies the claim
regarding the population parameter that is to be tested.
The alternative hypothesis, labelled H 1 , specifies an
alternative to the claim made in H 0 .
108
Definition (Simple and Composite Hypotheses)
A hypothesis that specifies a single value of a parameter
and permits us to specify uniquely the distribution of the
population being sampled from is called a simple
hypothesis. All other hypotheses concerning a population
parameter are called composite hypotheses.
Typically the alternative hypothesis is a composite
hypothesis that specifies a range of alternative values of
the parameter of interest, unless the parameter in question
can only take a finite number of values.
If the null hypothesis specifies a range of values of the
parameter considered, our hypothesis testing procedure
will use the limit value of the specified range of values (see
reference file for explanation).
Example:
It is claimed that the mean weight of flour in boxes of a
particular brand of flour is 500 grams. Suppose we wish
to test this claim.
109
The alternative hypothesis can be specified in several
ways, depending on the type of non-random variation that
is of interest.
A two-sided alternative hypothesis is of the form
parameter
some value specified by H 0
A two-sided alternative hypothesis leads us to a two-tailed

test.
A one-sided alternative is of the form
parameter < some value specified by H 0 .
or
parameter > some value specified by H 0 .
A one-sided alternative leads to a one-tailed test.
Example:
Suppose that in the previous example we were only
concerned whether the mean weight in the boxes were
greater than 500 grams.
110
(iii) Type I and Type II Errors
For any specification of H 0 and H 1 , two errors can occur
in testing H 0 .
Definition (Type I and Type II Errors)
A type I error occurs if the null hypothesis H 0 is rejected
when it is in fact true. A type II error occurs if the null
hypothesis H 0 is not rejected when it is in fact false.
The probability of a type I error is denoted , and the

probability of a type II error is denoted .
is also called
the significance level of the test.
The hypothesis testing procedure involves partitioning the
sampling distribution of the estimator (or some function of
the estimator), assuming H 0 is true, into non-rejection and
rejection regions. Traditionally this partitioning is done
by specifying .
is usually chosen to be some small value such as 0.1,

0.05 or 0.01.
111
The value (values for a two-tail test) of the test statistic
that partition(s) the sampling distribution of the estimator
into non-rejection and rejection regions is (are) called the
critical value (or values) of the test. The choice of
will
determine this (these) value(s).
9.2 Hypothesis Tests of the Mean When the Population is

Normal and 2 is Known
Suppose we wish to test the hypothesis
against
at the
H0 :
H1 :
0
0
level of significance.
In this case, assuming H 0 is true

2
X~N
exactly
And
Z
~ N (0,1)
(exactly)
112
Say x l is the critical value that cuts off an area of in the
upper tail of the distribution of X under H 0 . That is
P( X
xl )
This also implies

P
xl
xl
z is the value of Z which cuts off an area

n
in the upper tail of the standard normal distribution.
where
of
113
, and x x l implies (realised) z

n
we have the following two equivalent decision rules:
Since x l
reject H 0 :
if z
reject H 0 :
if x
xl
z ,
z
0
Example:
A cereal manufacturer claims that the mean fat content of
its cereal packets is 2.2 grams. Assume the fat content per
packet is approximately normally distributed with a
standard deviation of 0.6 grams. A consumer organisation
suspects that the mean fat content per packet is higher
than the manufacturers claim. It tests a random sample
of 25 packets of cereal and finds a sample mean fat
content of 2.4 grams. On the basis of this information, test
the manufacturers claim against a suitable alternative at
the
0.05 level of significance.
114
Step 1:
Label the random variable of interest and formulate the
null and alternative hypotheses.
Step 2:
Identify the appropriate sampling distribution of the test
statistic under the null hypothesis H 0 .
Step 3:
Find the critical value(s) of the test statistic.
Step 4:
State the decision rule.
115
Step 5:
Calculate the test statistic based on a realized sample.
Step 6:
Compare the realized value of the test statistic and the
critical value(s) to reach a conclusion.
Analogously, to test
against
H0 :
H1 :
0
0
the decision rule would be

reject H 0 if z
(or x
xs
where x s is the critical value that cuts off an area

lower tail of the distribution of X .
For a two-tail hypothesis test where
H0 :
H1 :
0
0
in the
116
we must divide the rejection region into upper and lower

tail parts. In this case the decision rule will be
reject H 0 if z
or z
(alternatively:
x
x s or x
xl )
Example:
Suppose that in the previous example the consumer
organisation is as concerned about too low an average fat
content as too high an average fat content. Retaining all
the other details of the previous example, test the
2. 2 (
manufacturers claim against H 1 :
0.05).
Step 1:
Step 2:
117
Step 3:
Step 4:
Step 5:
Step 6:
118
9.3 Hypothesis Tests of the Mean When the Population is
Non-Normal, 2 Known or Unknown, n 30
Again, suppose the null hypothesis is
H0 :
In this situation, given the central limit theorem, we know,

under H 0
2
X~N
approximately
~ N ( 0, 1)
approximately
We can then proceed as for (i) above, replacing

the formulae for z, x s and x l if is unknown.
by s in
Example:
Suppose a similar context as in the previous cereal packet
example except that
the population is non-normal with unknown variance
the sample size is 49
the sample standard deviation is 0.8
Again perform a two-tail test of the manufacturers claim
at the
0.05 level of significance, assuming x 2.4 as
before.
119
Step 1:
Step 2:
Step 3:
Step 4:
120
Step 5:
Step 6:
9.4 The T Distribution

If we have a small sample size ( n
30 ) from a normal
X
population and 2 is unknown we cannot consider
S n
as being approximately N ( 0, 1) .
In fact the true
X
distribution of
in this situation is that of the T
S n
distribution (or Students T distribution).
The random variable T
X
is called the T score.
S n
121
For large n the T distribution approaches the standard
normal distribution. It is bell-shaped and symmetric
about 0, but has fatter tails than the normal distribution.
X
is said to be T distributed with (n 1) degrees
S n
of freedom.
T
The T score value that cuts off an area of

in the right
hand tail of a T distribution with
(nu) degrees of
freedom is denoted t , . Values of t , are tabulated for
various common combinations of and (see Reference
file Appendix Table 4).
122
CRITICAL VALUES OF THE T DISTRIBUTION
The table below gives critical values of T for
given probability levels
0
Degrees of
Freedom,
1
2
3
4
5
6
7
8
9
10
11
12
13
40
60
120
Critical Values t
t 0 .1
t 0.05
t 0.025
t 0.01
t 0.005
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.303
1.296
1.290
1.282
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1.833
1.812
1.796
1.782
1.771
1.684
1.671
1.661
1.645
12.706
4.303
3.192
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.021
2.000
1.984
1.960
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.423
2.390
2.358
2.326
63.657
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.704
2.660
2.626
2.576
Example:
Using the T table find t 0. 05,10 and t 0. 025, 40 . (1.812, 2.021)
Armed with the T distribution we can construct a
confidence interval for using a small sample size from a
normal population with an unknown variance. Similar
algebra and reasoning to that used previously gives the
following ( 1
)100% confidence interval in these
circumstances:
123
x t
2,n 1
s
n
Example:
A large company employing 500 salespersons takes a
random sample of size 25 of these salespersons expense
account amounts for a particular month. The sample
mean is found to be $210 with a sample standard deviation
of $30. Construct a 95% confidence interval for the mean
expense account amount for the month in question,
assuming the expense account amounts are approximately
normally distributed. ($197.62, $222.38)
124
MAIN POINTS
A hypothesis test involves testing some claim about a
population parameter. We will be led to reject the claim if
the sample result obtained (i.e. value of the test statistic) is
highly unlikely assuming the claim is true.
In the context of a hypothesis test:
Type I error: rejecting the null hypothesis (claim) when it
is true.
Type II error: not rejecting the null hypothesis when it is
false.
The rejection region for a hypothesis test is determined
by the significance level , and the distribution of the test
statistic assuming the null hypothesis is true.
If X is normally distributed with unknown variance,
X
is a T distributed random variable with (n 1)
T
S n
degrees of freedom (even if n is small).
125
LECTURE - WEEK 10
Required Reading:
Ref. File 8: Sections 8.6, 8.7, 8.9
Ref. File 9: Sections 9.1 and 9.2
9.5 Hypothesis Tests about the Mean when the
Population is Normal, 2 is Unknown and n 30 (or
larger)
Again assuming H 0 :
T
X
S
is true, we have
is distributed as a T random variable with (n 1) degrees

of freedom.
As before, define t
P (T
P (T
,n 1
,n 1
and t
2,n 1
as
2,n 1
,n 1
Tn
126
Then the appropriate critical values will be given by

(assuming a significance level of ):
,n 1
(or x l
( H1 :
(or x s
,n 1
( H1 :
2,n 1
and
test ( H 1 :
s
) for an upper-tail test
n
,n 1
)
0
,n 1
s
) for a lower-tail test
n
2,n 1
0
(or
2 ,n 1
s
) for a two-tail
n
Example:
A manufacturer of radial tyres claims the mean tread life
of its tyres is at least 60,000 kilometres. The tyre tread life
is known to be approximately normally distributed. To
test the manufacturers claim, 16 of the tyres are selected
at random and tested. The sample yields a mean tread life
of 56,000 kilometres and a sample standard deviation of
5,250 kilometres. Perform the test of the manufacturers
claim against a lower tail alternative, assuming
0.05 .
127
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
128
Step 6:
9.6 P-Values (Prob-Values)

Definition (Probability Value of a Test)
The probability value or p-value of a test is the smallest
level of significance for which the observed test statistic
would indicate rejection of the null hypothesis.
For an upper-tail test,
p-value P (test statistic sample value of test statistic )
For a lower-tail test,
p-value P (test statistic sample value of test statistic )
For a two-tail test, if the observed test statistic is in the
upper tail of the distribution of this statistic under the
null hypothesis, then p-value
2 P (test statistic sample value of test statistic )
For a two-tail test, if the observed test statistic is in the
lower tail of the distribution of this statistic under the
null hypothesis, then p-value
2 P (test statistic sample value of test statistic )
129
For example, suppose the null hypothesis is H 0 :
0
and that x is obtained from the sample. Then, assuming
the null hypothesis is true
For an upper-tail test, p-value P ( X x )
For a lower-tail test, p-value P ( X x )
For a two-tail test, p-value 2 P ( X x) if x is to
the right of 0 , and p-value 2 P ( X x) if x is to
the left of 0 .
Example:
Recall the previous example where a cereal manufacturer
claimed the mean fat content of cereal packets was 2.2
grams. The fat content per packet was approximately
normally distributed with a standard deviation of 0.6
grams. A consumer organization suspected that the mean
fat content per packet was higher than the manufacturers
claim. It tested a random sample of 25 packets of cereal
and found a sample mean fat content of 2.4 grams.
Calculate the p-value for an upper tail test of the
manufacturers claim. (0.0475)
130
10. INFERENCE ABOUT A POPULATION
PROPORTION
10.1 The Sample Proportion of Successes
If we divide a binomial random variable X by the number
of trials n, we obtain the proportion of successes (in n
trials) or the sample proportion, normally denoted p . p
can be considered an estimator of the binomial
distribution parameter p, otherwise called the
population proportion.
For given n and p, the probability distribution of the
random variable p has the same shape as the probability
distribution of the binomial random variable X.
We have
P( X
x)
E ( p )
P p
Also
Var ( p )
p(1 p )
n
x
n
131
Example:
Suppose 2% of integrated circuits produced by a
particular process are defective. A manufacturer of
radios purchases 20 untested circuits. What is the
probability that at most 5% of these circuits prove to be
defective? (0.9401)
10.2 The Large Sample Distribution of the Sample

Proportion
We have seen that probabilities relating to a binomial
random variable X can be approximated using the normal
distribution provided np 5 and n(1 p ) 5 . Under the
same circumstances probabilities relating to the sample
proportion p can be approximated using the normal
distribution, setting the mean and variance of the normal
distribution equal to the mean and variance of p .
132
We know already that
E ( p )
And
Var ( p )
p(1 p )
n
Therefore, given that the conditions for using the normal

approximation are valid
p ~ N p,
p(1 p)
n
approximately
Example:
It is known that 10% of televisions produced by a given
manufacturer have a minor defect. A retailer buys 100
televisions from the manufacturer.
What is the
probability that more than 15% of these televisions have
minor defects? (0.0475)
133
10.3 Large Sample Confidence Intervals for Proportions
We have, if np
Z
5 and n(1 p )
5:
p p
~ N ( 0, 1) approximately
p(1 p ) n
Then for a given , we have, if the conditions for the

normal approximation hold
p p
p(1 p ) n
or after rearrangement
P ( p z
p(1 p) n
p z
p(1 p) n ) 1
Obviously, p is unknown, but for large samples it will be

approximately true that
P ( p z
p (1
p ) n
p z
p (1
p ) n ) 1
The required confidence interval is thus, for a particular

p obtained
(p z
p (1 p ) n , p z
p (1 p ) n )
134
Example:
Suppose a new product is to be launched on the market.
To gain an idea of the potential market size, 1,000 people
selected at random are shown the product and asked if
they would be willing to buy it. The proportion of these
people who would be willing to buy is found to be 0.065
(i.e. 65 people). Calculate a 90% confidence interval for
the proportion of all people who would buy the product.
(0.052, 0.078)
It can be shown that the sample size required for one to be

(1
)100% confident that a realized sample proportion
p is within D units of p is given by
p(1 p )
D
135
Unfortunately this depends on p, which is unknown after
the sample is taken.
However p(1 p ) reaches a
maximum of 0.25 when p 0.5 . Hence choosing n such
that
(0.5)(0.5)
D
2 ( 0.5)
will give the largest possible sample required.

Example:
In the context of the last example what sample size would
have allowed us to be 90% confident of a sampling error
D less than 0.01? (6,765)
10.4 Large Sample Tests of the Population Proportion

Suppose we wish to test a null hypothesis about a
H 0 : p p0
population proportion
against some
appropriate alternative, using the sample proportion as
the test statistic.
136
If np 0 5
hypothesis:
and n(1 p 0 )
5 , then under the null
p 0 (1 p 0 )
approximately
n
p p 0
~ N ( 0, 1)
approximately
p 0 (1 p 0 ) n
p ~ N p 0 ,
Z
Given , we can then proceed to formulate one and twotail tests for p in a similar fashion to that used for .
Example:
A public transport company claims that no more than 5%
of its customers are dissatisfied with its new electronic
ticketing system. 200 customers of the company are
selected at random and asked whether they are dissatisfied
with the ticketing system. Of these, 13 state they are
dissatisfied. Perform an upper tail hypothesis test of the
companys claim using the
0.05 significance level.
Step 1:
137
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
138
11. THE CHI-SQUARED DISTRIBUTION AND
INFERENCE CONCERNING A POPULATION
VARIANCE
11.1 The Chi-Square Distribution
The sum of squares of (nu) independent standard
normal random variables is said to be Chi-square
distributed with degrees of freedom, written 2 .
A variable that is chi-squared distributed can take any

non-negative value. For small values of the chisquared distribution is positively skewed but becomes
more symmetric and bell-shaped as becomes large.
The value of the chi-squared random variable which cuts

off an area of
in the right tail of a chi-square
distribution with degrees of freedom is denoted 2 , .
That is, 2 , is such that
P(
2
,
A table is used to determine

used values of and .
2
,
for various commonly
139
CRITICAL VALUES OF THE CHI-SQUARE DISTRIBUTION
The table below gives critical values of
for the given probability levels
Degrees of
Freedom
Critical Values
2
.995
2
.99
2
.975
2
.95
2
.90
2
.10
2
.05
2
.025
2
.01
2
.005
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0.0000393
0.0100251
0.0717212
0.206990
0.411740
0.675727
0.989265
1.344419
1.734926
2.15585
2.60321
3.07382
3.56503
4.07468
4.60094
0.0001571
0.0201007
0.114832
0.297110
0.554300
0.872085
1.239043
1.646482
2.087912
2.55821
3.05347
3.57056
4.10691
4.66043
5.22935
0.0009821
0.0506356
0.215795
0.484419
0.831211
1.237347
1.68987
2.17973
2.70039
3.24697
3.81575
4.40379
5.00874
5.62872
6.26214
0.0039321
0.102587
0.351846
0.710721
1.145476
1.63539
2.16735
2.73264
3.32511
3.94030
4.57481
5.22603
5.89186
6.57063
7.26094
0.0157908
0.210720
0.584375
1.063623
1.61031
2.20413
2.83311
3.48954
4.16816
4.86518
5.57779
6.30380
7.04150
7.78953
8.54675
2.70554
4.60517
6.25139
7.77944
9.23635
10.6446
12.0170
13.3616
14.6837
15.9871
17.2750
18.5494
19.8119
21.0642
22.3072
3.84146
5.99147
7.81473
9.48773
11.0705
12.5916
14.0671
15.5073
16.9190
18.3070
19.6751
21.0261
22.3621
23.6848
24.9958
5.02389
7.37776
9.34840
11.1433
12.8325
14.4494
16.0128
17.5346
19.0228
20.4831
21.9200
23.3367
24.7356
26.1190
27.4884
6.63490
9.21034
11.3449
13.2767
15.0863
16.8119
18.4753
20.0902
21.6660
23.2093
24.7250
26.2170
27.6883
29.1413
30.5779
7.87944
10.5966
12.8381
14.8602
16.7496
18.5476
20.2777
21.9550
23.5893
25.1882
26.7569
28.2995
29.8194
31.3193
32.8013
30
40
50
60
70
80
90
100
13.7867
20.7065
27.9907
35.5346
43.2752
51.1720
59.1963
67.3276
14.9535
22.1643
29.7067
37.4848
45.4418
53.5400
61.7541
70.0648
16.7908
24.4331
32.3574
40.4817
48.7576
57.1532
65.6466
74.2219
18.4926
26.5093
34.7642
43.1879
51.7393
60.3915
69.1260
77.9295
20.5992
29.0505
37.6886
46.4589
55.3290
64.2778
73.2912
82.3581
40.2560
51.8050
63.1671
74.3970
85.5271
96.5782
107.565
118.498
43.7729
55.7585
67.5048
79.0819
90.5312
101.879
113.145
124.342
46.9792
59.3417
71.4202
83.2976
95.0231
106.629
118.136
129.561
50.8922
63.6907
76.1539
88.3794
100.425
112.329
124.116
135.807
53.6720
66.7659
79.4900
91.9517
104.215
116.321
128.299
140.169
Example:
Find 02.01,15 ,
2
0.975,15
2
0.025,10
140
11.2 Confidence Intervals for a Population Variance
Given a random sample of a random variable X of size n,
we know
n
(Xi
S2
X )2
i 1
n 1
is an unbiased estimator of the population variance

We have the following important result.
Theorem (The Sample Variance and the Chi-Square

Distribution)
Suppose X 1 , X 2 ,....., X n is a random sample of a normally
distributed random variable X with a mean of ( E ( X ) )
and variance 2 ( Var ( X ) ). Then
(i) The sample mean X and sample variance S 2 will
be independently distributed.
And
n
(ii)
(n 1) S
2
X )2
(Xi
2
i 1
2
n 1
141
We define:
2
1
the value of the n2 1 variable which cuts

off an area of
2 in the left-hand tail of the
distribution.
2
2
the
value
of
the
2 ,n 1
n 1 variable which cuts off
an area of
2 in the right-hand tail of the
distribution.
2 ,n 1
2
2
1
Replacing
P(
2
1
by
2
2
2, n 1
2, n 1
(n 1) S 2
2
2,n 1
2,n 1
in the probability statement
gives
2
1
(n 1) S 2
2,n 1
2
n 1
2
2,n 1
By rearrangement we obtain
142
(n 1) S 2
(n 1) S 2
2
1
2 ,n 1
Then a (1
(n 1) s 2
2
2 ,n 1
2 ,n 1
)100% confidence interval will be given by
(n 1) s 2
2
1
2 ,n 1
Example:
A particular automotive engineering firm manufactures a
specific machined part used in motor vehicles. A random
sample of 15 of the parts from the firm yielded a diameter
sample variance of 0.0015 (diameter is measured in
centimetres). Assuming the diameter of the part is
approximately normally distributed, use this information
to calculate a 95% confidence interval for the variance of
the part. (0.0008, 0.0037)
143
11.3 Hypothesis Tests for a Population Variance
Suppose we wish to test the hypothesis
2
H0 :
H1 :
where
2
0
2
0
2
0
is the hypothesised variance.
Assuming H 0 is true, and supposing a significance level of

, we can write
P
(n 1) S 2
2
1
2,n 1
2
0
2
2,n 1
Our decision rule will be:

Reject H 0 :
2
0
if the test statistic
(n 1) s 2
2
0
is
such that
2
2
1
2,n 1
Or
2
2
2,n 1
Similarly we can formulate lower and upper tail tests for

the population variance.
144
For testing H 0 :
decision rule is
Reject H 0 if
2
0
against H1 :
(n 1) s 2
2
0
For testing H 0 :
decision rule is
Reject H 0 if
2
0
2
0
, the
2
0
, the
,n 1
against H1 :
(n 1) s 2
2
0
2
1
2
,n 1
Example:
Suppose the firm manufacturing the machined engine part
considered in the previous example claims its parts have a
diameter variance 2 of no more than 0.0013 (again
diameter is measured in centimetres). Again suppose a
random sample of 15 of the parts from the manufacturer
yielded a diameter sample variance of 0.0015. Given this
information, perform a test of H 0 : 2 0.0013 against
H1 : 2 0.0013 at the 5% level of significance. Again
assume the diameter of the part is approximately normally
distributed.
Step 1:
145
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
146
MAIN POINTS
The p-value of a test is the probability (under the null
hypothesis) of obtaining a value of the test statistic equal
to or more extreme (in the direction(s) of rejection) than
the value obtained.
When testing hypotheses (or forming confidence intervals
for) about :
1. If the population is normal, 2 known use normal
distribution
2. If the population is normal, 2 unknown use T
distribution
3. If the population is not normal, 2 known, n large
use normal distribution
2
4. If the population is normal or not normal,
unknown, n large use normal distribution and
replace by s
The sample proportion of successes p is an unbiased
estimator of the population proportion (binomial
parameter p).
probabilities relating to p in the same circumstances that
the normal distribution is used to approximate the
binomial distribution.
The sample proportion of successes can be used to form
confidence intervals for p, and to test hypotheses about
p.
147
The Chi-square distribution is the distribution of a sum of

independent squared standard normal variables
The Chi-square distribution is used to test hypotheses and
form confidence intervals about/for a single population
variance, under the assumption that the population is
normally distributed
148
LECTURE - WEEK 11
Required Reading:
Ref. File 10: Sections 10.1, 10.3, 10.4(a), 10.4(c)
Ref. File 11: Introduction and Sections 11.1(a) and 11.1(b)
12. INTRODUCTION TO CORRELATION ANALYSIS
IN THE CONTEXT OF CROSS-SECTIONAL
DATA
Definition (Deterministic and Stochastic Relationships)
Given two jointly distributed and statistically dependent
random variables X and Y, the random variable Y is
deterministically related to X if Y can be expressed as an
exact function of X only. Otherwise Y is stochastically
related to X.
Our concern now is to assess any linear relationship
between two stochastically related jointly distributed
random variables.
149
Definition (Sample Covariance between Two Random
Variables)
Suppose the n pairs of random variables
( X 1 ,Y1 ),..., ( X n , Yn ) represent a random sample of two
jointly distributed random variables X and Y. Then the
sample covariance (a random variable) between X and Y
based on these observations is defined as
Sample Covariance between X and Y

n
(Xi
S X ,Y
X ) ( Yi
Y)
i 1
n 1
A sample covariance calculated from a particular sample

is denoted s X ,Y .
150
Definition (Population Covariance)
The population covariance of two jointly distributed
random variables X and Y with means E ( X )
X and
E (Y )
Y is defined as
Cov ( X , Y )
E [( X
X ,Y
E ( X ))(Y
E (Y ))]
E ( XY ) E ( X ) E (Y )
E [( X
E ( XY )
X
X
)(Y
)]
The value of the covariance is dependent on the units of

measure. To overcome this, the deviations from the mean
in the covariance formula are standardized by dividing by
the appropriate standard deviations, yielding the
population and sample correlation coefficients.
Definition (Population Correlation Coefficient)
The population correlation coefficient X ,Y (rho) of two
jointly distributed random variables X and Y with finite
and non-zero standard deviations is defined as
X ,Y
(X
E ( X )) (Y
X
Where
SD ( X ) and
E (Y ))
Cov ( X ,Y )
X ,Y
X
SD (Y ) .
The population correlation coefficient indeed measures the

degree of linear relationship between two random
variables, in the sense that it measures the closeness of the
151
population data points to a reference straight line (the best
linear predictor, cf. reference file Section 10.4(b) (optional
reading))
Theorem (Properties of the Population Correlation
Coefficient)
With respect to two jointly distributed random variables,
say X and Y, both with finite non-zero variances:
(i)
The population correlation coefficient between the

variables can only take values between -1 and 1
1 ).
inclusive (i.e. 1
X ,Y
(ii) If X ,Y
1, there is a deterministic negative linear
relationship between the random variables X and
Y, whilst if X ,Y 1, there is a deterministic positive
linear relationship between X and Y.
(iii) If X ,Y 1 and thus there is a deterministic linear
relationship
between
the
variables,
this
relationship can be written as Y a bX , where
a and b are constants given by
X
Y
X ,Y
2
X
X ,Y
2
X
152
Definition (Sample Correlation Coefficient between Two
Random Variables)
Suppose the n pairs of random variables
( X 1 ,Y1 ),..., ( X n , Yn ) represent a random sample of the two
jointly distributed random variables X and Y. Then the
sample correlation coefficient R X ,Y (a random variable)
between X and Y based on these observations is defined as
( X i X ) ( Yi Y )
SX
SY
1
n 1
R X ,Y
(Xi
n
i 1
X )( Yi
n 1
S X SY
Y)
S X ,Y
S X SY
Where, as before:
n
(Xi
SX
i 1
n 1
X)
(Yi
and SY
Y )2
i 1
n 1
Under the assumption that both X and Y are bivariate

normally distributed, it is possible to test whether the true
population correlation coefficient X ,Y is equal to zero.
153
Theorem (Distribution of a Sample Statistic to Test
whether the Correlation Coefficient between Bivariate
Normally Distributed Variables is Zero)
Suppose that we dispose of a random sample
( X 1 ,Y1 ),..., ( X n , Yn ) of bivariate normally distributed
random variables X and Y.
Then under the null
hypothesis X ,Y 0 we have the distribution result
RX ,Y
n 2
~ Tn
2
1 RX ,Y
(i.e. under H 0 :
X ,Y
0)
Where R X ,Y is the sample correlation coefficient between

X and Y.
Example:
Suppose that the sample correlation coefficient between
results in first year economics (X) and accountancy (Y) at
a particular university is found to be 0.46, using a sample
of 42 students. Assuming X and Y are bivariate normally
distributed, test whether the population correlation
coefficient between these variables is equal to zero, using a
two-tail test with
0.05 .
Step 1:
154
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
155
13. INTRODUCTION TO REGRESSION ANALYSIS
IN THE CONTEXT OF CROSS-SECTIONAL
DATA
13.1 Introduction
Regression analysis is concerned with estimating the

relationship between the mean of a given variable Y, called
the dependent variable, and one or more independent
variables. We will confine our introduction to considering
the relationship between a dependent variable and one
independent variable. We will start by considering the
case where the two variables are jointly distributed
random variables related in a linear way.
156
An equation which specifies an exact relationship between
variables is called a deterministic model.
In situations where we feel there is a non-deterministic

relationship between two random variables X and Y, we
are commonly led to postulate a stochastic model of the
form
Y
g( X ) U
where:
U is the random error or random disturbance, i.e. a
random variable with a probability distribution.
E (Y | X ) g ( X ) is the deterministic component of
the model, which specifies how the mean of Y is
related to X. E (Y | X ) is called the conditional
expectation function (CEF) of Y given X.
E (Y | X ) g ( X ) implies that the expected value of
the random disturbance in the model will be zero for
all values of X.
We will be mainly concerned with linear stochastic models

for which
E (Y | X )
157
13.2 Assumptions of the Neoclassical (Stochastic
Regressor) Simple Linear Regression (NSR) Model
(NSR-1) (Stochastic Regressor Assumption)
X and Y are jointly distributed random variables for
which a sample consisting of n paired observations
{( X 1 ,Y1 ), ( X 2 ,Y2 ),...., ( X n ,Yn )} is to be obtained.
(NSR-2) (Sample Variation in the Regressor Assumption)
There is variation in the values taken by X in the
observed sample.
(NSR-3) (Linear CEF Assumption)
E (Yi | X 1 , X 2 ,..., X n )
where
and
E (Yi | X i )
Xi
{i 1,...., n}
2
are constants.
And the random disturbance (or random error) for

observation i is defined as
Ui
Yi
E (Yi | X 1 , X 2 ,..., X n )
{i
1,...., n}
158
(NSR-4) (Homoscedasticity or
Variance Assumption)
Var (U i | X 1 , X 2 ,..., X n )
Where
Constant
{i
Conditional
1,...., n}
is finite.
(NSR-5) (Non-Correlated
Assumption)
Random
Cov (U i , U j | X 1 , X 2 ,..., X n ) 0 for all i j

{i 1,...., n; j
Disturbances
1,...., n}
Although the NSR model as stated above does not assume

that the observations necessarily represent a random
sample of size n of the jointly distributed random
variables X and Y (as assumed when introducing the
correlation coefficient), for beginning students it is often
more instructive to assume this is indeed the case.
159
13.3 The Least Squares Estimators of the NSR Model
In the context of the NSR model we wish to estimate the
assumed linear conditional expectation function (CEF)
(also called the population regression line)
E (Y | X )
Estimation of 1 and 2 in the linear CEF is based on a

sample of n pairs of observation, which before they are
actually observed can be represented by the pairs of
random variables ( X i ,Yi ) for ( i 1,..., n) (cf. assumption
~
~
(NSR-1)). If 1 and 2 denote any estimators of 1 and
2 , the sample regression line can be written
~
Y
~
1
~
2 X
The random quantity (i.e. before actual observations are

taken) given by
~
Ui
Yi
~
Yi
Yi
~
( 1
~
2 X i )
is called the random residual associated with the

observation ( X i ,Yi ) .
The least squares estimators of 1 and 2 , which we shall
denote 1 and 2 , are chosen such that the sum of the
squared residuals is minimized. That is 1 and 2 are the
160
~
~
optimal values of 1 and 2 chosen (using calculus) such
that (the sum of squared residuals, or SSR)
n
~
Yi ) 2
(Yi
~
1
(Yi
i 1
~
2 X i ) 2
i 1
is minimized. This leads to the following formulae for 1

and 2 .
1
2 X
Y
n
X i Yi
i 1
n
nY X
(Xi
X )(Yi
Y)
i 1
n
X i2
nX 2
i 1
i 1
X )Yi
(Xi
X )2
i 1
n
X )2
(Xi
(Xi
i 1
We can also conveniently write

n
(Xi
X )(Yi
Y)
i 1
(Xi
X )(Yi
Y ) (n 1)
i 1
n
(Xi
X)
(Xi
i 1
X)
(n 1)
S X ,Y
S X2
i 1
The least squares random residual associated with the

observation ( X i ,Yi ) is given by
U i
Yi
Yi
Yi
( 1
2 X i )
161
A scatter diagram depicts all the realized observations
( x i , y i ) in a sample. Normally the dependent variable is
measured along the vertical axis.
The following example of a scatter diagram shows the
relationship between realized y i , y i and u i ( y i y i ) .
.
yi
E(Y | X
(x i , y i )
xi )
yi
y i
E (Y | X
X
2
x)
( x i , y i )
xi
Assumptions (NSR-1) to (NSR-3) allow us to show the

OLS estimators are unbiased.
162
Example:
Consider a retailer with stores in a number of different
localities. Suppose the retailers marketing manager
believes there is a stochastic linear relationship between
the amount her firm spends each month on local
advertising (X) and monthly sales (Y) of a store in a
particular locality. Fit a least squares regression line
given the following local advertising and sales data (in
thousands of dollars) for 12 randomly chosen stores of the
firm in a particular month.
Store
1
2
3
4
5
6
7
8
9
10
11
12
Totals
Advertising
(X)
5.0
4.5
7.0
7.6
5.0
7.4
9.0
6.5
6.2
4.6
5.8
10.0
78.6
Sales
(Y)
250
260
280
282
265
266
280
268
265
258
263
295
3232
x 2i
xi y i
25.00 1250.0
20.25 1170.0
49.00 1960.0
57.76 2143.2
25.00 1325.0
54.76 1968.4
81.00 2520.0
42.25 1742.0
38.44 1643.0
21.16 1186.8
33.64 1525.4
100.00 2950.0
548.26 21383.8
163
The unbiased least squares estimator of
in the simple
two variable neoclassical linear regression model is called
the standard error of the regression and is given by
SSR
n 2
SU
Where as before
n
SSR
U i2
i 1
(Yi Yi )2
i 1
(The sum of the squared residuals)

The denominator in the formula for SU is termed the
degrees of freedom of SSR. It represents the sample size
minus the number of parameters estimated ( 1 and 2 ).
Example:
Calculate the realized value of SU for the previous
sales/advertising example.
We have Y
227. 3642 6.4075 X
164
xi
5.0
4.5
7.0
7.6
5.0
7.4
9.0
6.5
6.2
4.6
5.8
10.0
yi
250
260
280
282
265
266
280
268
265
258
263
295
y i
259.4017
256.1980
272.2167
276.0612
259.4017
274.7797
285.0317
269.0130
267.0907
256.8387
264.5277
291.4392
y i y i
-9.4017
3.80205
7.7833
5.9388
5.5983
-8.7797
-5.0317
-1.01295
-2.0907
1.1613
-1.5277
3.5608
( y i y i ) 2
88.39196
14.45558
60.57976
35.26935
31.34096
77.08313
25.318
1.026068
4.371026
1.348618
2.333867
12.6793
354.1976
165
MAIN POINTS
The correlation coefficient measures the degree of linear
relationship between two random variables
Hypotheses about the population correlation coefficient
can be tested using the sample correlation coefficient,
assuming the two variables are bivariate normally
distributed
Regression analysis involves estimation of a stochastic
model relating two or more variables.
The neoclassical simple linear regression (NSR) model
(bivariate or stochastic regressor model) consists of a set
of assumptions about a bivariate statistical population and
sampling conditions that forms a basis for estimating an
assumed linear CEF
Least squares estimation involves using estimators of
coefficients that minimize the sum of squared
residuals(SSR), i.e. (Yi Yi ) 2 .
The unbiased least squares estimator of the disturbance
standard deviation
in the NSR model is called the
standard error of the regression and is given by
SU
SSR
n 2
166
LECTURE - WEEK 12
Required Reading:
Ref. File 2: Introduction and Sections 2.1(a), 2.2(a), 2.2(b)
Ref. File 11: Sections 11.3, 11.4(a), 11.6(a), 11.6(e), 11.6(f),
11.6(g)
13.4 The Explanatory Power of a Regression Equation
A measure of how superior the estimated regression line is
to Y in predicting the realized values of the dependent
variable is the coefficient of determination, denoted R 2 .
R 2 is defined by decomposing the total variation of Y in
the sample observations into explained and
unexplained variations, where
n
Y )2
(Yi
Total variation
SST
i 1
(Total Sum of Squares)

n
Explained variation
(Yi
Y )2
SSE
i 1
(Explained Sum of Squares)

n
(Yi
Unexplained variation
i 1
Yi )
U i2
SSR
i 1
(Residual Sum of Squares)
167
The following graph shows the relationship between the

deviations (not the squared deviations) for a particular
observation.
Y 1
(y i y i )
yi
(y i
y)
X
2
y i
(y i
y
xi
y)
It can be shown that
SST
SSR
SSE
Then the sample coefficient of determination R 2 is defined

as
R2
SSE
SST
We will have
0
R2
SSR
SST
168
If R 2 1 all the observations lie on a non-horizontal
straight line. R 2 will equal 0 if 2 0 , i.e. if the sample
regression line is horizontal.
Example:
Calculate R 2 for the sales/advertising example.
approx.)
We already know SSR
354.1976 and y
We can proceed by calculating SST.
yi
250
260
280
282
265
266
280
268
265
258
263
295
( y i y )2
373.7764
87.1105
113.7785
160.4453
18.7774
11.1109
113.7785
1.7777
18.7775
128.4437
40.1107
658.7795
1726.6666
269.3333
(0.79
169
It can also be shown that for the simple 2-variable NSR

model the coefficient of determination ( R 2 ) is simply the
square of the sample correlation coefficient of X and Y.
Hence, in the context of the NSR model, R and R 2 both
show the strength of the linear relationship between X and
Y.
170
13.5 Inference in the Neoclassic Simple Linear Regression
Model
(a) Operational Distribution Results Related to 1 and
2
Theorem (Conditional Variances of the OLS Estimators of
the NSR Model Intercept and Slope Parameters)
Under assumptions (NSR-1) to (NSR-5) of the neoclassical
simple regression model the least squares estimators 1
and 2 have the following conditional variances:
2
Var ( 1 | X 1 ,.., X n )
2
1 | x
Xi
i 1
n
(Xi
X )2
i 1
Var ( 2 | X 1 ,.., X n )
2
2
2 | x
(Xi
X )2
i 1
(Where x is used to denote ( X 1 ,.., X n ) )

If we replace 2 in the above formulae by its least squares
estimator SU2 , we obtain corresponding estimators of the
conditional variances as
171
S
S 2 | x
1
2
U
Xi
i 1
n
(Xi
X )2
i 1
2
2 | x
SU2
n
(Xi
X )2
i 1
To obtain exact distribution results for performing

statistical inference with respect to the NSR model
coefficients, we require an assumption regarding the
distribution of the random disturbances. The commonly
used assumption and distribution results are given in the
following theorem.
172
Theorem 11.12
Suppose assumptions (NSR-1) to (NSR-5) of the
neoclassical simple linear regression model are satisfied,
and in addition the random disturbances U i ( i 1,...., n )
are multivariate normally distributed conditional on
X 1 , X 2 ,..., X n . Then
1
1
~Tn
S | x
And
2
2
~Tn
S | x
Addition of the assumption of multivariate normally

distributed random disturbances to the NSR model results
in what can be referred to as the neoclassical normal
simple linear regression model.
In this context we will confine ourselves to considering
only inference with respect to 2 .
173
(b) Confidence Intervals for
We know that if 2 is the true slope coefficient and the U i

( i 1,...., n ) are multivariate normally distributed
conditional on X 1 , X 2 ,..., X n then
2
S
~ Tn
2 |x
Thus
2
S
2,n 2
2,n 2
2 |x
After rearrangement, we obtain

P ( 2
2 ,n 2
2 |x
For a 2 and s
2 |x
regression, the (1
{ 2
2 , n 2 2 | x
2 ,n 2
S | x )
2
obtained from a particular sample

)100% confidence interval will be
, 2
2 , n 2 2 | x
174
(c) Testing Hypotheses About
In the context of simple linear regression we commonly

wish to test the null hypothesis
H0 :
against a one or two-sided alternative.

Under H 0 we will have
2 0
S | x
2
2
~ Tn
S | x
Suppose H 1 : 2 0 and a significance level of . In this

case we reject H 0 if the calculated t is such that
or
t
t
2,n 2
2,n 2
One-tail tests can also be defined in the usual manner.
175
Example:
Reconsider the sales/advertising regression example with
the following results
Yi
227.3642 6.4075 X i
12
(x i
x) 2
n 12
33.43
i 1
sU2
35.41976
Perform a two-tail test of H 0 : 2 0 with
Test H 0 : 2 0 against H 1 : 2 0 with
Construct a 95% confidence interval for
8.7008)
0.05 .
0.05 .
2 . (4.1142,
First Test
Step 1:
Step 2:
176
Step 3:
Step 4:
Step 5:
Step 6:
Second Test
Step 1:
177
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
178
Regression results are often reported with the estimated
standard errors or T statistics given in brackets under the
corresponding estimated coefficients.
Note: The classical simple linear regression (CSR) model
involving a non-stochastic (fixed) regressor is discussed in
reference file Section 11.4. It is useful to compare this
model with NSR model, although the CSR model is not
examinable in this unit. Although the interpretation of the
two models is different, the basic form of the estimators
and inferential techniques are similar for the two models.
179
14. INTRODUCTION TO DIFFERENTIAL
CALCULUS
14.1 Limits
Terminology:
x
x
c : x goes towards c from the right (of c)

c : x goes towards c from the left (of c)
(x
c )
(x
c )
Definition (Limit of a Function)

The function y f ( x ) has the limit l when x c ,
written lim f ( x ) l , if and only if the following conditions
x
are satisfied:
(i) l is a finite number
(ii) f ( x ) l as x c from the left (left limit exists)
(iii) f ( x ) l as x c from the right (right limit
exists)
180
Infinite limit implies limit does not exist.
14.2 Continuity of Functions of One Variable

Definition (Continuity of a Function)
A function f ( x ) is continuous at x c if and only if the
following conditions are satisfied:
(i)
lim f ( x) exists
x
f ( x ) is defined at x
function)
(iii) lim f ( x) f (c)
(ii)
c (c is in the domain of the
A function may be continuous at some points and

discontinuous at others.
181
14.3 Differentiation
(a) Basic Definitions
Definition (Slope of a Linear Function)
The slope or rate of change of a linear function
f ( x ) a bx is the change in the value of the function per
1-unit change in x (the argument of the function),
measured from any starting point. It is given by b.
Now consider any function y
the domain of f ( x ) .
If x changes from c to (c
average rate of change of
x (c
x ) as
y
x
f (c
x)
x
f ( x ) and a point x
c in
x ) , then we can write the

f ( x ) between x c and
f (c )
Definition (Derivative of a Function at a Point)

The derivative of a function y f ( x ) at x c is defined as
lim
x
f (c
x)
x
f (c )
Alternative notations used for the above definition of a

derivative are
182
y (at x
dy
(at x
dx
c),
c),
f (c )
The derivative as defined above can be interpreted as the

slope of the tangent to f ( x ) at x c .
We can generalize * for any point x on the graph of
f ( x ) , giving
y
dy
dx
f ( x)
lim
x
f (x
x)
x
f (x)
y
Tangent at x
y
f (c
x)
y
f (c )
f (x)
183
Note:
A differentiable function is continuous but the converse
need not be true.
(b) Differentiation Rules
Theorem (Basic Rules of Differentiation)
(i)
d (k )
dx
(where k is any constant)
d (x n )
nx n 1
(ii)
(where n is any real number)
dx
(For all values of x for which both x n and x n 1 are
defined)
Given f ( x ) and g ( x ) are differentiable functions:

(iii)
d [k f ( x)]
dx
d f ( x)
dx
(where k is any constant)

(iv)
d [ f ( x) g( x)]
dx
(v)
d [ f ( x) g( x)]
dx
d f ( x)
dx
f ( x)
dg( x)
dx
d g( x)
dx
df ( x)
dx
(product rule)
g( x)
184
df ( x )
dg( x )
f ( x)
d [ f ( x ) / g ( x )]
dx
dx
(vi)
dx
[ g ( x )] 2
For g( x) 0
(quotient rule)
g( x)
d (e x )
(vii)
dx
(viii)
(where e is the natural base)
ex
d ln x
dx
1
x
(where ln x
Log e x )
(ix) If y is a differentiable function of g and g is a

differentiable function of x, then
dy
dx
dy dg
dg dx
(chain rule, or function of a function rule)
(x) If y f ( x ) is a differentiable function of x and its

inverse x f 1 ( y ) exists, then
dx
dy
1
dy dx
185
MAIN POINTS
The coefficient of determination ( R 2 ) measures the
proportion of total variation in the dependent variable
(in the sample) that is explained by variation in the
independent variable in the sample
In the context of simple linear regression, the sample
correlation coefficient equals the square root of the
coefficient of determination
Assumptions about the distribution of the disturbances
allow us to perform statistical inference in the linear
regression model
If the random disturbances in the NSR model are
multivariate normally distributed, exact distribution
results concerning the least squares estimators of the
models coefficients can be derived.
As the variance of the disturbances and hence the actual
variances of the least squares estimators are not known,
we are led to use the T distribution in performing
inference about the coefficients in the NSR model.
Continuity of a function at a point is a stronger condition
to satisfy than the limit of the function existing at the
point.
186
The first derivative at any point x on the graph of f ( x )

is given by
y
dy
dx
f ( x)
lim
x
f (x
x)
x
f (x)
187
LECTURE - WEEK 13
Required Reading:
Ref. File 2: Sections, 2.3(a), 2.3(b), 2.3(d), 2.4, 2.5(a),
2.5(b), 2.7
14. INTRODUCTION TO DIFFERENTIAL
CALCULUS CONTINUED
14.4 Some Marginal Concepts in Economics
(a) Total and Marginal Revenue
Preliminary - Demand Functions
In its simplest form, the demand function expresses
quantity demanded (q) as a function of price (p). That
is
q
f (p )
Normally economists draw the demand relationship as the

inverse demand function p f 1 (q ) g (q ) , with p on the
vertical axis.
188
Suppose an inverse demand function p
g (q ) .
We know
Total revenue
TR (q ) pq
g (q )q
The calculus definition of marginal revenue is then given

by
Marginal revenue
MR(q)
d TR (q)
dq
Example:
Calculate MR if TR
1100 q
3q 2
2
(b) Marginal Cost

Suppose we have a total cost function TC (q )
The calculus definition of marginal cost is then given by
Marginal cost MC
d TC(q )
dq
189
Example:
Calculate MC if TC
q3
10
200q 500
(c) Price Elasticity of Demand

The economic concept of elasticity of demand gives us a
way of gauging the responsiveness of quantity demanded
to changes in price which is independent of units of
measure.
p
p1
p
p0
D
q1
q
q0
If we consider finite changes in quantity and prices, we

could define point elasticity of demand as:
(q 1
(p 1
q0 ) q0
p0 ) p0
q q0
p p0
The above definition supposes the initial point is (q 0 , p 0 ) .
190
Definition (Arc Elasticity of Demand with Respect to
Price)
Consider a demand function q f (p ) for a particular
good, where q is quantity demanded and p is the price
per unit of the good, and two different points on the
function given by (q 0 , p 0 ) and (q 1 , p 1 ) . Let q (q 1 q 0 )
and p (p 1 p 0 ) . Then the arc elasticity of demand
with respect to price between the two given prices is given
by
E arc
[(q 0
[(p 0
q
q 1 ) 2]
p
p 1 ) 2]
If we consider infinitesimally small changes in p, we have

the following calculus definition of point elasticity.
Definition (Point Elasticity of Demand with Respect to
Price)
Suppose a differentiable demand function q f (p ) for a
particular good, where q is quantity demanded and p is
the price per unit of the good. The point elasticity of
demand with respect to price at any point (q , p ) on this
demand function is defined by
dq p
dp q
1
p
dp dq q
191
(
is the Greek letter eta)
Classification of elasticity of demand with respect to price:

1 : Demand is inelastic: a
proportional q
p leads to a less than
1 : Demand is unit elastic: a

proportional q
1 : Demand is elastic: a
proportional q
leads to a
p leads to a more than
Example:
Calculate the elasticity of demand with respect to price if
q
2500 8p 2p 2
192
14.5 Higher Order Derivatives
Definition (Second Derivative of a Function)
Suppose the (first) derivative of a single variable function
y f ( x ) is differentiable. Then the derivative of the first
derivative is called the second order derivative or simply
the second derivative.
The second derivative of the function y

variously denoted by
f ( x ) is
d2 f
d 2y
f ( x ), y ,
or 2
2
dx
dx
If successive derivatives are

derivative of y f ( x ) is denoted
f
(k )
( x ), y
(k )
Example:
y f ( x ) 2x 3
dk f
d ky
or k
k
dx
dx
5x 4
calculated,
the
kth
193
14.6 Maxima and Minima of Functions - Definitions
Definitions (Extreme Points of a Single-Variable Function)
Given a function y f ( x ) and a point x a in the domain
of f ( x ) :
f ( x ) has a global or absolute maximum value f (a ) at
x a if and only if f (a ) f ( x ) for all x in the
domain of f ( x ) .
f ( x ) has a global or absolute minimum value f (a ) at
x a if and only if f (a ) f ( x ) for all x in the
domain of f ( x ) .
f ( x ) has a local or relative maximum value f (a ) at
x a if and only if f (a ) f ( x ) for all x in some
interval, however small, around x a and in the
domain of the function.
f ( x ) has a local or relative minimum value f (a ) at
x a if and only if f (a ) f ( x ) for all x in some
interval, however small, around x a and in the
domain of the function.
194
14.7 Determination of Local Extreme Points
(a) The First Derivative Test
An extreme point of a function may occur at a point in the
domain of the function either where the first derivative
equals zero, or where the first derivative does not exist.
Points in the domain of the function where either of these
cases occurs are called critical points of the function.
Points where the first derivative is zero are also commonly
called stationary points.
The first derivative test finds local extreme points of a
function by determining the intervals on which the
function is increasing or decreasing.
The First Derivative Test
Step 1: Determine the points where f ( x) 0 or f (x) is
not defined.
Step 2: For the intervals delineated by the points found in
Step 1, determine whether the function is
increasing ( f (x) 0) or decreasing ( f (x) 0) .
Step 3: For each point determined in Step 1 at which the
function is continuous, note what happens to the
sign of f (x) as x increases through the point.
If the sign of f (x) changes from positive to
negative, the point is a local maximum.
If the sign of f (x) changes from negative to
positive, the point is a local minimum.
195
If the sign f (x) does not change sign, the point
is neither a local maximum nor a local
minimum.
Note: If a function is constant in value between two
points on its graph, any points between these two points
will represent both local maximum and local minimum
points. The first derivative test as stated above is not
applicable to finding such local extreme points. Usually
the existence of intervals over which the value of the
function is unchanged can be determined by simple
inspection of the function considered, or by checking
whether the first derivative of the function equals zero
over any particular interval(s). (Reference File 2, p. 43)
Example:
y
f ( x)
45x
x2
2
196
(b) The Second Derivative Test
The Second Derivative Test
Suppose that for a twice differentiable function y f ( x )
we have f (a) 0 ; x a is stationary point of the
function.
If f (a) 0 , the function attains a local maximum
value of f (a ) at x a .
If f (a) 0 , the function attains a local minimum
value of f (a ) at x a .
If f (a) 0 , we must check whether f (x) changes
sign as x increases through the point x a (as per
the first derivative test).
Note: As was the case with the first derivative test, the
second derivative test as stated above is not applicable to
determining local extreme points occurring on intervals
over which the value of a function is unchanged.
(Reference File 2, p. 46)
Inflection Points
Inflection points occur where f ( x) 0 and f (x)
changes sign. At such a point we may or may not have a
stationary point ( f ( x) 0 ).
Example:
y
f ( x)
x3
197
Example of Simple Optimization:
Following the worst cyclone in 50 years, a banana
plantation owner in Northern Queensland has to replant
most of his holding. From research studies conducted in
the region the plantation owner knows that, on average, if
1,760 banana trees are planted per hectare, the annual
yield per tree will be 2 cartons of bananas; in addition, for
every additional tree planted per hectare (over 1,760), the
annual yield per tree falls by a thousandth of a carton.
How many trees should the plantation owner plant per
hectare to maximize the yield per hectare? What is the
maximum yield per hectare?
198
MAIN POINTS
The calculus definitions of marginal revenue, marginal
cost and point elasticity of demand with respect to price
are given by, respectively
Marginal revenue
Marginal cost MC
d TR (q)
dq
d TC(q )
dq
MR(q)
dq p
dp q
The determination of local extreme points can be carried
out using the first derivative test.
The second derivative test can be used to determine
whether a stationary point of a twice differentiable
function is an extreme point of the function.

IEM Outline Lecture Notes Autumn 2016

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IEM Outline Lecture Notes Autumn 2016

Uploaded by

Copyright:

Available Formats

1

200052 INTRODUCTION TO ECONOMIC METHODS

Undertake the required reading from the reference

1.1 How Can We Define Statistics?

(Major) It can be useful. It can help us to make

(ii) People are bombarded with statistics all the time.

Rules (or axioms) for calculating probabilities of

Introduction to Differential Calculus

2.1 Some Basic Definitions Relating to Data

Elementary Units and Frames:

Statistical data normally represents measurements or

Data from qualitative populations cannot be expressed

In studying the data it is often useful to initially group the

The class width is the difference between successive lower

General Advice for Forming Frequency Distributions:

Frequency Cumulative. Relative

An ogive is a graph of the cumulative relative frequency

Note: Frequency and relative frequency histograms will

Relative Frequency Histogram

2.4 Shapes of Distributions

A distribution is symmetric if it has the following shape.

The above are all examples of unimodal distributions. A

Each combination of grade and gender is represented by a

MEASURES OF CENTRAL TENDENCY AND

In this section we shall look at important ways of

The sum of the numbers can be denoted

x is a shorthand way of writing the sum.

Theorem (Basic Properties of Summation Notation)

Use property (iii) of the above theorem to calculate

is the Greek letter mu)

Definition (Mean of a Sample from a Quantitative

6 , etc. (if we label across rows

If the distribution is skewed to the right (positively

If the distribution is skewed to the left (negatively skewed)

: the sample mean is

The median divides a set of quantitative data into two

MEASURES OF CENTRAL TENDENCY AND

3.3 Measures of Dispersion

) if the population mean is known

(ii) If x 1 , x 2 , x 3 , ....., x n represents a sample from a

(c) The Standard Deviation and Variance

(Finite) Population variance

Definition (Variance of a Sample from a Quantitative

Alternatively we can equivalently write:

The standard deviation is defined as the positive square

Population standard deviation

(ii) If x 1 , x 2 , x 3 , ....., x n represent a sample from a

Sample standard deviation

3.4 The Coefficient of Variation

3.5 Chebyshevs Theorem and the Empirical Rule

For hump-shaped or bell-shaped (unimodal) distributions,

INTRODUCTORY PROBABILITY THEORY

4.1 Basic Set Theory

Venn diagrams often provide a convenient way of

The Underlying Population Relative Frequency

In this case the relative frequencies of the simple events

(iii) The Underlying Population Relative Frequency

probability of simple events in both A

probability of simple events in A and/or

Events A and B are said to be mutually exclusive if

Two Approaches to Determining the Probability of an

(i) Define the experiment.

s {1,2,3,4,5,6,7,8} , A {1,3,5,6} , B {2,3,4,5,8}.

The standard deviation is the square root of the variance: