You are on page 1of 108

CONTENTS

Lecture 1: Exploratory Data Analysis And Descriptive Statistics


1.1
Studying one variable at a time
1.2
Studying two variables at a time
1.3
Studying more than two variables at a time
1.4
Effect of Transformation
1.5
Percentiles
1.6
Exercises

1
1
6
6
6
6
8

Lecture 2: Probability
2.1
Probability
2.2
Operations with Probability
2.3
Additive Rules of Probability
2.4
Complement of Event A
2.5
Conditional Probability
2.6
Independent Events
2.7
Intersection of events A and B
2.8
Bayes Rule
2.9
Exercises

11
13
14
15
15
16
17
17
19
21

Lecture 3: Discrete Random variables


3.1
Introduction
3.2
Bernoulli Distribution
3.3
Binomial Distribution
3.4
Poisson Distribution
3.5
Exercises

25
25
27
27
30
32

Lecture 4: Continuous Random variables


4.1
Introduction
4.2
Exponential Distribution
4.3
Exercises

33
33
36
38

Lecture 5: Normal Distribution


5.1
Introduction
5.2
Normal as an approximating distribution
5.3
Exercises

39
39
40
41

Lecture 6: Random Sampling and sampling distributions


6.1
Introduction
6.2
Exercises

43
43
46

Lecture 7: Test of Hypotheses


7.1
Introduction to Hypothesis Testing
7.2
Procedure
7.3
Confidence intervals for hypothesis testing
7.4
Proportions
7.5
Sample size
7.6
Hypothesis Test for Proportions
7.7
Exercises

48
48
48
50
51
53
53
54

Lecture 8: Type I and II Errors


8.1
Introduction
8.2
Type I and II Errors
8.3
Exercises

56
56
56
56

Lecture 9: Further Hypothesis Tests


9.1
Introduction
9.2
Comparison of two population means
9.3
The difference between two proportions
9.4
Paired samples
9.5
Exercises

58
58
58
61
63
65

Lecture 10: Inference for Variance


10.1
Introduction
10.2
Confidence Interval for 2
10.3
Confidence Interval for the Ratio of Two Variances
10.4
Significance Test of Hypotheses about a Variance
10.5
Significance Test of Hypotheses about Two Variances
10.6
Exercises

67
67
67
68
68
69
70

Lecture 11: Chi-squared Test


11.1
Goodness-of-fit Test
11.2
Test for Homogeneity
11.3
Continuity Correction
11.4
Exercises

72
72
74
77
78

Lecture 12: Regression Analysis


12.1
Introduction
12.2
Correlation
12.3
Regression
12.4
Exercises

80
80
82
83
87

Lab Assessment one


Lab Assessment two
Lab Assessment three
Lab Assessment four
Lab Assessment five
Lab Assessment six
Lab Assessment seven
Lab Assessment eight
Lab Assessment nine
Lab Assessment ten

II

88
90
93
94
96
98
99
101
104
106

LECTURE 1
EXPLORATORY DATA ANALYSIS AND DESCRIPTIVE STATISTICS

Statistics can be divided into two major areas. Descriptive statistics comprises the statistical methods
dealing with the collection, tabulation and summarization of data, so as to present meaningful
information. Statistical inference, on the other hand, consists of the methods involved with the
analysis and interpretation of data that will enable the statistician to develop meaningful inferences
about the data. Both sub fields are interrelated; while descriptive statistics organizes the collected data
in a systematic manner, statistical inference analyses the data and enables one to produce significant
inferences about it.
A population is the totality of the observations with which a statistician is concerned. The
observations could refer to anything of interest, such as persons, animals or objects; it need not be
limited to people. The size of the population is defined to be the number of observations in the
population. In collecting data concerning a population, the statistician is often interested in arriving at
conclusions involving the entirety of the population.
A sample is a subset of a population. A random sample of n observations is a sample with n
observations, selected in such a way that every such sample of the population has the same probability
of being selected. These samples are considered to be unbiased.
Often, a sample of the population is taken, data collected from it, and inferences about the population
are made based on the analysis of the sample data.
1.1 Studying one variable at a time
A stem-and-leaf plot is a graphical display showing the frequency of values in specified
intervals. It is useful for small amounts of data as it retains the actual numerical values.
Example:
Stem
1
2
3
4
5
6
7
8

Leaf
345666
000011223345
2233444448
111234567
22336789
269
45
9

The stem is an integer and the leaf is a decimal value.


A histogram is a graphical way to display the shape of the distribution.

Page | 1

Histogram of BMI
Mean
StDev
N

40

25.74
2.389
250

Frequency

30

20

10

20

22

24

26
28
Class mid points

30

32

A box-plot is a graphical summary of the distribution of a variable. The minimum, the 1st
quartile, the median, the 3rd quartile and the maximum are used to construct a box-plot.
This is called the five-number summary.
1. The ends of the box are at the quartiles.
2. Mark the median with a line.
3. Observations more than 1.5 * IQR outside the box are considered to be outliers and
are marked with stars.
4. Whiskers extend from the ends of the box to the smallest and largest observations
that are not outliers.

Box plot of BMI


34
32
30
28
26
24
22
20

Mean

The statistical mean of a set of observations is the average of the measurements in a set of data. The
population mean and sample mean are defined as follows:

Page | 2

Given the set of data values 1 , 2 , . . . ., from a finite population of size , the population mean
is calculated as

1
=

=1

Given the set of data values 1 , 2 , . . . ., from a sample of size , the sample

1
=

=1

The sample mean is often used as an estimator of the mean of the population from whence the sample
was taken. In fact, the sample mean is statistically proven to be a most effective estimator for the
population mean.
A trimmed mean of a set of values is a mean with a specified percentage of the largest and smallest
values excluded from the calculation.

Median

The median of a set of observations is that value that, when the observations are arranged in an
ascending or descending order, satisfies the following condition:
1. If the number of observations is odd, the median is the middle value.
2. If the number of observations is even, the median is the average of the two middle values.
The median is the same as the 50th percentile of a set of data.

Mode

The mode of a set of observations is the specific value that occurs with the greatest frequency. There
may be more than one mode in a set of observations, if there are several values that all occur with the
greatest frequency. A mode may also not exist; this is true if all the observations occur with the same
frequency.
Another measure of central location that is occasionally used is the midrange. It is computed as the
average of the smallest and largest values in a set of data.
Example 1.1: Given the following set of data
1.2

1.5

2.6

3.8

2.4

1.9

3.5

2.5

2.4

3.0

2.5

2.6

3.0

3.5

3.8

It can be sorted in ascending order:


1.2

1.5

1.9

2.4

2.4

The mean, median and mode are computed as follows:

Page | 3

1.2 1.5 2.6 3.8 2.4 1.9 3.5 2.5 2.4 3.0
10
= 2.48

~
x

= (2.4 + 2.5) / 2
= 2.45

The mode is 2.4, since it is the only value that occurs twice.
The midrange is (1.2 + 3.8) / 2 = 2.5.
Note that the mean, median and mode of this set of data are very close to each other. This suggests
that the data is very symmetrically distributed.

Range

The range of a set of observations is the absolute value of the difference between the largest and
smallest values in the set. It measures the size of the smallest contiguous interval of real numbers that
encompasses all the data values.
Example 1.2: Given the following sorted data:
1.2

1.5

1.9

2.4

2.4

2.5

2.6

3.0

3.5

3.8

The range of this set of data is 3.8 - 1.2 = 2.6.

Variance and Standard Deviation

The variance of a set of data is a cumulative measure of the squares of the difference of all the data
values from the mean.
The population and sample variance are calculated as follows:
Given the set of data values 1 , 2 , . . . ., from a finite population of size , the population variance
is calculated as,

1
2 = ( )2

=1

Given the set of data values 1 , 2 , . . . ., from a sample of size , the sample variance 2 is
calculated as,

1
2 =
( )2
1
=1

Page | 4

Note that the population variance is simply the arithmetic mean of the squares of the difference
between each data value in the population and the mean. On the other hand, the formula for the sample
variance is similar to the formula for the population variance, except that the denominator in the
fraction is ( 1) instead of . Using the above formula, the sample variance is statistically proven
to be a most effective estimator for the variance of the population to which the sample belongs.
The standard deviation of a set of data is the positive square root of the variance.
Example 1.3: Given the following sorted data:
1.2

1.5

x =

2.48 as computed earlier

s2 =

1.9

2.4

2.4

2.5

2.6

3.0

3.5

3.8

1
((1.2 - 2.48) 2 + (1.5 - 2.48) 2 + (1.9 - 2.48) 2 + (2.4 - 2.48)2
10 1
+ (2.4 - 2.48) 2 + (2.5 - 2.48)2 + (2.6 - 2.48) 2 + (3.0 - 2.48) 2
+ (3.5 - 2.48)2 + (3.8 - 2.48)2)

= (1 / 9) (1.6384 + 0.9604 + 0.3364 + 0.0064 + 0.0064 + 0.0004 + 0.0144 + 0.2704 + 1.0404


+ 1.7424)
= 0.6684

s = (0.6684) 1/2 = 0.8176


The sample variance can also be calculated as follows:
2
1
n 2 n

2
s
n xi xi
n(n 1)
i 1
i 1

Example 1.4: Given the above data, we can calculate s using the above formula:
n

x
i 1

2
i

= 1.2 2 1.52 1.9 2 2.4 2 2.4 2 2.52 2.6 2 3.0 2 3.52 3.82
= 1.44 + 2.25 + 3.61 + 5.76 + 5.76 + 6.25 + 6.76 + 9.00 + 12.25
+ 14.44
= 67.52

s2

1
(10 67.52- 24.8 2 )
10 9
= 0.6684

1.2 Studying two variables at a time

A two-way frequency table gives the number of cases within each combination of
categories of two qualitative variables.
A Scatter plot is a two-dimensional graphical display of two quantitative variables.

Page | 5

1.3 Studying more than two variable at a time

A multiway frequency table or multidimensional contingency table displays the number


of cases within each combination of categories of several qualitative variables.

1.4 Effect of Transformation


A transformation of a variable is a mathematical manipulation of each value of the variable. When
we make a transformation, we transform the original scale of measurement for the variable to a new
scale.
Many statistical techniques require that the data is approximately normally distributed so we often
apply a transformation to the data. If the data is skewed to the right, we can try the natural logarithms
or the square root. If the data is skewed to the left, we can try a power transformation greater than one.
All these transformations are non-linear.

Page | 6

1.5 Percentiles
Percentiles are values in a given set of observations that divide the data into 100 equal parts. These
values can be denoted by 1 , 2 , . . . . . , 99 where
1 % of the data falls below (is less than or equal to) P1
2 % of the data falls below P2
:
:
99 % of the data falls below P99
Percentiles can be calculated using a sorted list of observations or the cumulative frequency
distribution table corresponding to the observations. In the latter method, it is assumed that the values
in a class interval are uniformly distributed within it; extrapolation is then used to calculate the
percentiles. As this assumption is often untrue, percentile values can differ depending on whether raw
data or frequency distributions were used in the computation. Therefore, percentiles are often treated
as estimates for the value below which certain percentages of the observations fall.
Example 1.1: Given the following sorted list of observations:
0.7
2.5
4.3
6.4

0.8
3.1
4.6
6.6

0.9
3.2
4.7
6.8

1.1
3.3
5.0
7.0

1.2
3.4
5.2
7.7

1.4
3.8
5.5
8.2

1.9
3.9
5.6
8.9

2.2
4.0
5.8
9.2

2.2
4.1
5.9
9.5

2.3
4.2
6.1
9.9

P75 = 6.1, since 40 x 75 % = 30 and 6.1 is the 30th ranked value.


P45 = 4.0, since 40 x 45 % = 18 and 4.0 is the 18th ranked value.
P62 = 5.2, since 40 x 62 % = 24.8 and 5.2 is the 25th ranked value.
This set of observations has the following cumulative frequency distribution:
Measurements
0.0 - 1.0
1.0 - 2.0
2.0 - 3.0
3.0 - 4.0
4.0 - 5.0
5.0 - 6.0
6.0 - 7.0
7.0 - 8.0
8.0 - 9.0
9.0 - 10.0
Totals

Cumulative Frequency
3
7
11
18
24
29
34
35
37
40
40

Relative Cumulative Frequency


0.075
0.175
0.275
0.450
0.600
0.725
0.850
0.875
0.925
1.000
1.000

The percentiles can also be calculated from the cumulative frequency distribution table, using
extrapolation to arrive at estimates:
75 = 6.0 + 1.0

(0.75 0.725)
0.025
= 6.0 +
= 6.2
(0.85 0.725)
0.125

Page | 7

where 6.0 is the upper class limit of interval 5.0 - 6.0 with cumulative frequency 0.725, and 0.850 is
the cumulative frequency of the next interval, 6.0 - 7.0, with class width 1.0.
45 = 4.0 , since the interval 3.0 - 4.0 has a cumulative frequency of 0.45
62 = 5.0 + 1.0

(0.620 0.600)
0.020
= 5.0 +
= 5.16
(0.725 0.600)
0.125

where 5.0 is the upper class limit of interval 4.0 - 5.0 with cumulative frequency 0.600, and 0.725 is
the cumulative frequency of the next interval, 5.0 - 6.0, with class width 1.0.
The values of P75 and P62 differ between the two methods of calculation, while the values of P45 for both
methods are the same.
Deciles are values in a given set of observations that divide the data into 10 equal parts. These values
can be denoted by 1 , 2 , . . . . . , 9 , where
10 % of the data falls below D1
20 % of the data falls below D2
:
:
90 % of the data falls below D9
It is easy to see that
D1
D2
D3

= P10
= P20
= P30

D4 = P40
D5 = P50
D6 = P60

D7 = P70
D8 = P80
D9 = P90

Quartiles are values in a given set of observations that divide the data in 4 equal parts. These values
can be denoted by Q1, Q2 and Q3, where
25 % of the data falls below Q1
50 % of the data falls below Q2
75 % of the data falls below Q3
Again, it is obvious that 1 = 25 , 2 = 50 and 3 = 75 .
Deciles and quartiles are calculated in the same manner as percentiles.
1.6 Exercises
1.

Construct a box-plot using the five-number summary, minimum, Q1, median, Q3, maximum,
given as 48, 63, 70, 81 and 100 respectively.

2.

Consider the following strength measurements.


66
111
92

117
86
137

132
78
91

111
96
84

107
93
96

85
101
97

89
102
100

79
110
105

91
95

97
96
104

138
88
137

122
80

103
115
104

Page | 8

104
99
102
94
103
99

106
85
91
105
96
92

84
95
95

92
89

86
102
111

104
100
104

132
98
97

94
97
98

99
104
102

102
114
109

101
111
88

104
98
91

107
99
103

100
102

101
87

98
99

97
62

97
92

101
100

102
96

98
98

94

100

98

(a) Construct a stem-and-leaf plot with stems in ten and leaves in units. Use two rows per
stem.
(b) Comment on the distribution.
3.

A new chip can be reprogrammed without removing it from the microcomputer. Times (in
seconds) to reprogram a byte of memory on this chip are shown below. (Design news April
1983, page 26).
11.6 12.3 12.5 12.9 13.0 13.1 13.2 13.3 13.3 13.4 13.8 14.2 14.7 15.1
15.3
(a) Construct a dot-plot of these programming times.
(b) Calculate the mean and median of these values.
(c) Calculate the range interquartile range, and standard deviation. What do these three
descriptive statistics measure.
(d) A company has claimed that a byte of memory on this chip can be reprogrammed in less
than 14 seconds. Does this claim seem reasonable?

4.

Times to failure are listed below for tires manufactured with three different methods of
production.
Method 1:

10.03 10.47 10.58 11.48 11.60 12.41 13.03


13.51 14.48 16.96 17.08 17.27 17.90 18.21
19.30 20.10 21.51 21.78 21.79 25.34

Method 2:

10.10
17.01
28.59
19.07
22.61
28.28

Method 3:

11.01 11.20 12.95 13.19 14.81 16.03


18.96 24.10 24.15 24.52 26.05 26.44
30.24 31.03 33.51 33.61 40.68
19.51 19.62 20.47 20.78 21.37 22.08
23.47 26.02 26.23 26.47 27.07 27.43
29.10 29.66 30.67 30.81 34.36

(a) Represent the data graphically in an appropriate way.


(b) Calculate descriptive statistics for the three distributions of failure times separately.
(c) Compare the three distributions. In particular, compare location and variation for the three
sets of data.

5.

In 1975-1976, 499,602 men and 41,786 women received bachelors degrees in the USA,
165,474 men and 143,789 women received masters degrees, 26,010 men and 7,777 women
received doctorates, and 52,365 men and 9,720 women received first professional degrees.

Page | 9

In 1984-1985, 482,528 men and 496,949 women received bachelors degrees 143,390 men
and 142,861 women received masters degrees, 21,700 men and 11,243 women received
doctorates, and 50,455 men and 24,608 women received first professional degrees.
(a) Arrange this information in one or more frequency tables.
(b) Discuss the relationship between sex and degree, separately for the two academic years.
(c) Discuss the relationship between year and degree, separately for men and women.
1.7 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th edition)
Exercises - Page 52 - Question 1.13 & 1.14

Page | 10

LECTURE 2

PROBABILITY
So far we have used tools of data analysis to learn about a collection of information. In formal statistical
analysis, we go beyond the goals of data analysis. In general, statistical analysis (inference) involves
making probability statements about populations based on what we observe in our samples. The ideas
in probability that are needed for formal statistical inference are discussed in this lecture.
Statisticians use the word experiment to describe any process that generates a set of data.
An experiment is a process leading to a well-defined observation or outcome that generates a set of
data.
A simple example of a statistical experiment is the tossing of a coin. In this experiment there are only
two possible outcomes, heads and tails.
We are particularly interested in the observations obtained by repeating the experiment several times
under the same conditions. In most cases the outcomes will depend on chance and, therefore, cannot
be predicted with certainty. When a coin is tossed repeatedly, we cannot be certain that a given toss
will result in head. However we know the entire set of possibilities for each toss.
The sample space is the set of all possible outcomes of the experiment and is denoted by S. Each of
the possible outcomes is called an element or a member of the sample space , or simply a sample point.
If the sample space has finite number of elements we can list them as follows.
The sample space of S, of possible outcomes when a coin is tossed, may be written as
S = {H, T}
Where H and T corresponds to heads and Tails.
In some experiment it is helpful to list the elements of the sample space systematically by means of a
tree diagram.
Sample spaces with large or infinite number of sample points are best described by a statement or a
rule. For example, if the possible outcomes of an experiment are the set of cities in the world with a
population over 1 million, the sample space is written
S = {x | x is a city with a population over 1 million}
A finite sample space is a sample space that contains a finite number of outcomes.
The sample spaces that contain the outcomes of tossing a coin, drawings from a bag of mixed-colour
balls, and dealings from a regular 52-card deck are examples of discrete sample spaces.
A continuous sample space is a sample space that contains an interval of values.
Sample spaces that contain the outcomes of temperature readings, height measurements, and salaries
are examples of continuous sample spaces.
An event is a subset of the sample space and is denoted by E.
It may contain some, all or none of the outcomes comprising the sample space. If the event contains
only one sample point, it is a simple event. If the event contains two or more sample points, it is a
compound event. And if the event contains no sample points, it is known as a null space.

Page | 11

For any given experiment we may interested in occurrence of certain events rather than in the outcome
of a specific element in the sample space. For example we may interest in the event A that the outcome
when a die is tossed is divisible by 3. This will occur if the outcome is an element of the subset A =
{3,6} of the sample space S = {1,2,3,4,5,6} of tossing a die experiment.
To each event we assign a collection of sample points, which constitute a subset of the sample space.
That subset represents all of the elements for which the element is true.
The complement of an event A with respect to S is the subset of all the elements of s that are not in
A. we denote the complement of a by the symbol A .
For example consider the sample space
S = {A, B, C, D, E}. Let A = {B, D}. Then A ={A, C, E}.
The intersection of two events A and B, denoted by the symbol A B is the event containing all
elements that are common to A and B.
In the tossing of a die we might let A be the event that an even number occurs and B the event that a
number greater than 3 shows. Then the subsets A = {2,4,6} and B ={4,5,6} are subsets of the same
sample space S={1,2,3,4,5,6}. Both A and B will occur on a given toss if the outcome is an element of
the subset {4,6}, which is the intersection of A and B. So A B = {4,6}
For certain statistical experiments it is usual to define two events that cannot occur simultaneously.
Such events are said to be mutually exclusive.
Two events A and B are mutually exclusive, or disjoint if A B , that is, if A and B have no
elements in common.
The Union of the two events A and B, denoted by the symbol A B , is the event containing all the
elements that belong to A or B or both.

Example2.1:
Consider tossing a die and observing the number that appears on top face. This has a well-defined
outcome that is top face can be 1,2 3, 4, 5 or 6. So this can be taken as an experiment.
The sample space S of the experiment is S = {1,2 ,3,4,5,6}.
*

|
1

|
2

|
3

|
4

|
5

|
6

Throwing the die

S consists of 6 definite outcomes. So S is a finite sample space.


Some events on this sample space can be identified as even number occurs, odd number occurs and
number greater than 3 occurs.
Let A be the event that an even number occurs, B that an odd number occurs, and C that a number
greater than 3 occurs. Then

Page | 12

A = {2,4,6}
B = {1,3,5}
C = {4,5,6}
B C = {1,3,4,5,6}
B C = {5}
A B = {}, So A and B are mutually exclusive.
2.1 Probability
The probability of an event is the chance or likelihood of the event occurring.
In this chapter we consider only those experiments for which the sample space contains a finite number
of elements. The probability of an outcome or sample point is a real number, between 0 and 1 that
provides a measure of likelihood that the outcome or sample point will actually occur. A sample point
that absolutely cannot occur has a probability of 0, while a sample point that will always occur has a
probability of 1; all other sample points are assigned a probability based on this relative measure.
A probability function assigns a unique number or probability to each outcome.
The probability of an event A is the summation of the probabilities of all the sample points in A and
is denoted by P(A).
If event A is a subset of the sample space S, then 0 P(A) 1. If A = , then P(A) = P( ) =
0; if A = S, then P(A) = P(S) = 1. Otherwise, the value of P(A) is between 0 and 1.

Example2.2:
A coin is tossed twice. What is the probability that at least one head occurs?
Solution:
The sample space for this experiment is S = {HH, HT, TH, TT}. If the coin is balanced each of these
outcomes would be equally likely to occur. Therefore, we assign a probability of w to each sample
point. Then 4w = 1,or w =1/4.If A represents the event of at least one head occurring, then
1 1 1 3
A = {HH,HT,TH} and P( A)
4 4 4 4
If an experiment can result in any one of N different equally likely outcomes, and if exactly n of these
outcomes correspond to event A, then the probability of event A is
n
P( A)
N
Example2.3:
A mixture of candies contains 6 mints, 4 toffees, and 3 chocolates. If a person makes a random selection
of one of these candies, find the probability of getting (a) mint, or (b)a toffee or a chocolate.
Solution:
Let M, T, and C represent the events that the person selects, respectively, a mint, toffee, or chocolate
candy. The total number of candies is 13, all of which are equally likely to be selected.

Page | 13

(a) Since 6 of the 13 candies are mints, the probability of event M, selecting a mint at random, is
6
P( M )
13
7
(b) Since 7 of the 13 candies are toffees or chocolates, it follows that P A B
13
2.2 Operations with Probability
Often it is easier to calculate the probability of other events. This may well be true if the event in
question can be represented as the union or intersection of two other events or as the complement of
some event.
Just as events can be treated as sets, so can probabilities of an event (in a sense). The formulas used to
calculate the probability of unions, intersections and complements of events are similar to the ones
used for sets.

Page | 14

2.3 Additive Rules of Probability


Given that event A and event B are subsets of the sample space S, the following rules Union of events
A and B (Additive Rule of Probability)
If A and B are any two events, then
P A B P A PB P A B
where P A B is the probability that either events A or B occur and P A B is the probability that
both events A and B occur.
If A and B are mutually exclusive, then
P A B P A PB
Since if events A and B are mutually exclusive (i.e. A and B cannot occur together), P(A B) =
P( ) = 0
In general we can write,
If a set of events A1, A2 , A3,....., An are mutually exclusive, then
PA1 A21 An P A1 P A2 P An
2.4 Complement of event A
Since A A S , and the event A and its complement are mutually exclusive,

PS P A AA A P A P A 1
P A 1 P A
Example 2.4:
What is the probability of getting a total of 7 or 11 when a pair of dice are tossed?
Solution:
Let A be the event that 7 occurs and B the event that 11 comes up. Now, a total of 7 occurs for 6 of
the 36 sample points and a total of 11 occurs for only 2 of the sample points. Since all sample points
are equally likely, we have P A 6 and PB 1 8 .The events A and B are mutually exclusive, since
a total of 7 and 11cannot both occur on the same toss. Therefore,
1 1 2
P A B P A PB

6 18 9
This result could also have been obtained by counting the total number of points for the event P A B
, namely 8, and writing
n 81 2
P A B

.
N 36 9

Page | 15

2.5 Conditional Probability


The conditional probability of an event A is the probability that event A will occur, given that some
other event has occurred. The value assigned to the conditional probability of an event (given that
another event occurred) is based on a sample space that is different from the one used to derive the
simple probability.
Consider the event B of getting a perfect square when a die is tossed. The die is constructed so that the
even numbers are twice as likely to occur as the odd numbers. Based on the sample space
S 1,2,3,4,5,6, B contains two elements 1and 4.With probabilities of 1/9 and 2/9 assigned,
respectively to the odd and even numbers, the probability of B occurring is 1/3.
Now suppose that it is known that the toss of the die resulted in a number greater than 3. We are now
dealing with a reduced sample space A 4,5,6 , which is a subset of S. To find the probability that B
occurs, relative to the space A, we must first assign new probabilities to the elements of A proportional
to their original probabilities such that their sum is 1.Assinging a probability of w to the odd number
in A and a probability of 2w to the two even numbers, we have 5w =1 or w = 1/5.Relative to the space
A, we find that B contained the single element 4. Denoting this event by the symbol B | A , we write
B | A 4, and hence PB | A 2 5 .
This example shows that events may have different probabilities when considered relative to different
sample spaces.
2 2 9 P A B
We can also write PB | A

5 59
P A
Where P A B and P A are found from the original sample space S . In other words, a conditional
probability relative to a subspace A of S may be calculated directly from the probabilities assigned to
the original sample space S.

The conditional probability of event B, given that event A occurred, also known as the conditional
probability of B given A, is defined by
P A B
P B | A
P A if P A 0 .

Example2.5:
The probability that a regularly scheduled flight departs on time is P A 0.82 ; and the probability
that it departs and arrives on time is PD A 0.78 . Find the probability that a plane (a) arrives on
time given that it departed on time, and (b) departed on time given that it has arrived on time.

Solution:
(1) The probability that a plane arrives on time given that it departed on time is
PD A 0.78
P A | D

0.94
P D
0.83
(2) The probability that a regular scheduled flight departed on time given that it has arrived on
time is

Page | 16

PD | A

PD A 0.78

0.95
P A
0.82

2.6 Independent Events


Although conditional probability allows for an alteration of the probability of an event in the light of
additional material, it also helps to understand the concept of Independent Events. In the above
example PD | A differs from PD . This suggests that the occurrence of A influenced D. However
consider the situation where we have events A and B and P A | B P A . In other words the
occurrence of B had no impact on the occurrence of A. Here the occurrence of A is independent of the
occurrence B.

Two events A and B are independent if and only if


PB | A PB and P A | B P A . Otherwise, A and B are dependent.
2.7 Intersection of events A and B (Multiplicative Rules of probability)
If in an experiment the events A and B can both occur, then
P A B P AP( B | A)
Thus the probability that both A and B occur is equal to the probability that A occurs multiplied
by the probability that B occurs, given that A occurs. Since the events P A B and P A B
are equivalent, it follows from above rule that we can also write
P A B PB A PBP( A | B)

If two events A and B are independent then


P A B P APB
If in an experiment, the events A1 , A2 , A3 ,, Ak can occur, then
P A1 A2 A3 Ak P A1 P A2 | A1 P A3 P A1 | A2 P Ak P A1 A2 Ak 1
If the events A1 , A2 , A3 ,, Ak are independent, then
P A1 A2 A3 Ak P A1 P A2 P A3 P Ak
Example 2.5:
A card is drawn from a regular deck of 52 cards. Event A is the event that the card drawn is a Jack.
Event B is the event that the card drawn is a diamond. Find the probability that the
a. card drawn is a diamond and a Jack.
b. card drawn is a Jack given that the card is a diamond.
c. card drawn is a diamond given that the card is a Jack.
Solution:
P(A) = 4 / 52 = 1 / 13 since there are 4 Jacks in the deck
P(B) = 13 / 52 = 1 / 4 since there are 13 diamonds in the deck
P(A B) = 1 / 52 since there is only 1 Jack of diamonds in the deck

Page | 17

P(A|B) = P(A B) / P(B) = (1 / 52) / (13 / 52) = 1 / 13 = P(A)


P(B|A) = P(A B) / P(A) = (1 / 52) / (4 / 52) = 1 / 4 = P(B)
As event A and B are two independent events we can get P (A B) using the formula also.
That is P (A B) = P (A) P (B) = (1/13) (1/4) = 1/52

Example 2.6:
A bag contains 6 blue balls and 4 red balls. Two balls will be drawn from the bag. Calculate the
probability of either one of the balls is blue.
Solution:
Let event A be the event that the first ball is blue, and let event B be the event that the second ball is
blue. Then, the event A' will be the event that the first ball is red, and event B' will be the event that
the second ball is red.
Since there are 6 blue balls out of a total of 10 balls, the probability of choosing a blue ball in the first
drawing is 6/10. If a blue ball is taken out, then there will only be 5 blue balls and 9 total balls left;
the probability of choosing a blue ball will be 5/9. On the other hand, if the first ball is a red ball, then
there will be 6 blue balls and a total of 9 balls, in which case there would be a 6/9 (or 2/3) probability
of getting a blue ball.
Therefore
P(A) = 6/10
P(A') = 1 - 6/10 = 4/10
P(B|A) = 5/9
P(B|A') = 2/3
P(A B) = P(A) P(B|A) = (6/10)(5/9) = 1/3
P(A' B) = P(A') P(B|A') = (4/10)(2/3) = 4/15
Since A and A' are mutually exclusive events, (B A) and (B A') are also mutually exclusive
events. Thus, we can calculate P(B) as follows:
P(B)= P(B S) =
=
=
=

P( B (A A') ) = P( (B A) (B A') )
P(B A) + P(B A')
1/3 + 4/15
9/15

P (A B) = P(A) + P(B) - P(A B)


= 6/10 + 9/15 - 1/3
= 13/15
(A B) is the event that either one of the two balls drawn is blue. This being the case, P(A B)
is the probability that the first ball is blue, plus the probability that the first ball is red and the second
ball is blue. Thus,
P (A B) = P(A) + P(A' B)
= 6/10 + 4/15

Page | 18

= 13/15
This is a different way to obtain the solution, but the result is the same nevertheless (Events A and B
are independent events.)
2.8 Bayes Rule
If the set of events A1 , A2 , ....., An constitutes a partition of the sample space S, and event B is a
subset of S, then
B= B S
= B (A1 A2 ......... An )
= (B A1 ) (B A2 ) ....... (B An )
As the events A1 , A2 , ......, An are mutually exclusive, then the events (B Ai ), where i {1, 2,
...., n }, is also mutually exclusive. Assuming that none of the events A1 , A2 ,....., An is null, i.e. P(Ai
) 0 , i {1, 2,...., n }
P(B) = P(B A1 ) + P(B A2 ) + ....... + P(B An)
= P(A1) P(B | A1 ) + P(A2) P(B | A2) + ....... + P(An ) P(B | An )
Theorem of total probability
If the events A1 , A2 ,..., Ak constitute a partition of the sample space S such that, P Ai 0 for
i=1,2,,k, then for any event B of S,
k

i 1

i 1

PB PB Ai P Ai PB | Ai

From the definition of conditional probability,

P( Ai | B)

P( Ai B) P( Ai ) P( B | Ai )

P( B)
P( B)

Thus we have derived Bayes' Rule, which states the following:


Bayes' Rule :
If the set of events A1 , A2 ,....., An constitutes a partition of the sample space S, P(Ai ) 0 , i {1,
2,....., n }, and event B is a subset of S, P(B) 0,

P( Ai | B)

P( Ai ) P( B | Ai )
P( A1 ) P( B | A1 ) P( A2 ) P( B | A2 ) ........ P( An ) P( B | An )

Example 2.7:
A family had plans to go fishing on a Sunday afternoon, but their plans were dependent on the weather
at noon Sunday. If it was sunny, then there was a 90 % chance that they would go fishing. If it was
cloudy, then the probability that they would go fishing would drop to 50 %. And if it was raining, the
chances dropped to 15 %. The weather prediction, which we can assume to be accurate, called for a
10 % chance of rain, a 25 % chance of clouds, and a 65 % chance of sunshine.

Page | 19

Set event F as the event that the family goes fishing


S as the event that the weather is sunny at Sunday noon
C as the event that the weather is cloudy at Sunday noon
R as the event that the weather is rainy at Sunday noon
Assuming that the family ends up going fishing, find the probability of each type of weather occurring
.
Solution:
P(S) = 0.65, P(C) = 0.25, P(R) = 0.10
Note that P(S) + P(C) + P(R) = 1, and of course S, C and R are mutually exclusive events.
P(F|S) = 0.90, P(F|C) = 0.50, P(F|R) = 0.15
P(F) = P(F|S) P(S) + P(F|C) P(C) + P(F|R) P(R)
= (0.90)(0.65) + (0.50)(0.25) + (0.15)(0.10)
= 0.585 + 0.125 + 0.015
= 0.725
Assuming that the family ends up going fishing, the probability of each type of weather occurring is
P(S|F) = probability of sunny weather, given that the family went fishing.
=

P( F | S ) P( S )
(0.90)(0.65)
=
= 0.807
0.725
P( F )

P(S|F) = probability of cloudy weather, given that the family went fishing.
=

P( F | C ) P(C )
(0.50)(0.25)
=
= 0.172
0.725
P( F )

P(S|F) = probability of rainy weather, given that the family went fishing.
=

(0.15)(0.10)
P( F | R) P( R)
=
= 0.021
0.725
P( F )

Note that P(S|F) + P(C|F) + P(R|F) = 0.807 + 0.172 + 0.021 = 1.000

2.9 Exercises (Extracted from Schaums Series by Walpole & Mayer)

1. A pair of dice is tossed and the two numbers appearing on the top are recorded. Draw the sample
space and find the number of elements in each of the following events:
(a) A = { two numbers are equal }
(b) B = { sum is 10 or more }
(c) C = { 5 appears on first die }
(d) D = { 5 appears on at least one die }
2. Determine the probability p of each event:
(a) An even number appears in the toss of a fair die.

Page | 20

(b) At least one tail appears in the toss of 3 fair die.


(c) A white marble appears in the random drawing of 1 marble from a box containing 4 white
marbles, 3 red marbles and 5 blue marbles.

3. A box contains 15 billiard balls, which are numbered from 1 to 15. A ball is drawn at random and
the number recorded. Find the probability P that the number is;
(a) Even
(b) Less than 5
(c) Even and less than 5
(d) Even or less than 5
4. A class contains 10 men and 20 women of which half the women and half the men have brown
eyes. Find the probability P that a person chosen at random is a man or has brown eyes.

Page | 21

5. A sample space S consists of 4 elements, that is, S = a1 , a2 , a3 , a4 . Under which of the following
functions P does become a probability space?




(a) P a1 0.4, P a2 0.3, P a3 0.2, P a4 0.3




(b) P a1 0.4, P a2 0.2, P a3 0.7, P a4 0.1




(c) P a1 0.4, P a2 0.2, P a3 0.1, P a4 0.3




(d) P a1 0.4, P a2 0, P a3 0.5, P a4 0.1
6. Suppose A and B are events with P A 0.6, PB 0.3, and P A B 0.2. Find the probability
that:
(a) A does not occur.
(b) B does not occur.
(c) A or B occurs.
(d) Neither A nor B occurs.
7. Three fair coins, a penny, a nickel, and a dime, are tossed. Find the probability p that they are all
heads if:
(a) The penny is heads
(b) At least one of the coins is heads,
(c) The dime is tails
8. A billiard ball is drawn at random from a box containing 15 billiard balls numbered 1 to 15, and
the number n is recorded.
(a) Find the probability p that n exceeds 10.
(b) If n is even, find the probability p that n exceeds 10.
9. In a certain college, 25 percent of the students failed mathematics, 15 percent failed chemistry, and
10 percent failed both mathematics and chemistry. A student is selected random.
(a) If the student failed chemistry, what is the probability that he or she failed mathematics?
(b) If the student failed mathematics, what is the probability that he or she failed chemistry?
(c) What is the probability that the student failed mathematics or chemistry?
(d) What is the probability that the student failed neither mathematics nor chemistry?
10. Find PB | A if :
(a) A is a subset of B.
(b) A and B are mutually exclusive (disjoint) Assume P A 0 .

11. If the probabilities are, respectively, 0.09,0.15,0.21, and 0.23 that a person purchasing a new
automobile will choose the color green, white, red, or blue, what is the probability that a given
buyer will purchase a new automobile that comes in one of those colors?
12. Suppose that a factory has a fuse box containing 20 fuses, of which 5 are defective. If 2 fuses are
selected at random and removed from the box in succession without replacing the first, what is the
probability that both fuses are defective?
13. Items sampled on a production line may be classified as defective (D) or non-defective (N). List
elements in the sample space if sampling process terminates:
(a) After 4 items have been sampled.

Page | 22

(b) After 3 defectives in a row have been observed or 4 items have been sampled.
(c) When the first defective is observed.
Suppose, 5% of the products are defective.
(d) Find the probability of exactly 2 defective items if sampling processes (a) is adopted.
(e) In the sampling process (c), what is the probability that the sampling process is terminated
before the 3rd item is sampled?
14. It is compulsory for the driver of a car to wear a seat belt while driving. The results of a survey
show that not all drivers are wearing seat belts.
Age
< 40
>= 40

Driver
wearing Driver
not
seat belt
wearing seat belt
375
52
425
148

Use the data to estimate the probability that a randomly chosen driver
(a) Is wearing a seat belt.
(b) Is under 40 and wearing a seat belt.
(c) Suppose the randomly chosen driver is under 40. What is the probability that the driver is
wearing a seat belt?
15. In a certain region of the country it is known from past experience that the probability of selecting
an adult over 40 years of age with cancer is 0.05. If the probability of a doctor correctly diagnosing
a person with cancer as having the disease is 0.78 and the probability of incorrectly diagnosing a
person without cancer as having the disease is 0.06, what is the probability that a person is
diagnosed as having cancer?
16. In a certain assembly plant, three machines, B1, B2, and B3, make 30%,45%,and 25%,respectively
,of the products. It is known from past experience that 2%, 3%, and 2% of the products made by
each machine, respectively, are defective. Now, suppose that a finished product is randomly
selected. What is the probability that it is defective?

Page | 23

17. Suppose that three machines at a factory are used to produce a large quantity of identical parts. The
production machines have different capacities: Machine A has a large capacity and produces 60%
of the parts, while machines B and C produce 30% and 10% of the parts, respectively.
Historical data indicate that 10% of the parts produce by Machine A are defective, compared to
30 % for Machine B and 40% for Machine C.

(a) Complete the following table.


Machine
A
B
C
Total

Defective

Nondefective

Total

100

(b) What are the conditional probabilities, updated in light of the evidence that the part is defective
of machine A, B or C having produced it?

2.9 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th edition)
Exercises - Page 97 - Question 2. 109, 2. 110, 2. 111, 2. 112, 2. 129

LECTURE 3
DISCRETE RANDOM VARIABLES
3.1 Introduction

Page | 24

Definition: A random variable X is a numerically valued variable defined on the sample space, .
X: R
We say that X is a discrete random variable if it can take only a countable set of values, i.e. integer
or rational values.
Consider tossing a fair coin. We know that the outcome is either a head or a tail.
1
1
P(head) =
, P (tail) =
2
2
If we denote the number of heads by X, then
1
1
P(X = 1) =
, P (X = 0) =
2
2
X is an example of a random variable. Note that a random variable is usually labelled with a capital
letter (say X). The realised value of the random variable X is denoted by x.

Definition: If we have a discrete random variable X taking values x 1 , x 2 ,....., x n


with probabilities p1 , p 2 ,......., p n respectively, where
p1 p 2 p 3 ..... p n 1
pi 0 , i ,
then this defines a discrete probability distribution for X. Although we have written the random
variable X as taking a finite set of values in this definition, it also holds for an X which takes an infinite
countable set of values, e.g. all non-negative integers.
We may write P(X = xi) as pi. This is sometimes referred to as the probability function for X.
Example 3.1: Two fair dice are thrown. Let X be the sum of the values on the faces turned uppermost.
Find the probability distribution for X.
The sample space can be shown as follows.
X
P(X)

2
1
36

3
2
36

4
3
36

5
4
36

6
5
36

7
6
36

8
5
36

9
4
36

10
3
36

11
2
36

12
1
36

Note that X = 2 if and only if both dice show 1. Also, X = 3 if and only if one die shows 1 and the
other 2.
Note that the sum of the probabilities is one and all are positive so this is a valid probability distribution.
Example 3.2: The discrete random variable X has probability function given by
P(X=x) = cx2 , x=1,2,3,4. Find C.
X

P(X) c 4c 9c 16c
We know that c + 4c + 9c + 16c = 1 and hence c =

1
30

Page | 25

Definition: Suppose X is a discrete random variable taking values x 1 , x 2 ,....., x n , with probabilities
p1 , p 2 , p 3 ,....., p n then the mean or expected value of X, written as or E[X] is given by
n

E(X) p i x i
i 1

If X takes an infinite number of values the sum is taken over all values of i.
To justify this definition, suppose we had a sample of x values where x occurs with frequency f. Then
the sample mean would be

f x
f

f
i x i
f
i
i

In the limit if we collect enough data f i f i tends to p.


x

Example 3.3: A die is thrown, what is the mean (or expected) score?
1
1
1
1
1
1
E ( X ) 1 2 3 4 5 6 = 3.5
6
6
6
6
6
6
Note that the expected value of a random variable is not necessarily a value the random variable can
take. The expected score when we throw a. fair die is 2 1/2, but a die cannot take this value. Think of
the expected value or mean of a random variable as a measure of where the distribution is centred
around.
The expectation of any function of a random variable, g(X) say, is defined in a similar way.

Definition: If X is a discrete random variable then the expectation of X is given by


n

E[g(X)] p i g(x i )
i 1

We can also define the variance and standard deviation of a random variable.
Definition: If X is a discrete random variable then its variance, written Var[X] is defined by
n

Var[X] p i (x i ) 2
i 1

The standard deviation of X is the positive square root of the variance of X.


By multiplying out the bracket it is straightforward to see that the variance is given by
2
Var[X] ( p i x i ) 2 or Var[X] E[X 2 ] (E[X]) 2

Example 3.4 : A die is thrown, what is the variance of the score?


1
1
1
1
1
1 91
E[X 2 ] 12 2.2 32 4 2 5 2 6 2
6
6
6
6
6
6 6
91 49 35
V[X ]

6
4 12
The variance gives an idea of how spread out the distribution is.

Page | 26

3.2 Bernoulli Distribution


X
P(x)

0
1-p

1
P

Example 3.5 : Toss a coin once. Let p be the probability of getting a head and
X = 0 if T occurs
= 1 if H occurs

Then X ~ Bernoulli (p).

3.3 Binomial Distribution


If we have n Bernoulli trials with probability of a success equal to p then the probability of r successes
is given by the binomial probability
P( X r ) n c r p r (1 p)n r r 0, 1, 2, , n.
Thus, if we consider the random variable X which is the number of successes in n Bernoulli trials, then
P(X = r) is given by the binomial probability with parameters n and p.
Statistical table 1 gives the probability of r or more successes in n independent trials with the
probability of success p. For example if we wanted the probability of obtaining 23 or more heads in
50 tosses of a fair coin we find that the answer is 0.76006.
Example 3.6: Suppose that 5% of the articles made by a factory are defective. What is the probability
of finding 1 defective in a sample of 10 from a very large batch? Since it is a large batch we may treat
this as sampling without replacement and the number of defectives, X, will have a binomial distribution
with n =10 and
p = 0.05. Thus
10
P( X 1) 0.05 0.959 0.475
1
We can also find this quantity from the tables,
P( X 1) P( X 1) P( X 2) 0.40126 0.08614 0.31512
The tables are only given for some values of n and p so are not always useful, but you should know
how to use them. Note that although p is only given up to 0.5, we can always turn a problem where the
probability of a success is greater than 0.5 into a question about failures which will have probability
less than 0.5. An example of this is given next.

Example 3.7: Fifty seeds were planted and it is known that the probability of any seed germinating is
0.8. Assuming that the number of seeds germinating follows a binomial distribution, using tables find
the probabilities of the following events (a) exactly 40 seeds germinate,
(b)
more than 12 seeds fail to germinate,
(c)
more than 38 but fewer than 45 seeds germinate.

Page | 27

The tables only give values of p up to 0.5 so we have to convert the events to questions about failing
to germinate. The chance of a seed failing to germinate is 0.2. Let X be the number of seeds that
germinate and Y the number of seeds that fail to germinate so that X + Y = 50. Thus we require for (a)
P( X 10) P(Y 10)
P(Y 10) P(Y 11)
0.55626 0.41644
0.13982 0.140

For (b) Y 12 is the same as Y 13 and from the tables P(Y 13) 0.18606 0.186
For (c) 38< X<45 is the same as 6 Y 11 and the probability of this is
P(Y 6) P(Y 12) 0.95197 0.28933
0.66264 0.663
Note that we quote the final answers to no more than 3 decimal places.
When the tables do not exist there are approximations we may use. For example it can be shown that
as n and 0 , a binomial distribution tends to a Poisson distribution (see later). Thus if vi is
large and p is small we may approximate a binomial random variable by a Poisson one.

We next derive the mean and variance of a random variable X which has a binomial distribution with
parameters n and p.
Proposition 3.1 If X has a binomial distribution with parameters n and p then
E[X]= np and Var[X] = np(1p).
Proof By the definition of the mean

Page | 28

E ( X ) rP( X r )
r 0

n
n
r p r (1 p ) n r
r 0 r
n
n
r p r (1 p ) n r
r 1 r
n

r.
r 1

n!
p r (1 p ) n r
r!(n r )!

n(n 1)!
p r (1 p ) n r
(
r

1
)!
(
n

r
)!
r 1
n

n
n 1 r
p (1 p ) n r
n
r 1 r 1
n
n 1 r 1
p (1 p ) n r
np r
r

1
r 1

n 1 n 1

r
p (1 p ) n 1k
np
k 0 k
np[ p (1 p )]n 1
np

Recall that Var[ X ] E[ X 2 ] [ E[ X ]]2


Now E[ X 2 ] E[ X 2 ] E[ X ( X 1)] E[ X ]
Then
n

E[ X ( X 1)] r (r 1) P( X r )
r 0

r (r 1)
r 2

n!
p r (1 p ) n r
r!(n r )!

n(n 1)(n 2)! r


p (1 p ) n r
r 2 ( r 2)!( n r )!
n

n
n 2 r
p (1 p ) n r
n(n 1)
r

2
r 2

n
n 2 r 2
p (1 p ) n r
n(n 1) p 2
r 2 r 2
n2 n 2

k
p (1 p ) n 2 r
n(n 1) p 2
k
k 0

2
n(n 1) p [ p (1 p )]n 2

n(n 1) p 2

Page | 29

Thus Var[ X ] n(n 1) p.2 np (np) 2 np(1 p)


Example 3.8: The random variable X has a binomial distribution with parameters n=100 and p=0.8.
Find the mean and the variance of X.
The mean = np = 80, the variance is np (1- p) =16
3.4 Poisson Distribution
Suppose events occur at random at an average rate per minute. Examples include radioactive decay
and arrivals in a queue. Then the distribution of the number of events which occur in one minute, is
said to have a Poisson distribution with parameter . If X has a Poisson distribution then

P[ X r ] exp( )
where >0. Note that

r 0

r!

exp( )

r
r!

r 0,1,2,

exp( )

r!
exp( ) exp( )
r 0

1
So this is a valid probability distribution.
It can be shown that E( X ) , V ( X )
Statistical Table (3) gives the probability that a Poisson random variable with mean will be greater
or equal to r in the same way as the binomial tables. For example, suppose that X is a random variable
with Poisson distribution with mean 2.0.
Find (1) P( X 2) (2) P( X 3) (3) P( X 2)
P( X 2) P( X 2) P( X 3) 0.59399 0.32332 0.27067
P( X 3) 0.32332
P( X 2) 1 P( X 2) 1 0.59399 0.40611
A property of the Poisson distribution is that if X is Poisson with mean then kX is Poisson with mean
k . This can be useful in calculating probabilities of numbers of event in a time period different to
that for which information is given.
Note that if X has a binomial distribution with parameters n and p that E ( X ) np and
Var[ X ] np(1 p) . Now if p is small then 1-p is close to one and np(1-p) np .This suggest that if p
is small we may be able to approximate X by a Poisson random variable with mean np. So long as p
is small (may be < 0.1) and n is large (may be >50) a binomially distributed random variable is well
approximated by a Poisson random variable of mean np.

Example 3.9: IF X has a binomial distribution, n=100, p=0.01 then from the tables
P( X 1) 0.63397
P( X 2) 026424
P( X 1) 0.36973

Page | 30

The corresponding quantities from the Poisson tables with =1 are


P(Y 1) 0.63212
P(Y 2) 0.26424
P(Y 1) 0.36788
Example 3.10: The probability that a car has defective gearbox is 0.02. If I check the gearboxes of
140 cars what is a suitable approximation to the probability that I find
(a) 2 defectives (b) more than 5 defectives (c) fewer than 4 defectives
Let X be the number of defective gearboxes that I find. Then X has a binomial distribution with n=140
and p=0.02. Since n is large and p is small a Poisson random variable with mean np 2.8 will
give a good approximation to X tables
(a) P( X 2) P( X 2) P( X 3) 0.76892 0.53055 0.238
(b) P( X 5) P( X 6) 0.065
(c) P( X 4) 1 P( X 4) 1 0.30806 0.692

3.5 Exercises
1.

A manufacturing process produces components which are free from any faults with probability
p. Find the probability that in a sample of size 50 from a large batch there are fewer than 4
faulty components when p = 0.95. Find the probability that in a sample of size 50 there are
fewer than 10 faulty when p = 0.75.

2.

Use the table to give a suitable approximation to the probability that X 5 where X is binomial
random variable with parameters p = 0.05 and n = 400.

3.

A car-pooling study shows that the number of passengers, X in a car (excluding the driver) is
likely to assume the values 01,2,3 and 4 with probabilities given by the table.
X
P(X=x)

0
0.7

1
0.1

2
0.1

3
0.05

4
0.05

(a) Determine the probability of at least two passengers in a car.


(b) Find the cumulative distribution function of X and sketch it.
(c) Calculate
(i)
(ii)
(iii)

4.

E(X)
E(X2)
V(X)

Suppose that in late summer, the Fremantle Surf Life Saving club makes an average of two surf
rescues per day Use the Poisson probability distribution to determine the probability that
(a) More than two rescues are made on a particular day.

Page | 31

(b) Five surf rescues are made in a 3-day period.


3.6 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th Edition)
1. Exercises - Page 189 - Question 5.51 5.70

Page | 32

LECTURE 4
CONTINUOUS RANDOM VARIABLES
4.1 Introduction
A random variable X is a numerically valued variable defined on the sample space,
X: R
We say that X is a continuous variable if it is not discrete.
Definition: If X is a continuous random variable then there exists a non-negative function, f(x), called
the probability density function of X such that

f ( x)dx 1

And

and

P(a X b) f ( x)dx
a

Note that any function, which is non-negative and integrates to one is a possible probability density
function for a random variable X. As with discrete random variables some density functions are
commonly used to model continuous random variables. It is also convenient to define the following
function.
Definition The cumulative distribution function, F(x) of a continuous random variable x is defined by
t

F (t ) P( X t )

f ( x)dx

Note that for a discrete random variable the cumulative distribution function
P(X x) will be a step function with steps of height P(X = x) at the points at which X is defined. The
continuous version can be thought of as a limiting case when all values of x in an interval are possible.
Note that the cumulative distribution function is always non-decreasing and
lim F ( x) 0
lim F ( x) 1
x

We define the mean of a continuous random variable as follows.


Definition If X is a continuous random variable with probability density function f(x) then the mean
or expected value of X, E[X] or p is defined by
E[ X ]

xf ( x)dx

We define the expectation of a function of X in a similar way


Definition If X is a continuous random variable with probability density function f(x) then the
expected value of g(x) is defined by

Page | 33

g ( x) f ( x)dx

E[ g ( X )]

Similarly the variance is defined by


Definition If X is a continuous random variable with probability density function f(x) then the variance
of X, Var [X] is defined by

(x )

Var[ x]

f ( x)dx

f ( x)dx 2

We can also define the median of a continuous random variable.


Definition if X is a continuous random variable with probability density function f(x) then the median
of x is the value m satisfying the equation
m

1
2

f ( x)dx f ( x)dx
m

It is the value such that X is equally likely to be more than the median as less than it.
Example 4.1: A random variable X has probability density function.

cx 2 (1 x) if
f ( x)
0

0 x 1
otherwise

1.

Determine c.

2.

Find E[X].

3.

Find Var[X].

4.

Show that the median m satisfies the equation


6m 4 8m3 1 0

Solution:
1.

We know that

f ( x)dx 1

so

Page | 34

c( x
0

c(

x3
3

x 3 )dx 1

x4 ) 0 1
4

c 121
and hence c = 12.
2.
1

E ( X ) 12( x.3 x 4 )dx


0
1

x4 x5
12( )
4
5 0

5
3.
1

E[ X 2 ] 12( x 4 x 5 )dx
0
1

x5 x6
12( )
5
6 0

12

30
2

3
2

Thus

Var[ X ]

2 3
1

5 5
25

4.

Page | 35

12( x

x 3 )dx 0.5

0
m

x3 x4
12( ) 0.5
3
4 0

m3 m4

) 0 .5
12(
3
4

4m 3 3m 4 0.5
6 m 4 8m 3 1 0

4.2

Exponential Distribution

The exponential distribution can be used to model the lifetimes of components. It is also linked to the
Poisson distribution. If X has a Poisson distribution then the time between occurrences of X follows
an exponential distribution.
The probability density function for an exponential distribution is

exp( x) , if x 0
f ( x)
otherwise
0
We shall check first that this is a valid p d f. Clearly f ( x) 0 . Also

0 exp[x]dx exp[x] 1
0

To find the mean we use integration by parts

E[ X ] x exp( x)dx
0

[ exp[ x]x] exp( x)dx

exp[ x

0
1

Page | 36

To find the variance we first find E[X2]. This is also done by integration by parts

E[ X 2 ] x 2 exp( x)dx
0

[ exp[ x]x 2 ]0 2 x exp( x)dx


0

E ( X )

21


2
2

Therefore
Var[ X ]

The cumulative distribution function is given by

if x 0

0
F ( x) P( X x) x
f (t )dt if x 0

f (t )dt exp( t )dt


0

[ exp[ t ]0

Now

1 exp[ x]
The median m is given by F(m) = . Therefore substituting into the cdf

1 exp[ .m] 1 / 2
exp[ m] 1 / 2
m ln 1 / 2
m ln 2
m 1 ln 2.
4.3 Exercises
1.

The random variable X has probability density function f ( x) c(2 x) 3 for 0 < x < 2 and zero
otherwise. Determine c. Find the cumulative distribution function F(x).

Page | 37

2.

Assume that the continuous random variable x has the probability density function
k (9 4 x 2 ) for 0 x 3 / 2
f ( x)
0
otherwise

(a)
(b)
(c)
(d)
(e)

3.

4.

Calculate the value of k.


Find the mean and variance of x.
Find the cumulative distribution function of x.
Find the median of x.
Find P(1/2 x< 1).

The time (in hours) between successive calls has an exponential distribution with parameter
= 1/6 . What is the probability of waiting more than 15 minutes between any two successive
calls?
Identify and name the continuous random variables from the following list of variables:
X: the number of automobile accidents per year in Virginia.
Y: the length of time to play 18 holes of golf.
M: the amount of milk produced yearly by a particular cow.
N: the number of eggs laid each month by a hen.
P: the number of building permits issued each in a certain city.
Q: the weight of grain produced per acre.

4.4 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th Edition)
1. Exercises - Page 112- Question 3.7, 3.9, 3.12, 3.21

Page | 38

LECTURE 5

NORMAL DISTRIBUTION
5.1 Introduction
The normal, or Gaussian, distribution is the most commonly used distribution in statistics. A normally
distributed random variable with mean and variance 2 has its probability function given by
1
(x)
exp[ (x ) 2 /2 2 ] for x
2
It is denoted by X ~ N(( 2 ) .
If X is normally distributed with mean 0 and variance 1, then we write X ~ N(0,1) . Its probability
density function is usually written as (x) and is given by
1
(x)
exp( x 2 2) for x
2
The cumulative distribution function is denoted by (x).
We can calculate probabilities for a normal distribution from the standard normal using
X
~ N(0,1)

Statistical Table (4) gives the probability that a standard normal random variable, i.e. with mean zero
and variance 1, is larger than specified value. i.e. 1-(x).. In using the tables we utilise the symmetry
of the normal distribution, and the fact that P(Z 0) P(Z 0) 0.5
Example 5.1: Calculate the probabilities of the following events.
(i)
Z < -2.45,
(ii)
(Z < - 2.1) ( Z > 2.1)
(iii)
0 < Z < 1.2
Solution:
(i)
By symmetry P ( Z < -2.45) = P (Z > 2.45) = 0.00714
(ii)
By symmetry P [( Z < -2.1) P ( Z > 2.1) ] = 2 P ( Z > 2.1) = 2 x 0.01786 = 0.03572
(iii)
P [ Z > 1.2] = 0.11507
P [Z < 1.2 ] = 1 0.11507
= 0.88493
P[0 < Z < 1.2]
= 0.88493 0.5
= 0.38493
Example 5.2: It is known that in a certain district the heights of adult males are normally distributed
with mean 175cm and standard deviation 7cm. Find the probability that a man selected at random from
this district will be
(a)
over 182cm tall.
(b)
between 170cm and 181cm tall.
(c)
under 179cm tall.
Let X be the height of the selected man.

Page | 39

Then X ~ N(175,7 2 ) Z = (X-175)/7 ~ N (0,1)


(a)
(b)

(c)

P( X > 182) = P ( Z > (182 175)/7) = P (Z > 1) = 0.159


P( 170 < X < 181) = P ( -5/7 < Z < 6/7)
= P (Z >-5/7) P (Z > 6/7)
= 0.7625 0.1968 = 0.566
P (X< 179) = P ( Z < 4/7) = 1 P( Z > 4/7) 1 0.284 = 0.716

5.2 Normal as an Approximating Distribution


When n is large and p moderate we may use the normal distribution to approximate binomial
probabilities. Note that as we are approximating a discrete random variable by a continuous one, we
have to employ continuity correction.
For discrete random variable P( X < x) = P ( X x-1) We approximate these quantities by P (Y < x 1
) . We illustrate the technique in the following example.
2
Example 5.3: A fair coin is tossed 150 times. Find a suitable approximation to the
probability of each of the following events.
(a) more than 70 heads
(b) fewer than 82 heads
(c) more than 72 but fewer than 79 heads.
Let X be the number of heads thrown, then X has a binomial distribution with n = 150 and p = . As
n is larger and p moderate we may approximate X by Y a normal random variable with mean np = 75
and variance np(1-p) = 37.5.
a. We require P(X > 71) but this is the same as P(X 70 ) so we approximate by P (Y > 70.5).
P(Z (70.5 75)/ 37.5 ) P(Z 0.735) 0.769
b. We require P( X < 82) but this is the same as P(X 81) so we approximate
by P(Y < 81.5). P(Z < (81.5 75)/ 37.5 ) P(Z < 1.06) = 1- 0.145 = 0.855

Page | 40

(c) We require P (72 < X < 79) which is the same as P (73 X 78) and thus we
(72.5 < y < 78.5).

approximate by

P(-0.408 < Z < 0.571) = 0.658 0.284 = 0.374

We may similarly approximate a Poisson random variable by a normal one of the same mean and
variance so long as this mean is moderately large. We again have to use the continuity correction.

Example 5.4: A radioactive source emits particles at random at an average rate of 36 per hour. Find
an approximation to the probability that more than 40 particles are emitted in one hour.
Let X be the number of particles emitted in one hour. Then X has a Poisson distribution with mean 36
and variance 36. We can approximate X by Y which has a N(36, 36) distribution. We require P(X >
40). This is approximately P(Y 40.5).
40.5 36
)
6
P(Z 0.75)
0.2266

P(Y 40.5) P(Z

5.3 Exercises

1.

The sample data consists of the values:


0.325 0.317 0.375 0.325 0.508 0.117 0.150 0.317 0.275 0.383
Do they appear to come from a Normal Distribution?
(i)
What is the percentage of values within one standard deviation of the mean?
(ii)
What is the percentage of values within two standard deviations of the mean?
Do they appear to come from a Normal Distribution? Justify your answer.

2.

Construct a Normal probability plot using SPSS for the data given in (1) and explain how it
could be used for checking normality.

3.

94
70
48

95
91
28

30
72
20

98
97
65

76
97

73
84

95
28

97
19

86
90

91
77

85
58

70
58

96
47

(a) Plot these data.


(b) Find the mean and the standard deviation for this data.
(c) Let X be a Gaussian (Normal) random variable with mean and standard deviation you
calculated in part (b). Find the following probabilities.

Page | 41

(i)
(ii)
(iii)

P(X < 30)


P(X > 90)
P(50 < X < 80)

(d) Find the proportion of data values that are


(i) less than 30
(j) greater than 90
(k) from 50 to 80
(e) Can the distribution of these values approximated by a Normal distribution?
4.

The lengths of a batch of bolts are assumed normally distributed with mean 4cm and standard
deviation 0.1cm. What is the probability that a bolt selected at random will be more than
4.1655cm in length? (Give answer to 5 dp)

5.

A coin is to be tossed 100 times.


(a)

Assuming the coin is biased with P(head) =0.6, use a normal approximation to estimate
the probability that between 56 and 63 heads occur.

(b)

Assume P(head)=0.99. Use a suitable approximation to estimate the probability that


exactly 99 heads occur. (Do not calculate the exact binomial probability).

5.4 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th Edition)
1. Exercises - Page 209 - Question 6.1 6.10

Page | 42

LECTURE 6

RANDOM SAMPLING AND SAMPLING DISTRIBUTIONS


6.1 Introduction
Definition: The sampling distribution of a random variable is the collection or distribution of all
possible values of the random variable over all possible samples. If the sample is a random sample of
size n from an infinite population then, x1,x2,xn are independent random variables each with the
same distribution (i.e. same p.d.f or probability function) as the population so that
E(xi) =

Var (xi) = 2

Theorem 1 Averaging over all random samples of size n from an arbitrary population with mean
and variance 2, the sample mean x and sample variance s 2 have the following three properties:
E (x) =
i.e is an unbiased estimator of
Var (x) = 2/n
i.e. the variability of as an estimator decreases with n.
i.e. s2 is an unbiased estimate of 2 .
E[s 2 ] 2
Thus s2/n is used as an unbiased estimate of the variability or variance of x as an estimator of .
Example 6.1: An infinite population is described by an asymmetrical discrete distribution with just
two values: -3 with probability 0.3 and +1 with probability 0.7. Thus we have
E[X] (3 0.3) (1 0.7) 0.2

2 Var[X] E[X 2 ] 2 (3) 2 0.3 (1) 2 0.7 (0.2) 2 3.36

These are the values of the (usually unknown) population parameters. Let us now look at all samples
of size of 3. There are infinitely many, but we can tabulate them as follows:
Sample observations
s2
P(sample)
x
-3, -3, -3
-3
0
(0.3)3 = 0.027
-3, -3,1
-5/3
32/6
3(0.3)2 x 0.7 = 0.189
-3,1,1
-1/3
32/6
3(0.7)2 x 0.3 = 0.441
1,1,1
1
0
(0.7)3 = 0.343
Thus we see that 2 as an estimate of p is 2.8 below the true value in 2.7% of
samples, 1.2 above in 34.3% of samples etc. Taking the average over all samples or equivalently the
expectation over the sampling distribution, we see that
5
1
x 0.189 +
x 0.441 + 1 x 0.343 = - 0.2
E(x) = - 3x0.027+
3
3
exactly, confirming the first result of Theorem 1.
Then

Page | 43

V[ X] E[X 2 ] (E[X]) 2
5
1
2
2
= (3). x 0.027
0.189 0.441 (1) 0.343 (0.2)
3
3
2

= 1.12 2
which confirms the second result. The third result is verified for this example by averaging over all
possible values of s2 thus:
E(s 2 ) 0 0.027

32
32
0.189 0.441 0 0.343 3.36 2
6
6

Proof of Theorem 1
1
(x 1 x n )
n
1
E[x] . (E[x 1 ] E[x n ])
n
1

( )
n
1
(n )
n

x

1
Var [x 1 x n ]
n2
1
2 (Var[x 1 ] Var[x n ])
n
1
2 ( 2 2 )
n
n 2 2
2 .
n
n

Var [x]

Note: x1, ,xn are independent (random sample)


The theorem shows that
is the standard deviation of the sampling distribution of x. A sample
n
estimate of this variability is s
and is called the (estimated) standard error of the (sample) mean.
n
Theorem 2 Central Limit Theorem says that as n the sampling distribution of x tends to a
Normal distribution with the same mean and variance.

The importance of this result is that we do not need to know the form or type of the original population
distribution if our sample size is sufficiently large. We can use instead the Normal distribution for

Page | 44

statistical inference with the knowledge that the probabilities we calculate will be good approximations
to the true (but generally unknown) probabilities.
2
approximately for large n.
X ~ N ,
n

Then using the properties of the Normal distribution we can say that for large n,
X

P
u
n

can be found approximately for any specified value u without knowing the original form of the
population.
If, however, we do know the form of the population and it follows a Normal distribution, then for any
sample size n > 1 it can be shown that

2
X ~ N ,
n
Thus

~ N(0 ;1)

has a sampling distribution which is Standard Normal for any n (Table 4).
As the population standard deviation is often unknown, replacing it by the corresponding sample
quantity s changes the sampling distribution.
However, provided the underlying population is Normal, it can be shown that
X
T
s n
has a Students t-distribution with v degrees of freedom, where v n 1 (named after W. S.
Gossett, who took the pseudonym Student). The percentiles of this distribution are given in Table 7.
Another distribution, which arises from random samples of Normal populations, is the chi-square
distribution, whose percentage points are given in Table 8. It can be shown that
(n 1)s 2
V
~ 2n 1
2

the chi-square distribution with n 1 degrees of freedom, whatever the value of X . Yet another
distribution is the (Fisher) F-distribution with percentage points in Table 9. The F and 2 distributions
are used for statistical inference on the variances of Normal populations as well as for wider application
in Goodness-of-Fit tests.
Note:
Using SPSS it is possible to check these distributional results empirically by generating a sufficient

Page | 45

number of random samples from a Normal population.

6.2 Exercises

1. The heights of 1000 students are approximately normally distributed with a mean of 174.5 cm and
a standard deviation of 6.9 cm. If 200 random samples of size 25 are drawn from this population
and the means recorded, determine
(a) The expected mean and standard deviation of the sampling distribution of the mean.
(b) The number of sample means that fall between 172.5 and 175.8 cm inclusive.
(c) The number of sample means that falling below 172 cm.

2.

Show that the sample variance is unchanged if a constant is added to or subtracted from each value
of the sample.

3.

If the size of a sample is 36 and the standard error of the mean is 2, what must the size of the
sample become if the standard error is to be reduced to 1.2?

4. The amount of time that a drive-through bank teller depends on a customer is a random variable
with a mean =3.2 minutes and a standard deviation = 1.6 minutes. If a random sample of 64
customers is observed, find the probability that their mean time at the tellers counter is
a) At most 2.7 minutes;
b) More than 3.5 minutes;
c) At least 3.2 minutes but less than 3.4 minutes.
5.

If all possible samples of size 16 are drawn from a normal population with mean equal to 50 and
standard deviation equal to 5, what is the probability that a sample mean will fall in the interval
from 1.9 , 0.4 ? Assume that the sample means can be measured to any degree of
accuracy.

6.3 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th Edition)
1. Exercises - Page 275 - Question 8.18 8.20

Page | 46

Page | 47

LECTURE 7

TEST OF HYPOTHESIS
7.1 Introduction to Hypothesis Testing
Definition 1 The null hypothesis H0 is a statement about the value of the parameter of interest. A simple
null hypothesis specifies the population distribution exactly. We examine the data to see whether they
support or provide evidence against the null hypothesis H0.
The alternative hypothesis H1 describes only the possibilities (there may be many) that we are prepared
to consider if H0 is not true.
Definition 2 The test statistic for H0 versus H1 is a random variable with known (or approximately
known) distribution-assuming H0 to be true under H0. The observed value of the test statistic can
indicate departures from H0 in favour of H1.
Definition 3 The P-value gives the probability of, under H0, observing a value of the test statistic at
least as extreme as the value actually observed, where extremities indicate departures from H0 in favour
of H1. If the P-value is as small or smaller than , we say the test is statistically significant.
7.2 Procedure
Null and Alternative Hypotheses A clear statement of both should be given in terms of the population
parameter of interest, together with a short verbal interpretation.
Test Statistics: The formula in terms of sample statistics such as mean and standard deviation should
be stated with the (sampling) distribution under the null hypothesis. Then the observed value of the
test statistic should be calculated to at least three significant figures.
Assess evidence: The P-value should be used to form a verbal statement or conclusion regarding the
truth or otherwise of the null hypothesis. Finally a verbal interpretation of this conclusion should be
given for the non-statistician.
Depending on the conclusion reached (if any) the investigator may wish to quote a confidence interval
for the parameter at the desired level.
Example 7.1: Articles produced by a manufacturer should have mean length 4 cm. and standard
deviation 0.02cm. A test sample of size 10 from a large batch of production has x = 4.01. Is there
evidence that the unknown mean length , say, of articles in the batch is unsatisfactory?

Page | 48

The Null hypothesis is H0 : = 4 (batch satisfactory) to be tested against the alternative


H1 : 4 (batch unsatisfactory).
We need a test statistic whose distribution is known under the null hypothesis i.e. assuming H0 to be
true. We know that in general for random samples from a Normal population
. 2
X ~ N( ,
)
n
so, under H0

(0.02) 2
)
10
X4
Z
~ N(0,1)
0.02/ 10

X ~ N(4 ,

is standard Normal . Large values of Z (either positive or negative) indicate departures from H 0 in
favour of H1 and the observed value of Z is
4.01 4
Z
1.58
0.02 / 10
So, the probability of observing a value of Z at least as extreme as this (the P-value) is
P(Z> 1.58) + P(Z < -1.58) = 2 x P(Z> 1.58) = 0.1141,
using the symmetry of the Normal distribution.
Thus there is a 11.4% chance of observing this sample result or worse even if the batch is satisfactory.
We therefore conclude that there is no evidence against the null hypothesis.
Note that this was a two-sided or two-tailed test as the alternative hypothesis is two sided, namely
4 . If there was a legal requirement of a maximum mean length of cm, then we would not be
concerned with the possibility that <4. Instead test H0 : = 4 versus H1 : > 4. We would ask
whether there was sufficient evidence in the data to make us worry about failing the requirement, and
the test statistic and observed value would be the same as before. Only large positive and not negative
values of Z would indicate departures in favour of H1 so the P-value is just P(Z >1.58) = 0.057. Now
we have slight evidence against H0 in favour of.H1 i.e. slight evidence that the batch may fail to meet
the legal requirement. This is called a one-tailed or one-sided test as the alternative hypothesis is
one-sided, namely > 4.
However, the assumption that the population variance 2 is known is often unrealistic:
Example 7.2: A random sample of 5 men had a mean height x of 70 inches and a sample standard
deviation s of 2 inches. Is there any evidence in these data against the (null) hypothesis that the mean
of the population is 67 inches? To test H0 : = 67 versus H1 : 67 we need a test statistic whose
distribution is known under H0. Such a statistic is

X
s

~ t0 under Ho

Page | 49

That is, Students t with 4 degrees of freedom. The observed value of T is 3.35 so to calculate P we
must refer this value to percentage points of the t-distribution with 4 degrees of freedom (d.o.f.) Now
t4(0.025) = 2.776 lie on either side of our observed value.
Alternatively we can use the 2-values on the second row of Table 7 to arrive at the same answer. We
cannot therefore say exactly what the probability of obtaining a value at least as extreme as the one
observed is, but we can specify it within a suitable range and this is sufficient to enable us to conclude
that there is moderate evidence against the null hypothesis. So even this small sample provides
evidence.
7.3 Confidence Intervals for Hypothesis Testing
Often we may be asked to estimate the population mean, , rather than testing a hypothesis about it.
Or we may have performed a test and found evidence against the null hypothesis casting doubt on our
original hypothesised value. We can (and indeed must) give an estimate of uncertainty along with our
best estimate of p, which is ~, the sample mean.
Whatever the value of

P[1.96

1.96] 0.95 ,
n
cross multiplying we get

P[1.96 / n X 1.96 / n ] 0.95

Subtracting X gives
P[ X 1.96 / n X 1.96 / n ] 0.95
and finally multiplying by 1 gives,
P[ X 1.96 / n X 1.96 / n ] 0.95
and this is true whatever the value of , so we can say that the random interval
( X 1.96 / n , X 1.96 / n ) has a probability of 0.95 of containing or covering the value of ;
that is, 95% of all samples will give intervals (calculated according to this formula) which contain the
true value of the population mean. This interval is called a 95% confidence interval for . Note that
there is no guarantee that any specific sample contains with 95% probability.
In general a 100(1 )% confidence interval for is given by

[ x 1 ( / 2)

where 1 ( / 2) denotes an upper percentage point of the standard Normal distribution when 2 is
known, and given by
[ x t 1 ( / 2)

s
n

when 2 is unknown.

Page | 50

Example 7.1 revisited: The above argument can be used for a 95% confidence interval as we are
assuming that the population variance 2 is known. Thus
( X 1.96 0.02 / 10 , X 1.96 0.02 / 10 )
is a 95% confidence interval for p and substituting the observed value x = 4.01 we obtain (3.9976,
4.0224).
This interval includes 4 cm. Therefore, no evidence to say that the articles in the batch are
unsatisfactory.
Example 7.2 revisited : When 2 is unknown,
(X 2.776s/ 5 , X 2.776s/ 5 )

which contains 4 cm with probability 0.95 as t4(0.025) = 2.776 from Table 7, there being only four
degrees of freedom with a sample of size 5. If we substitute the observed values of these statistics the
endpoints of this interval become
70 2.776 x 2 / 5 (67.517, 72.483).
The interval does not include 67 cm. Therefore, H0 is rejected. It can be concluded that the mean of
the data is not 67.

7.4 Proportions
When the data is count data rather than measurements then we make inference about the proportion,
X
The sample proportion p
is an estimate of the unknown population proportion p.
n
In this case since we are dealing with large sample approximation to Binomial distribution we will
deal with only large sample approximation
Recall that if X is B(n,p) and n is large then X is approximately normal with np and
2 np(1 p) .

= X , = population mean = population proportion p.


n
therefore E ( p ) p ( since E (X ) ) and
p

2
p(1 p)
(since V ( X )
)
n
n
p(1 p)
Using central limit theorem p ~ N p,

V ( p )

Since we are considering large population and variance is unknown ( estimated by

p (1 p )
)
n

p (1 p )
The p ~ N p,

Hence the confidence interval becomes

Page | 51

P( z / 2 Z z / 2 ) 1
p p
P ( z / 2
z / 2 ) 1
p (1 p )
n

z / 2
P( p

p p
p (1 p )
n

where Z

(1 p
)
p
z / 2
. p p
n

(1 p
)
p
) 1
n

Therefore (1- )100% C.I is

z / 2
p

(1 p
)
p
,
n

z / 2
p

(1 p
)
p

Example 7.3: As part of quality improvement program, your mail-order company is studying the
process of filling customer orders. According to company standards, an order is shipped on time if it
is sent within 3 working days of the time it is received. You select an SRS of 100 of the 5000 orders
received in the past month for an audit. The audit reveals that 86 of these orders were shipped on time.
Find a 95% confidence interval for the true proportion of the months orders that were shipped on time.

Page | 52

7.5 Sample Size


If is used as an estimate of p we be (1- )x100% confident that the error will not exceed
p (1 p )
z / 2
. This is the margin of error, so we can find the required samples size for a given
n
margin of error and given level of confidence provided that we have an estimate of the proportion.
z2 / 2 p (1 p )
p (1 p )
e z / 2
n
n
e.2
Example 7.4: A national opinion poll found that 44% of all American adults agree that parents should
be given vouchers good for education at any public or private school of their choice. The results were
based on a small sample. How large an SRS is required to obtain a margin of error of 0.03 (that is,
3%) in a 95% confidence interval.?
z 2 p (1 p )
If have no idea what the proportion is then in the formula n / 2 2
we must use a value for
e.
p which gives a maximum for n so as to ensure that we meet the given specifications for margin of
error and confidence level.

z2 / 2 p (1 p )
e.2
2
dn z / 2 (1 2 p )

0 for maximum
dp
e2

1 2 p 0
1
p
2
2
z / 2 (1 / 2)(1 / 2 z2 / 2
2
ie. n
e2
4e
If you have some idea of p then use that as an estimate as it will give smaller sample size.

7.6 Hypothesis Test for Proportions


A hypothesis test for proportion (when the sample is large) can be carried out using a normal
approximation. This will be a good approximation provided n is large and p is not too close to 0 or
1.
If the null hypothesis is H o p po then the test statistic is Z

p p o

p o (1 p o )
n
Note that p o is used in the standard error of the proportion as the calculation is always performed

assuming that H o is true

Page | 53

Example 7.5: A commonly prescribed drug on the market for relieving nervous tension is believed to
be only 60% effective. Experimental results with a new drug administered to a random sample of 100
adults who were suffering from nervous tension showed that 70 received relief. Is this sufficient
evidence to conclude that the new drug is superior to the one commonly prescribed?
H0: p = 0.6
H1: p > 0.6
Under H0,

0.70.6

0.6(10.6)
100

= 2.041

For Z = 20.41 at 5% significance level, p-value is less than 0.05. Therefore, we reject H0 and conclude
that the new drug is superior to the commonly prescribed drug.
7.7 Exercises
1. For a random sample of size 7 from a Normal distribution with x = 3.47,
(a) test the hypothesis that the population mean is 3, against a one-sided alternative that it is greater
than 3 on the assumption that the population variance is 0.5;
(b) check this assumption if in fact s2 = 1.
2. A soft-drink machine is regulated so that the amount of drink dispensed is approximately normally
distributed with a mean of 22 and a standard deviation of 1.5 ml. If a random sample of 36 drinks
had an average content of 22.5 ml, test whether the machine is functioning as expected.
3. It is claimed that an automobile is driven on the average less than 20,000 km per year. To test this
claim, a random sample of 100 automobile owners are asked to keep a record of the distance they
travel. Would you agree with this claim if the random sample showed an average of 23,500 km
and a standard deviation of 3900 km?
4.

A random sample of 8 cigarettes of a certain brand has an average tar content of 18.6 mg and a
standard deviation of 2.4mg.. Is this in line with the manufacturers claim that the average tar
content does not exceed 17.5 mg? Use a 0.01 level of significance and assume the distribution of
tar contents to be normal.

5. The national safety Council reported that 52 percent of American turnpike drivers are men. A
sample of 300 cars traveling eastbound on the Ohio Turnpike yesterday that 170 were driven by
men.
At the 0.01 significance level, can we conclude that a larger proportion of men were
driving on the Ohio Turnpike than the national statistics
6. An English professor counted the number of misspelled words on an essay he recently assigned.
For his class of 40 students, the mean number of misspelled words was 6.05 and the standard
deviation 2.44. Construct a 95% confidence interval for the mean number of misspelled words in
the population of students.
7. A manufacturer of compact disk players uses a comprehensive set of tests to access the electrical
function of its product. All compact disk players must pass all tests prior to being sold. A random

Page | 54

sample of 500 disk players resulted in 15 failing one or more tests. Find a 90% confidence interval
for the proportion of compact disk players from the population that pass all tests.
8. Compute a 98% confidence interval for the proportion of defective items in a process when it is
found that a sample of size 100 yields 8 defectives.
9. 121 A certain new rocket-launching system is being considered for deployment of small shortrange launches. The existing system has p=0.8 as the probability of a successful launch. A sample
of 40 experimental launches is made with the new system and 34 are successful.
(a) Give a point estimate of the probability of a successful launch using the new system.
(b) Construct a 95% confidence interval for this probability.
(c) Does the evidence strongly indicate that the new system is better?
10. (a) A random sample of 500 cigarette smokers is selected and 86 are found to have a
preference for brand X. Find the 90% confidence interval for the fraction of the population of
cigarette smokers who prefer brand X.
(b) What can we assert with 90% confidence about the possible size of our error if we estimate
the fraction of cigarette smokers who prefer brand X to be 0.172?

7.8 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th Edition)
1. Exercises -Page 381- Question 10.19 10.23,
Page 390- Question 10.55 - 10.59

Page | 55

LECTURE 8
TYPE I AND TYPE II ERRORS
8.1 Introduction
A person is considered innocent until proved guilty. One can never be absolutely certain that the correct
decision has been made but the jury must be sure beyond reasonable doubt that a person is guilty
before passing judgment.
There are two types of errors possible.
Type I = decide guilty when in fact not guilty
Type II = decide not guilty when in fact guilty
These two types of errors are possible in hypothesis testing too.
8.2 Type I and II Errors
In statistical analysis, we will be concerned about keeping Probability of Type I error () =
P{Rejecting H0 | H0 is true} small.
However, as decreases, Probability of Type II error () = P{Accepting H0 | H1 is true} increases.
An increase in the sample size will reduce both and .
The probability of a Type II error, is more difficult to calculate than the probability of a Type I error
and is related to the power of the test (1-). A good test will have a low probability of committing a
Type I error and have a high power.
The power of a test is the probability of rejecting H0 given that a specific alternative is true.
8.3 Exercises
1. The proportion of adults living in a small town who are college graduates is estimated to be p=0.3.
To test this hypothesis, a random sample of 15 adults is selected. If the number of college graduates
in our sample is anywhere from 2 to 7, we shall accept the null hypothesis that p=0.3; otherwise,
we shall conclude that p 3. Evaluate assuming p=0.3. Evaluate for the alternatives p =0.2
and p=0.4. Is this a good test procedure?

2. In a large experiment to determine the success of a new drug, 400 patients with a certain disease
are to be given the drug. If more than 300 but less than 340 patients are cured, we shall conclude
that the drug is 80% effective. Find the probability of committing a Type I error. What is the
probability of committing a Type II error if the new drug is only 70% effective?

Page | 56

3. A new cure has been developed for a certain type of cement that results in a compressive strength
of 5000 kg/cm2 and a standard deviation of 120. To test the hypothesis that = 5000 against the
alternative that < 5000, a random sample of 50 pieces of cement are tested. The critical region is
defined to be sample mean less than 4970. Find the probability of committing a Type I error.
Evaluate for the alternatives = 4960 and =4970.

4. A fabric manufacturer believes that the proportion of orders for raw material arriving late is p =
0.6. If a random sample of 10 orders shows that 3 or fewer arrived late, the hypothesis that p = 0.6
should be rejected in favor of the alternative p < 0.6. Use the binomial distribution.
(a) Find the probability of committing a type I error if the true proportion is p = 0.6.
(b) Find the probability of committing a type I error for the alternative p = 0.3, p = 0.4, p = 0.5.
5. Repeat Exercise 5 when 50 orders are selected, and the critical region is defined to be x less than
or equal to 24, where x is the number of orders in our sample that arrived late. Use the normal
approximation.
8.4 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th Edition)
1. Exercises - Page 360 - Question 10.1 10.5

Page | 57

LECTURE 9

FURTHER HYPOTHESES TESTS


9.1 Introduction
In Lecture 7, we studied about hypothesis testing using examples from single mean. In this lecture,
we are comparing two population means.
9.2 Comparison of Two Population Means
The two populations can be independent or paired. It is very important that you learn to recognise
their differences from a verbal description. Otherwise an inappropriate test may be applied.
The data may consist of

Two independent random samples from two possibly different populations, or

A single random sample of pairs of measurements, which could arise either from a random
sample of individuals on each of which two possibly similar variables have been measured,
or in for example a matched pairs study with a random sample of pairs of similar individuals
on which the same variable was measured.

Example 9.1: Twenty-eight heart-attack patients were measured for cholesterol level 2 days, 4 days
and 14 days after the attack and cholesterol level was also measured for a control group of 30 patients.
In comparing (the population means of) any measurement on the 28 patients with the control group we
have two independent samples, assumed to be random, but for the comparison (over time) of any two
measurements on the 28 patients, the paired structure clearly applies.
In general, we are testing the same hypothesis about two population means denoted here A and B
H0 : A = B versus H1 : A B
or one-sided alternatives

Notation
Population A Population B
Population Mean
Population Variance
Population Distribution
Sample Size
Sample Mean

2A
Normal
nA
xA

2B
Normal
nB
xB

unknown
unknown
(transformed)
both 2

Page | 58

s 2A

Sample Variance

s2B

Example 9.2: Two random samples were independently drawn from two populations, A and B. Is
there evidence in the following data to indicate a difference in the population means?
Sample

Size

xi

297

322

xi2

16103 21978

i 1
n

i 1

Mean
Variance
S.E.Mean

49.5
280.3
6.84

64.4
310.3
7.88

We can see that the observed difference in sample means is less than two standard errors: not a rigorous
test, but an indication that there is likely to be weak or no evidence against the null hypothesis of no
differences in population mean.
Assuming firstly (and perhaps unrealistically) that 2A and 2B are known, a test can be easily derived
using the Normal distribution. As X A ~ N( A , 2A ) and X B ~ N( B , 2B ) are independent
X A X B ~ N( A B , 2A /n A 2B /n B ) and under H0 this is a known distribution with zero mean. A
suitable test statistic is therefore

XA XB
2A 2B

nA nB

~ N(0,1) under H0

If our assumed values for A and B are 18 and 15 respectively, then the observed value of Z is

49.5 64.4
324 6 225 5

14.9
99

1.4975

From Table (4), we obtain P(Z > 1.49) = 0.06811, P(Z > 1.50) = 0.06681 so using linear interpolation
P(Z> 1.4975) = 1/4 x 0.06811 + 3/4 x 0.06681 = 0.06713 approximately, and therefore P for a twosided test is 2 x 0.06713 = 0.1342: no evidence for differences.
The assumption of known variances is clearly unrealistic yet can be tested. But, if there is any doubt
concerning our assumed values, we can proceed as follows.
Two Sample t-test with pooled variance estimator Based on the assumption that although the two
population variances are unknown they are in fact equal to each other with a common value of 2,
say.
XA XB
Z
1 n A 1 nB

Page | 59

so we need only estimate . The appropriate estimator is a weighted average of the two unbiased
estimators s2A and s2B called the pooled variance estimator s2o where each estimate is weighted by its
degrees of freedom.
(n 1)s 2A (n B 1)s 2B
s o2 A
nA nB 2
It can be shown that replacing by sO in Z gives a test statistic T say, which has a t-distribution with
n A nB 2 degrees of freedom (the sum of the two d.o.f.s).
From the data in the example
5 (280.3) 4 (310.3)
so2
1.435
652
so s o2 = 17.14. The observed value of T is therefore
49.5 64.4
t
1.435
17.4 1 6 1 5
which we refer to Table 7, with v 6 5 2 9 degrees of freedom.
For a one-sided alternative H1: A B . We are interested in large negative values of T.
t9(0.05) =1.833 . By symmetry, for the negative value, -1.833.
-1.435 is not in the rejection region.

We have no reason to believe that the population variances are different, then the above procedure is
valid, but the assumption of equal variances nevertheless should be checked. So we test
2
2
H .'o : A . .2B versus H .'o : A . .2B . Lecture 10 gives details about this test.
9.3 The Difference between Two Proportions
A confidence interval for the difference between two binomial parameters p1 and p 2 can be calculated
if we have two independent samples of size n1 and n 2 respectively. Using the fact that
p(1 p)
Using the fact that p is approximately N p,
and that the sum/difference of two normal
n

random variables is also normal, we can state that p 1 p 2 is approximately.

p (1 p1 ) p 2 (1 p 2

N p1 p 2 , 1

n
n
1
2

A (1- ).100% confidence interval for p1 p2 is therefore given by

p 1 p 2 Z / 2

p 1 (1 p 1 ) p 2 (1 p 2 )

n1
n2

Page | 60

Example 9.3: A university financial aid office polled an SRS of undergraduate students to study
their summer employment. Not all students were employed the previous summer. Here are the results
for men and women
Men Women
Employed
718 593
Not Employed 79
139
Total
797 732
Give a 99% confidence interval for the difference between the proportions of male and female
students who were employed during the summer. Does the difference seem practically important to
you?

A hypothesis test for the difference of two proportions will be carried out using the test statistic
p 1 p 2 ( p1 p 2 )
z
p1 (1 p1 ) p 2 (1 p 2 )

n1
n2
The null hypothesis in general will be p1 p2 . Remember that all calculations are done assuming
the H o is true. Thus p1 and p 2 can be replaced by a common proportion p , giving
z

p 1 p 2

p 1 p 2

1
1
p(1 p)
n1 n2
but p is not known so it has to be replaced by the best estimate that can be obtained. This is called
x x2
the pooled estimate of the proportion and is given by p 1
where x1 and x 2 are the number of
n1 n2
success in each of the two samples.
The test statistics, therefore, becomes

1
1
p (1 p )
n1 n2

Example 9.4: use the data for students employed during the summer to answer the following
question.
Is there evidence that the proportion of male students employed during the summer differs from the
proportion of female students who were employed? State H o and H 1 , compute the test statistic, and
give its P-value.
Summary: Inference on A - B
A.

Both 2A and 2B known (or approximately known when both sample sizes n A , n B large, say
50, say 50 and using s A2 for 2A and s B2 for 2B

Page | 61

1.

H o : A B versus H1 : A B

2.

test statistic T

3.
4.

XA XB

~ N (0,1) under Ho
s n A s B2 n B
P = 2 x P(Z> |zl) where z is observed value (Table1 )
100(1 _ a)% Confidence Interval
2
A

x A x B 1 ( / 2)

A2
nA

B2
nB

using Table (4) for percentage points 1 ( 2)

Page | 62

B. Both 2A and 2B unknown that is the population variances are unknown but are equal.
1.

H o : A B versus H1 : A B

2.

test statistic Z

XA XB
so 1 n A 1 nB

~ t n A nB 2 under H o

where s o2 is the pooled estimate of population variance given by

(n A 1) s A2 (n B 1) s B2
n A nB 2
P = 2 x P(T> |t|) where t is observed value (Table t)
so2

3.
4.

100(1 _ )% Confidence Interval


[ x A x B t nA nB 2 ( / 2)S p 1 / n A 1 / nB ]

9.4 Paired Samples


Example 9.5: Sixteen patients sampled at random were assigned as matched pairs to two treatments,
treatment A being assigned to a random member of each pair. A response was measured and the data
were:

A
14.0
5.0
8.6
11.6
12.1
5.3
8.9
10.3

B
13.2
4~7
9.0
11.1
12.2
4.7
8.7
9.6

X (difference)
+ 0.8
+ 0.3
- 0.4
+ 0.5
_
0.1
+ 0.6
+ 0.2
+ 0.7

If we had assumed that the two samples were independent, and performed a two-sample t-test, the
observed value of the test statistic would be 0.205 on 14 degrees of freedom so there would have been
certainly no evidence for difference in means (check this as an exercise). The large patient-to-patient
variability within each treatment group would have obscured or masked any difference in the means
(if indeed there were any differences.) But of course this (invalid) analysis ignores the valuable pairing
information.
As the parameter of interest is still the difference in population means A B , we look at differences
between the members of a pair. In this example, only two differences are negative, so do the data
provide sufficient evidence against H o : A B 0 in favour of H1 : A B 0 . Note that the
parameter of interest is equivalently the mean of the population of differences (defined on each pair).
So we test H0 by a one-sample procedure based on the t-distribution (assuming the variance of the
differences 2, say, is unknown). Under H0 and the Normality assumption (which can be checked by

Page | 63

say, a Normal probability plot of the differences), the differences are a random sample from a Normal
population with zero mean and variance 2, so a suitable test statistic is

X
s

~ t n 1

under H o

where n is the number of pairs in the data, and the sample mean X X A X B and sample variance
s 2 s A2 s B2 refer to the sample of differences X.

The observed value of T is

0.325
0.413

= 2.225

on 7 d.o.f., so referring to Table 7, we have P <0.05 for a.t value of 2.365. We conclude that there is
no evidence to reject H0. So by pairing we have achieved a more precise comparison of the two
treatments.
A 95% confidence interval for A- B is
[ x A x B t nA nB 2 ( / 2)so 1 / n A 1 / nB ]
where = 0.05. Using Table 4 and the sample data, we have
[-14.9 17.14 x 2.262 x 0.6055] = [_ 14.923.48] = (_ 38.4,8.6). Note that this interval includes the
value zero, confirming our previous result that there was no evidence for differences in the population
means. We can think of any value within the interval as a plausible value, as it would not be rejected
as a null hypothesis value with a significance level of = 0.05.
Similarly for intervals under other assumptions (see the summary at the end of the last section).
In general the 100(1 _ )% confidence interval contains every value (and only those values) of the
unknown parameter of interest such that the P-value is greater than or equal to in a two-sided test of
the null hypothesis that the unknown parameter takes that value.

Page | 64

9.5 Exercises

1.

Two machines A and B fill bottles with fluid. Six bottles were measured carefully from a large
production run for each machine. The results (in fluid ounces) were
A:
16.03 16.01 16.04 15.96 16.05 15.98
B:
16.02 16.03 15.97 16.04 15.96 16.01
Is there any evidence that the average amounts filled by the two machines differ? State any
assumptions you make. Perform a non-graphical test for one of these assumptions.

2.

Ten patients who suffered from insomnia were examined in a medical study to determine the
effect of a sedative. Each patient received both the sedative and a placebo (control drug) for a
two-week period, the drugs being administered in random order, and there was a cooling-off
period of one week in between the two two-week periods. Neither the patient nor the drug
administrator knew which drug was being taken. The average number of hours of sleep per
night was recorded for each patient for each drug and the results were:
Patient 1
2
3 4
5
6
7
8
9 10
Sedative 1.3 1.1 6.2 3.6 4.9 1.4 6.6 4.5 4.3 6.1
Placebo 0.6 1.1 2.5 2.8 2.9 3.0 3.2 4.7 5.5 6.2
(a) Is there any evidence of differences showing that the sedative has a beneficial effect on
patients?
(b) Give a 99% confidence interval for the effect.
(c) State clearly the assumptions you make and include a check.

3.

A sample of scores on an examination given in Statistics are:


Males
Females

72
81

69
67

98
90

66
78

85
81

76
80

79
76

80

77

At 0.01 significance level, is the mean grade of the women higher than that of the men?
4.

A nationwide sample of influential Republicans and Democrats was asked as a part of a


comprehensive survey whether they favored lowering environmental standards so that high-sulfur
coal could be burned on coal-fired power plants. The result were:

Number Sampled
Number in favor

Republicans

Democrats

1000
200

800
168

At the 0.02 level of significance, can we conclude that there is a larger proportion of Democrats in
favor of lowering the standards.
5.

The research department at the home office of New Hampshire Insurance conducts ongoing
research on the causes of automobile accidents, the characteristics of the drivers and so on. A
random sample of 400 policies written on single persons was selected. It was discovered that in
the previous three-year period, 120 of them had at least one accident. Similarly, a sample of 600
policies written on married persons revealed that 150 had been in at least one accident. At the 0.05
level, is there a significant difference in the proportions of single and married persons having an
accident during a three-year period?

Page | 65

6. A random sample of size 25 is taken from a normal population having a mean of 80 and a standard
deviation of 5. A second random sample of size 36 is taken from a different normal population
having a mean of 75 and standard deviation of 3. Find the probability that the sample mean
computed from the 25 measurements will exceed the sample mean computed from the 36
measurements by at least 3.4 but less than 5.9.

9.6 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th Edition)
1. Exercises - Page 321 - Question 9.35 9.40

Page | 66

LECTURE 10
INFERENCE FOR VARIANCE
10.1 Introduction
Questions about variability are very important in quality control, engineering and the sciences. If two
processes produce items with roughly equal means, we may be interested in the variability of the
process to distinguish between them. We can perform hypothesis testing to compare the variances of
the two populations. We can also construct a confidence interval for a variance of a single population
or a proportion of variances of two populations.
10.2 Confidence Interval for 2
Suppose we have a random sample from a Normal distribution and we want to make inferences about
the variance 2 of that distribution.
A confidence interval for 2 is given by

(n 1)s 2 (n 1)s 2

, 2
2

n 1,(1-/2)
n 1,/2
where we use both lower and upper percentage points of the chi-square distribution with (n-1) degrees
of freedom in Table 8.
Note: The chi-square distribution is not symmetrical and always non-negative. The shape depends on
the degrees of freedom (). The mean of the distribution is and the variance is 2. Also, make sure
that the data is normally distributed since the tests are valid only when data is Normal or very close to
Normal.
Example 10.1:Let us calculate a 95% confidence interval for n = 5 and s2 = 1.45. Since
= 0.05, 2n 1,(1-/2) = 0.484 and 2n 1,/2 = 11.1433 (Table 8).

(n 1)s 2 (n 1)s 2

, 2
2

n 1,(1-/2)
n 1,/2
(5 1)1.45 (5 1)1.45
=
,

0.484
11.14
= (0.52, 11.98)
10.3 Confidence Interval for the Ratio of Two Variances
If s12 and s22 are the variances of independent random samples of size n1and n2 taken from Normal
populations with variances 12 and 22 respectively, then

Page | 67

F =

s12 /s22
12/22

has an F- distribution with 1 = n1 1 and


2= n2 1.

Note: The F-distribution is non-negative, is not symmetrical and depends on 1 and 2 degrees of
freedom (with the order being important). The table only give F-values that leave an area to the right
such as 0.05, 0.01. This is because
f ( 1 , 2 )

1
f1 ( 2 , 1 )

f 0.99 (4,7)

1
1

0.067
f 0.01 (7,4) 14.976

For example,

Then, (1-)100% confidence interval for the ratio 12 / 22 is

2
2
s1
s1
,
s 2 2 Fv1 , s 2 2 Fv1 ,1
2
2

2
2

Example 10.2: Let us calculate a 95% confidence interval for

12
using s12 = 1.33 and s22 = 0.556.
2
2

n1 = n2 =10.
1
4.03

x 1.33 , 4.03 x 1.33


0.556
0.556

= (0.59, 9.6)

10.4 Significance Test of Hypotheses about a Variance


1. The hypotheses are H0: 2 = 02 and HA: 2 02 where 02 is a specified number.
2. The test statistic is (n-1) s2.
02
3. Assume that we have a random sample from a Normal distribution with variance 2. Then, under
the null hypothesis, the test statistic has the 2 with (n-1) degrees of freedom.
4. Select significance level .
5. Let X denote a random variable having 2 with (n-1) degrees of freedom. Find c1 and c2 from Table
8 such that P(X < c1 ) = ./2 and P(X > c2 ) = ./2. The acceptance region is the interval (c1 , c2 ).
6. The decision rule is

Page | 68

If c1 < test statistic < c2, the results are consistent with the null hypothesis that the population
variance equals 02. Therefore, H0 is not rejected.
Otherwise, the results are inconsistent with the null hypothesis. Therefore, H0 is rejected.
Example 10.3: A sample of size n = 11 gives s2 = 1.5. Does this provide evidence against H0 :2 = 1
versus H1 : 2 > 1?

10 s 2
2
=15 and we refer the observed value of 15 to percentage points of the 10
1
distribution given in Table 8.
Test statistic is V =

The P-value is P(V > 15) because large values of V indicate departures from H0 in favour of H1.
Reading the values again from either side of the observed value we get
2
2
(0.025) = 20.4832, 10
(0..975) = 3.24697 H0 is not rejected.
10

There is no evidence to reject that 2 = 1.

10.5 Significance Test of Hypotheses about two Variances


1. The hypotheses are H0: 12 = 22 and HA: 12 22 where 12 and 22 denote the variances of the
two populations sampled.

2. Test statistic = Larger sample variance


Smaller sample variance
3. Assume that we have two independent random samples from two Normal distributions. The first
sample is of size n1 and comes from a Normal distribution with variance 12. The second sample
is of size n2 and comes from a Normal distribution with variance 22. Let s12 denote the first sample
variance and s22 the second sample variance.
4. Select significance level .
5. Suppose s12 is the larger sample variance, so the test statistic is s12 / s22 . Let F denote a random
variable having the F(n1-1, n2-1) distribution. Find the number c from Table 9 such that P(F< c) =
1- /2.
6. The decision rule is:
If test statistic < c, say the results are consistent with the null hypothesis that the two population
variances are equal.
Otherwise, the results are inconsistent with the null hypothesis. Therefore, H0 is rejected.

Page | 69

Example 10.4: Let us test the hypothesis that the two variances are the same where s12 = 1.33 and s22
= 0.556. n1 = n2 =10.
Test statistic = 1.33/0.556 = 2.4
F(9,9) for = 0.05 is 4.03.
2.4 < 4.03. Therefore, we have no reason to doubt that the variation is similar in the two populations.

10.6 Exercises
1. Find k for the following.
(a) P( 52 < k) = 0.95
2
P( 10
< 23.2) = k
P(77 < k) = 0.9
10
P(15
>k) = 0.975
7
P(3 <5.27) = k

(b)
(c)
(d)
(e)

2. 96
106
(a)
(b)
(c)
(d)

98
106

99
107

100
108

103
108

103
108

104
110

104
113

105
114

105
114

Test the null hypothesis that the variance is 100.


Calculate a 95% confidence interval for the variance.
Calculate a 99% confidence interval for the variance.
Comment on your results in (ii) and (iii).

3. An experiment was carried out by two analysts and 8 independent readings were obtained as
given below.
Analyst 1: 174
Analyst 2: 173
(a)
(b)
(c)
(d)

173
173

173
172

173
173

171
171

172
172

173
171

173
172

Plot the observations.


Test the null hypothesis that the variance of readings is the same for the two analysts.
Calculate 95% confidence interval for the ratio of the variances for the two analysts.
Interpret the answer obtained in (iii).

10.7 Further Exercises (Probability and statistics: Walpole, Myers and Myers 8th Edition)
1. Exercises - Page 289 - Question 8.39 8.42

Page | 70

Page | 71

LECTURE 11
CHI-SQUARED TEST
11.1 Goodness-of-fit Test

Aim is to test whether a population has a specified distribution. The question we are attempting to
answer here is: Do the data fit the assumed or postulated distribution?
Example 11.1: In 116 randomly selected families with two children, 42 have no girls, 52 have one girl
and only 22 have two girls. Assuming births of either sex are equally likely, do these data conflict with
the hypothesis that the sexes of successive births are independent? If the hypothesis is true, then the
number of girls in any family of two children follows a Binomial distribution with parameters n0 = 2
and p = 1/2 (we reserve n for the sample size, the number of families, here 116). We can construct a
table of frequencies:

Number of girls k
Observed Frequency Ok
Probability (under H0)
Expected Frequency Ek

0
42
(1/2)2

29

1
52
2 x 1/2 x 1/2

58

Total

22

116

(1/2)2

29

116

Here, the expected frequencies are obtained by multiplying the total frequency or sample size by the
probability under H0. They need not be integers.
Do the discrepancies between observed and expected frequencies provide sufficient evidence to cast
doubt on H0?
One of two possible test statistics studied here, and the commonest one, is Pearsons X2 statistic:
(Ok E k ) 2
2
X
Ek
k
The statistic is also called, somewhat confusingly, (Pearsons) Chi-Square(d). Clearly large values
of X2 indicate departures from H0. We can show that, for large n, X2 approximately follows a chisquare (sampling) distribution with degrees of freedom equal to the number of classes minus one.

Example 11.1 revisited:

X2

(42 29) 2 (52 58) 2 (22 29) 2

8.14
29
58
29

Page | 72

Referring this observed value to Table 8, we see that 22 (0.05) = 5.99 so P, which is as usual the
probability of obtaining a value at least as extreme as the one actually observed, lies in the range P <
0.05. Thus there is evidence against the independence of the sexes of successive births, if we assume
that births of either sex are equally likely.

Example 11.2: Number of accidents in a month is observed over a period of ten years (Table 12.1). If
these accidents occur randomly and at a uniform rate, then the data should follow a Poisson
distribution.
Table 12.1
No. of
Accidents
0
1
2
3
4
5
6
>7

Observed freq.
Ok
41
40
22
10
6
0
1
0

Probability
0.30119
0.36144
0.21686
0.08674
0.02602
0.00625
0.00125
0.00025

Expected freq.
Ek
36.14
43.37
26.02
10.41
3.12
0.75
0.15
0.03

Total

120

1.00000

120

The above table contains both the observed data and the probabilities and expected frequencies
calculated under the Poisson assumption. The unknown population parameter, which is the mean
number of accidents per month, is estimated by

(41 0) (40 1) ...... (0 7) 144

1.20
120
120

So, we can use Table 2 to calculate the probabilities in the third column.
Eg: P(X=0) = 1.00000- 0.69881 =0.30119
The observed value of Pearsons X2 is then referred to Table 8.
There is, however, a further complication regarding the approximation of the sampling distribution of
X2 by the chi-square distribution. The approximation for large n is valid only if the expected
frequencies are sufficiently large. Just how large depends on the accuracy of approximation desired,
the sample size and the distribution under test. As we require a probability (the P-value) accurate to
say, only three significant figures, we will adopt two alternative conditions, either of which must be
satisfied to make the approximation valid.

The first condition states that all expected frequencies must be 5 or more and that to satisfy it,
merging of (neighbouring) classes may be necessary.

Page | 73

Example 11.2 revisited:


To satisfy the first condition, we must merge the last five classes so that the last class is 3 or more
with Ok = 17 and Ek = 14.46. Thus

X2

(41 36.14) 2 (40 43.47) 2 (22 26.02) 2 (17 14.46) 2


= 1.998

36.14
43.47
26.02
14.46

The degrees of freedom of the appropriate chi-square distribution is given by:


= no. of classes no. of parameters estimated 1 where the classes are counted after merging.
Example 11.2 revisited: 4 1 1 2 . Table 8 gives 22 (0.05) = 5.99, so P > 0.05 and there is no
evidence against the null hypothesis of a Poisson distribution, that is no evidence against the hypothesis
that accidents occur randomly.
The second, less stringent condition states that not more than 20% of all the Ek can be less than
5, but that no Ek must be less than 1.
Example 11.2 revisited: To satisfy the second condition, we need only merge the last four classes so
the last class is now 4 or more with Ok = 7 and Ek = 4.05 <5, and this class represents exactly 20%
of the new number of classes. Now X2 = 3.72 on = 5 - 1- 1 = 3.
there is no evidence against randomness.

22,0.05 = 7.81, so again P> 0.05 and

11.2 Test for Homogeneity


Here we ask: Is there any evidence in the data for association between two categorical variables? This
question is answered by a (chi-square) test of Independence but the tests of Homogeneity and
Similarity are formally identical so we will deal with all three together.

Example 11.3: A survey of smoking habits was carried out with 50 males and 40 females randomly
selected (Table 12.2).
Sex
Non-smoker
Male
16
Female
24
Total
40

Light-smoker
20
10
30

Heavy-smoker
14
6
20

Total
50
40
90

Is there evidence of differences between the sexes? We are comparing two distributions (over smoking
habits) so test is one of SIMILARITY. The null hypothesis is that the population proportions of males
and females in each smoking category are the same.
Example 11.4: In a study of migrant birds, nestlings was ringed in four different locations AD.
One year later, birds were recaptured at each location and the number of ringed birds noted. The data
were:

Page | 74

Recovered
Not recovered
Total

A
30
150
180

B
75
225
300

C
24
63
87

D
31
202
233

Total
160
640
800

Is there evidence for differences in the four recovery rates? We are comparing four proportions so the
test is one of HOMOGENEITY. The null hypothesis is that the proportion of recovered birds is the
same for the four locations.

Example 11.5: Two hundred and twenty seven (227) randomly selected males were classified by eye
and hair colour:

Colour
Blue
Grey/green
Brown

Red/Fair
65
32
5

Brown
26
41
16

Black
8
24
10

Total
99
97
31

Is there evidence for association, or lack of independence, between the two factors (at three levels)?
Do the proportions (or probabilities) of the three eye colours differ among the sub-populations
comprising the three hair colours? Equivalently do the proportions (or probabilities) of the three hair
colours differ among the three eye colours? This is a test of INDEPENDENCE, sometimes
(confusingly) called a test of association. The null hypothesis is that for each pair of eye and hair
colours
P(eye colour and hair colour) = P(eye colour) x P(hair colour)
The data from all three of the above examples have the general form of an r x c contingency table with
r rows, c columns, (both with appropriate labels), row and column totals, with the observed frequencies
Ok in the cells of the table. The overall sample size appears as an overall total in the bottom right
hand corner. In order to use Pearsons X2 or the log-likelihood ratio statistic Y2. We need to calculate
expected frequencies Ek under the null hypotheses.
In each case we use

Ek

row total column total


overall sample size (n)

Example 11.3 revisited:


The method of sampling fixes the row totals and under H0
Ek
Column Total

Row Total
n
i.e. the row proportions or probabilities are the same for each row.

Page | 75

The table of expected frequencies is

Sex
Non-smoker
Male
22.2
Female
7.8
Total
40.0

Light-smoker
16.7
13.3
30.0

Heavy-smoker
11.11
8.9
20.0

Total
50
40
90

50 40
.The observed value of
90
X2 is (summing over cells reading across rows)
1.73 + 0.65 + 0.76 + 0.94 +2.16 + 0.82 = 7.06, which we refer to Table 8 with
v = (2 1) x (3 1) = 2: 22 (0.05) = 5.99146, so P <0.05 and there is evidence that smoking habits
differ between the sexes.

For example the top left hand cell is 22.2 =

Example 11.4 revisited:


The method of sampling fixes the column totals and under H0
Ek
Row Total

Column Total
n
i.e.
the column proportions are the same for each column.

Page | 76

Example 11.5 revisited:


The method of sampling fixes only the overall sample size and under H0
Ek Row total Column Total

n
n
n
For large n, the distribution of X2 and Y2 is approximately chi-square with degrees of freedom
determined by

= (no. of rows 1)(no. of cols. 1)


As with goodness-of-fit tests the approximation is valid only if either Condition A or Condition B
holds. If merging is necessary we should merge only complete rows or columns.

11.3 Continuity Correction


For a. 2 x 2 table approximation (of the distribution of X2 to a 2 distribution) is achieved by using
continuity correction.
(| ad bc | 1 / 2) 2 n
2
(a b)(c d )(a c)(b d )

Example 11.3 revisited: If we collapse the original 2 x 3 table into 2 x 2


Sex
Male
Female
Female

Non-smoker
16
24
40

Smoker
34
16
50

Total
50
40
90

Then, using the formula with continuity correction, X2 = 7.04 which on referred to 12 = 3.84 gives
P < 0.01, very strong evidence for differences originally observed seem to be due to the incidence of
smoking rather than the amount of it, once an individual is a smoker.

Page | 77

11.3 Exercises
1. Four seeds were planted in each of one hundred pots under identical conditions, as part of an
experimental investigation into seed germination. After a fixed period of time the number of seeds
germinating in each pot was noted and the frequency table was as follows:
No. of seeds germinating
0 1 2 3 4
Number of pots
12 24 39 22 3
If seeds germinate independently under these conditions then the number germinating should
follow a Binomial distribution. Test this hypothesis using a goodness-of-fit statistic. Use an
alternative statistic and comment on the results of both your tests.
2.
(a) In the migrant birds study (Example 12.4) test the hypothesis that the probability of recovering
a ringed bird after one year is constant over the four different locations.
(b) Test the hypothesis that eye and hair colour are independent characteristics in the survey of
British males (Example 12.5).

3. In an experiment to study the dependence of hypertension on smoking habits, the following data
were taken on 180nindividuals.

Hypertension
No hypertension

Non-smokers
21
48

Moderate smokers
36
26

Heavy smokers
30
19

Test the hypothesis that the presence or absence of hypertension is independent of smoking habits.
Use a 0.05 level of significance
4.To determine the current attitudes about prayers in public schools, a survey was conducted in 4
Virginia countries. The following table gives the attitudes of 200 parents from Craig County, 150
parents from Giles Country, 100 parents from Franklin Country and 100 parents from Montgomery
Country:

Attitude
Favor
Oppose
No opinion

Craig
65
42
93

Country
Giles
66
30
54

Franklin
40
33
27

Montgomery
34
42
24

Test the homogeneity of attitude among the 4 countries concerning prayer in the public schools. Use
a p-value in your conclusion
5. A random sample of 200 married men, all retired, was classified to education and number of
children:
No of children
Education
0-1
2-3
Over 3
Elementary
14
37
32
Secondary
19
42
17
College
12
17
10

Page | 78

Teat the hypothesis, at the 0.05 level of significance, that the size of a family is independent of the
level of education attained by the father.
6. A survey was conducted in two Virginia cities to determine voter sentiment for two gubernatorial
candidates in an upcoming election. Five hundred voters were randomly selected each city and the
following data were recorded.

Voter sentiment
Favor A
Favor B
Undecided

City
Richmond
204
211
85

Norfolk
225
198
77

At the 0.05 level of significance, test the null hypothesis that proportions of vectors favouring
candidate A, candidate B, or undecided are the same for each city
7. The marketing director for metropolitan daily news is studying the relationship between the types
of community the reader lives in and the portion of the paper he or she reads first. For a sample of
readers the following information was collected.

Urban
Rural
Farm

National News
170
120
130

sports
124
112
90

Comics
90
100
88

At the 0.05 significance level, can we conclude there is a relationship between the type of community
where the person resides and the portion of the paper read first.
11.4 Further Exercises (Probability and statistics: Walpole, Myers and Myers)
1. Exercises -Page 353- Question 12 16

Page | 79

LECTURE 12
REGRESSION ANALYSIS
12.1 Introduction
We have already seen in Lecture 1 how to present data of this form (using MINITAB) and make a
suitable comment. This descriptive a is now supplemented with more formal techniques, in particular
estimation and hypothesis testing for the parameters of a linear relationship which is defined below.
The data consist of n pairs of measurements on two variables X and Y:
(x1 , y1 ), (x 2 , y 2 ),, (x i , yi ),, (x n , y n )
These data can have arisen from a random sample of n individual from a population, or from an
experiment in which one variable, conventionally X, is held fixed or controlled at certain chosen levels
and independent measurements of the response variable, conventionally Y, are taken for each of these
level.
Examp1e 12.1 Ten males were randomly selected from a well-specified population and their weight
and height were measured:
Height(X)
Weight(Y)

63
145

71
158

72
156

68
148

75
163

66
155

68 76
153 158

71 70
150 154

WEIGHT(pounds)

These data are plotted below.


164
162
160
158
156
154
152
150
148
146
144
60

65

70

75

80

HEIGHT(inches)

Is there any relationship between the two variables? One way of characterising this is to say that the
mean weight for a given height seems to be an increasing linear function, within the range of heights
examined here.
Example 12.2: (data again taken from Chatfields book). An experiment was conducted to investigate
the variation in specific heat of a certain chemical with temperature. Two measurements of specific
heat were taken at a series of six equally spaced temperatures in the range 50100C.

Page | 80

The important feature of this experiment is that one variable, temperature, is controlled and only the
other, specific heat, is subject to random variation (mainly due to measurement error). It is not
necessary to have equally spaced values nor is it necessary to have equal replications at those values.

Specific Heat(Cal/gm/C)

Temperature 0C
(X)
Specific heat (Y)
(cal/gm/0C)

50

60

70

80

90

100

1.60
1.64

1.63
1.65

1.67
1.67

1.70
1.72

1.71
1.72

1.71
1.74

1.76
1.74
1.72
1.70
1.68
1.66
1.64
1.62
1.60
1.58
60

80

100

120

Temperature(0C)
Scatter plot of Specific Heat with Temperature
The data are plotted in Figure 8.2. As in Example 15, there is a positive relationship but now the
function giving mean specific heat for a given temperature, although increasing, may be curvilinear,
possibly quadratic.
You should read the suggested references for a discussion of other possible scatter plots and their
interpretation. The methods in this chapter are applicable only if the data show a linear relationship,
whether strong or weak, negative or positive. Either or both variables may need to be transformed to
achieve an approximate linear relationship.
12.2 Correlation
The only measure of association we will be covering here is Pearsons product moment linear
correlation coefficient r (when referring to this statistic, all the words here can be omitted with the
exception of linear correlation). The formula for r is
n

( x y ) nxy
i 1

n 2
n 2

xi nx 2 y i ny 2
i 1
i 1

Page | 81

Theorem 2: For any set of data (where neither the x-values nor the y-values are all the same)
1 r 1
with equality if and only if X and Y are linear combinations of each other for the data set.

What does the observed value of r tell us about the relationship between the two variables in the
population?
Assume that both samples are random and from Normal populations (easily checked with Normal
probability plots), and that the joint sample of pairs is random from a Bivariate Normal distribution.
(The first part of this assumption, which we can check, follows from the second part, which we cannot
check, nor do we need to how the full meaning of the second part.) We can use r as a test statistic for
the null hypothesis H0 : = 0 (no linear association) versus H1 : 0 , where is the population
parameter measuring the correlation of X and Y. The sampling distribution of r is complicated but it
can be shown that
t

r n2

~ tn 2 under H0
1 r2
We. can test H0 using Table which gives critical values of r for v n 2 degrees of freedom, or
by referring the observed value of t above to Table (7).

Example 12.1 revisited: You should always examine the scatter plot first before undertaking any
calculations or tests. Here it clearly shows that there is weak linear association.
Using hand calculations

i 1
n

i 1

y i 700, i 1 x i 1540,
n

i 1

y i2 .. 49140.. , i 1 x. i2 237412
n

x i yi

hence

146
0.777
140 252
which we refer to Table (10) with v = 8. Critical values are 0.7646 (P = 0.01) and 0.8721 (P =0.001)
for a two-sided test so we conclude that there is strong evidence for linear correlation as 0.001 <P <
0.01.
r

The same conclusion is obtained calculating 0.777 8 / 1 0.7772 3.49 which gives 0.01 > P
> 0.005.

12.3 Regression
Correlation analysis may establish a linear relationship but does not allow us to use it to say, predict
the value of one variable given the value of another. Regression Analysis allows us to do this and more.
It is also applicable when one of the variables (X) is controlled.

Page | 82

We will assume that the scatter plot of Y versus X shows a roughly linear relationship and in addition
that the spread in the Y-direction is roughly constant with X. It may be necessary to transform one or
both variables to achieve this.
We postulate a linear model for the data (and the population from which the data has been drawn):
Y X

x
where Y is the response variable, X is the regressor or explanatory variable and is a random error
with zero mean and constant variance 2 (unknown) for each value of X. The unknown parameters
and represent the intercept and slope of the (unknown) population regression line a + X.
The estimate of this line is a + bX where
a y bx

=1
= 2
1 2

are unbiased estimators of and respectively. (The derivation of these estimates by the Principle
of Least Squares or by Maximum Likelihood is outside the scope of this course)
Example 12.2 revisited: n = 12,

x 900, yi 20.16, x 75, y 1.68 .

i 1 i

The raw sums of squares and products are

i 1

x i y i 1519.9,

i 1

x i 71000, i 1 y i2 33.8894
2

Page | 83

i 1
n

x i y i nxy

i 1

x nx
2
i

1519.9 12 75 1.68
7.9

0.00226
2
3500
71000 12 75

This can be interpreted as the (estimated) average increase in specific heat per increase temperature of
10C.
a 1.68 0.00228 75 1.511
This has no interpretation as say, the specific heat at 0C, as the data tell us nothing about the
relationship outside the range of temperatures considered.
The fitted line Y = 1.511 + 000226X should now be plotted on the original scatter diagram, as a check
to calculations, and prior to checking the assumptions of the model.
To make inferences on the unknown parameters , , 2 and to make predictions (with appropriate
confidence intervals) of the response variable we need to make further distributional assumptions on
the random errors in the model. We assume that the random errors are independent and normally
distributed (zero mean and constant variance already assumed).
The assumptions of the regression model can be checked by calculating and examining (by suitable
plots) estimates of the random errors, called residuals. The ith residual is given by
ei yi bxi y.i y i
where y i is called the ith fitted value. Graphically the residual is just the vertical distance of the observed
ith point from the fitted line, If the assumptions are correct then these residuals should look
approximately like a random sample from a Normal distribution with zero mean and variance .2
They should be plotted (on scatter diagrams) against

the values of the regressor variable X - to indicate a departure from the model assumptions in
the form of a non-linear term, and also to check whether the variance changes with X.

the fitted values y i to indicate a possibly non-constant variance. (In many data sets the variance
or spread is found to increase with the estimated mean or fitted value and a transformation of
the response variable is often the appropriate remedy.)
expected Normal order statistics to indicate possible non-Normality in the error distribution
(MINITAB: NSCORES)

the values of any other variable observed on the sample units a systematic pattern here
indicates the need for a multiple regression, outside the scope of this course.

Page | 84

Example 12.2 revisited:

xi

observed( yi )

fitted ( y i )

residual ( ei )

1
2
3
4
5
6
7
8
9
10
11
12

50
50
60
60
70
70
80
80
90
90
100
100

1.60
1.44
1.63
1.65
1.67
1.67
1.70
1.72
1.71
1.72
1.71
1.74

1.587
1.587
1.627
1.627
1.667
1.667
1.707
1.707
1.747
1.747
1.787
1.787

+0.013
-0.147
+0.003
+0.023
+0.003
+0.003
-0.007
+0.013
-0.037
-0.027
-0.077
-0.047

Another purpose of calculating the residuals and/or plotting them is to detect possible outliers i.e points
which do not seem to belong with the remaining data on account of their large residual. To assess the
size of residuals we need to estimate the variation about the regression line. This variation is measured
by the residual or error sum of squares, denoted SSE and is given by
n

SSE e i2
i 1

(proof not required). It can be shown that the residual mean square, given by s2 = SSE/(n - 2) (the
same notation as for the sample variance), is an unbiased estimator of 2 and (n 2)s2/ 2 has a
chi-square sampling distribution with n 2 degrees of freedom.
SPSS Output:
Model Summary

Model
1

R Square

.806a

.649

Adjusted R

Std. Error of the

Square

Estimate
.614

.05072

a. Predictors: (Constant), x

Page | 85

ANOVAb
Model
1

Sum of Squares

df

Mean Square

Regression

.048

.048

Residual

.026

10

.003

Total

.073

11

Sig.

18.485

.002a

a. Predictors: (Constant), x
b. Dependent Variable: y

Coefficientsa
Standardized
Unstandardized Coefficients
Model
1

B
(Constant)
x

Std. Error
1.387

.066

.004

.001

Coefficients
Beta

.806

Sig.

21.033

.000

4.299

.002

a. Dependent Variable: y

12.4 Exercises
A study was made on the amount of converted sugar in a certain process at various temperatures. The
data were coded and recorded as follows:
Temperature, x
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0

(a)
(b)
(c)
(d)

Converted Sugar, y
8.1
7.8
8.5
9.8
9.5
8.9
8.6
10.2
9.3
9.2
10.5

Plot the data on a scatter diagram.


Estimate the linear regression line.
Estimate the mean amount of converted sugar produced when the coded temperature is 1.75.
Plot the residuals versus temperature. Comment.

12.5 Further Exercises (Probability and statistics: Walpole, Myers and Myers, 8th Edition)
1. Exercises -Page 421- Question 11.1, 11.2, 11.4, 11.7

Page | 86

NAME: Lab Assessment One


UNIT: Exploratory Data Analysis
TITLE OF ASSIGNMENT: Describing Data Using Graphs & Summary Statistics
ISSUE DATE: Week One
DESCRIPTION:
Major League Baseball is known as Americas pastime. The role of Major League Baseball
has been ingrained into American culture. The heroic figures and memorable moments of
Major League Baseball reflect the type of attitude that American culture is built on. Given
below are some measurements observed in this significant sport during the 1998 league.
X1 = Team Attendance
(Average number of spectators for a match that the team play)
X2 = Team Salary
(Earning of the team)
X3 = Years
(Years since the team has owned a stadium)
1. Identify the measurements of variables and enter the given data set into SPSS and specify the
variables properly.
2. Obtain the following for each variable
a. Box-Plot, Histogram and Stem-Leaf Plot.
b. Mean, Mode, Median and Standard Deviation.
c. First and Third Quartile.
d. Interquartile Range.
3. Manually find the outliers for each of the variables.
a. Beyond what point is a value considered an Outlier?
b. Does the outliers found match the outliers marked in the box-plot?
4. Write a brief summary regarding the distribution of each variable (Use both plots and summary
measures).

Note: Please explore the various options by clicking on any accessible menu item.

Page | 87

Data Set Major League Baseball 1998

Team
Atlanta Braves
New York Mets
Philadelphia Phillies
Florida Marlins
Houston Astros
Chicago Cubs
St. Louis Cardinals
Cincinnati Reds
Milwaukee Brewers
Pittsburgh Pirates
San Diego Padres
San Francisco Giants
Colorado Rockies
Arizona Diamondbacks
New York Yankees
Boston Red Sox
Toronto Blue jays
Baltimore Orioles
Tampa Bay Devil
Cleveland Indians
Chicago White Sox
Kansas City Royals
Minnesota Twins
Detroit Tigers
Texas Rangers
Anaheim Angels
Seattle Mariners
Oakland Athletics
Montreal Expos
Los Angeles Dodgers

Team Attendance (X1)


3.361
2.288
1.716
0.914
1.750
2.450
2.623
3.195
1.794
1.812
1.561
2.556
1.926
3.089
3.789
3.603
2.950
2.344
2.454
3.685
2.506
3.467
1.391
1.495
1.166
1.409
2.927
2.519
2.644
1.232

Team Salary (X2)


59.536
49.518
34.370
9.162
33.434
40.629
49.433
52.575
21.995
32.393
13.352
45.368
40.571
47.970
47.435
30.572
63.461
51.647
48.666
68.988
25.318
59.584
36.840
32.963
26.183
22.725
55.305
38.702
52.027
20.063

Years (X3)
3
35
28
23
12
34
85
33
29
46
29
32
39
37
4
1
76
87
10
7
9
5
8
26
17
87
5
33
23
33

GoalsWhen you completed this Assessment and Chapter 1 of the Hand Book, you will be able to,

Calculate the arithmetic mean, median and mode.


Explain the characteristic, uses, advantages, and disadvantages of each measure of location.
Identify the position of the arithmetic mean, median, and mode for both symmetric and skewed
distributions.
Compute and interpret the range, the variance, and the standard deviation.
Explain the characteristics, uses, advantages, and disadvantages of each measure of dispersion.
Compute and interpret quartiles and the interquartile range.
Diagnose a given dataset using Summary Statistics.

Page | 88

NAME: Assessment Two


UNIT: Exploratory Data Analysis
TITLE OF ASSIGNMENT: Frequency Distributions
ISSUE DATE: Week Two
DESCRIPTION:
Data Set one is about accommodations of set of students who are studying at University
College.
X1 = Age of Students
X2 = Gender
1 = Male.
2 = Female.
X3 = Accommodation
1 = Stays at Home.
2 = Boarded Students.
3 = Lodging.
1. Identify the measurements of variables and enter the given data set into SPSS. Code the
variables X2 and X3 as mentioned.
2. Analyse the data in a single variable at a time.
a. Use methods discussed in the first lab sheet to describe numerical variables.
b. Use tables/pie charts/bar charts to describe categorical variables.
3. Describe the gender and accommodation together using two-way tables.
4. Find the mean age for all the combinations of gender and accommodation.
Hint: Use following format.
Gender
Male
Female
Accommodation Home Boarded Lodge Home Boarded Lodge
Mean Age

Page | 89

Data Set One:


Age
(X1)
17
17
18
25
30
21
17
18
17
19
20
21
22
17
18
19
27
28
17
18
17
17
27
29
17
25
29
28
18
19
17
17
18
19
25
28
29
26
23
21
18
17
19
20
20
20
20

Sex
(X2)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1

Accommodation
(X3)
2
2
3
1
2
1
3
2
2
2
1
2
3
2
3
3
2
3
2
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
2
2
2
2
2
2
3
2
3
3
3

Age
(X1)
17
18
18
18
18
19
17
18
19
17
25
25
29
27
30
30
30
30
31
32
28
28
27
18
19
19
19
19
19
19
17
19
18
19
17
18
19
21
21
28
28
28
28
27
29
28
27

Sex
(X2)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

Accommodation
(X3)
3
2
2
2
1
3
2
2
2
3
2
1
2
2
2
3
2
2
2
2
2
3
2
2
2
2
2
2
1
3
2
2
2
2
2
1
3
3
1
1
2
3
2
1
3
2
1

Page | 90

Data Set Two:


The numbers of shareholders for a selected of large companies (in thousands) are:
Company
Pan American World Airways
General Public Utilities
Occidental Petroleum
Middle South Utilities
DaimlerChrysler
Standard Oil of California
Bethlehem Steel
Long Island Lighting
RCA
Greyhound Corporation
Pacific Gas & Electric
Niagara Mohawk Power
E.I. du Pont de Nemours
Westinghouse Electric
Union Carbide
BankAmerica
Northwest Utilities
Standard Oil (India)
Atlantic Richfield
Detroit Edition
Eastman Kodak
Dow Chemical
Pennsylvania Power
American Electric Power
Ohio Edition
Transamerica Corporation
Colombia Gas System
International Telephone
Union Electric
Virginia Electric and Power
Public Service Electric & Gas
Consumer Power

Number of Shareholders (thousands)


144
266
177
133
209
264
160
143
246
151
239
204
204
195
176
175
200
173
195
220
251
137
150
262
158
162
165
223
158
162
225
161

The number of shareholders is to be organized into a frequency distribution and several


graphs drawn to portray the distribution.
1. Using seven classes and a lower limit of 130, construct a frequency distribution.
2. Portray the distribution in the form of a frequency polygon.
3. Portray the distribution in a less-than cumulative frequency polygon.

Page | 91

4. Based on the polygon, three out of four (75 percent) of the companies have how
many shareholders or less?
5. Write a brief analysis of the number of shareholders based on the frequency
distribution and graphs.
GoalsWhen you completed this Assessment and Chapter 1 of the Hand Book, you will be
able to

Organize data into a frequency distribution.


Portray a frequency distribution in a histogram, frequency polygon, and
cumulative frequency polygon.
Develop a stem-and-leaf display.
Present data using such graphic techniques as line charts, bar charts, and pie charts.

NAME: Assessment Three


UNIT: Discrete Random Variables
TITLE OF ASSIGNMENT: Binomial Distribution
ISSUE DATE: Week Three
DESCRIPTION:
Consider the experiment of tossing a fair coin 5 times.

1. Using statistical tables find the following probabilities for the above mentioned
scenario.
a) P(X = 0)

e) P(X = 4)

b) P(X = 1)

f) P(X = 5)

c) P(X = 2)

g) P(X 3)

d) P(X = 3)

h) P(X 2)

2. Calculate the above probabilities using the probability mass function.

Page | 92

3. Suppose we want to simulate this experiment 1000 times. We can use the binomial
distribution. Use the following SPSS syntax to generate 1000 random numbers
from the binomial distribution with n = 5 and p = 12.
NEW FILE.
INPUT PROGRAM.
LOOP #I=1 TO 1000.
COMPUTE Binomial_COIN = RV.BINOMIAL(5,0.5).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.
4. Calculate the same probabilities using the generated data.
5. Comment on the probabilities obtained in all 3 methods.
6. How to increase the accuracy of the probabilities calculated using generated data.
7. Plot a histogram for the generated data.

Page | 93

NAME: Assessment Four


UNIT: Discrete Random Variables
TITLE OF ASSIGNMENT: Binomial/Poisson Distribution
ISSUE DATE: Week Three
DESCRIPTION:
The data given below consists of number of defectives in 200 samples of motherboards
where size of a sample is 50 motherboards.
DATA SET:
2
2
2
4
3
1
0
4
3
1
2
5
4
4
3
2
1
1
2
2

3
5
4
3
4
2
1
3
1
2
2
3
3
5
2
1
2
4
1
5

3
4
7
3
6
7
3
0
3
6
1
4
4
2
2
2
5
3
5
1

5
2
3
8
5
2
3
3
2
0
7
6
3
4
5
3
2
5
5
3

3
1
4
6
4
3
2
1
1
3
2
4
5
6
8
5
2
1
5
5

3
2
3
3
4
5
4
5
1
4
2
3
3
0
2
2
4
2
4
3

5
2
0
2
5
3
3
3
2
3
1
2
5
3
3
4
4
3
1
1

2
3
1
3
1
7
3
3
3
2
1
1
3
4
0
1
1
5
3
4

6
2
4
3
1
4
3
4
4
3
3
1
3
2
6
2
2
1
1
7

2
0
9
2
2
3
3
1
1
2
3
6
2
3
1
6
1
1
4
1

1. Calculate Mean and Variance for the given data.


2. Represent this data graphically in an appropriate way.
3. Assume that the distribution of the above set of data is Binomial with p = 0.06.
Calculate probability for the following events,
a. Number of defectives is equal to 3
b. Number of defectives is less than 2
Page | 94

c. Number of defectives is greater than or equal to 2


4. Do the Question_3 using Poisson approximation to the Binomial distribution.
Hint : = 3.
5. Compare the values obtained in Question 3 and Question 4.

NAME: Assessment Five


UNIT: Random Sampling
TITLE OF ASSIGNMENT: Properties of the Mean
ISSUE DATE: Week Four
DESCRIPTION:
The nicotine contents, in milligrams for 40 cigarettes of a certain brand were
recorded as follows.
1. Calculate Population Mean and Population Variance.
2. Get 30 random samples of size 5 and calculate sample mean and sample variance
for each sample.
3. Calculate Mean and Variance of the Sample Means.
4. Compare and state Relationship (if any) Population Mean and the Mean of
Sample Means.
5. Compare and state Relationship (if any) Population Variance and the Variance of
Sample Means.

DATA SET:
1.09
1.74
1.58
2.11
1.64
1.79
1.37
1.75

1.92
1.47
2.03
1.86
0.72
2.46
1.93
1.63

2.31
1.97
1.70
1.90
1.69
1.88
1.40
2.37

1.79
0.85
2.17
1.68
1.85
2.08
1.64
1.75

2.28
1.24
2.55
1.51
1.82
1.67
2.09
1.69

Page | 95

Use the Following Format.


Population Mean
Population Variance
Mean of the Sample Means
Variance of Sample Means
Sample Mean
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Variance

Sample Mean
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Variance

Page | 96

NAME: Assessment Six


UNIT: Hypothesis Testing
TITLE OF ASSIGNMENT: Tests and CI for Population Mean
ISSUE DATE: Week Five
DESCRIPTION:
The PE ratio of a share is given by the ratio of the market price of the share to earnings per
share. The mean PE ratio for the shares traded in a stock exchange was 16.2 in 2005. The
following are the PE ratios in 2005 of random selected 15 traded companies in the
Resource Sector.
Company
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

PE ratio
15.2
12.1
18.7
16.8
14.8
16.3
14.9
16.8
17.5
16.7
13.8
15.5
12.8
12.2
11.8

1. Write down the following statistics for the data above:


a. Mean
b. Variance
c. First Quartile (Q1)
d. Median
e. Third Quartile (Q3)
f. 98% confidence interval for the mean PE ratio of the Resource sector

Page | 97

2. Suppose, using the above data, you wish to test your claim that the mean PE ratio
of the Resource Sector was significantly different from 16.2.
a. Write down the null and the alternative hypotheses for the test.
b. Give the value of the test statistic.
c. Give the p-value for the test.
d. State your conclusion at the 5% level of significance making sure that the
conclusion clearly relates to the context of the question.
3. The test in part (2) relies on an assumption of normality of the data. Comment on
the validity of this assumption by answering the following parts.
a. State which test you have used to test the assumption.
b. Give the p-value for the test.
c. Clearly state your conclusion at the 5% level of significance.
4. Suppose you wish to test your claim that the mean PE ratio of the Resource Sector
was significantly less than 16.2.
a. Give the p-value for the test.
b. What is your conclusion at 5% level of significance?

NAME: Assessment Seven


UNIT: Hypothesis Testing
TITLE OF ASSIGNMENT: Tests and CI for Mean and Variance
ISSUE DATE: Week Five
Part 1.
Scientists developed a new method of determining serum iron concentrations. To check the
accuracy of the method, they made 20 analyses of control serum, with a concentration of
105 g serum iron per 100 millilitres. The determination of serum iron concentration
(g/100 ml) are shown below.
96
105

98
106

99
107

100
108

103
108

103
108

104
110

104
113

105
114

1. Test the null hypothesis that the average serum iron determination using the new
method is 105 g/100 ml. Use 5% significance.
2. Calculate a 95% confidence interval for the average serum iron determination using
the new method. Discuss your findings.

Page | 98

Part 2.
Does diet restriction prolong life? In this experiment, researchers examined the influence
of different diets on the aging process in rats (Berger, Boos, and Guess, 1988; Yu et
al.1982). Lifetimes (in days) of rats on a restricted diet and rats on an unrestricted diet are
shown below.
Restricted diet
105
193
604
605
804
810
907
919
982
1001

211
630
811
923
1008

236
716
833
931
1010

302
718
868
940
1011

363
727
871
957
1012

389
731
875
958
1014

390
749
893
961
1017

391
769
897
962
1032

403
770
901
974
1039

530
789
906
979
1045

Unrestricted diet
89
104
545
547
639
648
677
678
704
710

387
548
652
678
711

465
582
653
681
712

479
606
654
684
715

494
609
660
688
716

496
619
665
694
717

514
620
667
695
720

532
621
668
697
721

533
630
670
698
730

536
635
675
702
731

1. Test the null hypothesis that means lifetime is the same for rats on the two diets.
Use 5% significance.
2. Calculate a 95% confidence interval for the difference in mean lifetimes for rats
on the two diets.
3. Discuss your findings.

Part 3.
The hydrocarbon emissions are known to have decreased dramatically during 1980s. A
study was conducted to compare the hydrocarbon emissions at idling speed, in parts per
million (ppm), for automobiles of 1980 and 1990. Twenty cars of each year model were
randomly selected and their hydrocarbon emission levels were recorded. The data:
1980 models:
141
359
200
223

247
188

940
940

882
241

494
190

306
300

210
435

105
241

880
380

1990 models:
140
160
220
400

20
217

20
58

223
235

60
380

20
200

95
175

360
85

70
65

Test the hypothesis that 1 = 2 against the alternative that 1 2. Assume both
populations are normal and use 5% significance.
Page | 99

NAME: Assessment Eight


UNIT: Chi-Squared Test
TITLE OF ASSIGNMENT: Chi-Squared Test
ISSUE DATE: -DESCRIPTION:
Following data file contains 200 observations from a sample of high school students with
demographic information about the students, such as their gender (gender: 0-male, 1female), socio-economic status (ses:1-low, 2-middle, 3-high), School type (schtyp:1public, 2-private), ethnic background (race:1-hispanic, 2-asian, 3-african, 4-white) and
program type (prog:1-general, 2-academic, 3-vocation).

1. Check the statistical significance of following relationships;


a. Between the type of school attended and students' gender.
b. Between gender and socio-economic status.
c. Between gender and program type.
d. Between schools attended and program type.

2. Consider the two variables, type of school and ethnic background.


a. Is it possible to conduct the Chi-Squared test to identify the existence of a
statistically significant relationship between the above two variables? (Hint:
Do a cross tabulation and see the expected cell frequencies.)
b. When the expected cell frequencies are less than five, we can try combining
neighbouring classes to make them 5 or more. Solve the issue and test for
the relationship. (Hint: Recode the variable race in a meaningful way to
have only two categories, e.g.: Whites and Non-whites)

Page | 100

id

gender

race

ses

schtyp

prog

id

gender

race

ses

schtyp

prog

id

gender

race

ses

schtyp

prog

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

4
4
4
4
4
4
3
1
4
3
4
4
4
4
3
4
4
4
4
4
4
4
3
1
1
3
4
4
4
2
4
4
4
4
4
4
4

1
2
3
3
2
2
2
2
2
2
2
2
3
3
1
1
3
2
3
2
2
2
2
3
2
2
3
2
3
1
2
3
3
2
3
3
2

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
1
2
1
2

1
3
1
3
2
2
1
2
1
2
3
2
2
2
2
1
2
1
2
1
1
3
2
2
3
3
2
3
2
1
1
2
2
3
2
1
2

38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

4
1
4
4
4
4
3
4
4
3
4
4
1
2
4
1
4
4
1
4
1
4
1
4
4
4
4
4
4
4
4
4
1
4
4
4
4

3
1
2
2
2
2
1
3
1
3
2
2
2
2
3
2
2
2
3
1
2
2
2
2
3
1
2
3
2
2
1
1
2
2
3
2
2

1
1
1
2
2
1
1
1
1
1
1
2
1
1
1
1
2
1
1
1
1
2
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1

2
3
3
2
2
2
1
1
1
3
2
2
2
2
2
1
2
2
3
3
3
2
3
2
2
1
1
2
3
2
3
2
3
1
2
2
1

75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

4
1
4
4
4
1
4
4
4
1
4
4
4
4
4
4
2
4
4
1
4
4
4
4
1
4
4
4
3
4
4
4
4
4
3
4
4

2
1
3
3
2
3
3
1
2
1
2
3
3
3
2
3
2
1
3
1
1
1
2
3
1
3
3
3
1
3
2
2
3
1
1
3
2

1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
2
2
1

3
2
2
1
3
2
2
3
2
2
3
2
2
3
3
2
2
1
2
2
1
1
2
2
3
2
2
1
2
2
2
2
2
3
1
2
3

Page | 101

id

gender

race

ses

schtyp

prog

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

4
4
1
4
4
4
4
1
3
3
4
4
1
4
4
4
4
4
3
4
4
4
4
4
4
4
4
4
4
4

3
2
2
2
3
2
2
1
1
1
3
2
2
2
2
2
2
2
3
1
2
3
1
2
1
2
1
2
1
1

1
1
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
2
1
1

2
3
2
2
2
3
2
2
2
2
2
3
1
1
2
3
3
1
2
2
2
2
2
2
3
2
1
2
3
1

prog

prog
2
2
2
2
3
1
2
2
3
1
1
2
3
3
2
2
1
3
3
2
2
3
3
2
2
2
3
3
2
1

schtyp

schtyp
1
1
1
2
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1

ses

ses
1
3
1
3
2
3
3
1
1
1
2
2
2
1
3
2
2
3
1
2
1
2
2
1
3
2
2
2
2
1

race

race
1
4
4
1
4
4
4
4
3
1
4
4
4
3
4
4
2
4
3
4
2
4
4
4
4
4
3
1
3
1

gender

gender
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

id

id
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141

172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
189
190
191
192
193
194
195
196
197
198
199
200

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
3
2
3
4
4
4
4
4
4
4
4
4
2
2
4
4
3
4
4
4
2
4
2
4
4
4
4

2
3
3
1
1
2
2
3
1
2
2
3
2
3
1
2
3
1
1
3
2
3
2
2
2
2
2
3

1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
2
1
1
2
2
1
2
2
1
2
1
1

3
1
2
1
2
2
2
3
1
2
2
1
2
3
2
1
2
2
3
1
3
2
2
1
3
1
1
2

Page | 102

NAME: Assessment Nine


UNIT: Simple Linear Regression
TITLE OF ASSIGNMENT: Simple Linear Regression
ISSUE DATE: Week Six
DESCRIPTION:

Part 1.
The amounts of solids removed from a particular material when exposed to drying periods
of different lengths are as shown below. Use the data set to provide answers to the
following questions.
Drying Period (hours)
4.4
4.5
4.8
5.5
5.7
5.9
6.3
6.9
7.5
7.8

Solids Removed (grams)


13.1
14.2
9.0
11.5
10.4
11.5
13.8
14.8
12.7
15.1
9.9
12.7
13.8
16.5
16.4
15.7
17.6
16.9
18.3
17.2

(a) Draw the scatterplot for the above observations and comment on the plot.
(b) State the sample correlation coefficient.
(c) Test the hypothesis that there is no correlation between the two variables and
interpret the result.
(d) Do a regression analysis and state the regression model.
(e) Use the output to test if the slope of the regression is zero.
(f) State your conclusions from this analysis.

Page | 103

Part 2.
Suppose we wanted to predict a student's grade on a freshman college calculus midterm
based on his/her SAT score.
We must examine the SAT scores and calculus midterm scores achieved by former
students, in order to make any prediction as to how well a student would do on the calculus
midterm.
Student SAT Score (X) Calculus Midterm Score (Y)
1
1100
89
2
1300
92
3
1000
86
4
1100
92
5
1200
90
6
1200
93
7
1400
98
8
1300
95
9
1000
88
10
1400
95

(a) Draw the scatterplot for the above observations and comment on the plot.
(b) State the sample correlation coefficient.
(c) Test the hypothesis that there is no correlation between the two variables and
interpret the result.
(d) Do a regression analysis and state the regression model.
(e) Use the output to test if the slope of the regression is zero.
(f) State your conclusions from this analysis.

Page | 104

NAME: Assessment Ten


UNIT: All units
TITLE OF ASSIGNMENT: Report Preparation
ISSUE DATE: -DESCRIPTION:
Following are the salary data along with six variables for 52 professors in a small college.
The variables are:

sx = Sex (1 - female and 0 male).


rk = Rank (1 - assistant professor, 2 - associate professor, 3 - full professor).
yr = Number of years in current rank.
dg = Highest degree (1 - doctorate, 0 masters).
yd = Number of years since highest degree was earned.
sl = Academic year salary in dollars.

It is required to explore the factors effecting the salary of the professors. Do some
meaningful statistical analysis using your knowledge from previous lab sessions and
prepare a complete report.
sx
male
male
male
female
male
male
female
male
male
male
male
male
male
male
male
male
male

rk
full
full
full
full
full
full
full
full
full
full
full
associate
full
associate
full
full
full

yr
25
13
10
7
19
16
0
16
13
13
12
15
9
9
9
7
13

dg
doctorate
doctorate
doctorate
doctorate
masters
doctorate
masters
doctorate
masters
masters
doctorate
doctorate
doctorate
masters
doctorate
doctorate
doctorate

yd
35
22
23
27
30
21
32
18
30
31
22
19
17
27
24
15
20

sl
36350
35350
28200
26775
33696
28516
24900
31909
31850
32850
27025
24750
28200
23712
25748
29342
31114
Page | 105

sx
male
male
male
male
male
male
female
male
male
male
female
male
male
female
male
male
female
female
male
female
male
male
male
male
male
male
female
male
male
female
female
male
female
female
female

rk
associate
associate
full
assistant
associate
full
full
associate
full
associate
full
associate
associate
assistant
associate
assistant
associate
associate
associate
assistant
assistant
assistant
assistant
assistant
assistant
associate
assistant
assistant
assistant
assistant
assistant
assistant
assistant
assistant
assistant

yr
11
10
6
16
8
7
8
9
5
11
5
3
3
10
11
9
4
6
1
8
4
4
4
3
3
0
3
2
2
2
2
1
1
1
0

dg
masters
masters
masters
masters
masters
doctorate
doctorate
doctorate
doctorate
doctorate
doctorate
masters
masters
masters
masters
masters
masters
masters
doctorate
doctorate
doctorate
doctorate
doctorate
doctorate
masters
doctorate
doctorate
doctorate
doctorate
doctorate
doctorate
doctorate
doctorate
doctorate
doctorate

yd
14
15
21
23
31
13
24
12
18
14
16
7
17
15
31
14
33
29
9
14
4
5
4
4
11
7
3
3
1
6
2
1
1
1
2

sl
24742
22906
24450
19175
20525
27959
38045
24832
25400
24800
25500
26182
23725
21600
23300
23713
20690
22450
20850
18304
17095
16700
17600
18075
18000
20999
17250
16500
16094
16150
15350
16244
16686
15000
20300

Page | 106

You might also like