322 views

Uploaded by Ain Farhan

- Patent, R & D and Technological Spillovers
- 13 Probability Distribution
- 632234164 (1).pdf
- Tutorial Ssce 2193 2017
- RESERVOIR ENGINEERING - Determination of Oil and Gas Reserves.pdf
- Practice 10
- AgainstAllOdds_StudentGuide_Unit30
- Revise Asap
- Practica_1_[Compatibility_Mode]B.pdf
- Cover & Table of Contents - Statistics for Managers Using Microsoft Excel (5th Edition)
- Doe
- Technical Report Ch 12
- Least Square Regression
- Quetions for top students(2).pdf
- MQ12MathsMethodsVCEU3&42E.pdf
- MATH30-6 Lecture 7.pptx
- syll 15 smr13
- 1-s2.0-S0148906299000509-main
- Shelf life.pdf
- Exercise 15.12

You are on page 1of 251

for

Engineering Students

Noraslinda Mohamed Ismail

Arifah Bahar

Ismail Mohamad

Muhammad Hisyam Lee

Norazlina Ismail

Norhaiza Ahmad

Universiti Teknologi Malaysia

Preface

In general, engineers develop new products, improve existing designs, build and test

prototypes, troubleshoot ongoing manufacturing process and others. In each of these

functions, engineers collect and analyze data as an integral part of their job. Thus, statistical

methods are an inseparable part of how engineers solve engineering problems.

This text is an introductory statistics textbook designed for undergraduate students

taking engineering programs at Universiti Teknologi Malaysia, Skudai. It provides

sucient material covered in SSE2193 Engineering Statistics course throughout a 15week semester. This text does not pretend to provide either a complete statistical

toolkit or a review of all statistical methods in all aspects of engineering applications.

It does however provide students with an easy start-up kit to key statistical methods

with various examples and tasks that students can solve either during class or at their

own time.

We sincerely hope this text will be useful for students in acquiring skills of handling

observed data, drawing valid inferences and eventually making sound judgement and

profound decision.

Authors

September 2015

_________________________________________________________________________

Self-Review Quiz

Test your prior knowledge and understanding on the basic statistics by answering the

following questions.

1. The probability of an event is always

a) less than 0

b) in the range 0 to 1.0

c) greater than 0

2. Two equally likely events

a) have the same probability of occurence

b) cannot occur together

c) have no eect on the occurence of each other

3. Let S be the set of sample space and dened as S = {1, 2, 3, 4, 5, 6, 7}. Let A, B, C

be the subsets of the sample space and dened as

Given

i) P A B 0.5

P B C ' 0.25

a) i, ii, iii

4.

b) ii, iii, iv

c) i, iii, iv

d) ii, iv

a)

Yes b) No

a) have the same probability of occurence

b) cannot occur together

c) have no effect on the occurence of each other

7.

a)

b)

c) x

d)

8.

Which of the following data can most possibly be represented by a discrete random

variable?

a)

b)

c)

d)

The height of mountains across south-east Asia.

The number of errors typed on a piece of paper.

The amount of time spent by engineers working offshore.

a) Binomial with n = 25 and p = 0:6

b) Normal with = 30 and 2 = 16

c) Poisson with = 7

10. A random variable X follows a normal distribution with mean 16 and standard deviation

2. The probability of X being less than 15 can be calculated by finding

a)

15 16

P Z

22

b)

15 16

P Z

c)

d)

e)

16 15

P Z

15 16

P Z

15 16

P Z

Contents

Preface

Self-Review Quiz

1 Fundamental Topics

1.1 Descriptive Statistics and Inferential Statistics

1.1.1 Terms and denitions

1.1.2 Measures of central tendency

1.1.3 Measures of dispersion

1.1.4 The use of calculators

1.1.5 Types of Plots

1.2 Probability

1.2.1 Basic notation and denition

1.2.2 Classical denition of probability

1.2.3 Mutually exclusive event

1.2.4 Additive rule of probability

1.2.5 Conditional probability

1.2.6 Multiplication rule of probability

1.2.7 Independence

1.3 Random Variables

1.3.1 Discrete random variable

1.3.2 Continuous random variable

1.3.3 Cumulative distribution function

1.3.4 Mathematical expectation

1.3.5 Variance and standard deviation

1.4 Some Probability Distributions

1.4.1 Binomial distribution

1.4.2 Poisson distribution

1.4.3 Negative binomial distribution

1.4.5 Hypergeometric distribution

1.4.6 Normal distribution

1.4.7 Exponential distribution

1.4.8 Other continuous distributions

1.4.7 Exponential distribution

Exercise 1

2 Sampling Distributions

2.1 Introduction

2.2 Central Limit Theorem

__

2.4 Sampling Distribution of

__

__

X1X 2

2.6 Sampling Distribution of the Dierence Between two Proportions

2.7 t Distribution

2.8 X 2 Distribution

2.9 F Distribution

Exercise 2

3 Estimation

3.1 Introduction

3.2 Terminology

3.3 Point Estimate

3.4 Interval Estimate

3.5 CI on the Mean

3.6 CI for the Dierence between Two Population Means

3.7 CI for the Population Proportion

3.8 CI for the Dierence between Two Population Proportions

3.9 CI on the Normal Population Variance

Exercise 3

4 Tests of Hypotheses

4.1 Statistical Hypotheses

4.2 Test of Hypothesis for the Mean

4.3 Test of Hypothesis for the Variance

4.4 Test of Hypothesis for the Proportion

4.5 Test of Hypothesis for the Dierence between the

4.5.1 Variances known

4.5.2 Variances unknown

4.6 Test of Hypothesis for the Dierence between

the Proportions

4.7 Test of Hypothesis for the Ratio of the Variances

Exercise 4

5 Chi-Square Tests

5.1 Introduction

5.2 Goodness-of-t Test

5.3 Independence Test

5.4 Homogeneity Test

Exercise 5

6 Analysis of Variance

6.1 Introduction

6.2 One-Way ANOVA

6.3 Partitioning of Total Variability Into Components

6.4 Output

6.5 Computer Application - Using Excel

Exercise 6

7.1 Introduction

7.1.1 Regression analysis

7.1.2 Correlation coecient

7.2 Simple Linear Regression

7.2.1 Simple linear regression model

7.2.2 Model assumptions

7.2.3 Fitted simple linear regression equation

7.3 Scatter Diagram

7.3.1 Data plotting

7.3.2 Draw by eye

___________________________________________________________________________

7.4.1 Errors and residuals

7.4.2 The sum of squared residuals

7.4.3 Normal equations

7.4.4 The least squares estimators

7.4.5 The tted regression line and prediction

7.4.6 Finding the least squares estimates using a scientic calculator

7.5 Tests for Linearity of Regression

7.5.1 Testing procedures

7.5.2 Using a ttest approach

7.5.3 Using a one-way analysis of variance approach

7.6 Correlation

7.6.1 Product moment correlation coecient, r

7.6.2 Properties of r

7.6.3 Interpretation of r values

7.7 Simple Linear Regression and Correlation using Excel

7.7.1 Excel procedures

7.7.2 Excel output and interpretation

Exercise 7

8 Nonparametric Statistics

8.1 Introduction

8.2 Sign Test

8.3 Run Test

8.4 Some Methods Based on Ranks

8.4.1 Introduction

8.4.2 Mann-Whitney Test

8.4.3 Wilcoxon Signed-Rank test for Two Dependent

8.5 Measure of Association

8.5.1 Spearman Rank Correlation Coecient

Exercise 8

Answers

References

Chapter 1

Fundamental Topics

Learning Objectives:

At the end of this chapter, students should be able to

(a)

(b)

(c)

(d)

(e)

(g)

distinguish between descriptive and inferential statistics.

Identify types of data. summarize data numerically and graphically.

calculate the probability of an event using suitable properties.

nd the expected value and variance for discrete and continuous random

variable

identify probability models and their distributional characteristics

This chapter presents a brief refresher of basic statistics that students are expected to have

learnt at pre-undergraduate level. Although this chapter may not represent a whole course of

basic statistics material, it suces the necessary background framework for the succeeding

chapters in this book.

1.1

Statistics deals with the collection, analysis, presentation, and interpretation of data

set and making decision based on the observed data. The role of an engineer is fundamental

in many aspects of decision making process such as designing, developing new products,

maintaining and controlling manufacturing processes and improving previous systems and

processes. Statistical methods are important tools in these activities that could assist

engineers with both descriptive and analytical methods in handling with the variability in the

observed data.

Statistics can be divided into two major areas namely descriptive statistics and

inferential statistics. Descriptive statistics deals with collection and presentation of data.

These involve collecting raw data, classifying, interpreting and presenting the data into

meaningful information for users. On the other hand, inferential statistics involve procedures

used to draw inferences about a population from a sample. Here, probability models are used

to quantify the risks involved in making any statistical inference.

(a) Population

Population is the set under study. The items under study could refer to anything such as

persons or objects. The number of individual items in the population is the population size.

(b) Sample

Sample is a subset of a population. Elements in a sample are drawn from a population. By

using information from the sample, we can make inferences about the population.

(c) Random variable

Random relates to events that have no specic pattern and that they occur by chance of a

process. Thus random implies that in a process of selection, any individual object or element

has an equal chance of being selected. Variable represents unknown quantity that varies.

Random variables are either measurable or non-measurable entities. Measurable or countable

random variables are quantitative random variables which are either discrete or continuous.

In contrast, non-measurable random variables are qualitative random variables.

(d) Parameter

Parameter is a characteristic or measure that we obtain from a population.

(e) Data

A data set is a collection of facts or observations from which conclusion may be drawn. It can

be in numerical (quantitative) or non-numerical (qualitative) form.

Quantitative data can be split into two types: discrete (having distinct and separate values, for

example: 1, 2, 3, ...) and continuous (which takes any value in an interval, including rational

or decimal numbers). These data can be further classied into interval scale and ratio scale

data.

Qualitative data, on the other hand, can be divided into two groups: nominal (which can be

assigned a code in the form of a number where the numbers are simply labels such as races,

for example: Malay = 1, Chinese = 2 and Indian = 3) and ordinal (which can be ranked, i.e.

put in order, or have a rating scale attached, for example: rst, second, and third place in a

competition).

A central tendency of a set of data is a numerical value that indicates the middle of the data

set. The most common measures of central tendency are mean, median and mode.

(a) Mean

Mean or arithmetic mean of a list of observations is the sum of all observations divided by

the number (or size) of the observations. Population mean is

N

x

i 1

x

i 1

n

x

(b)

x

i 1

Median

Median is the middle value that divides the higher half of the data from the lower half of the

data when the observations are arranged in ascending or descending order. If the number of

observations is odd, the median is the middle value, and if the number of observations is

even, the median is the average of the two middle values.

(c)

Mode

Mode is the observation with the highest frequency. If there are several observations with the

same highest frequency, then there are more than one mode in the set of data. However, a

mode may not exist if all observations occur with the same frequency. Therefore, unlike mean

and median, mode is not unique.

1.1.3

Measures of dispersion

Measures of dispersion or variation are numerical values that indicate the variability of a set

of data. When the dispersion is large, the data are widely scattered. The simplest measure of

variation is range but the most used measures are variance and standard deviation.

(a) Range

Range of a data set is the difference between the largest and the smallest observations.

Range = Largest observation - Smallest observation

(b) Variance

The variance of a set of data is a measure of the spread or dispersion within a set of

data. The population variance is denoted by 2 and sample variance by s2.

The population variance, on one hand, is given by

2

1

N

x

i 1

where N is the population size, xi is the i-th observation in the population and is

the population mean.

The sample variance, on the other hand, is given by

s2

xi x

n 1 i 1

where n is the sample size, xi is the i-th observation and x is the sample mean.

If the variance is defined, we can conclude that it is never negative because the squares

are either positive or zero. The unit for variance is the square of the unit of observation.

(c) Standard deviation

Standard deviation is a positive square root of the variance. Therefore standard

deviation for population and sample are

x

i 1

and

i 1

__

xi x

n 1

respectively.

Manual calculations on simple summary statistics such as the mean and standard deviation on

a sample of univariate data can literally be carried out by hand. However it is often a tedious

practice and one is prone to make mistakes especially when dealing with a large set of sample

data. To avoid this, it is useful to use a scientic calculator to access the following

__

of a

set of numbers.

The following example has been done using Casio fx 570MS. You should consult your

calculator instruction manual if yours does not appear to follow the following patterns.

(1) Clear screen

Press Shif t, Press CLR, Choose 1 (for clear screen, Scl),

Press =, Press AC.

(2) Choosing SD mode

Press M ODE, M ODE, Choose 1 (for standard deviation, SD),

Press = . (note: SD should appear on the display screen)

(3) Entering data: eg. 1,2,3,4

Press 1, Press M + .

Press 2, Press M + .

Press 3, Press M + .

Press 4, Press M + .

__

Shif t 2, choose 1, gives the sample mean x 2.5 .

Shif t 2, choose 3, gives the sample standard deviation s = 1.29.

2

Shif t 1, choose 1, gives x 30. .

Shif t 1, choose 3, gives n 4 .

Exampl

___________________________________________________________________________

es

In a crash test, cars were tested to determine what impact speed was required to obtain

bumper damage. The following data shows the speed (in km/hours) of 10 sample cars. Find

the mean, median, mode, range, variance and standard deviation for the cars using the

formula manually. Check if you could get the same answers to the mean and standard

deviation using your calculator.

98, 101, 114, 90, 103, 93, 98, 105, 119, 89

Solution

Mean =

10

= 1010/10

= 101.

To nd the median, we have to rearrange the observations in an ascending or descending

order

89 90 93 98 98 101 103 105 114 119

Since the number of observations is even, the median is the average of the two middle

values

Median

98 101

2

= 99.5

Mode = 98 since it has the highest frequency, i.e. it appears most frequently in the data set.

Range = 119 89

= 30

As the set of data are taken from a sample, we can calculate its sample variance

s2

__

1 10

xi x

n 1 i 1

1 10

2

xi 101

9 i 1

95. 56

s

95.96

9.775

___________________________________________________________________________

1.15

Types of Plots

Data can be summarized, not only numerically using a measure of central tendency and a

dispersion measure, but also graphically which may give us an instantaneous idea about same

characteristics of the data such as its distribution and skewness.

A suitable graphical summary for qualitative data can either be a histogram or a boxplot. Whereas for qualitative data, one can use either pie chart, bar chart or Pareto chart. In

addition, one can use a scatter plot to summarize graphically a relationship between two

quantitative variables.

1.2

Probability

In common usage, the word probability means the chance that a particular event will occur. In

statistics, probability is a numerical measure of the likelihood of the event. Before we go

further, it is better for us to understand a few terms that are connected to probability

1.2.1

(a) Outcome

An outcome is a result of an experiment or trial

A sample space is a set that contains all possible outcomes from an experiment as its

elements. Usually we denote sample space as S. For example, a trial of tossing a die will lead

to S = {1, 2, 3, 4, 5, 6}.

(c) Event

Event is a subset from a sample space. Let an event A be dened as getting an odd number

from tossing a die. Then A = {1, 3, 5} which is a subset from the sample space, S = {1, 2, 3,

4, 5, 6}.

1.2.2

Classical probability uses the sample space to determine the numerical probability that an

event will occur. It is also called a theoretical probability. Let S be a sample space and E

be an event which is a subset of the sample space S . The probability of event E occurring

is

P E

number elemant in E

n E

number element in S

n S

But this is only true if all outcomes are equally likely (having the same chances) to occur.

There are some basic rules about probability:

(i) Any probability assigned must be a nonnegative real number. The probability will take a

value from 0 to 1. Since it reects a chance of an event to occur, a probability of 0 indicates

that the event will never occur. On the other hand, if the probability is 1, it means the event

will always occur for certain. Therefore,

0 P E 1

(ii) The probability of a sample space is always unity, i.e. P S 1 . The probability that

an event does not occur is one minus the probability that the event does occur. Therefore, if

E is the complement for E , then

P E' 1 P E

E i

i 1

(iv) P

i.e.,

P E

i 1

E1 E 2 .

for i 1, 2, , n where E1 , E 2

Example 2

___________________________________________________________________________

In an experiment, a box containing 5 green bulbs, 6 blue bulbs and 4 white bulbs are used. A

bulb is chosen at random. What is the probability that (i) a white bulb, (ii) a non-white bulb is

chosen?

Solution

The number of bulbs in the box is 15, so n S 15

Suppose event A is The bulb obtained is white. The number of white bulbs in the box is 4,

so n A 4 .

Hence,

P ( getting a white bulb ) P A

4

15

and

P ( not getting a white bulb ) P A ' 1 P A 1

4

11

15 15

When two events, say A and B, cannot occur together at the same time, we call these events

as mutually exclusive or disjoint events. The probability of them both occurring at the same

time is 0,

P A B 0.

The additive rule of probability can be used to determine the probability of event A or event

B occurs, or both occur, A B . The general additive rule is

To explain the above rule, when A and B are not mutually exclusive, there is an overlapping

or intersection between A and B. That is why when we add P(A) and P(B), the probability of

the intersection, P(A B), is added twice. To compensate for that double addition, the

intersection needs to be subtracted once, (P(A B)).

When A and B are mutually exclusive, P(A B) = 0, then the additive rule becomes

P(A B) = P(A) + P(B)

Example 3

_________________________________________________________________________

In a group of 30 engineering students, 4 out of the 7 women and 8 out of the 23 men wear

spectacles. What is the probability that a person chosen at random from the group is a woman

or someone who wears spectacles?

Solution

Let W be person chosen is a woman and S be person chosen wears spectacles

We have,

P W

7

,

30

P S

12

30

and

P W and S P W S

Thus,

P W or S

4

.

30

P W S

= P W P S P W S

7

12

4

30 30 30

= 0.5

___________________________________________________________________________

The probability of an event occurring given that another event has already occurred is called

a conditional probability. The symbol P A B denotes the probability that event A will

occur given that event B has occurred. The formula is given by

P A B

P A B

P B

where P A B is the probability that event A and event B both occur and P(B) is the

probability that event B occurs.

These probabilities are also referred to as Bayesian probability, named after the probability

theorist Thomas Bayes (1702 61).

The Bayes theorem gives us a general conditional probability formula. If there are k

mutually exclusive events and P B 0 , then

P Ak B

P Ak P B Ak

n

P A PB A

i 1

__________________________________________________________________________

Example

4

A quality control ocer would inspect an assembled product from machine A by randomly

selecting one of its components from the assembly line. The probability that a defective

component is found is 35%. If a defective component was found, the probability that machine

A breaks down an hour after the ocers inspection is 0.64. On the other hand, if a nondefective component was found, the probability that machine A breaks down an hour after the

ocers inspection is just 0.28.

(a) Find the probability that machine A breaks down an hour after the ocers inspection.

(b) If machine A breaks down an hour after inspection, what is the probability that a defective

component was found earlier?

___________________________________________________________________________

Solution:

P(Defective) = P(D) = 0.35

P(Breaks down|Defective) = P(B|D) = 0.64

(a) P(Breaks down) = P(B) = P(D)P(B|D) + P(D )P(B|D )

= 0.35(0.64) + (0.65)0.28

= 0.406.

(b) P( Defective Breaks down)

P D B

P B

0.64 0.35

0.552.

0.406

___________________________________________________________________________

1.2.6

The results of the multiplication rule can determine the probability that two events, A and B,

both occur. The multiplication rule follows from the denition of conditional probability. The

result is often written as follows, using set notation:

P(A B) = P(A|B)P(B)

or

P(A B) = P(B|A)P(A)

where

P(A) is the probability that event A occurs,

P(B) is the probability that event B occurs,

P(A B) is the probability that event A and event B both occur,

P(A|B) is the probability that event A occurs given that event B has already occurred,

and P(B|A) is the probability that event B occurs given that event A has already occurred.

We can easily understand the multiplication rules from a tree diagram. Some information

about the tree diagram: (i) the branches represent any possible outcomes from a trial, (ii) the

sum of the probabilities from a source is equal to 1.

_______________________________________________________________________

Example 5

All raw components of a certain product must pass two production process to become a

nished product. The probability that a raw component passes the rst production process is

0.72. The probability that the component passes the second production process after it passes

the rst production process is 0.8. What is the probability that a raw component becomes a

nished product?

Solution

Let A be a component passes the rst production process and B be a component passes the

second production process. Then,

= P(component passes both production process)

= P(A B)

= P(B|A).P(A)

= 0.8(0.72)

= 0.576

___________________________________________________________________________

1.2.7

Independence

P(A B) = P(A) P(B).

__________________________________________________________________________

Example

__________________________

6

Two marbles are drawn (without replacement) from a bag containing 4 red and 6 blue

marbles.

(a) What is the probability both of them are blue?

(b) What is the probability of getting one red and one blue marbles?

Solution

Let R represents a red marble and B represents a blue marble,

1

6 5

3

10 9

(a) P B and B

8

4 6

6 4

.

15

10 9

10 9

(b) P R and B P R B P B R

EXERCISE

A motor company has 18 used cars and 11 of them are accident-free. For the accident-free

car, the probability alarm system is not functioning is 0.3 and if the car was not accident-free,

the probability alarm system is not functioning is 0.6.

(a) Mr. Ahmad wants to buy 2 used cars for his children. Find the probability both cars are

accident-free.

(b) Mr. Osman also wants to buy 2 used cars. What is the probability that only one of them

is not accident-free.

(c) What is the probability that Miss Ani buys a car that is accident-free and its alarm

system is working?

(d) Ali wants to buy a used car. What is the probability that its alarm system is not

functioning?

(e) The alarm system for a used car bought by Madam Sheely is not functioning. What is

the probability that it is accident-free?

___________________________________________________________________________

1.3

Random Variables

A random variable, usually written as X , is a variable whose possible values are numerical

outcomes of a random phenomenon. There are two types of random variables, discrete and

continuous.

1.3.1

A discrete random variable is one which may take on only a countable number of distinct

values such as the number of children in a family, the number of goals scored in football

games and the number of defective bulbs in a box.

The probability distribution of a discrete random variable (sometimes called probability mass

function) is a list of probabilities associated with each of its possible values. The probability

(a)

0 p i 1, and

(b)

p

i

1

i

___________________________________________________________________________

Example

7

x2

is the probability mass function for X, for

30

x 0,1, 2, 3, 4.

Solution:

We need to show that 0 pi 1 and

Now P 0 0,

P1

1

,

30

P 2

p

i

4

,

30

1

i

P 3

9

16

and P 4

30

30

Now, P X x 0

1

4

9

16

1

30 30 30 30

Hence, it is shown that X is a discrete random variable and P(X = x) is the probability mass

function for X .

___________________________________________________________________________

Example

8

P X x

cx

2

for x 0, 1, 2, 3.

Solution

If X is a discrete random variable, then

P X

x 1

0

c

2c 3c

1

2

2

2

2

6c

1

2

1

3

__________________________________________________________________________

1.3.2 Continuous random variable

A continuous random variable can take all possible values over an interval of real numbers

such as weight, time, and height. The probability of a random variable X being in an interval

[a,b] is dened as an area under a curve which is represented by a function f (x), that is

P a X b

f x dx F b F a

The function f (x) is called a probability density function and it satises the following

conditions:

(a) The curve of f (x) has no negative values (f (x) 0 for all x)

(b) The total area under the curve is equal to 1

The function F (.) is a cumulative distribution function which will be discussed in the next

section.

___________________________________________________________________________

Example

Show that f x

x2

; 1 x 4 is a pdf and find P 2 X 3

21

Solution

4

We must show

f x dx 1

1

x3

x2

dx

21

21 1

64

1

63

63

1

3

P 2 X 3

2

x2

dx

21

Shown

19

63

___________________________________________________________________________

1.3.3 Cumulative distribution function

The cumulative distribution function, denoted by F (.) is

F (x) = P(X x)

For a discrete random variable, the cumulative distribution function is the sum of the

probabilities, that is

F x P X x

P X

t .

X=x

1

2

3

4

For a continuous random P(X

variable,

cumulative

distribution

function is found by

= x) the

0.12

0.54 0.09

0.25

integrating f (t) from to x, that is

F x P X x

f t dt

X=x

1

2

3

4

__________________________________________________________________________

P(X = x) 0.12 0.54 0.09 0.25

F (X = x) 0.12 0.66 0.75 1.0

Example

10

Find the cumulative distribution function of X if X is a discrete random variable having the

following the probability distribution:

Solution

or

0,

0.12,

F x

0.66,

1

x 1;

1 x 2;

3 x 4;

x 4.

___________________________________________________________________________

Example

11

Find the cumulative distribution function if the probability density function for X is

0.1 10 x 20

f x

0 elsewhwere

Solution

For x 10, F x 0

For 10 x 20

F x

f t dt

10

x

0.1 dt

10

0.1x 1

For x 20,

Therefore

F x 1

; x 10

0

F x 0.1x 1 ;10 x 20

1

; x 20

___________________________________________________________________________

1.3.4 Mathematical expectation

The expected value of a random variable indicates its average or central value.

(a) The expected value of a discrete random variable X is dened by

n

E X xi P xi

i 1

E X

x f xi dx

(a) The expected value of a constant is equal to the constant itself, that is E k k

(b) E kX kE X , where k is constant.

___________________________________________________________________________

1.3.5 Variance and standard deviation

The variance and standard deviation are non-negative real values which give an idea of how

widely spread the values of the random variable are likely to be. When the variance is large,

then the observations are more scattered around the mean. The variance of a random variable

X is dened as

Var X 2 E X 2 E X

where E(X ) and E(X 2) both exist and E(X ) is the expected value of X .

(a) Variance of a constant is equal to zero, Var k 0 .

(b) Var kX k 2 Var X .

Var X .

___________________________________________________________________________

Example

12

X = {1, 2, 5, 10} is a random variable with the probability function P(X = x) dened by

P(X = 1) = 0.4, P(X = 2) = 0.3 and P(X = 10) = 0.2

(a) Find P X 5.

(b) Evaluate the mean E(X ) and the variance Var(X ).

Solution

(a)

0.4 + 0.3 + P(X = 5) + 0.2 = 1

P(X = 5) = 0.1

4

Mean, E X xi P xi

i 1

3.5

EX 2

xi P xi

2

i 1

24.1

Var X

E X 2 E X

24.1 12,25

11.85

___________________________________________________________________________

Example

13 probability density function of a random variable X is f (x), dened as follows

The

0.1 2 x 6

f x 0.2 8 x 11

0 elsewhwere

Solution

Mean, E X

0.1x dx

0 .1 x 2

11

0.2 x dx

0.2 x 2

11

7 .3

Mean, E X

0.1x 2 dx

11

0.2 x 2 dx

0.1 x 3

0.2 x 3

3 2

3

6.93 54.6

11

61.53

Var X

E X

EX

61.53 53.29

8.24

__________________________________________________________________________________

1.4

In this section, we will introduce some popular distributions for discrete and continuous

random variables. Popular distributions for discrete random variables include binomial,

poisson, negative binomial, hypergeometric and geometric distributions. On the other hand,

special distributions for continuous random variable include normal, exponential, erlang,

gamma, weibull and lognormal distributions.

1.4.1

Binomial distribution

Binomial distribution is a discrete probability distribution. It is used when there are exactly

two mutually exclusive outcomes of a trial and these outcomes are appropriately labeled as

success and failure. The binomial distribution is used to obtain the probability of

observing x number of successes from n number of trials, with the probability of success on a

single trial is denoted by (Note that some references use p). The binomial distribution

assumes that is xed for all trials.

In general, if a random variable X follows the binomial distribution with parameters n and ,

we write

X ~ B n,

P X x n C x x 1

n

n

where C x

x

distribution are

n!

. The mean,

x ! n x !

n and

n x

x 0, 1, 2, , n

n 1 respectively.

We can evaluate the probability associated to a binomial distribution either using a scientic

calculator or a statistical table. Certain statistical table provides the cumulative binomial

probabilities, P(X k).

___________________________________________________________________________

Example

14

(a) P (X 4)

(b) P (X = 2)

(c) P (X < 3)

(d) P (X > 1)

(e) P (X 3)

Solution

(a)P (X 4) = 0.9976

(b)P (X = 2) = P (X 2) P (X 1)

= 0.8369 0.5282

= 0.3087

(c)P (X < 3) = P (X 2)

= 0.8369

(d)P (X > 1) = 1 P (X 1)

= 1 0.5282

= 0.4718

(e)P (X 3) = 1 P (X 2)

= 1 0.8369

= 0.1631

___________________________________________________________________________

Example

15

A pewter manufacturer produces souvenir mugs. Suppose that one of the machines breaks

down and 8% of the mugs are found to be defective and cannot be sold. If 23 mugs are

selected at random, nd the probability that

(a) 3 mug are defective.

(b) between 8 and 10 mugs are defective.

(c) at least 1 mug cannot be sold.

Solution

Let X represents the number of defective mugs, then X ~ B 23, 0.08.

(a) P X 3

C 3 0.08

23

1 0.08 233

0.1711

(b) Find the answer yourself and compare it with your neighbours answer.

(c)

P X 1

1 P X 0

1

C 0 0.08

23

1 0.08 23

1 0.1469

0.8531

___________________________________________________________________________

1.4.2 Poisson distribution

Poisson distribution is another discrete probability distribution. When we know the mean

number of events that occur in a certain time interval or continuum of space, then the Poisson

distribution is a suitable distribution to nd the probability of exactly

x occurrences in that

interval. Generally, a discrete random variable X is said to follow a Poisson distribution with

parameter , written as

X ~ Po

P X x

e x

for x 0,1, 2...

x!

where is the mean number of events in the given time interval or a continuum of space. The

interval must be statistically independent. The Poisson distribution has expected value

E X and variance Var(X ) = .

We can evaluate the probability associated to a poisson distribution either using a scientic

calculator or a statistical table. Certain statistical table provides the cumulative poisson

probabilities, P(X k).

If X 1 ~ Po 1 , X 2 ~ P0 2 , , X n ~ P0 n then

X 1 X 2 X n ~ P0 1 2 ... n

Example

16

If X ~ P0 2.4 , find

(a )

P X 6

(b)

P X 3

(e )

P X 4

(c ) P X 7

(d ) P X 7

Solution

(a) P (X 6) = 0.9884

(b) P (X 3) = 1 P (X 2)

= 1 0.5697

= 0.4303.

(c) P (X < 8) = P (X 7)

= 0.9967.

(d) P (X > 1) = 1 P (X 1)

= 1 0.3084

= 0.6916.

(b) P (X = 4) = P (X 4) P (X 3)

= 0.9041 0.7787

= 0.1254.

___________________________________________________________________________

Example

17

On average, Good Construction can build 8 units of playground during a 2-month period.

Find the probability that

(a) Good Construction can only build 3 units of playground during a 2-month period.

(b) Good Construction can build at most 10 units of playground during a 2-month period.

(c) Good Construction can build more than 20 units of playground during a 4-month period.

Solution

Let X be the number of playgrounds Good Construction can build during a 2-month period,

then X Po(8)

e 8 8 3

3!

0.0286

( a ) P X 3

(b) P X 10 0.8159

Let Y be the number of playgrounds Good Construction can build during a 4-month period,

then Y Po(16)

P Y 20

(c )

1 P Y 20

1 0.8682

0.1318

___________________________________________________________________________

1.4.3

A negative binomial experiment is a statistical experiment that has the following properties:

Each trial can result in just two possible outcomes. We call one of these outcomes

a success and the other, a failure.

The probability of success, denoted by p, is the same on every trial.

The trials are independent; that is, the outcome on one trial does not aect the

outcome on other trials.

advance.

produce r successes in a negative binomial experiment. The probability distribution of a

negative binomial random variable is called a negative binomial distribution,

which is also known as the Pascal distribution.

The negative binomial probability refers to the probability that a negative binomial

experiment results in r 1 successes after trial x 1 and r successes after trial x.

Denition: Suppose a negative binomial experiment consists of x trials and results in r

successes. If the probability of success on an individual trial is p, then the negative binomial

probability is:

b * x; r , p

C r 1 p r 1 p

x 1

xr

Note that

x 1

x 1 ! .

C r 1

r 1 ! x r !

r 1

x 1

The mean and variance for a negative binomial random variable are

E X r p

and

Var X r 1 p p 2

respectively.

_________________________________________________________________________

Task

1

Suppose that a call to Sinar FM gets connected with a probability of 0.05. Assume calls are

independent,

(a) what is the probability that the 6-th call made is the second call that gets connected?

[ 0.0102]

(b) what is the probability that more than four calls have to be made before getting

connected?

[0.8145]

___________________________________________________________________________

Task

2

Assume that a sample of 15 components are tested every hour. Suppose X denotes the

number of components in the sample of 15 that require modication. Components are

assumed to be independent with respect to modication. If the percentage of components that

require modication remains at 1.5%, what is the probability that hour 8 is the third sample at

which X exceeds 1?

[1.6894104]

___________________________________________________________________________

1.4.4

Geometric Distribution

The geometric distribution is a special case of the negative binomial distribution. It deals with

the number of trials required for a single success. Thus, the geometric distribution is negative

binomial distribution where the number of successes (r) is equal to 1.

Denition: Suppose a negative binomial experiment consists of x trials and results

in one success. If the probability of success on an individual trial is p, then the geometric

probability is:

P x; p p 1 p

x 1

for x 1, 2, 3, , and 0 p 1.

E X r p

and

Var X r 1 p p 2

respectively.

__________________________________________________________________________

Task 3

The probability that a computer running a certain operating system crashes on any given day

is 0.05. Find the probability that the computer crashes for the rst time on the 10th day after

the operating system is installed. Find the expected number of days the computer runs before

it crashes for the rst time.

[0.0315; 20 ]

1.4.5

Hypergeometric distribution

items.

In the population, k items can be classied as successes, and N k items can be

classied as failures.

k: The number of items in the population that are classied as successes

n: The number of items in the sample.

x: The number of items in the sample that are classied as successes.

k

C x : The number of combinations of k items, taken x at a time.

P (x; N, n, k): hypergeometric probability - the probability that an n-trial

hypergeometric experiment results in exactly x successes, when the population

consistsof N items, k of which are classied as successes.

hypergeometric experiment. The probability distribution of a hypergeometric random variable

is called a hypergeometric distribution.

Denition: Suppose a population consists of N items, k of which are successes. And a random

sample drawn from that population consists of n items, x of which are successes. Then the

hypergeometric probability is:

P x ; N , n, k

Cx

N K

N

Cn

C n x

N K

nx

N

EX n p

respectively, where p K N

and

and

N n

N 1

N n

Var X np1 p

N 1

__________________________________________________________________________

Task 4

A company employs 500 men under the age of 58. Suppose that 25% carry a marker on a

male chromosome that indicates an increased risk for high blood pressure.

a. If 20 men in the company are tested for the marker in this chromosome, what is the

probability that exactly half of them have the marker.

[0.0089 ]

b. If 15 men in the company are tested for the marker in this chromosome, what is the

probability that more than 1 has the marker?

[0.9229 ]

___________________________________________________________________________

1.4.6 Normal distribution

Normal distribution is the most important continuous distribution in statistics because

normality arises naturally in many physical, biological, and social measurement situations. It

is also named as Gaussian distribution taken from the name Gauss who found the probability

density function (pdf) for normal distribution. The pdf of a normal random variable X is

symmetric, bell-shaped and asymptotically approaches 0 as x goes to or .

A continuous random variable X with probability density function

f x

X 2

1

exp

,

2 2

2

X ~ N , 2

Since the integration for nding the probability using its probability density function is nontrivial, then we have to transform X into a standard normal variable Z which has

a mean 0 and and variance 1. The transformation can be done by using the following formula.

Z

Z ~ N 0, 1

We can evaluate the probability associated to a standard normal distribution either using a

scientic calculator or a statistical table. A statistical table typically provides two types of

tables associated to a standard normal distribution.

(i) a table that shows the probabilities for a standard normal distribution in the form

of P 0 Z z that is the area under the standard normal curve between 0

and positive z values.

(ii)

a table that shows the z values when P(Z > z) = where is the upper tail area of

the standard normal distribution, and 0.5.

Some properties of normal distribution

(a ) k X ~ N k x , k 2 2 x

(b) X Y ~ N x y , 2 x 2 y .

(c ) X Y ~ N x y , 2 x y

___________________________________________________________________________

Example

18

The lifetime of ROAD tyre is normally distributed with mean 24000 km and

standard

deviation 4000 km.

(a) Find the probability that the lifetime of ROAD tyre exceeds 27000 km.

(b) Find the probability that the lifetime of ROAD tyre is between 22500 km and

26500 km.

(c) If 10% of ROAD tyres have low lifetime, nd the maximum distance it can

achieve.

Solution

Let X represents the lifetime of ROAD tyre, then X N (24000, 40002).

(a)

27000 24000

P X 27000 P Z

4000

= P Z 0.75

= 0.5 0.2734

= 0.2266

(b)

26500 24000

22500 24000

Z

4000

4000

P 22500 X 26500 P

= P 0.375 Z 0.625

= 0.2357 + 0.148

= 0.3837

(c) Let x be the maximum distance specied, then the question implies P X x 0.1

which is equivalent to P Z z 0.1 0.1 . From table,

z 0.1 1.2816

Thus, 1.2816

Hence,

x 24000

4000

x = 24000 1.2816(4000)

x = 18873.6 km.

___________________________________________________________________________

______________

Example

19

A Cooper test for a football player from Team A is normally distributed with mean 660

second and standard deviation 45 second. The Cooper test for a football player from Team B

is normally distributed with mean 690 second and standard deviation 25 second. A player is

selected at random.

(a) What is the probability a player from Team A can complete the test less is than 700

second?

(b) What is the probability the time set by a Team A player is better than the time

set by a Team B player?

Solution

Let X represent a time set by Team A player, X N (660, 45 2 ) and let Y represent a time set

by Team B player, Y N (690, 25 2 )

(a)

P X 700

700 660

P Z

45

= P ( Z < 0.89 )

= 0.5 + 0.3133

= 0.8133

(b)

P X Y

P X Y 0

0 660 690

P Z

45 2 25 2

P Z 0.58

= 0.5 + 0.2190

___________________________________________________________________________

Task 5

A manufacturer produces bathroom tiles. The tiles are sold in boxes containing 25 tiles each.

The probability that a piece of tile from a box is defective is 0.1. A box is selected at random.

(i) no tiles are defective?

(ii) more than 10 tiles are defective?

(iii) at least 7 tiles are defective?

[ 0.0178 ]

[ 0.0001 ]

[ 0.0095 ]

(b) An interior decorating company purchases 10 boxes of tiles from the manufacturer. What

is the probability that at least two of the boxes contain perfect tiles?

[ 0.1581 ]

__________________________________________________________________________

Task 7

In 2006 World Cup tournament, the weight of the balls used is normally distributed with

mean weight 435 grams and standard deviation 10 grams. A ball is selected at random.

(a) What

is the probability the weight is between 400 grams and 450 grams? [0.933

(b) What is the probability the weight is more than 460 grams?

[ 0.0062 ]

(c) If 10% of the balls is considered heavy, what is the minimum weight of the ball

in that category?

[447.816 grams ]

___________________________________________________________________________

1.4.7

Exponential distribution

the time between successive events in a Poisson process, i.e. a process in which events occur

continuously and independently at a constant average rate.

Denition: Suppose a random variable X denotes the distance between successive events of a

Poisson process with mean , then X is an exponential random variable with parameter

which has the following probability density function:

f x e x

for 0 x < and > 0. The parameter is also called a rate parameter, whereas 1/ is a scale

parameter. The mean and variance for X are

EX 1

and Var X 1

can be written as X ~ Exp The cumulative distribution function for the exponential

random variable is

1 e x x 0

F x P X x

0 x0

Figure below demonstrates exponential probability density functions with dierent values.

It can be seen from the gure that all pdfs are monotonically decreasing.

: 1;

[ -: 0.5;

: 1.5; ]

EX 1

and Var X 1

respectively.

___________________________________________________________________________

Example

20

Solution

2

2

Furthermore, P X 1 1 P X 1 1 1 exp 2 1 0.1353

_________________________________________________________________________

Task 8

The time between phone calls received by a telephonist is exponentially distributed with a

mean of 10 minutes.

a. What is the probability that there are no calls in one hour?

[Ans: 0.0025 ]

b. What is the probability that there are not more than four calls within one hour? [ 0.2851]

c. Determine x such that the probability that there are no calls within x hours is 0.02

[39.12 minute]

__________________________________________________________________________

An important property of the exponential distribution is that it is memoryless , which means

that if a random variable X is exponentially distributed, its conditional probability is given by

P X x1 x 2

i.e

X x1 Pr X x 2 for all x1 , x 2 0.

P X x1 x 2 X x1 Pr X x 2

________________________________________________________________________

Task 9

The number of hits on a website follows a Poisson process with a rate of four per minute.

a. What is the probability that more than two minutes go by without a hit?

[ 3.35 10 4 ]

b. If two minutes have gone by without a hit, what is the probability that a hit will occur in

the next minute?

[ 0.9817]

___________________________________________________________________________

1.4.8

Other distributions for continuous random variables include Erlang, Gamma, Weibull and

log-normal distributions. Unlike normal distribution, these distributions assume that the

variables are strictly non-negative. The list of probability density functions for these

distributions are listed below:

s

1. Erlang

f x

r x r 1 e x

for x 0

r 1 !

and r 1, 2,

Note : If r 1 , then Erlang is

simply an exponential distribution.

EX

Var X

r

2

2. Gamma

f x

r

r x r 1 e x

for x 0 and r E

0X.

r 1 !

Gamma is simply an Erlang

distribution.

Var X

r

2

3. Weibull

1

x

x

exp

For x 0, 0 and 0 ,

Note: and are shape and the

f x

1 then, Weibull is simply an

exponential

distribution

with

E X T 1

2

Var X T 1

1

1

1

4.

Lognormal

f x

ln x 2

exp

2 2

x 2

where W ~ N , 2 .

EX e

Var X e 2 e w 1

2

The shape of the above distributions for varying values of their parameters can be

investigated via computer software such as Matlab. Further information and examples for

these distributions can be found from Montgomery & Runger (2006).

__________________________________________________________________________

Exercise 1

1. Identify whether the following items are constants or variables. If it is a variable, determine

whether it is quantitative or qualitative, discrete or continuous

(a) The number of days in March.

(b) IC numbers for Malaysian citizen.

(c) The time taken to write an essay.

(d) The type of cars used by employees of a company.

(e) Temperature for each day in a month.

(f) Minimum age to take a driving licence

(g) The lengths of a specic type of bricks.

(h) The compressive strengths of 100 aluminium-lithium alloy specimens.

(i) The number of students registering Engineering Statistics in the last ve

academic years.

(j) The breakdown time of an insulating uid between electrodes.

(k) The grades achieved by engineering students in UTM.

2. A motor company has 18 used cars and 11 of them are accident-free. For the accident-free

car, the probability alarm system is not functioning is 0.3 and if the car was not accident-free,

the probability alarm system is not functioning is 0.6.

(a) Mr. Ahmad wants to buy 2 used cars for his children. Find the probability both cars are

accident-free.

(b) Mr. Osman also wants to buy 2 used cars. What is the probability that only one of them

is not accident-free.

(c) What is the probability that Miss Ani buys a car that is accident-free and its alarm

system is working?

(d) Ali wants to buy a used car. What is the probability that its alarm system is not

functioning?

(e) The alarm system for a used car bought by Madam Sheely is not functioning. What is

the probability that it is accident-free?

k ; 0 x 1

x

f x ;1 x 2

4

0 ; elsewhere

(a) Show that k

5

.

8

(c) Find P(1/2 X 3/2).

(d) Find the expected value and variance for X .

independent of each other. The probability that any integrated circuit is defective is 0.03. The

product operates only when all integrated circuits work properly. What is the probability that

the product operates?

5. On average, IT Shop can sell 10 notebooks in 2 days. What is the probability

they can sell

(a) 13 notebooks in 2 days?

(b) at least 17 notebooks in 3 days?

(c) not more than 19 notebooks in 4 days?

and standard deviation 2 kg. The weight of a standard TV having the same screen width is

also normally distributed but with mean 31 kg and standard deviation 5 kg. What is the

probability that

(a) the weight of a at screen LCD TV is between 13 kg and 16 kg?

(b) the weight of 2 standard TVs is greater than 65 kg?

(c) the weight for 2 LCD TV is greater than the weight of a single standard TV?

14

35

15

27

23

18

50

33

36

48

25

19

29

22

42

15

Use a scientic calculator to determine the mean and variance for the above data. Now

assume that the data are sample data selected by random. Find the new mean and variance

for the data. Comment your answers.

8. 25 pieces of computer chips were tested and the proportion of any chip being

contaminated is 0.15. Find the probability that

(b) at least 20 chips are not contaminated.

(c) between 4 and 8 chips are contaminated.

(d) more than 2 chips are not contaminated.

A supplier delivers ten boxes, each containing 25 chips, to a customer. What is the probability

that the customer will receive at least two boxes containing at most two contaminated chips

each?

mean of RM2000 and a variance of 2500 RM squared. What is the probability that

a) the production yield on any particular day exceeds RM2500.

(b) the production yield is less than RM1900 on each of the next two days,

assuming the yields on dierent days are independent random variables.

(c) 5% of a days production yield is considered protable revenue to the company. What is the daily minimum yield to be considered protable?

Chapter 2

Sampling Distributions

Learning Objectives:

At the end of this chapter, students should be able to

(a) understand the concepts of sample mean and proportion.

(b) understand and use the central limit theorem.

(c) compute and interpret the sample mean and proportion.

(d) explain the important role of normal distributions as sampling distributions.

(e) calculate the probabilities associated with sample mean and sample proportion.

2.1

Introduction

unless the population is small or a nationwide census is available. The population mean, ,

and standard deviation, , are examples of population parameters. Given the impracticality of

__

independent samples from the same population.

By measuring the entire population and calculating the mean or variance, we refer this

quantity as a parameter of the population. If we measure from sample, then the mean or

variance is referred to as a statistic. There are many statistics that we can use, which include

the mean, median, mode, standard deviation and so on. One reason we sample is so that we

can get an estimate for an unknown parameter of the population we sample from.

Choosing a sample of size n from a population and measuring the statistics (mean, standard

deviation, etc), the sampling distribution is the resulting probability distribution. For

example, if the statistic is the sample mean, x , of samples of size eight, then the sampling

__

distribution is the probability distribution of the sample mean, X . It lists the various values

__

__

2.2

A very important and useful concept in statistics is the Central Limit Theorem (CLT).

The CLT says that if a large enough sample was drawn from a population, then the

distribution of the sample mean is approximately normal, regardless of the type of

distribution for the population the sample was drawn from.

The Central Limit Theorem states that

1. the mean of the sampling distribution of means is the same as the population mean,

2. the variance of the sampling distribution of means is the same as the population variance

divided by the size of the sample, and

3. if the population from which the sample is taken is normally distributed, then the sampling

distribution of means will also be normal. If the population is not normally distributed, then

the sampling distribution of means will approximately be normal distributed as the sample

size gets larger, usually when n 30

2.3

__

The sample mean, X is the best estimator of the population mean, . Suppose we have a

set of independent random variables X 1 , X 2 , X n where E X i and

__

__

X1 X 2 X 3 X n

n

n

1

Xi

n i 1

S

n

__

1

Xi X

n 1 i 1

__

The probability distribution of the sample means X , is called the sampling distribution

__

of X .

__

The expected mean and variance of X are denoted as __X and

__

E X

X

__

1 n

Xi

n i 1

1

n

n

__

X

Var X

1 n

Xi

n i 1

Var

1 n

Var X i

n i 1

1

2 n 2

n

2

2X .

__

__

. The

The sampling distribution for the sample mean is expressed as X ~ N ,

n

standardized variable

__

X

Z

follow a standard normal distribution. The sampling distribution of the mean is normally

distributed regardless of the population. If the population distribution is unknown or not

normal, then using the central limit theorem, the sampling distribution for sample mean is

normally distributed when n 30

___________________________________________________________________________

Example 1

A certain type of thread is manufactured with a mean tensile strength of 77.3 kg and a

standard deviation of 6.4 kg. Assuming that the tensile strength follow a normal distribution,

nd the probability that the mean tensile strength of a random sample of 40 such thread is

more than 75 kg.

Solution

Now

n 40

__

6.4 2

X

~

N

77

.

3

,

therefore

40

6.4 2

75 77.3

__

P X 75 P Z

6.4 2

40

P Z 2.27

0.5 0.4884

0.9884

_________________________________________________________________________

Example

The number of customers arriving per hour at a certain automobile service facility is assumed

to follow a Poisson distribution with mean 12. If a random sample of 36 hour were taken,

what is the probability that the mean number of customers in an hour is less than 10?

Solution

Given X ~ Po 12

__

12

X ~ N 12,

36

Therefore, by CLT

10 12

P X 10 P Z

12

36

__

P Z 3.46

0.5 0.4997

0.0003

___________________________________________________________________________

Task 1

The average life of a washing machine is 12 years with a standard deviation of 2 years.

Assuming that the lives of these machines follow approximately a normal distribution, nd

(a) the probability that the mean life of a random sample of 12 machines is greater than 10

years.

[ 0.9997 ]

b) the probability that the mean life of a random sample of 9 machines falls between 9.4 and

12.2 years.

[ 0.6179 ]

__________________________________________________________________________

___________________________________________________________________________

Task 2

A random sample of size 35 is taken from a population which has a binomial distribution with

the number of trials 50 and the proportion of success 0.30. What is the probability that the

sample mean is at least 13.5?

[ 0.9969 ]

__________________________________________________________________________

2.4

__

__

Sampling Distribution of X 1 X 2

Suppose we have two independent populations, both are normally distributed. Let the rst

2

population has mean 1 and variance 1 and the second population has mean 2 and

variance 2 .

2

__

__

If X 1 and X 2 are the sample means of two independent random samples of sizes n1 and

n 2 , then

2

__

X 1 ~ N 1 , 1

n1

and

2

__

X 2 ~ N 2 , 2

n2

__

__

__

__

X1X 2

__

__

E X1 E X 2

1 2

and variance

__

__

__

__

2 X 1 X 2 Var X 1 Var X 2

__

__

2

Var X 1 1 Var X 2

__

__

Var X 1 Var X 2

2

n1

n2

2

thus,

2

2

__

__

X 1 X 2 ~ N 1 2 , 1 2

n1

n2

with

__

__

X 1 X 2 1 2

Z

2

1

2

2

n1

n2

If the two populations are not normally distributed and both samples have sizes at least 30, by

__

__

___________________________________________________________________________

Example

3

A random sample of size 18 is selected from a normal population with a mean of 85 and a

standard deviation of 8. A second random sample of size 10 is taken from another normal

__

__

population with mean 80 and a standard deviation 5. Let X 1 and X 2 be the two sample

means. Find the probability

(a)

__

__

(b) that the dierence between the sample means is less than 6.

c) that the dierence between the means is more than 4.

Solution

__

82

__

52

and X 2 ~ N 80,

, therefore

We know that X 1 ~ N 85,

18

10

__

__

82

52

X 1 X 2 ~ N 85 80,

18

10

__

__

X 1 X 2 ~ 5, 6.0556

__

__

__

__

__

__

P X 1 X 2 P X 1 X 2 0

= P Z

05

6.0556

P Z 2.03

0.5 P 0 Z 2.03

= 0.5 0.4788

0.9788

(b) The probability that the dierence between the sample means is less than 6 is

__

__

P X 1 X 2 4

__

__

X1 X

4 P

__

__

X1 X

P Z

45

45

P Z

6.0556

6.0556

P Z 0.4 P Z 3.66

0.6554 0.0001

0.6555

Example

4

A random sample of size 49 is taken from a binomial distribution with n = 60 and p = 0.4.

Another random sample of size 32 is taken from another binomial distribution with n = 60

and p = 0.4. Find the probability that the dierence between the two sample means is less

than 1.

Solution

Given X 1 ~ B 60, 0.4 and X 2 ~ B 60, 0.4

__

__

14.4

14.4

X

X

1 ~ N 24,

2 ~ N 24,

Therefore, by CLT

and

49

32

__

__

14.4

114.4

,

Hence, X 1 X 2 ~ N 24 24,

49

32

__

__

X 1 X 2 ~ N 0, 0.7438

__

__

__

__

P X 1 X 2 1 P 1 X 1 X 2 1

1 0

0.7438

1 0

0.7438

P 1.16 Z 1.16

0.7540

___________________________________________________________________________

Task 3

picture tubes for use in their television sets. Type A tube has mean brightness of 100 and

standard deviation of 16, while type B tube has mean brightness of 110 and standard

deviation of 14. A random sample of 25 tubes from each type is selected. What is the

probability that the dierent brightness in the two sample means is at least 5.5?

[ 0.8555 ]

___________________________________________________________________________

Task 4

A random sample of size 30 is taken from a population which is distributed from a Poisson

distribution with mean 54. Another random sample of size 32 is taken from a Poisson

distribution with mean 58. What is the probability that the dierence between the means is

less than 2.

[ 0.1461 ]

___________________________________________________________________________

2.5

The concept of proportion is the same as the concept of probability of success in a binomial

experiment. The probability of success in a binomial experiment represents the proportion of

the sample or population that possesses a given characteristic.

The population proportion, denoted by , is obtained by taking the ratio of the number of

elements in a population with a specic characteristic to the total number of elements in the

population. The sample proportion, denoted by p, gives a similar ratio for a sample.

The population and sample proportions, denoted by and p, respectively, are calculated as

X

N

and

where

N

total number of elements in the sample

x

n

number of elements in the sample that possess a specic characteristic

and is a proportion of successes and not 3.1423... . Each sample will give a dierent value

of p therefore the proportion is a random variable and symbolized as P.

To determine the reliability of the estimator, P, we need to know its sampling distribution.

When samples of size n are drawn for this population, each sample contains a certain number

of observation event with the certain characteristics. The Central Limit Theorem (CLT) tells

us that the relative frequency distribution of the sample mean for any population is

approximately normal for suciently large samples, (n 30).

Sampling Distribution of P

1. Mean of the Sample Proportion

The mean of the sample proportion, P is denoted by p and is equal to the population

proportion, .

X

p E P E

1

E X

n

1

n

n

2. Variance of the Sample Proportion

The variance of the sample proportion is denoted by P2 and given by the formula

X

P Var P Var

1

Var X

n2

1

2 n 1

n

1

The standard deviation of the sample proportion is denoted by

P

1

n

Therefore the sampling distribution of P has mean and variance

written as

1

n

P ~ N ,

with

Z

1

n

The continuity correction factor needs to be made when a continuous curve is being used to

1

continuity correction factor according to the form of the probability statement as follows:

(a)

c .c

1

1

P P p P p

P p

2

n

2

n

(b)

c .c

1

P P p P P p

2n

(c)

c .c

P P p P P p

2 n

(d)

c .c

P P p P P p

2 n

c .c

1

P P p P P p

2

n

(e)

Example 5

A manufacturer claims that the diameter of a metal rod is 75% within the specication.

A random sample of 50 metal rods is chosen, nd the probability that

(a) at least 70% diameter of the metal rod within the specication.

(b) between 78% and 82% diameter of the metal rod within the specication.

(c) more than 90% diameter of the metal rod within the specication.

Solution

0.75

1 0.75 1 0.75

0.00375

n

50

P ~ N 0.75, 0.00375

(a) The probability that at least 70% diameter of the metal rod within the specication is

c .c

1

P P 0.70 P P 0.70

2 50

P P 0.69

0.69 0.75

P Z

0.00375

0.5 0.3365

0.8365

(b)

The probability that between 78% and 82% diameter of the metal rod within the

specication is

c .c

1

1

P 0.78 P 0.82 P 0.78

P 0.82

2

50

2

50

P 0.79 P 0.81

0.79 0.75

0.00375

0.81 0.75

0.00375

P 0.65 P 0.98

P 0 Z 0.98 P 0 Z 0.65

0.3365 0.2422

0.0943

(c) The probability that more than 90% diameter of the metal rod within the specication is

c .c

1

P P 0.90 P P 0.90

2 50

P P 0.91

0.91 0.75

P Z

0.00375

0.5 0.4955

0.0045

__________________________________________________________________________

Task 5

30% of pipe in a chemical plant showed signs of serious corrosion. A survey was done and a

random sample of 100 pipes in a chemical plant was selected. Find the probability that

(a) more than 35% of pipe in a chemical plant showed signs of serious corrosion.

[ 0.1151 ]

(b) from 20% to 30% of pipe in a chemical plant showed signs of serious corrosion.

[ 0.5328 ]

Task 6

From a survey, we found that 90% of automobile will not be rejected because of the machine

failure. A random sample of 50 automobiles was selected. What is the probability that

(a) not less than 92% of automobile will not be rejected because of the machine failure?

[ 0.4052 ]

(b) between 88% and 92% of automobile will not be rejected because of the machine

failure?

[ 0.1896 ]

__________________________________________________________________________

Task 7

3

of the rubber cushions will be rejected. A manufacturer did not

100

satised with the results and does a survey. Among 100 samples of the rubber

cushions, nd the probability of the

(a) proportion of the rubber cushions will be rejected exceed 0.04.

(b) proportion of the rubber cushions will be rejected not more than 0.05.

___________________________________________________________________________

2.6

Let say we have two binomial populations with proportion of successes 1 and 2 , with

random samples of size n1 and n 2 are taken from population 1 and population 2,

respectively. Then 1 and 2 are the proportions from those samples. By the CLT, provided

both n1 and n 2 are large ( n1 30 and n 2 30), the sampling distribution of P1 is

1 1

P1 ~ N 1 , 1

n1

1 2

P2 ~ N 2 , 2

n2

can be obtained. By the Central Limit Theorem,

the mean is

P1 P2 E P1 P2

E P1 E P2 1 2

Var P1 Var P2

1 1 1 2 1 2

n1

n2

2 P1 P2

1 1 1 2 1 2

n1

n2

The sampling distribution of the dierence between two proportions, P1 P2 has mean

1 2

and variance

1 1 1 2 1 2

n1

n2

1 1 2 1 2

P1 P2 ~ N 1 2 , 1

n

n

1

2

with

1 1 2 1 2

P1 P2 ~ N 1 2 , 1

n1

n2

___________________________________________________________________________

Example

6

Two companies, M Chip and N Chip produced micro computer chips and supplied them to

company ACERA. 25% of the micro computer chips produced by Company M Chip and 20%

of the micro computer chips produced by Company N Chip are defective. 100 samples are

randomly chosen from each company, nd the probability that

(a)

M Chip is greater than the sample proportion of defective micro computer chips

produced by Company N Chip.

(b) the sample proportions of defective micro computer chips dier by at least 6%.

(c) the dierence between the sample proportion of defective micro computer chips

produced by Company M Chip and the sample proportion of defective micro computer chips

produced by Company N Chip is at most 4%.

Solution

0.25 1 0.25

PM ~ N 0.25,

N 0.25, 0.001875

100

0.07 1 0.07

PN ~ N 0.20,

N 0.20, 0.0016

100

(a) The probability of the sample proportion of defective micro computer chips produced by

Company M Chip is greater than the sample proportion of defective micro computer chips

produced by Company N Chip is

P PM PN P PM PN 0

P Z

0 0.05

0.003475

= P (Z > 0.85)

= 0.5 + P (0 < Z < 0.85)

= 0.5 + 0.3023

= 0.8023

(b) The probability of the sample proportions of defective micro computer chips dier by at

least 6% is

P PM PN 0.06 P PM PN 0.06 P PM PN 0.06

0.06 0.05

0.06 0.05

P Z

P Z

0.003475

0.003475

= [0.5 P (0 < Z < 1.87)] + [0.5 P (0 < Z <

0.17)]

= [0.5 0.4693] + [0.5 0.0675]

= 0.0307 + 0.4325

= 0.4632

(c) The probability of the dierence between the sample proportion of defective micro

computer chips produced by Company M Chip and the sample proportion of defective micro

computer chips produced by Company N Chip is at most 4% is

0.04 0.05

PM PN 0.04 P Z

0.003475

= P (Z < 0.17)

= 0.5 P (0 < Z < 0.17)

= 0.5 0.0675

= 0.4325

________________________________________________________________________________

Exampl

e7

A manufacturer claims that some of the electrical parts produced by two machines are

defective. He said that 90 out of 1500 of the electrical parts are defective were produced by

machine 1 and 84 out of 1200 of the electrical parts are defective were produced by machine

2. If random samples of 50 electrical parts produced by machine 1 and 60 electrical parts

produced by machine 2 are chosen, what is the probability that

(a) the proportion of defective electrical parts produced by machine 1 is smaller than the

proportion of defective electrical parts produced by machine 2?

(b) the proportion of defective electrical parts produced by machine 1 is greater than the

proportion of defective electrical parts produced by machine 2?

(c) the proportion of defective electrical parts dier by less than 0.02?

Solution

1

90

0.06

1500

84

0.07

1200

0.06 1 0.06

P1 ~ N 0.06,

N 0.06, 0.001128

50

0.071 0.07

P2 ~ N 0.07,

N 0.07, 0.001085

50

(a) The probability of the proportion of defective electrical parts produced by machine 1 is

smaller than the proportion of defective electrical parts produced by machine 2 is

P P1 P2 P P1 P2 0 P Z

0 0.01

0.002213

= P (Z > 0.21)

= 0.5 + P (0 < Z < 0.21)

= 0.5 + 0.0832

= 0.5832

(b) The probability of the proportion of defective electrical parts produced by machine 1 is

greater than the proportion of defective electrical parts produced by machine 2 is

P P1 P2 P P1 P2 0

P Z

0 0.01

0.002213

= P (Z < 0.21)

= 0.5 P (0 < Z < 0.21)

= 0.5 0.0832

= 0.4168

(c) The probability of the proportion of defective electrical parts dier by less than 0.02 is

P P2 P1 0.02 P 0.02 P2 P1 0.02

0.02 0.01

0.002213

0.02 0.01

0.002213

= P (0 < Z < 0.64) + P (0 < Z < 0.21)

= 0.2389 + 0.0832

= 0.3221

___________________________________________________________________________

Task 8

A Production Manager claims that his two machines will fail due to continuous operation and

will produce defective products. An investigation was done and it was found that the claimed

was true. 50 of 500 products are from machine A and 45 of 500 products from machine B are

defective. 100 products from each machine were selected randomly. Find the probability that

(a) the sample proportion of the products from machine A is smaller than the sample

proportion of the products from machine B are defective.

[ 0.4052 ]

(b) the sample proportions dier by less than 1.8% are defective

[ 0.6730 ]

(c) the dierence between the sample proportion of the products from machine A

and the sample proportion of the products from machine B are defective is at least 1%.

[ 0.5000 ]

__________________________________________________________________________

Task 9

A company purchased parts from two suppliers and has been having serious problems with

scrap and rework with both suppliers. From previous record, 16% was found to be

nonconforming parts supplied by Supplier A while 14% was found to be nonconforming parts

supplied by Supplier B. A quality engineer decides to investigate and took 100 randomly

selected samples for an investigation from each supplier. What is the probability that

(a) the proportion of nonconforming parts supplied by Supplier A is greater than the

proportion of nonconforming parts supplied by Supplier B?

[ 0.6554 ]

(b) the proportion of nonconforming parts supplied from Supplier A is more than the

proportion of nonconforming parts supplied from Supplier B by at least 0.01?

[ 0.5793 ]

(c) the dierence between the proportion of nonconforming parts supplied by Supplier A and

the proportion of nonconforming parts supplied by Supplier A is more than 0.05? [ 0.2776 ]

___________________________________________________________________________

2.7 t Distribution

Theorem 1 Let Z be a standard normal variable and V a chi-squared random variable with

degrees of freedom. If Z and V , then the distribution of the random variable T , where

Z

V

v 1

2

h t

v v

2

v 1

2

2

t

1

v

v degrees of freedom.

Corollary 1 Let X 1 , X 2 , , X n be independent random variables that are all normal with

mean and standard deviation . Let

__

X i 1

n

and

S2

__

i 1 X i X

n 1

n

Xi

n

__

X

Then the random variable T S

has a t distribution with v n 1 degree of

n

freedom and can be written as T ~ t n 1 .

___________________________________________________________________________

Task 10

(a)

0.001

t 0.001,15 3.733

(b)

(c)

(d)

v 10.

v 20.

v 30.

0.005

t 0.005, 20 3.733

0.010

t 0.010 , 10 3.733

v 15.

0.025

t 0.025 , 30 3.733 ]

_______________________________________________________________________

2.8

2 Distribution

The continuous random variable X has a chi-squared distribution, with degrees of freedom,

if its density function is given by

f x

2

v

2

v

1

2

exp

x

2

x 0

2

where is a positive integer and can be written as X ~ v .

2

All chi-square distributions are skewed to the right. The symbol ,v denotes the number

along the horizontal axis that cuts o to its left an area of under the chi- square distribution

with degrees of freedom.

2

2

2

Table 8 from Lee (2004) gives the values of ,v with P ,v

___________________________________________________________________________

Task 11

(a)

0.01

v 10.

[ 2 0.01,10 23.209 ]

(b)

0.05

v 15.

[ 2 0.05,15 24.996 ]

(c)

0.99

v 12.

(d)

0.995

[ 2 0.995, 16 5.142 ]

v 16.

___________________________________________________________________________

2.9

F Distribution

Theorem 2 Let U and V be two random variables having independent chi-squared distribution

with v1 and v 2 degrees of freedom, respectively. Then the distribution of the random

variable

U

F

v1

v2

v v

1 2

2

h f

v1

2

v1

v

2

v2

2

v1

v1

1

2

1 v1 f

v2

v1 v2

2

0 f

Theorem 3 Writing f ,v ,v for f with v1 and v 2 degrees of freedom, we obtain

1

F 1 ,v1 ,v2

F ,v1 ,v2

1 ,v2

___________________________________________________________________________

___________________________________________________________________________

Task 12

10.48

(a)

0.001

v1 5

(b)

0.010

v1 10

v 2 10

(c)

0.975

v1 15

v2 9

[ f 0.975,15, 9 0.3205 ]

(d)

0.950

v1 12

v 2 20

[ f 0.950,12, 20 0.3937 ]

v 2 10

0.001, 5 , 10

Exercise 2

1. A random sample of size 32 is drawn from a normal distribution with mean 30 and

standard deviation 9. What is the probability that the

(a) sample mean is at most 26?

(b) sample mean is smaller than 33?

2. A random sample of size 41 is taken from a population which is Poisson distributed with

mean 26. What is the probability that the

(a) sample mean is less than 27?

(b) sample mean is at least 29?

3. A random sample of size 16 is selected from a normal distribution with a mean of 92 and a

standard deviation of 11. Another random sample of size 12 is selected with mean 88 and

standard deviation 16. Find the probability that

(a) the dierence between the mean is more than 8?

(b) is less than by 18?

4. PVC pipe is manufactured with a mean length of 30.5 inch and a standard deviation of 2.8

inches. Find the probability that a random sample of n = 15 pipes will have a sample mean

length greater than 29 inches.

5. The probability that a machine produces defective parts is 0.02. A random sample of 15

parts was taken.

(a) What is the probability that the sample mean is more than 0.5 if a random sample of size 4

was taken?

(b) What is the probability that the sample mean is less than 0.8 if a random sample of size 9

was taken?

6. The mean amount of air blows from a JSM air conditioner is 5.5 m in a minute with

standard deviation of 1.2 m. For DGM air conditioner, the mean amount of air blows is 4.9 m

in a minute with standard deviation of 1.1m. 12 set of air conditioner from both type are

selected to run a test.

a) What is the probability the mean air blows for JSM air conditioner is greater than DGM?

b) What is the probability that the dierence between mean air blows for both air conditioner

is less than 1?

7. The average weight a can of soda before the machine is service is 260 ml with standard

deviation of 11 ml. The average weight a can of soda after the machine is service is 250 ml

with standard deviation of 8 ml. 40 cans of soda before the machine is service was chosen at

random and 38 cans of soda after the machine is service was also chosen at random. Find the

probability the mean average weight a can of soda before the machine is service is at least

more than the average weight after the machine is service by 5.

8. The number of times Max photostat machine and JP photostat machine break- down

follows a Poisson distribution. An average of 8 breakdown were recorded for the Max

photostat machine during a randomly selected day. For JP Photostat machine, an average of 5

breakdown were recorded during a randomly selected day.

(a) If a random sample of 15 days were taken, what is the probability that the mean number

of breakdown recorded in a day for Max photostat machine is more than 10?

(b) If a random sample of 20 days were taken,

i. what is the probability that the mean number of breakdown recorded in a day dier by less

than 4?

ii. what is the probability that the dierence between the mean number of breakdown

recorded in a day is at least 5?

9. 15% of the paperclips do not follow the companys specications. QA inspector took 1000

samples randomly for inspection, what is the probability that

a) less than 15% of the paperclips do not follow the companys specications?

(b) at most 12% of the paperclips do not follow the companys specications?

(c) more than 17% of the paperclips do not follow the companys specications?

10. A claimed was made that 98% of A4 papers produced by a company has a good quality. A

survey was done and a random sample of 1000 A4 papers was selected. Find the probability

that

(a) more than 97% of A4 papers produced by a company has a good quality.

(b) between 97% and 99% of A4 papers produced by a company has a good quality.

(c) up to 99% of A4 papers produced by a company has a good quality.

11. A manufacturer claims that 34 of the electrical components was found to be nondefective. 250 electrical components were selected randomly. What is the probability that

(a) at least

(b)

4

of the electrical components was found to be nondefective?

5

37

39

to

of the electrical components was found to be nondefective?

50

50

7

of the electrical components was found to be nondefective?

10

12. A safety engineer claims that of all industrial accidents are caused by the carelessness of

the employees. A survey is carried and randomly 250 of all industrial accidents were selected.

What is the probability that

(a) at most

1

of all industrial accidents are caused by the carelessness of the employees?

4

(c )

1

of all industrial accidents are caused by the carelessness of the employees?

5

9

11

to

of all industrial accidents are caused by the carelessness of the employees

50

50

13. From previous record, 1.2% of machines in a manufacturing factory will be serviced at

least 3 times in a month. A survey was done involving 100 machines. Find the probability of

the

(a) proportion of machines in a manufacturing factory will be serviced at least 3 times in a

month more than 0.013.

(b) proportion of machines in a manufacturing factory will be serviced at least 3 times in a

month less than 0.09.

(c) proportion of machines in a manufacturing factory will be serviced at least 3 times in a

month not more than 0.10.

14. From previous experience, 35% of the microchips are defective. An engineer was asked

to investigate and solve this problem. He took randomly 500 samples of the microchips. Find

the probability of the

(a) proportion of the microchips are defective less than 0.36.

(b) proportion of the microchips are defective not more than 0.32.

(c) proportion of the microchips are defective between 0.33 and 0.38, inclusive.

15. A company produces component parts for two types of engines, DOHC and SOHC. They

claimed that 96% of the component parts for DOHC and 95% of the component parts for

SOHC meet specications. 100 random samples were selected from each component parts.

What is the probability that

(a) the proportion of the component parts for DOHC is less than the proportion

(b) the proportions dier by more than 0.5% meet specications?

(c) the proportion of the component parts for DOHC exceeds the proportion of

the component parts for SOHC meet specications by at least 1%?

16. A claimed was made that 10 out of 1000 laptops and 5 out of 500 desktops produced by a

company has been rejected. A survey was done and a random sample of 50 laptops and 40

desktops was selected. Find the probability that

(a) the sample proportion of the laptop is more than the sample proportion of the desktops has

been rejected.

(b) the dierence between the sample proportion of the laptop and the sample proportion of

the desktops has been rejected is at least 0.01.

(c) the sample proportion of the laptop is smaller than the sample proportion of the desktops

has been rejected by at most 0.005.

17. A manufacturer of CDs and DVDs players uses a set of comprehensive tests to access the

electrical function of its product. All disk players must pass all test prior to being sold. It was

found that

4

3

of CDs player and

of DVDs player failed the tests. A quality engineer

200

200

was asked to investigate the problems. 150 random samples were taken from each player.

What is the probability that

1

failed the tests?

100

(b) the proportions of CDs player is greater than the proportion of DVDs player failed the

tests?

(c) the proportion of CDs player is less than the proportion of DVDs player failed the test by

at most

2

100

18. A manufacturer claims that his products produced by two dierent machines meet the

customers specications. An investigation occurred and it was found that some of the

products failed to meet the specications and has been rejected. From 450 items, 27 of them

from machine A and from 500 items, 25 of them from machine B failed to meet the

specications and have been rejected. 60 items from each machine were selected randomly.

(a) the proportion of the items from machine A is greater than the proportion of the items

from machine B failed to meet the specications and has been rejected?

(b) the proportions dier by more than 1.5% failed to meet the specications and has been

rejected?

(c) the proportion of the items from machine B is less than the proportion of the items from

machine A failed to meet the specications and has been rejected is at least 1% ?

Chapter 3

Estimation

Learning Objectives:

At the end of this chapter, students should be able to

(a) distinguish between estimator and estimate for a given problem.

(b) describe the dierence between inferential statistics and descriptive statistics.

(c) identify the best estimator for mean, proportion and standard deviation construct the

condence interval for mean, proportion and variance for single population and for two

populations correctly based on given problem.

(d) interpret the condence interval correctly.

3.1

Introduction

In previous chapter we had learnt the sampling distributions of random variables. This

knowledge will equip us in working with the core of inferential statistics. Do you know what

inferential statistics is?

This chapter will introduce you to rstly, the denition of inferential statistics followed by the

denition of important terms that will be used intensively in this chapter namely estimator,

estimate, point and interval estimate, and condence interval. Next, we will discover the

procedure of estimating the true parameter of a population.

Lastly, we will construct the condence intervals for mean, proportion and variance for cases

of one population and two populations with the correct interpretation.

Let us recap the denition of inferential statistics. It deals with the use of probabilities and

data from sample to infer the underlying population or to make generalisation of the

underlying population. That is using information about the sample to make decision and

conclusion about population characteristics. For example by studying the average amount of

top-up spent by university students per month for a group of students in UTM, we can infer

the average amount of top-up spent by the whole university students in our country. Can you

guess what the sample and population in this example are? You can always think that, a

sample is a subset of a population. Does it help? Dont give up, you had tried your best! In

statistics, we call all university students in our country a population and the subset of this

population which is a group of students from UTM is called a sample. In the next section we

will start with the denition of important terms in this chapter.

3.2

Terminology

parameter.

2. Estimate is the value assigned to a population parameter based on the value of a sample

statistic.

3. Point estimate is the value of a sample statistic that is used to estimate a population

point estimate with the hope that this interval contains the corresponding population

parameter.

4. Condence interval that we will learn throughout this chapter is dened as an interval that

is constructed around a point estimate that is associated with the level of condence based on

the procedure in constructing it. The condence level is the proportion of times that the

condence interval will contain the true parameter, assuming that the estimation procedure is

repeated a large number of times.

Next we will learn through example on how to determine the best estimator and hence

construct the appropriate condence interval according to the sample data that we

have.

3.3

Point Estimate

We start with our previous example on the monthly amount of top-up by university students.

The mean value of monthly top-up computed for the sample is called a sample mean denoted

__

by x . This is a point estimate of the corresponding population mean, i.e mean monthly

top-up for university students in Malaysia. Let say, we select 1000 UTM students randomly

and the mean monthly top-up is RM40. This RM40 is a point estimate for the true mean of

monthly top-up for all university students in Malaysia. The statistician can then state that the

mean monthly top-up for Malaysian university student is RM40. This is what we call a point

estimation.

__

For the above example the population mean is estimated using the sample mean x

calculated as follows

__

x1 x 2 , x1000

1000

where x1 is the amount of monthly top-up by UTM student 1, x 2 is the amount of monthly

__

Similarly, we can also estimate an unknown population variance, 2 , using a point estimator

2

S 2 and the numerical value assigned to it, for example s 1.6 , is called the point

estimate for 2 .

In engineering we often need to estimate the followings:

The mean of a single population ; for example the mean breakdown voltage of

diodes.

The variance of a single population, 2 (or standard deviation, ); for example the

standard deviation of the inside diameter of certain plastic pipes.

The proportion of items in a population that belong to a certain class of interest; for

example the proportion of defective items for a particular production process.

means breakdown voltage of two diodes.

dierence in proportions of nonconforming coils of brand A and B.

12

; for example the ratio between variances of

22

The following table summarises the point estimates of these parameters together with their

statistics.

Table 3.1: Point Estimates and Statistics

______________________________________________

Unknown

Statistic

Parameter

Point estimate

______________________________________________

X

S

n 1

s2

X

n

________________________________________________

Statistical properties for best estimator (the most ecient estimator) must

1.

2.

be unbiased, that is E

possible.

For further explanation of these properties, please refer to Montgomery, Runger and Hubele

(2004) page 131-133.

___________________________________________________________________________

3.4

Interval Estimate

Next, by extending our top-up example, instead of saying that the mean top-up for university

students in Malaysia is RM40, we may want to say it within a certain range. That is, by

subtracting a number from RM40 and adding the same number to RM40 will give us this

range. In illustrating this example, let the number to be subtracted from RM40 is RM5 and

add this number to RM40. Hence we obtain the range from RM35 to RM45. Then we can

state that the range from RM35 to RM45 is likely to contain the mean top-up for all

Malaysian university students.

In general, the interval estimate of the unknown parameter can be written as l, u where l

is the lower limit and u is the upper limit. So the corresponding interval estimate for the

above example is RM(35,45). Since dierent samples will produce dierent values of sample

mean that result in dierent values of l and

random variables of the lower limit L and the upper limit U . The associated probability to

this interval estimate can be expressed as follows

P L U 1 ,

where 0 < < 1. That is we have a probability of 1 of choosing a sample that will

produce an interval containing the true value of . The resulting interval estimate is called a

100(1 )% condence interval (CI) for the true parameter .

Generally, a 100(1 )% condence interval (CI) for the true parameter means

P L U 1 ,

which can be interpreted as follow, if we collect innitely many random samples and

compute 100(1 )% CI for the true parameter for each sample, 100(1 )% of these

intervals will contain the true value of .

However, in practice we only draw one random sample. The interpretation that we will use is

the observed interval l, u contains the true value of with 100(1 ) condence level.

3.5

CI on the Mean

need to consider;

(a) population variance 2 is known,

(b) population variance 2 is unknown but the sample size is large n 30 and

(c) population variance 2 is unknown and the sample size is small n 30 .

These considerations need to be taken into account because we need to know the sampling

__

distribution for the sample mean X . The use of this sampling distribution will be

demonstrated as follows. Take the rst case as an example. We know that the sampling

__

u and variance

2

. Thus, the

n

__

X

statistic Z 2

is distributed as a standard normal. In computing a 100 1 % CI

n

population mean,

z 2 __

z 2

__

P X

,X

n

n

1 .

z 2

__

__

and x

(a) A 100 1 % CI for the population mean,

z 2

n

respectively.

u;

be written as

__

z 2

n

__

z 2

n

or

__

z 2

n

__

,x

z 2

(b) A 100 1 % CI for the population mean, with unknown population variance, 2

can also be written as

__

2, n 1

__

2, n 1

as we can use central limit theorem in this case, where s is the estimated sample standard

deviation.

(c)

__

2, n 1

__

2, n 1

with the assumption that the sample comes from normal distribution.

Example

1

2.18039 10 5 pascal. A sample of 16 specimens has been randomly selected which

__

gives the sample mean of x 2.49978 10 7 pascal. Construct a 95% CI on the mean compressive strength.

Solution

This example is clearly case (a) where population standard deviation is known and equals

to 2.18039 105 pascal. The CI that we want to compute is the 95% CI for the mean

__

compressive strength, . From the sample, x 2.49978 10 7 pascal and sample size,

n 16 .

so z 2 z 0.025 1.96

2.49978 10 7 1.96

2.18039 10 5

16

2.49978 10 7 1.96

2.18039 10 5

16

2.4891 10 7 2.5105 10 7

________________________________________________________________________

Task 1

A random sample of 16 compact cars tested for fuel consumption gave a mean of 12.5 km per

litre with a standard deviation of 0.83 km per litre. Assuming that the fuel consumption in km

per litre of all compact cars have a normal distribution, construct a 99% condence interval

for the population mean of fuel consumption for compact cars.

[ 11.8885, 13.1115 ]

Task 2

Borneo Steel Corporation produces iron rings that are supplied to ARAAB Co Ltd. These

rings are supposed to have a diameter of 60 cm. The machine that makes these rings does not

produce each ring with a diameter of exactly 60 cm. The diameter of each of the rings varies

slightly. It is known that when the machine is working properly, the rings made on this

machine have a mean diameter of 60 cm. The quality control department takes a random

sample of 35 such rings every week, calculates the mean of the diameters for these rings, and

makes a 99% condence interval for the population mean. If either the lower limit of this

condence interval is less than 59.938 cm or the upper limit of this condence interval is

greater than 60.063 cm, the machine is stopped and adjusted. A recent such sample of 35

rings produced a mean diameter of 60.038 cm with a standard deviation of 0.15 cm. Based on

this sample can you conclude that the machine needs an adjustment?

[(59.9727, 60.1033); yes]

___________________________________________________________________________

3.6

knowledge from previous section by choosing our statistic Z , as

__

__

X 1 X 2 1 2

Z

2

2

1

2

n1

n2

assuming we know both population variances. Again we compute a 100 1 % CI for the

dierence between the two population means, 1 2 so that

__

__

__

12 2 2

1 2 X 1 X 2 z

n

2

n2

1

__

P X 1 X 2 z

1 2 2 2

n n 1 .

2

1

There are three cases of 100 1 % CI for the dierence between two population means

1 2 ;

population variances

__

__

x 1 x 2 z

__

__

12 2 2

1 2 x 1 x 2 z

n

2

n2

1

12 2 2

n n

1

2

population variances and n1 , n2 30

i. with 1 2

2

__

__

__

1

1

1 2 x 1 x 2 z s p

x 1 x 2 z s p

2

2

n 1 n2

__

where s p

n1 1 s12

ii. with 1 2

2

1

1

n 1 n2

n 2 1 s 22

is a pooled standard deviation.

n1 n 2 2

__

__

__

s2

s2

1 2 x 1 x 2 z s p

x 1 x 2 z s p

2

2

n 1 n2

__

s2

s2

n 1 n2

population variances and n1 , n 2 30 and normality assumption holds

i. with 1 2

2

__

__

__

1

1

1 2 x 1 x 2 t s p

x 1 x 2 t s p

2

2

n2

n1

__

where v n1 n 2 2 and s p

ii. with 1 2

2

1

1

n 2

n1

n1 1 s12

n 2 1 s 22

n1 n 2 2

__

__

__

s2

s2

1 2 x 1 x 2 t s p

x 1 x 2 t s p

2

2

n 1 n2

__

s2

s2

n 1 n2

where

s2

s2

n 1 n2

s2

n1

s2

n2

n1 1

n2 1

_______________________________________________________________________

Example

2

Suppose random samples of 49 Silver Tyres and 36 Dun Tyres were selected. The sample

mean mileage the tyre lasts for Silver Tyres is 119000 km and the standard deviation is

7700km and the sample mean mileage for Dun Tyres is 118000 km and the standard

deviation is 6000km. Compute a 90% CI on the dierence of the two population means.

Solution

The 90% CI on the dierence of the two population means

7700 2

6000 2

49

36

1 2

7700 2

6000 2

49

36

1445.32 1 2 3445.32

___________________________________________________________________________

Task 3

Using Example 2 but we assume that their population variances are equal. Construct a 95%

CI on the dierence of the means mileage the tyre lasts.

[-2026.0942, 4026.0942]

___________________________________________________________________________

Task 4

A car magazine is comparing the total repair costs incurred during the rst three years on two

mid-sized cars, the Pherry and the XPY. Random samples of 16 Pherrys and 9 XPYs are

taken. All 25 cars are three years old and have similar mileages. The mean of repair costs for

the 16 Pherry cars is RM5000 for the rst three years with a standard deviation of RM800.

For the 9 XPY cars, this mean is RM7700 with a standard deviation of RM1000. Assume that

the repair costs follow a normal distribution with the same population variance. Construct a

90% condence interval for the dierence between the two populations means

[-3324.7295, -2075.270]

___________________________________________________________________________

Task 5

A process engineer is comparing two dierent etching solutions for removing silicon from

the backs of wafers. The etch rates follow normal distribution and have equal population

variances of 0.352. Below are the observed etch rates from 10 wafers for each solution.

____________________________

Solution 1

Solution 2

____________________________

9.7

10.5

10.1 9.9

9.3

10.2

10.5 10.1

9.1

9.9

10.6 10.2

9.5

10.3

10.3 10.3

10.0 10.1

10.3 10.1

____________________________

Find a 90% CI for the dierence in mean etch rates. [ -0.6375, -0.1225 ]

Task

6

Using Task 5, construct a 95% CI for the dierence in mean etch rates if we do not know the

population variances and assume that both populations have an unequal variances.

[ -0.7198, -0.0402 ]

___________________________________________________________________________

3.7

probability,

P z

1

n

1 .

P z

2

1

1

P z

.

2

n

n

1

n

P z

2

with

P 1 P

P 1 P

P z

.

2

n

n

__________________________________________________________________________

Example

3

defective units produced. A random sample of 200 boards contains 1 defectives. Find a 90% CI

for the true proportion of defectives.

Solution

0.005 1.6449

0.005 0.995

0.005 1.6449

200

0.005 0.995

200

0.0032 0.0132

___________________________________________________________________________

Task 7

A random sample of 200 diskettes were inspected and 17 defective diskettes were found. Find

a 95% CI on the true proportion of defective diskettes.

[ 0.0463, 0.1237 ]

___________________________________________________________________________

Task 8

A random sample of 400 components were tested and 6.25 percent of the sample components

fail to satisfy production specications. Find a 90% CI on the true proportion of components

that fail to satisfy the specications.

[ 0.0426, 0.0824 ]

__________________________________________________________________________

3.8

To construct the CI for 1 2 recall that the sampling distribution for P1 P2 is normal

1 1 1 2 1 2

.So the statistic

n1

n2

P1 P2 1 2

1 1 1 2 1 2

n1

n2

is a standard normal random variable. Using the same approach as previous section, we

obtain a 100 1 % CI for the dierence between two proportions as

P1 P2 z

P1 1 P1 P2 1 P2

P1 1 P1 P2 1 P2

1 2 P1 P2 z

2

n1

n2

n1

n2

___________________________________________________________________________

Example

4

In a factory, plastic parts are formed using two dierent injection-molding machines. Two

random samples, each of size 200 are chosen and 5 defective parts are found in the sample

from machine A whereas 6 defective parts are found in the sample from machine B. Construct

a 99% CI on the dierence in proportions of defective parts.

Solution

P1 5

200

0.025 ; P2 6

200

0.025 2.5758

0.025 0.975

0.03 0.97

1 2

200

200

0.025 0.975 0.03 0.97

200

200

0.0471 1 2 0.0371

_________________________________________________________________________

Task 9

A survey conducted by independent Engineering Education Research Unit found that among

teenagers aged 17 to 19, 20% of school girls and 25% of school boys wanted to study in

engineering discipline. Suppose that these percentages are based on random samples of 501

school girls and 500 school boys. Determine a 90% CI for the dierence between the

proportions of all school girls and all school boys who would like to study in engineering

discipline.

-0.0933, -0.00666]

___________________________________________________________________________

3.9

n 1 s 2

2

interval in such a way that

n 1 s 2 2 n 1 s 2 .

2

,n 1

12

,n 1

__________________________________________________________________________________

Example

5

A study on an operating system for a portable computer has been carried out thorvoughly to

estimate the variance of response time. A random sample of 10 portable computers are chosen

and give the standard deviation value of 8 milliseconds. Assume that the response time

follows normal distribution, construct a 95% CI on true variance of response time.

Solution

0.05, 02.025 19.023, 02.975 2.7

10 1 8 2

02.025,10 1

576

19.023

30.279

10 1 8 2 .

02.975,101

576

2.7

213.333

________________________________________________________________________________

Task

A random sample of 13 bolts is selected and the inside diameter is measured. The sample

standard deviation of the bolt inside diameter is 0.018 mm. Construct a 90% CI for the

standard deviation.

[0.0136, 0.0273]

__________________________________________________________________________

3.10

S 22

F

S12

22

12

in such a way that

P f 1

F f

1 .

2

,

n

1

,

n

1

2

,

n

1

,

n

1

2

1

2

1

S 22

1 .

P f 1

2

f

2 , n2 1, n1 1

2 , n2 1, n1 1

S1

2

2

2

Rearranging the above, we obtain a 100 1 % CI on the ratio of two variances of two

normal distributions,

S 22

s

22 s12

f1

2

2 f

.

2 , n2 1, n1 1

2 , n2 1, n1 1

s

S1

s2

12

2

1

2

2

F1

2 , n2 1, n1 1

1

F

2 , n2 1, n1 1

s12

1

2

s2 f

2 , n2 1, n2 1

12 s12

2 2 f

.

2 s 2 2, n2 1,n1 1

___________________________________________________________________________

Example

6

A quality engineer is studying the diameter of stainless steel rod manufactured on two

dierent machines. Two random samples of 16 and 13 rods respectively are selected which

give the variances of the diameter values 0.30cm2 and 0.40cm2 respectively. Assume that the

data were drawn from normal distributions, construct a 95% CI on the ratio of variances of

the diameters.

Solution

s12 0.30cm 2 s 22 0.40 cm 2 f 0.025,16 1, 131 3.18 f 0.025,131,16 1 2.96

s12

1

2

s2 f

2 , n1 1,n2 1

12 s12

f

.

22 s22 2, n2 1,n1 1

12

0.3 1

0.4 3.18

22

0.2358

0.3

2.96

0.4

12

2.22

22

_____________________________________________________________________________

Task 11

An engineer is studying an axial load of aluminium cans. It is measured by using a plate

where an increasing pressure is applied on top of the can until it collapses. This maximum

weight that the sides of the can can support is the axial load. Two random samples of sizes 10

and 7 aluminium cans are selected and the standard deviations are 10.1 kg and 11.8 kg

respectively. Find a 90% CI on the ratio of variances of the loads.

[0.1787,2.4689]

___________________________________________________________________________

Exercise 3

1. When you construct a 90% condence interval for , what are you 90% condent about?

2. What happen to the width of CI if we increase the same size?

3. Can we consider the construction of condence interval be part of inferential statistics?

Why?

4. For a data set obtained from a sample, n 49, x 102.5, and s 10.7

(a) What is the point estimate for ?

(b) Compute a 98% CI for .

5. A 90% CI for can be interpreted as follow, if we take 1000 random samples of the same

size and compute the condence interval each, then 900 of them

a. will contain

c. will contain x

6. Carbonated drink bottles are lled by an automated lling machine. Assume that the ll

volume is normally distributed and from previous production process the variance of ll

volume is 0.005 liter. A random sample of size 16 was drawn from this process which gives

the mean ll volume of 0.51 liter. Construct a 99% CI on the mean ll of all carbonated drink

bottles produced by this factory.

7. A random sample of 12 wafers were drawn from a slider fabrication process which gives

the following photoresist thickness in micrometer: 10 11 9 8 10 10 11 8 9 10 11 12 Assume

that the thickness is normally distributed. Construct a 95% CI for mean of all wafers

thickness produced by this factory,

8. The following is the result for diameter of 10 bearings selected randomly from a

production process.

0.5061 0.5083 0.5058 0.5075

0.5049

0.5037

(a) Construct a 90% CI for the mean of diameter of bearings.

(b) Construct a 95% CI for the mean of diameter of bearings.

(c) Comment on your interval estimates pertaining to their maximum error which is

dened as t 2 , n 1 .

9. In integrated circuit manufacturing industry, a basic process is to grow an epitaxial layer on

polished silicon wafers. The wafers are mounted on a susceptor and positioned inside a

specied jar. Through the nozzles positioned near the top of the jar a chemical vapours are

introduced. The susceptor is rotated and heat at constant temperature is applied. The

following are the thickness of the epitaxial layers (in m ) at low deposition time and at 59%

arsenic ow rate.

13.925 13.909

14.057

14.068

14.006

13.893

14.005

(a) Construct a 90% CI for the mean thickness of epitaxial layers assuming that the thickness

of epitaxial layer follows normal distribution with variance of 0.0050 m 2 .

(b) Construct a 90% CI for mean thickness of all epitaxial layers assuming that the thickness

of epitaxial layer follows normal distribution.

(c) Comment on the interval estimates based on their practicality.

10. Using data in question 9 and the following data on thickness of the epitaxial layers

at high deposition time and at 59% arsenic ow rate;

14.295 14.095 15.505

15.806

15.106

14.839,

construct a 90% CI on the dierence between means thickness of epitaxial layers assuming

that the thickness of epitaxial layers follow normal distribution with equal variances. Interpret

your CI and can you conclude that the true mean dierence is zero?

11. A quality inspector inspected a random sample of 300 memory chips from a production

line, she found 9 are defectives. Construct a 99% condence interval for the proportion of

defective chips.

defect of his products. A random sample of size 800 batteries contains 10 defectives.

Construct a 95% condence interval for the proportion of defectives.

13. A manufacturer of computer chips inspected a random sample of 1000 chips. The

following are the number of defects according to its type.

holes too small

90

poor connections

chip oversize

chip undersize

25

10

2

1

(a) What is the point estimate of the proportion of defectives due to holes too small?

(b) Construct a 90% CI for the proportion of defectives for the production process due to

holes too small.

(c) What is the point estimate for proportion of defectives due to poor connection?

(d) Construct a 90% CI for the proportion of defectives for the production process due to poor

connection.

(e) If oversize and undersize chip can be classied as incorrect chip size, what is the point

estimate of the proportion of defect due to incorrect chip size?

Hence nd a 95% interval estimate for the proportion of defective items due

to incorrect chip size.

14. An optical rm is concerned about the variability of the refractive index of a typical glass

that he will grind into lenses. The refractive index follows approxi- mately normal

distribution. A random sample of 15 glasses is drawn from a large shipment which give a

variance of 1.5 104 refractive index. Construct a 95% CI for the standard deviation of

refractive index of all glasses

bumper guards. A random sample of 6 guards from each type were mounted on a compact

car. Each car was then run into a concrete wall at 8km per hour.

The following are the costs of repairs (in RM):

Bumper guard 1 : 305 420 363 485 300 360

`Bumper guard 2 : 405 345 336 450 400 360

a) Construct a 90% CI for the mean cost of repairs using bumper guard 1. State 3 conditions

(b) Assuming that all conditions in part (a) are satised, construct a 90% CI for mean costs of

repairs using bumper guard 2. What can you observe from these CIs?

(c) Assuming that the variances of cost of repairs are equal, construct a 95% CI on the mean

dierences of cost of repairs.

(d) What is the point estimate of the variance of cost of repair for bumper guard 1? Construct

a 95% CI for variance of cost of repair for bumper guard 1.

(e) What is the point estimate of the standard deviation of cost of repair for bumper guard 2?

Construct a 95% CI for the standard deviation of cost of repair for bumper guard 2.

(f) Find a 90% CI for the ratio of two variances for cost of repairs.

__________________________________________________________________________

Chapter 4

Tests of Hypotheses

Learning Objectives:

At the end of this chapter, students should be able to:

a) structure science and/or engineering decision-making problems concerning one

or two samples as hypothesis test.

(b) test hypotheses concerning a population mean.

(c) test hypotheses concerning a population variance or standard deviation.

(d) test hypotheses concerning a population proportion.

(e) test hypotheses concerning the dierence in two population means.

(f) test hypotheses concerning the ratio of two population variances or standard

4.1

Statistical Hypotheses

Many science and engineering problems require us to decide whether to accept or reject

a statement about some parameter. That statement is called a hypothesis. A statistical

hypothesis can arise from various elds of interest such as engineering, science, education, etc. A systematic procedure to decide whether to accept or reject a hypothesis is

called hypothesis testing.

of one or more populations.

We cannot prove that a hypothesis is absolutely true or false. If the data sample supports the

hypothesis, then we do not reject it. If the data sample does not support the hypothesis, we

reject it.

The hypothesis being tested is referred to as the null hypothesis and denoted by H0. The null

hypothesis is set up primarily to see whether it can be rejected or not. Also, we must

formulate an alternative hypothesis in order to know when to reject a null hypothesis. The

alternative hypothesis denoted by H 1 is the hypothesis which we accept when the null

hypothesis can be rejected. Some authors use the notation Ha or H A for the alternative

hypothesis

Denition 2 A null hypothesis, H 0 , is an assertion about one or more population

parameters. We hold this assertion as true until there are sucient statistical evidence to

conclude otherwise. The alternative hypothesis, H 1 , is the assertion of all situations not

covered by the null hypothesis

Together, the null and the alternative hypotheses constitute complete set of hypotheses that

covers all possible values of the parameter or parameters under investigation. The value of

the population parameter specied in the null hypothesis is usually determined in one of the

following three ways:

1. from a model or theory regarding the process under investigation, then the objective of

hypothesis testing is usually to verify the model or theory.

2. from knowledge of the process or previous tests or experiments, then the objective of

hypothesis testing is to determine whether the parameter value has changed.

3. from external consideration, such as design or engineering specication, or from

contractual obligations, then the objective of hypothesis testing is conformance testing.

The hypothesis test is carried out using information obtained by random sampling.

For example, suppose that we are interested in the output voltage of a power supply used in a

mobile phone; output voltage is a random variable that can be described by a probability

distribution. Suppose that our interest focuses on the mean output voltage

(a parameter of this distribution). Specically, we are interested in deciding whether

or not the mean output voltage is 6.00 V. We may express this formally as

H 0 : 6.00 V

H 1 : 6.00 V

(4.1)

The statement H 0 : 6.00 V in Equation (4.1) is called the null hypothesis1, and the

statement H 1 : 6.00 V is called the alternative hypothesis. Since values of the

alternative hypothesis could be either greater or less than 6.00 V, it is called a two-sided

alternative hypothesis. When we formulate the hypotheses as

H 0 : 6.00 V

H 1 : 6.00 V

or

H 0 : 6.00 V

H 1 : 6.00 V

then values of the alternative hypothesis could be less than 6.00 V or greater than 6.00 V,

respectively, it is called a one-sided alternative hypothesis 2

Denition 3 A test statistic is a sample statistic computed from the data obtained by random

sampling. The value of the test statistic is used in determining whether or not the null

hypothesis should be rejected.

We decide whether or not to reject the null hypothesis by following a rule called the decision

rule.

Denition 4 The decision rule of a statistical hypothesis test is a rule that species

the conditions under which the null hypothesis may be rejected.

_______________________________________________________________

Note that when choosing the null hypothesis one should bear in mind that it should nearly

always be precise, or be easily reduced to a precise hypothesis. For example when testing

H 0 : 6 V versus H 1 : 6. V , the null hypothesis does not specify the value of

exactly and so is not precise. But in practice we would proceed as if we were testing

H 0 : 6 V versus H 1 : 6 V

2

Note that hypotheses are always statements about the parameters of one or more

populations

under investigation,

not statements __

about the sample. So it is wrong to write

__

__

H 0 : x 6 V versus H 1 : x 6 V or H 1 : x 6 V .

1

Table 4.1 shows all the four possible outcomes of a test of hypothesis. The conclusion

columns refer to the action that he or she will be taken based on the results of the sampling

experiment. He or she will either conclude that the alternative hypothesis H 1 is true or the

null hypothesis H 0 is true. The state of nature rows refer to the fact that either the alternative

hypothesis H 1 is true or the null hypothesis H 0 is true. We can assume the true state of

nature is unknown when he or she conducting the test.

Statistical Conclusion

State of Nature

H 1 is true

H 0 is true

H 0 is true

Type I error

Correct conclusion

H 1 is true

Correct conclusion

Type II error

(equivalently, rejecting null hypothesis) in fact H 0 is really true. This type of wrong

conclusion is called a Type I error.

H 1 ) when it is true in state of nature is dened as a Type I error.

Also, he or she will be making wrong conclusion if he/she accepts the null hypothesis

(equivalently, rejecting alternative hypothesis) when in fact H 1 is really true. This type

of wrong conclusion is called a Type II error.

Denition 6 Failing to reject the null hypothesis H 0 (equivalently failing to accept

alternative hypothesis H 1 ) when it is false in state of nature is dened as a Type II

error.

Probabilities can be associated with the Type I and Type II errors because this

conclusion is based on random variables. The probability of making a Type I error is

denoted by (the Greek letter alpha), that is

(4.2)

The probability of making a Type II error is denoted by (the Greek letter beta), that is

(4.3)

A decision will be made only when we know the probability of making the error that

corresponds to that conclusion. When is specied, we should be able to reject H 0 (accept

H 1 ) if the test statistics is in the rejection region. However, when is not specied, we

should avoid the decision to accept H 0 , instead we should state that the sample evidence is

insucient to reject H0 if the sample evidence does not support that decision. Type I error is

considered more important than Type II error because we want to guard against the

possibility of making a wrong conclusion while the state of nature is true more than guarding

the other type of error.

A procedure leading to a decision about a particular hypothesis is called a test of a

hypothesis. The general procedure used for testing a hypothesis is as follows:

1. Identify the parameter of interest.

2. Formulate a null hypothesis and an alternative hypothesis.

3. Choose a signicance level

4. Determine the distribution and state the rejection region of the test statistic.

5. Specify an appropriate test statistic and calculate the value of the test statistic from a

random sample of data.

6. Decide whether to reject H 0 or fail to reject H 0 by comparing the calculated value of the

test statistic with the values in the critical region.

Steps 14 should be completed prior to calculation of the test statistic from a random

sample of data. This sequence of steps will be illustrated in subsequent sections.

___________________________________________________________________________

4.2

We now consider the case of hypothesis testing on the mean of a population under the

assumption of normality. The tests are also valid in cases where only approximate normality

exists. If it is not normal then the conditions of the central limit theorem apply.

To test the hypothesis that a random sample X 1 , X 2 , , X n

of size

n comes from

__

where 0 is a specied constant and we have assumed that the population variance 2 is

known. Now consider testing the hypothesis

H 0 : 0

H1 : 0

(4.4)

__

X 0

(4.5)

If the null hypothesis is true, Z test has a standard normal distribution, N (0, 1). When we

know the distribution of the test statistic we can locate the critical region to control the Type I

error probability at the desired level. In this case we would use the

z

and z

percentage points

z test z

(4.6)

or

z test z

(4.7)

z

z test z

(4.8)

Equations (4.6) and (4.7) dene the critical region or rejection region for the test. The Type I

error probability for this test procedure is

The procedures for testing the mean when the variance is known are summarized in

Table 4.2.

Table 4.2: Testing the mean when variance is known

__

X 0

___________________________________________________________________________

Exampl

e 1 phones are powered by battery. The output voltage of a power supply used in a

Mobile

mobile phone is an important product characteristic. Specications require that the mean

output voltage must be 6.00 V. We know that the standard deviation of output voltage is =

0.5 V. We decide to specify a Type I error probability or signicance level of 0.05 . A

random sample of n 20 is collected and obtains a sample mean output voltage of

__

Solution

Case Null hypothesis Alternative hypothesis Rejection region

We will follow the procedure

in Section H

(4.1)

for testing a hypothesis:

z test z 2 or

H outlined

1

0 : 0

1 : 0

1. The parameter of interest is population mean, , the mean output voltage.

z z

test

2.

The

null

H 0 : 0

H1 : 0

2

H 1 : 6.00 V 3

H 0 : 0

H1 : 0

H 0 : 6.00 V versus

z test z

z test z

2

2

3. 0.05

4. Reject z test z

__

Z test

X 0

__

__

z test

x 0

6.80 6.00

0.5

20

7.16

6. Since the value z test 7.16 does exceed 1.96, we reject H 0 : 6.00 at the 0.05 level

of signicance. We can statistically conclude that the mean output voltage diers from 6 V,

based on a sample of 20 measurements.

Suppose that we specify the hypotheses as

H 0 : 0

H1 : 0

(4.9)

where the alternative hypothesis is one-sided. In dening the critical region for this test, we

observe that a positive value of the test statistic Z test would never lead us to conclude that

H 0 : 0 is false. Therefore, we would place the critical region in the lower tail of the

standard normal distribution and reject H 0 if the calculated value z test is too small. We

would reject H 0 if

z test z

Similarly, to test

H 0 : 0

H1 : 0

(4.10)

we observe that a negative value of the test statistic Z test would never lead us to conclude

that H 0 : 0 is false. Therefore, we would place the critical region in the upper tail of the

standard normal distribution and reject H 0 if the calculated value of ztest is too large. We

would reject H 0 if

z test z

_________________________________________________________________________

Task 1

A manufacturer claim that battery life of model Z1 exceeds 90.0 hours. The life in hours of a

battery is known to be approximately normally distributed, with standard deviation = 8.5

__

hours. A random sample of 18 batteries has a mean life of x 95.5 hours. Is there

evidence to support the claim. Use = 0.01.

z test

2.7452; reject H 0

__________________________________________________________________________________

S 2 . If n is large (normally n 30) we can proceed to use the test procedure based on the

normal distribution

__

Z test

with

X 0

S

S . However, when

__

X

S

Now consider testing the hypotheses in Equation (4.4). We will use the test statistic

__

Ttest

X 0

S

If the H 0 is true, Ttest has a t distribution with n 1 degrees of freedom and we can locate

the critical region to control the Type I error probability at the desired level. In this case we

2 , n 1

and t

2 , n 1

regions to reject H 0 : 0 if

t test t

, n 1

(4.11)

or

t test t

, n 1

(4.12)

t

2 , n 1

t test t

2 , n 1

(4.13)

Table 4.3: Testing the mean when variance is unknown and n < 30

__

X 0

S

, n 1 ,degree of freedom

H 0 : 0

H1 : 0

1

2

3

H 0 : 0

H1 : 0

H 0 : 0

H1 : 0

Rejection region

t test t

, n 1

ort t test tt

test

1

2

, n, n

1

t test t , n 1

Equations (4.11) and (4.12) dene the critical region or rejection region for the test.

The Type I error probability for this test procedure is

The procedures for testing the mean when the variance is unknown are summarized in

Table 4.3.

Table 4.2 and Table 4.3 are very similar except that Ttest is used as the test statistic

instead of Z test . Also, we use t distribution to dene the critical region instead of using

the standard normal distribution.

_____________________________________________________________________

Example 2

Referring to Example 1, suppose that the true variance is unknown. Ten determinations of the

output voltage of a power supply yielded the following values:

6.05

6.06

6.03

5.95

6.00

5.98

6.04

5.98

6.02

6.03

Can we say that the average output voltage equal to 6.00 V? Assume that the data

are approximately normal.

Solution

The solution using the outline in Section 4.1 is as follows:

1. The parameter of interest is population mean, , the mean output voltage.

2. The null and alternative hypotheses are

H 0 : 6.00 V versus H 1 : 6.00 V

3. 0.05

4.Reject

H0

if

t test t

2 , n 1

__

Ttest

X 0

S n

or

x 6.014 V

__

t test

x 0

6.014 6.00

0.0353

10

1.254

6. Since the value t test 1.254 is between 2.262 and 2.262, we are unable to reject

H 0 : 6.00 , and there is no strong evidence to indicate that output voltage not equal to

6.00 V at the 0.05 level of signicance . We can statistically conclude that the mean output

voltage equal 6.00 V, based on a sample of 10 measurements

___________________________________________________________________________

Task 2

Suppose you are a buyer of large supplies of mobile phone batteries. You want to test the

manufacturers claim that his mobile phone batteries last more than 900 hours. You test 40

batteries and nd that the sample mean is 922 hours and the sample standard deviation 68

hours. Should you accept claim? Use = 0.05.

z test

2.0462; reject H 0

___________________________________________________________________________

Task

A manufacturer of transistors claims that its transistors will last an average of 2100 hours. To

maintain this average, 20 transistors are tested each month. What conclusions should be

drawn from a sample that has a mean 2140 hours and a sample standard deviation 87 hours?

Assume that distribution of the lifetime of the transistors is normal. Use = 0.01.

t test

_______________________________________________________________________________

4.3

Hypothesis tests on the population variance or standard deviation are equally important as

testing on the population mean. For example, we wish to test whether a random sample is

drawn from a normal population of a specic known variance, say 02 or equivalently, that

the standard deviation is equal to 0 . To test

H 0 : 2 02

H 1 : 2 02

(4.14)

If the null hypothesis H 0 : 2 02 is true, the test statistic used is that given by the random

variable

n 1 S 2 .

(4.15)

02

2

which has a chi-square, , distribution with n 1 degrees of freedom. We will use the test

statistic

2

test

n 1 s 2

(4.16)

02

2

test

12 , n 1

2

where , n 1 is the upper 100

n 1 degree of freedom. Table 4.4 summarizes the critical regions needed for each of

Table 4.4: Testing the variance, 2

2

Test statistic:

n 1 S 2 , n 1

02

,degree of freedom

H 0 : 2 02

H 1 : 2 02

1

2

3

Example 3

H 0 : 2 02

H 0 : 2 02

H 1 : 2 02

H 1 : 2 02

Rejection region

2

test

12

, n 1

2

2

2

2

or test

1

test

, n,

n11

2

2

test

2 , n 1

A drilling machine is used to drill metal plates used in batteries. A random sample of 25

plates results in a sample variance of hole diameter of s 2 1.82mm 2 . If the variance of hole

diameter exceeds 1.00 mm 2 , the drilling machine must be serviced. Is there evidence that

the machine needs to be service? Use = 0.01, and assume that hole diameter has a normal

distribution.

Solution

The solution using the outlined in Section 4.1 is as follows:

1. The parameter of interest is population variance, 2 , the variance hole diameter

2. The null hypothesis and alternative hypothesis are

H 0 : 2 1.00 mm 2 versus H 1 : 2 1.00 mm 2

3. 0.01

Refer from Table 6 of Lee (2004).

5. The test statistic is

Z test

Since

10 200 0.05

Z test

0

0 1 0 n

0 1 0 n

0.05 0.03

1.6581

6. . Since the value ztest = 1.6581 is between 1.96 and 1.96, we are unable to reject

H 0 : 0.03 , and there is no strong evidence to indicate that the percentage of defective

not equal to 3% at the 0.005 level of signicance. We statistically conclude that the

percentage of defective components is 3%.

___________________________________________________________________________

For small

probabilities.

___________________________________________________________________________

Task 6

An electrical company claimed that at least 90% of the parts which they supplied on a

government contract conformed to specications. A sample of 280 parts was tested, and 35

did not meet specications. Can we accept the companys claim at a 0.05 level of

signicance?

z test

___________________________________________________________________________

Task 7

The manufacturer of electronic devices informed his buyer about the proportion of defective

devices in its shipments. He claims that the proportion of all devices that are defective is less

than 6%. A random sample of 100 electronic devices indicates that 5 are defective. Using

0.05 , test whether the buyer will accept the manufacturers claim or not.

z test

______________________________

4.5.1

Let

Variance known

X 11 , X 12 , , X 1n1

n1

known. The test statistic used to test H 1 : 0 0 against H 1 : 0 is the standard

normal random variable

__

__

X 1 X 2 1 2

12 22

n1 n2

Because Z has the standard normal distribution when H 0 is true, we would take z 2 and

z as the boundaries of the critical region. This result and two other cases are included in

2

Table 4.6.

Table 4.6: Testing 1 2 when variance 12 and 22 are known

__

__

X X 2 1 2

Z 1

Case Null hypothesis Alternative

Rejection region

Test statistic:hypothesis 2 2

20

1 H 0 : 1 2 0 H 1 : 1 12

z test z or

2

n1 n2

2

3

H 0 : 1 2 0

H 1 : 1 2 0

z test

z

z z

H 0 : 1 2 0

H 1 : 1 2 0

z test z

test

___________________________________________________________________________

Task 8

A manufacturer is comparing the settings of two machines, M1 and M2, which should

produce rods of the same length. Both have, over a long period, given rods whose lengths

were normally distributed with variance 37 cm 2 . Although the two machines are supposed to

given the same length of rod, he suspects that this is not so. Examine this suspicion, if the

total length of 15 rods from M1 is 513 cm, and the total length of 20 rods from M2 is 575 cm.

Use = 0.05.

z test

2.6231; reject H 0

________________________________________________________________________________

4.5.2

Variance unknown

If the sample sizes n1 and n2 are large (commonly, equal and greater than 30), the normal

distribution procedures in Section 4.5.1 could be used with replacing 12 and 22 with S12

and S 22 , respectively.

However, when sample sizes n1 and n2 are small (commonly, n < 30) and the populations

are normally distributed, our hypotheses testing will be based on the t distribution. Two

dierent assumptions must be treated. Firstly, we assume that the variances of the two normal

distributions are unknown but equal, 12 22 2 . . Secondly, we assume that the variances

of the two normal distributions are unknown and not equal, 12 22 .

(i) when 12 22 2 .

__

__

The variance of X 1 X 2 is

2

1 1

2 2 2

__ __

Var X 1 X 2 1 2

n1 n 2 n1 n 2

n1 n 2

__

__

X 1 X 2 1 2

1 1

n1 n 2

Since is unknown, we replace it with S p the pooled estimator of . The pooled estimator

2

of 2 , denoted by S p , is dened by

2

p

n1 1 S12 n 2

1 S 22

n1 n2 2

Test statistic is

__

__

X 1 X 2 1 2

S 2p

1

1

n1 n 2

The procedures for testing 1 2 when variance 12 and 22 are unknown but equal is

summarized in Table 4.7.

Table 4.7: Testing 1 2 when variance 12 and 22 are unknown but equal

__

Case Null hypothesis__ Alternative

Rejection region

1

2

1

2

hypothesis

2 0 H 1 : 1 2 0

1 H 0 : 1 T

z z or

Test statistic:

, v n1 ntest2 2 2

1

1

n1 n 2

z test

z

z test z2

2 H 0 : 1 2 0

H 1 : 1 2 0

z test z

3 H 0 : 1 2 0

H 1 : 1 2 0

degree of freedom

Example 5

A researcher wants to prove that brand X size AAA battery last an average of at least 30

minutes longer than brand Y. Two normally distributed independent random samples of 10

each brand are selected, and the batteries are run continuously until they are no longer

__

functional. The sample mean life for brand X is found to be x 328 minutes, and the

1

sample standard deviation is s1 46 minutes. The results for the brand Y batteries are

__

x 2 472 minutes and s 2 52 minutes. Is there evidence that brand X batteries last at least

30 minutes longer than brand Y batteries of the same size? Use = 0.05 and assume the two

population variances are equal.

Solution

1. The parameters of interest are 1 and 2 , the mean life of batteries.

2. H 0 : 1 2 30 versus H 1 : 1 2 30

3. 0.05.

4. Reject H 0 if t test t

, n1 n2 2

__

__

s 2p

s 2p

n1 1 s12 n2

1 s 22

n1 n2 2

10 1 46 2

10 1 52 2

10 10 2

2410

sp

2410 49.0918

__

t test

__

x1 x2 1 2

sp

1

1

n1 n 2

328 472 30

49.0918

1

1

10 10

7.9255

6.

brand X batteries last at least 30 minutes longer than brand Y batteries of the same size

___________________________________________________________________________

Task 9

A problem solving test was given to two groups of 35 and 40 engineers, respectively. In the

rst group the mean score was 82 with a standard deviation of 5, while in the second group

the mean score was 77 with a standard deviation of 10. Is there a signicance dierence

between the performances of the two groups at 5% level of signicance? Assume the two

population variances are equal.

z test

2.6780; reject H 0

___________________________________________________________________________

Task 10

An experiment is done to test the strength of two types of rock climbing ropes, namely R1

and R2. A sample of 15 pieces of rope R1 has a mean strength of 200 kg and a standard

deviation of 5 kg. A sample of 10 pieces of rope R2 has a mean strength of 188 kg and a

standard deviation of 6 kg. Assume the two population variances are equal. Test the mean

strength R1 is greater than R2 at 1% level of signicance.

t test

5.4299; reject H 0

_________________________________________________________________________________

(ii) when 12 22

When we cannot assume the unknown variances 12 and 22 are equal, then there is no exact

test statistic for testing H 0 : 1 2 0 . However, if H 0 : 1 2 0 is true, the

statistic

__

__

X 1 X 2 1 2

1

1

n1 n 2

S12 S 22

n

1 n2

S12

n

1

S 22

n

2

n__1 1 __ n2 1

X 1 X 2 1 2

(4.17)

S

S

n1 n2

2

1

2

2

Test statistic: , T

,2 v 2 2

degree

22

2 2

2

Sand

unequal is

and

S1

S1 S1 2

2 are unknown

2

n

n

n

n

1 2

1

2

summarized in Table 4.8 .

n1 1 n2 1

of freedom

Table 4.8: Testing 1 2 when variance 12 and 22 are unknown and

Case Null hypothesis Alternative

Rejection region

unequal 1 H 0 : 1 2 0 hypothesis

H 1 : 1 2 0

t test t or

2

2

3

H 0 : 1 2 0

H 0 : 1 2 0

H 1 : 1 2 0

H 1 : 1 2 0

,v

t test

t t ,tv

test

,v

t test t ,v

Example

6

A scientist want to determine how two catalysts will eect the mean yield of a chemical

process. Two normally distributed independent random samples of n1 12 for catalyst C1

and n2 10 for catalyst C2 are selected. The sample mean yield for catalyst C1 is found to be

__

x1 152.25 and the sample standard deviation is s1 3.44 . The results for the catalyst C2

__

are x 2 150.85 and s 2 3.72 . Is there any dierence between the mean yields? Use

0.01 and assume the two population variances are unequal.

Solution

1. The parameters of interest are 1 and 2 , the mean process yield.

2 H 0 : 1 2 0 (or H 0 : 1 2 ) versus H 1 : 1 2 0 (or H 1 : 1 2 ).

3. 0.01 .

4. We have s1 3.44 , s 2 3.72 , n1 12 , n2 10 . The degrees of freedom on ttest are

found from equation (4.17) as

S12 S 22

n1 n2

S12

n1

n1 1

S 22

n2

n2 1

3.44 2 3.72 2

10

12

3.44 2

12

12 1

3.72 2

10

10 1

18.6489 19

Therefore,

t test t

we

2 ,v

__

reject

H0

if

t test t

2 ,v

t 0.005,19 2.861

or

__

1

2

__

t test

__

x1 x 2 1 2

s12 s 22

n1 n2

152.25 150.85 0

3.44 2

3.72 2

12

10

0.9094

6. Since t test 0.9094 is less than 2.861, we fail to reject H 0 . We conclude that

there is no dierence between mean yields.

___________________________________________________________________________

4.6

Suppose that two independent random samples of sized n1 and n2 are taken from two large

populations and that X 1 n1 and X 2 n 2 represent the observed number of successes in

n1 and n2 trials, or the observed proportion of successes, respectively. Then P 1 X 1 n1 and

P 2 X 2 n2

normal with mean 1 and variance 1 1 1 n1 , if n1 is relatively large and 1 is not too

close to either 0 or 1. As rule of thumb both n1 1 and 1 1 1 must be greater than or

equal to 5 to makes use of the normal approximation to the binomial distribution. Similarly,

this applied to P 2 .

To test the hypotheses

H 0 : 1 2

H1 : 1 2

(4.18)

P1 P2 1 2

P1 1 P1 P2 1 P2

n1

n2

When H 0 is true, we can substitute 1 2 in the preceding formula for Z to give the

form

Z

P1 P2

P P1 1 2

Z P1 1P n2 n 1

P1 1 P11 2P2 1 P2

Test statistic:

n1

n2

where

X X2

P 1 hypothesis

Alternative

n1 n2

Rejection region

: .

2 statistic

0 ZHis1 :distributed

1 2 approximately

0

1 H 0 of

is a pooled estimate

The

1).

1

z test N

z(0,

or

2

2

3

H 0 : 1 2 0

H1 : 1 2 0

H 0 : 1 2 Table

0

H 1 :Testing

1

0 2

4.9:

2 1

z test

z

z z

test

z test z

__________________________________________________________________________________

Example 7

A usual medication was given to a random sample of 180 patients from district A who have

high fever. A new medication was given to a random sample of 200 patients from district B

who also have high fever. If 144 and 180 patients recover from the fever, respectively, is the

new medication helps to cure better the fever. Use = 0.05

Solution

1. The parameters of interest are 1 and 2 , the proportion of patients who recover from

usual medication and new medication, respectively.

2. H 0 : 1 2

versus H 1 : 1 2 .

3. 0.05

4. We reject H 0 if z test z z 0.05 1.6449 . Refer from Table 6 of Lee (2004).

5. We have

P1

144

0.80

180

P2

180

0.90

200

x1 x 2 144 180

0.8526

n1 n 2 180 200

z test

P1 P2

1

1

P 1 P

n1 n2

0.80 0.90

1

1

180

200

0.8526 0.1474

2.7456

there is strong evidence indicate that the new medication helps to cure better the fever.

___________________________________________________________________________

Task 11

A random sample of 150 students of UTM found that 102 were in favor of a new grading

system, while another sample of 180 students of UKM found that 108 were in favor of the

new system. Do the results indicate a signicant dierence in the proportion of UTM and

UKM students who favor the new grading system? Use = 0.01.

z test

___________________________________________________________________________

Task 12

A geneticist is interested in the proportion of males and females in a population that have a

certain minor blood disorder. He did a survey by taking a random sample of 100 males and

100 females. 31 of the males are found to be aicted, whereas only 24 of the females appear

to have the disorder. Can we conclude that the proportion of men in the population aicted

with this blood disorder is signicantly greater than the proportion of women aicted? Use

level of signicance = 0.01.

z test

___________________________________________________________________________

4.7

1

mean 2 and variance 22 . Assume that both populations are independent. Let S12 and S 22

be the sample variances. Then the ratio

S12 12

F 2 2

(4.19)

S

2

2

2

Test statistic: , F S1 , v n 1 , v n 1 degree of

1

1

2

2

has an F distribution with n1 1

S 2 numerator degrees of freedom and n 2 1 denominator

2

degrees offreedom

freedom. Under H 0 :

2

1

1 H 0 : 12 12

2

3

H 0 : 12 12

H 0 : 12 12

Alternative

hypothesis

H 1 : 12 12

H 1 2: 12 12

S 2

2

F H 112: 1 1

S2

Rejection region

Ftest F1

or

,v1 ,v2

FF

FF ,v ,v

test

test

211 ,v2 ,v

1

Table 4.10 summarizes the critical regions needed for each of the possible alternative

hypotheses.

Table 4.10: Testing of ratio of two variances

Table 9 in Lee (2004) contains only upper-tail percentage points of the F distribution. If we

need the lower-tail percentage points f1 ,v

1 , v2

f 1 ,v1 ,v2

1

f ,v2 ,v1

(4.20)

f 0.999 , 6 ,12

1

f 0.001,12 , 6

1

17.99

0.0556

___________________________________________________________________________

Example 8

Company A and company B can supply chemical material. The mean concentration for both

companies is the same, but we suspect that the variability in concentration may dier

between the two companies. The variance of concentration in a random sample of n1 8 by

company A yields s12 12.4 grams per liter, while for company B, a random sample of

n2 10 yields s 22 13.8 grams per liter. Is there sucient evidence to conclude that the

two population variances dier? We assume that concentration is a normal random variable

for both companies. Use = 0.02.

Solution

The solution using the outlined in Section 4.1 is as follows:

1. The parameter of interest are the variances of chemical concentration 12 and 12

2. The null hypothesis and alternative hypothesis are

H 0 : 12 12 versus H 1 : 12 12

3. 0.02

4. Reject H 0 if

f test f1

f 1 0.02 2,81,10 1

2 , v1 , v2

f 0.99 , 7 , 9

1

f 0.01, 9, 7

1

0.1488

6.72

or if

f test f

f 0.02 2,81,10 1

2 , v1 , v2

f 0.01, 7 , 9

5.61

5. The test statistic is

s12

s 22

f test

12.4

13.8

0.8986

6. Since the value f test 0.8986 is between 0.1488 and 5.61, we are unable to reject

H 0 : 12 12 at the 0.02 level of signicance. Therefore, there is no strong evidence to

__________________________________________________________________________

Task 13

Two types of equipments for measuring the amount of carbon monoxide in the atmosphere

are being compared in an air-pollution experiment. It is desired to determine whether the two

types of equipments yield measurements having the same variability. A random sample of 10

from equipment E1 has a sample standard deviation of 0.10. A random sample of 16 from

equipment E2 has a sample standard deviation of 0.09. Assuming the populations of

measurements to be approximately normally distributed. Test the hypothesis that E21 E2 2

against the alternative that E21 E2 2 . Use 0.05 .

f test

___________________________________________________________________________

Task 14

The following data represents the times taken by two machines in producing an electrical

part:

Machine

Time (in milliseconds)

_______________________________________________

1

108

86

98

109

92

81

165

97

134

87

114

_______________________________________________

Assuming that the distributions of the times are approximately normal, can we conclude that

there is a signicant dierence in variability of the times in producing an electrical part by

machine 1 and machine 2 at 0.05

___________________________________________________________________________

EXERCISE 4

1. Test the hypothesis that the random sample

30.4 31.2 30.8 29.9 30.4 30.7 29.9 30.1

came from a normal population with mean 30.5. The standard deviation of the measurements

is known to be 0.1. Use 0.05

__

2. A sample of size 60 yielded that values x 46.7 and s 2 41.5 . Test the hypothesis that

45 against the alternative that it is greater. Use 0.05 .

3. Repeat question (1) without assuming that the standard deviation is known to be 0.1. In

other words estimate the population variance from the sample measurements. Use 0.05

.

4. A manufacturer claims that the standard mean volume per bottle of shampoo is 250

milliliter. Ten random samples are taken from a batch and the volume per bottle is measured.

The ten measurements have a sample mean of 243 milliliter and a standard deviation of 7

milliliter. Assume approximate normality of data. Is this sample mean signicantly below the

claimed value? Use 0.01 .

5. The standard deviation of the breaking strengths of certain cables produced by a company

is given as 240 kg. After a change was introduced in the process of manufacturing of these

cables, the breaking strengths of a sample of 8 cables showed a standard deviation of 300 kg.

Investigate the signicance of the apparent increase in variability. Use 0.01 .

6. A semiconductor company claimed that at least 99% of the electronic components which

they export without defect. A sample of 150 electronic components was tested, and 12 with

defect. Can we accept the companys claim at a 0.01 level of signicance?

7. An opinion survey in district D1 found that 68% of people considered electricals taris to

be too high. A random sample of 35 people in district D2 were asked the same question 21

thought electricals taris to be too high. Is this proportion signicantly dierent from that of

district D1? Use 0.05 .

__

1

__

1

sample means at the 0.01 level of signicance? Assume that the two populations have equal

variances.

9. Random samples of 200 screws manufactured by machine A and 100 screws manufactured

by machine B showed 19 and 5 defective screws, respectively. Test the hypothesis that

(a) Machine B is performing better than machine A

(b) The two machines are showing dierent qualities of performance. Use 0.05 .

10. A vote is to be taken to determine whether a new housing should be constructed. The

housing area is near to a county site and also short distance from a town. To determine if

there is a signicant dierence in the proportion of county voters and town voters favoring

the proposal, a poll is taken. A random sample of 93 of 150 county voters favor the proposal

and 387 of 450 town voters also favor the the proposal. Can we conclude that the proportion

do county voters favoring the proposal is lower than the proposal of town voters? Use

0.05 .

11. A sample of male and a sample of female were polled on an issue. 120 of 250 male and

126 of 300 female vote yes on the issue. Can we conclude that more male than female favor

the issue. Use 0.02 .

12. Repeat exercise 11 but using 0.10 .

13. Two types of soil namely S1 and S2 at certain district solutions were tested for their

gamma radiation dose. A random sample of 6 measurements of S1 showed a mean of 7.52

with a standard deviation of 0.024. A random sample of 5 measurements of S2 showed a

mean of 7.49 with a standard deviation of 0.032. Assume both population variances are equal.

(a) Determine whether the two types of soil have dierent gamma radiation doses. Use

0.05 .

(b) Determine whether the two types of soil have dierence in the variability of

gamma radiation doses. Use 0.01 .

Chapter 5

Chi-Square Tests

Learning Objectives:

At the end of this chapter, students should be able to

(a) apply the goodness-of-t test.

(b) summarize data in contingency table.

(c) apply the independence test.

(d) apply the homogeneity test.

5.1

Introduction

We have seen in previous chapters that some random variables follow certain distributions

such as binomial, Poisson and normal distributions. We either make an assumption about the

distribution, or we know that the random variables follow specic distributions.

In the next section of this chapter we introduce a method to test such assumption known as

goodness-of-t test which requires the data to be presented in frequency distribution. In this

chapter, we will also discuss two methods of data analysis in which a data set is presented in

a contingency table. The two analysis are the independence test and homogeneity test,

discussed in sections 5.3 and 5.4 respectively.

5.2

Consider the result obtained from an experiment of tossing a die 300 times, as shown in Table

5.1 below:

Table 5.1: Frequency distribution

____________________________________________

Outcome

1

2

3

4

5

6

_____________________________________________

Frequency

45

52

60

58

44

41

_____________________________________________

There are six possible outcomes for each trial, i.e. obtaining number 1, 2, 3, 4, 5 or 6. These

outcomes are also referred to as categories. The question we would like to answer is whether

the dice is a fair dice. The results of the experiment is the evidence for concluding whether

the dice is a fair dice or otherwise. We know that a fair dice has the following characteristic

1

6

If X is a random variable representing the outcome obtained for each trial, then X follows the

uniform distribution with P (X = x) =

1

for x = 1, 2, 3, 4, 5, 6. The objective is to test the

6

hypotheses that the dice is a fair dice which can be stated as below:

H 0 : P 1 P 2 P 3 P 4 P 5 P 6

H 1 : P X i P X

1

6

for i, j 1, 2, 3, 4, 5, 6; i j

The statement in H 0 is equivalent to the dice being a fair dice and the statement in H 1 is

equivalent to the dice not being a fair dice. If the dice is a fair dice, we expect the frequency

for the outcome xi or category i is

Ei n P X i for i 1, 2, 3, 4, 5, 6

where

1

50

6

E 2 n P 2 300

1

50

6

E 4 n P 4 300

1

50

6

E 6 n P 6 300

E1 n P 1 300

E3 n P 3 300

E 5 n P 5 300

1

50

6

1

50

6

1

50

6

O1 45,

O2 52,

O3 60

O4 58,

O5 44,

O6 41

which dier from the expected frequencies if the dice is a fair dice.

The logic is if the dice is a fair dice, the dierence between the observed and the

expected frequencies

Oi \ Ei

observed and the expected frequencies forms the statistic to test the hypothesis regarding the

probability distribution of the random variable. The statistic is stated in the following theorem

Theorem 4 The statistic

O E

k 2

2 i\ i

i1 i

where k is the number of categories and p is the number of unknown parameters needed to be

estimated from the data. If there is no unknown parameter, then the degrees of freedom is

k 1 where p 0 .

Note: This theorem is applicable if the least expected value Ei is at least 5, i.e. E i 5

for all i.

O E

k 2

2 i\ i 2

, pk 1

i1 i

E

at signicance level .

Now we show the procedure to calculate the statistic 2 . Since the statistic 2 is calculated

from the observed sample we use the similar convention from previous chapter denoting

2

test

as the calculated statistic 2 .

________________________________________________________

Oi \

Ei n P i

Oi\ Ei

Ei

____________________________________________________

1

45 50 2 0.50

E1 300 50

O1 45

6

50

O2 52

E 2 300

1

50

6

52 50 2

O3 60

E3 300

1

50

6

60 50 2

O4 58

E 4 300

1

50

6

58 50 2

O5 44

E5 300

1

50

6

44 50 2

O6 41

E 6 300

1

50

6

41 50 2

50

50

50

50

50

0.08

2.00

1.28

0.72

1.62

__________________________________________________________

EO

E

6 i\ i

2

So

tes i1

i

and accept

2

H 0 if test

20.05, 61 11.070 . Note that v k 1 since unknown parameters are absent.

2

Since test 6.2 11.070 , we accept H 0 and conclude that there is no evidence that the

The test we have seen above is called goodness-of-t test. In general, we would

observe the following table with Oi represents the observed frequency for category i for

i 1, 2, , k .

and n O1 O2 Ok .

Category

1

2 ...

k

Ok

P i , is stated in the null

Frequency

O1 O2 i... occurring,

The belief is that the probability

of category

hypotheses H 0 as

H 0 : P i i

i 1, 2, , k .

for

Ei n P i and with the help of Theorem 1, we can test the hypothesis stated in H 0 .

___________________________________________________________________________

Example 1

The authority claims that the proportions of road accidents occurring in this country

according to the categories User Attitude (A), Mechanical Fault (M), Insucient Sign Board

(I) and Fate (F) are 60%, 20%, 15% and 5% respectively. A study by an independent body

shows the following data

Category

Total

Frequency

130

35

30

200

Solution

n = 200

H 0 : P (A) = 0.6, P (M ) = 0.2, P (I ) = 0.15, P (F ) = 0.05

H 1 : At least one P (i) diers for i = A, M, I and F.

_____________________________________________________________

Oi \

Oi\ Ei

Ei n P i

Ei

_______________________________________________________________

130 120 2 0.833

O A 130

E A 0.6 200 120

120

35 40 2

OM 35

E M 0.2 200 40

O I 30

E I 0.15 200 30

30 30 2

OF 5

E F 0.05 200 10

5 10 2

40

30

10

0.625

0.00

2.500

_______________________________________________________________

v 4 1 3.

2

test

0.833 0.625 0.000 2.500 3.958.

2

2

At = 0.05, reject H 0 if test 0.05,3 7.815 . Thus we accept H 0 and conclude that we

___________________________________________________________________________

Example 2

The number of students playing truancy in a school over 200 school days is shown below

No. of truancy

No. of days

12

32

45

50

35

26

If X is a random variable representing the number of students playing truancy per day, test

the hypothesis that X follows the Poisson distribution with mean 3 per day at 0.01

Solution

n 12 32 45 50 35 26 200 , k 6

For X ~ P0 3

P X 0 0.0498,

P X 1 0.1493,

P X 2 0.2241

P X 3 0.2240,

P X 4 0.1681,

P X 5 0.1847

Oi\ Ei

Oi \

Ei n P X i

O0 12

O1 32

O2 45

45 44.82 2

O3 50

50 44.80 2

O4 35

35 33.62 2

O5 26

26 36.94 2

Ei

12 9.96 2

9.96

0.42

32 29.86 2

29.86

44.82

44.80

33.62

0.15

0.00

0.60

0.06

3.24

36.94

_______________________________________________________________

2

test

0.42 0.15 0.00 0.60 3.24 4.47

2

2

At 0.01 , reject H 0 if test 0.01,5 15.086 0:01;5 = 15:086: Thus, H 0 is accepted

and we conclude that there is no evidence to support the number of students playing truancy

per day does not follow the Poisson distribution with mean 3 per day.

___________________________________________________________________________

IQ Score

Frequency

X < 90

2

90 X < 100

30

100 X < 110

85

110 X < 120

90

120 X < 130

40

Example 3

X 130

3

Total

250

It is believed that the IQ score of all adults follow the Normal distribution with mean 110 and

standard deviation 10. The score of IQ test given to 250 adults are summarized below where

X represent IQ score.

Solution

Let X represents the IQ scores.

H 0 : X ~ N 110, 10 2

Assuming H 0 is correct, Z

X 110

10

_______________________________________________

P

IQ Score

_______________________________________________

X 90

90 X 100

P Z 2 0.0228

P 2 Z 1 0.1359

100 X 110

P 1 Z 0 0.3413

110 X 120

P 0 Z 1 0.3413

120 X 130

P 1 Z 2 0.1359

X 130

P Z 2 0.0228

______________________________________________

Oi\ Ei

Oi

Ei n P X i

O1 2

O2 30

O3 85

85 85.33 2

O4 90

90 85.33 2

O5 40

40 33.98 2

O6 3

3 5.70

5.70

Ei

2 5.70 2

5.70

2.40

30 33.98 2

33.98

85.33

85.33

33.98

0.47

0.00

0.26

1.07

1.28

2

test

2.40 0.47 0.00 0.26 1.07 1.28 5.48

2

2

At 0.05 , reject H 0 if test 0.05,5 11.070 . Thus, we fail to reject H 0 and conclude

that there is no evidence to support the IQ scores does not follows the normal distribution

with mean 110 and standard deviation 10.

___________________________________________________________________________

Task 1

It is believed that the number of scratches on a compact disk produced by a process follows

the Poisson distribution with mean 2.5 scratches per disk. The following data shows the

number of disks with the corresponding number of scratches on them:

Number

of

scratches01234

Number

of

disk5223020158

Test the belief at significance level 0.01

k 6

2

then v 5; test

3.1523 15.086; fail to reject H 0

Task 2

Repeat Question in Task 1 above, but without knowing the true mean value. What differences

may you encounter?

k 6,

2

p 1 then v 4; test

3.1869 13.277; fail to reject H 0

___________________________________________________________________________

5.3

Independence Test

____________________________________

Student

Bespectacled

Result

_______________________________________

A

Yes

Excellent

B

No

Excellent

C

Yes

Good

D

Yes

Excellent

E

No

Good

F

No

Good

G

Yes

Excellent

______________________________________

Maths Results

Bespectacled

Yes

No

good

excellent

1

2

3

1

The first number 2 means there are two rows for the row variable "Bespectacled" with

categories Yes and No. The second number 2 means there are two columns for the column

variable "Maths Results" with two categories Good and Excellent. The row and column

variables are both nominal type of data. Each of the four boxes in the contingency table is

called cell. The numbers in each cell are the frequency of students having both the

corresponding row and column categories or simply referred to as observed frequency.

.

Usually, the question we have in mind when dealing with data in contingency table is

whether the two variables are independent. Independence means the two variables are not

influential to each other. Thus in the example above we want to test whether being

bespectacled or not is influencing the students Maths results or not. This test is called

independence test which capitalizes on the fact of independent events in probability study:

Two events A and B are independent if and only if

P (A B) = P (A)P (B),

To understand this test further we introduce the two-dimensional contingency table in its

general form.

In general, a two-dimensional contingency table is of the form below

Column Variable

Category B1

Category

Category

Category A1

Category A2

Row Variable

Category

Category Ar

O11

B2

O12

O21

O22

Or 1

Or 2

Bc

O1c

O2 c

Orc

The above contingency table is a r c contingency table where r denotes the number of

categories of the row variable, c denotes the number of categories of the column variable and

Oij is the observed frequency in cell i, j , i.e. the observed frequency for i th category of

ni

n j

Column Variable

Category B1

Category B2

Category A1

O11 A1 B1

O12

Category A2

O21 A2 B1 )

O22

Or1 Ar B1 )

Or 2

n 1

n 2

O1c A1 Bc )

Orc Ar Bc )

nr

n c

Category Bc

O2 c A2 Bc )

Row

Variable

Category

Category Ar

n1

n2

P Ai B j P Ai P B j

Most often, we do not know the true values of P Ai or P B j but we know from the

estimation Chapter 3 that the best estimator for population proportion or probability is the

sample proportion. Thus

P Ai

ni

and

P Bj

^

n j

n

P Ai Bj P Ai PBj

^

^ ^

ni n j

n n

With this estimated joint probability, we can find the expected frequency in each cell, E ij if

Ai and B j are independent. The expected frequency in cell i, j . is

Eij n P Ai Bj

^

n P Ai P Bj

^ ^

ni nj

n

n n

ni n j

n

Now, if Ai and B j are truly independent, we anticipate Oij and E ij do not differ and if

they differ the difference is not significant. The statistic Oij E ij forms the basis for the

independence test which is stated in Theorem 2.

Theorem 2

rc

The statistic

O E

2 i j\ i j

2

follows the chi-squared distribution with

i1 j1 i j

Oij the observed frequency in cell i, j . , and

E ij the expected frequency in cell i, j .

The theorem can be written simply as

rc

EO

~ cr 11 .

i11j Eij

2 i j\ i j 2

H 0 : Row and column variables are independent.

This test is a one-tailed test on the right where H 0 is rejected if the calculated 2 value is

2

greater than , r 1 c 1 at significance level

2

chapter, the calculated 2 value is denoted by test

test. Thus, we reject H 0 if

2

test

2 , r 1 c 1

Example 4

Insomnia is a disease where a person finds it hard to sleep at night. A study is conducted to

determine whether the two attributes, smoking habit and insomnia disease are dependent. The

following data set was obtained:

Insomnia

Yes

No

Habit

Non-smokers

Ex-smokers

Smokers

Solution

H 0 : Smoking habit and Insomnia are independent.

H 1 : Smoking habit and Insomnia are not independent.

r 3

c 2,

n1 10 70 80, n2 8 32 40,

n3 22 38 60, n 2 10 8 22 40,

n 2 70 32 38 140, n 10 70 8 32 22 38 180.

10

8

22

70

32

38

Oi

E11

80 40

17.78

180

O12 70

E12

80 140

62.22

180

O21 8

E 21

40 40

8.89

180

O22 32

E 22

40 140

31.11

180

O31 22

E 31

60 40

13.33

180

E 32

60 140

46.67

180

O11 10

10 17.78 2

17.78

8 8.89 2

8.89

Oi\ Ei

Ei

3.40

70 62.22 2

62.22

0.97

0.90

22 13.33 2

13.33

Ei n P X i

32 31.11 2

31.11

0.03

5.64

O32 38

38 46.67 2

46.67

1.61

2

test

3.40 0.97 0.90 0.03 5.64 1.61 12.55.

2

2

The critical value at 5% significance level is 0.05, 31 21 0.05, 2 5.991 and the rule is

2

to reject H 0 if test

5.991

level to conclude that smoking habit and insomnia disease are not independent.

Task 3

of their active involvement in co-curricular activities. The following data set was obtained:

Academic

Performance

Low Fair Good

Co-curricular

Inactive

Active

Activities

Use a 5% significance level to conduct the study.

v 2;

40

30

80

90

60

60

2

test

2.0168 5.991; fail to reject H 0

_____________________________________________________________________

Task 4

A study is conducted to determine whether the management efficiency and the specialization

sector are independent. The following data set was obtained:

Management

Efficiency

Low Fair Good

Education

Health

Sector

Banking

Use a 1% significance level to conduct the study.

v 4;

20

15

15

20

25

30

35

40

80

2

test

9.7807 13.277; fail to reject H 0

___________________________________________________________________________

5.4

Homogeneity Test

In the independence test each subject has the possibility of belonging to any of the

rc

cells. For further clarification, consider the following contingency table which shows the

frequency of students according to gender and their hand phone brands.

Hand phone brand

Male

Nokia

Samsung

Others

Total

80

60

30

170

Female

60

70

20

150

Total

140

130

50

320

If all 320 students are chosen at random regardless of their gender and hand phone brand,

each student will be classified in one of the six joint categories and the test of independence

is a valid test. In other words, each of the 320 students will belong in one and only one of the

six cells of the contingency table. However, we may want to fix the number of male and

female students in this study. For example we may want to have 150 male students and 170

female students.

Thus a male student will either belong to the joint categories (Male

Samsung) or (Male

Nokia), (Male

for the male category and not in any of the six joint categories. In other words, a male student

will belong to any of the three cells of the male category. Similarly a female student will

belong to any of the three cells of the female category. Fixing the number of male and female

students constrains the assignment of each subject to the relevant gender categories. When we

have such constraint, we are actually comparing the distribution of hand phone brand

preferences between the two genders. In this case, we fix the row total

ni .

This means we are comparing whether the preferences over Nokia, Samsung or other brand

of hand phones are the same for male and female students.

At the same time, we may prefer to fix the column total n j , i.e. we select 140 Nokia users,

130 Samsung users and 50 other brand users. Each user will be classified in the relevant cell

which is constrained on his/her preferences. Thus, we are actually comparing the distribution

of gender between the hand phone brands.

The relevant test is called homogeneity test where we are testing the similarity of two or more

populations with regard to the distribution of a certain characteristic. For the fixed number of

male and female students, the hypotheses are

H 0 : The proportions of students preferring the three hand phone brands are the same for

H 1 : The proportions of students preferring the three hand phone brands are not the same for

For the fixed number of brand users the hypotheses are

H 0 : The proportions of male and female students users are the same for Nokia, Samsung

H 1 : The proportions of male and female students users are not the same

The procedure to conduct the homogeneity test is the same as the test of independence

discussed earlier.

Task 5

200 female owners and 200 male owners of Proton cars are selected at random and

the colour of their cars are noted. The following data shows the results:

Car Colour

Gender

Black

Dull

Bright

Male

40

110

50

Female

20

80

100

Use a 1% significance level to test whether the proportions of colour preferences are the same

for male and female.

2

v 2; xtest

28.07 9.210; reject H 0

Exercise 5

1. A random sample of 200 printed boards has been collected and the following number

of defects was observed:

Number of defects

Observed Frequency

0 1 2 3 4 5

10 40 54 45 32 8

6

6

7 and more

5

Can we conclude that the number of defects follows the Poisson distribution with

mean 2.6 at significance level = 0.05?

selected and the following number of defective components was recorded:

Number of defects

Frequency

0

5

1 2 3 4 5 6 and more

10 18 19 16 12 20

Can we conclude that the number of defective electrical components follows the

Poisson distribution at significance level = 0.01?

3. A manufacturing engineer is testing a power supply used in a notebook computer. The

complete table of observed frequencies is as follows:

Class

interval

x 4.948

4.948 x 4.986

4.986 x 5.014

5.014 x 5.040

5.040 x 5.066

5.066 x 5.094

5.094 x 5.132

x 5.132

Observed

frequencies Oi

12

14

12

13

12

11

12

14

Test the hypothesis whether the output voltage is adequately described by a normal

distribution with mean 5.04V and standard deviation 0.08V at a significance level =

0.05.

4. A machine is supposed to mix 40% peanuts, 30% hazelnuts, 20% cashews, and 10%

pecans. A can containing 500 of these mixed nuts was found to have 269 peanuts, 112

hazelnuts, 74 cashews, and 45 pecans. At the 0.05 level of significance, test the

hypothesis that the machine is mixing the nuts according to the required percentages.

5. It is believed that the ratio of Bumiputera, Orang Asli, and others student intake in

Faculty of Engineering is 14:3:3. A sample of 500 students chosen at random shows

the following data:

Bumiputera

Orang Asli

Others

Number of Students

345

78

77

6. A random sample of semiconductor devices is taken to observe the relationship

between classification and status for each device. The results are as follows:

Classification

Defective

Non Defective

80

20

40

60

Status

Rejected

Non Rejected

Test the hypothesis that the status and classification are independent at significance

level = 0:05

7. A study was conducted to determine whether the type of painkiller administered to

patients is influencing the level of pain felt by patient and the following data set was

obtained:

Painkiller

A

B

No

20

10

Level of Pain

A little

30

35

Strong

10

15

Test whether the level of pain and the type of painkiller are independent at

significance level = 0:01.

8. A total of 1000 PVC pipes are sampled and categorized with respect to both length

and diameter specification. The results are presented in the following table:

Length

Too Short

Meet Specification

Too Long

Too Thick

20

65

35

Diameter

Meet Specification

115

550

145

Too Wide

15

45

10

Test at 1% significance level whether the length and the diameter of the PVC pipes

are independent.

components produced by workers were the same for the day, evening, and night shifts.

The following data were collected:

Defective

Non defective

Day

100

150

Shift

Evening

200

200

Night

200

150

components are the same for all three shifts.

10. A QC inspector took a set of sample data to determine whether the proportions of

output components for two shifts produced by machine A, B and C were the same.

The following data were collected:

Machine

A

B

C

Shift 1 100 120

180

Shift 2 120 180

100

Use a 0.05 level of significance to determine if the proportions of output components

for shift 1 are the same for all three machines.

Chapter 6

Analysis of Variance

Learning Objectives:

At the end of this chapter, students should be able to

a) Identify treatment, response and levels of treatment.

b) Analyse data using one-way ANOVA.

c) Perform one-way ANOVA techniques via the Microsoft Excel.

6.1 Introduction

In Chapter 4, we compare two population means or in other words two levels of a factor, to

decide if there was any difference occurring between the population means from which the

samples came from. However, researchers often want to examine differences among three or

more population means. For example, researchers might want to compare five different

temperatures in developing polymer to be used in removing toxic wastes from water. The

procedure that can be used for testing the equality for means of temperature is one-way

analysis of variance or one-way ANOVA. The five different levels of temperature are also

known as five levels of factors, or five treatments. A factor (or treatment) is a property, or

characteristic, that allows us to distinguish the different populations from one another. Levels

of factors are commonly denoted by k.

The term treatment is used because early applications of analysis of variance involved

agricultural experiments in which different plots of farmland were treated with different

fertilizers, seed types, insecticides and so on.

To understand how analysis of variance works and why it is called analysis of

variance, using the example above, we obtain a random sample from the population. For each

temperature, we measure the percentage of impurities removed by the treatment. We will get

different measurements for each temperature. This shows there is variability within group or

here we use the term 'Factor'.

In one-way analysis of variance, we partition the variability into two components:

within group variability and between group variability. We then examine the ratio of the two it is called an F ratio - by dividing the between group variability with the within group

variability. It is in this sense that ANOVA is an analysis of variance: the variance between

groups is compared to the variance within groups.

After conducting a one-way analysis of variance, we might conclude that there is

sufficient evidence to reject a claim of equal population means, but we cannot conclude from

ANOVA that any particular mean is different from the others.

The model deals with specific factor levels and is involved with testing the null

hypothesis against the alternative hypothesis, stated below:

H 0 : 1 2 ... k

The one-way analysis of variance specifically allows us to compare several groups of

observations, all of which are independent but possibly with a different mean for each group.

A test of great importance is whether or not all the means are equal. Assume that we are

interested in comparing the means of k populations. In a one-way ANOVA, it is assumed that

each of the populations is normally distributed with the same variance, 2 .

yij i ij

where yij is the jth observation from the ith factors, i is the ith mean and ij is the random

error.

An alternative and preferred form of this equation is obtained by substituting

i i

with the restriction

k

i 1

yij i ij ,

i 1

1

In carrying out ANOVA, it is y

11

know the following

y12

2

y21

y22

Factor

i

yi1

yi 2

y2 n2

...

k

yk 1

yk 2

important to

notations:

ni

j 1

over a level.

yi.

(ii) yi

is the level

ni

y1.

y2.

yini

...

yi .

yknk

yk .

y..

mean.

ni

(iii)

of the responses

y1n1

i 1 j 1

(iv) y ..

y..

is overall mean of the data.

N

ANOVA is a procedure in which the total variation in a measured response is partitioned into

components that can be attributed to recognizable sources of variation. These individual

components are useful in testing pertinent hypothesis. The total variability of the data,

designated by the double summation

k

y

i 1 j 1

ij

y .. ,

2

k

i 1 j 1

i 1

SST = SSTrt + SSE

where

i 1 j 1

SSTrt is the sum of squares due to the levels, and

SSE is the sum of squares due to the errors.

The equation for the total sum of squares, which is a measure of the overall variability

of the data, is

k

SST yij y ..

i 1 j 1

k

ni

yij

y..

i 1 j 1

The equation for the sum of squares for the levels, which measures the variability due to the

levels or factors, is

k

SSTrt n yi y ..

i 1

i 1

yi.

ni

y..

With SST and SSTrt known, SSE can be calculated by the formula

SSE = SST SSTrt

The SSE term measures the variability of the data due to random error.

There are degrees of freedom terms associated with each of the sums of squares. The

degrees of freedom for factor, error and total are given by k-1, N-k and N-1, respectively.

Mean square values are calculated by dividing the sum of square terms for the level

and error by their respective degrees of freedom values. These values represent the variance

of the level and error components of the data. Mean square values for levels and errors are

SSTrt

k 1

SSE

MSE =

N k

MSTrt =

F0 =

MSTrt

MSE

f calculated f ,k 1, N k

we reject the null hypothesis and conclude that some of the variability of the data is due to

differences in the factor levels.

6.4 Output

The general format for output for this type of analysis is an ANOVA table, which contains

basic information about the analysis:

Source of

Variation

Factor

(between levels)

Error

(within levels)

Total

Sum of Squares

Mean Square

f calculated

SSTrt

Degrees of

Freedom

k 1

MSTrt

SSE

N k

MSTrt

MSE

MSE

SST

N 1

Example 1

Three different types of alcohol can be used in a particular chemical process. The resulting

yield (in %) from several batches using the different types of alcohol are given below:

Alcohol (in %)

1

2

3

93

95

76

95

97

77

94

87

84

Test whether or not the three populations appear to have equal means using = 0.01.

Solution

Alcohol (in %)

1

2

93

95

95

97

94

87

y1. 262 y2. 279 y3.

3

76

77

84

237 y.. 778

N 9, k 4

Hypothesis:

H 0 : 1 2 3

H1 : i j

ni

SST yij

y..

i 1 j 1

93 95 74 ... 76 77 84

2

778

660.2222

k

SSTrt

i 1

yi

ni

y..

778

3

3

9

3

778

1

2622 2792 237 2

3

9

67,551.3333 67, 253.7778

297.5555

SSE SST+SSTrt

660.2222 297.5555

362.6667

Source of

Variation

Factor

Sum of

Squares

297.5555

Degrees of

Freedom

3 1 2

Error

362.6667

93 6

Mean Square

Fcalculated

297.5555

148.7778

2

362.6667

60.4445

6

148.7778

2.4614

60.4445

Total

660.2222

9 1 8

At = 0.01, from the statistical table for f distribution, we have

f 0.01,2,6 5.14

Since f calc 2.4614 f 0.01,2,6 5.14 , we unable to reject the null hypothesis and conclude that

there is no difference in the three types of alcohol at a significance of = 0.01.

Task 1

An experiment was done to compare the amount of heat loss for three types of thermal panes.

The inside temperature was kept at a constant 68o F , the outside temperature was kept at a

constant 20o F , and heat loss was recorded for three different panes of each type:

Pane Type

1

2

3

Use ANOVA to test for

differences in heat loss due to

20

14

11

14

12

13

pane type at = 0:05. What can

you conclude from this test?

29

13

19

16

12

15

[ f calc 2.3608 f 0.05,2,9 4.26, fail to reject H 0 ; No differences.]

Task 2

An experiment was conducted to compare four formulations for a lens coating with regard to

its adhesive property. Four samples of each formulation were used, and the resulting

adhesions are given below:

difference in the mean formulation at

1

15

10

21

23

Formulation

2

3

29

33

60

59

91

49

20

21

4

26

34

28

46

evidence to indicate a

0.05

Task 3

To determine the effect of three phosphor types on the output of computer monitors, each

phosphor type was used in three monitors, and the coded results are given below:

Type

2

4

2

3

3

1

3

sufficient 7

2 evidence to conclude that there is a

3 among the three monitors? Test by

difference in the mean phosphor 5

7

6

using = 0.025

5

4

5

5

6

[ f calc 7.4495 f 0.025,2,12 5.10, reject H 0 ; a difference exists.]

Do

the

data

provide

The Excel spreadsheet program has a tool to calculate one-way Analysis of Variance, which

simplifies our computational task considerably. The first step is to enter the data into an Excel

Worksheet. Each factor should be in a separate column. Each column should have a heading

representing the different factors.

In Excel 2007 Worksheet, select Data in the main menu, followed by Data Analysis. If

you use Excel 2003, you may go to Tools first, and select Data Analysis. If Data Analysis is

not available you must install the Data Analysis Tools as follows:

1. Select Add-Ins from the Tools menu.

2. Click on the box next to Analysis ToolPak to select it.

3. Click OK. You have now installed the ToolPak.

From any version, when you click Data Analysis, a pop-up menu will appear. You scroll

down the Data Analysis menu and select Anova:Single Factor. Complete the Anova:

Single Factor window as follows:

1. Enter $A$2:$C$7 in the Input Range: box(or you can enter that value automatically

2.

3.

4.

5.

by clicking in the box and then select the range of cells A2 through C7).

Click the Columns button so that we indicate our data is grouped by columns.

Click the Labels in first row box so that we indicate we are using labels.

Enter the value of alpha in the Alpha: box.

Under Output Options click the button for Output range: and enter $A$9 in the Output

range: box (or click in the box and then click on the cell A9 to cause it to appear in the

box).

6. Click OK.

An example of Excel output summary from a one-way analysis of variance can be seen in

Figure 6.1 below. Notice that the means for the three groups (as well as the count, sum, and

variance for each group) can be seen in the summary table.

One way to interpret the output is to look at the P-value, defined as

P value P ( F f calc )

P value P( F 5.178082192)

This P-value is then compared to a chosen level of significance, . The rules are:

However, if P value , then it suggests that the sample data provide sufficient

evidence to reject H 0 .

From the output above, P value 0.023917 . Suppose we choose = 0:05, noticeably the

P value 0.05 , thus we conclude that there exists a significant difference in the means at

0.05 level of significance. However, if we choose 0.01 , obviously P value 0.01 .

Hence, we fail to reject H 0 and conclude that there is no significant difference in the means

at 0.01 level of significance.

Task 4

Conduct a one-way ANOVA for Tasks 1, 2, and 3 by using Excel. Identify the P-value for

each task and interpret the value.

Exercise 6

1. It was known that a toxic material was dumped in a river leading into a large saltwater commercial fishing area. Civil engineers studied the way the water carried the

toxic material by measuring the amount of the material (in parts per million) found in

oysters harvested at three different locations, ranging from the estuary out into the bay

where the majority of commercial fishing was carried out. The resulting data are

given below:

average

found

parts

per

in

oysters

quality

control

= 0.05.

2. A

experiment

Site 1

15

26

20

20

29

28

21

26

to

Location

Site 2

19

15

10

26

11

20

13

15

18

Site 3

22

26

24

26

15

17

24

million of toxic material

harvested at three sites. Use

engineer

conducted

an

investigate

the

of

effect

assembly task. If experience is found to be a factor, a training program is planned for

new employees. The engineer randomly selected eight employees from groups who

had completed 1, 2, 3, and 4 years of work experiences, respectively.

The resulting data are given below:

among years of

assembly time.

b) Do the data

program might

1

40.3

25.4

28.2

41.6

28.8

38.7

29.4

37.7

Experience

2

3

34.2

26.3

25.4

29.2

30.2

24.6

28.9

29.1

39.2

34.8

29.5

32.3

29.0

36.0

25.6

25.6

4

26.6

21.2

23.2

27.0

27.1

27.3

34.2

33.3

significant differences

experience for average

Use = 0.05

suggest that a training

be productive?

3. The OPEC oil embargo made it evident that fuel economy in automobiles needed to

be improved. Newer lightweight materials were sought for use in automobile engines.

Comparisons on the density (in g / cm3 ) were made among test material samples of

steel, aluminium, and phenolic thermoset composites containing glass fibres, resulting

in the following data:

Steel

7.60

7.81

7.72

7.68

7.79

7.76

Materials

Aluminium

2.90

2.67

2.80

2.85

2.60

2.76

Phenolics

1.79

1.72

1.67

1.80

1.50

1.63

Using an analysis of variance, state the correct hypothesis for testing equality of

means in density for the three materials and conduct the ANOVA test. State your

conclusion. Use = 0:01 level of significance.

different seasons.

Season 1

Season 2

Season 3

Season 4

5.62

7.70

2.52

6.77

6.12

8.31

5.44

6.65

6.62

8.80

4.94

6.01

6.21

8.24

2.99

6.26

7.80

7.87

4.39

7.09

5.36

7.44

4.44

6.06

variability at 0.05 level of significance.

5. Four different machines are used in manufacturing rubber seals. The machines are

being compared with respect to tensile strength of the product. A random sample of

seals from each machine is used to determine whether the mean tensile strength varies

from machine to machine. The following data are the tensile strength measurements in

kilograms per square centimeter x 101

Machine

1

2

3

4

17.5

19.2

15.8

18.6

16.4

16.8

20.9

18.9

20.3

18.5

17.2

20.5

14.6

21.4

16.4

19.5

21.5

16.9

18.1

20.1

Perform the analysis of variance at the 0.025 level of significance and indicate

whether or not the mean tensile strengths differ significantly for the four machines.

6. In a biological experiment, 4 concentrations of a certain chemical are used to enhance

the growth in centimeters of a certain type of plant over time. The growths of plants

are measured. The following output is from Excel.

b) Can we conclude at = 0:05 level of significance that different concentrations

affect the growth of the plant?

7. A company is considering four brands of lightbulbs to choose from. Before the

company decides which lightbulbs to buy, they want to investigate if the mean

lifetimes of the four types of lightbulbs are the same. The company's research

department randomly selected a few bulbs of each brands and tested them. The

following results are based on the number of hours (in thousands) that each of the

bulbs lasted before being burned out. At 5% significance level, test the null hypothesis

that the mean lifetime of bulbs for each of these four brands is the same.

Chapter 7

Simple

Linear

Correlation

Regression

Learning Objectives:

At the end of this chapter, students should be able to

and

(b) Define the terms regression and correlation and highlight the differences between

the two terms.

(c) Write down a linear regression model correctly.

(d) Estimate unknown parameters in a linear regression model by using the method of

least squares.

(e) Use a scientific calculator and computer technology such as Microsoft Excel to get the

estimates of the unknown parameters in a linear regression model.

(f) Make a prediction based on a fitted regression model.

(g) Run a hypothesis test and make inferences on the existence of linearity in a linear

regression model.

(h) Compute a correlation coefficient and differentiate between different types of

relationship between two variables.

7.1 Introduction

In previous chapters, we have only focused on learning the behaviour of population and

sample characteristics, such as the mean, proportion and variance. Having learning about

those characteristics, we shall be able to move further at exploring the relationship between

variables, which can be said as the sample space of earlier chapters. Notice that in many

problems, arising from science and engineering, involve exploring the relationship between

two or more variables. In this chapter, we consider two statistical techniques that are very

useful as a foundation to describe the relationship between these variables. First, by using a

regression analysis, and second, by calculating a correlation coefficient.

Regression analysis generally models the relationship between one or more response1

variables and one or more predictor 2 variables. Three common classifications of regression

analysis are listed below:

i.

ii.

Simple linear regression if there is only one response variable and one predictor

variable.

Multiple regressions if there is only one response variable and many predictor

variables.

Multivariate regression if there are many response variables and one or more than

one predictor variable.

iii.

There are many other types of regression analysis. In this chapter, we only deal with

the first classification. Linear regression, in general, models the relationship between two or

more random variables using a linear equation. In other words, it is a method of estimating

the conditional expected value of one response variable given the values of some predictor

variable or variables. Simply put, linear regression assumes the best estimate of the response

variable is a linear function of some parameters (though not necessarily linear on the

predictors).

Correlation coefficient, on the other hand, gives us a single value, rather than a model, that

measures the relationship between variables. In this chapter, we also concentrate

Response variables are also called dependent variables, explained variables, predicted

variables, or regressands. In the case of a single response variable, it is usually denoted by Y.

2

Predictor variables, on the other hand, are also called independent variables, explanatory

variables, control variables, or regressors, and are usually denoted as X 1 , X 2 ,..., X p

only on correlation coefficient that measures the relationship that is linear, particularly for

quantitative data. This will be discussed in detail in Section 7.6.

Task 1

1. Choose your pair. Next, discuss the difference between regression and correlation.

2. Choose a different pair. Next, list down

a) two possible response variables, and

b) two possible predictor variables.

from your engineering discipline.

As mentioned earlier, the main focus of this chapter is a simple linear regression analysis.

It involves a single predictor, commonly denoted as X and a single response variable,

commonly denoted as Y.

single response (or dependent) variable, Y , and a single predictor (or independent) variable,

X.

Example 1

An engineering student is investigating if his carry marks for all subjects depend on the

number of revision hours he has spent on the subjects.

Solution

In this example, the response, or dependent, variable Y represents the engineering student's

carry marks for all subjects, whereas the predictor, or independent variable X represents the

number of revision hours the student has spent on each subject.

Example 2

An analyst is investigating if the increase in petrol price has an effect on the number of

customers at a petrol station.

Solution

The response variable Y represents the number of customers at the petrol station, whereas the

predictor variable X represents the increase in petrol price.

Once we have identified the response and predictor variables, we may select a random

sample consisting of n pairs of observations. Given this set of paired data,

or stochastic, model which consists of a deterministic and random components, as follows:

y1 xi i

(7.1)

is another unknown regression coefficient representing the slope,

and i is the random error for the i-th pair.

Notice here that the deterministic component in the regression model above is in fact a simple

linear, or a straight line, model.

The assumptions underlying the simple linear regression model include the followings:

1. The errors, i , are normally distributed.

Readers are to be cautioned that this intercept, , is not the same as the level of significance in a

hypothesis testing which is also denoted as . In addition, some references use 0 instead of in

the regression model.

3

3. The variance of the random errors is an unknown constant, 2 .

4. The errors are uncorrelated, that is Cov i , j 0 .

Re-expressing Equation 7.1 in terms of variables, instead of values, we get the following

equation:

Y x

(7.2)

Computing the expected value of Y given a certain value of X , say X x , will result in

Equation (7.2) becoming the following equation:

E Y X x Y X x x

(7.3)

We can see from equation (7.3) that the best estimate of the response variable given a certain

value of a predictor variable is simply a linear function of two unknown parameters, and

. After estimating the two unknown parameters, the target fitted simple linear regression

equation can be obtained and expressed as

Y x

(7.4)

Task 2

1. Determine the response and predictor variables in the following cases:

a. An investigation is carried out to study if the amount of certain chemical that

will dissolve in a given volume of water depends on the level of temperature.

b. A study is done to determine if Oxide of Nitrogen emission rate is influenced

by the load of an engine.

c. An engineer tries to predict the tensile strength of a specimen of cold drawn

copper from the Brinell hardness reading.

2. Without looking at your notes, re-write a simple regression model and state

assumptions related to the model. Next, check if you get the idea correct.

3. Similar to the above, re-write a fitted regression equation and check if you are on the

right track.

A scatter diagram can be used to plot the n randomly selected paired observations. This

diagram is a helpful tool in detecting a relationship between two variables.

The scatter diagram is a two-dimensional cartesian plot, with the x-axis representing the

predictor variable values and the y-axis representing the response variable values. Figure 7.1

shows two examples of scatter diagrams. From the scatter plots in Figure

7.1 below, we can detect a positive slope for the linear model between Y and X in plot (a) and

a negative slope for the linear model in plot (b).

We can draw, by eye, many straight lines through the points on the scatter diagram. These

straight lines, however, are subject to an individual's judgment and consequently will give

different estimated values of and . To arrive at a common estimated, or fitted, regression

equation with common and , we can use a method of least squares in estimating the

unknown parameters, which is discussed in the next section.

Task 3

1. Plot a scatter diagram that implies a very strong positive relationship between two

variables.

2. Plot a scatter diagram that implies a moderately weak negative relationship between

two variables.

The method of least squares is a classical method proposed by a German scientist named Karl

Gauss (1777-1855). It is a method that estimates the unknown simple linear regression

coefficients, and by minimizing the sum of squared residuals. The resulting fitted line

provides the best possible description of the relationship between the response and the

predictor variables.

Residuals are simply errors in a set of sample data. These residuals can be seen as the vertical

deviations of the estimated regression line from the observed values, as shown in Figure 7.2

below, and denoted by ei for the ith observation, i 1, 2,..., n , that is

ei yi y

(7.5)

These residuals are a very useful tool in providing information about the adequacy of the

fitted model.

Recall Equation 7.1, the population random error term can be re-expressed by

i yi xi

(7.6)

The sum of squared deviations of the observations from the true regression line is then given

by

n

i 1

i 1

L i 2 yi xi

(7.7)

By the method of least squares, we estimate the unknown parameters and explicitly by

minimizing the sum of squared errors, of residuals, with respect to these parameters, which is

meant by equating the partial derivatives of L with respect to and respectively to zero.

The least squares estimates of and , that is and respectively, must satisfy

the following conditions.

L

2 yi xi 0

i 1

n

L

2 yi xi xi 0

,

i 1

(7.8)

Simplifying the two equations in Equation 7.8 results in the following two further equations

n

yi n xi

i 1

i 1

i 1

i 1

i 1

xi yi xi xi 2

(7.9)

Equations (7.9) are commonly called the least squares normal equations.

Solving the least squares normal equations simultaneously yields the least squares estimators

y x

(7.10)

S

xy

S xx

(7.11)

1 n

1 n

xi and y yi whereby the sum of products, S xy , and the total sum of

n i 1

n i 1

squares for X , S xx , are given below.

where x

n

1 n

S xy xi yi xi yi

n i 1 i 1

i 1

1

S xx xi

n

i 1

n

xi

i 1

n

Another term that will be much in use later in this chapter is the total sum of squares of the

response variable Y denoted by S yy and is given as follows.

1

S yy yi

n

i 1

n

yi

i 1

n

These sums of squares and sum of product are commonly available in any standard statistical

formula sheet.

Once and are estimated, the fitted or estimated regression model can be expressed as a

simple deterministic straight line equation, given in Equation 7.4, re-expressed as below

Y x

given value of X x . In other words, the predicted value of the response, or dependent,

variable y for a given value of independent variable x can simply be obtained by

substituting the given value of x into the above equation. In short, the fitted line can be used

to make prediction on Y for any value of X , as long as the X values are within a given

range.

Most scientific calculators provide tools for obtaining the estimated regression coefficients

and hence the fitted regression line. The following steps require readers to use this kind of

calculator: CASIO fx-570MS.

Steps:

1. Choose the Regression mode

Mode

Mode

Shift CLR 1

Note that Step 3 is vitally important when storing a new data set so that the old data set will

be removed and will not be mixed with the new data set to ensure an accurate analysis.

Once the sample data are stored in the calculator, we can retrieve the available output

by pressing appropriate operators as shown in Table 7.2.

Table 7.2: Output available from CASIO fx-570MS calculator

Operators

Output

Shift

S-SUM

Shift

S-SUM

Shift

S-SUM

Shift

S-SUM

>

Shift

S-SUM

>

Shift

S-SUM

>

xy

Shift

S-SUM

Shift

S-SUM

>

Shift

S-SUM

> >

Shift

S-SUM

> >

Shift

S-SUM

> >

Notice that r in Table 7.2 is the product moment correlation coefficient which will be covered

in Section 7.6 of this chapter.

Example 3

Obtain the equation of the least squares regression line of y on x for the following data:

x

y

20 25 30 35 40 45 50 55 60 65

98 87 92 79 68 57 59 43 60 38

Solution

The least squares regression line y on x is y x .

Follow the five steps in Table 7.1. At Step 3, before we store the new data set, we must

always make sure that the old data set is already cleared. This is indicated by n 0 on the

calculator screen before the new data set is stored.

After storing the above data set, we should get the following output:

Operators

Output

Shift

S-SUM

Shift

S-SUM

x 425

Shift

S-SUM

n 10

Shift

S-SUM

>

Shift

S-SUM

>

y 681

Shift

S-SUM

>

xy

Shift

S-SUM

Shift

S-SUM

>

50125

x 42.5

y

By formula,

n

S

xy

S xx

xi yi

i 1

n

1 n

x

i yi

n i 1 i 1

1

xi

n

i 1

n

xi

i 1

n

Substituting the formula with the values obtained from calculator will lead to

1

425 681

10

4252

X

10

1.2667 (to 4 d.p.)

X

and

y x

Hence, the least squares regression line is

y 121.9348 1.2667 x

Operators

Output

Shift

S-SUM

> >

Shift

S-SUM

> >

1.2667

Intuitively, we should get the same values for and when calculating the estimated

values either by using formula or directly from calculator. Nonetheless, we may notice that in

this example the values of calculated by using the formula and its value obtained directly

from calculator are slightly different. This small discrepancy may always occur due to a

rounding off values at earlier stage of calculation.

This notation (d.p.) is a short form for decimal places. We normally round the final

answer to four decimal places.

For simplicity at the expense of accuracy, the least squares linear regression of y on x in

this example is thus

y 121.93 1.27 x

We will refer to this equation in the later examples and tasks. Noticeably, the estimated

regression line in this example has a positive intercept and negative slope. Note that and

can vary in , .

Example 4

Refer to Example 3, predict the value of y when x 58 .

Solution

When x 58 , the predicted value of y when using the regression equation is

Task 4

1. An article in the Journal of Sound and Vibration (Vol. 151, 1991, pp. 383-394)

described a study which investigated the relationship between noise exposure and

hypertension. The noise exposure is measured by the sound pressure level (SPL) in

decibels, whereas hypertension is measured by the blood pressure rise (BPR) in

millimetres of mercury (mmHg). A representative data set reported is as follows:

SPL, x

BPR, y

60 63 65 70 70 70 80 90 80 80

1

0

1

2 5

1

4 6

2 3

SPL, x

BPR, y

85

5

89

4

90

6

90 90

8 4

5 7

9

7

6

Ans : y 10.1315 0.1743

b) What can you infer from the estimated value of the slope?

Ans : A unit increase in x leads to a 0.1743 unit increase in y

c) Predict the value of Y when x 58 .

Ans : 48.27

linearly related to the speed setting, X , of the machine. The data below were

collected from a recent quality control record.

x

y

140 165 210 215 245 265 305 325 355 395

29 23 26 36 47

59 68 72 73 85

(a) Obtain x , y , S xx , S yy and S xy

(b) Hence, calculate and using formula. Compare the calculated estimated

values with those given directly by a scientific calculator.

Ans : 16.6914,0.2614

(c) Write a fitted simple linear regression model for the above data.

Ans : y 16.6914 0.2614

(d) Next, estimate the number of defective items produced by the machine if the

speed is 380.

Ans : y 83

Testing the statistical hypotheses about the model parameters is an important part of assessing

the adequacy and significance of a linear regression model. In this chapter, we limit our focus

at discussing the hypothesis testing about the slope of the regression model only whereas the

hypothesis testing about the intercept is not covered. Readers may refer to Montgomery et al.

(2003) p. 274 and other references for further details. Prior to testing the hypotheses, we need

to make the following assumptions:

a) The random errors, i , have a mean 0 and (unknown) variance 2 .

b) The random errors, i , are normally distributed.

c) The random errors corresponding to different observations are independent and

uncorrelated.

Furthermore, we also need to first observe the properties of which may be viewed as a

random variable. From the regression model in Equation (7.1), we can describe the properties

of as follows:

a)

b)

S

2

where

xx

1

S yy S xy

n2

Note that the proving of these properties is not covered in this chapter. These properties are

useful in computing the test statistic value in a hypothetical testing procedure.

Hypothetical testing procedures include writing the hypotheses, stating the decision rule,

computing the selected test statistic and finally making a conclusive decision related to the

null hypothesis about a particular parameter value, as discussed below:

When testing the hypotheses about the slope, , we actually test the linearity of the simple

linear regression model. Appropriate hypotheses are:

H0 : 0

H1 : 0

These hypotheses relate to the significance of regression. If we fail to reject H 0 , we may

conclude that there is no linear relationship between X and Y . This may imply either of

these two situations:

a) X is of little value in explaining the variation in Y and therefore the best estimator

of Y for any value of X is simply Y Y , or

b) the true relationship between X and Y are not linear.

However, if we reject H 0 , this will imply that X is of importance in explaining the

variability in Y .

Once we state the hypothesis statements, we may choose either t-test or one-way ANOVA

using f-test approach to carry out the test further. This option only applies on a two-sided test.

Furthermore, we can use t-test approach, rather than z-test approach simply because the

number of paired observations is small (n < 30) and the variance is unknown. Note here that

f-test value is simply t-value squared.

For testing the significance of regression, either approach will lead to a two-sided

hypothesis test that has two critical regions bounded by a maximum critical value on the left

and a minimum critical value on the right. The decision made is dependent upon the location

of the computed test statistic. The decision rule is to reject H 0 if the computed test statistic

lies in any of the critical regions, either in the left tail, or in the upper tail.

It is worth noted that t-test can, not only be applied to two-sided test, but also to onesided test. The use of f-test, however, can only apply on two-sided test. In short, we have two

options for carrying out a two-sided test but we are left with only one option for a one-sided

test.

A test statistic is computed by assuming the value under H 0 is true. This is the reason why

under H 0 the equality sign is important. This is also applied when we have one-sided test.

After computing the chosen test statistic, this value is then compared with the critical value

stated in Step 2. A decision is made according to the location of the test statistic value. If the

test statistic value lies in a critical region, we reject H 0 and say that we have strong or

sufficient evidence from our sample information that H 0 is false. Otherwise, we are unable to

reject H 0 implying that the available information is insufficient to go against H 0 .

We can test the linearity of a simple linear regression model by using a t-test. Why t-test? We

have assumed that the errors, i , are independently and identically distributed (iid) with a

Normal distribution having mean 0 and variance 2 . It follows directly that the observations

Yi are also iid normal with mean xi and variance 2 . Now, is a linear combination

2

of independent normal variables, and hence is N , / S xx using the properties listed in

n 2 2

2

(7.12)

result, the appropriate test statistic

Ttest

Var

2 / S xx

(7.13)

2 / S xx

df under

H 0 : 0 . The

determination of critical regions, and hence critical values, will depend on the alternative

hypothesis, H1 , and the level of significance, , as listed in Table 7.3.

Note that t ,n 2 is a critical value for testing at significance level and n 2 degrees of

freedom.

Table 7.3 Tests of hypothesis for the slope, , of linear regression model

Type of hypothesis testing

Hypothesis

Rejection

criteria

Two-sided test

(Test for linearity)

H0 : 0

H0 : 0

or ttest t / 2,n 2

[i.e. if ttest t / 2,n 2 ]

Right-tailed test

(Test for a positive slope)

Left-tailed test

(Test for a negative slope)

Example 5

H0 : 0

H0 : 0

H0 : 0

H0 : 0

Reject H 0 if ttest t ,n 2

Reject H 0 if ttest t ,n 2

significance = 0.05.

Solution

From the solution to Example 3, we have

x 425, x 20125

y 50125, xy 26330

1.27, n 10,

Therefore,

y 681,

S xx x

S yy y

S xy

x y

xy

n

4252

20125

10

50125

26330

6812

10

425 681

10

Thus,

2

Var

S xx

S xx S xy

n 2 S xx

3748.9 1.27 2612.5

10 2 2062.5

=

(to 4 s.f.)

We are to test the significance of regression given by the following hypotheses:

H0 : 0

H0 : 0

Step 2: Determine the rejection region and state a decision rule.

The significance level is = 0.05. The sign under H1 indicates that the test is twosided. Therefore, the area in the right or left tail of the t distribution is

/ 2 0.05 / 2 0.025

and

df n 2 10 2 8

From Table 7 in Lee (2004), the critical value, t0.025,8 2.306 . Thus, the decision rule is that

we will reject H 0 if ttest t0.025,8 ( 2.306) .

Step 3: Calculate the value of test statistic.

The value of test statistic is calculated as follows:

ttest

Var

1.27 0

0.02612

(to 4 d.p.)

The value of test statisti ttest =

therefore, ttest 2.306 and thus ttest certainly falls

in the critical region. Hence, we reject the null hypothesis and conclude that the data provide

sufficient evidence that the slope is significantly not zero at 0.05 level of significance.

The analysis of variance (ANOVA) method is an alternative approach to test the significance

of regression. Using this approach, the total variability in the response variable is partitioned

into two meaningful components as follows:

n

yi y

i 1

yi y yi y 2

i 1

i 1

Symbolically, we have

where SS denotes sum of squares and

n

2

a) SST yi y S yy is the total corrected sum of squares of y .

i 1

2

i 1

n

c) SSE yi y 2 SST SSR is the error sum of squares which measures the

2

i 1

The corresponding degrees of freedom df associated with each SS are as follows:

b) df reg 2 1 1 since the model has two unknown parameters, and

c) df E dfT df reg n 1 1 n 2. .

If we divide the SSR and SSE with their respective degrees of freedom, we will obtain the

mean squared regression denoted by MSR (= SSR/1) and the mean squared error denoted by

MSE (= SSE/n - 2) respectively. It can be shown that the test statistic

Ftest

MSR

MSE

follows the F distribution with 1 and n 2 degrees of freedom under the null hypothesis

H0 : 0 .

We can arrange the test procedure using this approach in an ANOVA table, as shown

in Table 7.4

Source of

Variation

Regression

Error

Total

Ftest

Sum of

Degrees of

Mean

Squares

Freedom

Square

SSR S xy

1

MSR

MSR / MSE

SST S yy

n2

n 1

MSE

H0 : 0

H1 : 0

We will reject H 0 if f test f ,1,n 2 at level of significance where f ,1,n 2 is the critical

value which is tabulated in Table 9 of Lee (2004).

Example 6

Reconsider Example 3, test H 0 : 0 versus H1 : 0 using the ANOVA approach.

Solution

Step 1: Calculate , S yy , S xy

From the solution in Example 5, we have

Step 2: Compute all the sums of squares

By formula,

SST S yy 3748.9

SSR S xy 1.27 2612.5

By substitution, the complete ANOVA table is as follows:

Source of

Variation

Regression

Error

Total

Sum of

Squares

3317.875

431.025

3748.9

Degrees of

Freedom

1

10 2 8

9

Mean

Square

3317.875

53.8781

ftest

The hypotheses statements: H 0 : 0 versus H1 : 0.

The rejection criterion: We will reject H 0 if f test f 0.05,1,8 [ 5.32 from Table 9 of Lee (2004)]

Decision and Conclusion: From ANOVA table, f test 61.5811 which is very far into the

critical region, i.e. f test 5.32 . Therefore, we reject H 0 and conclude that the data provide

sufficient evidence to support the existence of linearity between X and Y .

Task 5

1. Without looking at any reading material, list down briefly steps involved in testing the

significance of regression. Check your list with your friend who sits next to you and

compare your answers.

2. Why t-test is preferred to z-test in testing the slope of a linear regression model?

Discuss with your neighbours.

3. Consider the data from Question 1 in Task 4, by using t-test approach, test the

hypothesis that the regression of blood pressure rise (BPR) on the sound pressure

level (SPL) is linear at the 0.05 level of significance.

[ Ans : ttest 7.3145 t0.025,18 2.101 , reject H 0 , linearity significantly exists.]

H1 : 0 at the level of significance = 0:01. Write your conclusive decision

clearly.

appropriate test approach. Will your data provide enough evidence to reject H 0 ?

Verify your answer.

[ Ans : ttest 9.8448 t0.05,8 1.86 , reject H 0 , positive linearity significantly exists.]

6. Repeat Question 3 but by using a one-way ANOVA approach. Compare your current

decision with the previous one.

[ Ans : ftest 53.5015 f 0.05,1,18 4.41 , reject H 0 , linearity significantly exists.]

7. Repeat Question 4 but using a one-way ANOVA approach. What is your finding?

[ Ans : ftest 96.9210 f 0.05,1,8 5.32 , reject H 0 , linearity significantly exists.]

7.6 Correlation

from knowledge of the independent, or controlled, variable X. In this section, however, we

will consider the problem of measuring the relationship between two variables, X and Y. As

such, we have a correlation analysis which attempts to measure

1. the strength, and

2. the direction

of a relationship between two variables by means of a single number called a correlation

coefficient.

Particularly, a linear correlation coefficient is a measure of the strength and direction of a

linear relationship between two random variables, X and Y, denoted by for population data

and r for sample data. Here, r is known as Pearson's product moment correlation coefficient,

or simply sample correlation coefficient, defined as

r

S xy

S xx S yy

(7.14)

It measures the extent to which the points on a scatter diagram cluster about a straight line.

For example, if we construct a scatter diagram for a sample data having n pairs of

measurements

x , y : i 1, 2,..., n

i

7.6.2 Properties of r

Some properties of r include:

a) r 1,1 .

between X and Y. Furthermore, when r = 1, we have a perfect positive linear

relationship.

c) On the other hand, if r is close to -1, it implies that there is a strong negative linear

relationship between X and Y . Likewise, if we have r = -1, it means that we have a

perfect negative linear relationship.

d) When r is close to zero, either from positive or negative direction, it implies that there

is a weak or no linear relationship between X and Y.

Scatter diagrams below show three different positive linear relationships between X and Y ,

in an increasing order of strength:

(a) r 0.60

(b) r 0.85

(c) r 1

Meanwhile, the scatter diagrams below show examples of negative linear correlation between

X and Y, in an increasing order of strength:

(a) r 0.60

(b) r 0.85

(c) r 1

Noticeably, the wider the scatter of the points around a straight line the weaker the

correlation will be and hence the closer r is to 0, either from negative or positive directions.

The two diagrams below display examples of the absence of linear relationship

between X and Y. For Figure (b) below, although r = 0 implying no linear relationship, the

two variables do actually have a relationship which is nonlinear (in this case a quadratic

relationship).

Example 7

Compute the product moment correlation coefficient to measure the relationship between X

and Y variables based on sample data from Example 3. Comment your answer.

Solution

The correlation coefficient computed based on the sample data is the sample

correlation coefficient, r, given as

S xy

S xx S yy

2612.5

2062.5 3748.9

0.9395

To obtain the value directly from calculator, we may use the following operators:

Operators

Shift

S-SUM

> >

Output

3

Task 6

1. Refer the sample data from Question 1 in Task 4, measure the strength of

relationship between blood pressure rise (BPR) and the sound pressure level

(SPL).

[ Ans : 0.8650; strong positive correlation]

2. Refer to sample data from Question 2 in Task 4, obtain the Pearson product

moment correlation coefficient for the sample data. Comment your result.

[ Ans : 0.9611; very strong positive correlation]

The steps listed below are procedures of using Excel. In this case, we consider the sample

data from Question 1 in Task 4.

a) First, store the data in an Excel worksheet as shown in Figure 7.3 overleaf.

Figure 7.3 Data storage in Excel worksheet for regression for analysis

b) Next, click Tool from the menu bar and then choose Data Analysis from the pulldown menu followed by Regression from the pop-up menu.

7.

The following table lists the measurements of the air velocity and evaporation

coecient of burning fuel droplets in an impulse engine:

Air Velocity

(cm/sec)

20

60

100

140

180

220

260

300

340

380

420

460

Evaporation Coefficient (

/sec)

1.8

3.5

3.7

5.6

7.5

7.8

9.8

11.6

13.7

16.5

18.6

19.5

(a) Fit a straight line to these data by using the method of least squares.

(b) Estimate the evaporation of a droplet when the air velocity is 190 cm/sec.

(c) Test whether evaporation coecient of burning fuel droplets in an impulse engine is

positively related to the measurements of the air velocity at 0.10 signicance level.

(d) Find the Pearson correlation coecient. Give your comment.

8.

(in RM100) of the recently university graduates in engineering is related to their CGPA.

The excel output is as follows. Assume that the data is normally distributed.

(a)

(b)

(c)

Predict the starting monthly salary if the CGPA is 3.6.

Does the data support the existence of a linear relationship between starting salaries

(d)

Find the Pearson correlation coecient. What can you infer form the value?

9.

A manufacturing company bought a new cutting tool from company A and wanted to

investigate the useful life (in hours) related to the speed at which the tool is operated.

The Excel output follows for useful life of the tool (in hours) and speed (meters per

minutes).

(b) Predict the useful life if the speed is 55 m/mins.

(c) Test on the validity of the model build in part (a). Use = 0.01.

(d) Find the correlation. Interpret the value.

10.

The following output from Excel gives information on the engine powers x (in

(a) Find the least square estimates of the regression line for the engine power against the

maximum speed.

(b) What does the estimate of imply?

(c) What is the predicted maximum speed if the engine power is 72 kilowatt?

(d) Is there any evidence that the data strongly suggest a linear association between the

engine power and the maximum speed at the 0.01 signicance level.

(e) Find the correlation between the engine power and the maximum speed. Explain your

answer.

Correlation

Chapter 8

Nonparametric Statistics

Learning Objectives:

At the end of this chapter, students should be able to:

a)

b)

c)

d)

e)

f)

understand and apply the sign test.

understand and apply the run test.

understand and apply the Mann-Whitney test.

understand and apply the Wilcoxon signed-rank test.

compute the Spearmans rank correlation coecient.

8.1 Introduction

There are four types of data namely nominal, ordinal, interval scale and ratio scale data. An

example of nominal data is gender where male may be represented as 1 and female as 2.

The numbers are used for identication of the categories in gender variable. Data that can

be ordered from the lowest to the highest value such as feeling towards school which can be

categorized and ordered such as very unhappy, unhappy, somewhat happy, happy and very

happy, are ordinal data. To understand interval scale data, we start with an example;

temperature. A reading of 0 0 C does not mean there is no temperature and 50 0 C is not

twice as hot as 25 0 C . In contrast, 0 meter of length of ratio scale data means there is no

length and 50 m is twice the length of 25 m . The measurement length, weight and density

are some examples of ratio scale data. Statistical methods that we have discussed before

such as the t-test, ANOVA and regression deals with interval scale data or ratio scale data

and that the data being analyzed is assumed to come from a population with a specic

probability distribution. For example in the t-test, the population where a random sample is

selected from is assumed to be normally distributed with mean and variance 2 . In

general, these techniques are classed as parametric statistics. This chapter discusses an

alternative to the parametric statistics namely non-parametric statistics (NPS). Parametric

statistics is capable of analyzing interval scale and ratio scale data. Mean and variance for

these data can be calculated, interpreted and used in the analysis. But not so for nominal

and ordinal data.

For example, consider the nominal data gender with categories male and female.

Surely the mean of gender has no meaning. NPS is the method to use when dealing with such

data.

In general, a statistical technique is categorized as NPS if it has at least one of the

following characteristics:

1. The method is used on nominal data.

2. The method is used on ordinal data.

3. The method is used on interval scale or ratio scale data but there is no assumption

regarding the probability distribution of the population where the sample is selected.

8.2

Sign Test

We have seen the test of population proportion that uses the sampling distribution

P N ,

for large sample size n. The sign test is a test of the population

n

proportion for testing 0.5 in a small sample situation (usually for n 20).

To understand how the sign test works, let us look at this example.

A study is conducted to see the preference of hand-phone users towards two branches

of hand-phones A and B by asking the views of 12 users. Specically this study is done to

see if the preferences are the same towards the two brands.

If there is no dierence on the preference then we can anticipate the proportion of

users who prefer brand A is the same or about equal to the proportion of users who prefer

brand B. Since there are only two brands being tested, proportion of users preferring brand A

is 0.5 and similarly for brand B if there is no dierence on the brand preference.

If the proportion of users preferring brand A is greater than that of brand B, we can

anticipate the number of users preferring brand A will be a lot higher than the number of

users preferring brand B. On the other hand if the proportion of users preferring brand B is

greater than those of brand a, we can anticipate the number of users who prefer brand A will

be a lot lower that the number of users preferring brand B.

This forms our hypotheses

H 0 : 0.5

H 1 : 0 .5

where is the proportion of the population of users preferring brand A.

Now, we have 12 subjects who named their preferences and let X be a random

variable representing the number of users who prefer brand A and furthermore assume H 0 is

true, thus X follows the Binomial distribution with n = 12 and = 0.5 or simply.

X ~ Bin 12,0.5

For notational purposes, let those who prefer brand A be represented by the sign +

and those who prefer brand B be represented by the sign -. Thus, comes the sign test. So

the random variable X is redened to represent the number of + and X ~ Bin 12,0.5 . Our

alternative hypothesis H 1 : 0.5 indicates that we have a two-tailed test with two rejection

regions. Supposed this test is done at signicance level = 0.05, this means we would reject

H 0 if X a or X b , i.e. we would reject H 0 if the number of + is at most a or at least b.

The issue now is to nd the values of a and b.

By the nature of a two-tailed test we know that P X a P( X b) 0.05 . Now for

n

n x

P X x p x 1 p

x

for x = 0, 1, 2, ..., 12. The probability for each value of x is shown in the table below:

X=x

0

1

2

3

4

5

6

7

8

9

10

11

12

P (X = x)

0.0002

0.0029

0.0161

0.0537

0.1208

0.1934

0.2256

0.1934

0.1208

0.0537

0.0161

0.003

0.0002

P X 2 P( X 10)

P X 0 P ( X 1) P X 2 P( X 10) P X 11 P ( X 12)

0.0386

which is less than our chosen 0.05 .

If we decide to reject H 0 when X 3 or X 9 , we can see that the signicance level

P X 3 P( X 9)

= 0.146

which is a lot more than our chosen 0.05 .

Since the value 0.0386 is closer to 0.05 than 0.146, it is reasonable to make our

decision rule as reject H 0 if the number of + is at most 2 or the number of + is at least 10.

However, with this rule, our signicance level is not exactly 0.05 but 0.0384 .

would reject H 0 and make a conclusion that the data provide evidence that there is a

dierence in brand preference at a signicance level 0.05 .

The sign test uses the binomial distribution as the decision rule. In general, we have

three choices for our hypothesis :

1. Choice 1

H 0 : 0.5

H 1 : 0 .5

2. Choice 2

H 0 : 0.5

H 1 : 0 .5

3. Choice 3

H 0 : 0.5

H 1 : 0 .5

Choice 1: This is a two-tailed test with the rejection regions X a or X b . The

value of a is such that P X a

.

2

2

Choice 2: This is a one-tailed test on the right with the rejection region X a .

The value of a is such that P(X a) . The graph is shown in Figure 8.3.

Choice 3: This is a one-tailed test on the left with the rejection region X a . The

value of a is such that P X a . The graph is shown in Figure 8.4.

Example 1

10 engineering students went on a diet program in an attempt to lose weight with the

following results:

Name

Abu

Ah Lek

Sami

Kassim

Chong

Raja

Busu

Wong

Ali

Tan

Weight before

69

82

76

89

93

79

72

68

83

103

Weight after

58

73

70

71

82

66

75

71

67

73

Is the diet program an eective means of losing weight? Do the test at signicance level

0.10 .

Solution

Let the sign + indicates Weight before - Weight after > 0, and indicates Weight

before- Weight after < 0.

Thus

Name

Abu

Ah Lek

Sami

Kassim

Chong

Raja

Busu

Wong

Ali

Tan

Weight before

69

82

76

89

93

79

72

68

83

103

Weight after

58

73

70

71

82

66

75

71

67

73

Sign

+

+

+

+

+

+

+

+

H 0 : 0.5

H 1 : 0 .5

Let X represents the number of + sign. Assuming H 0 is correct, X ~ Bin 10,0.5 .

The observed number of + sign is 8 and the probability of getting at least 8 + is

P (X 8) = 1 0.9453 = 0.0547

which is less then 0.10 . Thus, we can conclude that there is sucient evidence that the

diet program is an eective programme to reduce weight.

Example 2

16 students were asked about their views on their college new regulation of not

allowing students to drive on campus. 13 of them oppose the ruling while 3 of them agree

with it. Is there evidence to support the hypothesis that the minority of students support the

new ruling at signicance level 0.05 ?

Solution

Let X represents the number of student supporting the ruling.

H 0 : 0.5

H 1 : 0 .5

Assuming H 0 is correct then X ~ Bin 16, 0.5 . The observed X is 3. Using the distribution

above

P (X 3) = 0.0106

which is less than 0.05 . Thus reject H 0 and conclude that there is sucient evidence that

Example 2

A paint supplier claims that a new additive will reduce the drying time of its acrylic

paint. To test this claim, 8 panels of wood are painted with one side of each panel with paint

containing the new additive and the other side with paint containing the regular additive. The

drying time, in hours, were recorded as follows:

Drying Times

Panel New Additive Regular Additive

1

6.4

6.6

2

5.8

5.8

3

7.4

7.8

4

5.5

5.7

5

6.3

6.0

6

7.8

8.4

7

8.6

8.8

8

8.2

8.4

Use the sign test at the 0.05 level to test the hypothesis that the new additive have the

same drying time as the regular additive.

[Ans: P X 1 0.0625 0.025 or P X 1 0.9922 0.025 ; fail to reject H 0 and

conclude that the new and regular additive have the same drying time.]

In cases where the number of subject is large (n 20), the normal approximation can be used

as a decision rule where if X is a random variable representing the number of + then

0.25

X N 0.5 ,

Consider a football team A with the following results in 12 games

W

W W W

It must be a good team to win 12 consecutive games and their winning the games are not by

chance nor it is random. Based on these results, we can easily predict the outcome of the

next game.

Consider another football team B with the following results in 12 games.

W L W L W L W L W

L W

Based on these result we can anticipate the result for the next game. The teams performance

is predictable and the results is not random.

Consider another football team C with the following results in 12 games.

W

W L

Denition A run is a sequence of one or more consecutive occurrences of the same outcome

in a sequence of occurrences in which there are only two possible outcomes.

For team A, there is only one run with Ws = 12 and Ls = 0.

WWWWWWWWWWWW

For team B, there are 12 runs with Ws = 6 and Ls = 6.

W L W L W L W L W

L W

W

W

L

L

L

W L

W

W

H 0 : The outcome of the game is random

For team A, we see that the outcome is not random and the number of run is the minimum 1.

For team B, we see that the outcome is not random and the number of run is the maximum

12. So, too many runs or too few runs indicate the outcome is not random.

Let

R= The number of runs

n1 = number of W

n2 = number of L

n n1 n2

It is a tedious job to construct the probability distribution of R for higher values of n1 and

n2 . With the probability distribution we are capable of building the rule for accepting and

rejecting H 0 . As we have said earlier, small value of R or large value of R indicates the

outcome is not random, thus the test of randomness is a two-tailed test. This test of

randomness is called the run test.

Since the run test is a two-tailed test, we would reject H 0 if the observed number or

runs R a or R b . The values a and b are chosen in such a way that P X a

P X b

and

2

2

W

random.

It is quite a tedious job to construct the probability distribution of the number of runs

R each time we perform a run test. Table 13 page 43 in Lee (2004) provides the critical values

to accept or reject at various values of signicance levels.

Example 3

A machine cuts plywood with mean length 100 cm and standard deviation 1 cm. 15

plywoods produced by this machine consecutively shows the following length (in cm).

99.5

99.5

99

99.8

100.6

99.7

100.1

99.8

100.3

100.1

100.2

100.5

100.2

100.3

99.9

Can we conclude that the length of plywoods cut by this machine is random over and below

the mean length 100 cm at signicance level 0.05 ?

Solution

Let + indicate the length of plywood which is over 100 cm and indicates the length

which is below 100 cm. The outcome is thus,

++++++++

with n = 15, n1 8 , n2 7 where n1 the number + and n 2 the number of .

H 0 : The length is random

The number of observed runs is R 9 . Using the statistical table, we would reject H 0 if

and conclude that, there is no evidence to conclude the length of plywood cut by the machine

is not random.

Task 3

+ + + + + ++ + +

where + indicates the price increase from the previous day and - indicates the price

decrease from the previous day. Is the price increase or decrease a random event at

signicance level 0.05 ?

[Ans: 5 R 11 15 , fail to reject H 0 and conclude that the price increase or decrease is a

random event.]

Task 4

In an industrial production line, items are inspected daily for defective items. The

following is a sequence of defective items, D, and non-defective items, N, produced by this

production line:

D

Use the runs test to determine whether the defective items are occurring at random. Let

0.05 .

[Ans: 4

R 10 14 , we fail to reject H 0 and conclude that the defective items are occurring at random.]

applying the run test. The Normal approximation comes in handy with the following

statement.

For large values of n1 and n2 , the distribution of R(the number of runs in the sample)

is

approximately

2R

Normal

2n1 n2 2n1 n2 n1 n2

n1 n2 2 n1 n2 1

, i.e

with

mean

2n1 n2

1

n1 n2

and

2n1 n2

2n n 2n n n1 n2

R N

1, 1 2 12 2

n

n

1

2

1

1

2

1

2

variance

and

2n1 n2

1

n1 n2

N 0,1

2n1 n 2 2n1 n2 n1 n2

R

n1 n2 2 n1 n2 1

In this case we can use the standard Normal distribution to nd the critical values of z for the

given signicance level .

8.4

8.4.1 Introduction

Often enough we are dealing with data in the form of ranks as in the case of ordinal data. For

instance, a study may involve the feelings of students towards this subject which can be

categorized as Very Unhappy, Unhappy, Somewhat Happy, Happy and Very Happy.

The feelings can be ordered or ranked where rank 1 represents the lowest feeling Very

Unhappy, rank 2 the second lowest feeling Unhappy and so forth. This section describes

some statistical methods in dealing with such data.

8.4.2 Mann-Whitney Test

The Mann-Whitney test or sometimes referred to as Wilcoxon rank-sum test is used to test

the location measures (such as means) of two dierent populations are identical.Two

independent random samples are required from each population. Let x1 , x 2 , ..., x n and

y1 , y 2 , ..., y m be two random samples of sizes n and m where n m from populations X and

Y respectively. We wish to test the hypotheses that the two distributions X and Y are the

same. The hypotheses are

H 0 : P X P Y

H 1 : P X P Y

Assign the rank 1 to n m to both samples where the smallest value from both samples is

assigned rank 1, the second smallest value is assigned rank 2, and so on. The highest value is

assigned rank n m . Let R X i and R Y j denote the rank assigned to X i and Y j for all i

and j. For convenience let N m n . The sum of the ranks assigned to population X can be

used as a test statistic,

n

T R X i

i 1

Sample

X

X

X

Y

Y

Y

Rank

1

2

3

4

5

6

We see that

3

T1 R X i 1 2 3 6

i 1

and

T2 R Y j 4 5 6 15

3

i 1

On one hand, when the sample sizes for both samples are the same we would expect

T1 R X i T2 R Y j

if both populations X and Y are the same. However, if they are signicantly dierent we

would expect T1 R X i to differ significantly with T2 RY j where we would expect

On the other hand, when the sample sizes dier, a rather small T1 or large T1 gives

some indication that the populations dier. Comparison of T1 with T2 is not appropriate

with diering sample sizes due to unequal chances of summing the integer ranks. Thus, the

inferential aspect must only consider either T1 alone or T2 alone.

Table A7 of W. J. Conover (1971) provides the critical value for rejection of H 0

for various values of n and m. The table provides P T W p p . For example consider

n 5 and m 7 . The value 15 corresponding to p = 0.001 means P (T < 15) 0.001 and the

value 22 corresponding to p = 0.05 means P (T < 22) 0.05. Thus we would left critical

value. The right-hand-side critical value is obtained by n N 1 w p . So the right-hand-side

critical value is 5(5 + 7 + 1) 22 = 43, i.e. P (T > 43) 0.05. Thus we would reject

H 0 : P X P Y 0.5 if the observed T R X i 22 at 0.05 as the left critical

value. The right-hand-side critical value is obtained by n N 1 w p . So the right-hand-side

critical value is 5(5 + 7 + 1) 22 = 43, i.e. P (T > 43) 0.05. Thus we would reject

H 0 : P X P Y 0.5 if T < 22 or T > 43 at 0.10 which corresponds to p = 0.05 for

two-sided test. However when n and m are large

n N 1 nm N 1

,

2

12

T N

Example 4

Data below show the marks obtained by electrical engineering students in an

examination:

Gender

Male

Male

Male

Male

Female

Female

Female

Female

Female

Marks

60

62

78

83

40

65

70

88

92

Can we conclude the achievements of male and female students are identical at signicance

level 0.1 .

Solution

H 0 : Male and Female achievements are the same.

Let the random variable X represents the gender Male and Y represents the gender Female.

Gender

Male

Male

Male

Male

Female

Female

Female

Female

Female

Random Variable

X

X

X

X

Y

Y

Y

Y

Y

Marks

60

62

78

83

40

65

70

88

92

Rank

2

3

6

7

1

4

5

8

9

n = 4, m = 5.

4

T1 R X i 2 3 6 7 18

i 1

and

T2 R Y j 1 4 5 8 9 26

5

i 1

Male and Female are not signicantly dierent.

Task 5

Petrobus

Procat

The petrol consumption

(in11.9

km/liter

petrol)

12.5,

10.5,

10.4,for several Proton Wira 1.5 model for two

10.8, 8.9, 10.0, 9.5,

brands of petrol is shown below:

11.2

13.0, 10.7

Can we conclude both brands of petrol give equal mileage at signicance level

0.05 ?

[Ans: 19 T1 35 41 , fail to reject H 0 and conclude that both brands of petrol give the same

mileage.]

Task 6

The following data represent the number of hours that two dierent digital cameras

operate before a recharge is required.

Camera

A

Camera

B

5.

2

5.

8

5.4

6.2

6.5

6.3

5.8

6.2

5.4

5.8

6.1

6.2

6.2

6.6

6.8

5.9

5.8

6.3

Use the Mann Whitney test with 0.1 to determine if camera A operates longer

than camera B on a full battery charge.

[Ans: T1 70.5 100 , fail to reject H 0 and conclude that there is no signicant evidence from the

data, at 0.1 , that Camera A operates longer than Camera B on a full battery charge.]

The Wilcoxon signed-rank test for two dependent samples or paired samples is used to test

whether two populations from which these samples are drawn are identical. For example, we

might want to test whether the weight of persons before and after going through a diet

program is the same or not. Each person will have two weight measurements; before and after

going through the diet program. So we have one sample for the weight before going through

the diet program and one sample for the weight after going through the diet program. Since

the two measurements come from the same person, the samples are dependent which is also

known as paired samples. To understand this technique, we start with an example.

Example 5

Consider the following data which record the weight (in kg) of 8 students before and

after going through a diet program intended to reduce their weight.

Subjec

t

A

B

C

D

E

F

G

H

Before (Y)

70

75

68

60

73

80

65

63

After

(X)

62

70

58

61

61

60

54

66

d i y i xi . Then we rank the di ignoring the negative sign (if any). This means we rank the

modular of d i ; d i . Let this ranks be noted by R. Next, we give the sign according to the

sign of the corresponding d. Let these signed-rank be denoted by R d i . So we would have

Subject

A

B

C

D

E

F

G

H

Before(Y )

70

75

68

60

73

80

65

63

After (X )

62

70

58

61

61

60

54

66

di= xi - yi

8

5

10

-1

12

20

11

-3

R

4

3

5

1

7

8

6

2

1. R d i is symmetry.

2. R d i is mutually independent.

3. R d i has the same median.

H 0 : The weight before and after is the same

Let R d i denote R d i which are positive and R d i denote R d i which are negative.

The logic is, if both the populations of weight before and after are the same then, we

can anticipate

T R d i T R d i

Since the assumption that R d i is symmetry then the mean of R d i 0 and the

H 0 : median of R d i 0

H 1 : median of R d i 0

We can have the usual one-tailed test as

H 0 : median of R d 0

H 1 : median of R d 0

or

H 0 : median of R d 0

H 1 : median of R d 0

and the two-tailed test

H 0 : median of R d 0

H 1 : median of R d 0

rejection rule make it simpler for us as we would only need to consider the lower of T and

T in our sample. For larger n it is a tedious job to construct the probability distribution of

R d . Table (Hisyam Lees table) lists the critical points for accepting H 0 for various values

of .

Going back to the before-after weight example, we see that T 33 and T 3 . At

signicance level = 0.05, Table (Hisyams table) gives the critical point with n = 8 as 4.

This means that we would reject H 0 if T 3 or T 3 . Since the lower of the two values

is T 3 which is exactly the same as the critical value 3, we reject H 0 and accept H 1 .

Thus we make the conclusion that there is evidence the weight before and after going through

the diet program is not equal.

Table below summarizes the various test procedures for both one-tailed and two tailed

test:

Task 7

Before 74

65

78

81

55

61

80

After

62

83

100 68

59

105 66

87

65

their hand-insert ability speed after attending a course. The following table gives the handinsert ability speed of 8 operators before and after they attended the course:

Using the 2.5% signicance level, can we conclude that attending the course increases the

hand-insert ability speed of the operators?

[Ans: Since T = 25.5 < 33, we fail to reject H 0 and conclude that the course does not increase the

operators hand-insert ability speed.]

Task 8

The following data gives the number of industrial accidents in ten manufacturing

plants for one month periods before and after an intensive promotion on safety:

Plant

Before

After

1

3

2

2

4

3

3

3

1

4

6

3

5

8

4

6

4

1

7

5

4

8

6

5

9

7

6

10

8

4

Do the data support the claim that the campaign was successful in reducing accidents?

Use = 0.05.

[Ans: Since T = 55 > 44, we reject H 0 and conclude that the campaign was successful in reducing

accidents at = 0.05.]

In a Wilcoxon signed-rank test for two dependent samples, when the sample size is

large (n 15) the statistics T and T is approximately Normal with mean T

n n 1

and

4

variance 2T n n 1 2n 1 written as

24

n n 1 n n 1 2n 1

,

4

24

T N

Thus,

n n 1

4

N 0,1

n n 1 2n 1

24

T

8.5

Measure of Association

8.5.1

We have seen the correlation coecient r measure the linear relationship between two

continuous variables X and Y.

A measure of correlation for ranked data based on the denition of Pearson Correlation

Coecient where there is no tie or few ties called Spearman Rank Correlation

Coecient, denoted by is given by

r s 1

6T

n n 2 1

where

n

R X R Y

T di

i 1

i 1

and

- R X i is the rank assigned to xi .

- R Yi is the ranks assigned to y i .

- d i is the dierence between the ranks assigned to xi and y i .

- n is the number of pairs of data.

Usually the value of rs is close to the value obtained by nding r based on numerical

measurements. The interpretation of rs is similar to the interpretation of r in which a value of

+1 or 1 indicates perfect association between X and Y. The plus sign indicates identical

rankings and the minus sign occurring for reverse ranking. When rs is zero or close to zero,

we would conclude that the variables are uncorrelated.

Some advantages in using rs rather than r are:

1.

when the data possess a distinct curvilinear relationship, the rank correlation

coecient will likely be more reliable than the conventional measure of r.

3. Meaningful numerical measurement of r is not possible such as when dealing with

ordinal data but nevertheless can establish rankings.

Mole ratio 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3

Viscosity 0.45 0.20 0.34 0.58 0.70 0.57 0.55 0.44

Example 6

The data below show the eect of the mole ratio of sebacic acid on the intrinsic viscosity of

copolyesters.

Find the Spearman rank correlation coecient to measure the relationship of mole ratio of

sebacic acid and the viscosity of copolyesters.

Solution

Let X and Y represent the mole ratio of sebacic and viscosity of copolyesters,

respectively. First we assign ranks to each set of measurements. The rank of 1 assigned to the

lowest number in each set, the rank of 2 to the second lowest number in each set, and so

forth, until the rank of 10 is assigned to the largest number. The table below shows the

individual rankings of the measurements and the dierences in ranks for the 8 pairs of

observations.

Mole ratio

1

0.9

0.8

0.7

0.45

8

4

4

0.2

7

1

6

0.34

6

2

4

0.58

5

7

-2

di2

16

36

16

4

0.6

0.5

0.4

0.3

0.7

0.57

0.55

0.44

4

3

2

1

8

6

5

3

-4

-3

-3

-2

16

9

9

4

T = 110

Thus,

r s 1

6T

n n 2 1

6 110

8 64 1

= 0.3095

which shows a weak negative correlation between the mole ratio of sebacic acid and the

viscosity of copolyesters.

Example 7

The following data were collected and rank during an experiment to determine the change in

thrust eciency, y as the divergence angle of a rocket nozzle, x changes:

Rank X

Rank Y

1

2

2

3

3

1

4

5

5

7

6

9

7

4

8

6

9

10

10

8

Find the Spearman rank correlation coecient to measure the relationship between the

divergence angle of a rocket nozzle and the change in thrust eciency.

Solution

R(xi)

1

R(yi)

2

di = R(xi)-R(yi)

-1

di2

1

2

3

4

5

6

7

8

9

10

3

1

5

7

9

4

6

10

8

-1

2

-1

-2

-3

3

2

-1

-2

1

4

1

4

9

9

4

1

4

T = 38

rs 1

6T

n n 2 1

6 38

10100 1

0.7697

indicating a high positive correlation between the divergence angle of a rocket nozzle and the

Dryingeciency.

time

2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

change in thrust

Solids removed 4.3 1.5 1.8 4.9 4.2 4.8 5.8 6.2 7.0 7.9

Task 9

The grams of solids removed from a material (y) is thought to be related to the drying time,

(x). Ten observations obtained from an experimental study follow.

Calculate the Spearman rank correlation coecient to measure the relationship between the

grams of solids removed from a material and the drying time.

=0.8788]

Task 9

[ rs

Two persons rank their preferences on 8 brands of automobile due to the rise of the price of

petrol. The ranks are in the following order:

Brands

Person A

Person B

Calculate the Spearman rank

1 2

3

4

5

6

7

8

5 8

4

3

6

2

7

1

7 5

4

2

8

1

6

3

correlation coecient to measure the relationship between the

[ rs

=0.7143]

Exercise 8

1.

Briey explains the meaning of categorical data and give two examples.

Name

Abu Ali Chen Rama Subra Lim Tan Amin

2. When does a statistical method become a non-parametric statistics?

Weight Before(kg) 78 86

69

83

78

74

80

90

Weight After (kg)

66 87

64

80

73

65

75

87

3. At a college there are two cafeterias A and B where the students usually have their

meals. A random sample of 12 students is taken and 5 of them prefer cafeteria A and

the rest indicates preference on cafeteria B. At the 5% signicance level, can we

conclude that the students at this college has equal preference of the two cafeterias?

4. Eight students went on a diet in an attempt to lose weight, with the following results:

Use the sign test to test whether the diet an eective means of losing weight at

signicance level 0.05 . Now use the Wilcoxon signed-rank test to test the same

hypothesis at the same signicance level.

5.

In a library, there are two popular reading sections A and B where students normally

do their fovourite readings. A random sample of 14 students is taken and their

preferences are shown below:

A B A A B A A A A B A B

At the 10% signicance level, can we conclude that the students has equal preference

of the two library reading sections?

6. Through the years the achievement award given to sta in a department has the

following order according to gender:

M M M M F F

M M F M F

where M represent Male and W represent Female. Is the award given according to

gender a random event at signicance level 0.05 .

7. In a study to determine whether accidents occurs at random or not the following data

were gathered for 15 consequtive days

+ + - Before

+ + 210

+ - 180+ 195

+ - 220+ 231

- -199 - 224+

After

where + indicates the number of accidents for that day is above average and -

indicates the number of accidents for that day is below average. Test the hypothesis at

signicance level 0.05 .

8. The following data gives the cholesterol levels for seven adults before and after they

completed a special dietry plan

Use the sign test at the 5% signicance level to test whether the level of cholesterol is

the same before and after completing the special dietary plan. Use the Wilcoxon

signed-rank test at the 5% signicance level to test whether the level of cholesterol is

the same before and after completing the special dietary plan. Draw your conclusion.

9. The following table gives the recorded grades for 10 engineering students on carry

marks and nal examination in an Engineering Statistics course:

Student

Ali

Bidin

Chua

Didi

Emily

Farouk

Gina

Carry Marks

48

46

38

43

36

49

44

Final Examination

47

45

42

40

38

49

44

Hasan

Intan

Joe

42

34

40

46

37

34

between carry marks and final examination.

10. Two panels test 12 brands of computer chips for overall quality. The ranks assigned

by the panels are as follows:

Brand

A

B

C

D

Panel

1

10

6

1

7

Panel

2

9

3

4

5

x

y

E

3

6

F

8

7

G

2

8

H

5

2

I

9

10

1.6 9.4 J 15.5 20.0

4 22.0 135.5 43.0 40.5 33.0

240 181 K 193 155

172 7 110 113 75

94

8

L

9

6

between the results given by panel 1 and panel 2.

11. An engineer wants to investigate the relationship between the fretting wear of mild

steel and oil viscosity. Representative data follow, with x = oil viscosity and

y = wear volume.

Calculate the Spearman rank correlation coecient to measure the relationship between

the fretting wear of mild steel and oil viscosity.

Answers

Answers to Self-Review Quiz

Questio

ns

1

2

3

4

5

6

7

8

9

10

Part A

b

a

d

b

b

b

c

c

b

d

Part B

FALSE

FALSE

TRUE

FALSE

TRUE

FALSE

FALSE

FALSE

TRUE

FALSE

Answers to Exercise 1

1.

(a) Constant

(b) Constant

(c) Variable, quantitative, continuous

(d) Variable, qualitative, nominal

(e) Variable, quantitative, interval-scaled

(f ) Constant

(g) Variable, quantitative, continuous

(i) Variable, quantitative, discrete

(j) Variable, quantitative, continuous

(k) Variable, qualitative, ordinal

2.

(a) 0.3595

(b) 0.5033

(c) 0.4278

(d) 0.4167

(e) 0.4396

(a) Straightforward

5

x ; 0 x 1

8

1 x2

;1 x 2

2 8

1 ; elsewhere

(b) f x

(c) 0.4688

4. 0.5438

5. (a) 0.0729

(b) 0.3359

(c) 0.4703

6.

(a) 0.5328

(b) 0.3372

(c) 0.0675

A possible comment: the means are the same for population and sample data, but

larger dispersion is observed if the data were sample data.

8.

(a) 0.9305

(b) 0.8385

(c) 0.2924

(d) 1

0.7642

9.

(a) 0

b) 5.1984 10 4

(c) RM2082.245

Answers to Exercise 2

1. a. 0.0060

b. 0.9706

2. a. 0.8962

b. 0.0001

3. a. 0.7757

b. 0.6129

4. 0.9808

5. a. 0.9993

b. 1.0000

6. a. 0.8997

b. 0.8020

7. 0.6772

8. a. 1.0000

b. i. 0.9998

ii. 0.0002

9. a. 0.4840

b. 0.0344

c. 0.0045

10. a. 0.9842

b. 0.9684

c. 0.9911

11. a. 0.0401

b. 0.5490

c. 0.9599

12. a. 0.9803

b. 0.4681

c. 0.6156

13. a. 0.2912

b. 1.0000

c. 1.0000

14. a. 0.6628

b. 0.0869

c. 0.7230

15. a. 0.3669

b. 0.8725

c. 0.5000

16. a. 0.5000

b. 0.3192

c. 0.5948

17. a. 0.4682

b. 0.6293

c. 0.9505

18. a. 0.4052

b. 0.7265

c. 0.5000

Answers to Exercise 3

1. The observed interval contains the true value of .

2. Shorter

3. Yes, because we are making use of the sample information to infer the population

parameter.

4. a. 102.5

b. (98.944, 106.056)

5. a.

6. (0.4645, 0.5555) liter

7. (9.1, 10.7) micrometer

8. a. (0.505441, 0.507519) cm

b. (0.504637, 0.508323) cm

as it is impractical to know the variance of normal population without knowing its mean.

10. (1.061, 0.460); the observed interval contains the true value of mean dierence with

90% level of condence, No.

11. (0.0107, 0.0493)

12. (0.0048, 0.0202)

13. a. 0.09 b. (0.0751, 0.1049)

c. 0.01

d.(0.0048,0.0152)

e.0.003;(0.0004,0.00639)

15. a. (314.033, 430.301); sample was drawn from a normal population, is unknown, and n

is small. b. (346.917, 418.417)

d. 70.6692; (1945.958, 30048.782) RM2

Answers to Exercise 4

1. z test = 2.3717; reject H 0 .

2. z test = 2.044; reject H 0 .

3. t test = 0.5167; fail to reject H 0 .

4. t test = 2.821; reject H 0 .

5. Fail to reject H 0 .

6. z test = 6.1546; reject H 0 .

c. RM(85.96, 64.96)

e. 43.459 ; RM(27.13, 29.78) f. (0.5236, 13.353)

8. z test = 4.0216; reject H 0 .

9. a. Fail to reject H 0 b. Fail to reject H 0 .

10. z test = 6.3640; reject H 0 .

11. z test = 1.4084; fail to reject H 0 .

12. z test = 1.4084; reject H 0 .

13. a. Fail to reject H 0

b. Fail to reject H 0 .

Answers to Exercise 5

2

1. k = 7, then = 6; xtest = 5.6807 < 12.592; Fail to reject H 0 .

2

2. k = 6, p = 1 where = 3.47, then = 4; xtest = 3.682 < 13.277; Fail to reject H 0 .

2

3. k = 8, then = 7; xtest = 0.6333 < 14.067; Fail to reject H 0 .

2

4. k = 4; then = 3; xtest = 40.692 > 7.815; reject H 0 .

2

5. k = 3; then = 2; xtest = 0.2448 < 9.21; Fail to reject H 0 .

2

6. Independence test: = 1; xtest = 33.33 > 3.841 (without Yates correction); reject H 0 ;

2

7. Independence test: = 2; xtest = 4.7179 < 9.21; Fail to reject H 0 ; Level of pains and type

2

8. Independence test: = 4; xtest = 13.3808 > 13.277; reject H 0 ; Length and diameter are

9. Homogeneity test: = 2; xtest

defective components are NOT the same, i.e. they are signicantly not homogeneous at

0.05 .

2

10. Homogeneity test: = 2; xtest = 36.6753 > 5.991; reject H 0 ; The proportions of output

components for shift 1 are signicantly not the same for all 3 machines.

Answers to Exercise 6

1. f calc 4.9471 f 0.05, 2, 21 3.47

(b) No

3. f calc 5982.001 f 0.01, 2,15 6.36 ; Means are signicantly dierent.

4. f calc 29.7986 f 0.05,3, 20 3.10 ; Season has a signicant impact on oxygen variability.

5. f calc 2.1656 f 0.02,3,16 4.08 ; The mean tensile strengths do not dier signicantly.

6. (a) 6, 5, 4 and 6 respectively.

(b) P value= 0.1827 > 0.05; Dierent concentrations do not aect the plant growth.

7. P value= 0.00143 < 0.05; The mean lifetimes are signicantly dierent.

Answers to Exercise 7

1. a.

0.6623 , 1.1256

2. a.

143.731 , 15.202

3. a.

0.2757 , 0.0255

4. a. 5.3066

c. Reject H 0

b. 3.98

b. 37.317

c. Reject H 0

d. 0.9939

d. - 0.9859

b. Reject H 0 c.0.9387

d. 0.9502

c. Accept H 0

b. 3.85

d. Accept H 0

5. a.

5.6 , 0.07

6. a.

2.8144 , 2.8622

b. 306.2076

c. Reject H 0

d. 0.8742

7. a.

0.0016 , 0.0415

b. 7.8866

c. Reject H 0

d. 0.9901

b. RM36.3 hundreds, or RM3630

c. Yes because Signicance F < 0.05, or P-value for CGPA coecient < 0.05

9. a. Lif e = 8.32975 0.085775 Speed

b. 3.6121 hours

c. Yes because Signicance F < 0.01, or P-value for Speed coecient < 0.01

d. r = Multiple R = 0.9339; very strong positive linear correlation between Useful Life and

Speed.

b. implies that a unit increase in power would lead to about 2.3794 units increase in

Maximum Speed.

c. 169.5146 km/h

d. Yes because Signicance F < 0.01, or P-value for Power coecient < 0.01

e. r = Multiple R = 0.7426; moderately strong positive linear correlation between

Maximum Speed and Power.

Answers to Exercise 8

3. H 0 : 0.5 vs H 0 : 0.5; P X 5 0.3872 0.025 or P X 5 0.8062 0.025

fail to reject H 0 ; the students at this college have equal preference of the two cafeterias.

4.

program is effective.

T 1 1 5 or T 32 31; reject H 0 ; the diet program is effective.

fail to reject H 0 ; the students have equal preference of the two library reading sections.

6. 3 R 6 11 ; the award given according to gender is a random event.

8. (Sign test) H 0 : 0.5 vs H 0 : 0.5; P X 4 0.7734 0.025 or

P X 4 0.5 0.025;

T 1 6.5 2 or 2 T 21.5 26; fail to reject H 0 ; the level of cholesterol is the

9. rs 0.8182 ; a strong positive correlation between carry marks and nal exam scores.

10. rs 0.6573 ; a moderately strong positive correlation between results given by panel 1

and panel 2.

11. rs 0.85 ; a strong negative correlation between the fretting wear of mild steel and oil

viscosity.

References

Lee, M. H. (2004). Statistical Tables and Formulae for Science and Engineering.

Skudai: UTM.

Montgomery, D. C. & Runger, G. C. (2006). Applied Statistics and Probability

for Engineers, 4th Ed. USA: John Wiley & Sons.

Montgomery, D. C., Runger, G. C. & Hubele, N. F. (2003). Engineering Statistics. USA: John Wiley & Sons.

- Patent, R & D and Technological SpilloversUploaded byganeshone
- 13 Probability DistributionUploaded byVivianne Yong
- 632234164 (1).pdfUploaded byGetachew
- Tutorial Ssce 2193 2017Uploaded bykarimov1924
- RESERVOIR ENGINEERING - Determination of Oil and Gas Reserves.pdfUploaded byLeonardo Barrios Carrera
- Practice 10Uploaded byorivalls
- AgainstAllOdds_StudentGuide_Unit30Uploaded byHacı Osman
- Revise AsapUploaded byAleriz Chang
- Practica_1_[Compatibility_Mode]B.pdfUploaded byRodrigo Velez
- Cover & Table of Contents - Statistics for Managers Using Microsoft Excel (5th Edition)Uploaded byBego Noriega
- DoeUploaded bysachin121083
- Technical Report Ch 12Uploaded bylunorip
- Least Square RegressionUploaded byMd Mostafa Kamal Faisal
- Quetions for top students(2).pdfUploaded byharuhi.karasuno
- MQ12MathsMethodsVCEU3&42E.pdfUploaded byCameron Taylor
- MATH30-6 Lecture 7.pptxUploaded bymisaka
- syll 15 smr13Uploaded byapi-225492070
- 1-s2.0-S0148906299000509-mainUploaded bySunny Sourav
- Shelf life.pdfUploaded byMihir Dixit
- Exercise 15.12Uploaded byLeonard Gonzalo Saavedra Astopilco
- Polycentric Urban Structure - The Case of Milwaukee (McMillen 2001)Uploaded byStephie Lee
- P86Uploaded byAriel
- 4221 LAST Chraif de PublicatUploaded byMichealla Shreif
- SyllabusUploaded byJosh Potash
- Various UnexpectedUploaded bywirodihardjo
- Examples of Negative Binomial DistributionUploaded byYoungJames
- StatsUploaded byVishal Gattani
- ch11.pptUploaded byandrey
- Standardised Regression Coefficient-metaanalysisUploaded byJasleen Kaur
- BITS pilani advance statisticUploaded bylucky2010

- RupaUploaded byFaizal Mohammed Said
- krsv 2016Uploaded byFarah Fauzi
- bab-6-ruang.pptxUploaded byAmu Kevin
- Engineering Mechanics - Statics, R.C. Hibbeler, 12th Edition.pdfUploaded byAin Farhan
- Fahaman Islam Liberal Dan Hubungannya Dengan Golongan KelasUploaded byAin Farhan
- 7 Distributed ForcesUploaded byAin Farhan
- Atomic Structure vs Crystal Structure.docxUploaded byAin Farhan
- Irama Dan PergerakanUploaded byKogila Saravanan
- 6c Structure MachineUploaded byAin Farhan
- HOW TO OVERCOME DRUG ABUSE.docxUploaded byAin Farhan
- Engineering Mechanics - Statics, R.C. Hibbeler, 12th EditionUploaded byAin Farhan
- History and Phase Diagram StannumUploaded byAin Farhan
- Fahaman Islam Liberal Dan Hubungannya Dgn Golongan Kelas Menengah - Azalina TatarUploaded bySharifah Naeilah Szied
- How to Overcome Drug AbuseUploaded byAin Farhan
- irama dan pergerakanUploaded byAin Farhan
- Main FactorUploaded byAin Farhan
- TIN InformationUploaded byAin Farhan
- Work Life Balance(Ulab2122)Uploaded byAin Farhan
- Five Benefits of WorkUploaded byAin Farhan
- Samples Summary of Critical ReadingUploaded bymuhdakmaladnan
- Tamadun Melayu T1 (2).pdfUploaded byAin Farhan
- solan pekse PPT.docxUploaded byAin Farhan
- 7 Distributed ForcesUploaded byAin Farhan
- Atomic Structure vs Crystal Structure.docxUploaded byAin Farhan
- GrammarUploaded byAin Farhan
- Tin @ StannumUploaded byAin Farhan
- History and Phase Diagram StannumUploaded byAin Farhan
- History and Phase Diagram StannumUploaded byAin Farhan
- Answer Keysection36 Sample Exam Paper 1Uploaded byAin Farhan

- Incorporating Side Information in Probabilistic Matrix Factorization With Gaussian ProcessesUploaded byken_ng333
- Gambler's RuinUploaded byDavid James
- Math StatisticsUploaded byscribd_sandeep
- Etc 2410 NotesUploaded byMohammad Rashman
- penelope_2003_NEAUploaded byThái Trần
- CEE 201 Syllabus - 2012Uploaded byKhaldoun Atassi
- RandomVariables_ProbDistUploaded byAnkur Sinha
- assighnment.docxUploaded byShubham Singh
- Bab 8 Probablity DistributionUploaded byMohd Rajaei Ali
- EFB334 Lecture03, VaR 1(1)Uploaded byTibet Love
- Tutorial 4 - Probability Distribution (With Answers)Uploaded byLiibanMaahir
- Binomial DistributionUploaded byAbhishek Kumar
- Conjugate Prior - WikipediaUploaded byAyan
- Msc DistributionsUploaded byAnonymous e5ciVe
- 30828795 Systems Reliability AssessmentUploaded byvinay kumar
- MonteCarlo+QuasiMC 2010Uploaded byc_mc2
- TRANSANCTION - Multiple - Instance Hidden Markov ModelUploaded byEhsan_Manzari
- midterm.pdfUploaded byuniversedrill
- Levin + Perez + Markov + Wilmer (2008). Markov chains and mixing timesUploaded byVinh Nguyen
- From Algorithms to Z-ScoresUploaded bybooklover2
- Answer key to Stock and Watson Chapter 16 questionsUploaded bypnunns
- CHAPTER3 Continuous Probability DistributionUploaded byMari Parian Ü
- Stochastic SimulationsUploaded byLeynard Natividad Marin
- Methods in Applied EconometricsUploaded bysravankumar66
- ExamQuestionsUP.pdfUploaded bysound05
- CH2 Prob Supp424Uploaded by孫晧晟
- A Chronotopic Model of Mobility in Urban SpacesUploaded byZerattull
- Probability - ShiryaevUploaded byPiyush Panigrahi
- BoE2016.pptx-show.pdfUploaded byDaniel Niculescu
- jntuk-dap-course structure and syllabus-b techece-ii year r10 studentsUploaded byapi-156199995