You are on page 1of 44

Asshar

edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

QMB PASS Week 3 2010


Topics to be covered:
1. Introduction to Statistics
2. Descriptive techniques Tabular, Graphical, Numerical
3. Introduction to Linear Regression
1. INTRO
Q: Can you explain the difference between the following pairs of terms?
a) descriptive statistics vs. inferential statistics?
b) population vs. sample?
c) parameter vs. statistic?
d) discrete vs. continuous random variable?
2. DESCRIPTIVE TECHNIQUES
a) Tabular
A variable is some characteristic of a population or sample.
o Values are possible observations of the variable.
o Data is the word we use for the actual observed values of a variable.
We can classify data into 3 different categories:
o i) interval/quantitative/numerical
o ii) nominal/qualitative/categorical
o iii) ordinal
For nominal data, we can draw up a table to describe the data with a column each for:
o a) class (a collection of data which are mutually exclusive)
o b) frequency (grouping that data into classes)
o c) relative frequency (representing the number of data in a class as a percentage of
the total data)
b) Graphical
If we were to describe nominal data, then we would use either a bar chart (for
frequencies) or a pie chart (for relative frequencies)
But where we have interval data, we can consider using a histogram steps being:
o 1. Collect interval data.
o 2. Create classes / class limits for the data, for which you could consider using
Sturges Formula or the Class Width formula.
o 3. Plot the data on a graph with frequency on the Y axis.
It is important to be able to describe shapes of histograms ~ a skill often tested in
tutorial and final exams. Lets look at that now.
Describing the shapes of histograms
a) Symmetry
o Your data may be symmetrical or non-symmetrical. Use common sense.

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

b) Skewness

o
o
o

Positive skew
Tail to the right
Mode < Median < Mean

o
o
o

Negative skew
Tail to the left
Mean < Median < Mode

c) Modal classes
o The modal class is the class with the largest number of observations.
o The 3 descriptions which could come in handy in describing your histogram are:
i) unimodal histogram = a histogram with one peak.
ii) bimodal histogram = a histogram with two peaks.
iii) bell-shaped histogram = a symmetric unimodal histogram.

Apart from the histogram, you should also be familiar with stem and leaf displays and
ogives.

c) A separate note on bivariate relations


Bivariate relations are an extension of univariate analyses to characterise relationships
between variables. You could represent them graphically using a scatter plot which
could be a time series plot in particular, or in a table with a contingency table.
d) Numerical
We can measure interval data in terms of central location:
o i) the mean / arithmetic mean = the average of the scores.
o ii) the median is the middle term after they have been ordered.
o iii) the mode is the observation that occurs with the greatest frequency.
Question time...
1. Oh no! How do we find the median if we have an even number of observations?
2. What are the advantages and disadvantages of each method of measuring central
location?

We can also measure interval data with respect to variability the spread of the data:
o i) range = largest smallest observation
o ii) variance:
population variance:

sample variance:

iii) standard deviation simply take the square root of the variance.
Also recall the empirical rule and Chebysheffs Theorem when required to
interpret the standard deviation of your data.
iv) coefficient of variation = standard deviation / mean

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

We can also measure with respect to relative standing.


o i) percentiles the Pth percentile is the value for which:
P% < (that value), and
(100-P%) > (that value)
location of a percentile, Lp:

Lp =

p
x (n + 1)
100

ii) in extension to percentile theory, we can measure the interquartile range


interquartile range = 75th 25th percentile = upper lower quartile
this measures the spread of the middle 50% of observations
At this point, make sure youve checked out what box plots and outliers are.

3. INTRODUCTION TO LINEAR REGRESSION


Recall from your lecture or earlier from this PASS class about bivariate relations ~ in
particular, think scatter plots and the types of relationships between the variables we
spot when we look at the plot. If we want to determine the intercept and slope of a
relationship between X & Y axis variables, we need values that will give us the line of
best fit.
To do so, we need to minimise the residual sum of squares by utilising the least
squares method.
Much more on this and linear regression generally in the second half of the course, but
for the moment, take precautionary notice with these equations and formulas:
1. assume the regression equation:
2. then:

and

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

QUESTION BANK
1. Using these numbers: 2 3 3 6 8 9 14 16 17 20, find the:
a. mean
b. median (8.5)
c. mode
d. lower quartile (3)
e. upper quartile (16.25)
f. interquartile range
2. Youre an investment banker and work 22.5 hours a day. Your monthly pay in recent
months has looked like this cuz youre a money machine:
$23,000 $36,500 $47,200 $20,200 $61,300
a. Whats the sample mean? ($37,640)
b. How about sample variance? ($292,743,000)
c. And sample standard deviation? ($17,109.73)
3. A set of test scores has a mean of 890 and standard deviation of 120. Whats the
coefficient of variation?
4. Check out these test scores: 88 76 67 90 98 68 75 86 82 90. Calculate:
a. sample mean
b. sample standard deviation
c. coefficient of variation

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

BES PASS Week 4


Aims:

To learn about Data Collection and Random Sampling


To understand Joint, Marginal and Conditional Probability
To learn the probability rules and apply it to sampling with/without replacement

1. Methods of Data Collection


Remembering that Data are mere observed values of a variable, and that a variable is just something that is of
interest of us, we will proceed to use the following methods of data to observe these variables.

Direct Observation measures the actual behaviour or outcomes


o E.g. Asking people whether theyve bought a product because of an advertisement
Experimental Data imposes a treatment and measures the resulting behaviour or outcomes
o E.g. Asking people to try aspirin and see whether they suffer fewer heart attacks or not
Surveys
Self administered Surveys Surveys sent to people who then mail back with their responses
Personal Interviews
Telephone Interviews

Q. What do you think are the pros and cons for each of the methods of data collection?
Hint: Think of the costs, response rate, purpose and biases that may arise
2. Random Sampling
The primary incentive for examining sample rather than a population is cost. Compiling statistics is usually
expensive, imagining conducting experiments on 10,000 people and asking them to take an aspirin every day
for 3 weeks and coming back to test on them!
Main Concept: Our Target Population can be inferred by the Sample Population if the sample statistic can
come quite close to the parameter it is designed to estimate
There are 3 different types of sampling plans:
Simple Random Sample: A sample selected in such a way that every possible sample with the same
number of observations is equally likely to be chosen
o E.g. Drawing ticket stubs in a raffle to determine the winner
Stratified Random Sample: Separating the population into strata and then drawing simple random
samples from each stratum
Cluster sampling: is a simple random sample of groups or clusters of elements
From these samples of observations, two main types of error arise:
1. Sampling Error is the difference between the sample and the population that exists only because of
the observations that happened to be selected for the sample
2. Non-sampling Error more serious than sampling error, and are due to mistakes made in the acquisition
of data or due to sample observations being selected improperly

Q. Discuss with the person next to you, examples of non-sampling errors.


3. Probability
Questions to think about...
a) Independence v Mutually Exclusive
b) Joint v Marginal Probability
c) Intersection v Union

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

Conditional Probability is the probability of an event A, occurring given another event B, also occurring.
It is represented by:

P( A | B )
which is read as Given that B has occurred, what is the probability of A occurring?
Expanding this we get...
( | ) =

( )
()

One of the reasons we compute conditional probability is to find whether two events are related. I.e. we
want to know whether they are independent events.
If they are independent, the probability of one event is not affected by the occurrence of the other event

( | ) = ()
( | ) = ()
4. Other Rules
The Multiplication Rule: is used to calculate the joint probability of two events. Based on the
conditional probability formula.... and then multiplying both sides by P(B)
i.e.

P(A and B) = P(B).P(A|B)


For Independent events,

P(A and B) = P(A).P(B) since P(A|B) = P(A)

The Complement Rule:


The complement of event A(denoted AC) is the event that occurs when event A does NOT occur
i.e.

P(AC) = 1- P(A)

The Addition Rule: allows us to calculate the union of two events


The probability that event A, OR event B, OR both occur is:

P(A or B) = P(A) + P(B) P(A and B)


For Mutually exclusive events,

P(A or B) = P(A) + P(B)


5. Sampling with or without replacement
If we were to finite (limited size) sample, we could:
a) Select without replacement: each time you select an observation you remove it from the pile, the
outcome of each selection will depend on the outcomes of previous selections because the size
of the population is getting smaller each time.
b) Select with replacement: each time you select an observation you re-place it back into the pile,
effectively this would mean population size stays the same and the outcomes of each selection
will be independent of one another.

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

GROUP EXCERCISE 1
Male

Female

Row Total

High Distinction

75

61

136

Pass

215

155

370

Column Total

290

216

506

1. P (Female)
2. P (High Dist)
3. P (Female U High Dist)
4. P ( Pass )
5. P(( Pass l Male )
6. Which ones of the above are Marginal Probabilities?
7. Which ones are Joint Probabilities?

GROUP EXCERCISE 2

Probability Trees

Probability trees are a very neat and fast way for working out many probability problems.
Example: (QMB Final 99s2): An advertising executive is studying the television viewing habits of married men and
women during prime-time hours. The executive has determined that during prime-time, husbands are watching
television 60% of the time. It has also been determined that when the husband is watching television, 40% of the
time the wife is also watching. When the husband is not watching television, 30% of the time the wife is watching
television.

i.
ii.

Find the probability that the wife is watching television. ( 0.36 )


Find the probability that, if the wife is watching television, the husband is also watching television.
(0.6667 )

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

Useful Practice Questions


1.

17

a. find the mean and the mode of this data set (2 marks)
b. Find the median and the third quartile of this data set (2 marks)
2. 2. Suppose A and B are mutually exclusive events. If P(A) = 0.4 and P(B) = 0.2, then P(AlB)=?
3. 2 teams A and B are of equal ability, so each has a probability of 0.5 of defeating the other. Assume that the
outcome of any game is independent of the outcome of any other game. What is the probability that team A
wins 4 games in a row?
4. Approximately 30% of the sales representatives hired by a firm quit in less than 1 year. Suppose that two
sales representatives are hired and assume that the first sales representatives behaviour is independent of
the second sales representatives behaviour.
a. What is the probability that both quit within the year?
b. Find the probability that exactly one representative quits
5. A group of individuals concerned about environmental problems claims that 30% of the adults in a certain
town have been adversely affected by a new nuclear power plant that pollutes the air and causes lung
damage. To test their claim, you randomly select 4 adults of the town
a. If the environmental group is correct, what is the probability that all 4 people have been adversely
affected?
b. What is the probability that at least one of the 4 individuals has been adversely affected?

Answers
1. A) mean = 6, mode = 5
B) median = 5, 75% quartile = 7, observing that 50% of data points are below 5 and 75% below 7
2. 0
3. 0.0625
4. A) 0.09
B) 0.42
5. A) 0.0081
B) 0.7599

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

QMB PASS Week 5 2010


Topics to be covered:
1. Random variables & Probability Distributions
2. Bivariate Distributions
3. Applications in Finance: Portfolio Diversification & Asset Allocation
1. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS
Here are a few questions as a warm up.
1. What is a random variable?
2. Can you recall from Week 3 PASS the difference between a discrete and
continuous random variable?
If we are happy with this, we now approach the concept of a probability distribution,
which is a table, formula, or graph that describes the values of a random variable and the
probability associated with those values.
For discrete probability distributions, there are 2 requirements:
1. 0 P(x) 1, for all x.
2. P(x) = 1.
Lets think about the methods/techniques we can use to describe the
population/probability distribution. From memory, or using your lecture notes, fill out the
following table with assistance from group members around you:
ANALYSING PROBABILITY DISTRIBUTIONS
Term
Definition
Population Mean

Formula

aka. Expected Value of X

Population Variance

(Full)

(Shortcut)

Population Standard
Deviation

We also come across a new concept of the laws of expected value & variance. These are:
a) Expected Value
1. E(C) = C
2. E(X+C) = E(X) + C
3. E(CX) = C.E(X)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

b) Variance
1. V(C) = 0
2. V(X+C) = V(X)
3. V(CX) = C2V(X)
Now, try these questions.
Q1. Sheldon has trouble sleeping at night because sometimes there is this one girl
who calls him up at like 3am in the morning for no reason. It means hes in a bad
mood the next day. It happens so much he could actually create a probability
distribution for it:
Number of time she calls Sheldon
1
2
3
4
5
6
7

Probability she will call Sheldon


.05
.12
.20
.30
.15
.10
.08

Help Sheldon compute the mean and variance of the number of times the annoying
girl calls him. (Mean = 4, variance = 2.40)

Q2. Continuing on, this girl is crazy. Every time she walks past a Louis Vuitton
store, she has this burning temptation to buy a LV handbag. She used to buy, like, 2
or 3 at a time, but now that Sheldon dumped her, shes more reluctant to buy one
these days. This is the probability distribution for the number of LV handbags she
buys each time she goes out:
Number of LV handbags she
wants to buy
0
1
2
3
4

Probability she buys that


number of handbags
.10
.25
.40
.20
.05

How many LV handbags should we expect her to buy on Thursday night? (1.85)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

2. BIVARIATE DISTRIBUTIONS
Do you recall bivariate relations from Week 3 PASS? We now come across the concept of
bivariate distributions which provide the probabilities of combinations of 2 variables.
There are 2 measures that are important in describing the bivariate distribution.
IMPORTANT FORMULAS FOR BIVARIATE DISTRIBUTIONS
Term
Formula
(Full)
Covariance
(Shortcut)

Coefficient of
Correlation

Importantly, we also have laws of expected value & variance for the sum of 2 variables
too:
1. E(X+Y) = E(X) + E(Y)
2. V(X+Y) = V(X) + V(Y) + 2.COV(X,Y)
...noting that if X and Y are independent, then COV(X,Y) = 0.
Group Question This question is quite long so divide parts up with your partner to get it
done in time.
Sheldon and Juliet are PASS leaders by day, and drug dealers by night. Let X and Y
be the weight in kilograms of drugs Sheldon and Juliet sell each night respectively.
Bivariate Probability Distribution:

0
1
2
Total

0
.12
.21
.07
.4

X
1
.42
.06
.02
.5

2
.06
.03
.01
.1

Total
.6
.3
.1
1.00

You are given the following information to assist you:


E(X) = .7
V(X) = .41
E(Y) = .5
V(Y) = .45
a) Calculate the covariance using either the full or shortcut formula. (-.15)
b) Calculate the coefficient of correlation between the kilograms of drugs sold by
Sheldon and Juliet. (-.35)
c) Draw an inference/conclusion regarding your findings in part b).

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

3. APPLICATIONS IN FINANCE: PORTFOLIO DIVERSIFICATION & ASSET


ALLOCATION
Here are some questions to consider with the person sitting next to you:
1. Why should we diversify our portfolios investments?
2. How do we diversify?
3. Why did your lecturer include it in the lecture slides ie. how is diversification
related to statistics?
FORMULAS FOR A PORTFOLIO OF 2 STOCKS
Term
Formula
Mean

E(Rp) = w1.E(R1) + w2.E(R2)

Variance

V(Rp) = w1212 + w2222 + 2w1w21 2

Question
Sheldon has also joined the recent craze of investing in English football clubs. This
is what his investment portfolio looks like:
Stock
Proportion of Portfolio
Mean
Standard Deviation

Manchester United (#1)


.30
.12
.02

Liverpool (#2)
.70
.25
.15

For each of the following coefficients of correlation, calculate the expected value
and standard deviation of the portfolio.
a) = .5 (.211, .1081)
b) = .2 (.211, .1064)
c) = 0 (.211, .1052)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

BES PASS Week 6


Aims:

To understand the last of the discrete probability distributions binomial distribution


To introduce continuous probability distributions uniform distribution

Question
Recall our discussion of discrete and continuous random variables.
Discrete = countable/finite
Continuous = range of values/infinite number of values in a given interval
Which of the following are discrete and which are continuous?
a) The number of goals scored in 20 attempts (discrete)
b) The time it takes to write an essay (continuous)
c) The number of people in a bar (discrete)
d) The temperature inside a room (continuous)
e) The amount of energy used by a computer (continuous)

1. Binomial Distribution
Lets recall the properties of a Binomial Experiment, theres 4, so give it a shot!
1) Fixed number of trials (n)
2) Two possible outcomes: success and failure
3) P(Success) = p and P(Failure) = (1-p)
4) Trials are independent
Examples: Flipping a coin 10 times, Drawing 5 cards out of a shuffled deck
Note: In a binomial experiment, there is an assumption of a sequence of Bernoulli trials, i.e. the random
variables are independently and identically distributed (iid)
Binomial Random Variable
The probability of x successes in a binomial experiment with n trials and the probability of success p is

o X ~ Bin(n,p)
o P ( X = x ) = nCx px qn-x
N.B. Learn to use Binomial tables!!!
P(X = k) Individual binomial probability
P(X k) Cumulative binomial probability
P(X > k) Survivor probability
Also, from Perms and Combs,

Cr

n!
r ! n r !

Which means we can also write our Binomial Function as:

(Sheldon, my word wont type equations! I shall write this one out >_<.. I had to copy and paste all these
equations)

Asshar
edon.
.
.

Sheldon and Juliet


Wednesday 2-3pm
Mean and Variance of a Binomial Distribution

RC M010

= E(X) = np
2 = Var(X) = np(1-p)
o = np(1-p)

Exercises:
1. Sheldon knows that 15% of all the girls he goes out with want expensive presents during the first month of
dating. He decides to test this theory out and goes out with 6 girls. Assume the performances of the girls are
independent of one another. Whats the probability that:
a) All six girls will require expensive presents during the month of dating? (0.0000)
b) 1 of them will demand an expensive present during the first month of dating? (0.3993)
c) At least 3 of them will require expensive presents during the first month of dating? (Hint: use cumulative
binomial probabilities) (0.0473)

2. The Koch Electric Company makes electric shavers. If the probability that an electric shaver is
defective is 0.01, what is the probability of the following in a shipment of 500 electric shavers that:
a) None are defective? (0.0067)
b) One is defective? (0.0337)
c) More than three are defective? (0.735)

3. A plumber installs six hot water heaters in a housing development. The probability that any
individual heater will last more than 10 years is 0.7, and their life lengths are independent. Let X
denote the number of water heaters that last more than 10 years.
a) Find the probability that more than 3 of the water heaters will last more than 10 years (0.7443)
b) Find the mean and variance of the random variable X (4.2; 1.26)

4. A quality control manager for a manufacturer has instituted acceptance sampling in order to monitor the
quality of incoming parts that are bought in bulk. The policy is that all incoming parts are checked by
selecting at random 10 parts and then determining whether each part contains any defects or not. If 2 or
more parts are found to have defects then the entire order is rejected and is returned to the supplier. What
is the probability that an order from a particular supplier is rejected if that supplier is known to have 5% of
parts with defects? (0.0861)

5. The probabilities that three independent members of a committee will vote in favour of electing a PASS
leader as president are 0.2, 0.3 and 0.5, respectively. The probability that at most one member of the
committee will elect a PASS leader is? (0.75)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

Taking our focus to continuous random variables,


o P(X = x) = 0 since there are an infinite number of total values that can be obtained
o P(X < k) = P(X k)
Probability Density Function
Consider a probability histogram for random variables
o If we make the widths of the columns so small that they are approximately continuous, it
will form a smooth curve
The area under that curve becomes a part of our probability density function f(x) whose range is a
x b:
1) f(x) 0 for all x between a and b
2) The total area under the curve between a and b is 1
2. Uniform Distribution
o Also known as a rectangular probability distribution (from its shape)
o A distribution where all random variable values within the range a X b are all equally as likely to
occur

o Defined by the function:


o f(x) = 1
where a x b
ba
Taking the example from your lecture notes...
Store deliveries 7-8am
Let X = no. of minutes after 7:00am

Some formulas for uniform distribution:


E(X) = (a + b)/2
Median = (a + b) /2
Var(X) = (b a)2/12
P(x1Xx2) = Area under graph = (x2 - x1) x 1/(b - a)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

Exercises:
1. The time before a baby cries is a uniformly distributed random variable between 0 and 30 minutes
a) Find the probability distribution function (1 /30)
b) Find the probability that a baby cries within 20 minutes (0.67)
c) Find the probability that a baby does not cry within 10 minutes (0.67)
d) Find the probability that a baby cries between 15 minutes and 20 minutes (0.17)
2.

If the random variable X is uniformly distributed between 2 and 10:


a) Calculate the formula of the probability distribution function. What type of line would this be if drawn?
b) Calculate P (2 X 10), that is, find the probability that X will assume a value of between 2 and 10 (P (2
X 10) equals 1, as the area under the curve must be equal to 1, that is, all the possible outcomes fall within
this range)
c) Find the mean and variance of X (6; 5 )
d) Calculate P (2 X 8) (0.75)
e) Calculate P (X = 6) (0)

3. If X, a continuous random variable, is symmetric about , is P (X < - 2) equal to P (X > + 2)? (yes)
4. If X, a continuous random variable, is symmetric about X = 2, find P (X > 2) (0.5)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

BES PASS Week 7 2010


Topics to be covered:
1. The Normal Distribution & Finding Probabilities
2. The Normal Approximation to the Binomial
3. Concepts of Estimation
1. THE NORMAL DISTRIBUTION
a) Just for starters a warm-up question
In the space below, draw what a normal distribution looks like. Then, in a different colour,
show what happens when:
a) the mean increases/decreases
b) the standard deviation increases/decreases

b) Calculating normal probabilities


To calculate the probability that a normal random variable falls into any interval, we need
to compute the area in the interval under the curve. But since thats too hard (we need
calculus), we can use the probability tables, provided we standardise this random variable.
THE STANDARDISED NORMAL RANDOM VARIABLE
Z=(X)/

Class Example
You make an investment of stocks with an average return of 10%. Find the
probability that you will lose money:
a) if the standard deviation of returns is 5% (0.0228)
b) if the standard deviation of returns is 10% (0.1587)
Clue! Use the tables.

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

c) Finding values of Z
Just then, we focused on working out Z, then using that to work out the probability of
something.
However, often questions can ask us to reverse engineer the process, by giving us a
probability first, then working out what Z is. This is the complete opposite of the previous
process.
FINDING VALUES OF Z, GIVEN A PROBABILITY

ZA = The value of Z such that the area to its right under the standard normal
curve is A.

ie. ZA = The value of a standard normal random variable such that


P ( Z > ZA ) = A

Question
a) Find Z0.25 (1.96)
b) Find Z0.05 (1.645)
d) ZA and percentiles
ZA & PERCENTILES

ZA = 100 ( 1 A ) th percentile of a standard random variable.

eg. Using question (b) from above, Z0.05 = 1.645 = the 95th percentile.

Lets do some questions


1. The amount of time students spend each week on Facebook (FB) is a normally
distributed random variable with a mean of 7.5 hours and a standard deviation of
2.1 hours.
a) What proportion of students go on FB for more than 10 hours per week? (.1170)
b) Find the probability that a student spends between 7 and 9 hours on FB. (.3559)
c) What proportion of students spend less than 3 hours on FB? (.0162)
d) What is the amount of time below which only 5% of students spend on FB? (4.05
hours)

2. An analysis of the amount of interest paid monthly by Visa cardholders reveals that
the amount is normally distributed with a mean of $27 and a standard deviation of
$7.
a) What proportion of the cardholders pay more than $30 in interest? (.3336)
b) What proportion of the cardholders pay more than $40 in interest? (.0314)
c) What proportion of the cardholders pay less than $15 in interest? (.0436)
d) What interest payment is exceeded by only 20% of the cardholders? ($32.88)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

2. THE NORMAL APPROXIMATION TO THE BINOMINAL


a) Why we approximate the binomial by the normal
Discrete distributions such as the binomial distribution are not that easy to draw inferences
from. But inferences is the reason why we need sampling distributions.
Because of this, we approximate the binomial distribution by a normal distribution, by
drawing a bell-shaped curve to smooth out the ends of the rectangles in the histogram.
b) Nuts and bolts of normal approximation to the binomial
You should recall from last week that, for the binomial distribution:
BINOMIAL FORMULAS

mean: = n.p

standard deviation: = n p ( 1 p )

Note that, however, we cant directly apply the normal to the binomial. We actually need a
continuity correction factor of 0.5 to adjust for the approximation. In particular:
USING THE CONTINUITY CORRECTION FACTOR
Let Y be the normal random variable approximating the binomial random variable X.

P ( X = x ) P ( x 0.5 < Y < x + 0.5 )

P ( X x ) P ( Y < x + 0.5 )

P ( X x ) P (Y > x 0.5 )

c) Some questions to try


3. Juliet and Sheldon are stars of the next Batman movie, the Dark PASS Leader. Juliet
is Batwoman and Sheldon is Joker. We are in that moment close to the end of the
film where Sheldon as Joker flips the very special coin with probability of heads as
0.95. Juliet doesnt believe such a coin exists, so Sheldon, being Joker, messes
around and flips it 100 times. What is the probability that:
a) Juliet sees 100 heads flipped? (0.0133)
b) Juliet sees at least 90 heads flipped? (0.9941)
c) Juliet sees no more than 98 heads flipped? (0.9463)
d) Juliet saves Gotham city from Sheldons destruction? (No chance at all, ever...)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

4. Anyway...so...moving on...Sheldon and Juliet have now become avatars. Sheldon is


Corporal Jake Sully from Earth and Juliet is Neytiri from Navi. They are fighting
against the evil humans who want to destroy their magic tree, from where they
generate energy to teach their PASS classes. Sheldon is actually from the human
team but betrays them and links up with Juliet to fight against the evil humans.
Over the past few centuries, they fight 50 wars, with the probability of the Navi
winning being a massive 10%. The possible outcomes of wars are only winning and
losing there is no such thing as drawing a war. Calculate the probability that:
a) Navi wins the war 8 times? (0.0695)
b) Navi wins the war at least than 3 times? (0.8810)
c) Navi wins the war no more than 7 times? (0.8810)
d) Sheldon as Corporal Jake Sully will officially turn into an avatar (110%)...I will
teach you in PASS class next week in the other world.

3. CONCEPTS OF ESTIMATION
a) Some chilled questions to consider with the person next to you...

What is the purpose of estimation?

What is the difference between a point estimator and an interval estimator?

b) The 3 desirable qualities of estimators


1. Unbiased-ness

2. Consistency

3. Relative efficiency

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

BES PASS Week 8


Aims:
Learn about the sampling distribution of the mean
Learn the Central Limit Theorem (CLT)
Learn about Confidence Interval
o Estimation of Population mean when Population variance is known
Selecting Sample size

Q. Recall the difference between:


a) population vs sample
b) parameter vs statistic

1. Sampling Distribution of the mean

is the distribution of all possible values that can be assumed by that statistic, computed of samples of the
same size drawn from the same population
i.e. allows us to estimate the population parameter using a sample statistic
o The population of a random variable will have certain parameters
E.g. Mean and Variance 2
o For a particular sample of size n, the sample statistic is unlikely to be the same as its population
parameter
Known as sampling error: The cost of sampling which can be reduced by taking larger
samples. (NB: Standard Deviation of Sampling distribution of the mean = Sampling Error)
o Different samples (of size n) will have different sample statistics
i.e. Sample mean/variance will vary for each sample
o Taking repeated samples of size n, the distribution of this statistic can be computed

Properties of the Sampling Distribution of the Sample Mean


1) x =
2) 2x = 2 and x =
n
n
3) If X is normal, X is normal
If X is non-normal, X is approximately normal for sufficiently large sample sizes

2. Central Limit Theorem


Definition: The sampling distribution of the mean of a random sample drawn from any population is
approximately normal for a sufficiently large sample size. The larger the sample size, the more closely the
sampling distribution of X will resemble a normal distributi on.

Mean

Variance

Distribution

Population parameter

Samples (any size) from normal pop

2/n

Normal

Samples (n30) from non-normal population

2/n

Approx normal re: CLT

RC M

Asshar
edon.
.
.

Sheldon and Juliet


Wednesday 2-3pm
Group Excercise 1
A sample of 100 observations is drawn from a normal population, with 1000 and 200 .
Find

1050

960

1100

Group Excercise 2
In a certain PASS community, 60% of all leaders are in favour of electing Sheldon as the genius. A random
sample of 200 leaders is taken. What is the probability that 100 or less of these leaders favour the election of
Sheldon as the one and only genius? (0.0025)
Group Excercise 3
Cadbury Yowie chocolates are known to have a mean weight of 27g and a variance of 6.25g
squared. If a random sample of 60 Yowies is examined, find the probability that its average is:
a)

Below 26g (0.1075)

b) Between 27.50g and 28g (0.1601)

Group Excercise 4
A basketball coach is seeking tall recruits who are smart enough to be eligible for college. The
recruit must be at least 74 inches tall and have an IQ of 115 or above. Height and IQ are
independent of one another. IQ is normally distributed with mean 100 and standard deviation 12,
and height is normally distributed with mean 70 and standard deviation 2 inches. What percentage
of the population satisfies the coachs requirements? (0.24%)

Additional Questions for you to do...


1. The amount of time lawyers devote to their jobs per week is normally distributed with a mean of 52 hours
and standard deviation of 6 hours.
a.) Find the probability that the mean amount of work per week for three randomly selected professors
is more than 60 hours. (0.0104)
b.) If the strict boss finds out that the average time worked by his 7 employees is less than 48hours, he
will fire them all. What is the probability they will be fired? (0.29454)

2. The time it takes for a statistics professor to mark his mid-session test is normally distributed with a mean
of 4.8 mins and a standard deviation of 1.3 mins. If there are 60 students in the class, what is the probability
that he needs more than 5 hours to mark all the mid-session tests? (0.1170)

3. Pierres goose farm claims that its jars of foie gras have a weight of 250g and a standard deviation of 6g.
After buying 36 jars, before eating them on petits blinis with some fig jam, salt and pepper, you weighed
them and found them to have a mean of 245g. What general statement can we make about the Pierres
claim? (Pierre is a lying Frenchman)

4. The mean of a population is 18.75, and the standard deviation is 7.8:


a) If a sample of 50 values is taken, what is the probability that the sample mean is greater than 20?
(0.1292)
b) If a sample of 100 is taken, what is the probability that the sample mean is greater than 20? (0.0548)

RC M

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

3. Confidence Interval
Recall:
o Point Estimators produce a single estimate of the parameter of interest
o Interval Estimators produce a range of values and attach a degree of confidence with that interval
Confidence Interval: is a interval estimator defined by the confidence level (1-)
This implies that we start with the confidence level we want and then work out the width of the interval
Deriving it step by step...
1. Recall:
Standard Normal: Z = X
/n
2. Employ the definition of a confidence interval
Symmetrical Interval: P(-Z/2 < X < Z/2) = 1-
/n
3. Rearranging,

P(X - Z/2 < < X + Z/2 )


n
n

Lower Confidence Level


(LCL)

Upper Confidence Level


(UCL)

Therefore the confidence interval estimator (confidence level 1- ) is:

How do we interpret this interval?


Confidence Level
(1 - )

/2

Z/2

90%

0.1

0.05

1.645

95%

0.05

0.025

1.960

98%

0.02

0.01

2.326

99%

0.01

0.005

2.576

4. Selecting sample size


Previously, we determined our level of confidence first and then the width of the interval. However, sometimes,
the width of the confidence interval may be determined before sampling.
To calculate the sample size based on a desired sampling error, we use the formula: (Solve for n from the above
confidence interval formula)

where B = the sampling error, and is equal to


*Note that this is equal to half the width of the confidence interval
Consider what happens to the width of a CI when:
o Standard deviation changes
o Confidence level changes
o Sample size changes
o Sample mean changes

RC M

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Group Excercise 5
If we know that = 40, and we obtain a sample mean of 136, construct a 95% confidence interval for the
population mean using a sample size of:
a) 20
(-118.47 153.53 )
b) 160
( 129.80 142.20 )

Group Excercise 6
If we know that = 40, and we obtain a sample mean of 136 using 25 values, construct a confidence interval
for the population mean having:
a) A confidence level of 99%
(115.392 156.608)
b) A confidence level of 50%
(130.6 141.40)
Group Excercise 7
John wants to estimate the average time it takes for customers to have lunch at his new cafe. He knows from
past experience that the standard deviation will be 18. John wants to use a confidence interval of 90% and
have a sampling error no greater than 3 minutes. How many customers does he need to time? (98)

Additional Questions for you to do...


1. The average mark for an exam was between 65% and 75% using a confidence coefficient of 90%. Indicate
whether the following statements are true or false:
a) 90% of students scored a mark between 65% and 75%
b) If random samples were taken, then 90% of the samples would have a mean between 65% and 75%
c) The probability that the average population mark will be contained in this interval is 90%
d) The probability that the average population mark will fall between this interval is 90%
2.

In an article about disinflation, various investments were examined.


The investments included stocks, bonds, and real estate. Suppose that a random sample of 200 rates of
return on real estate investments was computed and recorded. The sample mean was calculated to be
12.10% return. Assuming that the standard deviation of all rates of return on real estate investments is 2.1%,
estimate the mean rate of return on all real estate investments with 90% confidence. Interpret the estimate.
(11.86% : 12.34%)

3. An economist wants to estimate the mean annual income of households in a particular district. It is assumed
that the population standard deviation is $4000. The economist wants to estimate the sample mean to
within D = $500 of the true mean with 95% level of confidence. Calculate the sample size required.
4. Starting annual salaries for university graduates with business degrees are believed to have a standard
deviation of approximately $1800. A 95% confidence interval estimate of the mean annual starting salary is
desired. How large a sample should be taken if we want to be 95% confident that the maximum sampling
error is:
a. $500

b. $200
5. A medical researcher wants to investigate the amount of time it takes for patients headache pain to be
relieved after taking a new prescription painkiller. She plans to use statistical methods to estimate the mean
of the population of relief times. She believes that the population is normally distributed with a standard
deviation of 20 minutes. How large a sample should she take to achieve 90% confidence to within 1 minute?

John wants to estimate the average time it takes for customers to have lunch at his new caf. He knows
from past experience that the standard deviation will be 18. John wants to use a confidence interval of
90% and have a sampling error no greater than 3 minutes. How many customers does he need to time?
(98)

RC M

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

BES PASS Week 9 2010


Topics to be covered:
1. Hypothesis Testing Type I & Type II errors
2. Test about the mean when the population standard deviation is known
3. p-values
1. HYPOTHESIS TESTING
a) Setting up a model to help us think about hypothesis testing.
What exactly is a hypothesis test? We can try and simplify the issue by using the most
popular model of thinking a criminal trial. Imagine Sheldon has been arrested for drink
driving...although this would never happen.
Suppose that we set up 2 hypotheses to test:
o H0: Sheldon is innocent (the null hypothesis)
o H1: Sheldon is guilty (the alternative hypothesis)
The testing procedure begins with the assumption that the null hypothesis is true.
o ie. Assume Sheldon is innocent assume he is not a drink driver.
What is the goal of hypothesis testing? It is to determine whether there is enough
evidence to infer that the alternative hypothesis is true.
In statistics, the speak of the result of the hypothesis test in either 1 of 2 ways:
a) rejecting the null hypothesis in favour of the alternative
b) or not rejecting the null hypothesis in favour of the alternative
Notice that we dont say that we accept the null hypothesis...why?

b) Errors induced when running hypothesis tests


There are 2 possible errors:
o A Type I error occurs when we reject a true null hypothesis.
ie. Sheldon is innocent (Juliet was the actual drink driver), but he is still
wrongly convicted.
P ( Type I error ) =
o A Type II error occurs when we do not reject a false null hypothesis.
ie. Sheldon is actually guilty of drink driving, but he is acquitted.
P ( Type II error ) =
c) Group discussion exercise
In the following 2 scenarios, identify what the null and alternative hypotheses would be:
1. You are considering whether you should apply to be a PASS leader in 2011. If you
succeed, a life of fame, fortune and happiness awaits you. If you fail, no one will like
you. Should you apply?

2. You are faced with 2 investments. One is very risky, but the potential returns are high.
The other is safe, but the potential is quite limited. Which one should you choose?

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

2. TESTING THE POPULATION MEAN WHEN THE POPULATION STANDARD


DEVIATION IS KNOWN.
Lets go through an example as a class (adapted from Keller) to illustrate the required
process.
a) Factual scenario
Sheldon and Juliet run an Asian mini-goods store where they sell, amongst other things,
Hello Kitty mobile phone chains and Totoro pillows. Juliet wants to introduce a new
profit strategy - selling Easyway drinks too.
They determine the new profit strategy will be cost-effective only if the mean monthly
account is more than $170.
A random sample of 400 accounts is drawn, for which the sample mean is $178.
Juliet knows the accounts are approximately normally distributed with a standard
deviation of $65.
Can we conclude the new strategy will be cost-effective, if we run the test at 95%
confidence?
b) Setting up the model
What is the null and alternative hypothesis?
H0:
H1:
NB: There are 2 methods in which we can proceed with this problem, using either:
a) the rejection region method
b) the p-value approach
We will consider both methods.
c) The rejection region method
The rejection region is a range of values such that if the test statistic falls into that range,
we decide to reject the null hypothesis in favour of the alternative hypothesis.
Show how we can use the rejection region method to solve this problem in the space below.
Draw diagrams where appropriate.

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

d) The p-value approach


Why do we have 2 methods to do the same problem? Well...there are actually some
drawbacks to using the rejection region method. Can you think of some?

The p-value of a test is the probability of observing a test statistic at least as extreme as
the one computed given that the null hypothesis is true.
So in our example here, what would the p-value be?

e) Interpreting the p-value


Interpreting the p-value we just calculated, this means that the probability of observing
a sample mean at least as large as 178 from a population whose mean is 170 is _______,
which is very small.
o In other words, we have just observed an unlikely event, an event so unlikely that
we seriously doubt the assumption that began the process (that the null hypothesis
is true).
o Consequently, we have reason to reject the null hypothesis and support the
alternative.
However, one thing must be clear to you the p-value is not the probability that the
null hypothesis is true. You cannot make a probability statement about a parameter; it
is not a random variable. This is a similar case to what we did last week with
confidence interval estimators.
You may notice that the further the sample mean is from the hypothesized mean, the
smaller the p-value is.
We can develop this further to use the idea of significance to describe the p-value.
Lets fill out the table below together:

Range of p-value
< 0.01
0.01 0.05
0.05 0.10
> 0.10

DESCRIBING THE P-VALUE


Amount of evidence to infer that
Term for level of
the alternative hypothesis is true
significance
Overwhelming
Strong
Weak
None

f) The p-value and rejection region methods


Note that another way to make the rejection / non-rejection decision is to compare the
p-value with the selected value of the significance level.
COMPARING P-VALUES WITH THE SIGNIFICANCE LEVEL
o
o

If p-value < p-value is small enough to reject the null hypothesis.


If p-value > we do not reject the null hypothesis.

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

Red Centre M010

g) A note be careful when you draw conclusions from hypothesis tests


HOW TO DRAW CONCLUSIONS FROM HYPOTHESIS TESTS
o
o

If we reject the null hypothesis, we conclude that there is enough statistical


evidence to infer that the alternative hypothesis is true.
If we do not reject the null hypothesis, we conclude that there is not
enough statistical evidence to infer that the alternative hypothesis is true.

NB: You cannot prove that either the null or hypothesis is true.
h) One and Two Tail tests
ONE VS TWO TAIL TESTS
o

One Tail tests are of the form:


H0: = 0
H1: > 0 or H1: < 0
Two Tail tests are of the form:
H0: = 0
H1: 0

THE ADJUSTMENT REQUIRED FOR RUNNING TWO-TAIL TESTS


1. Change the rejection region into a two-tail rejection region of the form
z < - z/2 or z > z/2
2. See if the Z-score you calculate falls into either of these 2 rejection regions.
3. In terms of the p-value, you need to determine the p-value in both tails.

3. PRACTICE QUESTIONS
1. Juliet gets really annoyed when her BES students take too long to finish her class tests.
Shes shot a few of them in the leg before. To investigate further, she randomly samples
10 students and measures the amount of time they spend doing a BES test. The results
are listed below. Assuming that the times are normally distributed with a standard
deviation of 2 minutes, test to determine whether the owner can infer at the 5%
significance level that the mean amount of time spent on the tests is greater than 6
minutes. Data: 8 11 5 6 7 8 6 4 8 3. (Answer: z = .95, p-value = .1711, no.)

2. Sheldon owns a telecommunications company called Shel-Tel which specializes in


providing cheap phone call rates back to Hong Kong. Suppose mean and standard
deviation of monthly long-distance bills of customers are $17.09 and $3.87 respectively.
Sheldon takes a random sample of 100 customers and recalculates their monthly bill
using rates quoted by a leading competitor, as $17.55. Can we conclude at the 5%
significance level that there is a difference between the average Shel-Tel bill and that of
the competitor? (Answer: z = 1.19, p-value = .2340, no.)

Asshar
edon.
.
.

Juliet and Sheldon

Wednesday 2-3pm

Red Centre M010

BES PASS WEEK 10


Aims:

Learning to calculate the probability of Type II errors and the Power of the Test
Hypothesis Testing when population variance is unknown
Sampling distribution of sampling proportion

1. Some clarification...
Sample Mean: ~ ( , 2 ) subscripts to show this is different to the population
Hypothesis testing: testing where our lies in relation to our hypothesised 0
o Methods: Critical values for (not a confidence interval) and critical values using z-scores
o State H0 with a strong equality sign (=) and your conclusion with the level of significance.
o Value of is called our significance level
2. Type I and Type II errors
ERRORS
Reject H0
Do not reject H0

Given a true H0
Type I error
Correct Decision

Given a false H0
Correct Decision
Type II error

Type I error occurs when we reject a true null hypothesis


The significance level is usually given as 0.01, 0.05 or 0.10
P (Type I error) =
P (Reject H0 | H0 is true) =

Type II error occurs when we do not reject a false null hypothesis


P ( Type II error) =
P (Do not reject H0 | H0 is false) =

NB. There is a trade-off between the two types of errors. Changing our significance level will produce
resultant changes in .
Power of the test
The power of the test is the probability of correctly rejecting a false null hypothesis.
Power = 1 -
NB. 1 !!!!
Steps:
1. Draw the distribution of 0 under the null
hypothesis, H0

Hypothesized Mean
Distribution

2. Find the critical value, c and rejection region for


level of significance
P(Type I error)

Rejection Region

Actual Mean

3. Draw a new distribution for the true population


mean, 1 in relation to H1
P(Type II error)
For an upper tailed test:
P(Type II error) = ( <|=1) - see diagram
For a lower tailed test:
P(Type II error) = ( >|=1)

H0 :

x z

Distribution

Actual :
Correctly Rejected
Non-Rejection Region
Rejection Region
But SHOULD reject!
When H0 is False

Asshar
edon.
.
.

Juliet and Sheldon


Problems for you to do...

Wednesday 2-3pm

Red Centre M010

1) N.S.W. Police are testing if vehicles are exceeding the speed limit of 90km/hr on South Dowling Road. A sample of
81 vehicles yields a mean driving speed of 98km/hr. If the population of vehicle speeds is normally distributed with a
standard deviation of 25 km/hour, test the hypothesis, at the 5% level of significance, H0: = 90; H1: > 90. If H0 is
rejected, calculate , the probability of Type II error, given that the true = 100. ( = 0.0253; power of test = 0.9747;
we reject H0)
2) Miss Rose was researching dress sizes. She had thought the mean dress size was 9. But her suspicion is that it will
be larger than that. Thus, being the relative unknown and incredible mathematician she was, Miss Rose decided to
do a hypothesis test. She found the population to be normally distributed, with standard deviation of 4. If = 0.05
and the sample size was 64, calculate the power of the test if the mean was actually
a. 9.5 (0.1844)
b. 10 (0.6387)
3) What will be the answer for (a) and (b) in the above example if Miss Rose only suspected that the mean size was
not 9? (0.17, 0.516)
3. T-distribution
So far, the problems we have dealt with assume that the population variance 2 is known
This is unrealistic, were more likely to know the sample variance
Note that s2 is an unbiased and consistent estimator of 2

For large sample sizes: (n > 30)

For small sample sizes: (n < 30)

By the CLT, is approximately normal regardless


of the population distribution
Standardised test statistic remains
approximately normal even when replacing
with s
Use Z-scores and normal distribution table
Must use the t-distribution
Similar to the normal distribution but with
fatter tails

Our variance is determined by our degrees of freedom, v


T-distribution is only valid if the underlying distribution is normal
Z-Normal

We have a new t-statistic:


, =

t-dist

Where P( t > t,v) = and v = n-1 (degress of freedom)


Confidence Intervals when 2 is unknown
The procedure is still the same.The only difference is that we replace our z-score with a t-score, and
with an s!!!

, = 1

Where = significance level and 1 is the confidence level


2

Asshar
edon.
.
.

Juliet and Sheldon

Wednesday 2-3pm

Red Centre M010

Hypothesis Testing when 2 is unknown


Procedure also remains the same except we replace our z-score with a t-score and with s

Using the unstandardised method, our critical values for X are:

Assuming H0: = 0
If H1: > 0 then c = 0 + t(s/n) If X > c, reject H0, otherwise we do not reject
If H1: <0 then c = 0 t(s/n) If X< c, reject H0, otherwise we do not reject
If H1: 0 then c = 0 t/2(s/n) If X< c or X> c, reject H0, otherwise we do not reject
Or using the standardised method, our critical values for t are:
t, v (one-tailed)
or
t/2,v (two-tailed test)
where = level of significance
= n-1 (degrees of freedom)
Assuming H0: = 0
o If H1: > 0 reject H0 if t > t,v
o

If H1: < 0 reject H0 if t < -t,v

If H1: 0 reject if: t > t/2,v or t < -t/2,v

Problems for you to do


4) Sheldon owns a farm and needs to know the number of strawberries that can be picked on weekday mornings. On
a sample of 8 Monday mornings, the number of strawberries picked between 7am and 9am are counted. Assume
the population is normal. Construct a 99% confidence interval for the population mean if the sample mean is 1500
and the sample standard deviation is 300. (1128.88, 1871.12)
5) Petit Restaurant claims to sell at least 60 cakes per day. Assume Petits sales are approximately normally
distributed. To test Petits claims, 16 days are selected at random and tested. The sample yields a mean of 56 and a
sample standard deviation of 5.25. Perform the test of Petits claims against a suitable alternative, assuming =
0.05. (Reject null)
6) A car rental company is interested in the amount of time its vehicles are out of operation for repair work. A
random sample of 12 cars showed that, over the past year, the numbers of days each had been inoperative were as
follows: 15, 11, 19, 24, 6, 18, 20, 15, 18, 12, 14, 19. Given that the population is normally distributed, find with 99%
confidence and interval which the actual mean may be within. (11.618, 20.216)
7) In a study to determine the capability of the BES PASS leaders, Judith, the PASS coordinator has to measure the
mean exam marks of every student that attends his class. She takes 9 random students and their exam marks. The
sample mean and standard deviation were 80 percent and 4 marks respectively. Assuming that the marks are
normally distributed, calculate a 95% confidence interval for the true exam mark.

Asshar
edon.
.
.

Juliet and Sheldon

Wednesday 2-3pm

Red Centre M010

4. Sampling Distribution of a Sample Proportion


Recall the binomial distribution where X is the number of successes for a fixed number of trials
n = no. of trials
p = probability of successes
q = (1-p) = probability of failure
E(X) = np and Var(X) = npq
Similar to the distribution of a sample mean, if we take many samples of size n and calculate the sample

proportion of success, ( ) for each of them, you will find..

E()= p
Var ( )= pq/n

What is the distribution of ?


By the CLT, for large sampel sizes X is approximately normal, therefore the sample proportion is also
approximately normal.

~ ,

and therefore our z-score is

Our confidence level for the Population Proportion (p) is

Hypothesis Testing for the Population Proportion (p)


Assuming H0: p = p0 our critical values are:
o If H1:p < p0 then p* = p0 Z(p0q0/n)
o If H1:p > p0 then p* = p0 + Z(p0q0/n)

o If H1:p p0 then p*= p0 Z/2(p0q0/n)


Or using the standardised test-statistic:
o
If H1:p < p0 , reject H0 if Z < -Z
o

If H1:p > p0 , reject H0 if Z > Z

If H1:p > p0 , reject H0 if Z < -Z/2 or Z > Z/2

Problems for you to do..again.last one!

8) The proportion of families buying milk from Company A in a certain city is p = 0.6.
A random sample of 10 families shows that 4 buy milk from Company A.
a) Conduct a hypothesis test with a null H0: p = 0.6 against the alternative H1: p < 0.6. Find the critical values using
both unstandardised and standardised methods at the 5% significance level. (Do not reject null)
b) Construct a 95% confidence interval for p. Does this interval include 0.6? [0.096, 0.799]
If we reject the null when 3 or fewer families buy milk from Company A:
c) Find the probability of committing a Type I error. (0.055)
d) If the true proportion of families buying milk from Company A is p = 0.5, what is the probability of committing a
Type II error based on the above decision rule? (0.828)

Asshar
edon.
.
.

Sheldon & Juliet

Wednesday 10-11am

OMB229

BES PASS Week 11 2010


Topics to be covered:
1. Simple linear regression
2. Assumptions of the regression model
3. Methods of assessing and analysing the model
1. SIMPLE LINEAR REGRESSION AN INTRODUCTION
Introducing regression
Regression analysis is used to predict the value of one variable on the basis of other variables.
The technique involves developing a mathematical equation/model that describes the relationship
between the dependent variable (Y) and the independent variables (X1, X2, X3, , Xk, where k = the
number of independent variables.
Regardless of why regression analysis is performed, we begin by developing this mathematical
equation/model that describes the relationship between the dependent variable and independent
variables.
The Simple Linear Regression Model (aka. the First-Order Linear Model)
THE SIMPLE LINEAR REGRESSION MODEL
y = 0 + 1x +

In order to investigate the relationship between x and y, we need to calculate the value of the
coefficients 0 and 1 using the least squares method, with whom you had a friendly encounter in
Week 2.

Least Squares Method

Why is it called the least squares method? Recall that when we draw a line through a set of sample
data, we aim for the best line the line of best fit. In particular, this line is the one which is closest to
the sample data points; the line that minimizes the sum of the squared differences between the points
and the line.
LEAST SQUARES LINE COEFFICIENTS
s xy

b1 =

b0 = y b1 x

s 2x

Class Example

The annual bonuses (millions) of 6 football players from Chelsea FC [the 2010 Premier League (clearly
dominating Man Utd) AND FA Cup Champions] with different years of experience are recorded as
follows. The manager, Carlo Ancelotti, has hired you as his private statistician to determine the
relationship between annual bonus and years of experience.
Years of experience (x)
Annual Bonus (y)

1
6

2
1

3
9

4
5

5 6
17 12

Asshar
edon.
.
.

Sheldon & Juliet

Wednesday 10-11am

Frank Lampard has already performed some initial calculations for you:
SOME HELPFUL DATA FOR THIS QUESTION

n
i=1 x i = 21
n
i=1 yi = 50
n
i=1 x i yi = 212
n
2
i=1 x i = 91

WHAT YOU NEED TO CALCULATE

sxy =

sx2 =

b1 =

x=

y=

b0 =

Finally, the least squares simple regression line is:


y=

HOW TO INTERPRET THE SIMPLE REGRESSION LINE


Advise Carlo on the relationship between bonuses and years of experience at Chelsea FC.

2. ASSUMPTIONS OF THE REGRESSION MODEL


THE 7 ASSUMPTIONS OF THE REGRESSION MODEL
1.
2.
3.
4.
5.
6.
7.

OMB229

Asshar
edon.
.
.

Sheldon & Juliet

Wednesday 10-11am

OMB229

3. METHODS OF ASSESSING AND ANALYSING THE MODEL


Introduction
Having established the required conditions for our assessment methods to be valid in the previous
section, we can now look at the methods to assess our regression model.
However, we need to look at the concept of the sum of squares for error, which forms the foundation
for all these methods.
Sum of squares for error
Recall that the least squares method determines the coefficients that minimize the sum of squared
deviations between the points and the line defined by the coefficients aka. the sum of squares for
error (SSE).
SHORT-CUT FORMULA FOR SSE

SSE =

n
i = 1(yi

y i )2 =

Method 1: Standard Error of the Estimate (SEE)


STANDARD ERROR OF ESTIMATE

SSE
n2

s =

QUESTION
1. Calculate the standard error of estimate for Chelsea FC. (1.596)
2. Interpret what it tells you about the models fit.

Method 2: The t-test of the slope (a hypothesis test)


In this method of assessing the regression model, we look in particular at the slope of the simple
regression line and run a hypothesis test on it. Steps are below.

Step 1: Set up hypothesis test

Ho: 1 = 0 (ALWAYS)
H1: 1 , >, < 0

Step 2: Find rejection region


If our test statistic falls within the rejection region,
we can conclude that the variables are linearly
related.
If 1 > 0, then the variables are positively
related.
If 1 < 0, they are inversely related.
Since 1 is our X coefficient, this means that a
one unit change in X will cause a 1 change in Y.

Asshar
edon.
.
.

Sheldon & Juliet

Wednesday 10-11am

OMB229

Step 3: Calculate test statistic


Critical Values & Decision Rule

t > t , n-2
t < - t, n-2
H1: 1 0 |t|< t/2, n-2

t=

H1: 1 > 0
H1: 1 < 0

b1 1
sb 1

sb 1 =
Step 4: Conclusion
If we dont reject Ho we can conclude y is
not linearly related to x

s
n 1 sx2

QUESTION
1. Perform a hypothesis t-test of the slope for Chelsea FC at 5% significance.
(t-stat = 5.5413, reject null).
2. Interpret what it tells you about the models fit.

Method 3: Coefficient of Determination


The coefficient of determination, R2, allows us to determine the strength of a linear relationship.
R2 = the amount of variation in the dependent variable that is explained by variation in the
independent variable.
To fully understand this, we will need to break down the total variation in y, as follows:
COEFFICIENT OF DETERMINATION

R2 =

s 2xy
s 2x s 2y

=1

SSE
(y i

y )2

(y i y )2 SSE
(y i

y )2

EXPLAINED VARIATION
VARIATION IN Y

QUESTION
1. Calculate the coefficient of correlation for Chelsea FC. (0.491)
2. Interpret what this tells you about the regression model.

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

BES PASS Week 12


Aim:

Learn about the prediction in linear regression


Learn about the multiple regression model

Prediction in linear regression


We can use our model to forecast or estimate values of Y (dependent variable)
Point Prediction
o Use the fitted regression line to predict a value of Y for a given level of X
i.e. = b0 + b1x
o NB. This prediction is less accurate if the value of X falls outside the range of OLS
o This point estimate does not provide any information on how close our predicted value is
from our true value.
Thus..we use,

Interval Prediction
Formula
Prediction Interval

This prediction interval is used to predict a one-time occurrence


for a particular value of the dependent variable
Confidence Interval Estimator of
the Expected Value of Y

This is the confidence interval used to predict the mean of y or the


long-run average of y.

Why is there a missing 1 under the square root for the confidence interval estimator?
Ans. There is less error in estimating a mean value as opposed to predicting an individual value

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

Class Excercise
In televisions early years, most commercials were 60seconds long. Now, however, commercials can be any length.
The objective of commercials remains the same-to have as many viewers as possible remember the product in a
favorable way. A total of 60 participants were shown advertisements of varying length and each was given a test
score based on what they would remember. Using the data set (Keller, 16.06)
a) Determine the least squares line of test scores on the length of the advertisement.
b) Interpret the coefficients and their significance. Comment on the overall fit of the model.
c) Predict with 95% confidence the memory test score of a viewer who watches a 36 second commercial.
d) Estimate with 95% confidence the mean memory test score of people who watch 36 second commercials.
Also,

= 38
2 = 193.90
= 13.80
2 = 47.96
= 57.86
n = 60

Multiple Regression
Recall the assumptions of a classical linear regression model
Problem? Only measured the effect of ONE variable on the model
All the other factors were omitted and included in the error term ()
This can cause confoundment and omitted variable bias.
Bias occurs when:
Omitted variable is correlated with explanatory or other independent variable
Omitted variable is a determinant with the explanatory variable
Violates the assumption of the zero conditional mean and therefore, OLS estimates are no longer unbiased.
Our new population regression model is:

Interpretation of 1
Measures the effect of a change in X1 holding X2, X3, ... , Xk constant
Also known as the partial effect of X1 holding all other explanatory variables constant
What happens if the variables X2, X3, ... , Xk are omitted and these variables are correlated with X1?
Omitted variables will appear in the disturbance/error term
ZCM assumption will be violated (error term now correlated with independent variable)
Produces a biased estimator of 1 (will also include the effect of other variables on Y)

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

Essentially, the process of multiple regression remains the same as linear regression
Minimising SSE

2
=1( ) gives us = 0 + 1 + +

Additional Assumption (to the 7 you already have)


No perfect (multi)collinearity or exact linear relationships between the explanatory variables
This is particularly important for dummy variables

Assessing the model


Standard Error of Estimate

where k-1 is the number of explanatory variables

Hypothesis Testing

where v = n-k-1

Coefficient of Determination Adjusted R2


Excel prints off an additional R2 statistic which is called the coefficient of determination adjusted for
degrees of freedom.
This is because, adding an extra explanatory variable will never imply a fall in R2

= 1

o
o

( 1)
( )2
( 1)

If n > k, unadjusted and adjusted R2 will be similar


If SSE is statistically significant (i.e. quite different to 0), the values of unadjusted and
adjusted R2 will differ substantially

Class Exercise: (Adapted from QMB Final Exam S2 2007)


The Human Rights and Equal Opportunity Commission has asked you for further analysis on gender discrimination in
the law firms; (see Question 2), including an examination of the difference in income. In order to provide some
evidence on whether differences in the incomes of male and female lawyers are due to discrimination or some other
factors, a regression model is constructed based on human capital theory. The model is specified as follows:
HRINCOMEi = 0 + 1EXPi + 2FIRMSIZE2i + 3FIRMSIZE3i + 4FIRMSIZE4i + 5PARTNERi + 6FEMALEi + Ui
HRINCOME = Hourly income from legal practice in dollars (i.e. total income divided by hours worked)
EXP = Experience measured as the number of years working as a lawyer
FIRMSIZEI = Dummy variable equal to 1 if firm has between 1 - 10 lawyers inclusive, and 0 otherwise
FIRMSIZE2 = Dummy variable equal to 1 if firm has between 11 - 50 lawyers inclusive, and 0 otherwise
FIRMSIZE3 = Dummy variable equal to 1 if firm has between 51 - 200 lawyers inclusive, and 0 otherwise
FIRMSIZE4 = Dummy variable equal to 1 if firm has equal to or greater than 201 lawyers, and 0 otherwise
PARTNER = Dummy variable equal to 1 if lawyer is a partner, and 0 otherwise
FEMALE = Dummy variable equal to 1 if lawyer is female, and 0 otherwise.

Asshar
edon.
.
.

Sheldon and Juliet

Wednesday 2-3pm

RC M010

The regression was estimated by Ordinary Least Squares and a portion of the EXCEL output is reproduced below in
Table 3:

a) The sample mean of HRINCOME for males is $59 and for females is $34. Why is this difference not necessarily
evidence of gender discrimination? [2 marks]
b) Use the regression output to conclude whether there is evidence of gender discrimination in hourly incomes.
Justify your answer. [3 marks]
c) Interpret the estimate for the EXP variable in terms of both economic and statistical significance. Is it consistent
with your expectations? Discuss. [3 marks]
d) Test the null hypothesis that 5 is equal to zero against the alternative that it is greater than zero. Use a 1%
significance level. [1 mark]
e) What are the "Standard Error" and "R Square" statistics reported amongst the "Regression Statistics" in the EXCEL
output? Interpret the R Square result for this regression model. [3 marks]
f) Calculate the predicted hourly income for a male lawyer, with 10 years experience who works in a firm with 20
lawyers but who is not a partner. [1mark]
Distributions thus far:
o Binomial Distribution (Week 6)
o Uniform Distribution (Week 6)
o Normal Distribution (Week 7)
o Distribution of the Sample Mean (Week 8)
o T-Distribution (Week 10)
o Distribution of the Sample Proportion (Week 10)
Next week....(Our last week!)
Chi-Squared Distribution
Revision on whatever we decided today...Confidence Interval, Hypothesis Testing?

Asshar
edon.
.
.

Sheldon & Juliet

Wednesday 2-3pm

BES PASS Week 13 2010


Topics to be covered:
1. Chi-squared distribution
2. Inferences about population variance
3. Chi-squared Goodness-of-Fit Test
4. Chi-squared Test of a Contingency Table
1. CHI-SQUARED DISTRIBUTION
WARM-UP QUESTION
1. What does the chi-squared distribution look like?
2. What is the effect of increasing v (the degrees of freedom)?

IMPORTANT STATS FOR THE CHI-SQUARED DISTRIBUTION

Chi-squared random variable: 2


Mean: E (2 ) = v
Variance: V (2 ) = 2v

HOW TO DETERMINE CHI-SQUARED VALUES

2 > 0 (always)
area to the right of 2 = 2A,V
2
area to the left of 2 = 1A,V
use the table of values at the back of your yellow booklet

2. INFERENCES ABOUT A POPULATION VARIANCE


There are 2 ways of drawing inferences about a population variance:
1) Confidence Interval Estimator of 2
2) Run a hypothesis test on 2
1) Confidence Interval Estimator of 2
CONFIDENCE INTERVAL ESTIMATOR OF 2

lower confidence limit (LCL)

n1 s 2
2

upper confidence limit (UCL)

n1 s 2
21

Red Centre M010

Asshar
edon.
.
.

Sheldon & Juliet

2) Running a hypothesis test on 2


Step 1: Define Hypothesis Test

Wednesday 2-3pm

Red Centre M010

Step 2: Establish Rejection Region

2
If H1: 2 > 1, RR: x 2 > x,v

H0: 2 = 1

2
If H1: 2 < 1, RR: x 2 < x1,v

H1: 2 , >, < 1

2
2
If H1: 2 1, RR: x 2 > x/2,v
or x 2 < x1/2,v

Step 4: Calculate Test Statistic

n 1 s2
=
2
2

Step 3: State Decision Rule


If our test statistic falls within the rejection region,
there is sufficient evidence to suggest that the
population variance 1

Step 5: Conclusion
Do we have enough evidence to reject H0, that the population variance = 1?
PRACTICE QUESTIONS
1. The sample variance of a random sample of 50 observations from a normal population was found
to be s2 = 80. Can we infer at the 1% significance level that 2 is less than 100? (No)

2. Estimate 2 with 90% confidence given that n=15 and s2=12. (7.0932, 25.5684)

3. CHI-SQUARED GOODNESS OF FIT TEST


The purpose of a Chi-squared goodness of fit test is to examine whether observed & expected frequencies
are the same in a multinomial experiment. But first, lets check out some stats for multinomial experiments.
PROPERTIES OF MULTINOMIAL EXPERIMENTS

Fixed number of trials (n)


Outcome of each trial falls into one of k categories (cells)
p1 + p2 + p3 + ... + pk = 1
Each trial is independent.

Asshar
edon.
.
.

Sheldon & Juliet

Wednesday 2-3pm

Red Centre M010

FREQUENCY

Frequency = the number of outcomes falling into each of the k cells/categories.


It is notated by f1, f2, f3, ..., fk (where fi = the observed frequency of outcomes falling into cell i)
f1 + f2 + f3 + ... + fk = n

How to run a Chi-squared Goodness-of-Fit test another flow-chart by Shel


Step 1: Check the Rule of Five
For each cell, ei 5, where ei = npi

Step 2: Define hypothesis test

Step 4: Decision Rule


If our test statistic falls within the rejection
region, there is sufficient evidence to suggest
the observed frequency of a multinomial
variable its expected value

Step 5: Calculate Test Statistic


k
2

X =
i=1

(fi ei )2
ei

H0: p1 = ..., p2 = ..., p3 = ..., etc.


H1: at least one of the pi its specified value

Step 3: Critical Value, Rejection Region


2
Rejection region: x2 > x,k1

Step 6: Conclusion
Do we have enough evidence to reject H0 that at
least one of the pi its specified value?

PRACTICE QUESTION
3. We would like to make inferences about the market shares of Dell, HP, Apple, and the rest at the
5% significance level. In a random sample of 200 computers, we find that 48 are Dell, 42 are HP, 12
are Apple and 98 are the rest.
Test the hypothesis that:
H0: p1=0.2, p2=0.2, p3=0.1, p4=0.5
H1: At least one pi is not equal to its specified value

(Answer: Dont reject H0 at 5% significance level)

Asshar
edon.
.
.

Sheldon & Juliet

Wednesday 2-3pm

Red Centre M010

4. CHI-SQUARED TEST OF A CONTINGENCY TABLE


The purpose of running a Chi-squared test of a contingency table is to determine whether theres enough
evidence to infer that
a) 2 nominal variables are related
b) differences exist between 2 or more populations of nominal variables
How to run a Chi-squared test of a Contingency Table the final flow-chart Shel will ever make for you...
Step 2: Rejection region, critical value

Step 1: Define hypothesis test

H0: variables are independent


H1: variables are dependent.

2
Rejection region: x2 > x,v

where v = ( r 1 ) ( c 1 )

Step 4: Calculate test statistic


k
2

X =
i=1

Step 3: Decision rule


If our test statistic falls within the rejection
region, there is sufficient evidence to suggest
the variables are dependent.

(fi ei )2
ei

where
total of row i . (total of column j)
eij =
sample size

Step 6: Conclusion
Do we have enough evidence to reject H0 that
the variables are dependent?

PRACTICE QUESTION
4. Test the hypothesis that income and education are independent at the 1% significance level.

Education/Income
Secondary
Tertiary
Doctorate
TOTAL

< $50k
40
30
1
71

$50k - $100k
30
40
12
82

> $100k
12
20
15
47

(Answer: reject H0 ie. the variables are not independent.

TOTAL
82
90
28
200

You might also like