Professional Documents
Culture Documents
edon.
.
.
Wednesday 2-3pm
Asshar
edon.
.
.
Wednesday 2-3pm
b) Skewness
o
o
o
Positive skew
Tail to the right
Mode < Median < Mean
o
o
o
Negative skew
Tail to the left
Mean < Median < Mode
c) Modal classes
o The modal class is the class with the largest number of observations.
o The 3 descriptions which could come in handy in describing your histogram are:
i) unimodal histogram = a histogram with one peak.
ii) bimodal histogram = a histogram with two peaks.
iii) bell-shaped histogram = a symmetric unimodal histogram.
Apart from the histogram, you should also be familiar with stem and leaf displays and
ogives.
We can also measure interval data with respect to variability the spread of the data:
o i) range = largest smallest observation
o ii) variance:
population variance:
sample variance:
iii) standard deviation simply take the square root of the variance.
Also recall the empirical rule and Chebysheffs Theorem when required to
interpret the standard deviation of your data.
iv) coefficient of variation = standard deviation / mean
Asshar
edon.
.
.
Wednesday 2-3pm
Lp =
p
x (n + 1)
100
and
Asshar
edon.
.
.
Wednesday 2-3pm
QUESTION BANK
1. Using these numbers: 2 3 3 6 8 9 14 16 17 20, find the:
a. mean
b. median (8.5)
c. mode
d. lower quartile (3)
e. upper quartile (16.25)
f. interquartile range
2. Youre an investment banker and work 22.5 hours a day. Your monthly pay in recent
months has looked like this cuz youre a money machine:
$23,000 $36,500 $47,200 $20,200 $61,300
a. Whats the sample mean? ($37,640)
b. How about sample variance? ($292,743,000)
c. And sample standard deviation? ($17,109.73)
3. A set of test scores has a mean of 890 and standard deviation of 120. Whats the
coefficient of variation?
4. Check out these test scores: 88 76 67 90 98 68 75 86 82 90. Calculate:
a. sample mean
b. sample standard deviation
c. coefficient of variation
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
Q. What do you think are the pros and cons for each of the methods of data collection?
Hint: Think of the costs, response rate, purpose and biases that may arise
2. Random Sampling
The primary incentive for examining sample rather than a population is cost. Compiling statistics is usually
expensive, imagining conducting experiments on 10,000 people and asking them to take an aspirin every day
for 3 weeks and coming back to test on them!
Main Concept: Our Target Population can be inferred by the Sample Population if the sample statistic can
come quite close to the parameter it is designed to estimate
There are 3 different types of sampling plans:
Simple Random Sample: A sample selected in such a way that every possible sample with the same
number of observations is equally likely to be chosen
o E.g. Drawing ticket stubs in a raffle to determine the winner
Stratified Random Sample: Separating the population into strata and then drawing simple random
samples from each stratum
Cluster sampling: is a simple random sample of groups or clusters of elements
From these samples of observations, two main types of error arise:
1. Sampling Error is the difference between the sample and the population that exists only because of
the observations that happened to be selected for the sample
2. Non-sampling Error more serious than sampling error, and are due to mistakes made in the acquisition
of data or due to sample observations being selected improperly
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
Conditional Probability is the probability of an event A, occurring given another event B, also occurring.
It is represented by:
P( A | B )
which is read as Given that B has occurred, what is the probability of A occurring?
Expanding this we get...
( | ) =
( )
()
One of the reasons we compute conditional probability is to find whether two events are related. I.e. we
want to know whether they are independent events.
If they are independent, the probability of one event is not affected by the occurrence of the other event
( | ) = ()
( | ) = ()
4. Other Rules
The Multiplication Rule: is used to calculate the joint probability of two events. Based on the
conditional probability formula.... and then multiplying both sides by P(B)
i.e.
P(AC) = 1- P(A)
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
GROUP EXCERCISE 1
Male
Female
Row Total
High Distinction
75
61
136
Pass
215
155
370
Column Total
290
216
506
1. P (Female)
2. P (High Dist)
3. P (Female U High Dist)
4. P ( Pass )
5. P(( Pass l Male )
6. Which ones of the above are Marginal Probabilities?
7. Which ones are Joint Probabilities?
GROUP EXCERCISE 2
Probability Trees
Probability trees are a very neat and fast way for working out many probability problems.
Example: (QMB Final 99s2): An advertising executive is studying the television viewing habits of married men and
women during prime-time hours. The executive has determined that during prime-time, husbands are watching
television 60% of the time. It has also been determined that when the husband is watching television, 40% of the
time the wife is also watching. When the husband is not watching television, 30% of the time the wife is watching
television.
i.
ii.
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
17
a. find the mean and the mode of this data set (2 marks)
b. Find the median and the third quartile of this data set (2 marks)
2. 2. Suppose A and B are mutually exclusive events. If P(A) = 0.4 and P(B) = 0.2, then P(AlB)=?
3. 2 teams A and B are of equal ability, so each has a probability of 0.5 of defeating the other. Assume that the
outcome of any game is independent of the outcome of any other game. What is the probability that team A
wins 4 games in a row?
4. Approximately 30% of the sales representatives hired by a firm quit in less than 1 year. Suppose that two
sales representatives are hired and assume that the first sales representatives behaviour is independent of
the second sales representatives behaviour.
a. What is the probability that both quit within the year?
b. Find the probability that exactly one representative quits
5. A group of individuals concerned about environmental problems claims that 30% of the adults in a certain
town have been adversely affected by a new nuclear power plant that pollutes the air and causes lung
damage. To test their claim, you randomly select 4 adults of the town
a. If the environmental group is correct, what is the probability that all 4 people have been adversely
affected?
b. What is the probability that at least one of the 4 individuals has been adversely affected?
Answers
1. A) mean = 6, mode = 5
B) median = 5, 75% quartile = 7, observing that 50% of data points are below 5 and 75% below 7
2. 0
3. 0.0625
4. A) 0.09
B) 0.42
5. A) 0.0081
B) 0.7599
Asshar
edon.
.
.
Wednesday 2-3pm
Formula
Population Variance
(Full)
(Shortcut)
Population Standard
Deviation
We also come across a new concept of the laws of expected value & variance. These are:
a) Expected Value
1. E(C) = C
2. E(X+C) = E(X) + C
3. E(CX) = C.E(X)
Asshar
edon.
.
.
Wednesday 2-3pm
b) Variance
1. V(C) = 0
2. V(X+C) = V(X)
3. V(CX) = C2V(X)
Now, try these questions.
Q1. Sheldon has trouble sleeping at night because sometimes there is this one girl
who calls him up at like 3am in the morning for no reason. It means hes in a bad
mood the next day. It happens so much he could actually create a probability
distribution for it:
Number of time she calls Sheldon
1
2
3
4
5
6
7
Help Sheldon compute the mean and variance of the number of times the annoying
girl calls him. (Mean = 4, variance = 2.40)
Q2. Continuing on, this girl is crazy. Every time she walks past a Louis Vuitton
store, she has this burning temptation to buy a LV handbag. She used to buy, like, 2
or 3 at a time, but now that Sheldon dumped her, shes more reluctant to buy one
these days. This is the probability distribution for the number of LV handbags she
buys each time she goes out:
Number of LV handbags she
wants to buy
0
1
2
3
4
How many LV handbags should we expect her to buy on Thursday night? (1.85)
Asshar
edon.
.
.
Wednesday 2-3pm
2. BIVARIATE DISTRIBUTIONS
Do you recall bivariate relations from Week 3 PASS? We now come across the concept of
bivariate distributions which provide the probabilities of combinations of 2 variables.
There are 2 measures that are important in describing the bivariate distribution.
IMPORTANT FORMULAS FOR BIVARIATE DISTRIBUTIONS
Term
Formula
(Full)
Covariance
(Shortcut)
Coefficient of
Correlation
Importantly, we also have laws of expected value & variance for the sum of 2 variables
too:
1. E(X+Y) = E(X) + E(Y)
2. V(X+Y) = V(X) + V(Y) + 2.COV(X,Y)
...noting that if X and Y are independent, then COV(X,Y) = 0.
Group Question This question is quite long so divide parts up with your partner to get it
done in time.
Sheldon and Juliet are PASS leaders by day, and drug dealers by night. Let X and Y
be the weight in kilograms of drugs Sheldon and Juliet sell each night respectively.
Bivariate Probability Distribution:
0
1
2
Total
0
.12
.21
.07
.4
X
1
.42
.06
.02
.5
2
.06
.03
.01
.1
Total
.6
.3
.1
1.00
Asshar
edon.
.
.
Wednesday 2-3pm
Variance
Question
Sheldon has also joined the recent craze of investing in English football clubs. This
is what his investment portfolio looks like:
Stock
Proportion of Portfolio
Mean
Standard Deviation
Liverpool (#2)
.70
.25
.15
For each of the following coefficients of correlation, calculate the expected value
and standard deviation of the portfolio.
a) = .5 (.211, .1081)
b) = .2 (.211, .1064)
c) = 0 (.211, .1052)
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
Question
Recall our discussion of discrete and continuous random variables.
Discrete = countable/finite
Continuous = range of values/infinite number of values in a given interval
Which of the following are discrete and which are continuous?
a) The number of goals scored in 20 attempts (discrete)
b) The time it takes to write an essay (continuous)
c) The number of people in a bar (discrete)
d) The temperature inside a room (continuous)
e) The amount of energy used by a computer (continuous)
1. Binomial Distribution
Lets recall the properties of a Binomial Experiment, theres 4, so give it a shot!
1) Fixed number of trials (n)
2) Two possible outcomes: success and failure
3) P(Success) = p and P(Failure) = (1-p)
4) Trials are independent
Examples: Flipping a coin 10 times, Drawing 5 cards out of a shuffled deck
Note: In a binomial experiment, there is an assumption of a sequence of Bernoulli trials, i.e. the random
variables are independently and identically distributed (iid)
Binomial Random Variable
The probability of x successes in a binomial experiment with n trials and the probability of success p is
o X ~ Bin(n,p)
o P ( X = x ) = nCx px qn-x
N.B. Learn to use Binomial tables!!!
P(X = k) Individual binomial probability
P(X k) Cumulative binomial probability
P(X > k) Survivor probability
Also, from Perms and Combs,
Cr
n!
r ! n r !
(Sheldon, my word wont type equations! I shall write this one out >_<.. I had to copy and paste all these
equations)
Asshar
edon.
.
.
RC M010
= E(X) = np
2 = Var(X) = np(1-p)
o = np(1-p)
Exercises:
1. Sheldon knows that 15% of all the girls he goes out with want expensive presents during the first month of
dating. He decides to test this theory out and goes out with 6 girls. Assume the performances of the girls are
independent of one another. Whats the probability that:
a) All six girls will require expensive presents during the month of dating? (0.0000)
b) 1 of them will demand an expensive present during the first month of dating? (0.3993)
c) At least 3 of them will require expensive presents during the first month of dating? (Hint: use cumulative
binomial probabilities) (0.0473)
2. The Koch Electric Company makes electric shavers. If the probability that an electric shaver is
defective is 0.01, what is the probability of the following in a shipment of 500 electric shavers that:
a) None are defective? (0.0067)
b) One is defective? (0.0337)
c) More than three are defective? (0.735)
3. A plumber installs six hot water heaters in a housing development. The probability that any
individual heater will last more than 10 years is 0.7, and their life lengths are independent. Let X
denote the number of water heaters that last more than 10 years.
a) Find the probability that more than 3 of the water heaters will last more than 10 years (0.7443)
b) Find the mean and variance of the random variable X (4.2; 1.26)
4. A quality control manager for a manufacturer has instituted acceptance sampling in order to monitor the
quality of incoming parts that are bought in bulk. The policy is that all incoming parts are checked by
selecting at random 10 parts and then determining whether each part contains any defects or not. If 2 or
more parts are found to have defects then the entire order is rejected and is returned to the supplier. What
is the probability that an order from a particular supplier is rejected if that supplier is known to have 5% of
parts with defects? (0.0861)
5. The probabilities that three independent members of a committee will vote in favour of electing a PASS
leader as president are 0.2, 0.3 and 0.5, respectively. The probability that at most one member of the
committee will elect a PASS leader is? (0.75)
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
Exercises:
1. The time before a baby cries is a uniformly distributed random variable between 0 and 30 minutes
a) Find the probability distribution function (1 /30)
b) Find the probability that a baby cries within 20 minutes (0.67)
c) Find the probability that a baby does not cry within 10 minutes (0.67)
d) Find the probability that a baby cries between 15 minutes and 20 minutes (0.17)
2.
3. If X, a continuous random variable, is symmetric about , is P (X < - 2) equal to P (X > + 2)? (yes)
4. If X, a continuous random variable, is symmetric about X = 2, find P (X > 2) (0.5)
Asshar
edon.
.
.
Wednesday 2-3pm
Class Example
You make an investment of stocks with an average return of 10%. Find the
probability that you will lose money:
a) if the standard deviation of returns is 5% (0.0228)
b) if the standard deviation of returns is 10% (0.1587)
Clue! Use the tables.
Asshar
edon.
.
.
Wednesday 2-3pm
c) Finding values of Z
Just then, we focused on working out Z, then using that to work out the probability of
something.
However, often questions can ask us to reverse engineer the process, by giving us a
probability first, then working out what Z is. This is the complete opposite of the previous
process.
FINDING VALUES OF Z, GIVEN A PROBABILITY
ZA = The value of Z such that the area to its right under the standard normal
curve is A.
Question
a) Find Z0.25 (1.96)
b) Find Z0.05 (1.645)
d) ZA and percentiles
ZA & PERCENTILES
eg. Using question (b) from above, Z0.05 = 1.645 = the 95th percentile.
2. An analysis of the amount of interest paid monthly by Visa cardholders reveals that
the amount is normally distributed with a mean of $27 and a standard deviation of
$7.
a) What proportion of the cardholders pay more than $30 in interest? (.3336)
b) What proportion of the cardholders pay more than $40 in interest? (.0314)
c) What proportion of the cardholders pay less than $15 in interest? (.0436)
d) What interest payment is exceeded by only 20% of the cardholders? ($32.88)
Asshar
edon.
.
.
Wednesday 2-3pm
mean: = n.p
standard deviation: = n p ( 1 p )
Note that, however, we cant directly apply the normal to the binomial. We actually need a
continuity correction factor of 0.5 to adjust for the approximation. In particular:
USING THE CONTINUITY CORRECTION FACTOR
Let Y be the normal random variable approximating the binomial random variable X.
P ( X x ) P ( Y < x + 0.5 )
P ( X x ) P (Y > x 0.5 )
Asshar
edon.
.
.
Wednesday 2-3pm
3. CONCEPTS OF ESTIMATION
a) Some chilled questions to consider with the person next to you...
2. Consistency
3. Relative efficiency
Asshar
edon.
.
.
Wednesday 2-3pm
is the distribution of all possible values that can be assumed by that statistic, computed of samples of the
same size drawn from the same population
i.e. allows us to estimate the population parameter using a sample statistic
o The population of a random variable will have certain parameters
E.g. Mean and Variance 2
o For a particular sample of size n, the sample statistic is unlikely to be the same as its population
parameter
Known as sampling error: The cost of sampling which can be reduced by taking larger
samples. (NB: Standard Deviation of Sampling distribution of the mean = Sampling Error)
o Different samples (of size n) will have different sample statistics
i.e. Sample mean/variance will vary for each sample
o Taking repeated samples of size n, the distribution of this statistic can be computed
Mean
Variance
Distribution
Population parameter
2/n
Normal
2/n
RC M
Asshar
edon.
.
.
1050
960
1100
Group Excercise 2
In a certain PASS community, 60% of all leaders are in favour of electing Sheldon as the genius. A random
sample of 200 leaders is taken. What is the probability that 100 or less of these leaders favour the election of
Sheldon as the one and only genius? (0.0025)
Group Excercise 3
Cadbury Yowie chocolates are known to have a mean weight of 27g and a variance of 6.25g
squared. If a random sample of 60 Yowies is examined, find the probability that its average is:
a)
Group Excercise 4
A basketball coach is seeking tall recruits who are smart enough to be eligible for college. The
recruit must be at least 74 inches tall and have an IQ of 115 or above. Height and IQ are
independent of one another. IQ is normally distributed with mean 100 and standard deviation 12,
and height is normally distributed with mean 70 and standard deviation 2 inches. What percentage
of the population satisfies the coachs requirements? (0.24%)
2. The time it takes for a statistics professor to mark his mid-session test is normally distributed with a mean
of 4.8 mins and a standard deviation of 1.3 mins. If there are 60 students in the class, what is the probability
that he needs more than 5 hours to mark all the mid-session tests? (0.1170)
3. Pierres goose farm claims that its jars of foie gras have a weight of 250g and a standard deviation of 6g.
After buying 36 jars, before eating them on petits blinis with some fig jam, salt and pepper, you weighed
them and found them to have a mean of 245g. What general statement can we make about the Pierres
claim? (Pierre is a lying Frenchman)
RC M
Asshar
edon.
.
.
Wednesday 2-3pm
3. Confidence Interval
Recall:
o Point Estimators produce a single estimate of the parameter of interest
o Interval Estimators produce a range of values and attach a degree of confidence with that interval
Confidence Interval: is a interval estimator defined by the confidence level (1-)
This implies that we start with the confidence level we want and then work out the width of the interval
Deriving it step by step...
1. Recall:
Standard Normal: Z = X
/n
2. Employ the definition of a confidence interval
Symmetrical Interval: P(-Z/2 < X < Z/2) = 1-
/n
3. Rearranging,
/2
Z/2
90%
0.1
0.05
1.645
95%
0.05
0.025
1.960
98%
0.02
0.01
2.326
99%
0.01
0.005
2.576
RC M
Asshar
edon.
.
.
Wednesday 2-3pm
Group Excercise 5
If we know that = 40, and we obtain a sample mean of 136, construct a 95% confidence interval for the
population mean using a sample size of:
a) 20
(-118.47 153.53 )
b) 160
( 129.80 142.20 )
Group Excercise 6
If we know that = 40, and we obtain a sample mean of 136 using 25 values, construct a confidence interval
for the population mean having:
a) A confidence level of 99%
(115.392 156.608)
b) A confidence level of 50%
(130.6 141.40)
Group Excercise 7
John wants to estimate the average time it takes for customers to have lunch at his new cafe. He knows from
past experience that the standard deviation will be 18. John wants to use a confidence interval of 90% and
have a sampling error no greater than 3 minutes. How many customers does he need to time? (98)
3. An economist wants to estimate the mean annual income of households in a particular district. It is assumed
that the population standard deviation is $4000. The economist wants to estimate the sample mean to
within D = $500 of the true mean with 95% level of confidence. Calculate the sample size required.
4. Starting annual salaries for university graduates with business degrees are believed to have a standard
deviation of approximately $1800. A 95% confidence interval estimate of the mean annual starting salary is
desired. How large a sample should be taken if we want to be 95% confident that the maximum sampling
error is:
a. $500
b. $200
5. A medical researcher wants to investigate the amount of time it takes for patients headache pain to be
relieved after taking a new prescription painkiller. She plans to use statistical methods to estimate the mean
of the population of relief times. She believes that the population is normally distributed with a standard
deviation of 20 minutes. How large a sample should she take to achieve 90% confidence to within 1 minute?
John wants to estimate the average time it takes for customers to have lunch at his new caf. He knows
from past experience that the standard deviation will be 18. John wants to use a confidence interval of
90% and have a sampling error no greater than 3 minutes. How many customers does he need to time?
(98)
RC M
Asshar
edon.
.
.
Wednesday 2-3pm
2. You are faced with 2 investments. One is very risky, but the potential returns are high.
The other is safe, but the potential is quite limited. Which one should you choose?
Asshar
edon.
.
.
Wednesday 2-3pm
Asshar
edon.
.
.
Wednesday 2-3pm
The p-value of a test is the probability of observing a test statistic at least as extreme as
the one computed given that the null hypothesis is true.
So in our example here, what would the p-value be?
Range of p-value
< 0.01
0.01 0.05
0.05 0.10
> 0.10
Asshar
edon.
.
.
Wednesday 2-3pm
NB: You cannot prove that either the null or hypothesis is true.
h) One and Two Tail tests
ONE VS TWO TAIL TESTS
o
3. PRACTICE QUESTIONS
1. Juliet gets really annoyed when her BES students take too long to finish her class tests.
Shes shot a few of them in the leg before. To investigate further, she randomly samples
10 students and measures the amount of time they spend doing a BES test. The results
are listed below. Assuming that the times are normally distributed with a standard
deviation of 2 minutes, test to determine whether the owner can infer at the 5%
significance level that the mean amount of time spent on the tests is greater than 6
minutes. Data: 8 11 5 6 7 8 6 4 8 3. (Answer: z = .95, p-value = .1711, no.)
Asshar
edon.
.
.
Wednesday 2-3pm
Learning to calculate the probability of Type II errors and the Power of the Test
Hypothesis Testing when population variance is unknown
Sampling distribution of sampling proportion
1. Some clarification...
Sample Mean: ~ ( , 2 ) subscripts to show this is different to the population
Hypothesis testing: testing where our lies in relation to our hypothesised 0
o Methods: Critical values for (not a confidence interval) and critical values using z-scores
o State H0 with a strong equality sign (=) and your conclusion with the level of significance.
o Value of is called our significance level
2. Type I and Type II errors
ERRORS
Reject H0
Do not reject H0
Given a true H0
Type I error
Correct Decision
Given a false H0
Correct Decision
Type II error
NB. There is a trade-off between the two types of errors. Changing our significance level will produce
resultant changes in .
Power of the test
The power of the test is the probability of correctly rejecting a false null hypothesis.
Power = 1 -
NB. 1 !!!!
Steps:
1. Draw the distribution of 0 under the null
hypothesis, H0
Hypothesized Mean
Distribution
Rejection Region
Actual Mean
H0 :
x z
Distribution
Actual :
Correctly Rejected
Non-Rejection Region
Rejection Region
But SHOULD reject!
When H0 is False
Asshar
edon.
.
.
Wednesday 2-3pm
1) N.S.W. Police are testing if vehicles are exceeding the speed limit of 90km/hr on South Dowling Road. A sample of
81 vehicles yields a mean driving speed of 98km/hr. If the population of vehicle speeds is normally distributed with a
standard deviation of 25 km/hour, test the hypothesis, at the 5% level of significance, H0: = 90; H1: > 90. If H0 is
rejected, calculate , the probability of Type II error, given that the true = 100. ( = 0.0253; power of test = 0.9747;
we reject H0)
2) Miss Rose was researching dress sizes. She had thought the mean dress size was 9. But her suspicion is that it will
be larger than that. Thus, being the relative unknown and incredible mathematician she was, Miss Rose decided to
do a hypothesis test. She found the population to be normally distributed, with standard deviation of 4. If = 0.05
and the sample size was 64, calculate the power of the test if the mean was actually
a. 9.5 (0.1844)
b. 10 (0.6387)
3) What will be the answer for (a) and (b) in the above example if Miss Rose only suspected that the mean size was
not 9? (0.17, 0.516)
3. T-distribution
So far, the problems we have dealt with assume that the population variance 2 is known
This is unrealistic, were more likely to know the sample variance
Note that s2 is an unbiased and consistent estimator of 2
, =
t-dist
, = 1
Asshar
edon.
.
.
Wednesday 2-3pm
Assuming H0: = 0
If H1: > 0 then c = 0 + t(s/n) If X > c, reject H0, otherwise we do not reject
If H1: <0 then c = 0 t(s/n) If X< c, reject H0, otherwise we do not reject
If H1: 0 then c = 0 t/2(s/n) If X< c or X> c, reject H0, otherwise we do not reject
Or using the standardised method, our critical values for t are:
t, v (one-tailed)
or
t/2,v (two-tailed test)
where = level of significance
= n-1 (degrees of freedom)
Assuming H0: = 0
o If H1: > 0 reject H0 if t > t,v
o
Asshar
edon.
.
.
Wednesday 2-3pm
E()= p
Var ( )= pq/n
~ ,
8) The proportion of families buying milk from Company A in a certain city is p = 0.6.
A random sample of 10 families shows that 4 buy milk from Company A.
a) Conduct a hypothesis test with a null H0: p = 0.6 against the alternative H1: p < 0.6. Find the critical values using
both unstandardised and standardised methods at the 5% significance level. (Do not reject null)
b) Construct a 95% confidence interval for p. Does this interval include 0.6? [0.096, 0.799]
If we reject the null when 3 or fewer families buy milk from Company A:
c) Find the probability of committing a Type I error. (0.055)
d) If the true proportion of families buying milk from Company A is p = 0.5, what is the probability of committing a
Type II error based on the above decision rule? (0.828)
Asshar
edon.
.
.
Wednesday 10-11am
OMB229
In order to investigate the relationship between x and y, we need to calculate the value of the
coefficients 0 and 1 using the least squares method, with whom you had a friendly encounter in
Week 2.
Why is it called the least squares method? Recall that when we draw a line through a set of sample
data, we aim for the best line the line of best fit. In particular, this line is the one which is closest to
the sample data points; the line that minimizes the sum of the squared differences between the points
and the line.
LEAST SQUARES LINE COEFFICIENTS
s xy
b1 =
b0 = y b1 x
s 2x
Class Example
The annual bonuses (millions) of 6 football players from Chelsea FC [the 2010 Premier League (clearly
dominating Man Utd) AND FA Cup Champions] with different years of experience are recorded as
follows. The manager, Carlo Ancelotti, has hired you as his private statistician to determine the
relationship between annual bonus and years of experience.
Years of experience (x)
Annual Bonus (y)
1
6
2
1
3
9
4
5
5 6
17 12
Asshar
edon.
.
.
Wednesday 10-11am
Frank Lampard has already performed some initial calculations for you:
SOME HELPFUL DATA FOR THIS QUESTION
n
i=1 x i = 21
n
i=1 yi = 50
n
i=1 x i yi = 212
n
2
i=1 x i = 91
sxy =
sx2 =
b1 =
x=
y=
b0 =
OMB229
Asshar
edon.
.
.
Wednesday 10-11am
OMB229
SSE =
n
i = 1(yi
y i )2 =
SSE
n2
s =
QUESTION
1. Calculate the standard error of estimate for Chelsea FC. (1.596)
2. Interpret what it tells you about the models fit.
Ho: 1 = 0 (ALWAYS)
H1: 1 , >, < 0
Asshar
edon.
.
.
Wednesday 10-11am
OMB229
t > t , n-2
t < - t, n-2
H1: 1 0 |t|< t/2, n-2
t=
H1: 1 > 0
H1: 1 < 0
b1 1
sb 1
sb 1 =
Step 4: Conclusion
If we dont reject Ho we can conclude y is
not linearly related to x
s
n 1 sx2
QUESTION
1. Perform a hypothesis t-test of the slope for Chelsea FC at 5% significance.
(t-stat = 5.5413, reject null).
2. Interpret what it tells you about the models fit.
R2 =
s 2xy
s 2x s 2y
=1
SSE
(y i
y )2
(y i y )2 SSE
(y i
y )2
EXPLAINED VARIATION
VARIATION IN Y
QUESTION
1. Calculate the coefficient of correlation for Chelsea FC. (0.491)
2. Interpret what this tells you about the regression model.
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
Interval Prediction
Formula
Prediction Interval
Why is there a missing 1 under the square root for the confidence interval estimator?
Ans. There is less error in estimating a mean value as opposed to predicting an individual value
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
Class Excercise
In televisions early years, most commercials were 60seconds long. Now, however, commercials can be any length.
The objective of commercials remains the same-to have as many viewers as possible remember the product in a
favorable way. A total of 60 participants were shown advertisements of varying length and each was given a test
score based on what they would remember. Using the data set (Keller, 16.06)
a) Determine the least squares line of test scores on the length of the advertisement.
b) Interpret the coefficients and their significance. Comment on the overall fit of the model.
c) Predict with 95% confidence the memory test score of a viewer who watches a 36 second commercial.
d) Estimate with 95% confidence the mean memory test score of people who watch 36 second commercials.
Also,
= 38
2 = 193.90
= 13.80
2 = 47.96
= 57.86
n = 60
Multiple Regression
Recall the assumptions of a classical linear regression model
Problem? Only measured the effect of ONE variable on the model
All the other factors were omitted and included in the error term ()
This can cause confoundment and omitted variable bias.
Bias occurs when:
Omitted variable is correlated with explanatory or other independent variable
Omitted variable is a determinant with the explanatory variable
Violates the assumption of the zero conditional mean and therefore, OLS estimates are no longer unbiased.
Our new population regression model is:
Interpretation of 1
Measures the effect of a change in X1 holding X2, X3, ... , Xk constant
Also known as the partial effect of X1 holding all other explanatory variables constant
What happens if the variables X2, X3, ... , Xk are omitted and these variables are correlated with X1?
Omitted variables will appear in the disturbance/error term
ZCM assumption will be violated (error term now correlated with independent variable)
Produces a biased estimator of 1 (will also include the effect of other variables on Y)
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
Essentially, the process of multiple regression remains the same as linear regression
Minimising SSE
2
=1( ) gives us = 0 + 1 + +
Hypothesis Testing
where v = n-k-1
= 1
o
o
( 1)
( )2
( 1)
Asshar
edon.
.
.
Wednesday 2-3pm
RC M010
The regression was estimated by Ordinary Least Squares and a portion of the EXCEL output is reproduced below in
Table 3:
a) The sample mean of HRINCOME for males is $59 and for females is $34. Why is this difference not necessarily
evidence of gender discrimination? [2 marks]
b) Use the regression output to conclude whether there is evidence of gender discrimination in hourly incomes.
Justify your answer. [3 marks]
c) Interpret the estimate for the EXP variable in terms of both economic and statistical significance. Is it consistent
with your expectations? Discuss. [3 marks]
d) Test the null hypothesis that 5 is equal to zero against the alternative that it is greater than zero. Use a 1%
significance level. [1 mark]
e) What are the "Standard Error" and "R Square" statistics reported amongst the "Regression Statistics" in the EXCEL
output? Interpret the R Square result for this regression model. [3 marks]
f) Calculate the predicted hourly income for a male lawyer, with 10 years experience who works in a firm with 20
lawyers but who is not a partner. [1mark]
Distributions thus far:
o Binomial Distribution (Week 6)
o Uniform Distribution (Week 6)
o Normal Distribution (Week 7)
o Distribution of the Sample Mean (Week 8)
o T-Distribution (Week 10)
o Distribution of the Sample Proportion (Week 10)
Next week....(Our last week!)
Chi-Squared Distribution
Revision on whatever we decided today...Confidence Interval, Hypothesis Testing?
Asshar
edon.
.
.
Wednesday 2-3pm
2 > 0 (always)
area to the right of 2 = 2A,V
2
area to the left of 2 = 1A,V
use the table of values at the back of your yellow booklet
n1 s 2
2
n1 s 2
21
Asshar
edon.
.
.
Wednesday 2-3pm
2
If H1: 2 > 1, RR: x 2 > x,v
H0: 2 = 1
2
If H1: 2 < 1, RR: x 2 < x1,v
2
2
If H1: 2 1, RR: x 2 > x/2,v
or x 2 < x1/2,v
n 1 s2
=
2
2
Step 5: Conclusion
Do we have enough evidence to reject H0, that the population variance = 1?
PRACTICE QUESTIONS
1. The sample variance of a random sample of 50 observations from a normal population was found
to be s2 = 80. Can we infer at the 1% significance level that 2 is less than 100? (No)
2. Estimate 2 with 90% confidence given that n=15 and s2=12. (7.0932, 25.5684)
Asshar
edon.
.
.
Wednesday 2-3pm
FREQUENCY
X =
i=1
(fi ei )2
ei
Step 6: Conclusion
Do we have enough evidence to reject H0 that at
least one of the pi its specified value?
PRACTICE QUESTION
3. We would like to make inferences about the market shares of Dell, HP, Apple, and the rest at the
5% significance level. In a random sample of 200 computers, we find that 48 are Dell, 42 are HP, 12
are Apple and 98 are the rest.
Test the hypothesis that:
H0: p1=0.2, p2=0.2, p3=0.1, p4=0.5
H1: At least one pi is not equal to its specified value
Asshar
edon.
.
.
Wednesday 2-3pm
2
Rejection region: x2 > x,v
where v = ( r 1 ) ( c 1 )
X =
i=1
(fi ei )2
ei
where
total of row i . (total of column j)
eij =
sample size
Step 6: Conclusion
Do we have enough evidence to reject H0 that
the variables are dependent?
PRACTICE QUESTION
4. Test the hypothesis that income and education are independent at the 1% significance level.
Education/Income
Secondary
Tertiary
Doctorate
TOTAL
< $50k
40
30
1
71
$50k - $100k
30
40
12
82
> $100k
12
20
15
47
TOTAL
82
90
28
200