You are on page 1of 27

CS2101: Research Methodology week 5 Sample, CLT & CI

CS2101: Research Methodology week 5


Sample, CLT & CI

by:
Divya Kumar
Assistant Professor

Department of CSE
Motilal Nehru National Institute of Technology Allahabad, Allahabad
CS2101: Research Methodology week 5 Sample, CLT & CI

1 Sampling Distributions
Central Limit Theorem

2 Condence Intervals
Introduction
Margin of Error
Student's t -Distribution
CS2101: Research Methodology week 5 Sample, CLT & CI
Sampling Distributions
Central Limit Theorem

The concepts of Sample and Population

We can not do a study on whole population.


Hence we collect and study samples.
X replaces µ and s replaces σ when sample size is n .
CS2101: Research Methodology week 5 Sample, CLT & CI
Sampling Distributions
Central Limit Theorem

Points to Ponder

When you take a sample of data, it's important to realize the


results will vary from sample to sample.
Statistical results based on samples should include a measure
of how much they expect those results to vary from sample to
sample.
CS2101: Research Methodology week 5 Sample, CLT & CI
Sampling Distributions
Central Limit Theorem

Points to Ponder

A sampling distribution is a probability distribution of a


statistic obtained through a large number of samples drawn
from a specic population. The sampling distribution of a
given population is the distribution of frequencies of a range of
dierent outcomes that could possibly occur for a statistic of a
population.
CS2101: Research Methodology week 5 Sample, CLT & CI
Sampling Distributions
Central Limit Theorem

The Central Limit Theorem

The Central Limit Theorem (CLT) is a statistical theory that


states that given a suciently large sample size from a
population with a nite level of variance, the mean X of all
samples from the same population will be approximately equal
to the mean of the population µ.
Furthermore, all of the samples will follow an approximate
normal distribution pattern.
All variances being approximately equal to the variance of the
population

divided by square root of each sample's size
(σ/ n).
CS2101: Research Methodology week 5 Sample, CLT & CI
Sampling Distributions
Central Limit Theorem

Breaking Down CLT

According to the central limit theorem, the mean of a sample


of data will be closer to the mean of the overall population in
question as the sample size increases, regardless of the actual
distribution of the data, and whether it is normal or
non-normal.
As a general rule, sample sizes equal to or greater than 30 are
considered sucient for the central limit theorem to hold.
Its according to statistics that: The distribution of the sample
means is fairly normally distributed.
CS2101: Research Methodology week 5 Sample, CLT & CI
Sampling Distributions
Central Limit Theorem

Breaking Down CLT

Figure: Whatever the form of the population distribution, the sampling


distribution tends to a Gaussian, and its dispersion is given by the Central
Limit Theorem.
CS2101: Research Methodology week 5 Sample, CLT & CI
Sampling Distributions
Central Limit Theorem

Example of CLT

If an investor is looking to analyze the overall return for a


stock index made up of 1,000 stocks, he can take random
samples of stocks from the index to get an estimate for the
return of the total index
The samples must be random, and at least 30 stocks must be
evaluated in each sample.
The average returns from these samples approximates

the return for the whole index and are approximately

normally distributed.

The approximation holds even if the actual returns for the


whole index are not normally distributed.
CS2101: Research Methodology week 5 Sample, CLT & CI
Sampling Distributions
Central Limit Theorem

Sampling Error

A sampling error is a statistical error that occurs when an


analyst does not select a sample that represents the entire
population of data and the results found in the sample do not
represent the results that would be obtained from the entire
population.
Sampling is an analysis performed by selecting by specic
number of observations from a larger population, and this work
can produce both sampling errors and nonsampling errors.
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Introduction

Correcting Your Guesses

A condence interval (abbreviated CI) is used for the purpose


of estimating a population parameter (a single number that
describes a population) by using statistics (numbers that
describe a sample of data).
Example, you might estimate the average household income
(parameter) based on the average household income from a
random sample of 1, 000 homes (statistic). However, because
sample results will vary you need to add a measure of that
variability to your estimate. This measure of variability is
called the margin of error, the heart of a condence interval.
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Margin of Error

Factors Aecting Margin of Error

Three factors aect the size of the margin of error are:


The condence level (z )
The sample size (n)
The amount of variability in the population (σ)

σ
CI for mean = X ±z√ (1)
n
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Margin of Error

The Condence Level z

The condence level z of a condence interval corresponds to


the percentage of the time your result would be correct if you
took numerous random samples.
Typical condence levels are 95% or 99% (many others are
also used).
The condence level determines the number of standard errors
you add and subtract to get the percentage condence you
want.
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Margin of Error

Some z Values

Percentage Condence z value


80 1.28
90 1.64
95 1.96
98 2.33
99 2.58
Table: Some Selected z Values
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Margin of Error

Example

Suppose you work for a High Performance Computing Lab. You


want to estimate, with 95% condence, the average completion
time interval of the next lot of the job. Assume that the previous
completion time data has σ = 2.3. State you estimate if you nd
an average completion time of 7.5 seconds on 100 sample lots.
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Margin of Error

Example

Suppose you work for a High Performance Computing Lab. You


want to estimate, with 95% condence, the average completion
time interval of the next lot of the job. Assume that the previous
completion time data has σ = 2.3. State you estimate if you nd
an average completion time of 7.5 seconds on 100 sample lots.
signicance level = 100(1 − α) = 95 (2)

2.3
Margin of error = 1.96 ∗ = 0.45 (3)
10
Interval =⇒ (7.5 − 0.45, 7.5 + 0.45) (4)
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Margin of Error

Did You Noticed?

Standard Deviation is changed into Standard Error?


CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Student's t -Distribution

What is t Distribution?

t -distribution is any member of a family of continuous


probability distributions that arises when estimating the mean
of a normally distributed population in situations where the
sample size is small (< 30) and population standard

deviation is unknown.

The t -distribution is symmetric bell-shaped, like the normal


distribution, but has heavier tails, meaning that it is more
prone to producing values that fall far from its mean.
Like normal distribution's Z tables, we have t tables for
t -distribution.
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Student's t -Distribution

The t Curve

Figure: The t -Curve


CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Student's t -Distribution

t -Distribution

X −µ
t = √ (5)
s/ n

The t -distribution with v = n − 1 degrees of freedom is the


sampling distribution of the t -value when the samples consist
of independent identically distributed observations from a
normally distributed population.
Thus for inference purposes t is a useful "pivotal quantity" in
the case when the mean µ and variance σ are unknown
population parameters, in the sense that the t -value has then
a probability distribution that depends on neither µ nor σ .
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Student's t -Distribution

Example

The dierence in the processor times of two dierent


implementations of the same algorithm was measured on seven
similar workloads. The dierences are
1.5, 2.6, −1.8, 1.3, −0.5, 1.7, 2.4. Can we say with 99%
condence that one implementation is superior to the other?
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Student's t -Distribution

Example

The dierences are 1.5, 2.6, −1.8, 1.3, −0.5, 1.7, 2.4.
Can we say with 99% condence that one implementation is
superior to the other?
100(1 − α) = 99
signicance level , α = 0.01 =⇒ 1 − α/2 = 0.995
n = 7, X = 7.20/7 = 1.03

Variance = 2.57, s = 2.57 = 1.60
v = 7−1 = 6
t[0.995, 6] = 3.707

condence interval = 1.03 ± t (s / n ) = (−1.21, 3.27)
As condence interval includes 0 so we cannot say with 99% that
the mean dierence is signicantly dierent.
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Student's t -Distribution

Example 2

Six similar workloads were used on two systems. The observations


are
(5.4, 19.1) , (16.6, 3.5), (0.6, 3.4), (1.4, 2.5), (0.6, 3.6), (7.3, 1.7).
Is one system better than the other in 90% condence interval?
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Student's t -Distribution

Example 2

Six similar workloads were used on two systems. The observations


are
(5.4, 19.1) , (16.6, 3.5), (0.6, 3.4), (1.4, 2.5), (0.6, 3.6), (7.3, 1.7).
Is one system better than the other in 90% condence interval?
0.95 quantile of a t -variate with 5 degrees is 2.015
90% condence interval is (−7.75, 7.11), Hence systems not
dierent.
CS2101: Research Methodology week 5 Sample, CLT & CI
Condence Intervals
Student's t -Distribution

Example 3

The processor time required to execute a task was measured on two


systems. The times on system A were
5.36, 16.57, 0.62, 1.41, 0.64, 7.26. The times on system B
were19.12, 3.52, 3.38, 2.50, 3.60, 1.74. Are the two systems
signicantly dierent?
CS2101: Research Methodology week 5 Sample, CLT & CI
Queries

QUERIES
??
CS2101: Research Methodology week 5 Sample, CLT & CI
Thanks

THANKS !!

FOR KINDNESS AND SUPPORT

You might also like