You are on page 1of 83

ANALYSIS OF DATA

Apratim Guha
apratim@iima.ac.in

Sessions 1-5
2

Basic Concepts-I
An experiment is an occurrence whose result, or outcome,
is uncertain.
The set of all possible outcomes is called the sample space
for the experiment.
3

Example
4

One Die

Two Dice
5

Example 1
A readership survey conducted among the adult
population showed that 35% read Times, 15% read
Express and 25% read Herald; 10% read both Times and
Express, 8% read both Express and Herald, 5% read both
Times and Herald; 4% read all three publications.

Sample space =?
6

Basic Concepts-II
Given a sample space S, an event E is a subset of S.

The outcomes in E are called the favourable outcomes.

We say that E occurs in a particular experiment if the


outcome of that experiment is one of the elements of E.
7

Example 2
When a fair die is rolled, probability of each outcome is 1/6.

So probability of E: Even numbers is P(E) =?

But a die does not need to be fair, and outcomes are not
always equally likely!

For example, suppose we have a loaded die with


probabilities of 1, 2, 3, 4, 5 and 6 respectively 0.25, 0.15,
0.15, 0.15, 0.15 and 0.15.

Then P(E) = ?
8

Basic Concepts-III
For any event A:
i) 0 P(A) 1.
ii) P() = 0, where is the null event (no outcomes).
iii) P(S) = 1, where S is the sample space (all outcomes).
iv) If A and B are two events with no common outcome, (i.e.
mutually exclusive) then
P(A B) P(A) P(B)
v) If A and B are two mutually exclusive events such that
P(A B) 1 then A and B are called complements of each
other. We denote B as A or AC.
9

Example 2 continued
Consider the fair die being rolled twice
P(sum is even) = ?
P(product is odd) =?
P(sum is even or product is odd) =?
10

General Formula

P(A B) P(A) P( B) P(A B)

What Happens With More Than 2 Events?


11

Example 2 continued
Consider the fair die being rolled twice
P(sum is even) = ?
P(product is odd) =?
P(sum is even or product is odd) =?

Re-compute the probabilities for the loaded die.


12

Example 1 (continued)
A readership survey conducted among the adult
population showed that 35% read Times, 15% read
Express and 25% read Herald; 10% read both Times and
Express, 8% read both Express and Herald, 5% read both
Times and Herald; 4% read all three publications.

Sample space =?
Probability of reading exactly 2 newspapers =?
13

Example 2 continued
Work out the probabilities
i) P(sum is even given the product is odd)
ii) P(product is odd given sum is even)

for the fair die as well as the loaded die.

How to proceed?
14

Conditional Probability
Let A, B be two events such that P(A) > 0. Then

P ( A B)
P( B | A)
P ( A)
is the conditional probability of B given A.
Conditional probability given an event A is computed by
restricting consideration only to A.
15

Two Dice: sum is even?


16

Theorem 1.1: Law of total probability


Let A and B be two events such that 0 < P(A) < 1. Then

P ( B ) P ( B | A ) P ( A ) P ( B | A ) P ( A ).

Note:
1. P(A B) P( B | A)P(A) .
2. P( A B C) P(C | A B) P(B | A)P(A)
Chain rule.
17

Theorem 1.2: Bayes' Theorem


Let A and B be two events such that 0<P(A)<1 and P(B)>0.
Then

P ( B | A )P ( A ) P ( B | A)P ( A)
P ( A | B) .
P ( B | A )P ( A ) P ( B | A )P ( A ) P ( B)
18

Example : Bolt Factory


Machines A, B and C manufacture 25%, 35% and 40% of
total production respectively.
Of their output 5%, 4% and 2% are defective respectively.
A bolt is drawn at random from the produce and is found
to be defective.
What is the probability that it was manufactured by
machine A?
19

Use a tree diagram


20

Disease Screening
A: individual has the disease
A : individual does not have the disease

Rare disease: P(A) = p is very small

Test:
B: test result indicates individual has disease,

P(B|A) = 1 and P(B| A ) = , a very small number.

P(A|B) = ?
21

Tree Diagram
22

Interpretation
Illustration:
P(A) = 0.0001
P(B|A) = 1
P(B| A ) = 0.0001

P(A|B) =?
What if P(A) = 0.01?
Conclusion: Test is not good enough unless p is much
higher that .

Better to look at at risk population only


23

Independence
Two events A and B are called independent, if

P(A B) P(A)P(B)

Idea comes from the fact that A and B are independent means
P(A | B) P(A)
and/or
P(B | A) P(B)
What happens to the chain rule in case of independence?
24

Example
Randomly drawing from a pack of cards, getting a king is
independent of getting a spade.

Note
Mutually exclusive (non-null) events are never independent.
(Non-null) Sub-events are never independent of the
corresponding super-events, and vice-versa, (unless ?)
25

Mutual Independence versus Pairwise


Independence
When are three events independent?

Example
Two fair dice are rolled.
A= getting a six in the first die
B= getting a six in the second die
C= getting equal outcomes in the two die

A,B,C are pairwise independent


Are they independent?
26

Risk
What would you prefer: Rs. 1000 for sure or an
investment that gives a return of Rs. 2000 with probability
0.5 and a loss of Rs. 1000 with probability 0.5?

If you are risk-neutral, it should not matter to you


If you are risk-prone, youll go for the investment
If you are risk-averse, youll prefer the sure amount

Look at the Mektek problem. Solve it from the risk-


neutral point of view.
DISTRIBUTIONS/
RANDOM VARIABLES
28

Random Variables
Random variable: Response of random experiments taking
different numerical values with certain probabilities.
The probability models describe the random variables.

EXAMPLES
Number of car crashes in Nagpur tomorrow
Amount of rainfall in India next month
Salary offered to a PGPN passout
Length of time a cancer patient survives after detection
29

Discrete Random Variables


Takes finitely/countably many values: typically integers

All examples that we are going to look at in this class are


numbers of something:
Number of earthquakes in Japan in one year
Number of errors in each page of a text book
Number of heads in 1000 coin tosses
30

Formal Description
Value (x) Probability (p(x))
1 p(1)
2 p(2)

Total 1

The probability function above is called probability mass function (p.m.f.)


Values can, of course, be any other set of integers.
31

Distribution Function
A random variable X can be characterised by its (Cumulative)
Distribution Function F(x) = P(Xx).

The distribution function tells us the chance that


Number of accidents are below a certain margin
Rainfall is below a certain amount
Salary is above a certain threshold
The patient survives more than some stipulated time limit

The last two cases look at 1-F(x), which is called the


Survival Function.
32

Some Properties
1. 0 F(x) 1.

2. F(x) is non-decreasing.

3. lim F(x) 0; lim F(x) 1


x x

4. F(x) is right continuous.

.
33

Distribution function
The distribution function for a discrete random variable
looks like a step, and hence is called a step function.
Example
p(1) = 0.1, p(2) = 0.2 and p(5) = 0.7.
What is the distribution function?
34

Expectation
Expectation of a random variable is the weighted average
of its values, weighted by the corresponding probabilities:

The expectation (or mean/average) of a discrete random


variable X taking values x1, x2, , xn with probabilities
p(x1), p(x2), , p(xn) is given by the weighted average

E(X) xip(xi ).
n

i 1
35

Examples
1. p(1) = 0.1, p(2) = 0.2 and p(5) = 0.7.E(X)=?

2. I usually buy either one or two packets of biscuits each


time I go to the store with probabilities 0.4 and 0.4,
respectively. The packets cost Rs.10 each. However,
sometimes, (with 0.2 probability,) there is a sale when
the packets are sold at Rs.8 per pack, when I buy 10
packs. What is my expected cost?
36

Properties of a Function
A biased coin is tossed. P(head) = p.
Let X = 1 if head, 0 if tail. E(X) = ? V(X)=?
37

Properties (True for all variables)


1. If X is a random variable, and a and b are two constants,
then
E(aX) = aE(X), E(b+X) = b+E(X).
Eg. If $1 = Rs. a, then if expected salary of a PGP passout
in US$ is $X, in Rupees it is Rs.aX. Then again, if everyone
is paid a joining bonus $b, expected salary is $(b+X).
2. If X and Y are two random variables, then
E(X+Y) = E(X) + E(Y).
Eg. Expected sum of salary of two friends is the sum of
their expectations.
38

Variance
The variance of a random variable measures its spread.
For a discrete random variable X, variance is given by
the weighted average of the squared deviations from
the expectation
V(X) (xi - E(X)) p(xi ) xi2p(xi ) E(X)2
n n
2

i 1 i1

Examples: Compute the variance for our two examples.


39

Properties (True for all variables)


1. If X is a random variable, and a and b are two constants,
then
V(aX) = a2V(X), V(b+X) = V(X).
2. If X and Y are two independent random variables, then
V(X+Y) = V(X) + V(Y).
Otherwise there will be a cross-term. Lets not bother about
that here.
40

Two dice
41

Two Dice
Y= sum of outcome based on two throws of a fair die

E(Y) =??? V(Y) =???


Any easier way to compute it?
42

Special Random Variables


When I repeat an exercise which results in only two possible
outcomes, how do I obtain the probability distribution of
number of occurrences of what I want?

Examples:
Probability distribution of the number of rainy days in Nagpur
this year.
Probability distribution of number of children out of 100
randomly chosen kids of age 10 who have dropped out of
school.
43

Example
10 shots fired at a target.
P(Success) = 0.2
P(at least 2 hits in 10 shots) =?
P(at least 2 hits in 10 shots| at least 1 hit in 10 shots) =?
44

Binomial Random Variable


A Bernoulli trial is a random event with only two possible
outcomes, say success and failure, with fixed
probabilities. We get to define success!
A binomial random variable is the number of successes
(or failures) in a fixed number of independent Bernoulli
trials.
45

Ber(p)
1 trial with probability of success p

Y = number of success
= 0 if failure, 1 if success

P(Y=1) = p P(Y=0)=1-p.

E(Y) = p
Var(Y) = p(1-p)
46

Binomial(n,p)
n independent trials with probability of success p

X = number of success

n x n x
P(X=x) = p (1 p) , x = 0, 1, 2, ..., n WHY???
x

E(X) = np
Var(X) = np(1-p)

R: dbinom(x, n, p) #For individual probabilities: pmf.


Use pbinom(x,n,p) for cdf.
47

Example (#2 from Casemat)


P(Success) = 0.2
a)10 shots fired at a target.
P(at least 2 hits in 10 shots) =?
P(at least 2 hits in 10 shots| at least 1 hit in 10 shots) =?
b) 5 more shots are fired.
P(at least 2 hits in 15 shots)=?
What if the shots were fired from a different angle?
48

Additive Property
If X~Bin(n, p) and Y~Bin(m, p) are independent, then
X+Y~Bin(n+m, p).

Example: Number of hits has distribution Bin(10, 0.2).


Try 5 more times. Number of hits now has distribution
Bin(15, 0.2).
Note: p has to be the same, and independence is
needed.
What happens otherwise?
49

Example
X~Bin(10,0.2), Y~Bin(5,0.4) are independent.

P(X+Y>0) =?
50

Example
Assume that for a particular machine breaks down once
every month on average. Assume that the no. of breakdowns
follow a Poisson distribution.
What is the probability that there are no breakdown in 3
months?
What is the probability of exactly one breakdown in 3
months?
51

Poisson Distribution
Used to model rare events
Is an approximation of a Bin(n,p) random variable with
large n and small p such than np = is moderate

X = number of events: rate

x
P(X=x) = e x = 0, 1, 2, ...
x!

E(X) = , Var(X) =

R: dpois(x,): pmf, ppois(x, ): cdf


52

Additive Property
If X~Poisson(1) and Y~Poisson(2) are independent, then
X+Y~Poisson(1+2).

Note: Independence is needed.


53

Example
What is the probability that there is exactly one breakdown
in the first two months and another one in the third month?

What is the probability that there is at least one breakdown


in the first two months and none during the last two months
during the three month period?
CONTINUOUS
DISTRIBUTIONS
55

Continuous Random Variables


A Continuous response or random variable, X, is
described in terms of a probability density function
(p.d.f.) f(x):
P(a < X < b) = f ( x )dx (area under the p.d.f. From a to b)
b

for any a < b. a

We need
1. f(x) 0,

2. Total area = 1, i.e. f ( x )dx 1.

56

The Cumulative Distribution Function


For a continuous random variable, C.D.F. F(x) =P(X
x) is given by

x
F( x ) f ( y )dy.

Note that F(x) here is a continuous increasing function.


Mean and Variance of a continuous random
57

variable
The mean is given by


= E(X) = xf ( x )dx.

The variance is given by


2 = E[(X-)2] = x f(x)dx x 2 f(x)dx 2 .
2


58

Uniform Distribution U(a,b)


X is uniformly distributed on an interval (a,b) if it has the
density function

1
f(x) = for a x b
b a
Typically used to model
small errors, e.g.
0 otherwise rounding off errors.

Notation: X~U(a,b).

Mean:
ab Variance: b a 2
, .
2 12

R: dunif(x,a,b):pdf, punif(x,a,b):cdf, qunif(p,a,b): pth


quantile/100pth percentile.
Example
59

Suppose that the length of 1 screws are actually uniformly


distributed between 0.99 and 1.02.
What is the probability that a screw is longer than 1.01?
In a pack of 120 screws, how many are expected to be
longer than 1.01?
60

Normal Distribution
Found almost everywhere: considered to be the natural
distribution for a number of features for large groups: e.g.
Height, Weight, Grades ...

Averages for large groups have approximately normal


distributions.
61

Normal Distribution
The p.d.f. of normal distribution with mean and
variance 2 is
( )
f(x)=

If X is normal with mean and variance 2, we write


X~N(,2).
If =0 and 2=1, then X has the standard normal
distribution.
P.D.F. of standard normal: (.)
CDF : (.)
Properties
62

1. If X~N(,2), then (X-)/ ~N(0,1). Hence, we only need


to tabulate the standard normal distribution values.
2. Normal p.d.f. is symmetric around its mean , and its
shape depends on s.d. . The higher the sd, the flatter the
curve.
63

1. In particular, the standard normal density (.) is


symmetric around 0, i.e. (-z) = (z).
2. This means (-z) = 1- (z).
3. We can re-write this as (z) + (-z) = 1, so, (0) = 0.5.
64

Standard Normal Values (F(z) = (z))

Here F(z) (or (z)) is the area under the standard normal pdf
(.) upto z.
Example
65

X~N(5,16). What is P(1<X<13)?


What is the 80-th percentile of X?

80th percentile = 0.8th quantile

R: dnorm(x,,): pdf, pnorm(x,,): cdf,


qnorm(p,,): pth quantile/100pth percentile.

Put =0, =1 for standard normal.


66

Result
For two independent normal random variables X~N(a,v)
and Y~N(b,u),
i. X+Y is normal with mean a+b and variance v+u.
ii. X-Y is normal with mean a-b and variance v+u.

Example: X~N(5,16), Y~N(-5,9). What is P(X-Y>15)?


67

An Application of the Normal Distribution:


Central Limit Theorem
Consider a number of random variables: X1, X2, , Xn for
some large value n
Let all of them be independent and identically distributed
(IID)

How is the sample mean distributed?


68

Example
Example
There are 3 lunch specials in a restaurant: A, B and C.
A student eats these specials with prob. 60%, 20% and 20%
resp.
A costs Rs.100, B costs Rs.140. C costs Rs.150.
a) Obtain the distn. of the students daily expenditure on
lunch.
b) Obtain the distn. of the students average expenditure on
lunch over two days.
c) Obtain the distn. of the students average expenditure
on lunch over 30 days.
69

Why is there a distribution of X ?


Consider other students who go to the restaurant
and eat there as well.
Assume there are many of them, but all of them
follow the same daily demand distribution of A, B
and C.
All of them have their own X , hence we can talk of
a distribution of X
Distribution of
Sampling Distribution 70

X
Sampling Distribution 71

The Central Limit Theorem (CLT)


X1, , X n : IID Sample with mean and variance 2
n X
For large sample size n, is approximately N(0,1).

Or, equivalently:

For large sample size n, X is approximately N(, 2 n).


An alternative form of the CLT
Sampling Distribution 72

Sum of samples:

For large n, Xi is approximately normal with mean n and variance n2.


n

i1
Distribution of Proportion
Sampling Distribution 73

Indicator variables: denote whether an event occurs or not,


and are typically valued 0 or 1.
Distributed as Bernoulli(p) Binomial(1,p), where probability of
event = p.

Let n = number of trials


Proportion p = Total no. of occurrences/n
Total no. of occurrences can be thought of as sum of indicator
variables
Distribution of total no. of occurrences : Binomial(n, p).

Proportion can be thought of as sample mean of indicator


variables
74

Distribution of Proportion and the CLT


Sampling Distribution

Assume np 5 and n(1-p) 5. (Then we can say n is large.)


Then the following Central Limit Theorem may be used:

p(1 p)
p ~ N p, approximately for large n
n

Alternative form:

No. of occurrences ~ N(np, np(1-p)) approximately for large n.


i.e. Bin(n,p) can be approximated by N(np,np(1-p)) for large n.
75

Normal Approximation to Binomial


For any value of p, a Bin(n,p) distribution can be approximated
by a N(np, np(1-p)) distribution for large enough n.
Need np 5 and n(1-p) 5 for a reasonable approximation.
Sampling Distribution 76

Issues With the CLT


When can we use it:
If samples are from IID distributions
If there is moderate or no skew

When we may not use it:


Dont use it if the distributions are not IID.
Errors may be large for small samples from skewed
distributions
77

Example
Binomial(n,p) with p = 0.5, and various n.
78

Example
Binomial(n,p) with p = 0.1, and various n.
79

How good are the approximations?


Depends!
Larger skew: need larger sample size
No skew: even 15 is a reasonable sample size to use the
normal approximation
Moderate skew: need 30 or more
High skew: need 50 or more
Severe skew (Example: binomial with large n and small p
so that Poisson approximation holds): might need very
high sample size, in the range of several hundred or even
higher
80

Example
Binomial(n,p) with p = 0.01, and various n.
81

Back to our example


Work out and X. Hence work out the approximate
distribution of the sample mean for n = 30 using CLT.
What is the probability that over 30 days,
a) average spend is at least Rs.120?
b) between Rs.110 and Rs.130?
c) What is the probability that the total spend is not more than
Rs.4000?
Sampling Distribution 82

Continuing With the Example


a) What proportion of days is the student expected to spend
more than Rs.110?
b) What is the (sampling) distribution of the number of times
the student spends over Rs.110 for lunch on a day?
c) What is the (sampling) distribution of the number of times
the student spends over Rs.110 for lunch in 5 days?
d) What is the (sampling) distribution of the number of times
the student spends over Rs.110 for lunch in 50 days?
e) What is the probability that the student spends over
Rs.110 at least 30 times in 50 days?
83

Quick questions
I collect a sample, without replacement, of 100 children
of age 6 from Nagpur.
a) Does the CLT say that their height distribution is No.
approximately normal?
b) Does the CLT say that the distribution of their average Yes*
height is approximately normal?
c) I record the heights of 15 students of age 6 and 15 No
students of age 10. Is the distribution of the average
height approximately normal? What if I recorded Yes*
50+50?
I record the heights of 20 students sampled, without
No.
replacement, from this class. Does the CLT say that the
average height is approximately normal?