You are on page 1of 73

An Introduction to

Information Theory
Shriram Nandakumar
Department of Electronics & Communication,
Amrita School of Engineering,
Amrita Vishwa Vidyapeetham University,
Amritapuri Campus.

Tribute to the Father


The fundamental problem of communication is
that of reproducing at one point either exactly or
approximately a message selected at another
point Claude Shannon 48

Thanks to.
Gilbert Strang- My role model in teaching.
Accessible to all in the Internet through his
video lectures!!
Shri. R.Srinivasa Varadhan- Mathematics teacher.
Shri. R.Varadhan- Chemistry teacher.
Mrs. V.Uma Maheswari- Introduced me to the world of
Signal Processing in a lucid way.
Prof. Dimitris.A.Pados- Taught me Information Theory.
Dr. Sundararaman Gopalan- Inspires me on how to stay simple
& be rational.
Dr. Nithin Nagaraj , my role model for doing good research.
Finally.. My Parents!

Information Theory deals with


2 Fundamental Anti-podal Problems:
Source Coding: How to measure information content.
How to compress data? AVOID REDUNDANCY
Channel Coding: How to communicate perfectly over
imperfect channels? INTRODUCE REDUNDANCY

PROBABILITY PRIMER

Conditional Probability
How to view Conditional Probability??

Source: Intuitive Probabilty & Random


Processes- Steven.M.Kay

Independence of Events
Independent Events
Important view- Conditioning of an event A
over some other event B doesnt alter P(A).

Famous view (yet important too!)


Another Myth- Independent Events are Mutually
Exclusive- NO!!!!!

Probability Primer
Independence DOESNT mean that
one event has no effect on the other(s)

Independence of Events
Independence of 3 events (Can be extended to
N events)

4th condition
not satisfied?A,B,C are
pair-wise
independent

Chain Rule for multiple events


Chain Rule:

What if A,B & C are independent??

Bayes theorem
How to view Bayes Theorem?
Please start viewing it a tool for INFERENCE.
Assess the validity of an event when some other
event has been observed.
Assess your hypothesis in light of new evidence.

Prior Probability
Posterior Probability

A more detailed example coming


later!

Discrete Random Variables (Bernoulli)

Source: Intuitive Probabilty & Random ProcessesSteven.M.Kay

Discrete Random Variables (Binomial)

Example: # of heads in M independent coin tosses


Source: Intuitive Probabilty & Random
Processes- Steven.M.Kay

Discrete Random Variables (Geometric)

Example: Occurrence of first head


Source: Intuitive Probabilty & Random
Processes- Steven.M.Kay

Discrete Random VariablesTransformations

Eg: Input RV: X~{-1, 0, 1} with equal probabilities.


Output RV: Y=2X+1
Z=X2
Again how to view this?
Conserve Probability Masses
1. Find the values that output RV can take.
2. For every value of Y, find what value(s) of X
caused it by inverse function.
3. Transfer the probability masses of corresponding
value(s) of X to Y. If there is a many-to-one
mapping, add the probability masses.

Discrete Random VariablesTransformations


Another Example:

Expected Values of RVs


Expected value/ mean:

Myth- Expected value/ mean is the most probable value


NO!!! It also need not be located at center of the
PMF.

Variance:

Two Random Variables


Can be characterized by Joint PMF- Describes how the
RVs jointly realize.
Also associated are Conditional PMFs, Marginal
PMFs.
Joint PMF cannot be determined from Marginal
PMFs

Two Random Variables- Independence


One view:
Joint PMF = (Marginal of X) X (Marginal of Y)
X indicates cross-product.

Two Random Variables- Independence


Another view:
Conditional of X given Y = Marginal of X
Conditional of Y given X = Marginal of Y

Two Random Variables


Each RV can be associated with their Expected
Values & their variances. But they do not
completely characterize them.
Joint Moments- Covariance & Correlation.

Two Random Variables

Can we calculate the co-variances? Is there a case


where the RVs are independent?

Two Random Variables


Independence implies zero co-variance.
Zero co-variance doesnt imply independence.

What can we tell about these RVs?


Independent? Correlated?

Two Random Variables


Experiment:
We have 2 coins. We pick one & toss N times.

Random variables:
X- The coin we pick {0,1}
Y- Number of heads {0,1,.N}.

X={0,1} (2 FAIR coins, equally likely)


Y={0,1,2,3}
X & Y are INDEPENDENT- why?

Joint Distribution

0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
1

Marginal of Y (What about X?)


0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0.5

1.5

2.5

Conditionals of Y (conditioned on X)
0.5

0.4

0.3

0.2

0.1

0.5

1.5

2.5

0.5

1.5

2.5

0.5

0.4

0.3

0.2

0.1

X=0,1 (2 BENT coins, equally likely)


Pr0(Heads)= 0.8
Pr1(Heads)= 0.2
Y=0,1,2,3

X & Y are DEPENDENT- why?

Joint PMF

0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1
3
2
1
0

Marginal PMF of Y (# of heads)


(what about marginal of X?)
0.35

0.3

0.25

0.2

0.15

0.1

0.05

0.5

1.5

2.5

Conditional of Y (conditioned on X)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

0.5

1.5

2.5

0.5

1.5

2.5

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

X=0,1 (Unequally likely, Biases 0.8


& 0.2 respectively)
Y=0,1,2,3
Again they are dependent.

Joint PMF

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1
3
2
1
0

0.35

Comparison of conditionals
0.3

0.25

0.2

0.15

0.1

0.05

0.5

1.5

2.5

0.5

1.5

2.5

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

What about this?

0.14

0.12

0.1

0.08

0.06

0.04
10
8

0.02
6
0
4

4
3.5

2.5

2
2

1.5

0.5

..This?
How to find if X is uniform? Are biases
same?
0.35

0.3

0.25

0.2

0.15

0.1
10
0.05

8
6

0
4

4
3

2
2

Joint Distribution of Bi-grams in a


document

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

Conditional Distribution of Bi-grams

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

A solidifying example!

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

Application Example
Coding Theory (A Simple ErrorCorrecting code for Binary Symmetric
Channel)

How to achieve perfect communication


over imperfect, noisy channel?

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

0.1 is a big number though!

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

Channel coding (w/o Source Coding)

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

Repetition Codes

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

How Decoder Works??

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

How Decoder Works??


Posterior , Prior, Likelihood & Majority Voting
Uses Inference
Likelihood

Posterior

Prior

How Decoder Works??

Guess
if
Guess
if vice-versa.
For equally likely hypothesis of
&
Maximizing A posterior Probability is equivalent to
maximizing the likelihood

How Decoder Works??

How Decoder Works??

Performance of Repetition Codes


s

MVD

What is the Pr(


far less than
For

, Pr(

)? This probability has to be

.
)= Prob of 3 flips + Prob of 2 flips
(Less likely)
=

(More likely)
+

Repetition Codes
Rate vs Error Probability

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

Entropy & Information Measures

Measure of information of an outcome


You toss a fair coin & you observe a head. What is
the information you have gained? Ans: 1 bit

Shannon Information Content:


Measured in bits
Source: Information theory,
Inference & Learning AlgorithmsDavid J.C. Mackay

Entropy of a Random Variable


Let us define an ensemble as
The outcome is the value of a RV which takes on
one of a set of possible values from the alphabet
having probabilities

Again measured in bits


H(X) can be seen as Weighted Average Information
Content

Binary Entropy Function


Example: A coin-toss

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

2 Important Properties

Why

? Why not some other function ?

Shannon wanted his measure to satisfy three postulates:


Postulate #1: A larger number of potential outcomes
have larger uncertainty.
Postulate #2: The relative likelihood of each outcome
determines the uncertainty.
Postulate #3: The weighted uncertainties of
independent events must add up.

Redundancy
Measures the fractional difference between
and its maximum possible value

Redundancy =
What is the redundancy in fair-coin toss??

Designing Informative ExperimentsThe Weighing Problem

Please think in an Information


Theoretic Perspective!!

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

Entropy as minimum # of binary


questions
What is the smallest # of Yes/No questions needed
to identify an integer between 0 & 64?
Ans: 6. How??
What are the questions (not so important right now
though!)

Joint Entropy

If X & Y are independent:

Conditional Entropy

Conditional Entropy H(X|Y) measures average


uncertainty that remains about X when Y is known.

Joint, Conditional & Marginal Entropies


are related

Quick Example:
You have 3 coins. One is fair, the 2nd is double-headed
& 3rd is double-tailed. You are blind-folded & made to
pick a coin and toss it. You look at one its faces & see a
head. Does seeing one of the faces reduce the uncertainty
regarding the coin you have picked? Evaluate all entropies.

Mutual Information

It measures the average reduction in uncertainty


about X that results from learning the value of Y or
vice-versa .

Quick Example- Evaluate I(X,Y) for previous example.

Various Relationships

Source: Information theory,


Inference & Learning AlgorithmsDavid J.C. Mackay

Conditional Mutual Information

Some Illegal Information Measures!!

References
A mathematical Theory of Communication
Claude.E.Shannons 1948 classic paper.
Information Theory, Inference & Learning Algorithms
David J.C. Mackay, Cambridge University Press.
Intuitive Probability and Random Processes using
Matlab, Steven.M.Kay, Springer.
A light Discussion and Derivation of Entropy, Tutorial
Paper, Jonathan Shlens, Systems Neurobiology
Laboratory, Salk Institute for Biological Studies.

Thanks for attending!

You might also like