Information Theory - Shriram

An Introduction to
Information Theory
Shriram Nandakumar
Department of Electronics & Communication,
Amrita School of Engineering,
Amrita Vishwa Vidyapeetham University,
Amritapuri Campus.
Tribute to the Father

The fundamental problem of communication is
that of reproducing at one point either exactly or
approximately a message selected at another
point Claude Shannon 48
Thanks to.
Gilbert Strang- My role model in teaching.
Accessible to all in the Internet through his
video lectures!!
Shri. R.Srinivasa Varadhan- Mathematics teacher.
Shri. R.Varadhan- Chemistry teacher.
Mrs. V.Uma Maheswari- Introduced me to the world of
Signal Processing in a lucid way.
Prof. Dimitris.A.Pados- Taught me Information Theory.
Dr. Sundararaman Gopalan- Inspires me on how to stay simple
& be rational.
Dr. Nithin Nagaraj , my role model for doing good research.
Finally.. My Parents!
Information Theory deals with

2 Fundamental Anti-podal Problems:
Source Coding: How to measure information content.
How to compress data? AVOID REDUNDANCY
Channel Coding: How to communicate perfectly over
imperfect channels? INTRODUCE REDUNDANCY
PROBABILITY PRIMER
Conditional Probability
How to view Conditional Probability??
Source: Intuitive Probabilty & Random

Processes- Steven.M.Kay
Independence of Events
Independent Events
Important view- Conditioning of an event A
over some other event B doesnt alter P(A).
Famous view (yet important too!)

Another Myth- Independent Events are Mutually
Exclusive- NO!!!!!
Probability Primer
Independence DOESNT mean that
one event has no effect on the other(s)
Independence of Events
Independence of 3 events (Can be extended to
N events)
4th condition
not satisfied?A,B,C are
pair-wise
independent
Chain Rule for multiple events

Chain Rule:
What if A,B & C are independent??
Bayes theorem
How to view Bayes Theorem?
Please start viewing it a tool for INFERENCE.
Assess the validity of an event when some other
event has been observed.
Assess your hypothesis in light of new evidence.
Prior Probability
Posterior Probability
A more detailed example coming

later!
Discrete Random Variables (Bernoulli)
Source: Intuitive Probabilty & Random ProcessesSteven.M.Kay
Discrete Random Variables (Binomial)
Example: # of heads in M independent coin tosses

Discrete Random Variables (Geometric)
Example: Occurrence of first head

Discrete Random VariablesTransformations
Eg: Input RV: X~{-1, 0, 1} with equal probabilities.

Output RV: Y=2X+1
Z=X2
Again how to view this?
Conserve Probability Masses
1. Find the values that output RV can take.
2. For every value of Y, find what value(s) of X
caused it by inverse function.
3. Transfer the probability masses of corresponding
value(s) of X to Y. If there is a many-to-one
mapping, add the probability masses.
Discrete Random VariablesTransformations

Another Example:
Expected Values of RVs

Expected value/ mean:
Myth- Expected value/ mean is the most probable value

NO!!! It also need not be located at center of the
PMF.
Variance:
Two Random Variables

Can be characterized by Joint PMF- Describes how the
RVs jointly realize.
Also associated are Conditional PMFs, Marginal
PMFs.
Joint PMF cannot be determined from Marginal
PMFs
Two Random Variables- Independence

One view:
Joint PMF = (Marginal of X) X (Marginal of Y)
X indicates cross-product.
Two Random Variables- Independence

Another view:
Conditional of X given Y = Marginal of X
Conditional of Y given X = Marginal of Y

Each RV can be associated with their Expected
Values & their variances. But they do not
completely characterize them.
Joint Moments- Covariance & Correlation.
Can we calculate the co-variances? Is there a case

where the RVs are independent?

Independence implies zero co-variance.
Zero co-variance doesnt imply independence.
What can we tell about these RVs?

Independent? Correlated?

Experiment:
We have 2 coins. We pick one & toss N times.
Random variables:
X- The coin we pick {0,1}
Y- Number of heads {0,1,.N}.
X={0,1} (2 FAIR coins, equally likely)

Y={0,1,2,3}
X & Y are INDEPENDENT- why?
Joint Distribution
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
1
Marginal of Y (What about X?)

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0.5
1.5
2.5
Conditionals of Y (conditioned on X)
0.5
0.4
0.3
0.2
0.1
0.5
1.5
2.5
0.5
1.5
2.5
0.5
0.4
0.3
0.2
0.1
X=0,1 (2 BENT coins, equally likely)

Pr0(Heads)= 0.8
Pr1(Heads)= 0.2
Y=0,1,2,3
X & Y are DEPENDENT- why?
Joint PMF
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1
3
2
1
0
Marginal PMF of Y (# of heads)

(what about marginal of X?)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0.5
1.5
2.5
Conditional of Y (conditioned on X)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.5
1.5
2.5
0.5
1.5
2.5
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
X=0,1 (Unequally likely, Biases 0.8

& 0.2 respectively)
Y=0,1,2,3
Again they are dependent.
Joint PMF
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1
3
2
1
0
0.35
Comparison of conditionals
0.3
0.25
0.2
0.15
0.1
0.05
0.5
1.5
2.5
0.5
1.5
2.5
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
What about this?
0.14
0.12
0.1
0.08
0.06
0.04
10
8
0.02
6
0
4
4
3.5
2.5
2
2
1.5
0.5
..This?
How to find if X is uniform? Are biases
same?
0.35
0.3
0.25
0.2
0.15
0.1
10
0.05
8
6
0
4
4
3
2
2
Joint Distribution of Bi-grams in a

document
Source: Information theory,

Inference & Learning AlgorithmsDavid J.C. Mackay
Conditional Distribution of Bi-grams

A solidifying example!

Application Example
Coding Theory (A Simple ErrorCorrecting code for Binary Symmetric
Channel)
How to achieve perfect communication

over imperfect, noisy channel?

0.1 is a big number though!

Channel coding (w/o Source Coding)

Repetition Codes

How Decoder Works??

How Decoder Works??

Posterior , Prior, Likelihood & Majority Voting
Uses Inference
Likelihood
Posterior
Prior
How Decoder Works??
Guess
if
Guess
if vice-versa.
For equally likely hypothesis of
&
Maximizing A posterior Probability is equivalent to
maximizing the likelihood
How Decoder Works??
How Decoder Works??
Performance of Repetition Codes

s
MVD
What is the Pr(

far less than
For
, Pr(
)? This probability has to be
.
)= Prob of 3 flips + Prob of 2 flips
(Less likely)
=
(More likely)
+
Repetition Codes
Rate vs Error Probability

Entropy & Information Measures
Measure of information of an outcome

You toss a fair coin & you observe a head. What is
the information you have gained? Ans: 1 bit
Shannon Information Content:

Measured in bits
Entropy of a Random Variable

Let us define an ensemble as
The outcome is the value of a RV which takes on
one of a set of possible values from the alphabet
having probabilities
Again measured in bits

H(X) can be seen as Weighted Average Information
Content
Binary Entropy Function

Example: A coin-toss

2 Important Properties
Why
? Why not some other function ?
Shannon wanted his measure to satisfy three postulates:

Postulate #1: A larger number of potential outcomes
have larger uncertainty.
Postulate #2: The relative likelihood of each outcome
determines the uncertainty.
Postulate #3: The weighted uncertainties of
independent events must add up.
Redundancy
Measures the fractional difference between
and its maximum possible value
Redundancy =
What is the redundancy in fair-coin toss??
Designing Informative ExperimentsThe Weighing Problem
Please think in an Information

Theoretic Perspective!!

Entropy as minimum # of binary

questions
What is the smallest # of Yes/No questions needed
to identify an integer between 0 & 64?
Ans: 6. How??
What are the questions (not so important right now
though!)
Joint Entropy
If X & Y are independent:
Conditional Entropy
Conditional Entropy H(X|Y) measures average

uncertainty that remains about X when Y is known.
Joint, Conditional & Marginal Entropies

are related
Quick Example:
You have 3 coins. One is fair, the 2nd is double-headed
& 3rd is double-tailed. You are blind-folded & made to
pick a coin and toss it. You look at one its faces & see a
head. Does seeing one of the faces reduce the uncertainty
regarding the coin you have picked? Evaluate all entropies.
Mutual Information
It measures the average reduction in uncertainty

about X that results from learning the value of Y or
vice-versa .
Quick Example- Evaluate I(X,Y) for previous example.
Various Relationships

Conditional Mutual Information
Some Illegal Information Measures!!
References
A mathematical Theory of Communication
Claude.E.Shannons 1948 classic paper.
Information Theory, Inference & Learning Algorithms
David J.C. Mackay, Cambridge University Press.
Intuitive Probability and Random Processes using
Matlab, Steven.M.Kay, Springer.
A light Discussion and Derivation of Entropy, Tutorial
Paper, Jonathan Shlens, Systems Neurobiology
Laboratory, Salk Institute for Biological Studies.
Thanks for attending!

Information Theory - Shriram

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory - Shriram

Uploaded by

Copyright:

Available Formats

An Introduction to

Tribute to the Father

Information Theory deals with

Source: Intuitive Probabilty & Random

Famous view (yet important too!)

Chain Rule for multiple events

What if A,B & C are independent??

A more detailed example coming

Discrete Random Variables (Bernoulli)

Source: Intuitive Probabilty & Random ProcessesSteven.M.Kay

Discrete Random Variables (Binomial)

Example: # of heads in M independent coin tosses

Discrete Random Variables (Geometric)

Example: Occurrence of first head

Discrete Random VariablesTransformations

Eg: Input RV: X~{-1, 0, 1} with equal probabilities.

Discrete Random VariablesTransformations

Expected Values of RVs

Myth- Expected value/ mean is the most probable value

Two Random Variables

Two Random Variables- Independence

Two Random Variables- Independence

Two Random Variables

Two Random Variables

Can we calculate the co-variances? Is there a case

Two Random Variables

What can we tell about these RVs?

Two Random Variables

X={0,1} (2 FAIR coins, equally likely)

Marginal of Y (What about X?)

X=0,1 (2 BENT coins, equally likely)

X & Y are DEPENDENT- why?

Marginal PMF of Y (# of heads)

X=0,1 (Unequally likely, Biases 0.8

What about this?

Joint Distribution of Bi-grams in a

Source: Information theory,

Conditional Distribution of Bi-grams

Source: Information theory,

Source: Information theory,

How to achieve perfect communication

Source: Information theory,

0.1 is a big number though!

Source: Information theory,

Channel coding (w/o Source Coding)

Source: Information theory,

Source: Information theory,

How Decoder Works??

Source: Information theory,

How Decoder Works??

How Decoder Works??

How Decoder Works??

How Decoder Works??

Performance of Repetition Codes

What is the Pr(

)? This probability has to be

Source: Information theory,

Entropy & Information Measures

Measure of information of an outcome

Shannon Information Content:

Entropy of a Random Variable

Again measured in bits

Binary Entropy Function