• Ei tuloksia

Autumn2012 Lecture3:SourceCoding:TheoryJyrkiKivinen Information-TheoreticModeling

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Autumn2012 Lecture3:SourceCoding:TheoryJyrkiKivinen Information-TheoreticModeling"

Copied!
132
0
0

Kokoteksti

(1)

Outline Entropy and Information Data Compression

Information-Theoretic Modeling

Lecture 3: Source Coding: Theory

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

(2)

Outline Entropy and Information Data Compression

Lecture 3: Source Coding: Theory

Jyrki Kivinen Information-Theoretic Modeling

(3)

Outline Entropy and Information Data Compression

1 Entropy and Information Entropy

Information Inequality Data Processing Inequality

2 Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

(4)

Outline Entropy and Information Data Compression

1 Entropy and Information Entropy

Information Inequality Data Processing Inequality

2 Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

Jyrki Kivinen Information-Theoretic Modeling

(5)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Entropy

Given a discrete random variableX with pmf pX, we can measure the amount of “surprise” associated with each outcomex∈ X by the quantity

IX(x) = log2 1 pX(x) .

The less likely an outcome is, the more surprised we are to observe it. (The point in the log-scale will become clear shortly.)

Theentropy ofX measures theexpected amount of “surprise”: H(X) =E[IX(X)] = X

x∈X

pX(x) log2 1 pX(x) .

(6)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Entropy

Given a discrete random variableX with pmf pX, we can measure the amount of “surprise” associated with each outcomex∈ X by the quantity

IX(x) = log2 1 pX(x) .

The less likely an outcome is, the more surprised we are to observe it. (The point in the log-scale will become clear shortly.)

Theentropy ofX measures theexpected amount of “surprise”:

H(X) =E[IX(X)] = X

x∈X

pX(x) log2 1 pX(x) .

Jyrki Kivinen Information-Theoretic Modeling

(7)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Binary Entropy Function

For binary-valuedX, with p =pX(1) = 1−pX(0), we have H(X) =plog2 1

p + (1−p) log2 1 1−p .

(8)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

More Entropies

1 thejoint entropy of two (or more) random variables:

H(X,Y) =X

x∈X y∈Y

pX,Y(x,y) log2 1 pX,Y(x,y) ,

2 theentropy of a conditional distribution: H(X |Y =y) = X

x∈X

pX|Y(x |y) log2 1 pX|Y(x|y) ,

3 and the conditional entropy: H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

pX,Y(x,y) log2 1 pX|Y(x |y) .

Jyrki Kivinen Information-Theoretic Modeling

(9)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

More Entropies

1 thejoint entropy of two (or more) random variables:

H(X,Y) =X

x∈X y∈Y

pX,Y(x,y) log2 1 pX,Y(x,y) ,

2 theentropy of a conditional distribution:

H(X |Y =y) = X

x∈X

pX|Y(x |y) log2 1 pX|Y(x|y) ,

3 and the conditional entropy: H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

pX,Y(x,y) log2 1 pX|Y(x |y) .

(10)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

More Entropies

1 thejoint entropy of two (or more) random variables:

H(X,Y) =X

x∈X y∈Y

pX,Y(x,y) log2 1 pX,Y(x,y) ,

2 theentropy of a conditional distribution:

H(X |Y =y) = X

x∈X

pX|Y(x |y) log2 1 pX|Y(x|y) ,

3 and the conditional entropy:

H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

pX,Y(x,y) log2 1 pX|Y(x |y) .

Jyrki Kivinen Information-Theoretic Modeling

(11)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

More Entropies

1 thejoint entropy of two (or more) random variables:

H(X,Y) =X

x∈X y∈Y

pX,Y(x,y) log2 1 pX,Y(x,y) ,

2 theentropy of a conditional distribution:

H(X |Y =y) = X

x∈X

pX|Y(x |y) log2 1 pX|Y(x|y) ,

3 and the conditional entropy:

H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

pX,Y(x,y) log2 1 pX|Y(x |y) .

(12)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

More Entropies

1 thejoint entropy of two (or more) random variables:

H(X,Y) =X

x∈X y∈Y

pX,Y(x,y) log2 1 pX,Y(x,y) ,

2 theentropy of a conditional distribution:

H(X |Y =y) = X

x∈X

pX|Y(x |y) log2 1 pX|Y(x|y) ,

3 and the conditional entropy:

H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

pX,Y(x,y) log2 1 pX|Y(x |y) .

Jyrki Kivinen Information-Theoretic Modeling

(13)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

More Entropies

The joint entropyH(X,Y) measures the uncertainty about the pair (X,Y).

The entropy of the conditional distributionH(X |Y =y) measures the uncertainty aboutX when we know that Y =y. The conditional entropyH(X |Y) measures theexpected uncertainty aboutX when the valueY is known.

(14)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

More Entropies

The joint entropyH(X,Y) measures the uncertainty about the pair (X,Y).

The entropy of the conditional distributionH(X |Y =y) measures the uncertainty aboutX when we know that Y =y.

The conditional entropyH(X |Y) measures theexpected uncertainty aboutX when the valueY is known.

Jyrki Kivinen Information-Theoretic Modeling

(15)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

More Entropies

The joint entropyH(X,Y) measures the uncertainty about the pair (X,Y).

The entropy of the conditional distributionH(X |Y =y) measures the uncertainty aboutX when we know that Y =y. The conditional entropyH(X |Y) measures theexpected uncertainty aboutX when the value Y is known.

(16)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·pX|Y(x |y) .

For the entropy we have: Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables: H(X1, . . . ,Xn) =

n

X

i=1

H(Xi |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) =H(X) +H(Y). Logarithmicscale makes entropy additive.

Jyrki Kivinen Information-Theoretic Modeling

(17)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·pX|Y(x |y) . For the entropy we have:

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables: H(X1, . . . ,Xn) =

n

X

i=1

H(Xi |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) =H(X) +H(Y). Logarithmicscale makes entropy additive.

(18)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·pX|Y(x |y) . For the entropy we have:

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) . Proof.

pX,Y(x,y) =pY(y)·pX|Y(x |y) Next apply log(ab) = loga+ logb.

The rule can be extended to more than two random variables: H(X1, . . . ,Xn) =

n

X

i=1

H(Xi |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) =H(X) +H(Y). Logarithmicscale makes entropy additive.

Jyrki Kivinen Information-Theoretic Modeling

(19)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·pX|Y(x |y) . For the entropy we have:

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) . Proof.

log2pX,Y(x,y) = log2pY(y) + log2pX|Y(x |y) Next apply loga=−log(1/a).

The rule can be extended to more than two random variables: H(X1, . . . ,Xn) =

n

X

i=1

H(Xi |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) =H(X) +H(Y). Logarithmicscale makes entropy additive.

(20)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·pX|Y(x |y) . For the entropy we have:

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) . Proof.

log2 1

pX,Y(x,y) = log2 1

pY(y) + log2 1 pX|Y(x |y)

⇔E

log2 1 pX,Y(x,y)

=E

log2 1 pY(y)

+E

log2 1 pX|Y(x |y)

⇔H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables: H(X1, . . . ,Xn) =

n

X

i=1

H(Xi |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) =H(X) +H(Y). Logarithmicscale makes entropy additive.

Jyrki Kivinen Information-Theoretic Modeling

(21)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·pX|Y(x |y) . For the entropy we have:

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables:

H(X1, . . . ,Xn) =

n

X

i=1

H(Xi |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) =H(X) +H(Y). Logarithmicscale makes entropy additive.

(22)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·pX|Y(x |y) . For the entropy we have:

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables:

H(X1, . . . ,Xn) =

n

X

i=1

H(Xi |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) = H(X) +H(Y).

Logarithmicscale makes entropy additive.

Jyrki Kivinen Information-Theoretic Modeling

(23)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·pX|Y(x |y) . For the entropy we have:

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables:

H(X1, . . . ,Xn) =

n

X

i=1

H(Xi |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) = H(X) +H(Y).

Logarithmicscale makes entropy additive.

(24)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Mutual Information

Themutual information

I(X ; Y) =H(X)−H(X |Y)

measures the average decrease in uncertainty aboutX when the value ofY becomes known.

Mutual information issymmetric(chain rule): I(X ; Y) =H(X)−H(X |Y) =

=H(Y)−H(Y |X) =I(Y ; X) .

On the average,X gives as much information aboutY as Y gives aboutX.

Jyrki Kivinen Information-Theoretic Modeling

(25)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Mutual Information

Themutual information

I(X ; Y) =H(X)−H(X |Y)

measures the average decrease in uncertainty aboutX when the value ofY becomes known.

Mutual information issymmetric(chain rule):

I(X ; Y) =H(X)−H(X |Y) =H(X)−(H(X,Y)−H(Y))

=H(Y)−H(Y |X) =I(Y ; X) .

On the average,X gives as much information aboutY as Y gives aboutX.

(26)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Mutual Information

Themutual information

I(X ; Y) =H(X)−H(X |Y)

measures the average decrease in uncertainty aboutX when the value ofY becomes known.

Mutual information issymmetric(chain rule):

I(X ; Y) =H(X)−H(X |Y) =H(X)−H(X,Y) +H(Y)

=H(Y)−H(Y |X) =I(Y ; X) .

On the average,X gives as much information aboutY as Y gives aboutX.

Jyrki Kivinen Information-Theoretic Modeling

(27)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Mutual Information

Themutual information

I(X ; Y) =H(X)−H(X |Y)

measures the average decrease in uncertainty aboutX when the value ofY becomes known.

Mutual information issymmetric(chain rule):

I(X ; Y) =H(X)−H(X |Y) = (H(X)−H(X,Y)) +H(Y)

=H(Y)−H(Y |X) =I(Y ; X) .

On the average,X gives as much information aboutY as Y gives aboutX.

(28)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Mutual Information

Themutual information

I(X ; Y) =H(X)−H(X |Y)

measures the average decrease in uncertainty aboutX when the value ofY becomes known.

Mutual information issymmetric(chain rule):

I(X ; Y) =H(X)−H(X |Y) = (H(X)−H(X,Y)) +H(Y)

=H(Y)−H(Y |X) =I(Y ; X) .

On the average,X gives as much information aboutY as Y gives aboutX.

Jyrki Kivinen Information-Theoretic Modeling

(29)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Mutual Information

Themutual information

I(X ; Y) =H(X)−H(X |Y)

measures the average decrease in uncertainty aboutX when the value ofY becomes known.

Mutual information issymmetric(chain rule):

I(X ; Y) =H(X)−H(X |Y) = (H(X)−H(X,Y)) +H(Y)

=H(Y)−H(Y |X) =I(Y ; X) .

On the average,X gives as much information aboutY as Y gives aboutX.

(30)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Relationships between Entropies

H(X,Y)

H(X)

H(Y)

H(X | Y) I(X ; Y) H(Y | X)

Jyrki Kivinen Information-Theoretic Modeling

(31)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Information Inequality

Kullback-Leibler Divergence

Therelative entropyor Kullback-Leibler divergence between (discrete) distributionspX andqX is defined as

D(pX kqX) = X

x∈X

pX(x) log2 pX(x) qX(x) .

Information Inquality

For any two (discrete) distributionspX and qX, we have D(pX kqX)≥0

with equality iffpX(x) =qX(x) for all x ∈ X. Proof. Gibbs!

(32)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Information Inequality

Kullback-Leibler Divergence

Therelative entropyor Kullback-Leibler divergence between (discrete) distributionspX andqX is defined as

D(pX kqX) = X

x∈X

pX(x) log2 pX(x) qX(x) .

(We considerpX(x) log2pqX(x)

X(x)= 0 wheneverpX(x) = 0.)

Information Inquality

For any two (discrete) distributionspX and qX, we have D(pX kqX)≥0

with equality iffpX(x) =qX(x) for all x ∈ X. Proof. Gibbs!

Jyrki Kivinen Information-Theoretic Modeling

(33)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Information Inequality

Kullback-Leibler Divergence

Therelative entropyor Kullback-Leibler divergence between (discrete) distributionspX andqX is defined as

D(pX kqX) = X

x∈X

pX(x) log2 pX(x) qX(x) . Information Inquality

For any two (discrete) distributionspX and qX, we have D(pX kqX)≥0

with equality iffpX(x) =qX(x) for all x∈ X.

Proof. Gibbs!

(34)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Information Inequality

Kullback-Leibler Divergence

Therelative entropyor Kullback-Leibler divergence between (discrete) distributionspX andqX is defined as

D(pX kqX) = X

x∈X

pX(x) log2 pX(x) qX(x) . Information Inquality

For any two (discrete) distributionspX and qX, we have D(pX kqX)≥0

with equality iffpX(x) =qX(x) for all x∈ X. Proof. Gibbs!

Jyrki Kivinen Information-Theoretic Modeling

(35)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Kullback-Leibler Divergence

The information inequality implies I(X ; Y)≥0 .

Proof.

I(X ; Y) =H(X)−H(X |Y)

=H(X) +H(Y)−H(X,Y)

= X

x∈X y∈Y

pX,Y(x,y) log2 pX,Y(x,y) pX(x)pY(y)

=D(pX,Y kpXpY)≥0 .

In addition,D(pX,Y kpXpY) = 0 iff pX,Y(x,y) =pX(x)pY(y) for allx∈ X,y ∈ Y. This means that variablesX andY are

independentiff I(X ; Y) = 0.

(36)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Kullback-Leibler Divergence

The information inequality implies I(X ; Y)≥0 . Proof.

I(X ; Y) =H(X)−H(X |Y)

=H(X) +H(Y)−H(X,Y)

= X

x∈X y∈Y

pX,Y(x,y) log2 pX,Y(x,y) pX(x)pY(y)

=D(pX,Y kpXpY)≥0 .

In addition,D(pX,Y kpXpY) = 0 iff pX,Y(x,y) =pX(x)pY(y) for allx∈ X,y ∈ Y. This means that variablesX andY are

independentiff I(X ; Y) = 0.

Jyrki Kivinen Information-Theoretic Modeling

(37)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Kullback-Leibler Divergence

The information inequality implies I(X ; Y)≥0 . Proof.

I(X ; Y) =H(X)−H(X |Y)

=H(X) +H(Y)−H(X,Y)

= X

x∈X y∈Y

pX,Y(x,y) log2 pX,Y(x,y) pX(x)pY(y)

=D(pX,Y kpXpY)≥0 .

In addition,D(pX,Y kpXpY) = 0 iff pX,Y(x,y) =pX(x)pY(y) for allx∈ X,y ∈ Y. This means that variablesX andY are

independentiff I(X ; Y) = 0.

(38)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. pX(x)≤1⇒log2 1

pX(x) ≥0.

2 H(X)≤log2|X |

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

Jyrki Kivinen Information-Theoretic Modeling

(39)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. pX(x)≤1⇒log2 1

pX(x) ≥0.

2 H(X)≤log2|X |

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

(40)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. pX(x)≤1⇒log2 1

pX(x) ≥0.

2 H(X)≤log2|X |

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

Jyrki Kivinen Information-Theoretic Modeling

(41)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. pX(x)≤1⇒log2 1

pX(x) ≥0.

2 H(X)≤log2|X |

Proof. LetuX(x) = |X |1 be the uniform distribution overX. 0≤D(pX kuX) = X

x∈X

pX(x) log2pX(x)

uX(x) = log2|X |−H(X) .

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

(42)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. pX(x)≤1⇒log2 1

pX(x) ≥0.

2 H(X)≤log2|X |

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

Jyrki Kivinen Information-Theoretic Modeling

(43)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Ludvig Boltzmann (1844–1906)

(44)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. pX(x)≤1⇒log2 1

pX(x) ≥0.

2 H(X)≤log2|X |

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

3 H(X |Y)≤H(X)

On the average, knowing another r.v. can only reduce uncer- tainty about X. However, note that H(X |Y = y) may be greater thanH(X) for somey — “contradicting evidence”.

Jyrki Kivinen Information-Theoretic Modeling

(45)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. pX(x)≤1⇒log2 1

pX(x) ≥0.

2 H(X)≤log2|X |

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

3 H(X |Y)≤H(X) Proof.

0≤I(X ; Y) =H(X)−H(X |Y) .

On the average, knowing another r.v. can only reduce uncer- tainty about X. However, note that H(X |Y = y) may be greater thanH(X) for somey — “contradicting evidence”.

(46)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. pX(x)≤1⇒log2 1

pX(x) ≥0.

2 H(X)≤log2|X |

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

3 H(X |Y)≤H(X)

On the average, knowing another r.v. can only reduce uncer- tainty about X. However, note that H(X |Y = y) may be greater thanH(X) for somey — “contradicting evidence”.

Jyrki Kivinen Information-Theoretic Modeling

(47)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Mutual Information

Theconditional mutual information of variablesX andY given Z is defined as

I(X ; Y |Z) =H(X |Z)−H(X |Y,Z) .

Chain Rule of Mutual Information

For random variablesX and Y1, . . . ,Yn we have I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi |Y1, . . . ,Yi−1) . Independence amongY1, . . . ,Ynimplies

I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi) .

(48)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Mutual Information

Theconditional mutual information of variablesX andY given Z is defined as

I(X ; Y |Z) =H(X |Z)−H(X |Y,Z) . Chain Rule of Mutual Information

For random variablesX and Y1, . . . ,Yn we have I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi |Y1, . . . ,Yi−1) .

Independence amongY1, . . . ,Ynimplies I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi) .

Jyrki Kivinen Information-Theoretic Modeling

(49)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Chain Rule of Mutual Information

Theconditional mutual information of variablesX andY given Z is defined as

I(X ; Y |Z) =H(X |Z)−H(X |Y,Z) . Chain Rule of Mutual Information

For random variablesX and Y1, . . . ,Yn we have I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi |Y1, . . . ,Yi−1) . Independence amongY1, . . . ,Yn implies

I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi) .

(50)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Data Processing Inequality

LetX,Y,Z be (discrete) random variables. If Z is conditionally independent of X given Y, i.e., if we have

pZ|X,Y(z |x,y) =pZ|Y(z |y) for all x,y,z, thenX,Y,Z form aMarkov chainX →Y →Z.

For instance,Y is a “noisy” measurement ofX, and Z =f(Y) is the outcome of deterministic data processing performed onY, then we haveX →Y →Z.

This implies that

I(X ; Z |Y) =H(Z |Y)−H(Z |Y,X) = 0 .

WhenY is known,Z doesn’t give any extra information aboutX (and vice versa).

Jyrki Kivinen Information-Theoretic Modeling

(51)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Data Processing Inequality

LetX,Y,Z be (discrete) random variables. If Z is conditionally independent of X given Y, i.e., if we have

pZ|X,Y(z |x,y) =pZ|Y(z |y) for all x,y,z, thenX,Y,Z form aMarkov chainX →Y →Z.

For instance,Y is a “noisy” measurement ofX, and Z =f(Y) is the outcome of deterministic data processing performed onY, then we haveX →Y →Z.

This implies that

I(X ; Z |Y) =H(Z |Y)−H(Z |Y,X) = 0 .

WhenY is known,Z doesn’t give any extra information aboutX (and vice versa).

(52)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Data Processing Inequality

LetX,Y,Z be (discrete) random variables. If Z is conditionally independent of X given Y, i.e., if we have

pZ|X,Y(z |x,y) =pZ|Y(z |y) for all x,y,z, thenX,Y,Z form aMarkov chainX →Y →Z.

For instance,Y is a “noisy” measurement ofX, and Z =f(Y) is the outcome of deterministic data processing performed onY, then we haveX →Y →Z.

This implies that

I(X ; Z |Y) =H(Z |Y)−H(Z |Y,X) = 0 .

WhenY is known,Z doesn’t give any extra information aboutX (and vice versa).

Jyrki Kivinen Information-Theoretic Modeling

(53)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Data Processing Inequality

Assuming thatX →Y →Z is a Markov chain, we get I(X ; Y,Z) =I(X ; Z) +I(X ; Y |Z)

=I(X ; Y) +I(X ; Z |Y) .

Now, becauseI(X ; Z |Y) = 0, andI(X ; Y |Z)≥0, we obtain: Data Processing Inequality

IfX →Y →Z is a Markov chain, then we have I(X ; Z)≤I(X ; Y) .

No data-processing can increase the amount of information that we have aboutX.

(54)

Outline Entropy and Information Data Compression

Entropy

Information Inequality Data Processing Inequality

Data Processing Inequality

Assuming thatX →Y →Z is a Markov chain, we get I(X ; Y,Z) =I(X ; Z) +I(X ; Y |Z)

=I(X ; Y) +I(X ; Z |Y) .

Now, becauseI(X ; Z |Y) = 0, andI(X ; Y |Z)≥0, we obtain:

Data Processing Inequality

IfX →Y →Z is a Markov chain, then we have I(X ; Z)≤I(X ; Y) .

No data-processing can increase the amount of information that we have aboutX.

Jyrki Kivinen Information-Theoretic Modeling

(55)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

1 Entropy and Information Entropy

Information Inequality Data Processing Inequality

2 Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

(56)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

IfX1,X2, . . . is a sequence ofindependent and identically distributed(i.i.d.) r.v.’s with domainX and pmfpX, then

log2 1

pX(X1),log2 1 pX(X2), . . . is also an i.i.d. sequence of r.v.’s.

The expected values of the elements of the above sequence are all equal to the entropy:

E

log2 1 pX(Xi)

=X

x∈X

pX(x) log2 1

pX(x) =H(X) for all i ∈N.

Jyrki Kivinen Information-Theoretic Modeling

(57)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

IfX1,X2, . . . is a sequence ofindependent and identically distributed(i.i.d.) r.v.’s with domainX and pmfpX, then

log2 1

pX(X1),log2 1 pX(X2), . . . is also an i.i.d. sequence of r.v.’s.

The expected values of the elements of the above sequence are all equal to the entropy:

E

log2 1 pX(Xi)

= X

x∈X

pX(x) log2 1

pX(x) =H(X) for all i ∈N.

(58)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The i.i.d. assumption is equivalent to p(x1, . . . ,xn) =

n

Y

i=1

pX(xi) .

1

n log2 1

p(x1, . . . ,xn) = 1 n

n

X

i=1

log2 1 pX(xi) . Asymptotic Equipartition Property (AEP)

For i.i.d. sequences, we have

n→∞lim Pr

1

nlog2 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

Jyrki Kivinen Information-Theoretic Modeling

(59)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The i.i.d. assumption is equivalent to 1

p(x1, . . . ,xn) =

n

Y

i=1

1 pX(xi) .

1

n log2 1

p(x1, . . . ,xn) = 1 n

n

X

i=1

log2 1 pX(xi) . Asymptotic Equipartition Property (AEP)

For i.i.d. sequences, we have

n→∞lim Pr

1

nlog2 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(60)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The i.i.d. assumption is equivalent to

log2 1

p(x1, . . . ,xn) = log2

n

Y

i=1

1 pX(xi) .

1

n log2 1

p(x1, . . . ,xn) = 1 n

n

X

i=1

log2 1 pX(xi) . Asymptotic Equipartition Property (AEP)

For i.i.d. sequences, we have

n→∞lim Pr

1

nlog2 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

Jyrki Kivinen Information-Theoretic Modeling

(61)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The i.i.d. assumption is equivalent to

log2 1

p(x1, . . . ,xn) =

n

X

i=1

log2 1 pX(xi) .

1

n log2 1

p(x1, . . . ,xn) = 1 n

n

X

i=1

log2 1 pX(xi) . Asymptotic Equipartition Property (AEP)

For i.i.d. sequences, we have

n→∞lim Pr

1

nlog2 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(62)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The i.i.d. assumption is equivalent to 1

n log2 1

p(x1, . . . ,xn) = 1 n

n

X

i=1

log2 1 pX(xi) .

Asymptotic Equipartition Property (AEP) For i.i.d. sequences, we have

n→∞lim Pr

1

nlog2 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

Jyrki Kivinen Information-Theoretic Modeling

(63)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The i.i.d. assumption is equivalent to 1

n log2 1

p(x1, . . . ,xn) = 1 n

n

X

i=1

log2 1 pX(xi) .

By the (weak) law of large numbers, the average on the right-hand side converges in probability to its mean, i.e., the entropy:

n→∞lim Pr

"

1 n

n

X

i=1

log2 1

pX(Xi)−H(X)

<

#

= 1 for all >0.

Asymptotic Equipartition Property (AEP) For i.i.d. sequences, we have

n→∞lim Pr

1

nlog2 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(64)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The i.i.d. assumption is equivalent to 1

n log2 1

p(x1, . . . ,xn) = 1 n

n

X

i=1

log2 1 pX(xi) .

Asymptotic Equipartition Property (AEP) For i.i.d. sequences, we have

n→∞lim Pr

1

nlog2 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

Jyrki Kivinen Information-Theoretic Modeling

(65)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The AEP states that for any >0, and large enoughn, we have Pr

1

n log2 1

p(x1, . . . ,xn) −H(X)

<

≈1

⇔ Prh

p(x1, . . . ,xn) = 2−n(H(X)±)i

≈1 .

Asymptotic Equipartition Property (informally)

“Almost all sequences are almost equally likely.”

(66)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The AEP states that for any >0, and large enoughn, we have Pr

1

n log2 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

H(X)− < 1

n log2 1

p(x1, . . . ,xn) <H(X) +

⇔ Prh

p(x1, . . . ,xn) = 2−n(H(X)±)i

≈1 .

Asymptotic Equipartition Property (informally)

“Almost all sequences are almost equally likely.”

Jyrki Kivinen Information-Theoretic Modeling

(67)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The AEP states that for any >0, and large enoughn, we have Pr

1

n log2 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

n(H(X)−)<log2 1

p(x1, . . . ,xn) <n(H(X) +)

⇔ Prh

p(x1, . . . ,xn) = 2−n(H(X)±)i

≈1 .

Asymptotic Equipartition Property (informally)

“Almost all sequences are almost equally likely.”

(68)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The AEP states that for any >0, and large enoughn, we have Pr

1

n log2 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

2n(H(X)−)< 1

p(x1, . . . ,xn) <2n(H(X)+)

⇔ Prh

p(x1, . . . ,xn) = 2−n(H(X)±)i

≈1 .

Asymptotic Equipartition Property (informally)

“Almost all sequences are almost equally likely.”

Jyrki Kivinen Information-Theoretic Modeling

(69)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The AEP states that for any >0, and large enoughn, we have Pr

1

n log2 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

2−n(H(X)+)<p(x1, . . . ,xn)<2−n(H(X)−)

⇔ Prh

p(x1, . . . ,xn) = 2−n(H(X)±)i

≈1 .

Asymptotic Equipartition Property (informally)

“Almost all sequences are almost equally likely.”

(70)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The AEP states that for any >0, and large enoughn, we have Pr

1

n log2 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

2−n(H(X)+)<p(x1, . . . ,xn)<2−n(H(X)−)

⇔ Prh

p(x1, . . . ,xn) = 2−n(H(X)±)i

≈1 .

Asymptotic Equipartition Property (informally)

“Almost all sequences are almost equally likely.”

Jyrki Kivinen Information-Theoretic Modeling

(71)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

The AEP states that for any >0, and large enoughn, we have Pr

1

n log2 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

2−n(H(X)+)<p(x1, . . . ,xn)<2−n(H(X)−)

⇔ Prh

p(x1, . . . ,xn) = 2−n(H(X)±)i

≈1 .

Asymptotic Equipartition Property (informally)

“Almost all sequences are almost equally likely.”

(72)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

Technically, the key step in the proof was using the weak law of large numbers to deduce

n→∞lim Pr

"

1 n

n

X

i=1

log2 1

pX(Xi)−H(X)

<

#

= 1 for all >0.

In other words, with high probability the average “surprisingness”

log2pX(Xi) over the sequence is close to its expectation.

Of course we could just leave out the logs and similarly use the law of large number to deduce

n→∞lim Pr

" 1 n

n

X

i=1

pX(Xi)−E[pX(Xi)]

<

#

= 1 for all >0. That is, with high probability the average probability of the elements is close to its expectation, which is the entropy. However, this is less useful because the sumPn

i=1pX(Xi) has no clear connection to the probabilitypX(X1, . . . ,Xn) of the whole sequence.

We get the connection by taking logs, which converts sums to products, allowing us to then use the i.i.d. assumption.

Jyrki Kivinen Information-Theoretic Modeling

(73)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

Of course we could just leave out the logs and similarly use the law of large number to deduce

n→∞lim Pr

"

1 n

n

X

i=1

pX(Xi)−E[pX(Xi)]

<

#

= 1 for all >0.

That is, with high probability the average probability of the elements is close to its expectation, which is the entropy.

However, this is less useful because the sumPn

i=1pX(Xi) has no clear connection to the probabilitypX(X1, . . . ,Xn) of the whole sequence.

We get the connection by taking logs, which converts sums to products, allowing us to then use the i.i.d. assumption.

(74)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

Of course we could just leave out the logs and similarly use the law of large number to deduce

n→∞lim Pr

"

1 n

n

X

i=1

pX(Xi)−E[pX(Xi)]

<

#

= 1 for all >0.

That is, with high probability the average probability of the elements is close to its expectation, which is the entropy.

However, this is less useful because the sumPn

i=1pX(Xi) has no clear connection to the probabilitypX(X1, . . . ,Xn) of the whole sequence.

We get the connection by taking logs, which converts sums to products, allowing us to then use the i.i.d. assumption.

Jyrki Kivinen Information-Theoretic Modeling

(75)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

AEP

Of course we could just leave out the logs and similarly use the law of large number to deduce

n→∞lim Pr

"

1 n

n

X

i=1

pX(Xi)−E[pX(Xi)]

<

#

= 1 for all >0.

That is, with high probability the average probability of the elements is close to its expectation, which is the entropy.

However, this is less useful because the sumPn

i=1pX(Xi) has no clear connection to the probabilitypX(X1, . . . ,Xn) of the whole sequence.

We get the connection by taking logs, which converts sums to products, allowing us to then use the i.i.d. assumption.

(76)

Outline Entropy and Information Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

Typical Sets

Typical Set

Thetypical setA(n) is the set of sequences (x1, . . . ,xn)∈ Xnwith the property:

2−n(H(X)+) ≤p(x1, . . . ,xn)≤2−n(H(X)−) .

The AEP states that

n→∞lim Pr h

Xn∈A(n) i

= 1 .

In particular, for any >0, and large enough n, we have Pr

h

Xn∈A(n) i

>1− .

Jyrki Kivinen Information-Theoretic Modeling

Viittaukset

LIITTYVÄT TIEDOSTOT

Onko tulkittava niin, että kun Myllyntaus ja Hjerppe eivät kommentoineet millään tavalla artikkelini rakennetta edes alakohtien osalta, he ovat kuitenkin

In this problem we will attempt to reproduce Figure 2.4 in Hastie et al (2009) 1 , also presented in the slides of lecture 6, showing the error on the test and training data as

In this problem we will attempt to reproduce Figure 2.4 in Hastie et al (2009) 1 , also presented in the slides of lecture 6, showing the error on the test and training data as

(Hint: This problem does not happen in the first iteration since at least the data point which is equal to µ j will be assigned to cluster D j at first. But it can happen in the

Outline Administrative issues Overview of Contents Compression..

Probability Space and Random Variables Joint and Conditional Distributions Expectation.. Law of

It is easily seen that all prefix codes are uniquely decodable: each symbol can be decoded as soon as its codeword is read.. Outline Codes

Note that since Shannon-Fano gives E [`(X )] ≤ H(X ) + 1, and Huffman is optimal, Huffman must satisfy the same bound. Jyrki Kivinen