Autumn2012 Lecture3:SourceCoding:TheoryJyrkiKivinen Information-TheoreticModeling

(1)

Outline Entropy and Information Data Compression

Information-Theoretic Modeling

Lecture 3: Source Coding: Theory

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

(2)

Lecture 3: Source Coding: Theory

Jyrki Kivinen Information-Theoretic Modeling

(3)

1 Entropy and Information Entropy

Information Inequality Data Processing Inequality

2 Data Compression

Asymptotic Equipartition Property (AEP) Typical Sets

Noiseless Source Coding Theorem

(4)

2 Data Compression

(5)

Entropy

Given a discrete random variableX with pmf p_X, we can measure the amount of “surprise” associated with each outcomex∈ X by the quantity

I_X(x) = log₂ 1 p_X(x) .

The less likely an outcome is, the more surprised we are to observe it. (The point in the log-scale will become clear shortly.)

Theentropy ofX measures theexpected amount of “surprise”: H(X) =E[I_X(X)] = X

x∈X

p_X(x) log₂ 1 p_X(x) .

(6)

Entropy

Given a discrete random variableX with pmf p_X, we can measure the amount of “surprise” associated with each outcomex∈ X by the quantity

I_X(x) = log₂ 1 p_X(x) .

The less likely an outcome is, the more surprised we are to observe it. (The point in the log-scale will become clear shortly.)

Theentropy ofX measures theexpected amount of “surprise”:

H(X) =E[I_X(X)] = X

x∈X

p_X(x) log₂ 1 p_X(x) .

(7)

Entropy

Binary Entropy Function

For binary-valuedX, with p =p_X(1) = 1−p_X(0), we have H(X) =plog₂ 1

p + (1−p) log₂ 1 1−p .

(8)

Entropy

More Entropies

1 thejoint entropy of two (or more) random variables:

H(X,Y) =X

x∈X y∈Y

p_X_,Y(x,y) log₂ 1 pX,Y(x,y) ,

2 theentropy of a conditional distribution: H(X |Y =y) = X

x∈X

p_X_|Y(x |y) log₂ 1 p_X|Y(x|y) ,

3 and the conditional entropy: H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

pX,Y(x,y) log₂ 1 p_X_|Y(x |y) .

(9)

Entropy

More Entropies

H(X,Y) =X

x∈X y∈Y

2 theentropy of a conditional distribution:

H(X |Y =y) = X

x∈X

p_X|Y(x |y) log₂ 1 p_X|Y(x|y) ,

3 and the conditional entropy: H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

(10)

Entropy

More Entropies

H(X,Y) =X

x∈X y∈Y

H(X |Y =y) = X

x∈X

3 and the conditional entropy:

H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

(11)

Entropy

More Entropies

H(X,Y) =X

x∈X y∈Y

H(X |Y =y) = X

x∈X

H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

(12)

Entropy

More Entropies

H(X,Y) =X

x∈X y∈Y

H(X |Y =y) = X

x∈X

H(X |Y) =X

y∈Y

p(y)H(X |Y =y)

=X

x∈X y∈Y

(13)

Entropy

More Entropies

The joint entropyH(X,Y) measures the uncertainty about the pair (X,Y).

The entropy of the conditional distributionH(X |Y =y) measures the uncertainty aboutX when we know that Y =y. The conditional entropyH(X |Y) measures theexpected uncertainty aboutX when the valueY is known.

(14)

Entropy

More Entropies

The entropy of the conditional distributionH(X |Y =y) measures the uncertainty aboutX when we know that Y =y.

The conditional entropyH(X |Y) measures theexpected uncertainty aboutX when the valueY is known.

(15)

Entropy

More Entropies

The entropy of the conditional distributionH(X |Y =y) measures the uncertainty aboutX when we know that Y =y. The conditional entropyH(X |Y) measures theexpected uncertainty aboutX when the value Y is known.

(16)

Entropy

Chain Rule of Entropy

Remember the chain rule of probability:

pX,Y(x,y) =pY(y)·p_X|Y(x |y) .

For the entropy we have: Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables: H(X1, . . . ,Xn) =

n

X

i=1

H(X_i |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) =H(X) +H(Y). Logarithmicscale makes entropy additive.

(17)

Entropy

Chain Rule of Entropy

pX,Y(x,y) =pY(y)·p_X|Y(x |y) . For the entropy we have:

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables: H(X1, . . . ,Xn) =

n

X

i=1

H(X_i |H1, . . . ,Hi−1) .

(18)

Entropy

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) . Proof.

pX,Y(x,y) =pY(y)·p_X|Y(x |y) Next apply log(ab) = loga+ logb.

The rule can be extended to more than two random variables: H(X₁, . . . ,X_n) =

n

X

i=1

H(X_i |H₁, . . . ,Hi−1) .

(19)

Entropy

Chain Rule of Entropy

log₂p_X_,Y(x,y) = log₂p_Y(y) + log₂p_X_|Y(x |y) Next apply loga=−log(1/a).

n

X

i=1

H(X_i |H₁, . . . ,Hi−1) .

(20)

Entropy

Chain Rule of Entropy

log₂ 1

p_X_,Y(x,y) = log₂ 1

p_Y(y) + log₂ 1 p_X_|Y(x |y)

⇔E

log₂ 1 p_X_,Y(x,y)

=E

log₂ 1 p_Y(y)

+E

log₂ 1 p_X_|Y(x |y)

⇔H(X,Y) =H(Y) +H(X |Y) .

n

X

i=1

H(X_i |H₁, . . . ,Hi−1) .

(21)

Entropy

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

The rule can be extended to more than two random variables:

H(X1, . . . ,Xn) =

n

X

i=1

H(X_i |H1, . . . ,Hi−1) .

(22)

Entropy

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

H(X1, . . . ,Xn) =

n

X

i=1

H(X_i |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) = H(X) +H(Y).

Logarithmicscale makes entropy additive.

(23)

Entropy

Chain Rule of Entropy

H(X,Y) =H(Y) +H(X |Y) .

H(X1, . . . ,Xn) =

n

X

i=1

H(X_i |H1, . . . ,Hi−1) .

X Y ⇔ H(X |Y) =H(X) ⇔ H(X,Y) = H(X) +H(Y).

Logarithmicscale makes entropy additive.

(24)

Entropy

Mutual Information

Themutual information

I(X ; Y) =H(X)−H(X |Y)

measures the average decrease in uncertainty aboutX when the value ofY becomes known.

Mutual information issymmetric(chain rule): I(X ; Y) =H(X)−H(X |Y) =

=H(Y)−H(Y |X) =I(Y ; X) .

On the average,X gives as much information aboutY as Y gives aboutX.

(25)

Entropy

Mutual Information

I(X ; Y) =H(X)−H(X |Y)

Mutual information issymmetric(chain rule):

I(X ; Y) =H(X)−H(X |Y) =H(X)−(H(X,Y)−H(Y))

=H(Y)−H(Y |X) =I(Y ; X) .

(26)

Entropy

Mutual Information

I(X ; Y) =H(X)−H(X |Y)

I(X ; Y) =H(X)−H(X |Y) =H(X)−H(X,Y) +H(Y)

=H(Y)−H(Y |X) =I(Y ; X) .

(27)

Entropy

Mutual Information

I(X ; Y) =H(X)−H(X |Y)

I(X ; Y) =H(X)−H(X |Y) = (H(X)−H(X,Y)) +H(Y)

=H(Y)−H(Y |X) =I(Y ; X) .

(28)

Entropy

Mutual Information

I(X ; Y) =H(X)−H(X |Y)

=H(Y)−H(Y |X) =I(Y ; X) .

(29)

Entropy

Mutual Information

I(X ; Y) =H(X)−H(X |Y)

=H(Y)−H(Y |X) =I(Y ; X) .

(30)

Entropy

Relationships between Entropies

H(X,Y)

H(X)

H(Y)

H(X | Y) I(X ; Y) H(Y | X)

(31)

Entropy

Information Inequality

Kullback-Leibler Divergence

Therelative entropyor Kullback-Leibler divergence between (discrete) distributionsp_X andq_X is defined as

D(pX kqX) = X

x∈X

pX(x) log₂ pX(x) q_X(x) .

Information Inquality

For any two (discrete) distributionsp_X and q_X, we have D(p_X kq_X)≥0

with equality iffp_X(x) =q_X(x) for all x ∈ X. Proof. Gibbs!

(32)

Entropy

Information Inequality

D(pX kqX) = X

x∈X

pX(x) log₂ pX(x) q_X(x) .

(We considerpX(x) log₂^p_q^X^(x⁾

X(x)= 0 wheneverpX(x) = 0.)

Information Inquality

For any two (discrete) distributionsp_X and q_X, we have D(pX kqX)≥0

with equality iffp_X(x) =q_X(x) for all x ∈ X. Proof. Gibbs!

(33)

Entropy

Information Inequality

D(pX kqX) = X

x∈X

pX(x) log₂ pX(x) q_X(x) . Information Inquality

with equality iffp_X(x) =q_X(x) for all x∈ X.

Proof. Gibbs!

(34)

Entropy

Information Inequality

D(pX kqX) = X

x∈X

pX(x) log₂ pX(x) q_X(x) . Information Inquality

with equality iffp_X(x) =q_X(x) for all x∈ X. Proof. Gibbs!

(35)

Entropy

Kullback-Leibler Divergence

The information inequality implies I(X ; Y)≥0 .

Proof.

I(X ; Y) =H(X)−H(X |Y)

=H(X) +H(Y)−H(X,Y)

= X

x∈X y∈Y

p_X_,Y(x,y) log₂ p_X_,Y(x,y) p_X(x)p_Y(y)

=D(pX,Y kpXpY)≥0 .

In addition,D(pX,Y kpXpY) = 0 iff pX,Y(x,y) =pX(x)pY(y) for allx∈ X,y ∈ Y. This means that variablesX andY are

independentiff I(X ; Y) = 0.

(36)

Entropy

Kullback-Leibler Divergence

The information inequality implies I(X ; Y)≥0 . Proof.

I(X ; Y) =H(X)−H(X |Y)

=H(X) +H(Y)−H(X,Y)

= X

x∈X y∈Y

p_X_,Y(x,y) log₂ p_X_,Y(x,y) pX(x)pY(y)

(37)

Entropy

Kullback-Leibler Divergence

The information inequality implies I(X ; Y)≥0 . Proof.

I(X ; Y) =H(X)−H(X |Y)

=H(X) +H(Y)−H(X,Y)

= X

x∈X y∈Y

p_X_,Y(x,y) log₂ p_X_,Y(x,y) pX(x)pY(y)

(38)

Entropy

Properties of Entropy

Properties of entropy:

1 H(X)≥0

Proof. p_X(x)≤1⇒log₂ 1

p_X(x) ≥0.

2 H(X)≤log₂|X |

A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965):

S =klnW .

(39)

Entropy

Properties of Entropy

1 H(X)≥0

p_X(x) ≥0.

2 H(X)≤log₂|X |

S =klnW .

(40)

Entropy

Properties of Entropy

1 H(X)≥0

p_X(x) ≥0.

2 H(X)≤log₂|X |

S =klnW .

(41)

Entropy

Properties of Entropy

1 H(X)≥0

p_X(x) ≥0.

2 H(X)≤log₂|X |

Proof. Letu_X(x) = _{|X |}¹ be the uniform distribution overX. 0≤D(p_X ku_X) = X

x∈X

p_X(x) log₂p_X(x)

u_X(x) = log₂|X |−H(X) .

S =klnW .

(42)

Entropy

Properties of Entropy

1 H(X)≥0

p_X(x) ≥0.

2 H(X)≤log₂|X |

S =klnW .

(43)

Entropy

Ludvig Boltzmann (1844–1906)

(44)

Entropy

Properties of Entropy

1 H(X)≥0

Proof. pX(x)≤1⇒log₂ 1

p_X(x) ≥0.

2 H(X)≤log₂|X |

S =klnW .

3 H(X |Y)≤H(X)

On the average, knowing another r.v. can only reduce uncertainty about X. However, note that H(X |Y = y) may be greater thanH(X) for somey — “contradicting evidence”.

(45)

Entropy

Properties of Entropy

1 H(X)≥0

p_X(x) ≥0.

2 H(X)≤log₂|X |

S =klnW .

3 H(X |Y)≤H(X) Proof.

0≤I(X ; Y) =H(X)−H(X |Y) .

(46)

Entropy

Properties of Entropy

1 H(X)≥0

p_X(x) ≥0.

2 H(X)≤log₂|X |

S =klnW .

3 H(X |Y)≤H(X)

(47)

Entropy

Chain Rule of Mutual Information

Theconditional mutual information of variablesX andY given Z is defined as

I(X ; Y |Z) =H(X |Z)−H(X |Y,Z) .

Chain Rule of Mutual Information

For random variablesX and Y₁, . . . ,Y_n we have I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi |Y1, . . . ,Yi−1) . Independence amongY₁, . . . ,Y_nimplies

I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi) .

(48)

Entropy

Chain Rule of Mutual Information

I(X ; Y |Z) =H(X |Z)−H(X |Y,Z) . Chain Rule of Mutual Information

n

X

i=1

I(X ; Yi |Y1, . . . ,Yi−1) .

Independence amongY₁, . . . ,Y_nimplies I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi) .

(49)

Entropy

Chain Rule of Mutual Information

I(X ; Y |Z) =H(X |Z)−H(X |Y,Z) . Chain Rule of Mutual Information

n

X

i=1

I(X ; Yi |Y1, . . . ,Yi−1) . Independence amongY₁, . . . ,Y_n implies

I(X ; Y1, . . . ,Yn) =

n

X

i=1

I(X ; Yi) .

(50)

Entropy

Data Processing Inequality

LetX,Y,Z be (discrete) random variables. If Z is conditionally independent of X given Y, i.e., if we have

p_Z|X,Y(z |x,y) =pZ|Y(z |y) for all x,y,z, thenX,Y,Z form aMarkov chainX →Y →Z.

For instance,Y is a “noisy” measurement ofX, and Z =f(Y) is the outcome of deterministic data processing performed onY, then we haveX →Y →Z.

This implies that

I(X ; Z |Y) =H(Z |Y)−H(Z |Y,X) = 0 .

WhenY is known,Z doesn’t give any extra information aboutX (and vice versa).

(51)

Entropy

Data Processing Inequality

This implies that

I(X ; Z |Y) =H(Z |Y)−H(Z |Y,X) = 0 .

(52)

Entropy

Data Processing Inequality

This implies that

I(X ; Z |Y) =H(Z |Y)−H(Z |Y,X) = 0 .

(53)

Entropy

Data Processing Inequality

Assuming thatX →Y →Z is a Markov chain, we get I(X ; Y,Z) =I(X ; Z) +I(X ; Y |Z)

=I(X ; Y) +I(X ; Z |Y) .

Now, becauseI(X ; Z |Y) = 0, andI(X ; Y |Z)≥0, we obtain: Data Processing Inequality

IfX →Y →Z is a Markov chain, then we have I(X ; Z)≤I(X ; Y) .

No data-processing can increase the amount of information that we have aboutX.

(54)

Entropy

Data Processing Inequality

Assuming thatX →Y →Z is a Markov chain, we get I(X ; Y,Z) =I(X ; Z) +I(X ; Y |Z)

=I(X ; Y) +I(X ; Z |Y) .

Now, becauseI(X ; Z |Y) = 0, andI(X ; Y |Z)≥0, we obtain:

Data Processing Inequality

IfX →Y →Z is a Markov chain, then we have I(X ; Z)≤I(X ; Y) .

No data-processing can increase the amount of information that we have aboutX.

(55)

2 Data Compression

(56)

AEP

IfX1,X2, . . . is a sequence ofindependent and identically distributed(i.i.d.) r.v.’s with domainX and pmfp_X, then

log₂ 1

p_X(X₁),log₂ 1 p_X(X₂), . . . is also an i.i.d. sequence of r.v.’s.

The expected values of the elements of the above sequence are all equal to the entropy:

E

log₂ 1 p_X(X_i)

=X

x∈X

p_X(x) log₂ 1

p_X(x) =H(X) for all i ∈N.

(57)

AEP

IfX1,X2, . . . is a sequence ofindependent and identically distributed(i.i.d.) r.v.’s with domainX and pmfp_X, then

log₂ 1

p_X(X₁),log₂ 1 p_X(X₂), . . . is also an i.i.d. sequence of r.v.’s.

The expected values of the elements of the above sequence are all equal to the entropy:

E

log₂ 1 p_X(X_i)

= X

x∈X

p_X(x) log₂ 1

p_X(x) =H(X) for all i ∈N.

(58)

AEP

The i.i.d. assumption is equivalent to p(x₁, . . . ,x_n) =

n

Y

i=1

p_X(x_i) .

1

n log₂ 1

p(x₁, . . . ,x_n) = 1 n

n

X

i=1

log₂ 1 p_X(x_i) . Asymptotic Equipartition Property (AEP)

For i.i.d. sequences, we have

n→∞lim Pr

1

nlog₂ 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(59)

AEP

The i.i.d. assumption is equivalent to 1

p(x₁, . . . ,x_n) =

n

Y

i=1

1 p_X(x_i) .

1

n log₂ 1

p(x₁, . . . ,x_n) = 1 n

n

X

i=1

n→∞lim Pr

1

nlog₂ 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(60)

AEP

The i.i.d. assumption is equivalent to

log₂ 1

p(x₁, . . . ,x_n) = log₂

n

Y

i=1

1 p_X(x_i) .

1

n log₂ 1

p(x₁, . . . ,x_n) = 1 n

n

X

i=1

n→∞lim Pr

1

nlog₂ 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(61)

AEP

The i.i.d. assumption is equivalent to

log₂ 1

p(x₁, . . . ,x_n) =

n

X

i=1

log₂ 1 p_X(x_i) .

1

n log₂ 1

p(x₁, . . . ,x_n) = 1 n

n

X

i=1

n→∞lim Pr

1

nlog₂ 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(62)

AEP

n log₂ 1

p(x₁, . . . ,x_n) = 1 n

n

X

i=1

log₂ 1 p_X(x_i) .

Asymptotic Equipartition Property (AEP) For i.i.d. sequences, we have

n→∞lim Pr

1

nlog₂ 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(63)

AEP

n log₂ 1

p(x₁, . . . ,x_n) = 1 n

n

X

i=1

log₂ 1 p_X(x_i) .

By the (weak) law of large numbers, the average on the right-hand side converges in probability to its mean, i.e., the entropy:

n→∞lim Pr

"

1 n

n

X

i=1

log₂ 1

pX(Xi)−H(X)

<

#

= 1 for all >0.

n→∞lim Pr

1

nlog₂ 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(64)

AEP

n log₂ 1

p(x₁, . . . ,x_n) = 1 n

n

X

i=1

log₂ 1 p_X(x_i) .

n→∞lim Pr

1

nlog₂ 1

p(x1, . . . ,xn) −H(X)

<

= 1 for all >0.

(65)

AEP

The AEP states that for any >0, and large enoughn, we have Pr

1

n log₂ 1

p(x1, . . . ,xn) −H(X)

<

≈1

⇔ Prh

p(x₁, . . . ,x_n) = 2^−n(H(X^)±)i

≈1 .

Asymptotic Equipartition Property (informally)

“Almost all sequences are almost equally likely.”

(66)

AEP

1

n log₂ 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

H(X)− < 1

n log₂ 1

p(x₁, . . . ,x_n) <H(X) +

⇔ Prh

p(x₁, . . . ,x_n) = 2^−n(H(X^)±)i

≈1 .

(67)

AEP

1

n log₂ 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

n(H(X)−)<log₂ 1

p(x₁, . . . ,x_n) <n(H(X) +)

⇔ Prh

p(x₁, . . . ,x_n) = 2^−n(H(X^)±)i

≈1 .

(68)

AEP

1

n log₂ 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

2^n(H(X⁾⁻⁾< 1

p(x₁, . . . ,x_n) <2^n(H(X⁾⁺⁾

⇔ Prh

p(x₁, . . . ,x_n) = 2^−n(H(X^)±)i

≈1 .

(69)

AEP

1

n log₂ 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

2^−n(H(X⁾⁺⁾<p(x1, . . . ,xn)<2^−n(H(X⁾⁻⁾

⇔ Prh

p(x₁, . . . ,x_n) = 2^−n(H(X^)±)i

≈1 .

(70)

AEP

1

n log₂ 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

2^−n(H(X⁾⁺⁾<p(x1, . . . ,xn)<2^−n(H(X⁾⁻⁾

⇔ Prh

p(x₁, . . . ,x_n) = 2^−n(H(X^)±)i

≈1 .

(71)

AEP

1

n log₂ 1

p(x1, . . . ,xn) −H(X)

<

| {z }

≈1

2^−n(H(X⁾⁺⁾<p(x1, . . . ,xn)<2^−n(H(X⁾⁻⁾

⇔ Prh

p(x₁, . . . ,x_n) = 2^−n(H(X^)±)i

≈1 .

(72)

AEP

Technically, the key step in the proof was using the weak law of large numbers to deduce

n→∞lim Pr

"

1 n

n

X

i=1

log₂ 1

p_X(X_i)−H(X)

<

#

= 1 for all >0.

In other words, with high probability the average “surprisingness”

log₂p_X(X_i) over the sequence is close to its expectation.

Of course we could just leave out the logs and similarly use the law of large number to deduce

n→∞lim Pr

" 1 n

n

X

i=1

pX(Xi)−E[pX(Xi)]

<

#

= 1 for all >0. That is, with high probability the average probability of the elements is close to its expectation, which is the entropy. However, this is less useful because the sumPn

i=1p_X(X_i) has no clear connection to the probabilityp_X(X₁, . . . ,X_n) of the whole sequence.

We get the connection by taking logs, which converts sums to products, allowing us to then use the i.i.d. assumption.

(73)

AEP

n→∞lim Pr

"

1 n

n

X

i=1

p_X(X_i)−E[p_X(X_i)]

<

#

= 1 for all >0.

That is, with high probability the average probability of the elements is close to its expectation, which is the entropy.

However, this is less useful because the sumPn

i=1p_X(X_i) has no clear connection to the probabilitypX(X1, . . . ,Xn) of the whole sequence.

(74)

AEP

n→∞lim Pr

"

1 n

n

X

i=1

<

#

= 1 for all >0.

(75)

AEP

n→∞lim Pr

"

1 n

n

X

i=1

<

#

= 1 for all >0.

(76)

Typical Sets

Typical Set

Thetypical setA⁽ⁿ⁾ is the set of sequences (x₁, . . . ,x_n)∈ Xⁿwith the property:

2^−n(H(X⁾⁺⁾ ≤p(x1, . . . ,xn)≤2^−n(H(X⁾⁻⁾ .

The AEP states that

n→∞lim Pr h

Xⁿ∈A⁽ⁿ⁾ i

= 1 .

In particular, for any >0, and large enough n, we have Pr

h

Xⁿ∈A⁽ⁿ⁾ i

>1− .