Autumn2012 Lecture5:SourceCoding:AlgorithmsJyrkiKivinen Information-TheoreticModeling

(1)

Outline Codes Optimal Codes

Information-Theoretic Modeling

Lecture 5: Source Coding: Algorithms

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

(2)

1 Codes

Decodable Codes Prefix Codes

Kraft-McMillan Theorem

2 Optimal Codes

Entropy Lower Bound Shannon-Fano Coding Huffman

Jyrki Kivinen Information-Theoretic Modeling

(3)

1 Codes

2 Optimal Codes

(4)

Decodable Codes Prefix Codes Kraft-McMillan Theorem

Extension Code

A (binary)symbol code C : X → {0,1}^∗ is a mapping from the alphabetX to the set of finite binary sequences.

Theextensionof code C is the mappingC^∗ : X^∗ → {0,1}^∗ obtained by concatenating the codewordsC(xi) for each input symbolxi:

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

(5)

Extension Code

Theextensionof code C is the mappingC^∗ : X^∗→ {0,1}^∗ obtained by concatenating the codewordsC(xi) for each input symbolx_i:

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

C*

I N P U T _ S T R I N G ...

(6)

Extension Code

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

C*

N P U T _ S T R I N G ...

0001 111001 101011111010011...

I

1001

(7)

Extension Code

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

C*

P U T _ S T R I N G ...

I N

(8)

Extension Code

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

C*

U T _ S T R I N G ...

101011111010011...

I

1001 N

0001 P

111001

(9)

Extension Code

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

C*

T _ S T R I N G ...

I N P U

(10)

Extension Code

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

C*

_ S T R I N G ...

010011...

I

1001 N

0001 P

111001 U

10101 T

1111

(11)

Extension Code

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

C*

S T R I N G ...

I N P U T _

(12)

Extension Code

C^∗(x₁,x₂, . . . ,x_n) =C(x₁)C(x₂). . .C(x_n) .

C*

I

1001 N

0001 P

111001 U

10101 T

1111 _

01 S T R I N G ...

0011...

(13)

(Other types of code)

For reference, some example of codes that arenotsymbol codes:

Run Length Encoding(RLE): for example, encode aaaaccabbb as (a,4),(c,2),(a,1),(b,3)

Adaptive codes, where the code of a symbol may change based on what symbols have appeared previously

Note that coding blocks ofb bits for some constant b is still a symbol code, with alphabet size|X |= 2^b.

(14)

(Other types of code)

(15)

(Other types of code)

(16)

(Other types of code)

(17)

Decodable Codes

Decodable Code

CodeC is (uniquely)decodableiff its extensionC^∗ is a one-to-one mapping, i.e., iff

(x₁, . . . ,x_n)6= (y₁, . . . ,y_n) ⇒ C^∗(x₁, . . . ,x_n)6=C^∗(y₁, . . . ,y_n) .

x

A code with codewords{0,1,10,11} isnotuniquely decodable: What does 10 mean?

√

A code with codewords{00,01,10,11} is uniquely decodable: Each pair of bits can be decoded individually.

√

A code with codewords{0,01,011,0111}is also uniquely decodable: What does 0011 mean?

(18)

Decodable Codes

Decodable Code

(x₁, . . . ,x_n)6= (y₁, . . . ,y_n) ⇒ C^∗(x₁, . . . ,x_n)6=C^∗(y₁, . . . ,y_n) .

x

√

(19)

Decodable Codes

Decodable Code

(x₁, . . . ,x_n)6= (y₁, . . . ,y_n) ⇒ C^∗(x₁, . . . ,x_n)6=C^∗(y₁, . . . ,y_n) .

x

√

(20)

Decodable Codes

Decodable Code

(x₁, . . . ,x_n)6= (y₁, . . . ,y_n) ⇒ C^∗(x₁, . . . ,x_n)6=C^∗(y₁, . . . ,y_n) .

x

√

(21)

Prefix Codes

An important subset of decodable codes is the set of prefix(-free) codes.

Prefix Code

A codeC : X → {0,1}^∗ is called aprefix codeiff no codeword is a prefix of another.

It is easily seen that all prefix codes are uniquely decodable: each symbol can be decoded as soon as its codeword is read. Therefore, prefix codes are also calledinstantaneous codes.

x

A code with codewords{0,01,011,0111}is uniquely decodablebut not prefix-free: e.g., 0 is a prefix of 01.

√

A code with codewords{0,10,110,111}isprefix-free.

(22)

Prefix Codes

Prefix Code

x

√

(23)

Prefix Codes

Prefix Code

x

√

(24)

Kraft Inequality

The codeword lengths of a prefix codes satisfy the following important property.

Kraft Inequality

The codeword lengths`₁, . . . , `_m of any (binary) prefix code satisfy

m

X

i=1

2^−`ⁱ ≤1 .

Conversely, given a set of codeword lengths that satisfy this inequality, there is a prefix code with these codeword lengths.

(25)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

√

Codewords{0,10,110,111}

(26)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

x

Kraft inequality violated. ⇒ Not decodable.

(27)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

0010

√

Fixed-length code

(28)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

√

Decodable & prefix-free

(29)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

Kraft?

√

Decodable?

√

Prefix-free?

x

(30)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

Kraft?

√

Decodable?

√

Prefix-free?

x

(31)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

Kraft?

√

Decodable?

x

Prefix-free?

x

(32)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

Kraft?

√

Decodable?

x

Prefix-free?

x

(33)

Kraft Inequality: proof

Proof: Assume first we have a prefix code with codeword lengths

`₁, . . . , `_m. LetL= max_i`_i be the length of the longest codeword.

Create a binary tree with the code words as leafs.

Add nodes to make a complete tree of depthL.

Assign each new leaf to the code word above it. Notice that because of the prefix property, each leaf gets assigned at most once.

(34)

Kraft Inequality: proof

(35)

Kraft Inequality: proof

Assign each new leaf to the code word above it.

Notice that because of the prefix property, each leaf gets assigned at most once.

(36)

Kraft Inequality: proof

(37)

Kraft Inequality: proof

0 1

0 0

1

1 C(a) = 0

C(b) = 100 C(c) = 11 C(d) = 101

`1= 1

`2= 3

`3= 2

`4= 3 a

b

c

d

(38)

Kraft Inequality: proof

0 1 0 1

0 1

0 1 0 1

0 0

1

1 C(a) = 0

C(b) = 100 C(c) = 11 C(d) = 101

`1= 1

`2= 3

`3= 2

`4= 3 a

b d

c

(39)

Kraft Inequality: proof

Assign each new leaf to the code word above it.

0 1 0 1

0 1

0 1 0 1

0 0

1

1 C(a) = 0

C(b) = 100 C(c) = 11 C(d) = 101

`1= 1

`2= 3

`3= 2

`4= 3 a

b d

c

(40)

Kraft Inequality: proof

A code word of length`get 2^L−` leaves assigned to it.

There are 2^L leaves.

Number of assigned leaves≤total number of leaves.

(41)

Kraft Inequality: proof

m

X

i=1

2^L−`ⁱ ≤2^L

(42)

Kraft Inequality: proof

m

X

i=1

2^L·2^−`ⁱ ≤2^L

(43)

Kraft Inequality: proof

m

X

i=1

2^L·2^−`ⁱ ≤2^L

(44)

Kraft Inequality: proof

m

X

i=1

2^−`ⁱ ≤1.

(45)

Kraft Inequality: proof

m

X

i=1

2^−`ⁱ ≤1.

This concludes the first half of the proof.

(46)

Kraft Inequality: proof

Assume now we are given a set of integers such that

m

X

i=1

2^−`ⁱ ≤1.

Again letL= max_i`_i. We construct a prefix code with codeword lengths`_i.

We can assume that the lengths are sorted so that

`₁ ≤`₂ ≤ · · · ≤`_m, and that `₁ >0.

(47)

Kraft Inequality: proof

Now create a binary tree of depthL. Associate codewords to nodes of the tree as previously.

Repeat the following fori = 1, . . . ,m.

1 Choose the leftmost node at depth `₁ (depth of the root is 0). Assign the corresponding string as codeword i.

2 Remove the chosen node and its descendants from the tree.

(48)

Kraft Inequality: proof

1 Choose the leftmost node at depth `₁ (depth of the root is 0). Assign the corresponding string as codeword i.

(49)

Kraft Inequality: proof

1 Choose the leftmost node at depth `₁ (depth of the root is 0).

Assign the corresponding string as codewordi.

(50)

Kraft Inequality: proof

1 Choose the leftmost node at depth `₁ (depth of the root is 0).

Assign the corresponding string as codewordi.

(51)

Kraft Inequality: example

0 1 0 1

0 1

0 1 0 1

0 0

1

1 `1= 1

`2= 3

`3= 2

`4= 3

C(a) =. . . C(b) =. . . C(c) =. . . C(d) =. . .

a b c d

(52)

Kraft Inequality: example

0 1 0 1

0 1

0 1 0 1

0 0

1

1 `1= 1

`4= 3 C(a) =. . .

C(d) =. . . a

d

`2= 2

`3= 3 c b

C(c) =. . . C(b) =. . .

(53)

Kraft Inequality: example

0 1 0 1

0 1

0 1 0 1

0 0

1 1

`4= 3

C(d) =. . . a

d

`2= 2

`3= 3 c b

C(c) =. . . C(b) =. . . a

C(a) = 0

`1= 1

(54)

Kraft Inequality: example

0 1 0 1

0 1

1

`4= 3

C(d) =. . . a

d

`2= 2

`3= 3 c b

C(c) =. . . C(b) =. . . C(a) = 0

`1= 1

(55)

Kraft Inequality: example

0 1 0 1

0 1

1

`4= 3

C(d) =. . . d

`3= 3 c b

C(b) =. . . C(a) = 0

`1= 1 a

C(c) = 10 c

`2= 2

(56)

Kraft Inequality: example

0 1

1 1

`4= 3

C(d) =. . . d

`2= 2

`3= 3 c b

C(b) =. . . C(a) = 0

`1= 1 a

C(c) = 10

(57)

Kraft Inequality: example

0 1

1 1

`4= 3

C(d) =. . . d

`2= 2 c

b

C(a) = 0

`1= 1 a

C(c) = 10 C(b) = 110 b

`3= 3

(58)

Kraft Inequality: example

1 1 1

`4= 3

C(d) =. . . d

`2= 2 c

b

C(a) = 0

`1= 1 a

C(c) = 10 C(b) = 110

`3= 3

(59)

Kraft Inequality: example

1 1 1

d

`2= 2

`3= 3 c b

C(a) = 0

`1= 1 a

C(c) = 10

d

C(b) = 110 C(d) = 111

`4= 3

(60)

Kraft Inequality:proof

The order of using up the tree is important.

0 1 0 1

0 1

0 1 0 1

0 0

1 1

C(b) = 000

C(d) =? b

c

d

`2= 2

`1= 3

`3= 2

`4= 1 a

d c b

C(c) = 01 C(d) = 100

No code of length 1 left for a!

(61)

Kraft Inequality: proof

Notice that since`i ≤`i+1, we keep moving deeper into the tree.

Therefore the descendants of the node chosen for`i cannot include any descendants of nodes chosen for`₁, . . . , `i−1.

Therefore, when`_i is assigned, exactly 2^L−`ⁱ leaves are removed from the tree.

By the same leave counting argument as before, the tree will not run out of leaves before we have assigned all the code words. This concludes the second direction of the proof.

(62)

Kraft Inequality: proof

By the same leave counting argument as before, the tree will not run out of leaves before we have assigned all the code words. This concludes the second direction of the proof.

(63)

Kraft Inequality: proof

By the same leave counting argument as before, the tree will not run out of leaves before we have assigned all the code words.

This concludes the second direction of the proof.

(64)

Kraft Inequality: proof

By the same leave counting argument as before, the tree will not run out of leaves before we have assigned all the code words.

This concludes the second direction of the proof.

(65)

Kraft Inequality

Question: What if the inequality is satisfied strictly, i.e., the sum of the terms in the sum equalslessthan one:

m

X

i=1

2^−`ⁱ <1 .

Then it is possible to make the codewords shorter and still have a decodable (prefix) code.

(66)

Kraft Inequality

Question: What if the inequality is satisfied strictly, i.e., the sum of the terms in the sum equalslessthan one:

m

X

i=1

2^−`ⁱ <1 .

Then it is possible to make the codewords shorter and still have a decodable (prefix) code.

(67)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

Not all of budget used. ⇒ Some codewords can be made shorter.

(68)

Kraft Inequality

0

1

00

01

10

11

000 001 010 011 100 101 110 111

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Total budget

“Kraft tight” / complete code.

(69)

Kraft–McMillan Theorem

The Kraft inequality restricts the codeword lengths of prefix codes.

Could we do much better if we would only require decodability?

In fact it can be shown that we do not lose anything at all! Kraft-McMillan Theorem

The codeword lengths`₁, . . . , `_m of any uniquely decodable (binary) code satisfy

m

X

i=1

2^−`ⁱ ≤1 .

Conversely, given a set of codeword lengths that satisfy this inequality, there is a uniquely decodable (prefix) code with these codeword lengths.

(70)

Kraft–McMillan Theorem

In fact it can be shown that we do not lose anything at all!

m

X

i=1

2^−`ⁱ ≤1 .

(71)

Kraft–McMillan Theorem

In fact it can be shown that we do not lose anything at all!

m

X

i=1

2^−`ⁱ ≤1 .

(72)

Kraft-McMillan Theorem & Codes

Prefix Codes Decodable Codes

All Codes

Kraft Inequality

(73)

1 Codes

2 Optimal Codes

(74)

Codelengths and Probabilities

Let`1, . . . , `m be the codeword lengths of a uniquely decodable codeC : X → {0,1}^∗. By the Kraft-McMillan theorem we have

c =

m

X

i=1

2^−`ⁱ ≤1 .

Define a probability mass functionp : X →[0,1] as follows: pi = 2^−`ⁱ

c

⇔ `i = log₂ 1 cpi

,

wherec is given above. Function p is indeed a pmf:

1 Non-negative: p(x)≥0 for allx ∈ X.

2 Sums to one: X

x∈X

p(x) =

m

X

i=1

1

c2^−`ⁱ = c c = 1 .

(75)

Codelengths and Probabilities

c =

m

X

i=1

2^−`ⁱ ≤1 .

Define a probability mass functionp : X →[0,1] as follows:

pi = 2^−`ⁱ c

⇔ `i = log₂ 1 cpi

,

wherec is given above.

Function p is indeed a pmf:

2 Sums to one: X

x∈X

p(x) =

m

X

i=1

1

c2^−`ⁱ = c c = 1 .

(76)

Codelengths and Probabilities

c =

m

X

i=1

2^−`ⁱ ≤1 .

pi = 2^−`ⁱ

c ⇔ `i = log₂ 1 cpi

, wherec is given above.

Function p is indeed a pmf:

2 Sums to one: X

x∈X

p(x) =

m

X

i=1

1

c2^−`ⁱ = c c = 1 .

(77)

Codelengths and Probabilities

c =

m

X

i=1

2^−`ⁱ ≤1 .

pi = 2^−`ⁱ

Functionp is indeed a pmf:

2 Sums to one: X

x∈X

p(x) =

m

X

i=1

1

c2^−`ⁱ = c c = 1 .

(78)

Codelengths and Probabilities

c =

m

X

i=1

2^−`ⁱ ≤1 .

pi = 2^−`ⁱ

Functionp is indeed a pmf:

2 Sums to one: X

x∈X

p(x) =

m

X

i=1

1

c2^−`ⁱ = c c = 1 .

(79)

Codelengths and Probabilities

Assuming that the code is “Kraft tight”,c = 1, then under the pmfp corresponding to the codeword lengths `1, . . . , `m, the expected codeword length is

E[`(X)] =

m

X

i=1

2^−`ⁱ`i

=

m

X

i=1

pi log₂ 1

p_i =H(X) . This is the best we can hope for:

The expected codelength of any uniquely decodable code is at least the entropy:

E[`(X)]≥H(X) .

(80)

Codelengths and Probabilities

E[`(X)] =

m

X

i=1

2^−`ⁱ`i

=

m

X

i=1

pi log₂ 1 p_i

=H(X) . This is the best we can hope for:

E[`(X)]≥H(X) .

(81)

Codelengths and Probabilities

E[`(X)] =

m

X

i=1

2^−`ⁱ`i

=

m

X

i=1

pi log₂ 1

p_i =H(X) .

This is the best we can hope for:

E[`(X)]≥H(X) .

(82)

Codelengths and Probabilities

E[`(X)] =

m

X

i=1

2^−`ⁱ`i

=

m

X

i=1

pi log₂ 1

p_i =H(X) . This is the best we can hope for:

E[`(X)]≥H(X) .

(83)

Entropy Lower Bound

E[`(X)]≥H(X) .

Proof.

E[`(X)]−H(X) =X

x∈X

p(x)`(x)−X

x∈X

p(x) log₂ 1 p(x)

=X

x∈X

p(x) log₂ 1

2^−`^x −X

x∈X

p(x) log₂ 1 p(x)

=X

x∈X

p(x) log₂p(x) 2^−`^x

=X

x∈X

p(x)

log₂ p(x)

q(x)+ log₂ 1 c

q(x) = 2^−`(x) c

=D(p kq) + log₂ 1 c ≥0 .