### 582650 Information-Theoretic Modeling (Autumn 2012) Homework 3/Solutions

1. (Cover & Thomas, Ex5.3.) Consider a prexix codeC:X → {0,1}^{∗}where|X |=m, and assume
the code lengths`_{1}, . . . , `_{m}satisfy

m

X

i=1

2^{−`}^{i} <1

(i.e., the Kraft Inequality is not tight.) Show that there are arbitrarily long bit strings that
are not codes for any sequence of symbols. That is, there is an infinite number of bit strings
s∈ {0,1}^{∗}such that for allx1. . . xn ∈ X^{∗}we haveC^{∗}(x1. . . xn)6=s, whereC^{∗}is the extension
ofC. (Hint: find a stringsthat is not a prefix of any codeword.)

Solution Let

m

X

i=1

2^{−`}^{i} =a <1

Then there must exist a strings∈ {0,1}^{∗}of length`(s)≥ −log_{2}(1−a), which is not a prefix to
any codeword inC(X). Otherwise the code would either be complete, that is Kraft’s inequality
would be tight, or there needed to be a strings^{0}∈ {0,1}^{∗}of length`(s^{0})<−log_{2}(1−a), which
would implyPm

i=12^{−`}^{i} ≤1−2^{−`(s}^{0}^{)}< a. In either case we have a contradiction.

Alternatively you may arrange the existing codewords into a tree (with branches labelled ’0’

and ’1’, each path from the root to a leaf reading as a codeword). Since Kraft’s inequality is not tight, there must be some node having only one child, the missing branch then terminates a stringsthat is no prefix to any codeword.

Sincesis not a prefix to any codeword, any concatenationst, witht∈ {0,1}^{∗}, is not a codeword
either.

2. Consider a source alphabet X = {x_{1}, . . . , x_{m}} of size m. Assume we are given propabilities
p1, . . . , pm for the source symbols, so Pr[X = xi] = pi. Recall that the Shannon-Fano code
works as follows:

(a) Sort the symbols according to decreasing probability so that we can assume p1 ≥ p2 ≥ . . .≥pm.

(b) Initialize all codewordsw1, . . . , wmas the empty string.

(c) Split the symbols in two sets,(x1, . . . , xk)and(xk+1, . . . , xm), so that the total probabilities of the two sets are as equal as possible, i.e., minimize the difference|(p1+. . .+pk)−(pk+1+ . . .+pm)|.

(d) Add the bit ‘0’ to all codewords in the first set, w_{i} 7→w_{i}0, for all 1≤i≤k, and ‘1’ to all
codewords in the second set, w_{i} 7→w_{i}1 for allk < i≤m.

(e) Keep splitting both sets recursively (Step (c)) until each set contains only a single symbol.

Simulate the Shannon-Fano code, either on paper or by computer, for a source with symbols A : 0.2, B : 0.22, C : 0.25, D : 0.15, E : 0.13, F : 0.05, where the numbers indicate the probabilitiespi.

Solution

pC = 0.25> pB = 0.22> pA= 0.2> pD= 0.15> pE= 0.13> pF = 0.05

symbol codeword

A 10

B 01

C 00

D 110

E 1110

F 1111

3. Take a piece of text, estimate the symbol occurrence probabilities from it. Then use them to encode the text using the Shannon-Fano code (on a computer). Compare the code-length to the entropy of the symbol distributionH(X).

Solution There are implementations readily available on the net, so I will not add another one to the lot. If you got any of them running, instead of hacking your own, then that is fine.

Suppose your textX containsn symbols s from an alphabetΣ of size m appearing with fre-
quencies f(s), then you should estimate their probabilities as p_{s} = ^{f(s)}_{n} . The entropy then
is

H(X) =−X

s∈Σ

pslog_{2}ps= log_{2}n−1
n

X

s∈Σ

f(s) log_{2}f(s)

For the Shannon codelength we have (symbol-wise)

`(s)≤ d−log_{2}p_{s}e<−log_{2}p_{s}+ 1

and therefore you should have achieved (altogether) H(X)≤ 1

n`(X)< H(X) + 1

Depending on the length of the encoded text and the number of symbols therein, the actual per-symbol codelength is probably quite close to the entropy.

4. (Cover & Thomas, Ex5.12.) Consider a random variableX that takes on four values, a, b, c,!,
with respective probabilities ^{1}_{3},^{1}_{3},^{1}_{4},_{12}^{1}

.

(a) Construct a Huffman code for this random variable.

Solution We first merge the two least probable symbolscand!into a pseudo-node, which
is assigned the sum of the two probabilities ^{1}_{3}. We are now left with three nodes of equal
probability, out of which we may merge any two. Depending on whether we merge nodesa
andbor we merge either of them with the pseudo-node{c,!}, we get codes with codelengths
as given in (b). The corresponding codewords can then be, e.g.,

symbol code 1 code 2

a 1 00

b 00 01

c 010 10

! 011 11

(b) Show that there exist two different sets of optimal (Huffman) lengths for the codewords;

namely, show that codeword length assignments(1,2,3,3)and(2,2,2,2)are both optimal.

Solution As we have seen in (a), both are optimal by construction. Tie breaking is arbitrary in Huffman coding!

(c) Conclude that there are optimal codes with code-word lengths for some symbols that exceed the Shannon code-length l

log_{2}_{p(x)}^{1} m
.

2

Solution In code 1 symbolc has code length3>2 =l
log_{2} ^{1}1

4

m .

5. Assume that a personxis picked at random according to distributionpfrom a finite population
X. In order to identify x, you are allowed to ask an oracle a sequence of any yes—no questions
like “Doesxbelong to the subsetS_{i}={xi,1, x_{i,2}, . . . , x_{i,n(i)}}?”.

(a) What is the optimal strategy if you want to minimize the expected number of questions required? What can you say about the expected number of questions that are required?

(Hint: the sequence of answers can be regarded as the codeword ofx.)

Solution Huffman Coding! While Shannon-Fano is optimal symbol-wise (the Shannon codelength can never be exceeded), Huffman is optimal for the whole set of symbols to be encoded, which we need when we want to minimize the expected codelength. Build the decision tree for the Huffman code. Each node represents a subset Si as required in the problem setting, therefore each binary decision at any can be set as a question “Does x belong to the subset represented by the left child?”. The expected numberof questions is lowerbounded by the entropyH(p)of the distribution and upperbounded byH(p) + 1.

(b) What if the goal is to minimize theworst-case (maximum) number of questions?

Solution As for the worst case all (non-zero) probabilities are to be treated equally, we
may simply replacepby a uniform distribution (possibly after having shrunkX to include
only members getting non-zero probability under p). The minimax number of questions
becomes dlog_{2}|X |e.

Bonus problem. Implement the Hamming(7,4)code. The decoder should try to find a sequence of data bitsd1, d2, d3, d4 for which the difference between the received seven-bit sequence and the seven-bit codeword corresponding to the data bits is minimal (at most one bit). For example, in Lecture 2, the (Hamming) distance between the received sequence1111010and the codeword 1011010 is one, so that the data bits is decoded as 1011. Test your code by encoding the sequencerat.txt. Simulate the binary symmetric channel,p= 0.01, by saying ‘bsc.py 0.01’, then decode the noisy signal. You can create a PBM image file by the script ‘txt2pbm.py’. You can display a PBM file by ‘gimp file.pbm’. Compare the result with and without Hamming coding. Your experiment should be something like:

cat rat.txt | ./bsc.py 0.01 | ./txt2pbm.py >ratnoisy.pbm enc <rat.txt | ./bsc.py 0.01 | dec | ./txt2pbm.py >ratham.pbm

where encand dec are the encoding and decoding commands, respectively. The example file rat.txtand the scriptsbsc.pyandtxt2pbm.pyare available at the course web page.

Solution There are sufficiently many implementations available on the internet. Wikipedia explains encoding and decoding in great detail. In a nutshell, we read in four bits (b1, b2, b3, b4) at a time, which are copied to the seven bit output as such. The remaining three output bits are the parity bits b1+b2+b4, b1+b3+b4 and b2+b3+b4 (addition modulo 2). These may then take on8 different combinations of values, all zero if no bit error occurred and an encoding of the flipped bit if exactly one error occurred. Two or more errors cannot be corrected.

Therefore, given a channel error rate ofp, the probability that no incorrectable error occurs is
P(success) = (1−p)^{7}+ 7p(1−p) = (1 + 6p)(1−p)^{6},

which is a lower bound to the probability of bit-wise success (which may be smaller, as erroneous decoding does not necessarily affect all four bits to be decoded together).

3

Forp= 0.01this bound becomesP(success)≈0.998, so Hamming encoding before transmission over the noisy channel should make your image at least five times cleaner. Also, the noise now comes in bursts of up to four pixels, see Figure 2.

Interestingly, already forp= 0.1the boundP(success)≈0.85is smaller than1−p, and for large noise rates the image quality actually decreases when Hamming encoding. Typically, Hamming codes are used when low error rates can be assumed.

Figure 1: ratnoisy.pbm

Figure 2: ratham.pbm

4