• Ei tuloksia

582650 Information-Theoretic Modeling (Autumn 2012) Homework 3/Solutions


Academic year: 2022

Jaa "582650 Information-Theoretic Modeling (Autumn 2012) Homework 3/Solutions"




582650 Information-Theoretic Modeling (Autumn 2012) Homework 3/Solutions

1. (Cover & Thomas, Ex5.3.) Consider a prexix codeC:X → {0,1}where|X |=m, and assume the code lengths`1, . . . , `msatisfy




2−`i <1

(i.e., the Kraft Inequality is not tight.) Show that there are arbitrarily long bit strings that are not codes for any sequence of symbols. That is, there is an infinite number of bit strings s∈ {0,1}such that for allx1. . . xn ∈ Xwe haveC(x1. . . xn)6=s, whereCis the extension ofC. (Hint: find a stringsthat is not a prefix of any codeword.)

Solution Let




2−`i =a <1

Then there must exist a strings∈ {0,1}of length`(s)≥ −log2(1−a), which is not a prefix to any codeword inC(X). Otherwise the code would either be complete, that is Kraft’s inequality would be tight, or there needed to be a strings0∈ {0,1}of length`(s0)<−log2(1−a), which would implyPm

i=12−`i ≤1−2−`(s0)< a. In either case we have a contradiction.

Alternatively you may arrange the existing codewords into a tree (with branches labelled ’0’

and ’1’, each path from the root to a leaf reading as a codeword). Since Kraft’s inequality is not tight, there must be some node having only one child, the missing branch then terminates a stringsthat is no prefix to any codeword.

Sincesis not a prefix to any codeword, any concatenationst, witht∈ {0,1}, is not a codeword either.

2. Consider a source alphabet X = {x1, . . . , xm} of size m. Assume we are given propabilities p1, . . . , pm for the source symbols, so Pr[X = xi] = pi. Recall that the Shannon-Fano code works as follows:

(a) Sort the symbols according to decreasing probability so that we can assume p1 ≥ p2 ≥ . . .≥pm.

(b) Initialize all codewordsw1, . . . , wmas the empty string.

(c) Split the symbols in two sets,(x1, . . . , xk)and(xk+1, . . . , xm), so that the total probabilities of the two sets are as equal as possible, i.e., minimize the difference|(p1+. . .+pk)−(pk+1+ . . .+pm)|.

(d) Add the bit ‘0’ to all codewords in the first set, wi 7→wi0, for all 1≤i≤k, and ‘1’ to all codewords in the second set, wi 7→wi1 for allk < i≤m.

(e) Keep splitting both sets recursively (Step (c)) until each set contains only a single symbol.

Simulate the Shannon-Fano code, either on paper or by computer, for a source with symbols A : 0.2, B : 0.22, C : 0.25, D : 0.15, E : 0.13, F : 0.05, where the numbers indicate the probabilitiespi.


pC = 0.25> pB = 0.22> pA= 0.2> pD= 0.15> pE= 0.13> pF = 0.05


symbol codeword

A 10

B 01

C 00

D 110

E 1110

F 1111

3. Take a piece of text, estimate the symbol occurrence probabilities from it. Then use them to encode the text using the Shannon-Fano code (on a computer). Compare the code-length to the entropy of the symbol distributionH(X).

Solution There are implementations readily available on the net, so I will not add another one to the lot. If you got any of them running, instead of hacking your own, then that is fine.

Suppose your textX containsn symbols s from an alphabetΣ of size m appearing with fre- quencies f(s), then you should estimate their probabilities as ps = f(s)n . The entropy then is

H(X) =−X


pslog2ps= log2n−1 n



f(s) log2f(s)

For the Shannon codelength we have (symbol-wise)

`(s)≤ d−log2pse<−log2ps+ 1

and therefore you should have achieved (altogether) H(X)≤ 1

n`(X)< H(X) + 1

Depending on the length of the encoded text and the number of symbols therein, the actual per-symbol codelength is probably quite close to the entropy.

4. (Cover & Thomas, Ex5.12.) Consider a random variableX that takes on four values, a, b, c,!, with respective probabilities 13,13,14,121


(a) Construct a Huffman code for this random variable.

Solution We first merge the two least probable symbolscand!into a pseudo-node, which is assigned the sum of the two probabilities 13. We are now left with three nodes of equal probability, out of which we may merge any two. Depending on whether we merge nodesa andbor we merge either of them with the pseudo-node{c,!}, we get codes with codelengths as given in (b). The corresponding codewords can then be, e.g.,

symbol code 1 code 2

a 1 00

b 00 01

c 010 10

! 011 11

(b) Show that there exist two different sets of optimal (Huffman) lengths for the codewords;

namely, show that codeword length assignments(1,2,3,3)and(2,2,2,2)are both optimal.

Solution As we have seen in (a), both are optimal by construction. Tie breaking is arbitrary in Huffman coding!

(c) Conclude that there are optimal codes with code-word lengths for some symbols that exceed the Shannon code-length l

log2p(x)1 m .



Solution In code 1 symbolc has code length3>2 =l log2 11


m .

5. Assume that a personxis picked at random according to distributionpfrom a finite population X. In order to identify x, you are allowed to ask an oracle a sequence of any yes—no questions like “Doesxbelong to the subsetSi={xi,1, xi,2, . . . , xi,n(i)}?”.

(a) What is the optimal strategy if you want to minimize the expected number of questions required? What can you say about the expected number of questions that are required?

(Hint: the sequence of answers can be regarded as the codeword ofx.)

Solution Huffman Coding! While Shannon-Fano is optimal symbol-wise (the Shannon codelength can never be exceeded), Huffman is optimal for the whole set of symbols to be encoded, which we need when we want to minimize the expected codelength. Build the decision tree for the Huffman code. Each node represents a subset Si as required in the problem setting, therefore each binary decision at any can be set as a question “Does x belong to the subset represented by the left child?”. The expected numberof questions is lowerbounded by the entropyH(p)of the distribution and upperbounded byH(p) + 1.

(b) What if the goal is to minimize theworst-case (maximum) number of questions?

Solution As for the worst case all (non-zero) probabilities are to be treated equally, we may simply replacepby a uniform distribution (possibly after having shrunkX to include only members getting non-zero probability under p). The minimax number of questions becomes dlog2|X |e.

Bonus problem. Implement the Hamming(7,4)code. The decoder should try to find a sequence of data bitsd1, d2, d3, d4 for which the difference between the received seven-bit sequence and the seven-bit codeword corresponding to the data bits is minimal (at most one bit). For example, in Lecture 2, the (Hamming) distance between the received sequence1111010and the codeword 1011010 is one, so that the data bits is decoded as 1011. Test your code by encoding the sequencerat.txt. Simulate the binary symmetric channel,p= 0.01, by saying ‘bsc.py 0.01’, then decode the noisy signal. You can create a PBM image file by the script ‘txt2pbm.py’. You can display a PBM file by ‘gimp file.pbm’. Compare the result with and without Hamming coding. Your experiment should be something like:

cat rat.txt | ./bsc.py 0.01 | ./txt2pbm.py >ratnoisy.pbm enc <rat.txt | ./bsc.py 0.01 | dec | ./txt2pbm.py >ratham.pbm

where encand dec are the encoding and decoding commands, respectively. The example file rat.txtand the scriptsbsc.pyandtxt2pbm.pyare available at the course web page.

Solution There are sufficiently many implementations available on the internet. Wikipedia explains encoding and decoding in great detail. In a nutshell, we read in four bits (b1, b2, b3, b4) at a time, which are copied to the seven bit output as such. The remaining three output bits are the parity bits b1+b2+b4, b1+b3+b4 and b2+b3+b4 (addition modulo 2). These may then take on8 different combinations of values, all zero if no bit error occurred and an encoding of the flipped bit if exactly one error occurred. Two or more errors cannot be corrected.

Therefore, given a channel error rate ofp, the probability that no incorrectable error occurs is P(success) = (1−p)7+ 7p(1−p) = (1 + 6p)(1−p)6,

which is a lower bound to the probability of bit-wise success (which may be smaller, as erroneous decoding does not necessarily affect all four bits to be decoded together).



Forp= 0.01this bound becomesP(success)≈0.998, so Hamming encoding before transmission over the noisy channel should make your image at least five times cleaner. Also, the noise now comes in bursts of up to four pixels, see Figure 2.

Interestingly, already forp= 0.1the boundP(success)≈0.85is smaller than1−p, and for large noise rates the image quality actually decreases when Hamming encoding. Typically, Hamming codes are used when low error rates can be assumed.

Figure 1: ratnoisy.pbm

Figure 2: ratham.pbm




Find out what compression algorithms are used in the gadgets you use—DVD player, digital TV, CD, iPod, cell-phone, laptop (WinZip), etc.. You can simply list the names of

Solution First observe that the minimum number of weightings must be at least dlog 3 24e = 3, since there are 24 possible outputs (12 balls, each can be either light or heavy) and

The noiseless data is also a polynomial of degree 6, and the estimate based on the noisy data gets reasonably close, much closer than the observed noisy data points.. In this sense

• When you use a modifying word like ”set”, vector”, ”model” etc. before the

What is notable is the similar form of the curves: removing some individuals does not seem to have changed the optimal number of clusters... Figure 4.10: Adjusted mutual

assessment do not uncover what students actually want to say about sustainability, what they want to learn, or the impact of their learning.. In an effort to meet this

He wants to write a number in each of the remaining 5 circles such that the sums of the 3 numbers along each side of the pentagon are equal.. If you add the digits of the

If the original pixel values are of UINT-type, what kind of problems may arise with filtering.. Present how you calculated you answer, not just the