582650 Information-Theoretic Modeling (Autumn 2012) Course examination, Thursday 18 October, 9:00–12:00 CK112 Examiner: Jyrki Kivinen

(1)

582650 Information-Theoretic Modeling (Autumn 2012) Course examination, Thursday 18 October, 9:00–12:00 CK112 Examiner: Jyrki Kivinen

Answer all the problems. The maximum score for the exam is 36 points. You may answer in Finnish, Swedish or English.

1. [12 points]

(a) Let A ∈ {1,2,3,4} and B ∈ {1,2} be two random variables with the following joint distribution:

A= 1 A= 2 A= 3 A= 4 B = 1 1/16 1/8 1/4 1/16 B = 2 1/4 1/8 1/16 1/16 .

Calculate the entropiesH(A)and H(B), the joint entropyH(A, B)and the conditional entropyH(B |A).

(b) AssumeX andY are two random variables about which we only know that H(X) = 3 and H(Y) = 2. Under this assumption, what are the largest and smallest possible values for the mutual information I(X;Y)? Justify your claims, and show by example that your upper and lower bounds can actually be achieved.

(c) Define the Kullback-Leibler divergence (relative entropy) between two dis- cete distributions. Show that it is always non-negative.

In parts (a) and (b) you may use any properties of entropy etc. that you know from the course, but please state explicitly which properties you are using. In the proof for part (c), you may use only generally known mathematical results, including Jensen’s inequality.

Points were given 4 for each subtask.

For (a) you got one point for each correctly calculated entropy, half a point for the correct approach with some error in calculation. The correct values are H(A) = ²⁷₈ log 2−⁵₈log 5(or ²⁷₈ −⁵₈log₂5),H(B) = log 2(or1),H(A, B) = ¹¹₄ log 2 (or ¹¹₄ ) and H(B |A) =H(A, B)−H(A) = ⁵₈log 5−⁵₈log 2 (or ⁵₈log₂5−⁵₈).

For (b) you got 2 points for the correct bounds 0 ≤ I(X;Y) ≤ 2 and another two points for correct justification and examples. If, for example, X consists of 3 independent bits of maximum entropy (fair coin throws) and Y of two such bits, then I(X;Y) = 0 is achieved when all these bits are distinct and independent, and I(X;Y) = 2 when Y’s bits are a subset of X’s. Aternatively it was enough to state that the minimum is achieved when X and Y are independent, and the maximum is achieved when the value ofX completely determines the value ofY.

(2)

For (c) we have d_KL(p||q) = ^P_x∈Xp(x) log^p(x)_q(x), which can be shown to be non- negative using Jensen’s inequality together with the convexity of the minus log.

Alternatively (and simpler), you could use Gibbs’ inequality (as it was not explicitly forbidden to do so).

2. [12 points]

(a) What is meant by a decodable code, and a prefix code? What does the Kraft(-McMillan) Theorem tell us about such codes? Give both a precise mathematical formulations and a very brief intuitive explanation of how these concepts and results can be useful.

(b) Explain the basic idea of the Huffman coding algorithm. What are its input and output, and what guarantees do we have for the quality of the output?

What result does the algorithm give for a five-symbol alphabet{a,b,c,d,e} when P(a) = 0.05, P(b) = 0.35, P(c) = 0.15, P(d) = 0.25 and P(e) = 0.2.

You don’t need to give a step-by-step simulation, just the end result and perhaps a brief summary of some key points to help understand it.

Points were given 6 for each subtask.

For (a) you got 1.5 points for each of the following

• (uniquely) decodable code: injective (one-to-one) function, each sequence of symbols is assigned a unique encoding, from any (legal) encoding we can uniquely recover the original message.

• prefix (-free) code: no codeword is a prefix of another, therefore we can decode the encoding of any data sequence in one pass. This because we know at each point, whether a symbol codeword terminates or continues.

Prefix codes are automatically decodable.

• mathematical fomulation: for any decodable code (Kraft) with corresponding symbol codeword lengths `_i we have

X

i

2^−`ⁱ ≤1.

Conversely, for any set of integers `_i satisfying the above, there exists a decodable code with these codelengths. Kraft-McMillan states that the same holds when we restrict to prefix codes.

• intuitive explanation: Kraft (-McMillan) couples probabilities (pi ≈2^−`ⁱ) to codelengths of decodable (prefix) codes. Any code can be interpreted as a probability (sub-) distribution, and conversely any probability distribution can be used to define a code that optimally (wrt. expected codelength) compresses data following that distribution. Furthermore, any slack in the inequality can be exploited to make the corresponding code shorter.

2

(3)

For (b) you got 2 points for each of the following

• The correct result (e.g. a→111,b →00, c→110, d→01, e→10)

• the basic idea: define a node for each symbol with weight equal to its as- sociated probability, iteratively join two lightest nodes into a new node of weight equal to the sum of the two weights, do so until only one node is left.

The resulting tree gives the codewords when we interpret all ’left edges’ to read 0 and all ’right edges’ to read 1. This is by construction a complete (tight Kraft inequality) prefix code.

• quality guarantee: The Huffman code yields the shortest expected codelength wrt. to the given distribution over the source alphabetX and this ex- pectation is upper-bounded byH(X) + 1(trivially, it is also lower- bounded byH(X)).

3. [12 points]Two-part coding and model class selection. Explain how the minimum description length (MDL) principle with a two-part code can be applied to the model class selection problem. Your answer should include the following:

• what is the model class selection problem, including an example

• mathematical definition of the two-part code in the discretely parameterized case, and what is the standard way of dealing with a continuous parameter space

• what does the MDL principle mean in this context, and why using it might make sense

• in practice, what modelling and computational steps would be included in the process.

Points were given 3 per bullet.

• The model class selection problem is the task of finding, given some data D, the model class (i.e. set of distributions sharing some parametric form) that is most suitable for that data among a family (set) of such classes.

Suitability is defined wrt. the task at hand, which can be compression, prediction, denoising and so on. From the exercises most of you remembered the curve fitting example.

• Two-part encoding means that, given a model classM, we encode the model parameters θ ∈ Θ (i.e., choose a distribution from that class) seperately, and subsequently encode the data using the specified distribution. Having chosen the best θ, the total codelength becomes

`_2p(D|M) =`(θ) +`(D|M(θ)) =`(θ)−logp^M_θ (D).

When Θ is discrete, we often set `(θ) = log|Θ| (i.e., use a uniform distribution), but other ways of encoding are allowed as well. If the parameter

3

(4)

space is continuous, we need to discretize it and use the most suitable θ in the discretized space as an approximation to the ’true’ value. It can be shown that for optimal compression the density of this quantization should be proportional to the square root of the data size.

• The MDL principle means that data compression is equivalent to learning, as anything that helps us compress the data is also a feature of it, and conversely any regularity we find in the data can be used to encode it in a more compact way. In this context this means minimizing the total codelength

`2p(D|M), first over θ and then over M. Using Kraft’s theorem to look at the definition of the two-part codelength probabilistically, we can interpret the two-to-the-minus of the parameter encoding2^−`(θ)as a prior distribution and the second part becomes the data probability. In model class selection, the prior2^−`(θ) serves as a complexity penalizing term that prevents us from overfitting the data (i.e. modeling noise instead of data features). The resulting model trades off data fit against generalization capacity in a sensible way and should therefore be good, e.g. for prediction of future data from the same source.

• In practice, in order to perform the above, we need to define a set of promis- ing model classes for D (ones that are likely to compress well), for each of them compute the MDL-optimal set of parameters θ, and then compare the resulting two-part codelengths against each other in order to find the MDL-best model class.

4