Autumn2012 Lecture10:MDLPrinciple—PartIIJyrkiKivinen Information-TheoreticModeling

(1)

Outline MDL for Gaussian Models MDL for Multinomial Models

Information-Theoretic Modeling

Lecture 10: MDL Principle — Part II

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

(2)

1 MDL for Gaussian Models Encoding Continuous Data Differential Entropy Linear Regression

Subset Selection Problem Wavelet Denoising

2 MDL for Multinomial Models Universal Codes

Fast NML Computation Histogram Density Estimation Clustering

(3)

(4)

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

(5)

Gaussian models

Density function:

φ_µ,σ2(x) = 1

√

2πσ² e−(x−µ)² 2σ² .

Mean: µ=E[X], varianceσ² =E[(X −µ)²] Maximum likelihood: ˆµ= 1

n

X

i=1

x_i, ˆσ² = 1 n

n

X

i=1

(x_i −µ)ˆ ².

(6)

Gaussian models

Density function:

φ_µ,σ2(x) = 1

√

2πσ² e−(x−µ)² 2σ² .

Mean: µ=E[X], varianceσ² =E[(X −µ)²]

Maximum likelihood: ˆµ= 1 n

n

X

i=1

x_i, ˆσ² = 1 n

n

X

i=1

(x_i −µ)ˆ ².

(7)

Gaussian models

Density function:

φ_µ,σ2(x₁, . . . ,x_n)^(i.i.d.)=

n

Y

i=1

√ 1

2πσ² e−(x_i−µ)² 2σ² .

n

X

i=1

x_i, ˆσ² = 1 n

n

X

i=1

(x_i −µ)ˆ ².

(8)

Gaussian models

Density function:

φ_µ,σ2(x₁, . . . ,x_n)^(i.i.d.)= 1

√ 2πσ²

n n

Y

i=1

e−(x_i −µ)² 2σ² .

n

X

i=1

x_i, ˆσ² = 1 n

n

X

i=1

(x_i −µ)ˆ ².

(9)

Gaussian models

Density function:

φ_µ,σ2(x₁, . . . ,x_n)⁽ⁱ^.i.d.)= 1

2πσ²

n/2 n

Y

i=1

e−(x_i−µ)² 2σ² .

n

X

i=1

x_i, ˆσ² = 1 n

n

X

i=1

(x_i −µ)ˆ ².

(10)

Gaussian models

Density function:

φ_µ,σ2(x₁, . . . ,x_n)^(i.i.d.)= 2πσ²−n/2 n

Y

i=1

e−(x_i −µ)² 2σ² .

n

X

i=1

x_i, ˆσ² = 1 n

n

X

i=1

(x_i −µ)ˆ ².

(11)

Gaussian models

Density function:

φ_µ,σ2(x₁, . . . ,x_n)⁽ⁱ^.i.d.)= 2πσ²−n/2

e− Pn

i=1(x_i−µ)² 2σ² .

n

X

i=1

x_i, ˆσ² = 1 n

n

X

i=1

(x_i −µ)ˆ ².

(12)

Gaussian models

Density function:

φ_µ,σ2(x₁, . . . ,x_n)⁽ⁱ^.i.d.)= 2πσ²−n/2

e− Pn

i=1(x_i−µ)² 2σ² .

Mean: µ=E[X], varianceσ² =E[(X −µ)²] Maximum likelihood: ˆµ= 1

n

X

i=1

x_i, σˆ² = 1 n

n

X

i=1

(x_i −µ)ˆ ².

(13)

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we face the problem of How to encode continuous data?

We already know how to encode using models with continuous parameters:

two-part with optimal quantization ≈ ^k₂ log₂n , mixture code,

NML.

Obviously not possible to encode data with infinite precision. Have todiscretize: encodex only up to precision δ.

(14)

How to Encode Continuous Data?

NML.

(15)

How to Encode Continuous Data?

two-part with optimal quantization ≈ ^k₂ log₂n ,

mixture code, NML.

(16)

How to Encode Continuous Data?

NML.

(17)

How to Encode Continuous Data?

NML.

(18)

How to Encode Continuous Data?

NML.

Obviously not possible to encode data with infinite precision. Have

(19)

Differential Entropy

What is the optimal rate for encoding (compressing) continuous data (up to precisionδ)?

The answer involves again an entropy. However, not the familiar kind of entropy but instead...

Differential entropy

LetX ∈Rbe a continuous random variable with probability densityf : R→R⁺.

The differential entropy ofX is defined as h(X) =EX∼f

log₂ 1 f(X)

= Z

f(x) log₂ 1 f(x)dx.

(20)

Differential Entropy

log₂ 1 f(X)

= Z

(21)

Differential Entropy

log₂ 1 f(X)

= Z

(22)

Differential Entropy

log₂ 1 f(X)

= Z

(23)

Differential Entropy

The differential entropy ofX is defined as h(X) =E_X∼f

log₂ 1 f(X)

= Z

(24)

Differential Entropy

Ifδ >0 is small, the probability that X ∈[(t−¹₂)δ,(t+¹₂)δ] is well approximated byf(tδ)δ.

Hence, the minimum coding rate of the discretized random variableX^δ is given by

H(X^δ)≈ X

x=tδ:t∈Z

f(x)δlog₂ 1 f(x)δ

−→δ→0

Z +∞

−∞

Hence, the rate is approximatelyH(X^δ)≈h(X)−log₂δ.

(25)

Differential Entropy

H(X^δ)≈ X

x=tδ:t∈Z

−→δ→0

Z +∞

−∞

f(x) log₂ 1 f(x)dx. Hence, the rate is approximatelyH(X^δ)≈h(X)−log₂δ.

(26)

Differential Entropy

H(X^δ)≈ X

x=tδ:t∈Z

−→δ→0

Z +∞

−∞

f(x) log₂ 1 f(x)δ dx.

(27)

Differential Entropy

H(X^δ)≈ X

x=tδ:t∈Z

−→δ→0

Z +∞

−∞

f(x) log₂ 1

f(x)dx −log₂δ.

(28)

Differential Entropy

H(X^δ)≈ X

x=tδ:t∈Z

−→δ→0

Z +∞

−∞

f(x) log₂ 1

f(x)dx −log₂δ.

(29)

Differential Entropy

The minimum coding rateh(X)−log₂δ is achieved if and only if the code-word lengths are chosen according to

`(x) = log₂ 1

f(x)δ = log₂ 1

f(x)+ log₂ 1 δ.

The term log₂(1/δ) depends only on the precision we chose and is same for all models. Therefore, we can ignore it for the purpose of comparing models.

(30)

Differential Entropy

`(x) = log₂ 1

f(x)δ = log₂ 1

f(x)+ log₂ 1 δ.

(31)

Differential Entropy

`(x) = log₂ 1

f(x)δ = log₂ 1

f(x)+ log₂ 1 δ.

(32)

Back to Gaussians

Recall the Gaussian density function:

φ_µ,σ²(x1, . . . ,xn)⁽ⁱ^.i.d.)= 2πσ²−n/2

e− Pn

i=1(xi−µ)² 2σ² .

The code-length is then n

2log₂(2πσ²)− 1 (2 ln 2)σ²

n

X

i=1

(x_i −µ)².

(33)

Back to Gaussians

Recall the Gaussian density function:

φ_µ,σ²(x1, . . . ,xn)⁽ⁱ^.i.d.)= 2πσ²−n/2

e− Pn

i=1(xi−µ)² 2σ² .

The code-length is then n

2log₂(2πσ²)− 1 (2 ln 2)σ²

n

X

i=1

(x_i−µ)².

(34)

Back to Gaussians

Ok, we have our Gaussian code-length formula:

n

2log₂(2πσ²)− 1 (2 ln 2)σ²

n

X

i=1

(x_i−µ)².

Let’s use the two-part code and plug in the maximum likelihood parameters:

ˆ µ= 1

n

X

i=1

xi, σˆ²= 1 n

n

X

i=1

(xi −µ)ˆ ².

(35)

Back to Gaussians

n

2log₂(2πσ²)− 1 (2 ln 2)σ²

n

X

i=1

(x_i−µ)².

ˆ µ= 1

n

X

i=1

xi, ˆσ²= 1 n

n

X

i=1

(xi −µ)ˆ ².

(36)

Back to Gaussians

n

2log₂(2πσˆ²)− 1 (2 ln 2)ˆσ²

n

X

i=1

(x_i−µ)ˆ ².

ˆ µ= 1

n

X

i=1

xi, ˆσ²= 1 n

n

X

i=1

(xi −µ)ˆ ².

(37)

Back to Gaussians

n

2log₂(2πσˆ²)− 1 (2 ln 2)ˆσ²

n

X

i=1

(x_i−µ)ˆ ².

ˆ µ= 1

n

X

i=1

xi, ˆσ²= 1 n

n

X

i=1

(xi −µ)ˆ ².

(38)

Back to Gaussians

n

2log₂(2πσˆ²)− n 2 ln 2.

ˆ µ= 1

n

X

i=1

xi, ˆσ²= 1 n

n

X

i=1

(xi −µ)ˆ ².

(39)

Back to Gaussians

n

2log₂σˆ²+constant.

ˆ µ= 1

n

X

i=1

xi, ˆσ²= 1 n

n

X

i=1

(xi −µ)ˆ ².

(40)

Back to Gaussians

We get the total (two-part) code-length formula:

n

2log₂σˆ²+k

2 log₂n+constant.

Since we have two parameters, µandσ², we letk = 2.

Notice that depending on what exactly you are doing, you may or may not care about theconstant.

(41)

Back to Gaussians

n

2log₂σˆ²+k

2 log₂n+constant.

Since we have two parameters,µ andσ², we letk = 2.

(42)

Back to Gaussians

n

2log₂σˆ²+2

2log₂n+constant.

Since we have two parameters,µ andσ², we letk = 2.

(43)

Linear Regression

A similar treatment can be given tolinear regression models.

The model includes a set ofregressor variables x₁, . . . ,x_p ∈R, and a set ofcoefficientsβ₁, . . . , β_p.

Thedependent variable,Y, is assumed to be Gaussian:

the meanµ is given as a linear combination of the regressors: µ=β₁x₁+· · ·+β_px_p=β^Tx,

variance is some parameter σ².

(44)

Linear Regression

the meanµ is given as a linear combination of the regressors: µ=β₁x₁+· · ·+β_px_p=β^Tx,

variance is some parameter σ².

(45)

Linear Regression

the meanµ is given as a linear combination of the regressors:

µ=β₁x₁+· · ·+β_px_p=β^Tx, variance is some parameter σ².

(46)

Linear Regression

For a sample of sizen, the matrix notation is convenient:

Y =





 Y1

... Yn





 X =







x11 · · · x1p

... . .. ... xn1 · · · xnp





 β =





 β1

... βp





 =





 1

... n







Then the model can be written as Y =Xβ+, where_i ∼ N(0, σ²).

(47)

Linear Regression

For a sample of sizen, the matrix notation is convenient:

Y =





 Y1

... Yn





 X =







x11 · · · x1p

... . .. ... xn1 · · · xnp





 β =





 β1

... βp





 =





 1

... n







Then the model can be written as Y =Xβ+, where_i ∼ N(0, σ²).

(48)

Linear Regression

The maximum likelihood estimators are now βˆ= (X^TX)⁻¹X^TY, σˆ² = 1

nkY −Xβkˆ ²₂= RSS n , whereRSSis the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formula applies:

n

2log₂ +

2log₂n+constant.

The number of parameters is nowp+ 1 (p of the βs and σ²), so we get...

(49)

Linear Regression

n

2log₂σˆ²+k

2log₂n+constant.

(50)

Linear Regression

n

2log₂RSS+k

2 log₂n+constant.

(51)

Linear Regression

n

2log₂RSS+k

2 log₂n+constant.

(52)

Linear Regression

n

2log₂RSS+p+ 1

2 log₂n+constant.

The number of parameters is nowp+ 1 (p of the βs and σ²), so

(53)

Subset Selection Problem

Often we have a large set of potential regressors, some of which may be irrelevant.

The MDL principle can be used to select a subset of them by comparing the total code-lengths:

minS

n

2log₂RSS_S +|S|+ 1 2 log₂n

, whereRSSS is the RSS obtained by using subset S of the regressors.

⇒Exercise

(54)

Subset Selection Problem

minS

n

⇒Exercise

(55)

Subset Selection Problem

minS

n

⇒Exercise

(56)

Wavelet Denoising

One particularly useful way to obtain the regressor (design) matrix is to usewavelets.

Image by Gabriel Peyr´e

(57)

Wavelet Denoising

One particularly useful way to obtain the regressor (design) matrix is to usewavelets.

Image by Gabriel Peyr´e

(58)

Wavelet Denoising

(59)

Wavelet Denoising

Main effort in constructing a universal code:

1 combines two-part, mixture, and NML universal codes,

2 bounds on NML normalization region required,

3 important lesson: remember to encode model class.

(60)

Wavelet Denoising

(61)

Wavelet Denoising

(62)

Wavelet Denoising

(63)

Wavelet Denoising

(64)

Wavelet Denoising

(65)

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

(66)

Multinomial Models

The multinomial model — the generalization of Bernoulli — is very simple:

p(x_j) =θ_j, for j ∈ {1, . . . ,m}.

Maximum likelihood:

θˆ_j = #{x_i =j}

n .

Two-part, mixture, and NML models readily defined.

⇒Exercises 5.1 & 5.2

(67)

Multinomial Models

p(x_j) =θ_j, for j ∈ {1, . . . ,m}.

Maximum likelihood:

θˆ_j = #{x_i =j}

n .

(68)

Multinomial Models

p(x_j) =θ_j, for j ∈ {1, . . . ,m}.

Maximum likelihood:

θˆ_j = #{x_i =j}

n .

(69)

Multinomial Models

p(x_j) =θ_j, for j ∈ {1, . . . ,m}.

Maximum likelihood:

θˆ_j = #{x_i =j}

n .

(70)

Fast NML for Multinomials

The na¨ıve way to compute the normalizing constant in the NML model

p_θ_ˆ(xⁿ)

C_n^m , C_n^m= X

yⁿ∈Xⁿ

p_θ_ˆ(yⁿ), takes exponential time (Ω(mⁿ)).

The second most na¨ıve way takes “only” polynomial time, O(n^m−1), but is still intractable unless m≤3 (or maybem≤4).

(71)

Fast NML for Multinomials

The na¨ıve way to compute the normalizing constant in the NML model

p_θ_ˆ(xⁿ)

C_n^m , C_n^m= X

yⁿ∈Xⁿ

p_θ_ˆ(yⁿ), takes exponential time (Ω(mⁿ)).

The second most na¨ıve way takes “only” polynomial time, O(n^m−1), but is still intractable unless m≤3 (or maybem≤4).

(72)

Fast NML for Multinomials

There is a way — which is not na¨ıve at all! — to do it in linear time,O(n+m), using the following recursion:

C_n^m =C_n^m−1+ n

m−2C_n^m−2,

whereC_n^m is the normalizing constant for anm-ary multinomial and sample sizen.

The trick is to reduce the general case toC_n¹ = 1 andC_n², the latter of which can be computed in linear time (using the second most na¨ıve approach).

Kontkanen & Myllym¨aki, “A linear-time algorithm for computing the multinomial stochastic complexity”,Information Processing Letters103 (2007), 6, pp. 227–233

(73)

Fast NML for Multinomials

C_n^m =C_n^m−1+ n

m−2C_n^m−2,

Kontkanen & Myllym¨aki, “A linear-time algorithm for computing the multinomial stochastic complexity”,Information Processing Letters103 (2007), 6, pp. 227–233

(74)

Fast NML for Multinomials

C_n^m =C_n^m−1+ n

m−2C_n^m−2,

Kontkanen & Myllym¨aki, “A linear-time algorithm for computing the

(75)

Histogram Density Estimation

For a histogram density, we get again a code-length formula where log₂ 1

f(x) is the only essential term.

Choosing the numberand the positions of break-points can be done by MDL.

The code-length is equivalent (up to additive constants) to the code-length in a multinomial model.

⇒Linear time algorithm can be used.

(76)

Histogram Density Estimation

(77)

Histogram Density Estimation

(78)

Histogram Density Estimation

(79)

Histogram Density Estimation

(80)

Histogram Density Estimation

(81)

Clustering

Consider the problem of clustering vectors of (independent) multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial) models.

Again, the problem is reduced to the multinomial case, and the fast NML algorithm can be applied.

(82)

Clustering

(83)

Clustering

(84)

Clustering

(85)