• Ei tuloksia

Autumn2012 Lecture10:MDLPrinciple—PartIIJyrkiKivinen Information-TheoreticModeling

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Autumn2012 Lecture10:MDLPrinciple—PartIIJyrkiKivinen Information-TheoreticModeling"

Copied!
89
0
0

Kokoteksti

(1)

Outline MDL for Gaussian Models MDL for Multinomial Models

Information-Theoretic Modeling

Lecture 10: MDL Principle — Part II

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

(2)

Outline MDL for Gaussian Models MDL for Multinomial Models

1 MDL for Gaussian Models Encoding Continuous Data Differential Entropy Linear Regression

Subset Selection Problem Wavelet Denoising

2 MDL for Multinomial Models Universal Codes

Fast NML Computation Histogram Density Estimation Clustering

(3)

Outline MDL for Gaussian Models MDL for Multinomial Models

1 MDL for Gaussian Models Encoding Continuous Data Differential Entropy Linear Regression

Subset Selection Problem Wavelet Denoising

2 MDL for Multinomial Models Universal Codes

Fast NML Computation Histogram Density Estimation Clustering

(4)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

(5)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

Density function:

φµ,σ2(x) = 1

2πσ2 e−(x−µ)22 .

Mean: µ=E[X], varianceσ2 =E[(X −µ)2] Maximum likelihood: ˆµ= 1

n

n

X

i=1

xi, ˆσ2 = 1 n

n

X

i=1

(xi −µ)ˆ 2.

(6)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

Density function:

φµ,σ2(x) = 1

2πσ2 e−(x−µ)22 .

Mean: µ=E[X], varianceσ2 =E[(X −µ)2]

Maximum likelihood: ˆµ= 1 n

n

X

i=1

xi, ˆσ2 = 1 n

n

X

i=1

(xi −µ)ˆ 2.

(7)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

Density function:

φµ,σ2(x1, . . . ,xn)(i.i.d.)=

n

Y

i=1

√ 1

2πσ2 e−(xi−µ)22 .

Mean: µ=E[X], varianceσ2 =E[(X −µ)2]

Maximum likelihood: ˆµ= 1 n

n

X

i=1

xi, ˆσ2 = 1 n

n

X

i=1

(xi −µ)ˆ 2.

(8)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

Density function:

φµ,σ2(x1, . . . ,xn)(i.i.d.)= 1

√ 2πσ2

n n

Y

i=1

e−(xi −µ)22 .

Mean: µ=E[X], varianceσ2 =E[(X −µ)2]

Maximum likelihood: ˆµ= 1 n

n

X

i=1

xi, ˆσ2 = 1 n

n

X

i=1

(xi −µ)ˆ 2.

(9)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

Density function:

φµ,σ2(x1, . . . ,xn)(i.i.d.)= 1

2πσ2

n/2 n

Y

i=1

e−(xi−µ)22 .

Mean: µ=E[X], varianceσ2 =E[(X −µ)2]

Maximum likelihood: ˆµ= 1 n

n

X

i=1

xi, ˆσ2 = 1 n

n

X

i=1

(xi −µ)ˆ 2.

(10)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

Density function:

φµ,σ2(x1, . . . ,xn)(i.i.d.)= 2πσ2−n/2 n

Y

i=1

e−(xi −µ)22 .

Mean: µ=E[X], varianceσ2 =E[(X −µ)2]

Maximum likelihood: ˆµ= 1 n

n

X

i=1

xi, ˆσ2 = 1 n

n

X

i=1

(xi −µ)ˆ 2.

(11)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

Density function:

φµ,σ2(x1, . . . ,xn)(i.i.d.)= 2πσ2−n/2

e− Pn

i=1(xi−µ)22 .

Mean: µ=E[X], varianceσ2 =E[(X −µ)2]

Maximum likelihood: ˆµ= 1 n

n

X

i=1

xi, ˆσ2 = 1 n

n

X

i=1

(xi −µ)ˆ 2.

(12)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Gaussian models

Density function:

φµ,σ2(x1, . . . ,xn)(i.i.d.)= 2πσ2−n/2

e− Pn

i=1(xi−µ)22 .

Mean: µ=E[X], varianceσ2 =E[(X −µ)2] Maximum likelihood: ˆµ= 1

n

n

X

i=1

xi, σˆ2 = 1 n

n

X

i=1

(xi −µ)ˆ 2.

(13)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we face the problem of How to encode continuous data?

We already know how to encode using models with continuous parameters:

two-part with optimal quantization ≈ k2 log2n , mixture code,

NML.

Obviously not possible to encode data with infinite precision. Have todiscretize: encodex only up to precision δ.

(14)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we face the problem of How to encode continuous data?

We already know how to encode using models with continuous parameters:

two-part with optimal quantization ≈ k2 log2n , mixture code,

NML.

Obviously not possible to encode data with infinite precision. Have todiscretize: encodex only up to precision δ.

(15)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we face the problem of How to encode continuous data?

We already know how to encode using models with continuous parameters:

two-part with optimal quantization ≈ k2 log2n ,

mixture code, NML.

Obviously not possible to encode data with infinite precision. Have todiscretize: encodex only up to precision δ.

(16)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we face the problem of How to encode continuous data?

We already know how to encode using models with continuous parameters:

two-part with optimal quantization ≈ k2 log2n , mixture code,

NML.

Obviously not possible to encode data with infinite precision. Have todiscretize: encodex only up to precision δ.

(17)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we face the problem of How to encode continuous data?

We already know how to encode using models with continuous parameters:

two-part with optimal quantization ≈ k2 log2n , mixture code,

NML.

Obviously not possible to encode data with infinite precision. Have todiscretize: encodex only up to precision δ.

(18)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we face the problem of How to encode continuous data?

We already know how to encode using models with continuous parameters:

two-part with optimal quantization ≈ k2 log2n , mixture code,

NML.

Obviously not possible to encode data with infinite precision. Have

(19)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

What is the optimal rate for encoding (compressing) continuous data (up to precisionδ)?

The answer involves again an entropy. However, not the familiar kind of entropy but instead...

Differential entropy

LetX ∈Rbe a continuous random variable with probability densityf : R→R+.

The differential entropy ofX is defined as h(X) =EX∼f

log2 1 f(X)

= Z

f(x) log2 1 f(x)dx.

(20)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

What is the optimal rate for encoding (compressing) continuous data (up to precisionδ)?

The answer involves again an entropy. However, not the familiar kind of entropy but instead...

Differential entropy

LetX ∈Rbe a continuous random variable with probability densityf : R→R+.

The differential entropy ofX is defined as h(X) =EX∼f

log2 1 f(X)

= Z

f(x) log2 1 f(x)dx.

(21)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

What is the optimal rate for encoding (compressing) continuous data (up to precisionδ)?

The answer involves again an entropy. However, not the familiar kind of entropy but instead...

Differential entropy

LetX ∈Rbe a continuous random variable with probability densityf : R→R+.

The differential entropy ofX is defined as h(X) =EX∼f

log2 1 f(X)

= Z

f(x) log2 1 f(x)dx.

(22)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

What is the optimal rate for encoding (compressing) continuous data (up to precisionδ)?

The answer involves again an entropy. However, not the familiar kind of entropy but instead...

Differential entropy

LetX ∈Rbe a continuous random variable with probability densityf : R→R+.

The differential entropy ofX is defined as h(X) =EX∼f

log2 1 f(X)

= Z

f(x) log2 1 f(x)dx.

(23)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

What is the optimal rate for encoding (compressing) continuous data (up to precisionδ)?

The answer involves again an entropy. However, not the familiar kind of entropy but instead...

Differential entropy

LetX ∈Rbe a continuous random variable with probability densityf : R→R+.

The differential entropy ofX is defined as h(X) =EX∼f

log2 1 f(X)

= Z

f(x) log2 1 f(x)dx.

(24)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

Ifδ >0 is small, the probability that X ∈[(t−12)δ,(t+12)δ] is well approximated byf(tδ)δ.

Hence, the minimum coding rate of the discretized random variableXδ is given by

H(Xδ)≈ X

x=tδ:t∈Z

f(x)δlog2 1 f(x)δ

−→δ→0

Z +∞

−∞

f(x) log2 1 f(x)dx.

Hence, the rate is approximatelyH(Xδ)≈h(X)−log2δ.

(25)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

Ifδ >0 is small, the probability that X ∈[(t−12)δ,(t+12)δ] is well approximated byf(tδ)δ.

Hence, the minimum coding rate of the discretized random variableXδ is given by

H(Xδ)≈ X

x=tδ:t∈Z

f(x)δlog2 1 f(x)δ

−→δ→0

Z +∞

−∞

f(x) log2 1 f(x)dx. Hence, the rate is approximatelyH(Xδ)≈h(X)−log2δ.

(26)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

Ifδ >0 is small, the probability that X ∈[(t−12)δ,(t+12)δ] is well approximated byf(tδ)δ.

Hence, the minimum coding rate of the discretized random variableXδ is given by

H(Xδ)≈ X

x=tδ:t∈Z

f(x)δlog2 1 f(x)δ

−→δ→0

Z +∞

−∞

f(x) log2 1 f(x)δ dx.

Hence, the rate is approximatelyH(Xδ)≈h(X)−log2δ.

(27)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

Ifδ >0 is small, the probability that X ∈[(t−12)δ,(t+12)δ] is well approximated byf(tδ)δ.

Hence, the minimum coding rate of the discretized random variableXδ is given by

H(Xδ)≈ X

x=tδ:t∈Z

f(x)δlog2 1 f(x)δ

−→δ→0

Z +∞

−∞

f(x) log2 1

f(x)dx −log2δ.

Hence, the rate is approximatelyH(Xδ)≈h(X)−log2δ.

(28)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

Ifδ >0 is small, the probability that X ∈[(t−12)δ,(t+12)δ] is well approximated byf(tδ)δ.

Hence, the minimum coding rate of the discretized random variableXδ is given by

H(Xδ)≈ X

x=tδ:t∈Z

f(x)δlog2 1 f(x)δ

−→δ→0

Z +∞

−∞

f(x) log2 1

f(x)dx −log2δ.

(29)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

The minimum coding rateh(X)−log2δ is achieved if and only if the code-word lengths are chosen according to

`(x) = log2 1

f(x)δ = log2 1

f(x)+ log2 1 δ.

The term log2(1/δ) depends only on the precision we chose and is same for all models. Therefore, we can ignore it for the purpose of comparing models.

(30)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

The minimum coding rateh(X)−log2δ is achieved if and only if the code-word lengths are chosen according to

`(x) = log2 1

f(x)δ = log2 1

f(x)+ log2 1 δ.

The term log2(1/δ) depends only on the precision we chose and is same for all models. Therefore, we can ignore it for the purpose of comparing models.

(31)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Differential Entropy

The minimum coding rateh(X)−log2δ is achieved if and only if the code-word lengths are chosen according to

`(x) = log2 1

f(x)δ = log2 1

f(x)+ log2 1 δ.

The term log2(1/δ) depends only on the precision we chose and is same for all models. Therefore, we can ignore it for the purpose of comparing models.

(32)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

Recall the Gaussian density function:

φµ,σ2(x1, . . . ,xn)(i.i.d.)= 2πσ2−n/2

e− Pn

i=1(xi−µ)22 .

The code-length is then n

2log2(2πσ2)− 1 (2 ln 2)σ2

n

X

i=1

(xi −µ)2.

(33)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

Recall the Gaussian density function:

φµ,σ2(x1, . . . ,xn)(i.i.d.)= 2πσ2−n/2

e− Pn

i=1(xi−µ)22 .

The code-length is then n

2log2(2πσ2)− 1 (2 ln 2)σ2

n

X

i=1

(xi−µ)2.

(34)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

Ok, we have our Gaussian code-length formula:

n

2log2(2πσ2)− 1 (2 ln 2)σ2

n

X

i=1

(xi−µ)2.

Let’s use the two-part code and plug in the maximum likelihood parameters:

ˆ µ= 1

n

n

X

i=1

xi, σˆ2= 1 n

n

X

i=1

(xi −µ)ˆ 2.

(35)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

Ok, we have our Gaussian code-length formula:

n

2log2(2πσ2)− 1 (2 ln 2)σ2

n

X

i=1

(xi−µ)2.

Let’s use the two-part code and plug in the maximum likelihood parameters:

ˆ µ= 1

n

n

X

i=1

xi, ˆσ2= 1 n

n

X

i=1

(xi −µ)ˆ 2.

(36)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

Ok, we have our Gaussian code-length formula:

n

2log2(2πσˆ2)− 1 (2 ln 2)ˆσ2

n

X

i=1

(xi−µ)ˆ 2.

Let’s use the two-part code and plug in the maximum likelihood parameters:

ˆ µ= 1

n

n

X

i=1

xi, ˆσ2= 1 n

n

X

i=1

(xi −µ)ˆ 2.

(37)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

Ok, we have our Gaussian code-length formula:

n

2log2(2πσˆ2)− 1 (2 ln 2)ˆσ2

n

X

i=1

(xi−µ)ˆ 2.

Let’s use the two-part code and plug in the maximum likelihood parameters:

ˆ µ= 1

n

n

X

i=1

xi, ˆσ2= 1 n

n

X

i=1

(xi −µ)ˆ 2.

(38)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

Ok, we have our Gaussian code-length formula:

n

2log2(2πσˆ2)− n 2 ln 2.

Let’s use the two-part code and plug in the maximum likelihood parameters:

ˆ µ= 1

n

n

X

i=1

xi, ˆσ2= 1 n

n

X

i=1

(xi −µ)ˆ 2.

(39)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

Ok, we have our Gaussian code-length formula:

n

2log2σˆ2+constant.

Let’s use the two-part code and plug in the maximum likelihood parameters:

ˆ µ= 1

n

n

X

i=1

xi, ˆσ2= 1 n

n

X

i=1

(xi −µ)ˆ 2.

(40)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

We get the total (two-part) code-length formula:

n

2log2σˆ2+k

2 log2n+constant.

Since we have two parameters, µandσ2, we letk = 2.

Notice that depending on what exactly you are doing, you may or may not care about theconstant.

(41)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

We get the total (two-part) code-length formula:

n

2log2σˆ2+k

2 log2n+constant.

Since we have two parameters,µ andσ2, we letk = 2.

Notice that depending on what exactly you are doing, you may or may not care about theconstant.

(42)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Back to Gaussians

We get the total (two-part) code-length formula:

n

2log2σˆ2+2

2log2n+constant.

Since we have two parameters,µ andσ2, we letk = 2.

Notice that depending on what exactly you are doing, you may or may not care about theconstant.

(43)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

A similar treatment can be given tolinear regression models.

The model includes a set ofregressor variables x1, . . . ,xp ∈R, and a set ofcoefficientsβ1, . . . , βp.

Thedependent variable,Y, is assumed to be Gaussian:

the meanµ is given as a linear combination of the regressors: µ=β1x1+· · ·+βpxpTx,

variance is some parameter σ2.

(44)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

A similar treatment can be given tolinear regression models.

The model includes a set ofregressor variables x1, . . . ,xp ∈R, and a set ofcoefficientsβ1, . . . , βp.

Thedependent variable,Y, is assumed to be Gaussian:

the meanµ is given as a linear combination of the regressors: µ=β1x1+· · ·+βpxpTx,

variance is some parameter σ2.

(45)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

A similar treatment can be given tolinear regression models.

The model includes a set ofregressor variables x1, . . . ,xp ∈R, and a set ofcoefficientsβ1, . . . , βp.

Thedependent variable,Y, is assumed to be Gaussian:

the meanµ is given as a linear combination of the regressors:

µ=β1x1+· · ·+βpxpTx, variance is some parameter σ2.

(46)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

For a sample of sizen, the matrix notation is convenient:

Y =

 Y1

... Yn

 X =

x11 · · · x1p

... . .. ... xn1 · · · xnp

 β =

 β1

... βp

 =

1

... n

Then the model can be written as Y =Xβ+, wherei ∼ N(0, σ2).

(47)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

For a sample of sizen, the matrix notation is convenient:

Y =

 Y1

... Yn

 X =

x11 · · · x1p

... . .. ... xn1 · · · xnp

 β =

 β1

... βp

 =

1

... n

Then the model can be written as Y =Xβ+, wherei ∼ N(0, σ2).

(48)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

The maximum likelihood estimators are now βˆ= (XTX)−1XTY, σˆ2 = 1

nkY −Xβkˆ 22= RSS n , whereRSSis the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formula applies:

n

2log2 +

2log2n+constant.

The number of parameters is nowp+ 1 (p of the βs and σ2), so we get...

(49)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

The maximum likelihood estimators are now βˆ= (XTX)−1XTY, σˆ2 = 1

nkY −Xβkˆ 22= RSS n , whereRSSis the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formula applies:

n

2log2σˆ2+k

2log2n+constant.

The number of parameters is nowp+ 1 (p of the βs and σ2), so we get...

(50)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

The maximum likelihood estimators are now βˆ= (XTX)−1XTY, σˆ2 = 1

nkY −Xβkˆ 22= RSS n , whereRSSis the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formula applies:

n

2log2RSS+k

2 log2n+constant.

The number of parameters is nowp+ 1 (p of the βs and σ2), so we get...

(51)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

The maximum likelihood estimators are now βˆ= (XTX)−1XTY, σˆ2 = 1

nkY −Xβkˆ 22= RSS n , whereRSSis the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formula applies:

n

2log2RSS+k

2 log2n+constant.

The number of parameters is nowp+ 1 (p of the βs and σ2), so we get...

(52)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Linear Regression

The maximum likelihood estimators are now βˆ= (XTX)−1XTY, σˆ2 = 1

nkY −Xβkˆ 22= RSS n , whereRSSis the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formula applies:

n

2log2RSS+p+ 1

2 log2n+constant.

The number of parameters is nowp+ 1 (p of the βs and σ2), so

(53)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Subset Selection Problem

Often we have a large set of potential regressors, some of which may be irrelevant.

The MDL principle can be used to select a subset of them by comparing the total code-lengths:

minS

n

2log2RSSS +|S|+ 1 2 log2n

, whereRSSS is the RSS obtained by using subset S of the regressors.

⇒Exercise

(54)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Subset Selection Problem

Often we have a large set of potential regressors, some of which may be irrelevant.

The MDL principle can be used to select a subset of them by comparing the total code-lengths:

minS

n

2log2RSSS +|S|+ 1 2 log2n

, whereRSSS is the RSS obtained by using subset S of the regressors.

⇒Exercise

(55)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Subset Selection Problem

Often we have a large set of potential regressors, some of which may be irrelevant.

The MDL principle can be used to select a subset of them by comparing the total code-lengths:

minS

n

2log2RSSS +|S|+ 1 2 log2n

, whereRSSS is the RSS obtained by using subset S of the regressors.

⇒Exercise

(56)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

One particularly useful way to obtain the regressor (design) matrix is to usewavelets.

Image by Gabriel Peyr´e

(57)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

One particularly useful way to obtain the regressor (design) matrix is to usewavelets.

Image by Gabriel Peyr´e

(58)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

(59)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

Main effort in constructing a universal code:

1 combines two-part, mixture, and NML universal codes,

2 bounds on NML normalization region required,

3 important lesson: remember to encode model class.

(60)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

Main effort in constructing a universal code:

1 combines two-part, mixture, and NML universal codes,

2 bounds on NML normalization region required,

3 important lesson: remember to encode model class.

(61)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

Main effort in constructing a universal code:

1 combines two-part, mixture, and NML universal codes,

2 bounds on NML normalization region required,

3 important lesson: remember to encode model class.

(62)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

Main effort in constructing a universal code:

1 combines two-part, mixture, and NML universal codes,

2 bounds on NML normalization region required,

3 important lesson: remember to encode model class.

(63)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

(64)

Outline MDL for Gaussian Models MDL for Multinomial Models

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising

Wavelet Denoising

(65)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

1 MDL for Gaussian Models Encoding Continuous Data Differential Entropy Linear Regression

Subset Selection Problem Wavelet Denoising

2 MDL for Multinomial Models Universal Codes

Fast NML Computation Histogram Density Estimation Clustering

(66)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Multinomial Models

The multinomial model — the generalization of Bernoulli — is very simple:

p(xj) =θj, for j ∈ {1, . . . ,m}.

Maximum likelihood:

θˆj = #{xi =j}

n .

Two-part, mixture, and NML models readily defined.

⇒Exercises 5.1 & 5.2

(67)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Multinomial Models

The multinomial model — the generalization of Bernoulli — is very simple:

p(xj) =θj, for j ∈ {1, . . . ,m}.

Maximum likelihood:

θˆj = #{xi =j}

n .

Two-part, mixture, and NML models readily defined.

⇒Exercises 5.1 & 5.2

(68)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Multinomial Models

The multinomial model — the generalization of Bernoulli — is very simple:

p(xj) =θj, for j ∈ {1, . . . ,m}.

Maximum likelihood:

θˆj = #{xi =j}

n .

Two-part, mixture, and NML models readily defined.

⇒Exercises 5.1 & 5.2

(69)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Multinomial Models

The multinomial model — the generalization of Bernoulli — is very simple:

p(xj) =θj, for j ∈ {1, . . . ,m}.

Maximum likelihood:

θˆj = #{xi =j}

n .

Two-part, mixture, and NML models readily defined.

⇒Exercises 5.1 & 5.2

(70)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Fast NML for Multinomials

The na¨ıve way to compute the normalizing constant in the NML model

pθˆ(xn)

Cnm , Cnm= X

yn∈Xn

pθˆ(yn), takes exponential time (Ω(mn)).

The second most na¨ıve way takes “only” polynomial time, O(nm−1), but is still intractable unless m≤3 (or maybem≤4).

(71)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Fast NML for Multinomials

The na¨ıve way to compute the normalizing constant in the NML model

pθˆ(xn)

Cnm , Cnm= X

yn∈Xn

pθˆ(yn), takes exponential time (Ω(mn)).

The second most na¨ıve way takes “only” polynomial time, O(nm−1), but is still intractable unless m≤3 (or maybem≤4).

(72)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Fast NML for Multinomials

There is a way — which is not na¨ıve at all! — to do it in linear time,O(n+m), using the following recursion:

Cnm =Cnm−1+ n

m−2Cnm−2,

whereCnm is the normalizing constant for anm-ary multinomial and sample sizen.

The trick is to reduce the general case toCn1 = 1 andCn2, the latter of which can be computed in linear time (using the second most na¨ıve approach).

Kontkanen & Myllym¨aki, “A linear-time algorithm for computing the multinomial stochastic complexity”,Information Processing Letters103 (2007), 6, pp. 227–233

(73)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Fast NML for Multinomials

There is a way — which is not na¨ıve at all! — to do it in linear time,O(n+m), using the following recursion:

Cnm =Cnm−1+ n

m−2Cnm−2,

whereCnm is the normalizing constant for anm-ary multinomial and sample sizen.

The trick is to reduce the general case toCn1 = 1 andCn2, the latter of which can be computed in linear time (using the second most na¨ıve approach).

Kontkanen & Myllym¨aki, “A linear-time algorithm for computing the multinomial stochastic complexity”,Information Processing Letters103 (2007), 6, pp. 227–233

(74)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Fast NML for Multinomials

There is a way — which is not na¨ıve at all! — to do it in linear time,O(n+m), using the following recursion:

Cnm =Cnm−1+ n

m−2Cnm−2,

whereCnm is the normalizing constant for anm-ary multinomial and sample sizen.

The trick is to reduce the general case toCn1 = 1 andCn2, the latter of which can be computed in linear time (using the second most na¨ıve approach).

Kontkanen & Myllym¨aki, “A linear-time algorithm for computing the

(75)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Histogram Density Estimation

For a histogram density, we get again a code-length formula where log2 1

f(x) is the only essential term.

Choosing the numberand the positions of break-points can be done by MDL.

The code-length is equivalent (up to additive constants) to the code-length in a multinomial model.

⇒Linear time algorithm can be used.

(76)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Histogram Density Estimation

For a histogram density, we get again a code-length formula where log2 1

f(x) is the only essential term.

Choosing the numberand the positions of break-points can be done by MDL.

The code-length is equivalent (up to additive constants) to the code-length in a multinomial model.

⇒Linear time algorithm can be used.

(77)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Histogram Density Estimation

For a histogram density, we get again a code-length formula where log2 1

f(x) is the only essential term.

Choosing the numberand the positions of break-points can be done by MDL.

The code-length is equivalent (up to additive constants) to the code-length in a multinomial model.

⇒Linear time algorithm can be used.

(78)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Histogram Density Estimation

For a histogram density, we get again a code-length formula where log2 1

f(x) is the only essential term.

Choosing the numberand the positions of break-points can be done by MDL.

The code-length is equivalent (up to additive constants) to the code-length in a multinomial model.

⇒Linear time algorithm can be used.

(79)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Histogram Density Estimation

(80)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Histogram Density Estimation

(81)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Clustering

Consider the problem of clustering vectors of (independent) multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial) models.

Again, the problem is reduced to the multinomial case, and the fast NML algorithm can be applied.

(82)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Clustering

Consider the problem of clustering vectors of (independent) multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial) models.

Again, the problem is reduced to the multinomial case, and the fast NML algorithm can be applied.

(83)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Clustering

Consider the problem of clustering vectors of (independent) multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial) models.

Again, the problem is reduced to the multinomial case, and the fast NML algorithm can be applied.

(84)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Clustering

Consider the problem of clustering vectors of (independent) multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial) models.

Again, the problem is reduced to the multinomial case, and the fast NML algorithm can be applied.

(85)

Outline MDL for Gaussian Models MDL for Multinomial Models

Universal Codes Fast NML Computation Histogram Density Estimation Clustering

Clustering

Consider the problem of clustering vectors of (independent) multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial) models.

Again, the problem is reduced to the multinomial case, and the fast NML algorithm can be applied.

Viittaukset

LIITTYVÄT TIEDOSTOT

Probability Space and Random Variables Joint and Conditional Distributions Expectation.. Law of

This is the interesting case in which the code actually does result in compression. On the first lecture we saw that any attempt at compressing everything will fail because there

Outline Noisy Channels Channel Coding and Shannon’s 2nd Theorem Hamming Codes..

It is easily seen that all prefix codes are uniquely decodable: each symbol can be decoded as soon as its codeword is read.. Outline Codes

Note that since Shannon-Fano gives E [`(X )] ≤ H(X ) + 1, and Huffman is optimal, Huffman must satisfy the same bound. Jyrki Kivinen

1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p ∈ [0, 1].. 2 We’d like to encode data at

2 MDL Principle Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL.. Jyrki Kivinen

To summarize, the prefix property is reasonable from a practical point of view, and from a theoretical point of view can be assumed without increasing input lengths too much....