• Ei tuloksia

Autumn2012 Lecture9:TheMDLPrincipleJyrkiKivinen Information-TheoreticModeling

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Autumn2012 Lecture9:TheMDLPrincipleJyrkiKivinen Information-TheoreticModeling"

Copied!
83
0
0

Kokoteksti

(1)

Information-Theoretic Modeling

Lecture 9: The MDL Principle

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

Jyrki Kivinen Information-Theoretic Modeling

(2)

Outline Occam’s Razor MDL Principle

Lecture 9: MDL Principle

Jorma Rissanen (left) receiving the IEEE Information Theory Society Best Paper Award from Claude Shannon in 1986.

IEEE Golden Jubilee Award for Technological Innovation(for the invention of arithmetic coding) 1998;IEEE Richard W. Hamming Medal(for fundamental

contribution to information theory, statistical inference, control theory, and the theory of complexity) 1993;Kolmogorov Medal2006;

IBM Outstanding Innovation Award(for work in statistical inference, information theory, and the theory of complexity) 1988;

IEEE Claude E. Shannon Award 2009; ...

Jyrki Kivinen Information-Theoretic Modeling

(3)

Outline Occam’s Razor MDL Principle

1 Occam’s Razor House

Visual Recognition Astronomy

Razor

Probabilistic Models Old-Style MDL Modern MDL

Jyrki Kivinen Information-Theoretic Modeling

(4)

Outline Occam’s Razor MDL Principle

1 Occam’s Razor House

Visual Recognition Astronomy

Razor

2 MDL Principle Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Jyrki Kivinen Information-Theoretic Modeling

(5)

House

Jyrki Kivinen Information-Theoretic Modeling

(6)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease... Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(7)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

2 appendicitis,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease... Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(8)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(9)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

3 food poisoning,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(10)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(11)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

5 meningitis.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(12)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(13)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

Jyrki Kivinen Information-Theoretic Modeling

(14)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced bygout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(15)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 common cold,

2 appendicitis,

3 food poisoning,

4 hemorrhage,

gout medicine,

5 common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced bygout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(16)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 common cold,

2 gout medicine,

3 gout medicine,

4 gout medicine,

5 common cold.

No single disease causes all of these.

Each symptom can be caused bysome(possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced bygout medicine.

Jyrki Kivinen Information-Theoretic Modeling

(17)

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(18)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(19)

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(20)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(21)

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(22)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(23)

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(24)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(25)

Astronomy

Jyrki Kivinen Information-Theoretic Modeling

(26)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Astronomy

Jyrki Kivinen Information-Theoretic Modeling

(27)

Astronomy

Jyrki Kivinen Information-Theoretic Modeling

(28)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

William of Ockham (c. 1288–1348)

Jyrki Kivinen Information-Theoretic Modeling

(29)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

appearances.”

Diagnostic parsimony: Find the fewest possible causes that explain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damn well please.”)

Jyrki Kivinen Information-Theoretic Modeling

(30)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural things than such as are both true and sufficient to explain their

appearances.”

Diagnostic parsimony: Find the fewest possible causes that explain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damn well please.”)

Jyrki Kivinen Information-Theoretic Modeling

(31)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural things than such as are both true and sufficient to explain their

appearances.”

Diagnostic parsimony: Find the fewest possible causes that explain the symptoms.

Jyrki Kivinen Information-Theoretic Modeling

(32)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural things than such as are both true and sufficient to explain their

appearances.”

Diagnostic parsimony: Find the fewest possible causes that explain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damn well please.”)

Jyrki Kivinen Information-Theoretic Modeling

(33)

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(34)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(35)

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(36)

Outline Occam’s Razor MDL Principle

House Visual Recognition Astronomy Razor

Visual Recognition

Jyrki Kivinen Information-Theoretic Modeling

(37)

1 Occam’s Razor House

Visual Recognition Astronomy

Razor

2 MDL Principle Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Jyrki Kivinen Information-Theoretic Modeling

(38)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Principle

Minimum Description Length (MDL) Principle (2-part) Choose the hypothesis which minimizes the sum of

1 the codelength of the hypothesis, and

2 the codelength of the data with the help of the hypothesis.

How to encode datawith the help of a hypothesis?

Jyrki Kivinen Information-Theoretic Modeling

(39)

MDL Principle

Minimum Description Length (MDL) Principle (2-part) Choose the hypothesis which minimizes the sum of

1 the codelength of the hypothesis, and

2 the codelength of the data with the help of the hypothesis.

How to encode datawith the help of a hypothesis?

Jyrki Kivinen Information-Theoretic Modeling

(40)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)! different groups ofk exceptions.

Jyrki Kivinen Information-Theoretic Modeling

(41)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

n

k

= n!

k!(n−k)! different groups ofk exceptions.

Jyrki Kivinen Information-Theoretic Modeling

(42)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)!

different groups ofk exceptions.

Jyrki Kivinen Information-Theoretic Modeling

(43)

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)!

different groups ofk exceptions.

k= 1 :

n

1

= 6252625 ≈1.4·10188. Codelength log2(n+ 1) + log2

n

k

≈19 vs. log22625= 625

Jyrki Kivinen Information-Theoretic Modeling

(44)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)!

different groups ofk exceptions.

k= 2 :

n

2

= 195 0002625≈1.4·10188. Codelength log2(n+ 1) + log2

n

k

≈27 vs. log22625= 625

Jyrki Kivinen Information-Theoretic Modeling

(45)

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)!

different groups ofk exceptions.

k= 3 :

n

3

= 40 495 0002625≈1.4·10188. Codelength log2(n+ 1) + log2

n

k

≈35 vs. log22625= 625

Jyrki Kivinen Information-Theoretic Modeling

(46)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)!

different groups ofk exceptions.

k= 10 :

n

10

= 2 331 354 000 000 000 000 0002625. Codelength log2(n+ 1) + log2

n

k

≈80 vs. log22625= 625

Jyrki Kivinen Information-Theoretic Modeling

(47)

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)!

different groups ofk exceptions.

k= 100 :

n

100

≈9.5·10117 2625≈1.4·10188. Codelength log2(n+ 1) + log2

n

k

≈401 vs. log22625= 625

Jyrki Kivinen Information-Theoretic Modeling

(48)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)!

different groups ofk exceptions.

k= 300 :

n

300

≈2.7·10186 <2625 ≈1.4·10188. Codelength log2(n+ 1) + log2

n

k

≈629 vs. log22625= 625

Jyrki Kivinen Information-Theoretic Modeling

(49)

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).

For image of sizen= 625, there are 2n different images, and

n

k

= n!

k!(n−k)!

different groups ofk exceptions.

k= 372 :

n

372

≈5.1·10181 2625≈1.4·10188. Codelength log2(n+ 1) + log2

n

k

≈613 vs. log22625= 625

Jyrki Kivinen Information-Theoretic Modeling

(50)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Noisy PSNR=19.8 MDL (A-B) PSNR=32.9

Rissanen & Shannon: log2 1 pθˆ(D) +k

2 log2n.

Jyrki Kivinen Information-Theoretic Modeling

(51)

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Noisy PSNR=19.8 MDL (A-B) PSNR=32.9

Rissanen & Shannon: log2 1 pθˆ(D)+ k

2 log2n.

Jyrki Kivinen Information-Theoretic Modeling

(52)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Polynomials

From P. Gr¨unwald

Jyrki Kivinen Information-Theoretic Modeling

(53)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Old-Style MDL

With the precision 1n the codelength for data is almost optimal:

min

θq∈{θ(1)(2),...}`θq(D) ≈ min

θ∈Θ`θ(D) = log2 1 pθˆ(D) .

“Steam MDL”

`θq(D) +`(θq)≈log2 1 pθˆ(D) +k

2log2n .

Jyrki Kivinen Information-Theoretic Modeling

(54)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Old-Style MDL

With the precision 1n the codelength for data is almost optimal:

min

θq∈{θ(1)(2),...}`θq(D) ≈ min

θ∈Θ`θ(D) = log2 1 pθˆ(D) . This gives the total codelength formula:

“Steam MDL”

`θq(D) +`(θq)≈log2 1 pθˆ(D) +k

2log2n .

Jyrki Kivinen Information-Theoretic Modeling

(55)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Old-Style MDL

The k2log2n formula is only a rough approximation, and works well only for very large samples.

likelihood, etc.

Jyrki Kivinen Information-Theoretic Modeling

(56)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Old-Style MDL

The k2log2n formula is only a rough approximation, and works well only for very large samples.

MDL in the 21st century:

More advanced codes: mixtures, normalized maximum likelihood, etc.

Jyrki Kivinen Information-Theoretic Modeling

(57)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

Perhaps the best known use for MDL is formodel class selection (which is often called just model selection).

We want to pick the one that seems best suited for our dataD. TypicallyM1⊂ · · · ⊂ Mm, so Mm always achieves the best fit to data, but may beoverfitting(see Introduction to machine

learning).

Therefore we use MDL and pickMi that minimizes the total code length.

Jyrki Kivinen Information-Theoretic Modeling

(58)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

Perhaps the best known use for MDL is formodel class selection (which is often called just model selection).

Suppose we have a family of model classesM1, . . . ,Mm, each with their own parameter set Θ1, . . . ,Θm.

We want to pick the one that seems best suited for our dataD.

TypicallyM1⊂ · · · ⊂ Mm, so Mm always achieves the best fit to data, but may beoverfitting(see Introduction to machine

learning).

Therefore we use MDL and pickMi that minimizes the total code length.

Jyrki Kivinen Information-Theoretic Modeling

(59)

MDL Model Selection

Perhaps the best known use for MDL is formodel class selection (which is often called just model selection).

Suppose we have a family of model classesM1, . . . ,Mm, each with their own parameter set Θ1, . . . ,Θm.

We want to pick the one that seems best suited for our dataD.

TypicallyM1⊂ · · · ⊂ Mm, so Mm always achieves the best fit to data, but may beoverfitting(see Introduction to machine

learning).

Therefore we use MDL and pickMi that minimizes the total code length.

Jyrki Kivinen Information-Theoretic Modeling

(60)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

Actually here we have a “three-part” code:

1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.

2 Encoding of the parameter (vector): `1(θ), θ ∈Θi.

3 Encoding of the data: log2 1

pθ(D), D∈ D.

If we have a finite family ofm model classes, we can do Part 1 with`(Mi) = log2m for alli.

This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible” overNand`(Mi) = log2(1/p(i)).

Jyrki Kivinen Information-Theoretic Modeling

(61)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

Actually here we have a “three-part” code:

1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.

3 Encoding of the data: log2

pθ(D), D∈ D.

If we have a finite family ofm model classes, we can do Part 1 with`(Mi) = log2m for alli.

This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible” overNand`(Mi) = log2(1/p(i)).

Jyrki Kivinen Information-Theoretic Modeling

(62)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

Actually here we have a “three-part” code:

1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.

2 Encoding of the parameter (vector): `1(θ), θ ∈Θi.

3 Encoding of the data: log2 1

pθ(D), D∈ D.

If we have a finite family ofm model classes, we can do Part 1 with`(Mi) = log2m for alli.

This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible” overNand`(Mi) = log2(1/p(i)).

Jyrki Kivinen Information-Theoretic Modeling

(63)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

Actually here we have a “three-part” code:

1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.

2 Encoding of the parameter (vector): `1(θ), θ ∈Θi.

3 Encoding of the data: log2 1

pθ(D), D ∈ D.

This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible” overNand`(Mi) = log2(1/p(i)).

Jyrki Kivinen Information-Theoretic Modeling

(64)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

Actually here we have a “three-part” code:

1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.

2 Encoding of the parameter (vector): `1(θ), θ ∈Θi.

3 Encoding of the data: log2 1

pθ(D), D ∈ D.

If we have a finite family ofm model classes, we can do Part 1 with`(Mi) = log2mfor all i.

This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible”

overNand`(Mi) = log2(1/p(i)).

Jyrki Kivinen Information-Theoretic Modeling

(65)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

If we are interested in choosing a model class (and not the

parameters), we can improve parts 2 & 3 by combining them into a better universal code than two-part:

Mi Mi

universal code-length (e.g., mixture, NML) based on model classMi.

Jyrki Kivinen Information-Theoretic Modeling

(66)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

If we are interested in choosing a model class (and not the

parameters), we can improve parts 2 & 3 by combining them into a better universal code than two-part:

1 Encoding of the model class index: `(Mi), i ∈N.

2 Encoding of the data: `Mi(D), D∈ D, where`Mi is a universal code-length (e.g., mixture, NML) based on model classMi.

Jyrki Kivinen Information-Theoretic Modeling

(67)

MDL Model Selection

If we are interested in choosing a model class (and not the

parameters), we can improve parts 2 & 3 by combining them into a better universal code than two-part:

1 Encoding of the model class index: `(Mi), i ∈N.

2 Encoding of the data: `Mi(D), D∈ D, where`Mi is a universal code-length (e.g., mixture, NML) based on model classMi.

Jyrki Kivinen Information-Theoretic Modeling

(68)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL Model Selection

MDL Explanation of MDL

The success in extracting the structure from data can be measured by the codelength.

In practice, we only find the structure that is “visible” to the used model class(es). For instance, the Bernoulli (coin flipping) model only sees the number of 1s.

Jyrki Kivinen Information-Theoretic Modeling

(69)

MDL Model Selection

MDL Explanation of MDL

The success in extracting the structure from data can be measured by the codelength.

In practice, we only find the structure that is “visible” to the used model class(es). For instance, the Bernoulli (coin flipping) model only sees the number of1s.

Jyrki Kivinen Information-Theoretic Modeling

(70)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL & Bayes

The MDL model selection criterion

minimize`(θ) +`θ(D) can be interpreted (viap = 2−`) as

maximizep(θ)·pθ(D) .

In Bayesian probability, this is equivalent tomaximization of posterior probability:

p(θ|D) = p(θ)p(D|θ) p(D) ,

where the termp(D) (the marginal probability ofD) is constant wrt.θ and doesn’t affect model selection.

⇒Probabilistic Modelling

Jyrki Kivinen Information-Theoretic Modeling

(71)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL & Bayes

The MDL model selection criterion

minimize`(θ) +`θ(D) can be interpreted (viap = 2−`) as

maximizep(θ)·pθ(D) .

In Bayesian probability, this is equivalent tomaximization of posterior probability:

p(θ|D) = p(θ)p(D |θ) p(D) ,

where the termp(D) (the marginal probability ofD) is constant wrt.θ and doesn’t affect model selection.

Jyrki Kivinen Information-Theoretic Modeling

(72)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

MDL & Bayes

The MDL model selection criterion

minimize`(θ) +`θ(D) can be interpreted (viap = 2−`) as

maximizep(θ)·pθ(D) .

In Bayesian probability, this is equivalent tomaximization of posterior probability:

p(θ|D) = p(θ)p(D |θ) p(D) ,

where the termp(D) (the marginal probability ofD) is constant wrt.θ and doesn’t affect model selection.

⇒Probabilistic Modelling

Jyrki Kivinen Information-Theoretic Modeling

(73)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Example: Denoising

Complexity = Information + Noise

= Regularity + Randomness

= Algorithm + Compressed file

The MDL principle gives a natural method for denoising since the very idea of MDL is to separate the total complexity of a signal into information and noise.

First encode a smooth signal (information), and then the difference to the observed signal (noise).

Jyrki Kivinen Information-Theoretic Modeling

(74)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Example: Denoising

Complexity = Information + Noise

= Regularity + Randomness

= Algorithm + Compressed file

Denoisingmeans the process of removing noise from a signal.

The MDL principle gives a natural method for denoising since the very idea of MDL is to separate the total complexity of a signal into information and noise.

First encode a smooth signal (information), and then the difference to the observed signal (noise).

Jyrki Kivinen Information-Theoretic Modeling

(75)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Example: Denoising

Complexity = Information + Noise

= Regularity + Randomness

= Algorithm + Compressed file

Denoisingmeans the process of removing noise from a signal.

The MDL principle gives a natural method for denoising since the very idea of MDL is to separate the total complexity of a signal into information and noise.

Jyrki Kivinen Information-Theoretic Modeling

(76)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Example: Denoising

Complexity = Information + Noise

= Regularity + Randomness

= Algorithm + Compressed file

Denoisingmeans the process of removing noise from a signal.

The MDL principle gives a natural method for denoising since the very idea of MDL is to separate the total complexity of a signal into information and noise.

First encode a smooth signal (information), and then the difference to the observed signal (noise).

Jyrki Kivinen Information-Theoretic Modeling

(77)

Example: Denoising

Noisy PSNR=19.8 MDL (A-B) PSNR=32.9

Jyrki Kivinen Information-Theoretic Modeling

(78)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Example: Denoising

Jyrki Kivinen Information-Theoretic Modeling

(79)

Example: Denoising

Jyrki Kivinen Information-Theoretic Modeling

(80)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Example: Denoising

Jyrki Kivinen Information-Theoretic Modeling

(81)

Example: Denoising

Jyrki Kivinen Information-Theoretic Modeling

(82)

Outline Occam’s Razor MDL Principle

Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL

Example: Denoising

Jyrki Kivinen Information-Theoretic Modeling

(83)

Coming next

We’ll see some more MDL:

MDL on continuous data

examples of MDL in applications.

Then we move on toKolmogorov complexity.

Jyrki Kivinen Information-Theoretic Modeling

Viittaukset

LIITTYVÄT TIEDOSTOT

Note that since Shannon-Fano gives E [`(X )] ≤ H(X ) + 1, and Huffman is optimal, Huffman must satisfy the same bound. Jyrki Kivinen

Combining solutions to problems 1–3, we get two-part block codes: Write first the joint distribution of blocks of N symbols, and then encode using blocks of length N.. The size of

1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p ∈ [0, 1].. 2 We’d like to encode data at

Encoding Continuous Data Differential Entropy Linear Regression Subset Selection Problem Wavelet Denoising..

To summarize, the prefix property is reasonable from a practical point of view, and from a theoretical point of view can be assumed without increasing input lengths too much....

Key ideas: motivation and basic definition universal and prefix machines invariance theorem, uncomputability basic upper bounds and examples More technical: proof of

Find out what compression algorithms are used in the gadgets you use—DVD player, digital TV, CD, iPod, cell-phone, laptop (WinZip), etc.. You can simply list the names of

Solution First observe that the minimum number of weightings must be at least dlog 3 24e = 3, since there are 24 possible outputs (12 balls, each can be either light or heavy) and