Information-Theoretic Modeling
Lecture 9: The MDL Principle
Jyrki Kivinen
Department of Computer Science, University of Helsinki
Autumn 2012
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Lecture 9: MDL Principle
Jorma Rissanen (left) receiving the IEEE Information Theory Society Best Paper Award from Claude Shannon in 1986.
IEEE Golden Jubilee Award for Technological Innovation(for the invention of arithmetic coding) 1998;IEEE Richard W. Hamming Medal(for fundamental
contribution to information theory, statistical inference, control theory, and the theory of complexity) 1993;Kolmogorov Medal2006;
IBM Outstanding Innovation Award(for work in statistical inference, information theory, and the theory of complexity) 1988;
IEEE Claude E. Shannon Award 2009; ...
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
1 Occam’s Razor House
Visual Recognition Astronomy
Razor
Probabilistic Models Old-Style MDL Modern MDL
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
1 Occam’s Razor House
Visual Recognition Astronomy
Razor
2 MDL Principle Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Jyrki Kivinen Information-Theoretic Modeling
House
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
common cold,
2 appendicitis,
gout medicine,
3 food poisoning,
gout medicine,
4 hemorrhage,
gout medicine,
5 meningitis.
common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease... Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced by gout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
2 appendicitis,
3 food poisoning,
gout medicine,
4 hemorrhage,
gout medicine,
5 meningitis.
common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease... Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced by gout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
common cold,
2 appendicitis,
gout medicine,
3 food poisoning,
gout medicine,
4 hemorrhage,
gout medicine,
5 meningitis.
common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced by gout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
common cold,
2 appendicitis,
3 food poisoning,
4 hemorrhage,
gout medicine,
5 meningitis.
common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced by gout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
common cold,
2 appendicitis,
gout medicine,
3 food poisoning,
gout medicine,
4 hemorrhage,
gout medicine,
5 meningitis.
common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced by gout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
common cold,
2 appendicitis,
gout medicine,
3 food poisoning,
gout medicine,
4 hemorrhage,
5 meningitis.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced by gout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
common cold,
2 appendicitis,
gout medicine,
3 food poisoning,
gout medicine,
4 hemorrhage,
gout medicine,
5 meningitis.
common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced by gout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
common cold,
2 appendicitis,
gout medicine,
3 food poisoning,
gout medicine,
4 hemorrhage,
gout medicine,
5 meningitis.
common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 pneumonia,
common cold,
2 appendicitis,
gout medicine,
3 food poisoning,
gout medicine,
4 hemorrhage,
gout medicine,
5 meningitis.
common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced bygout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 common cold,
2 appendicitis,
3 food poisoning,
4 hemorrhage,
gout medicine,
5 common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced bygout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
House
Brandon has
1 cough,
2 severe abdominal pain,
3 nausea,
4 low blood pressure,
5 fever.
1 common cold,
2 gout medicine,
3 gout medicine,
4 gout medicine,
5 common cold.
No single disease causes all of these.
Each symptom can be caused bysome(possibly different) disease...
Dr. House explains the symptoms with two simple causes:
1 common cold, causing the cough and fever,
2 pharmacy error: cough medicine replaced bygout medicine.
Jyrki Kivinen Information-Theoretic Modeling
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Astronomy
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Astronomy
Jyrki Kivinen Information-Theoretic Modeling
Astronomy
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
William of Ockham (c. 1288–1348)
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Occam’s Razor
Occam’s Razor
Entities should not be multiplied beyond necessity.
appearances.”
Diagnostic parsimony: Find the fewest possible causes that explain the symptoms.
(Hickam’s dictum: “Patients can have as many diseases as they damn well please.”)
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Occam’s Razor
Occam’s Razor
Entities should not be multiplied beyond necessity.
Isaac Newton: “We are to admit no more causes of natural things than such as are both true and sufficient to explain their
appearances.”
Diagnostic parsimony: Find the fewest possible causes that explain the symptoms.
(Hickam’s dictum: “Patients can have as many diseases as they damn well please.”)
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Occam’s Razor
Occam’s Razor
Entities should not be multiplied beyond necessity.
Isaac Newton: “We are to admit no more causes of natural things than such as are both true and sufficient to explain their
appearances.”
Diagnostic parsimony: Find the fewest possible causes that explain the symptoms.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Occam’s Razor
Occam’s Razor
Entities should not be multiplied beyond necessity.
Isaac Newton: “We are to admit no more causes of natural things than such as are both true and sufficient to explain their
appearances.”
Diagnostic parsimony: Find the fewest possible causes that explain the symptoms.
(Hickam’s dictum: “Patients can have as many diseases as they damn well please.”)
Jyrki Kivinen Information-Theoretic Modeling
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
House Visual Recognition Astronomy Razor
Visual Recognition
Jyrki Kivinen Information-Theoretic Modeling
1 Occam’s Razor House
Visual Recognition Astronomy
Razor
2 MDL Principle Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Principle
Minimum Description Length (MDL) Principle (2-part) Choose the hypothesis which minimizes the sum of
1 the codelength of the hypothesis, and
2 the codelength of the data with the help of the hypothesis.
How to encode datawith the help of a hypothesis?
Jyrki Kivinen Information-Theoretic Modeling
MDL Principle
Minimum Description Length (MDL) Principle (2-part) Choose the hypothesis which minimizes the sum of
1 the codelength of the hypothesis, and
2 the codelength of the data with the help of the hypothesis.
How to encode datawith the help of a hypothesis?
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)! different groups ofk exceptions.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
n
k
= n!
k!(n−k)! different groups ofk exceptions.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)!
different groups ofk exceptions.
Jyrki Kivinen Information-Theoretic Modeling
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)!
different groups ofk exceptions.
k= 1 :
n
1
= 6252625 ≈1.4·10188. Codelength log2(n+ 1) + log2
n
k
≈19 vs. log22625= 625
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)!
different groups ofk exceptions.
k= 2 :
n
2
= 195 0002625≈1.4·10188. Codelength log2(n+ 1) + log2
n
k
≈27 vs. log22625= 625
Jyrki Kivinen Information-Theoretic Modeling
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)!
different groups ofk exceptions.
k= 3 :
n
3
= 40 495 0002625≈1.4·10188. Codelength log2(n+ 1) + log2
n
k
≈35 vs. log22625= 625
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)!
different groups ofk exceptions.
k= 10 :
n
10
= 2 331 354 000 000 000 000 0002625. Codelength log2(n+ 1) + log2
n
k
≈80 vs. log22625= 625
Jyrki Kivinen Information-Theoretic Modeling
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)!
different groups ofk exceptions.
k= 100 :
n
100
≈9.5·10117 2625≈1.4·10188. Codelength log2(n+ 1) + log2
n
k
≈401 vs. log22625= 625
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)!
different groups ofk exceptions.
k= 300 :
n
300
≈2.7·10186 <2625 ≈1.4·10188. Codelength log2(n+ 1) + log2
n
k
≈629 vs. log22625= 625
Jyrki Kivinen Information-Theoretic Modeling
Encoding Data: Rules & Exceptions
Idea 1: Hypothesis = rule; encode exceptions.
Black box of size 25×25 = 625, white dots at (x1,y1), . . . ,(xk,yk).
For image of sizen= 625, there are 2n different images, and
n
k
= n!
k!(n−k)!
different groups ofk exceptions.
k= 372 :
n
372
≈5.1·10181 2625≈1.4·10188. Codelength log2(n+ 1) + log2
n
k
≈613 vs. log22625= 625
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Encoding Data: Probabilistic Models
Idea 2: Hypothesis = probability distribution.
Noisy PSNR=19.8 MDL (A-B) PSNR=32.9
Rissanen & Shannon: log2 1 pθˆ(D) +k
2 log2n.
Jyrki Kivinen Information-Theoretic Modeling
Encoding Data: Probabilistic Models
Idea 2: Hypothesis = probability distribution.
Noisy PSNR=19.8 MDL (A-B) PSNR=32.9
Rissanen & Shannon: log2 1 pθˆ(D)+ k
2 log2n.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Polynomials
From P. Gr¨unwald
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Old-Style MDL
With the precision √1n the codelength for data is almost optimal:
min
θq∈{θ(1),θ(2),...}`θq(D) ≈ min
θ∈Θ`θ(D) = log2 1 pθˆ(D) .
“Steam MDL”
`θq(D) +`(θq)≈log2 1 pθˆ(D) +k
2log2n .
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Old-Style MDL
With the precision √1n the codelength for data is almost optimal:
min
θq∈{θ(1),θ(2),...}`θq(D) ≈ min
θ∈Θ`θ(D) = log2 1 pθˆ(D) . This gives the total codelength formula:
“Steam MDL”
`θq(D) +`(θq)≈log2 1 pθˆ(D) +k
2log2n .
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Old-Style MDL
The k2log2n formula is only a rough approximation, and works well only for very large samples.
likelihood, etc.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Old-Style MDL
The k2log2n formula is only a rough approximation, and works well only for very large samples.
MDL in the 21st century:
More advanced codes: mixtures, normalized maximum likelihood, etc.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
Perhaps the best known use for MDL is formodel class selection (which is often called just model selection).
We want to pick the one that seems best suited for our dataD. TypicallyM1⊂ · · · ⊂ Mm, so Mm always achieves the best fit to data, but may beoverfitting(see Introduction to machine
learning).
Therefore we use MDL and pickMi that minimizes the total code length.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
Perhaps the best known use for MDL is formodel class selection (which is often called just model selection).
Suppose we have a family of model classesM1, . . . ,Mm, each with their own parameter set Θ1, . . . ,Θm.
We want to pick the one that seems best suited for our dataD.
TypicallyM1⊂ · · · ⊂ Mm, so Mm always achieves the best fit to data, but may beoverfitting(see Introduction to machine
learning).
Therefore we use MDL and pickMi that minimizes the total code length.
Jyrki Kivinen Information-Theoretic Modeling
MDL Model Selection
Perhaps the best known use for MDL is formodel class selection (which is often called just model selection).
Suppose we have a family of model classesM1, . . . ,Mm, each with their own parameter set Θ1, . . . ,Θm.
We want to pick the one that seems best suited for our dataD.
TypicallyM1⊂ · · · ⊂ Mm, so Mm always achieves the best fit to data, but may beoverfitting(see Introduction to machine
learning).
Therefore we use MDL and pickMi that minimizes the total code length.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
Actually here we have a “three-part” code:
1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.
2 Encoding of the parameter (vector): `1(θ), θ ∈Θi.
3 Encoding of the data: log2 1
pθ(D), D∈ D.
If we have a finite family ofm model classes, we can do Part 1 with`(Mi) = log2m for alli.
This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible” overNand`(Mi) = log2(1/p(i)).
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
Actually here we have a “three-part” code:
1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.
3 Encoding of the data: log2
pθ(D), D∈ D.
If we have a finite family ofm model classes, we can do Part 1 with`(Mi) = log2m for alli.
This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible” overNand`(Mi) = log2(1/p(i)).
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
Actually here we have a “three-part” code:
1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.
2 Encoding of the parameter (vector): `1(θ), θ ∈Θi.
3 Encoding of the data: log2 1
pθ(D), D∈ D.
If we have a finite family ofm model classes, we can do Part 1 with`(Mi) = log2m for alli.
This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible” overNand`(Mi) = log2(1/p(i)).
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
Actually here we have a “three-part” code:
1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.
2 Encoding of the parameter (vector): `1(θ), θ ∈Θi.
3 Encoding of the data: log2 1
pθ(D), D ∈ D.
This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible” overNand`(Mi) = log2(1/p(i)).
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
Actually here we have a “three-part” code:
1 Encoding of the model class: `(Mi), i ∈ {1, . . . ,m}.
2 Encoding of the parameter (vector): `1(θ), θ ∈Θi.
3 Encoding of the data: log2 1
pθ(D), D ∈ D.
If we have a finite family ofm model classes, we can do Part 1 with`(Mi) = log2mfor all i.
This can be generalized to infinite families of model classes, in which case we often to pickp which is “as uniform as possible”
overNand`(Mi) = log2(1/p(i)).
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
If we are interested in choosing a model class (and not the
parameters), we can improve parts 2 & 3 by combining them into a better universal code than two-part:
Mi Mi
universal code-length (e.g., mixture, NML) based on model classMi.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
If we are interested in choosing a model class (and not the
parameters), we can improve parts 2 & 3 by combining them into a better universal code than two-part:
1 Encoding of the model class index: `(Mi), i ∈N.
2 Encoding of the data: `Mi(D), D∈ D, where`Mi is a universal code-length (e.g., mixture, NML) based on model classMi.
Jyrki Kivinen Information-Theoretic Modeling
MDL Model Selection
If we are interested in choosing a model class (and not the
parameters), we can improve parts 2 & 3 by combining them into a better universal code than two-part:
1 Encoding of the model class index: `(Mi), i ∈N.
2 Encoding of the data: `Mi(D), D∈ D, where`Mi is a universal code-length (e.g., mixture, NML) based on model classMi.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL Model Selection
MDL Explanation of MDL
The success in extracting the structure from data can be measured by the codelength.
In practice, we only find the structure that is “visible” to the used model class(es). For instance, the Bernoulli (coin flipping) model only sees the number of 1s.
Jyrki Kivinen Information-Theoretic Modeling
MDL Model Selection
MDL Explanation of MDL
The success in extracting the structure from data can be measured by the codelength.
In practice, we only find the structure that is “visible” to the used model class(es). For instance, the Bernoulli (coin flipping) model only sees the number of1s.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL & Bayes
The MDL model selection criterion
minimize`(θ) +`θ(D) can be interpreted (viap = 2−`) as
maximizep(θ)·pθ(D) .
In Bayesian probability, this is equivalent tomaximization of posterior probability:
p(θ|D) = p(θ)p(D|θ) p(D) ,
where the termp(D) (the marginal probability ofD) is constant wrt.θ and doesn’t affect model selection.
⇒Probabilistic Modelling
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL & Bayes
The MDL model selection criterion
minimize`(θ) +`θ(D) can be interpreted (viap = 2−`) as
maximizep(θ)·pθ(D) .
In Bayesian probability, this is equivalent tomaximization of posterior probability:
p(θ|D) = p(θ)p(D |θ) p(D) ,
where the termp(D) (the marginal probability ofD) is constant wrt.θ and doesn’t affect model selection.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
MDL & Bayes
The MDL model selection criterion
minimize`(θ) +`θ(D) can be interpreted (viap = 2−`) as
maximizep(θ)·pθ(D) .
In Bayesian probability, this is equivalent tomaximization of posterior probability:
p(θ|D) = p(θ)p(D |θ) p(D) ,
where the termp(D) (the marginal probability ofD) is constant wrt.θ and doesn’t affect model selection.
⇒Probabilistic Modelling
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Example: Denoising
Complexity = Information + Noise
= Regularity + Randomness
= Algorithm + Compressed file
The MDL principle gives a natural method for denoising since the very idea of MDL is to separate the total complexity of a signal into information and noise.
First encode a smooth signal (information), and then the difference to the observed signal (noise).
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Example: Denoising
Complexity = Information + Noise
= Regularity + Randomness
= Algorithm + Compressed file
Denoisingmeans the process of removing noise from a signal.
The MDL principle gives a natural method for denoising since the very idea of MDL is to separate the total complexity of a signal into information and noise.
First encode a smooth signal (information), and then the difference to the observed signal (noise).
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Example: Denoising
Complexity = Information + Noise
= Regularity + Randomness
= Algorithm + Compressed file
Denoisingmeans the process of removing noise from a signal.
The MDL principle gives a natural method for denoising since the very idea of MDL is to separate the total complexity of a signal into information and noise.
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Example: Denoising
Complexity = Information + Noise
= Regularity + Randomness
= Algorithm + Compressed file
Denoisingmeans the process of removing noise from a signal.
The MDL principle gives a natural method for denoising since the very idea of MDL is to separate the total complexity of a signal into information and noise.
First encode a smooth signal (information), and then the difference to the observed signal (noise).
Jyrki Kivinen Information-Theoretic Modeling
Example: Denoising
Noisy PSNR=19.8 MDL (A-B) PSNR=32.9
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Example: Denoising
Jyrki Kivinen Information-Theoretic Modeling
Example: Denoising
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Example: Denoising
Jyrki Kivinen Information-Theoretic Modeling
Example: Denoising
Jyrki Kivinen Information-Theoretic Modeling
Outline Occam’s Razor MDL Principle
Rules & Exceptions Probabilistic Models Old-Style MDL Modern MDL
Example: Denoising
Jyrki Kivinen Information-Theoretic Modeling
Coming next
We’ll see some more MDL:
MDL on continuous data
examples of MDL in applications.
Then we move on toKolmogorov complexity.
Jyrki Kivinen Information-Theoretic Modeling