Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Information-Theoretic Modeling
Lecture 8: Universal Source Coding
Jyrki Kivinen
Department of Computer Science, University of Helsinki
Autumn 2012
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Lecture 8: Universal Source Coding
Moline Universal Model D, Little Casterton Working Weekend, 2006.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
1 Universal Source Codes Definitions
Universal Models
2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn
3 Advanced Universal Codes Mixture Codes
Normalized Maximum Likelihood Universal Prediction
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
1 Universal Source Codes Definitions
Universal Models
2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn
3 Advanced Universal Codes Mixture Codes
Normalized Maximum Likelihood Universal Prediction
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
1 Universal Source Codes Definitions
Universal Models
2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn
3 Advanced Universal Codes Mixture Codes
Normalized Maximum Likelihood Universal Prediction
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X.
We writeDfor the set of all possible data. A typical situation is D=Xnwhere n may or may not be known in advance.
A probability distributionp overD is called amodel. A set of modelsM is called amodel class.
Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xn where n may or may not be known in advance.
A probability distributionp overD is called amodel. A set of modelsM is called amodel class.
Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xn where n may or may not be known in advance.
A probability distributionp overD is called amodel.
A set of modelsM is called amodel class.
Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xn where n may or may not be known in advance.
A probability distributionp overD is called amodel.
A set of modelsM is called amodel class.
Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xn where n may or may not be known in advance.
A probability distributionp overD is called amodel.
A set of modelsM is called amodel class.
Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.
We have a parametric familyM={pθ |θ∈Θ}where Θ =
(µ, σ2)∈R2 |σ2>0 .
We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn). We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).
However, keep in mind that we may also havep overD that does not satisfy the independence assumption.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.
We have a parametric familyM={pθ |θ∈Θ}where Θ =
(µ, σ2)∈R2 |σ2>0 .
We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn). We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).
However, keep in mind that we may also havep overD that does not satisfy the independence assumption.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.
We have a parametric familyM={pθ |θ∈Θ}where Θ =
(µ, σ2)∈R2 |σ2>0 .
We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn).
We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).
However, keep in mind that we may also havep overD that does not satisfy the independence assumption.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.
We have a parametric familyM={pθ |θ∈Θ}where Θ =
(µ, σ2)∈R2 |σ2>0 .
We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn).
We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).
However, keep in mind that we may also havep overD that does not satisfy the independence assumption.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.
We have a parametric familyM={pθ |θ∈Θ}where Θ =
(µ, σ2)∈R2 |σ2>0 .
We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn).
We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).
However, keep in mind that we may also havep overD that does not satisfy the independence assumption.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Information-theoretic modeling?
In what follows, it’s important to keep in mind that we don’t claim that we can find a “true” modelp that “really” generated the data D, or even that such a “true” model exists.
However, keeping in mind how codes and distributions are related, it seems reasonable to think that
If a code based on modelp is good at compressingD, then perhaps studyingp can tell us something useful aboutD.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Information-theoretic modeling?
In what follows, it’s important to keep in mind that we don’t claim that we can find a “true” modelp that “really” generated the data D, or even that such a “true” model exists.
However, keeping in mind how codes and distributions are related, it seems reasonable to think that
If a code based on modelp is good at compressingD, then perhaps studyingp can tell us something useful aboutD.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Information-theoretic modeling?
In what follows, it’s important to keep in mind that we don’t claim that we can find a “true” modelp that “really” generated the data D, or even that such a “true” model exists.
However, keeping in mind how codes and distributions are related, it seems reasonable to think that
If a code based on modelp is good at compressingD, then perhaps studyingp can tell us something useful aboutD.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
The model withinMthat achieves the shortest code-length for datax is themaximum likelihood (ML) model:
minθ∈Θlog2 1
pθ(D) = log2 1 pθˆ(D) .
Depends onD!
For modelq, the excess code-length or “regret” over the ML model inMis given by
log2 1
q(D) −log2 1 pθˆ(D) .
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
The model withinMthat achieves the shortest code-length for datax is themaximum likelihood (ML) model:
minθ∈Θlog2 1
pθ(D) = log2 1
pθˆ(D) . Depends onD!
For modelq, the excess code-length or “regret” over the ML model inMis given by
log2 1
q(D) −log2 1 pθˆ(D) .
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Definitions
The model withinMthat achieves the shortest code-length for datax is themaximum likelihood (ML) model:
minθ∈Θlog2 1
pθ(D) = log2 1
pθˆ(D) . Depends onD!
For modelq, the excess code-length or “regret” over the ML model inMis given by
log2 1
q(D) −log2 1 pθˆ(D) .
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
log2 1
pθˆ(D) ≤log2 1 pθ(D)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
−log2 1
pθˆ(D) ≥ −log2 1 pθ(D)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
log2 1
q(D) −log2 1
pθˆ(D) ≥log2 1
q(D) −log2 1 pθ(D)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
ED∼pθ
log2 1
q(D) −log2 1 pθˆ(D)
≥ED∼pθ
log2 1
q(D) −log2 1 pθ(D)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
ED∼pθ
log2 1
q(D) −log2 1 pθˆ(D)
≥ED∼pθ
log2 1 q(D)
−ED∼pθ
log2 1 pθ(D)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
ED∼pθ
log2 1
q(D) −log2 1 pθˆ(D)
≥ED∼pθ
log2 1 q(D)
−X
D
pθ(D) log2 1 pθ(D)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
ED∼pθ
log2 1
q(D) −log2 1 pθˆ(D)
≥ED∼pθ
log2 1 q(D)
−H(pθ)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
ED∼pθ
log2 1
q(D) −log2 1 pθˆ(D)
≥ED∼pθ
log2 1 q(D)
−nH(p(1)θ )
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
1 nED∼pθ
log2 1
q(D) −log2 1 pθˆ(D)
≥ 1 nED∼pθ
log2 1 q(D)
−H(p(1)θ )
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
n→∞lim 1 nED∼pθ
log2 1
q(D) −log2 1 pθˆ(D)
≥ lim
n→∞
1 nED∼pθ
log2 1 q(D)
−H(p(1)θ )
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
0≥ lim
n→∞
1 nED∼pθ
log2 1 q(D)
−H(p(1)θ )
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
n→∞lim 1 nED∼pθ
log2 1 q(D)
≤H(p(1)θ )
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
n→∞lim 1 nED∼pθ
log2 1 q(D)
=H(p(1)θ ) (2)
This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
Universal model
A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:
n→∞lim 1 n max
D∈D
log2 1
q(D) −log2 1 pθˆ(D)
= 0 . (1)
n→∞lim 1 nED∼pθ
log2 1 q(D)
=H(p(1)θ ) (2) This is another (stochastic) definition of universality, equivalent to
1
nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
The typical situation might be as follows:
1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].
2 We’d like to encode data at rateH(p).
3 However, we do not knowp in advance.
Again, we don’t need to believe that data arereally generated by a Bernoulli model.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
The typical situation might be as follows:
1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].
2 We’d like to encode data at rateH(p).
3 However, we do not knowp in advance.
Again, we don’t need to believe that data arereally generated by a Bernoulli model.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
The typical situation might be as follows:
1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].
2 We’d like to encode data at rateH(p).
3 However, we do not knowp in advance.
Again, we don’t need to believe that data arereally generated by a Bernoulli model.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
The typical situation might be as follows:
1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].
2 We’d like to encode data at rateH(p).
3 However, we do not knowp in advance.
Again, we don’t need to believe that data arereally generated by a Bernoulli model.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Definitions Universal Models
Universal models
The typical situation might be as follows:
1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].
2 We’d like to encode data at rateH(p).
3 However, we do not knowp in advance.
Again, we don’t need to believe that data arereally generated by a Bernoulli model.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
1 Universal Source Codes Definitions
Universal Models
2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn
3 Advanced Universal Codes Mixture Codes
Normalized Maximum Likelihood Universal Prediction
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Two-Part Codes
LetM={pθ : θ∈Θ}be a parametric probabilistic model class.
If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1}∗ which maps each parameter value to a codeword of length`1(θ).
For any distributionpθ, the Shannon code-lengths satisfy
`θ(D) =
log2 1 pθ(D)
≈log2 1 pθ(D) .
Using parameter valueθ, the total code-length becomes (≈)
`1(θ) + log2 1 pθ(D) .
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Two-Part Codes
LetM={pθ : θ∈Θ}be a parametric probabilistic model class.
If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1}∗ which maps each parameter value to a codeword of length`1(θ).
For any distributionpθ, the Shannon code-lengths satisfy
`θ(D) =
log2 1 pθ(D)
≈log2 1 pθ(D) .
Using parameter valueθ, the total code-length becomes (≈)
`1(θ) + log2 1 pθ(D) .
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Two-Part Codes
LetM={pθ : θ∈Θ}be a parametric probabilistic model class.
If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1}∗ which maps each parameter value to a codeword of length`1(θ).
For any distributionpθ, the Shannon code-lengths satisfy
`θ(D) =
log2 1 pθ(D)
≈log2 1 pθ(D) .
Using parameter valueθ, the total code-length becomes (≈)
`1(θ) + log2 1 pθ(D) .
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Two-Part Codes
LetM={pθ : θ∈Θ}be a parametric probabilistic model class.
If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1}∗ which maps each parameter value to a codeword of length`1(θ).
For any distributionpθ, the Shannon code-lengths satisfy
`θ(D) =
log2 1 pθ(D)
≈log2 1 pθ(D) .
Using parameter valueθ, the total code-length becomes (≈)
`1(θ) + log2 1 pθ(D) .
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Two-Part Codes
Using the maximum likelihood parameter, the total code-length becomes
`two-part(D) =`1(ˆθ) + log2 1 pθˆ(D) .
Hence, the regretof the two-part code is
`two-part(D)−log2 1
pθˆ(D) =`1(ˆθ)
<cn for all c >0 and large n.
For discrete parameter modelsthe two-part code is universal.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Two-Part Codes
Using the maximum likelihood parameter, the total code-length becomes
`two-part(D) =`1(ˆθ) + log2 1 pθˆ(D) . Hence, theregret of the two-part code is
`two-part(D)−log2 1
pθˆ(D) =`1(ˆθ)
<cn for all c >0 and large n.
For discrete parameter modelsthe two-part code is universal.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Two-Part Codes
Using the maximum likelihood parameter, the total code-length becomes
`two-part(D) =`1(ˆθ) + log2 1 pθˆ(D) . Hence, theregret of the two-part code is
`two-part(D)−log2 1
pθˆ(D) =`1(ˆθ)<cn for all c >0 and large n.
For discrete parameter modelsthe two-part code is universal.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Two-Part Codes
Using the maximum likelihood parameter, the total code-length becomes
`two-part(D) =`1(ˆθ) + log2 1 pθˆ(D) . Hence, theregret of the two-part code is
`two-part(D)−log2 1
pθˆ(D) =`1(ˆθ)<cn for all c >0 and large n.
For discrete parameter modelsthe two-part code is universal.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Universality of Two-Part Codes
However, keep in mind that universality is not everything.
Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster. On the other hand, two-part codes have the advantage of being reasonably easy to understand.
Often they are also efficiently computable.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Universality of Two-Part Codes
However, keep in mind that universality is not everything.
Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster.
On the other hand, two-part codes have the advantage of being reasonably easy to understand.
Often they are also efficiently computable.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Universality of Two-Part Codes
However, keep in mind that universality is not everything.
Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster.
On the other hand, two-part codes have the advantage of being reasonably easy to understand.
Often they are also efficiently computable.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Universality of Two-Part Codes
However, keep in mind that universality is not everything.
Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster.
On the other hand, two-part codes have the advantage of being reasonably easy to understand.
Often they are also efficiently computable.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Θ
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Θ
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Θ
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Θ
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Θ
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Θ
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Continuous Parameters
What if the parameters are continuous (like polynomial
coefficients)? We can’t encode all continuous values with finite code-lengths!
Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.
Θ
Information Geometry!
If the points are sufficientlydense(in a code-length sense) then the
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
About Quantization
How many points should there be in the subsetθ(1), θ(2), . . .?
Intuition: Data does not allow us to tell apart θ1 andθ2 if
|θ1−θ2|<c 1
√n. ⇒ Don’t care about higher precision. Theorem
Optimal quantization accuracy is of order 1
√n.
⇒ number of points ≈√
nk =nk/2, wherek =dim(Θ).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
About Quantization
How many points should there be in the subsetθ(1), θ(2), . . .?
Intuition: Data does not allow us to tell apart θ1 and θ2 if
|θ1−θ2|<c 1
√n. ⇒ Don’t care about higher precision.
Theorem
Optimal quantization accuracy is of order 1
√n.
⇒ number of points ≈√
nk =nk/2, wherek =dim(Θ).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
About Quantization
How many points should there be in the subsetθ(1), θ(2), . . .?
Intuition: Data does not allow us to tell apart θ1 and θ2 if
|θ1−θ2|<c 1
√n. ⇒ Don’t care about higher precision.
Theorem
Optimal quantization accuracy is of order 1
√n.
⇒ number of points ≈√
nk =nk/2, wherek =dim(Θ).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
About Quantization
How many points should there be in the subsetθ(1), θ(2), . . .?
Intuition: Data does not allow us to tell apart θ1 and θ2 if
|θ1−θ2|<c 1
√n. ⇒ Don’t care about higher precision.
Theorem
Optimal quantization accuracy is of order 1
√n.
⇒ number of points ≈√
nk =nk/2, wherek =dim(Θ).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
About Quantization
How many points should there be in the subsetθ(1), θ(2), . . .?
Intuition: Data does not allow us to tell apart θ1 and θ2 if
|θ1−θ2|<c 1
√n. ⇒ Don’t care about higher precision.
Theorem
Optimal quantization accuracy is of order 1
√n.
⇒ number of points ≈√
nk =nk/2, wherek =dim(Θ).
The code-length for the quantized parameters becomes
`(θq)≈log2nk/2 = k
2log2n .
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Asymptotics:
k2log n
With the precision √1n the code-length for data is almost optimal:
min
θq∈{θ(1),θ(2),...}`θq(D) ≈ min
θ∈Θ`θ(D) = log2 1 pθˆ(D) .
The total code-length becomes then (≈) log2 1
pθˆ(D) +k
2 log2n , so that the regret is k
2 log2n.
Since log2n grows slower thann, thetwo-part code is universal also for continuous parameter models.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Asymptotics:
k2log n
With the precision √1n the code-length for data is almost optimal:
min
θq∈{θ(1),θ(2),...}`θq(D) ≈ min
θ∈Θ`θ(D) = log2 1 pθˆ(D) . The total code-length becomes then (≈)
log2 1 pθˆ(D) +k
2 log2n , so that the regret is k
2 log2n.
Since log2n grows slower thann, thetwo-part code is universal also for continuous parameter models.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Discrete Parameters Continuous Parameters Asymptotics: k2logn
Asymptotics:
k2log n
With the precision √1n the code-length for data is almost optimal:
min
θq∈{θ(1),θ(2),...}`θq(D) ≈ min
θ∈Θ`θ(D) = log2 1 pθˆ(D) . The total code-length becomes then (≈)
log2 1 pθˆ(D) +k
2 log2n , so that the regret is k
2 log2n.
Since log2n grows slower thann, thetwo-part code is universal also for continuous parameter models.
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Mixture Codes
Normalized Maximum Likelihood Universal Prediction
1 Universal Source Codes Definitions
Universal Models
2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn
3 Advanced Universal Codes Mixture Codes
Normalized Maximum Likelihood Universal Prediction
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Mixture Codes
Normalized Maximum Likelihood Universal Prediction
Mixture Universal Model
There are universal codes that are strictly better than the two-part code.
For instance, given a uniquely decodable code for the parameters, letw be a distribution over the parameter space Θ (quantized if necessary) defined as
w(θ) = 2−`(θ)
c , wherec =X
θ∈Θ
2−`(θ) ≤1. Letpw be a mixture distribution over the data-sets D∈ D, defined as
pw(D) =X
θ∈Θ
pθ(D)w(θ) ,
i.e., an “average” distribution, where eachpθ is weighted byw(θ).
Outline Universal Source Codes Two-Part Codes Advanced Universal Codes
Mixture Codes
Normalized Maximum Likelihood Universal Prediction
Mixture Universal Model
There are universal codes that are strictly better than the two-part code.
For instance, given a uniquely decodable code for the parameters, letw be a distribution over the parameter space Θ (quantized if necessary) defined as
w(θ) = 2−`(θ)
c , wherec =X
θ∈Θ
2−`(θ) ≤1.
Letpw be a mixture distribution over the data-sets D∈ D, defined as
pw(D) =X
θ∈Θ
pθ(D)w(θ) ,
i.e., an “average” distribution, where eachpθ is weighted byw(θ).