• Ei tuloksia

Autumn2012 Lecture8:UniversalSourceCodingJyrkiKivinen Information-TheoreticModeling

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Autumn2012 Lecture8:UniversalSourceCodingJyrkiKivinen Information-TheoreticModeling"

Copied!
106
0
0

Kokoteksti

(1)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Information-Theoretic Modeling

Lecture 8: Universal Source Coding

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

(2)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Lecture 8: Universal Source Coding

Moline Universal Model D, Little Casterton Working Weekend, 2006.

(3)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

1 Universal Source Codes Definitions

Universal Models

2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn

3 Advanced Universal Codes Mixture Codes

Normalized Maximum Likelihood Universal Prediction

(4)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

1 Universal Source Codes Definitions

Universal Models

2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn

3 Advanced Universal Codes Mixture Codes

Normalized Maximum Likelihood Universal Prediction

(5)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

1 Universal Source Codes Definitions

Universal Models

2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn

3 Advanced Universal Codes Mixture Codes

Normalized Maximum Likelihood Universal Prediction

(6)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X.

We writeDfor the set of all possible data. A typical situation is D=Xnwhere n may or may not be known in advance.

A probability distributionp overD is called amodel. A set of modelsM is called amodel class.

Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.

(7)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xn where n may or may not be known in advance.

A probability distributionp overD is called amodel. A set of modelsM is called amodel class.

Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.

(8)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xn where n may or may not be known in advance.

A probability distributionp overD is called amodel.

A set of modelsM is called amodel class.

Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.

(9)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xn where n may or may not be known in advance.

A probability distributionp overD is called amodel.

A set of modelsM is called amodel class.

Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.

(10)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsxi come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xn where n may or may not be known in advance.

A probability distributionp overD is called amodel.

A set of modelsM is called amodel class.

Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆Rk for somek andpθ is a model for each θ∈Θ.

(11)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.

We have a parametric familyM={pθ |θ∈Θ}where Θ =

(µ, σ2)∈R22>0 .

We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn). We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).

However, keep in mind that we may also havep overD that does not satisfy the independence assumption.

(12)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.

We have a parametric familyM={pθ |θ∈Θ}where Θ =

(µ, σ2)∈R22>0 .

We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn). We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).

However, keep in mind that we may also havep overD that does not satisfy the independence assumption.

(13)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.

We have a parametric familyM={pθ |θ∈Θ}where Θ =

(µ, σ2)∈R22>0 .

We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn).

We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).

However, keep in mind that we may also havep overD that does not satisfy the independence assumption.

(14)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.

We have a parametric familyM={pθ |θ∈Θ}where Θ =

(µ, σ2)∈R22>0 .

We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn).

We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).

However, keep in mind that we may also havep overD that does not satisfy the independence assumption.

(15)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

ExampleLet pµ,σ2 be the normal distribution overX =Rwith meanµand varianceσ2.

We have a parametric familyM={pθ |θ∈Θ}where Θ =

(µ, σ2)∈R22>0 .

We can extendpµ,σ2 into a distributionpµ,σ(n)2 overD=Rn by assuming independence: pµ,σ(n)2(x1, . . . ,xn) =pµ,σ2(x1). . .pµ,σ2(xn).

We often abuse notation by just writingpθ(x1, . . . ,xn) instead of pθ(n)(x1, . . . ,xn).

However, keep in mind that we may also havep overD that does not satisfy the independence assumption.

(16)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Information-theoretic modeling?

In what follows, it’s important to keep in mind that we don’t claim that we can find a “true” modelp that “really” generated the data D, or even that such a “true” model exists.

However, keeping in mind how codes and distributions are related, it seems reasonable to think that

If a code based on modelp is good at compressingD, then perhaps studyingp can tell us something useful aboutD.

(17)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Information-theoretic modeling?

In what follows, it’s important to keep in mind that we don’t claim that we can find a “true” modelp that “really” generated the data D, or even that such a “true” model exists.

However, keeping in mind how codes and distributions are related, it seems reasonable to think that

If a code based on modelp is good at compressingD, then perhaps studyingp can tell us something useful aboutD.

(18)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Information-theoretic modeling?

In what follows, it’s important to keep in mind that we don’t claim that we can find a “true” modelp that “really” generated the data D, or even that such a “true” model exists.

However, keeping in mind how codes and distributions are related, it seems reasonable to think that

If a code based on modelp is good at compressingD, then perhaps studyingp can tell us something useful aboutD.

(19)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

The model withinMthat achieves the shortest code-length for datax is themaximum likelihood (ML) model:

minθ∈Θlog2 1

pθ(D) = log2 1 pθˆ(D) .

Depends onD!

For modelq, the excess code-length or “regret” over the ML model inMis given by

log2 1

q(D) −log2 1 pθˆ(D) .

(20)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

The model withinMthat achieves the shortest code-length for datax is themaximum likelihood (ML) model:

minθ∈Θlog2 1

pθ(D) = log2 1

pθˆ(D) . Depends onD!

For modelq, the excess code-length or “regret” over the ML model inMis given by

log2 1

q(D) −log2 1 pθˆ(D) .

(21)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Definitions

The model withinMthat achieves the shortest code-length for datax is themaximum likelihood (ML) model:

minθ∈Θlog2 1

pθ(D) = log2 1

pθˆ(D) . Depends onD!

For modelq, the excess code-length or “regret” over the ML model inMis given by

log2 1

q(D) −log2 1 pθˆ(D) .

(22)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(23)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

log2 1

pθˆ(D) ≤log2 1 pθ(D)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(24)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

−log2 1

pθˆ(D) ≥ −log2 1 pθ(D)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(25)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

log2 1

q(D) −log2 1

pθˆ(D) ≥log2 1

q(D) −log2 1 pθ(D)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(26)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

ED∼pθ

log2 1

q(D) −log2 1 pθˆ(D)

≥ED∼pθ

log2 1

q(D) −log2 1 pθ(D)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(27)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

ED∼pθ

log2 1

q(D) −log2 1 pθˆ(D)

≥ED∼pθ

log2 1 q(D)

−ED∼pθ

log2 1 pθ(D)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(28)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

ED∼pθ

log2 1

q(D) −log2 1 pθˆ(D)

≥ED∼pθ

log2 1 q(D)

−X

D

pθ(D) log2 1 pθ(D)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(29)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

ED∼pθ

log2 1

q(D) −log2 1 pθˆ(D)

≥ED∼pθ

log2 1 q(D)

−H(pθ)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(30)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

ED∼pθ

log2 1

q(D) −log2 1 pθˆ(D)

≥ED∼pθ

log2 1 q(D)

−nH(p(1)θ )

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(31)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

1 nED∼pθ

log2 1

q(D) −log2 1 pθˆ(D)

≥ 1 nED∼pθ

log2 1 q(D)

−H(p(1)θ )

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(32)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

n→∞lim 1 nED∼pθ

log2 1

q(D) −log2 1 pθˆ(D)

≥ lim

n→∞

1 nED∼pθ

log2 1 q(D)

−H(p(1)θ )

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(33)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

0≥ lim

n→∞

1 nED∼pθ

log2 1 q(D)

−H(p(1)θ )

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(34)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

n→∞lim 1 nED∼pθ

log2 1 q(D)

≤H(p(1)θ )

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(35)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

n→∞lim 1 nED∼pθ

log2 1 q(D)

=H(p(1)θ ) (2)

This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(36)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log2 1

q(D) −log2 1 pθˆ(D)

= 0 . (1)

n→∞lim 1 nED∼pθ

log2 1 q(D)

=H(p(1)θ ) (2) This is another (stochastic) definition of universality, equivalent to

1

nD(pθkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(37)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].

2 We’d like to encode data at rateH(p).

3 However, we do not knowp in advance.

Again, we don’t need to believe that data arereally generated by a Bernoulli model.

(38)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].

2 We’d like to encode data at rateH(p).

3 However, we do not knowp in advance.

Again, we don’t need to believe that data arereally generated by a Bernoulli model.

(39)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].

2 We’d like to encode data at rateH(p).

3 However, we do not knowp in advance.

Again, we don’t need to believe that data arereally generated by a Bernoulli model.

(40)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].

2 We’d like to encode data at rateH(p).

3 However, we do not knowp in advance.

Again, we don’t need to believe that data arereally generated by a Bernoulli model.

(41)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Definitions Universal Models

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].

2 We’d like to encode data at rateH(p).

3 However, we do not knowp in advance.

Again, we don’t need to believe that data arereally generated by a Bernoulli model.

(42)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

1 Universal Source Codes Definitions

Universal Models

2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn

3 Advanced Universal Codes Mixture Codes

Normalized Maximum Likelihood Universal Prediction

(43)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Two-Part Codes

LetM={pθ : θ∈Θ}be a parametric probabilistic model class.

If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1} which maps each parameter value to a codeword of length`1(θ).

For any distributionpθ, the Shannon code-lengths satisfy

`θ(D) =

log2 1 pθ(D)

≈log2 1 pθ(D) .

Using parameter valueθ, the total code-length becomes (≈)

`1(θ) + log2 1 pθ(D) .

(44)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Two-Part Codes

LetM={pθ : θ∈Θ}be a parametric probabilistic model class.

If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1} which maps each parameter value to a codeword of length`1(θ).

For any distributionpθ, the Shannon code-lengths satisfy

`θ(D) =

log2 1 pθ(D)

≈log2 1 pθ(D) .

Using parameter valueθ, the total code-length becomes (≈)

`1(θ) + log2 1 pθ(D) .

(45)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Two-Part Codes

LetM={pθ : θ∈Θ}be a parametric probabilistic model class.

If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1} which maps each parameter value to a codeword of length`1(θ).

For any distributionpθ, the Shannon code-lengths satisfy

`θ(D) =

log2 1 pθ(D)

≈log2 1 pθ(D) .

Using parameter valueθ, the total code-length becomes (≈)

`1(θ) + log2 1 pθ(D) .

(46)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Two-Part Codes

LetM={pθ : θ∈Θ}be a parametric probabilistic model class.

If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1} which maps each parameter value to a codeword of length`1(θ).

For any distributionpθ, the Shannon code-lengths satisfy

`θ(D) =

log2 1 pθ(D)

≈log2 1 pθ(D) .

Using parameter valueθ, the total code-length becomes (≈)

`1(θ) + log2 1 pθ(D) .

(47)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Two-Part Codes

Using the maximum likelihood parameter, the total code-length becomes

`two-part(D) =`1(ˆθ) + log2 1 pθˆ(D) .

Hence, the regretof the two-part code is

`two-part(D)−log2 1

pθˆ(D) =`1(ˆθ)

<cn for all c >0 and large n.

For discrete parameter modelsthe two-part code is universal.

(48)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Two-Part Codes

Using the maximum likelihood parameter, the total code-length becomes

`two-part(D) =`1(ˆθ) + log2 1 pθˆ(D) . Hence, theregret of the two-part code is

`two-part(D)−log2 1

pθˆ(D) =`1(ˆθ)

<cn for all c >0 and large n.

For discrete parameter modelsthe two-part code is universal.

(49)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Two-Part Codes

Using the maximum likelihood parameter, the total code-length becomes

`two-part(D) =`1(ˆθ) + log2 1 pθˆ(D) . Hence, theregret of the two-part code is

`two-part(D)−log2 1

pθˆ(D) =`1(ˆθ)<cn for all c >0 and large n.

For discrete parameter modelsthe two-part code is universal.

(50)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Two-Part Codes

Using the maximum likelihood parameter, the total code-length becomes

`two-part(D) =`1(ˆθ) + log2 1 pθˆ(D) . Hence, theregret of the two-part code is

`two-part(D)−log2 1

pθˆ(D) =`1(ˆθ)<cn for all c >0 and large n.

For discrete parameter modelsthe two-part code is universal.

(51)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Universality of Two-Part Codes

However, keep in mind that universality is not everything.

Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster. On the other hand, two-part codes have the advantage of being reasonably easy to understand.

Often they are also efficiently computable.

(52)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Universality of Two-Part Codes

However, keep in mind that universality is not everything.

Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster.

On the other hand, two-part codes have the advantage of being reasonably easy to understand.

Often they are also efficiently computable.

(53)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Universality of Two-Part Codes

However, keep in mind that universality is not everything.

Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster.

On the other hand, two-part codes have the advantage of being reasonably easy to understand.

Often they are also efficiently computable.

(54)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Universality of Two-Part Codes

However, keep in mind that universality is not everything.

Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster.

On the other hand, two-part codes have the advantage of being reasonably easy to understand.

Often they are also efficiently computable.

(55)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).

(56)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).

(57)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).

(58)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).

(59)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).

(60)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).

(61)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).

(62)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`θ(D).

(63)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the

(64)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

About Quantization

How many points should there be in the subsetθ(1), θ(2), . . .?

Intuition: Data does not allow us to tell apart θ1 andθ2 if

1−θ2|<c 1

√n. ⇒ Don’t care about higher precision. Theorem

Optimal quantization accuracy is of order 1

√n.

⇒ number of points ≈√

nk =nk/2, wherek =dim(Θ).

(65)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

About Quantization

How many points should there be in the subsetθ(1), θ(2), . . .?

Intuition: Data does not allow us to tell apart θ1 and θ2 if

1−θ2|<c 1

√n. ⇒ Don’t care about higher precision.

Theorem

Optimal quantization accuracy is of order 1

√n.

⇒ number of points ≈√

nk =nk/2, wherek =dim(Θ).

(66)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

About Quantization

How many points should there be in the subsetθ(1), θ(2), . . .?

Intuition: Data does not allow us to tell apart θ1 and θ2 if

1−θ2|<c 1

√n. ⇒ Don’t care about higher precision.

Theorem

Optimal quantization accuracy is of order 1

√n.

⇒ number of points ≈√

nk =nk/2, wherek =dim(Θ).

(67)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

About Quantization

How many points should there be in the subsetθ(1), θ(2), . . .?

Intuition: Data does not allow us to tell apart θ1 and θ2 if

1−θ2|<c 1

√n. ⇒ Don’t care about higher precision.

Theorem

Optimal quantization accuracy is of order 1

√n.

⇒ number of points ≈√

nk =nk/2, wherek =dim(Θ).

(68)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

About Quantization

How many points should there be in the subsetθ(1), θ(2), . . .?

Intuition: Data does not allow us to tell apart θ1 and θ2 if

1−θ2|<c 1

√n. ⇒ Don’t care about higher precision.

Theorem

Optimal quantization accuracy is of order 1

√n.

⇒ number of points ≈√

nk =nk/2, wherek =dim(Θ).

The code-length for the quantized parameters becomes

`(θq)≈log2nk/2 = k

2log2n .

(69)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Asymptotics:

k2

log n

With the precision 1n the code-length for data is almost optimal:

min

θq∈{θ(1)(2),...}`θq(D) ≈ min

θ∈Θ`θ(D) = log2 1 pθˆ(D) .

The total code-length becomes then (≈) log2 1

pθˆ(D) +k

2 log2n , so that the regret is k

2 log2n.

Since log2n grows slower thann, thetwo-part code is universal also for continuous parameter models.

(70)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Asymptotics:

k2

log n

With the precision 1n the code-length for data is almost optimal:

min

θq∈{θ(1)(2),...}`θq(D) ≈ min

θ∈Θ`θ(D) = log2 1 pθˆ(D) . The total code-length becomes then (≈)

log2 1 pθˆ(D) +k

2 log2n , so that the regret is k

2 log2n.

Since log2n grows slower thann, thetwo-part code is universal also for continuous parameter models.

(71)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Discrete Parameters Continuous Parameters Asymptotics: k2logn

Asymptotics:

k2

log n

With the precision 1n the code-length for data is almost optimal:

min

θq∈{θ(1)(2),...}`θq(D) ≈ min

θ∈Θ`θ(D) = log2 1 pθˆ(D) . The total code-length becomes then (≈)

log2 1 pθˆ(D) +k

2 log2n , so that the regret is k

2 log2n.

Since log2n grows slower thann, thetwo-part code is universal also for continuous parameter models.

(72)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Mixture Codes

Normalized Maximum Likelihood Universal Prediction

1 Universal Source Codes Definitions

Universal Models

2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: k2logn

3 Advanced Universal Codes Mixture Codes

Normalized Maximum Likelihood Universal Prediction

(73)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Mixture Codes

Normalized Maximum Likelihood Universal Prediction

Mixture Universal Model

There are universal codes that are strictly better than the two-part code.

For instance, given a uniquely decodable code for the parameters, letw be a distribution over the parameter space Θ (quantized if necessary) defined as

w(θ) = 2−`(θ)

c , wherec =X

θ∈Θ

2−`(θ) ≤1. Letpw be a mixture distribution over the data-sets D∈ D, defined as

pw(D) =X

θ∈Θ

pθ(D)w(θ) ,

i.e., an “average” distribution, where eachpθ is weighted byw(θ).

(74)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Mixture Codes

Normalized Maximum Likelihood Universal Prediction

Mixture Universal Model

There are universal codes that are strictly better than the two-part code.

For instance, given a uniquely decodable code for the parameters, letw be a distribution over the parameter space Θ (quantized if necessary) defined as

w(θ) = 2−`(θ)

c , wherec =X

θ∈Θ

2−`(θ) ≤1.

Letpw be a mixture distribution over the data-sets D∈ D, defined as

pw(D) =X

θ∈Θ

pθ(D)w(θ) ,

i.e., an “average” distribution, where eachpθ is weighted byw(θ).

Viittaukset

LIITTYVÄT TIEDOSTOT

The Extrinsic Object Construction must have approximately the meaning'the referent ofthe subject argument does the activity denoted by the verb so much or in

Let’s consider for a moment who they are, the ones we consider “founders”, “key figures”, or “big names” or the texts and books that comprise our “canon”, the

Outline Administrative issues Overview of Contents Compression..

Kahta

In the fourth work, we apply another two possible longitudinal models: (1) a multilevel model and (2) a mixed effect model to map a wood properties data set that is characterized by

Tytin tiukka itseluottamus on elämänkokemusta, jota hän on saanut opiskeltuaan Dallasissa kaksi talvea täydellä

Kun saaren korkeimmalla kohdalla sijaitseva avara huvilarakennus oli hel- posti seiniä puhkomalla ja ovia siirte- lemällä saatettu siihen kuntoon, että seura voi sinne

19 mm thick wood-fibre panel fronts with low formaldehyde emission CLASS E0, covered on 2 sides with melamine sheets [HRM], edge on 4 sides in 8/10 thick abs.. The external surface