Autumn2012 Lecture8:UniversalSourceCodingJyrkiKivinen Information-TheoreticModeling

(1)

Outline Universal Source Codes Two-Part Codes Advanced Universal Codes

Information-Theoretic Modeling

Lecture 8: Universal Source Coding

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

(2)

Lecture 8: Universal Source Coding

Moline Universal Model D, Little Casterton Working Weekend, 2006.

(3)

1 Universal Source Codes Definitions

Universal Models

2 Two-Part Codes Discrete Parameters Continuous Parameters Asymptotics: ^k₂logn

3 Advanced Universal Codes Mixture Codes

Normalized Maximum Likelihood Universal Prediction

(4)

Universal Models

(5)

Universal Models

(6)

Definitions Universal Models

Definitions

Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsx_i come from some domain X.

We writeDfor the set of all possible data. A typical situation is D=Xⁿwhere n may or may not be known in advance.

A probability distributionp overD is called amodel. A set of modelsM is called amodel class.

Model classes are oftenparametric: M={pθ |θ∈Θ} where Θ⊆R^k for somek andp_θ is a model for each θ∈Θ.

(7)

Definitions

Our basic setting is that we have somedata D = (x1, . . . ,xm) where the individual data pointsx_i come from some domain X. We writeDfor the set of all possible data. A typical situation is D=Xⁿ where n may or may not be known in advance.

A probability distributionp overD is called amodel. A set of modelsM is called amodel class.

(8)

Definitions

A probability distributionp overD is called amodel.

A set of modelsM is called amodel class.

(9)

Definitions

(10)

Definitions

(11)

Definitions

ExampleLet p_µ,σ² be the normal distribution overX =Rwith meanµand varianceσ².

We have a parametric familyM={pθ |θ∈Θ}where Θ =

(µ, σ²)∈R² |σ²>0 .

We can extendp_µ,σ² into a distributionp_µ,σ⁽ⁿ⁾2 overD=Rⁿ by assuming independence: p_µ,σ⁽ⁿ⁾2(x₁, . . . ,x_n) =p_µ,σ2(x₁). . .p_µ,σ2(x_n). We often abuse notation by just writingp_θ(x₁, . . . ,x_n) instead of p_θ⁽ⁿ⁾(x₁, . . . ,x_n).

However, keep in mind that we may also havep overD that does not satisfy the independence assumption.

(12)

Definitions

(µ, σ²)∈R² |σ²>0 .

We can extendp_µ,σ² into a distributionp_µ,σ⁽ⁿ⁾2 overD=Rⁿ by assuming independence: p_µ,σ⁽ⁿ⁾2(x₁, . . . ,x_n) =p_µ,σ2(x₁). . .p_µ,σ2(x_n). We often abuse notation by just writingp_θ(x₁, . . . ,x_n) instead of p_θ⁽ⁿ⁾(x₁, . . . ,x_n).

(13)

Definitions

(µ, σ²)∈R² |σ²>0 .

We can extendp_µ,σ² into a distributionp_µ,σ⁽ⁿ⁾2 overD=Rⁿ by assuming independence: p_µ,σ⁽ⁿ⁾2(x1, . . . ,xn) =p_µ,σ²(x1). . .p_µ,σ²(xn).

We often abuse notation by just writingp_θ(x₁, . . . ,x_n) instead of p_θ⁽ⁿ⁾(x₁, . . . ,x_n).

(14)

Definitions

(µ, σ²)∈R² |σ²>0 .

(15)

Definitions

(µ, σ²)∈R² |σ²>0 .

(16)

Information-theoretic modeling?

In what follows, it’s important to keep in mind that we don’t claim that we can find a “true” modelp that “really” generated the data D, or even that such a “true” model exists.

However, keeping in mind how codes and distributions are related, it seems reasonable to think that

If a code based on modelp is good at compressingD, then perhaps studyingp can tell us something useful aboutD.

(17)

Information-theoretic modeling?

(18)

Information-theoretic modeling?

(19)

Definitions

The model withinMthat achieves the shortest code-length for datax is themaximum likelihood (ML) model:

minθ∈Θlog₂ 1

p_θ(D) = log₂ 1 p_θ_ˆ(D) .

Depends onD!

For modelq, the excess code-length or “regret” over the ML model inMis given by

log₂ 1

q(D) −log₂ 1 p_θ_ˆ(D) .

(20)

Definitions

minθ∈Θlog₂ 1

p_θ(D) = log₂ 1

p_θ_ˆ(D) . Depends onD!

log₂ 1

q(D) −log₂ 1 p_θ_ˆ(D) .

(21)

Definitions

minθ∈Θlog₂ 1

p_θ(D) = log₂ 1

p_θ_ˆ(D) . Depends onD!

log₂ 1

q(D) −log₂ 1 p_θ_ˆ(D) .

(22)

Universal models

Universal model

A model (code) whose regret grows slower thann, for all data sequences, is said to be auniversal model(code) relative to model classM:

n→∞lim 1 n max

D∈D

log₂ 1

q(D) −log₂ 1 p_θ_ˆ(D)

= 0 . (1)

This is another (stochastic) definition of universality, equivalent to

1

nD(p_θkq)→0 for all θ∈Θ. It is weaker since (1) ⇒ (2).

(23)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

log₂ 1

p_θ_ˆ(D) ≤log₂ 1 p_θ(D)

1

(24)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

−log₂ 1

p_θ_ˆ(D) ≥ −log₂ 1 p_θ(D)

1

(25)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

log₂ 1

q(D) −log₂ 1

p_θ_ˆ(D) ≥log₂ 1

q(D) −log₂ 1 p_θ(D)

1

(26)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

ED∼p_θ

log₂ 1

≥ED∼p_θ

log₂ 1

q(D) −log₂ 1 pθ(D)

1

(27)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

ED∼p_θ

log₂ 1

≥ED∼p_θ

log₂ 1 q(D)

−ED∼p_θ

log₂ 1 pθ(D)

1

(28)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

ED∼p_θ

log₂ 1

≥ED∼p_θ

log₂ 1 q(D)

−X

D

p_θ(D) log₂ 1 pθ(D)

1

(29)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

ED∼p_θ

log₂ 1

≥ED∼p_θ

log₂ 1 q(D)

−H(p_θ)

1

(30)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

ED∼p_θ

log₂ 1

≥ED∼p_θ

log₂ 1 q(D)

−nH(p⁽¹⁾_θ )

1

(31)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

1 nED∼p_θ

log₂ 1

≥ 1 nED∼p_θ

log₂ 1 q(D)

−H(p⁽¹⁾_θ )

1

(32)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

n→∞lim 1 nED∼p_θ

log₂ 1

≥ lim

n→∞

1 nED∼p_θ

log₂ 1 q(D)

−H(p⁽¹⁾_θ )

1

(33)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

0≥ lim

n→∞

1 nED∼p_θ

log₂ 1 q(D)

−H(p⁽¹⁾_θ )

1

(34)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

log₂ 1 q(D)

≤H(p⁽¹⁾_θ )

1

(35)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

log₂ 1 q(D)

=H(p⁽¹⁾_θ ) (2)

1

(36)

Universal models

Universal model

n→∞lim 1 n max

D∈D

log₂ 1

= 0 . (1)

log₂ 1 q(D)

=H(p⁽¹⁾_θ ) (2) This is another (stochastic) definition of universality, equivalent to

1

(37)

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by a Bernoulli model with parameter p∈[0,1].

2 We’d like to encode data at rateH(p).

3 However, we do not knowp in advance.

Again, we don’t need to believe that data arereally generated by a Bernoulli model.

(38)

Universal models

(39)

Universal models

(40)

Universal models

(41)

Universal models

(42)

Discrete Parameters Continuous Parameters Asymptotics: ^k₂logn

Universal Models

(43)

Two-Part Codes

LetM={p_θ : θ∈Θ}be a parametric probabilistic model class.

If the parameter space Θ is discrete, we can construct a (prefix) codeC1 : Θ→ {0,1}^∗ which maps each parameter value to a codeword of length`₁(θ).

For any distributionp_θ, the Shannon code-lengths satisfy

`_θ(D) =

log₂ 1 p_θ(D)

≈log₂ 1 p_θ(D) .

Using parameter valueθ, the total code-length becomes (≈)

`₁(θ) + log₂ 1 pθ(D) .

(44)

Two-Part Codes

`_θ(D) =

log₂ 1 p_θ(D)

≈log₂ 1 p_θ(D) .

`₁(θ) + log₂ 1 pθ(D) .

(45)

Two-Part Codes

`_θ(D) =

log₂ 1 p_θ(D)

≈log₂ 1 p_θ(D) .

`₁(θ) + log₂ 1 pθ(D) .

(46)

Two-Part Codes

`_θ(D) =

log₂ 1 p_θ(D)

≈log₂ 1 p_θ(D) .

`₁(θ) + log₂ 1 pθ(D) .

(47)

Two-Part Codes

Using the maximum likelihood parameter, the total code-length becomes

`_two-part(D) =`₁(ˆθ) + log₂ 1 p_θ_ˆ(D) .

Hence, the regretof the two-part code is

`_two-part(D)−log₂ 1

pθˆ(D) =`₁(ˆθ)

<cn for all c >0 and large n.

For discrete parameter modelsthe two-part code is universal.

(48)

Two-Part Codes

`_two-part(D) =`₁(ˆθ) + log₂ 1 p_θ_ˆ(D) . Hence, theregret of the two-part code is

pθˆ(D) =`₁(ˆθ)

<cn for all c >0 and large n.

(49)

Two-Part Codes

pθˆ(D) =`₁(ˆθ)<cn for all c >0 and large n.

(50)

Two-Part Codes

pθˆ(D) =`₁(ˆθ)<cn for all c >0 and large n.

(51)

Universality of Two-Part Codes

However, keep in mind that universality is not everything.

Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster. On the other hand, two-part codes have the advantage of being reasonably easy to understand.

Often they are also efficiently computable.

(52)

Universality of Two-Part Codes

Since the two-part code is universal, its regret goes to zero, but there may be other codes for which regret goes to zerofaster.

On the other hand, two-part codes have the advantage of being reasonably easy to understand.

(53)

Universality of Two-Part Codes

(54)

Universality of Two-Part Codes

(55)

Continuous Parameters

What if the parameters are continuous (like polynomial

coefficients)? We can’t encode all continuous values with finite code-lengths!

Solution: Quantization. Choose a discrete subset of points, θ⁽¹⁾, θ⁽²⁾, . . ., and use only them.

Information Geometry!

If the points are sufficientlydense(in a code-length sense) then the code-length for data is still almost as short as minθ∈Θ`_θ(D).

(56)

Continuous Parameters

(57)

Continuous Parameters

Θ

(58)

Continuous Parameters

Θ

(59)

Continuous Parameters

Θ

(60)

Continuous Parameters

Θ

(61)

Continuous Parameters

Θ

(62)

Continuous Parameters

Θ

(63)

Continuous Parameters

Θ

If the points are sufficientlydense(in a code-length sense) then the

(64)

About Quantization

How many points should there be in the subsetθ⁽¹⁾, θ⁽²⁾, . . .?

Intuition: Data does not allow us to tell apart θ1 andθ2 if

|θ₁−θ₂|<c 1

√n. ⇒ Don’t care about higher precision. Theorem

Optimal quantization accuracy is of order 1

√n.

⇒ number of points ≈√

n^k =n^k/2, wherek =dim(Θ).

(65)

About Quantization

Intuition: Data does not allow us to tell apart θ1 and θ2 if

|θ₁−θ₂|<c 1

√n. ⇒ Don’t care about higher precision.

Theorem

√n.

(66)

About Quantization

|θ₁−θ₂|<c 1

Theorem

√n.

(67)

About Quantization

|θ₁−θ₂|<c 1

Theorem

√n.

(68)

About Quantization

|θ₁−θ₂|<c 1

Theorem

√n.

The code-length for the quantized parameters becomes

`(θ^q)≈log₂n^k^/2 = k

2log₂n .

(69)

Asymptotics:

^k₂

log n

With the precision ^√¹_n the code-length for data is almost optimal:

min

θ^q∈{θ⁽¹⁾,θ⁽²⁾,...}`_θ^q(D) ≈ min

θ∈Θ`_θ(D) = log₂ 1 p_θ_ˆ(D) .

The total code-length becomes then (≈) log₂ 1

pθˆ(D) +k

2 log₂n , so that the regret is k

2 log₂n.

Since log₂n grows slower thann, thetwo-part code is universal also for continuous parameter models.

(70)

Asymptotics:

^k₂

log n

min

θ^q∈{θ⁽¹⁾,θ⁽²⁾,...}`_θ^q(D) ≈ min

θ∈Θ`_θ(D) = log₂ 1 p_θ_ˆ(D) . The total code-length becomes then (≈)

log₂ 1 pθˆ(D) +k

2 log₂n.

(71)

Asymptotics:

^k₂

log n

min

θ^q∈{θ⁽¹⁾,θ⁽²⁾,...}`_θ^q(D) ≈ min

θ∈Θ`_θ(D) = log₂ 1 p_θ_ˆ(D) . The total code-length becomes then (≈)

log₂ 1 pθˆ(D) +k

2 log₂n.

(72)

Mixture Codes

Universal Models

(73)

Mixture Codes

Mixture Universal Model

There are universal codes that are strictly better than the two-part code.

For instance, given a uniquely decodable code for the parameters, letw be a distribution over the parameter space Θ (quantized if necessary) defined as

w(θ) = 2^−`(θ)

c , wherec =X

θ∈Θ

2^−`(θ) ≤1. Letp^w be a mixture distribution over the data-sets D∈ D, defined as

p^w(D) =X

θ∈Θ

p_θ(D)w(θ) ,

i.e., an “average” distribution, where eachpθ is weighted byw(θ).

(74)

Mixture Codes

Mixture Universal Model

There are universal codes that are strictly better than the two-part code.

For instance, given a uniquely decodable code for the parameters, letw be a distribution over the parameter space Θ (quantized if necessary) defined as

w(θ) = 2^−`(θ)

c , wherec =X

θ∈Θ

2^−`(θ) ≤1.

Letp^w be a mixture distribution over the data-sets D∈ D, defined as

p^w(D) =X

θ∈Θ

p_θ(D)w(θ) ,

i.e., an “average” distribution, where eachpθ is weighted byw(θ).