Model Averaging for Linear Regression

(1)

Erkki P. Liski

University of Tampere

Department of Mathematics, Statistics and Philosophy

(2)

Model Averaging for Linear Regression Outline

Outline

É The Model

É Model selection

É Model average estimator (MAE)

É Why MAE?

É General structure of MAE

É Selecting the model weights

É Finite Sample Performance

(3)

Outline

É The Model

É Model selection

É Why MAE?

(4)

Outline

É The Model

É Model selection

É Why MAE?

(5)

Model Averaging for Linear Regression The Model

Homoscedastic linear regression

Variables The response yand the predictors₁, ₂, . . .

The Model

y=μ+ϵ, μ=

∞

X

j=1

β_j_j, E(ϵ|) =0, E(ϵ²|) =σ², β₁, β₂, . . .andσ2 are unknown parameters, and

= (

1, 

2, . . .). Further

E(μ2)<∞ and

∞

P

j=1

β_j_j converges in mean-square.

(6)

Variables The response yand the predictors₁, ₂, . . . The Model

y=μ+ϵ, μ=

∞

X

j=1

β_j_j, E(ϵ|) =0, E(ϵ²|) =σ², β1, β

2, . . .andσ2 are unknown parameters, and

= (

1, 

2, . . .).

Further E(μ2)<∞ and

∞

P

j=1

(7)

Variables The response yand the predictors₁, ₂, . . . The Model

y=μ+ϵ, μ=

∞

X

j=1

β_j_j, E(ϵ|) =0, E(ϵ²|) =σ², β1, β

2, . . .andσ2 are unknown parameters, and

= (

1, 

2, . . .). Further

E(μ2)<∞ and

∞

P

j=1

(8)

Model Averaging for Linear Regression Model Selection

Model Selection

Covariates K potential predictors ₁, . . . , _K available.

Observe (y

1,

1), . . . ,(y_n,_n),_= (_

1, _

2, . . . , _K).

Approximating Linear Model y_=

K

X

j=1

_jβ_j+b_+ϵ_, =1,2, . . . , n,

b_=

∞

X

j=K+1

β_j_j is the approximation error.

Multiple models are present.

Modelm {I^{∈m}|=1,2, . . . , K}⊂{1,2, . . . , K}.

(9)

Model Selection

Observe (y

1,

1), . . . ,(y_n,_n),_= (_

1, _

2, . . . , _K). Approximating Linear Model

y_=

K

X

j=1

_jβ_j+b_+ϵ_, =1,2, . . . , n,

b_=

∞

X

j=K+1

Modelm {I^{∈m}|=1,2, . . . , K}⊂{1,2, . . . , K}.

(10)

Model Selection

Observe (y

1,

1), . . . ,(y_n,_n),_= (_

1, _

2, . . . , _K). Approximating Linear Model

y_=

K

X

j=1

_jβ_j+b_+ϵ_, =1,2, . . . , n,

b_=

∞

X

j=K+1

Modelm {I^{∈m}|=1,2, . . . , K}⊂{1,2, . . . , K}.

(11)

Model Averaging for Linear Regression Class of Appr. Models

A Class of Approximating Models A TheM×K Incidence Matrix

A=







1 1 0 . . . 0 0 ... ... ... . .. ... ...

1 0 1 . . . 0 1 ... ... ... . .. ... ...

1 1 1 . . . 0 1







=







^T ..1

.

^T ..m

.

^T

M







for the models inA. The 1’s in row_mdisplay the predictors in themth model.

The Regression Matrix of the Modelm X_m=Xdig(_m),

_m is the vector diagonal entries ofdig(_m), Xdenotes then×K regression matrix.

(12)

A=







1 1 0 . . . 0 0 ... ... ... . .. ... ...

1 0 1 . . . 0 1 ... ... ... . .. ... ...

1 1 1 . . . 0 1







=







^T ..1

.

^T ..m

.

^T

M







(13)

A=







1 1 0 . . . 0 0 ... ... ... . .. ... ...

1 0 1 . . . 0 1 ... ... ... . .. ... ...

1 1 1 . . . 0 1







=







^T ..1

.

^T ..m

.

^T

M







(14)

Model Averaging for Linear Regression Approximating Modelm

Approximating Model m takes the form

y=X_mβ_m+b_m+ϵ.

The LSE ofβ_m ˆ

β_m= (X^T

mX_m)⁺X^T

my

and of μ_m=X_mβ_m ˆ

μ_m=H_my underm∈M, where

H_m=X_m(X^T

mX_m)⁺X^T

m

is a projector.

(15)

y=X_mβ_m+b_m+ϵ.

β_m= (X^T

mX_m)⁺X^T

my

μ_m=H_my

underm∈M, where

H_m=X_m(X^T

mX_m)⁺X^T

m

is a projector.

(16)

y=X_mβ_m+b_m+ϵ.

β_m= (X^T

mX_m)⁺X^T

my

μ_m=H_my underm∈M, where

H_m=X_m(X^T

mX_m)⁺X^T

m

is a projector.

(17)

Model Averaging for Linear Regression Model Average Estimator

Model Average Estimator (MAE)

É MAE is an alternative to model selection

É A model selection procedure can be unstable

É When is combining better than selection?

É How to measure the uncertainty in selection? Draper 1995 (JRSS B), Raftery, et. al. 1997 (JASA), Burnham & Anderson 2002 (Book), Hjort & Claeskens 2003 (JASA), Hansen 2007 (Econometrica)

MAE ofβ andμ

ˆ β_=

M

X

m=1

_mβˆ_m, weights_≥0 with

M

X

m=1

_m=1

ˆ

μ_=H_y, H_=

M

X

m=1

_mH_m is the implied hat matrix.

(18)

É How to measure the uncertainty in selection?

Draper 1995 (JRSS B), Raftery, et. al. 1997 (JASA), Burnham & Anderson 2002 (Book), Hjort & Claeskens 2003 (JASA), Hansen 2007 (Econometrica)

MAE ofβ andμ

ˆ β_=

M

X

m=1

M

X

m=1

_m=1

ˆ

μ_=H_y, H_=

M

X

m=1

(19)

Draper1995 (JRSS B),Raftery, et. al. 1997 (JASA), Burnham & Anderson2002 (Book), Hjort & Claeskens 2003 (JASA),Hansen2007 (Econometrica)

MAE ofβ andμ

ˆ β_=

M

X

m=1

M

X

m=1

_m=1

ˆ

μ_=H_y, H_=

M

X

m=1

(20)

MAE ofβ andμ

ˆ β_=

M

X

m=1

M

X

m=1

_m=1

ˆ

μ_=H_y, H_=

M

X

m=1

(21)

MAE ofβ andμ

ˆ β_ =

M

X

m=1

_mβˆ_m, weights_ ≥0 with

M

X

m=1

_m=1

ˆ

μ_ =H_y, H_=

M

X

m=1

(22)

Model Averaging for Linear Regression The Alg. Structure of MAE

The Algebraic Structure of MAE Define

AA^T =







k1 k

12 . . . k

1M

k₂₁ k₂ . . . k₂_M ... ... . ..

k_M₁ k_M₂ . . . k_M







=K

Properties of H_, model weights^T = (_, . . . , _M). (i) tr(H_) =

M

P

m=1

_mk_m. (ii) tr(H²

) =^TK . (iii) λ_M(H_)≤1.

(23)

Model Averaging for Linear Regression The Alg. Structure of MAE

The Algebraic Structure of MAE Define

AA^T =







k1 k

12 . . . k

1M

k₂₁ k₂ . . . k₂_M ... ... . ..

k_M₁ k_M₂ . . . k_M







=K

Properties of H_, model weights^T = (_, . . . , _M). (i) tr(H_) =

M

P

m=1

_mk_m. (ii) tr(H²

) =^TK .

(iii) λ_M(H_)≤1.

(24)

Model Averaging for Linear Regression The Risk under Sq-Error Loss

The Risk under Squared-Error Loss Squared-Error Loss L() =kμ−μˆ_k².

The Conditional Risk ofμˆ_ R() =E(L()|

1. . ._n)

=k(−H_)μk²+σ²^TK 

=^T(B+σ²K),

B=







b11 b

12 . . . b

1M

... ... . .. b_M

1 b_M

2 . . . b_MM







, withb_mk=b^T

m(−H_m)(−H_k)b_k. At least two non-zero_ in the optimal .

Example: Suppose M=2, ^T = (,1−). Then

∈(0,1)unless b₁₁=b₁₂ orb₂₂=b₁₂.

(25)

The Risk under Squared-Error Loss Squared-Error Loss L() =kμ−μˆ_k². The Conditional Risk ofμˆ_

R() =E(L()|₁. . ._n)

=k(−H_)μk²+σ²^TK 

=^T(B+σ²K),

B=







b11 b

12 . . . b

1M

... ... . .. b_M

1 b_M

2 . . . b_MM







, withb_mk=b^T

∈(0,1)unless b₁₁=b₁₂ orb₂₂=b₁₂.

(26)

R() =E(L()|₁. . ._n)

=k(−H_)μk²+σ²^TK 

=^T(B+σ²K),

B=







b11 b

12 . . . b

1M

... ... . ..

b_M₁ b_M₂ . . . b_MM







, withb_mk=b^T

m(−H_m)(−H_k)b_k.

At least two non-zero_ in the optimal .

∈(0,1)unless b₁₁=b₁₂ orb₂₂=b₁₂.

(27)

R() =E(L()|₁. . ._n)

=k(−H_)μk²+σ²^TK 

=^T(B+σ²K),

B=







b11 b

12 . . . b

1M

... ... . ..

b_M₁ b_M₂ . . . b_MM







, withb_mk=b^T

∈(0,1)unless b₁₁=b₁₂ orb₂₂=b₁₂.

(28)

Model Averaging for Linear Regression Selecting the Model Weights

Selecting the Model Weights _

Mallows’ Criterion (MMAE) for MAE (Hansen 2007)

C() =k(−H_)yk²+2σ2k_, k_=

M

X

m=1

_mk_m,

σ2 is replaced with an estimate. Selectˆ such that ˆ

=rgmin



C().

Properties of C(): E[C()] =E[L()] +nσ2 and L(ˆ)

inf^L()−→^p 1 asn→∞.

(29)

C() =k(−H_)yk²+2σ2k_, k_=

M

X

m=1

_mk_m, σ2 is replaced with an estimate. Selectˆ such that

ˆ

=rgmin



C().

inf^L()−→^p 1 asn→∞.

(30)

C() =k(−H_)yk²+2σ2k_, k_=

M

X

m=1

_mk_m, σ2 is replaced with an estimate. Selectˆ such that

ˆ

=rgmin



C().

inf^L()−→^p 1 asn→∞.

(31)

SmoothedAC and BC (SAC & SBC)

_m=exp(−¹₂AC^m)/^P^M

=1exp(−¹₂AC^)

(Bucland 1997, Burnham & Anderson 2002),

_m=exp(−¹₂BC^m)/^P^M

=1exp(−¹₂BC^),

theAC andBCcriteria for model mare AC^m=ln ˆσ²

m+2k_m and BC^m=ln ˆσ²

m+k_mlnn. SmoothedMDL (SMDL)

_m=exp(−MDL^m)/

M

X

=1

exp(−MDL^), where

MDL^m=nln ˆs²

m+k_mlnF_m+ln[k_m(n−k_m)], F_m=kμˆ_mk²/ k_msˆ²

m. (Rissanen 2000 & 2007, Liski 2006)

(32)

=1exp(−¹₂AC^)

=1exp(−¹₂BC^), theAC andBCcriteria for model mare

AC^m=ln ˆσ²

m+k_mlnn.

SmoothedMDL (SMDL)

M

X

=1

MDL^m=nln ˆs²

m. (Rissanen 2000 & 2007, Liski 2006)

(33)

=1exp(−¹₂AC^)

AC^m=ln ˆσ²

m+k_mlnn.

M

X

=1

MDL^m=nln ˆs²

m. (Rissanen 2000 & 2007, Liski 2006)

(34)

=1exp(−¹₂AC^)

AC^m=ln ˆσ²

m+k_mlnn.

M

X

=1

MDL^m=nln ˆs²

m. (Rissanen2000 & 2007,Liski2006)

(35)

Model Averaging for Linear Regression Finite Sample Performance

Finite Sample Performance

Simulation Model is the infinite order regression

y_ =

∞

X

j=1

β_j_j+ϵ_,

É _j∼N(0,1)iid (

1 =1),ϵ_ ∼N(0,1)and_j⊥⊥ϵ_.

É β_j =cp

2 j⁻^−1/2 and the populationR2= ₁₊^c²_c2.

É 50≤n≤1000 andM=3n1/3.

É 0.5≤≤1.5, for larger the coefficients β_j decline more quicly.

É c is selected such that0.1≤R2≤0.9.

Mean of predictive losskμ−μˆ_k²over simulations.

(36)

y_ =

∞

X

j=1

β_j_j+ϵ_,

É _j∼N(0,1)iid (

1 =1),ϵ_ ∼N(0,1)and_j⊥⊥ϵ_.

É β_j =cp

É 50≤n≤1000 andM=3n1/3.

(37)

y_ =

∞

X

j=1

β_j_j+ϵ_,

É _j∼N(0,1)iid (

1 =1),ϵ_ ∼N(0,1)and_j⊥⊥ϵ_.

É β_j =cp

É 50≤n≤1000 andM=3n1/3.

(38)

Simulation Results

Comments

É AC and Mallows’ Csimilar, MMAEbetter than SAC.

É SAC has lower risk thanAC

É MMAE better than SBC in most cases. Method Performs well for SBC n andR2small, large BC ‘small’ models

AC ‘large´ models

É SMDLbetter than MDL.

É SMDLemulates the best performance criterion.

É SMDLhas the best overall performance.

(39)

Comments

É MMAE better than SBC in most cases. Method Performs well for SBC n andR2small, large BC ‘small’ models

(40)

Comments

É MMAE better than SBC in most cases.

Method Performs well for SBC n andR2small, large BC ‘small’ models

(41)

Comments

(42)

Comments

(43)

Comments

(44)

Model Averaging for Linear Regression References

References

Buckland, S. T., Burnham, K. P. and Augustin, N. H.

(1997), Model Selection: An Integral Part of Inference.Biometrics, 53, 603–618.

Burnham & Anderson(2002),Model Selection and Multi-model Inference, Springer

Draper, D.(1995), Assesment and Propagation of Model Uncertainty.Journal of the Royal Statistical Society B, 57, 45–70.

Hansen, B. E.(2007), Least Squares Model Averaging.Econometrica, Forthcoming.

Hjort, L. H. and Claeskens, G.(2003), Frequentist Model Average Estimators.Journal of the American Statistical Association, 98, 879–899.

(45)

Model Averaging for Linear Regression References

Liski, E. P.(2006), Normalized ML and the MDL Principle for Variable Selection in Linear Regression In: Festschrift for Tarmo Pukkila on His 60th

Birthday, 159-172.

Raftery, A. E., Madigan, D. and Hoeting, J. A.(1997), Bayesian Model Averaging for Regression Models.

Journal of the American Statistical Association, 92, 179–191.

Rissanen, J. (2000).MDL Denoising.IEEE Trans.

Information Theory, IT-46, pp. 2537–2543.

Rissanen, J.(2007),Information and Complexity in Statistical Modeling, Springer