• Ei tuloksia

Model Averaging for Linear Regression

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Model Averaging for Linear Regression"

Copied!
45
0
0

Kokoteksti

(1)

Model Averaging for Linear Regression

Model Averaging for Linear Regression

Erkki P. Liski

University of Tampere

Department of Mathematics, Statistics and Philosophy

(2)

Model Averaging for Linear Regression Outline

Outline

É The Model

É Model selection

É Model average estimator (MAE)

É Why MAE?

É General structure of MAE

É Selecting the model weights

É Finite Sample Performance

(3)

Model Averaging for Linear Regression Outline

Outline

É The Model

É Model selection

É Model average estimator (MAE)

É Why MAE?

É General structure of MAE

É Selecting the model weights

É Finite Sample Performance

(4)

Model Averaging for Linear Regression Outline

Outline

É The Model

É Model selection

É Model average estimator (MAE)

É Why MAE?

É General structure of MAE

É Selecting the model weights

É Finite Sample Performance

(5)

Model Averaging for Linear Regression The Model

Homoscedastic linear regression

Variables The response yand the predictors1, 2, . . .

The Model

y=μ+ϵ, μ=

X

j=1

βjj, E(ϵ|) =0, E(ϵ2|) =σ2, β1, β2, . . .andσ2 are unknown parameters, and

= (

1, 

2, . . .). Further

E(μ2)<∞ and

P

j=1

βjj converges in mean-square.

(6)

Model Averaging for Linear Regression The Model

Homoscedastic linear regression

Variables The response yand the predictors1, 2, . . . The Model

y=μ+ϵ, μ=

X

j=1

βjj, E(ϵ|) =0, E(ϵ2|) =σ2, β1, β

2, . . .andσ2 are unknown parameters, and

= (

1, 

2, . . .).

Further E(μ2)<∞ and

P

j=1

βjj converges in mean-square.

(7)

Model Averaging for Linear Regression The Model

Homoscedastic linear regression

Variables The response yand the predictors1, 2, . . . The Model

y=μ+ϵ, μ=

X

j=1

βjj, E(ϵ|) =0, E(ϵ2|) =σ2, β1, β

2, . . .andσ2 are unknown parameters, and

= (

1, 

2, . . .). Further

E(μ2)<∞ and

P

j=1

βjj converges in mean-square.

(8)

Model Averaging for Linear Regression Model Selection

Model Selection

Covariates K potential predictors 1, . . . , K available.

Observe (y

1,

1), . . . ,(yn,n),= (

1, 

2, . . . , K).

Approximating Linear Model y=

K

X

j=1

jβj+b+ϵ, =1,2, . . . , n,

b=

X

j=K+1

βjj is the approximation error.

Multiple models are present.

Modelm {I{m}|=1,2, . . . , K}{1,2, . . . , K}.

(9)

Model Averaging for Linear Regression Model Selection

Model Selection

Covariates K potential predictors 1, . . . , K available.

Observe (y

1,

1), . . . ,(yn,n),= (

1, 

2, . . . , K). Approximating Linear Model

y=

K

X

j=1

jβj+b+ϵ, =1,2, . . . , n,

b=

X

j=K+1

βjj is the approximation error.

Multiple models are present.

Modelm {I{m}|=1,2, . . . , K}{1,2, . . . , K}.

(10)

Model Averaging for Linear Regression Model Selection

Model Selection

Covariates K potential predictors 1, . . . , K available.

Observe (y

1,

1), . . . ,(yn,n),= (

1, 

2, . . . , K). Approximating Linear Model

y=

K

X

j=1

jβj+b+ϵ, =1,2, . . . , n,

b=

X

j=K+1

βjj is the approximation error.

Multiple models are present.

Modelm {I{m}|=1,2, . . . , K}{1,2, . . . , K}.

(11)

Model Averaging for Linear Regression Class of Appr. Models

A Class of Approximating Models A TheM×K Incidence Matrix

A=

1 1 0 . . . 0 0 ... ... ... . .. ... ...

1 0 1 . . . 0 1 ... ... ... . .. ... ...

1 1 1 . . . 0 1

=

T ..1

.

T ..m

.

T

M

for the models inA. The 1’s in rowmdisplay the predictors in themth model.

The Regression Matrix of the Modelm Xm=Xdig(m),

m is the vector diagonal entries ofdig(m), Xdenotes then×K regression matrix.

(12)

Model Averaging for Linear Regression Class of Appr. Models

A Class of Approximating Models A TheM×K Incidence Matrix

A=

1 1 0 . . . 0 0 ... ... ... . .. ... ...

1 0 1 . . . 0 1 ... ... ... . .. ... ...

1 1 1 . . . 0 1

=

T ..1

.

T ..m

.

T

M

for the models inA. The 1’s in rowmdisplay the predictors in themth model.

The Regression Matrix of the Modelm Xm=Xdig(m),

m is the vector diagonal entries ofdig(m), Xdenotes then×K regression matrix.

(13)

Model Averaging for Linear Regression Class of Appr. Models

A Class of Approximating Models A TheM×K Incidence Matrix

A=

1 1 0 . . . 0 0 ... ... ... . .. ... ...

1 0 1 . . . 0 1 ... ... ... . .. ... ...

1 1 1 . . . 0 1

=

T ..1

.

T ..m

.

T

M

for the models inA. The 1’s in rowmdisplay the predictors in themth model.

The Regression Matrix of the Modelm Xm=Xdig(m),

m is the vector diagonal entries ofdig(m), Xdenotes then×K regression matrix.

(14)

Model Averaging for Linear Regression Approximating Modelm

Approximating Model m takes the form

y=Xmβm+bm+ϵ.

The LSE ofβm ˆ

βm= (XT

mXm)+XT

my

and of μm=Xmβm ˆ

μm=Hmy undermM, where

Hm=Xm(XT

mXm)+XT

m

is a projector.

(15)

Model Averaging for Linear Regression Approximating Modelm

Approximating Model m takes the form

y=Xmβm+bm+ϵ.

The LSE ofβm ˆ

βm= (XT

mXm)+XT

my

and of μm=Xmβm ˆ

μm=Hmy

undermM, where

Hm=Xm(XT

mXm)+XT

m

is a projector.

(16)

Model Averaging for Linear Regression Approximating Modelm

Approximating Model m takes the form

y=Xmβm+bm+ϵ.

The LSE ofβm ˆ

βm= (XT

mXm)+XT

my

and of μm=Xmβm ˆ

μm=Hmy undermM, where

Hm=Xm(XT

mXm)+XT

m

is a projector.

(17)

Model Averaging for Linear Regression Model Average Estimator

Model Average Estimator (MAE)

É MAE is an alternative to model selection

É A model selection procedure can be unstable

É When is combining better than selection?

É How to measure the uncertainty in selection? Draper 1995 (JRSS B), Raftery, et. al. 1997 (JASA), Burnham & Anderson 2002 (Book), Hjort & Claeskens 2003 (JASA), Hansen 2007 (Econometrica)

MAE ofβ andμ

ˆ β=

M

X

m=1

mβˆm, weights0 with

M

X

m=1

m=1

ˆ

μ=Hy, H=

M

X

m=1

mHm is the implied hat matrix.

(18)

Model Averaging for Linear Regression Model Average Estimator

Model Average Estimator (MAE)

É MAE is an alternative to model selection

É A model selection procedure can be unstable

É When is combining better than selection?

É How to measure the uncertainty in selection?

Draper 1995 (JRSS B), Raftery, et. al. 1997 (JASA), Burnham & Anderson 2002 (Book), Hjort & Claeskens 2003 (JASA), Hansen 2007 (Econometrica)

MAE ofβ andμ

ˆ β=

M

X

m=1

mβˆm, weights0 with

M

X

m=1

m=1

ˆ

μ=Hy, H=

M

X

m=1

mHm is the implied hat matrix.

(19)

Model Averaging for Linear Regression Model Average Estimator

Model Average Estimator (MAE)

É MAE is an alternative to model selection

É A model selection procedure can be unstable

É When is combining better than selection?

É How to measure the uncertainty in selection?

Draper1995 (JRSS B),Raftery, et. al. 1997 (JASA), Burnham & Anderson2002 (Book), Hjort & Claeskens 2003 (JASA),Hansen2007 (Econometrica)

MAE ofβ andμ

ˆ β=

M

X

m=1

mβˆm, weights0 with

M

X

m=1

m=1

ˆ

μ=Hy, H=

M

X

m=1

mHm is the implied hat matrix.

(20)

Model Averaging for Linear Regression Model Average Estimator

Model Average Estimator (MAE)

É MAE is an alternative to model selection

É A model selection procedure can be unstable

É When is combining better than selection?

É How to measure the uncertainty in selection?

Draper 1995 (JRSS B), Raftery, et. al. 1997 (JASA), Burnham & Anderson 2002 (Book), Hjort & Claeskens 2003 (JASA), Hansen 2007 (Econometrica)

MAE ofβ andμ

ˆ β=

M

X

m=1

mβˆm, weights0 with

M

X

m=1

m=1

ˆ

μ=Hy, H=

M

X

m=1

mHm is the implied hat matrix.

(21)

Model Averaging for Linear Regression Model Average Estimator

Model Average Estimator (MAE)

É MAE is an alternative to model selection

É A model selection procedure can be unstable

É When is combining better than selection?

É How to measure the uncertainty in selection?

Draper 1995 (JRSS B), Raftery, et. al. 1997 (JASA), Burnham & Anderson 2002 (Book), Hjort & Claeskens 2003 (JASA), Hansen 2007 (Econometrica)

MAE ofβ andμ

ˆ β =

M

X

m=1

mβˆm, weights 0 with

M

X

m=1

m=1

ˆ

μ =Hy, H=

M

X

m=1

mHm is the implied hat matrix.

(22)

Model Averaging for Linear Regression The Alg. Structure of MAE

The Algebraic Structure of MAE Define

AAT =

k1 k

12 . . . k

1M

k21 k2 . . . k2M ... ... . ..

kM1 kM2 . . . kM

=K

Properties of H, model weightsT = (, . . . , M). (i) tr(H) =

M

P

m=1

mkm. (ii) tr(H2

) =TK . (iii) λM(H)≤1.

(23)

Model Averaging for Linear Regression The Alg. Structure of MAE

The Algebraic Structure of MAE Define

AAT =

k1 k

12 . . . k

1M

k21 k2 . . . k2M ... ... . ..

kM1 kM2 . . . kM

=K

Properties of H, model weightsT = (, . . . , M). (i) tr(H) =

M

P

m=1

mkm. (ii) tr(H2

) =TK .

(iii) λM(H)≤1.

(24)

Model Averaging for Linear Regression The Risk under Sq-Error Loss

The Risk under Squared-Error Loss Squared-Error Loss L() =kμμˆk2.

The Conditional Risk ofμˆ R() =E(L()|

1. . .n)

=k(H)μk2+σ2TK 

=T(B+σ2K),

B=

b11 b

12 . . . b

1M

... ... . .. bM

1 bM

2 . . . bMM

, withbmk=bT

m(Hm)(Hk)bk. At least two non-zero in the optimal .

Example: Suppose M=2, T = (,1). Then

∈(0,1)unless b11=b12 orb22=b12.

(25)

Model Averaging for Linear Regression The Risk under Sq-Error Loss

The Risk under Squared-Error Loss Squared-Error Loss L() =kμμˆk2. The Conditional Risk ofμˆ

R() =E(L()|1. . .n)

=k(H)μk2+σ2TK 

=T(B+σ2K),

B=

b11 b

12 . . . b

1M

... ... . .. bM

1 bM

2 . . . bMM

, withbmk=bT

m(Hm)(Hk)bk. At least two non-zero in the optimal .

Example: Suppose M=2, T = (,1). Then

∈(0,1)unless b11=b12 orb22=b12.

(26)

Model Averaging for Linear Regression The Risk under Sq-Error Loss

The Risk under Squared-Error Loss Squared-Error Loss L() =kμμˆk2. The Conditional Risk ofμˆ

R() =E(L()|1. . .n)

=k(H)μk2+σ2TK 

=T(B+σ2K),

B=

b11 b

12 . . . b

1M

... ... . ..

bM1 bM2 . . . bMM

, withbmk=bT

m(Hm)(Hk)bk.

At least two non-zero in the optimal .

Example: Suppose M=2, T = (,1). Then

∈(0,1)unless b11=b12 orb22=b12.

(27)

Model Averaging for Linear Regression The Risk under Sq-Error Loss

The Risk under Squared-Error Loss Squared-Error Loss L() =kμμˆk2. The Conditional Risk ofμˆ

R() =E(L()|1. . .n)

=k(H)μk2+σ2TK 

=T(B+σ2K),

B=

b11 b

12 . . . b

1M

... ... . ..

bM1 bM2 . . . bMM

, withbmk=bT

m(Hm)(Hk)bk. At least two non-zero in the optimal .

Example: Suppose M=2, T = (,1). Then

∈(0,1)unless b11=b12 orb22=b12.

(28)

Model Averaging for Linear Regression Selecting the Model Weights

Selecting the Model Weights

Mallows’ Criterion (MMAE) for MAE (Hansen 2007)

C() =k(H)yk2+2σ2k, k=

M

X

m=1

mkm,

σ2 is replaced with an estimate. Selectˆ such that ˆ

=rgmin

C().

Properties of C(): E[C()] =E[L()] +2 and L(ˆ)

infL()−→p 1 asn∞.

(29)

Model Averaging for Linear Regression Selecting the Model Weights

Selecting the Model Weights

Mallows’ Criterion (MMAE) for MAE (Hansen 2007)

C() =k(H)yk2+2σ2k, k=

M

X

m=1

mkm, σ2 is replaced with an estimate. Selectˆ such that

ˆ

=rgmin

C().

Properties of C(): E[C()] =E[L()] +2 and L(ˆ)

infL()−→p 1 asn∞.

(30)

Model Averaging for Linear Regression Selecting the Model Weights

Selecting the Model Weights

Mallows’ Criterion (MMAE) for MAE (Hansen 2007)

C() =k(H)yk2+2σ2k, k=

M

X

m=1

mkm, σ2 is replaced with an estimate. Selectˆ such that

ˆ

=rgmin

C().

Properties of C(): E[C()] =E[L()] +2 and L(ˆ)

infL()−→p 1 asn∞.

(31)

Model Averaging for Linear Regression Selecting the Model Weights

SmoothedAC and BC (SAC & SBC)

m=exp(−12ACm)/PM

=1exp(−12AC)

(Bucland 1997, Burnham & Anderson 2002),

m=exp(−12BCm)/PM

=1exp(−12BC),

theAC andBCcriteria for model mare ACm=ln ˆσ2

m+2km and BCm=ln ˆσ2

m+kmlnn. SmoothedMDL (SMDL)

m=exp(−MDLm)/

M

X

=1

exp(−MDL), where

MDLm=nln ˆs2

m+kmlnFm+ln[km(nkm)], Fm=kμˆmk2/ kmsˆ2

m. (Rissanen 2000 & 2007, Liski 2006)

(32)

Model Averaging for Linear Regression Selecting the Model Weights

SmoothedAC and BC (SAC & SBC)

m=exp(−12ACm)/PM

=1exp(−12AC)

(Bucland 1997, Burnham & Anderson 2002),

m=exp(−12BCm)/PM

=1exp(−12BC), theAC andBCcriteria for model mare

ACm=ln ˆσ2

m+2km and BCm=ln ˆσ2

m+kmlnn.

SmoothedMDL (SMDL)

m=exp(−MDLm)/

M

X

=1

exp(−MDL), where

MDLm=nln ˆs2

m+kmlnFm+ln[km(nkm)], Fm=kμˆmk2/ kmsˆ2

m. (Rissanen 2000 & 2007, Liski 2006)

(33)

Model Averaging for Linear Regression Selecting the Model Weights

SmoothedAC and BC (SAC & SBC)

m=exp(−12ACm)/PM

=1exp(−12AC)

(Bucland 1997, Burnham & Anderson 2002),

m=exp(−12BCm)/PM

=1exp(−12BC), theAC andBCcriteria for model mare

ACm=ln ˆσ2

m+2km and BCm=ln ˆσ2

m+kmlnn.

SmoothedMDL (SMDL)

m=exp(−MDLm)/

M

X

=1

exp(−MDL), where

MDLm=nln ˆs2

m+kmlnFm+ln[km(nkm)], Fm=kμˆmk2/ kmsˆ2

m. (Rissanen 2000 & 2007, Liski 2006)

(34)

Model Averaging for Linear Regression Selecting the Model Weights

SmoothedAC and BC (SAC & SBC)

m=exp(−12ACm)/PM

=1exp(−12AC)

(Bucland 1997, Burnham & Anderson 2002),

m=exp(−12BCm)/PM

=1exp(−12BC), theAC andBCcriteria for model mare

ACm=ln ˆσ2

m+2km and BCm=ln ˆσ2

m+kmlnn.

SmoothedMDL (SMDL)

m=exp(−MDLm)/

M

X

=1

exp(−MDL), where

MDLm=nln ˆs2

m+kmlnFm+ln[km(nkm)], Fm=kμˆmk2/ kmsˆ2

m. (Rissanen2000 & 2007,Liski2006)

(35)

Model Averaging for Linear Regression Finite Sample Performance

Finite Sample Performance

Simulation Model is the infinite order regression

y =

X

j=1

βjj+ϵ,

É jN(0,1)iid (

1 =1),ϵ N(0,1)andj⊥⊥ϵ.

É βj =cp

2 j−1/2 and the populationR2= 1+c2c2.

É 50n1000 andM=3n1/3.

É 0.51.5, for larger the coefficients βj decline more quicly.

É c is selected such that0.1R20.9.

Mean of predictive losskμμˆk2over simulations.

(36)

Model Averaging for Linear Regression Finite Sample Performance

Finite Sample Performance

Simulation Model is the infinite order regression

y =

X

j=1

βjj+ϵ,

É jN(0,1)iid (

1 =1),ϵ N(0,1)andj⊥⊥ϵ.

É βj =cp

2 j−1/2 and the populationR2= 1+c2c2.

É 50n1000 andM=3n1/3.

É 0.51.5, for larger the coefficients βj decline more quicly.

É c is selected such that0.1R20.9.

Mean of predictive losskμμˆk2over simulations.

(37)

Model Averaging for Linear Regression Finite Sample Performance

Finite Sample Performance

Simulation Model is the infinite order regression

y =

X

j=1

βjj+ϵ,

É jN(0,1)iid (

1 =1),ϵ N(0,1)andj⊥⊥ϵ.

É βj =cp

2 j−1/2 and the populationR2= 1+c2c2.

É 50n1000 andM=3n1/3.

É 0.51.5, for larger the coefficients βj decline more quicly.

É c is selected such that0.1R20.9.

Mean of predictive losskμμˆk2over simulations.

(38)

Model Averaging for Linear Regression Finite Sample Performance

Simulation Results

Comments

É AC and Mallows’ Csimilar, MMAEbetter than SAC.

É SAC has lower risk thanAC

É MMAE better than SBC in most cases. Method Performs well for SBC n andR2small, large BC ‘small’ models

AC ‘large´ models

É SMDLbetter than MDL.

É SMDLemulates the best performance criterion.

É SMDLhas the best overall performance.

(39)

Model Averaging for Linear Regression Finite Sample Performance

Simulation Results

Comments

É AC and Mallows’ Csimilar, MMAEbetter than SAC.

É SAC has lower risk thanAC

É MMAE better than SBC in most cases. Method Performs well for SBC n andR2small, large BC ‘small’ models

AC ‘large´ models

É SMDLbetter than MDL.

É SMDLemulates the best performance criterion.

É SMDLhas the best overall performance.

(40)

Model Averaging for Linear Regression Finite Sample Performance

Simulation Results

Comments

É AC and Mallows’ Csimilar, MMAEbetter than SAC.

É SAC has lower risk thanAC

É MMAE better than SBC in most cases.

Method Performs well for SBC n andR2small, large BC ‘small’ models

AC ‘large´ models

É SMDLbetter than MDL.

É SMDLemulates the best performance criterion.

É SMDLhas the best overall performance.

(41)

Model Averaging for Linear Regression Finite Sample Performance

Simulation Results

Comments

É AC and Mallows’ Csimilar, MMAEbetter than SAC.

É SAC has lower risk thanAC

É MMAE better than SBC in most cases.

Method Performs well for SBC n andR2small, large BC ‘small’ models

AC ‘large´ models

É SMDLbetter than MDL.

É SMDLemulates the best performance criterion.

É SMDLhas the best overall performance.

(42)

Model Averaging for Linear Regression Finite Sample Performance

Simulation Results

Comments

É AC and Mallows’ Csimilar, MMAEbetter than SAC.

É SAC has lower risk thanAC

É MMAE better than SBC in most cases.

Method Performs well for SBC n andR2small, large BC ‘small’ models

AC ‘large´ models

É SMDLbetter than MDL.

É SMDLemulates the best performance criterion.

É SMDLhas the best overall performance.

(43)

Model Averaging for Linear Regression Finite Sample Performance

Simulation Results

Comments

É AC and Mallows’ Csimilar, MMAEbetter than SAC.

É SAC has lower risk thanAC

É MMAE better than SBC in most cases.

Method Performs well for SBC n andR2small, large BC ‘small’ models

AC ‘large´ models

É SMDLbetter than MDL.

É SMDLemulates the best performance criterion.

É SMDLhas the best overall performance.

(44)

Model Averaging for Linear Regression References

References

Buckland, S. T., Burnham, K. P. and Augustin, N. H.

(1997), Model Selection: An Integral Part of Inference.Biometrics, 53, 603–618.

Burnham & Anderson(2002),Model Selection and Multi-model Inference, Springer

Draper, D.(1995), Assesment and Propagation of Model Uncertainty.Journal of the Royal Statistical Society B, 57, 45–70.

Hansen, B. E.(2007), Least Squares Model Averaging.Econometrica, Forthcoming.

Hjort, L. H. and Claeskens, G.(2003), Frequentist Model Average Estimators.Journal of the American Statistical Association, 98, 879–899.

(45)

Model Averaging for Linear Regression References

Liski, E. P.(2006), Normalized ML and the MDL Principle for Variable Selection in Linear Regression In: Festschrift for Tarmo Pukkila on His 60th

Birthday, 159-172.

Raftery, A. E., Madigan, D. and Hoeting, J. A.(1997), Bayesian Model Averaging for Regression Models.

Journal of the American Statistical Association, 92, 179–191.

Rissanen, J. (2000).MDL Denoising.IEEE Trans.

Information Theory, IT-46, pp. 2537–2543.

Rissanen, J.(2007),Information and Complexity in Statistical Modeling, Springer

Viittaukset

LIITTYVÄT TIEDOSTOT

While linear regression analysis was used to test the normality of the final graduating results and linear regression models were fitted to the data to predict the

While linear regression analysis was used to test the normality of the final graduating results and linear regression models were fitted to the data to predict the

• The conditional variance, conditional covariance, and conditional correlation coefficient, for the Gaussian distribution, are known as partial variance , partial covariance

I Linear classifier is described in section 5.4 of the course textbook, which is on the topic of artificial neural networks, which take inspiration from the brain (network of

U of Manchester, UK for linear models Colorado State U correct linear model. FWIKOSHI, Yasunori Multivariate linear Fort Collins, CO,

In this thesis, I have given an introduction to the topic of model selection in machine learning, described a broadly (though not universally) applica- ble framework to

Figure 2 depicts the estimated hedonic model (named as “Trend+X’s”) when an error correction model is used for regression effects and a local linear trend model is used for a

On single-target data sets, we compare regression trees, random forests, linear regression ( Dirty ), model trees ( MtSmoti ), and rule ensembles: four versions of Fire , two