• Ei tuloksia

3.2 Framework for independent observations

3.2.1 Generalized linear models

Throughout this thesis the outcome variableY has two values. Generally, the values of this type of binary variableY are encoded as 1 and 0. The probability of Y being one is denoted by P(Y = 1) = p and the probability of Y being zero P(Y = 0) = 1−p. The ratio of probabilities, 1−pp , is called the odds and the logarithm of the odds, logit(p) = log 1−pp

, is called logit.

The binary variable can be modelled by a generalized linear model

(McCullagh and Nelder, 1989). In generalized linear models, the density/prob-ability function of the independent random variable Yi, i = 1, . . . , n, can be expressed as

fYi(yii, φ) = exp

ai[yiθi−b(θi)]

φ +c(yi, φ)

, (3.1)

where ai, i = 1, . . . , n, represent known weights, b and c are some known functions and θi, i= 1, . . . , n, and φ are unknown parameters.

The distribution given by (3.1) belongs to the exponential family. The exponential family includes many of the most common distributions such as the normal, exponential, gamma, chi-squared, Bernoulli, Wishart and many others. Probability distributions of the outcomeYiin generalized linear models are usually parametrized in terms of the meanµiand the dispersion parameter φ instead of the natural parameterθi. For example, for each observation i, for the binary Yi ∼Bin(1, pi),pi ∈(0,1), the distribution is P(Yi =yi) =pyii(1− pi)(1−yi),yi = 0,1. The notationYi ∼Bin(1, pi) has the same meaning asYi ∼ B(pi) which is a Bernoulli distribution with probability pi. The distribution of the independent Bernoulli trials with probabilitypi for observation ican be written as

P(Yi=yi) = pyii(1−pi)1−yi

=

pi

1−pi

yi

(1−pi)

= exp

log pi

1−pi

yi

(1−pi)

= exp

yilog pi

1−pi

+ log(1−pi)

= exp

yiθi−log(1 + exp(θi)) , (3.2) where θi = log 1−ppii

. The distribution (3.2) is a member of the exponential

family. It follows from (3.1) by choosingφ= 1, ai = 1, b(θi) = log(1 + exp(θi)) and c(yi, φ) = 0.

The generalized linear model generalizes linear regression by allowing the linear model to be related to the response variable via an invertible link func-tion g(·). The link function provides the relationship between the linear pre-dictor ηi and the mean µi of the distribution, g(µi) = ηi. There are a number of popular link functions, such as the logit link function

ηi =g(pi) = log pi

1−pi

=Xiβ, (3.3)

where pi ∈ (0,1), i = 1, . . . , n, and Xi is a row of the design matrix X. It has been chosen in this thesis because it is appropriate for the analysis and for comparison of the results from the literature. In Equation (3.3), β is the parameter vector (a column vector).

The mean and variance of outcome variableYi from the exponential family of distributions (3.1) are obtained by first and second derivatives with respect to θi, and they are for ai = φ = 1, E(Yi) = bi) = µi = g−1(Xiβ) and var(Yi) = b′′i). The primes denote derivatives with respect to θi. For the Bernoulli distribution

E(Yi) =bi) = eθi

1 +eθi =pi

and

var(Yi) =b′′i) = eθi

(1 +eθi)2 =pi(1−pi).

The column vector y contains elements yi (the outcomes of Y) which are the observed{0,1}values for each observationi. The joint probability function of observations is

f(y|β) = Yn

i=1

pyii(1−pi)1−yi, (3.4) where pi depends on β. This is a likelihood function with respect to β. The likelihood function can be expressed as

L(β|y) = Yn

i=1

pi

1−pi

yi

(1−pi). (3.5)

Using the exponential function to the sides of the expression (3.3) and solving for pi, we get

pi

1−pi

=eXiβ ⇔pi =

eXiβ 1 +eXiβ

.

The solved terms 1−ppii and pi are substituted to Equation (3.5)

L(β|y) = Yn

i=1

(eXiβ)yi

1− eXiβ 1 +eXiβ

= Yn

i=1

(eyiXiβ)(1 +eXiβ)−1.

Hence, the log-likelihood based on all data is

l(β) = Σni=1yi(Xiβ)−Σni=1log(1 +eXiβ) and for unit i it is

l(β) =yiXiβ−log(1 +eXiβ). (3.6) 3.2.2 Outcome model and measurement error model

For independent observations the outcome model is defined as P(Y = 1|x, Z) =π(xβx+ZβZ) = exp(xβx+ZβZ)

1 + exp(xβx+ZβZ), (3.7) where Y is a binary outcome variable which depends on variables x : 1×m and variables Z : 1×mZ through a logistic regression model, βZ : mZ ×1 is a coefficient vector and the parameter of interest is βx :m×1.

Below we will present the measurement error model for independent ob-servations. The framework is extended for analysing cluster-correlated data in Section 3.3.

Instead of variablesx: 1×msometimes surrogate variablesX : 1×mX are observed. The surrogate variablesXare related to variablesxby a multivariate normal linear model

X =xαx+ZαZX, (3.8)

where measurement error is εX ∼ N(0, σX2I) and parameter matrix αx : m × mX includes regression coefficients for x and matrix αZ : mZ × mX

are regression coefficient matrix for Z. This model is the measurement error model.

Because the observed surrogate variablesX are occasionally used to predict the unobserved variables x, a secondary study objective is to estimate the conditional distribution of x given X. We assume that the variables x ∼ N(0,Σx). Under these assumptions the conditional distribution of xgiven X

is multivariate normal and xcan be modelled as

x=XγX +ZγZx, (3.9)

where measurement errorεx ∼N(0,Σx|X) and γ’s are the matrices of regres-sion coefficients γX :mX×m and γZ :mZ×m(Messer and Natarajan, 2008).

Information on the measurement error parameterγX and parameterγZ is ob-tained from a validation study. Conditional on x and Z, the distribution of Y is assumed independent of X, which means that measurement error is non-differential. Recall that it was assumed thatZ is measured without error and for all study subjects. Later we will omitZ and use the distribution ofxgiven X, i.e. we use loglikelihood −12(log det(Σx|X) + (x−XγXx|X−1 (x−XγX)T).

As described earlier in Section 3.1, for an internal validation design, a data matrix (Y,x,X) is available on each subject in the validation sub-sample.

In the main study sample data, (Y,X) are available, but x is not available.

Therefore the concepts of missingness can be applied in conceptualization of measurement error structures. We now define concepts and terminology for measurement error adjustment approaches for a study design where a main dataset and an internal validation dataset are available.

Definition 1. If validation study subjects constitute a simple random sample from main study subjects, the measurement error mechanism will be called measurement error completely at random (MECAR).

A more relaxed assumption about the missing data mechanism is missing at random. The missing at random assumption can be defined as follows:

Definition 2. If the validation sample is stratified on covariates Z and the covariatesZ are included in models (3.7) and (3.9) and assuming that the mod-els are correct, the measurement error mechanism will be called measurement error at random (MEAR).

In both cases, MECAR and MEAR, the main point is that the conditional distribution of x|Y, X, Z will be the same for a validation study subject for which x is observed, as for a main study subject for which x is not observed.

In external validation designs, the main study and the validation study use independent samples. In the validation sample (x, X) are observed. In the main study (Y, X) are observed. Note thatxand Y are not observed together for any subjects because main and study use independent samples. The MEAR condition is now that the conditional distributionx|Y, X, Z should be the same for main study and validation study subjects, with a similar requirement for

Y|x, X, Z. For external validation design, no direct observations from either conditional distribution are available. Messer and Natarajan (2008, p. 6,338) have stated that in practice, the assertion that the data are MEAR may depend more strongly on a priori modelling assumptions for an external design than for an internal design.

In the simulations studies of Chapters 6, 7 and 8, internal validation design is used. The missing data mechanism is controlled in the simulation experiments. In the first simulation, the missing data mechanism is MECAR.

In the last two simulation experiments, different types of validation datasets are tested. In these experiments, the missing data mechanism is either MECAR or MEAR.

3.2.3 Likelihood functions

Likelihood functions are needed to obtain MI, RC and ML estimates in later sections. Three likelihood functions l1, l2, l3 are derived, as in the paper by Messer and Natarajan (2008), for cases: a) (Y, x, X) observed, b) (Y, X) observed and x missing, and c) full sample likelihood. Variables Z are ob-served on all subjects. It is also assumed that Z are measured without error.

Models (3.7) and (3.8) include variables Z explicitly, but often Z is omitted from the notation. So do we. We will assume that Y is an outcome variable with binary values, X : 1×mX a vector of surrogate variables, x : 1×m a vector of the original variables for internal validation and x: 1×m matrix of the original variables for external validation (see Figure 3.1). The regression coefficients are βx : mx ×1 and γX : mX ×m, and Σx|X is the variance for measurement error εx. Now we derive the likelihood functions for each of the cases a)–c).

a) Let (Y, x, X) be observed. The joint distribution of (Y, x) given X, f(Y, x|X), can be derived when noticing that models (3.7) and (3.9) imply that outcome variable Y is conditionally independent of the surrogate variables X wheneverxis observed. Thus, the conditional distribution of outcome variable Y given (x, X) isf(Y|x, X) =f(Y|x) and hence the joint distribution of (Y, x) given X isf(Y, x|X) =f(Y|x, X)f(x|X) =f(Y|x)f(x|X).

Hence, the log-likelihood for Y given (x, X), consists of two parts, l1 and l2:

l(βxXx|X;Y, x|X) =l1x;Y|x) +l2Xx|X;x|X), (3.10)

where l1 is a logistic regression likelihood,

l1x;Y|x) =Y xβx−log(1 + exp(xβx))

as was derived in Equation (3.6), and l2 is a normal regression likelihood, l2Xx|X;x|X) =−1

2

log det(Σx|X) + (x−XγX−1x|X(x−XγX)T .

b) Let (Y, X) be observed and x missing. From the joint distribution f(Y, x|X) = f(Y|x, X)f(x|X) the marginal distribution of Y|X is obtained by integrating the unobserved variables x out of the joint distribution:

f(Y|X;βxXx|X) = Z

f(Y|x, X;βx)f(x|X;γXx|X) dx. (3.11) If Equation (3.9) is plugged in to Equation (3.7), we get P(Y = 1|x, Z) = P(Y = 1|X,εxxX),i.e. model (3.11), that is, the likelihood for the logistic-normal effects model. An univariate integral is obtained when the variable v =εxβx substituted in the log-likelihood

l3xXx|X;Y|X)

= log

xTΣx|Xβx)−1/2

Z exp((XγXβx+v)Y)

1 + exp(XγXβx+v)exp − 1

2vTxTΣx|Xβx)−1v dv

(3.12)

c) Full sample likelihood. For an internal design, the full sample log-likelihood is the sum over study subjects in a validation study and in a main study:

l(βxXx|X)

= X

i∈validation

(l1x;Yi|xi) +l2(γ,Σx|X;xi|Xi)) + X

i∈main

l3xXx|X;Yi|Xi).

(3.13) Because for an external design, variables x and surrogate variables X are observed in the validation study only, the term l1 becomes zero and then the

full sample likelihood is l(βxXx|X) = X

i∈validation

l2Xx|X;xi|Xi) + X

i∈main

l3xXx|X;Yi|Xi).

(3.14)

3.3 Framework for cluster-correlated data

Cluster-correlated data arise when there is a hierarchical or clustered struc-ture in the sample or population. As a consequence, observations within a cluster are more alike than observations from different clusters. Intra-cluster correlation (ICC) describes how strongly units in the same group resemble each other. In the literature, there are several definitions and formulas of intra-cluster correlation, and many different ways for its calculation have been presented. A commonly used definition of ICC is based on a decomposition of the total variance into the within-cluster (s2w) and between-cluster (s2b) vari-abilities. Intra-cluster correlation is defined as

ICC = s2b s2b +s2w.

It is obvious that measurement error also can be intra-cluster correlated and should be taken into account in order to develop reliable and effective measure-ment error adjustmeasure-ment methods. As introduced in Section 3.2 for independent observations, two models are incorporated in the adjustment process: the mea-surement error model (3.8) and the outcome model (3.7). Meamea-surement error models for cluster-correlated data are discussed in more detail in Section 3.3.3.

Because we are working with a binary outcome variable, the outcome model is a logistic mixed model and it is presented in Section 3.3.2.

A generalized linear mixed model (GLMM) provides a useful approach for analysing a wide variety of data structures including cluster-correlated data. It combines the properties of two statistical models: linear mixed models and gen-eralized linear models (Nelder and Wedderburn, 1972; McCulloch and Searle, 2001; Demidenko, 2004). Linear mixed models are utilized when incorporating random effects. By using proper link functions, generalized linear models can handle non-normal data, e.g. binary data.

3.3.1 Linear mixed models

A linear mixed model contains both fixed effects and random effects. By using standard notation, the model can be represented as

Y =Xβ+W u+ε, (3.15)

where Y :n×1 is the vector of observations, X :n×m is the design matrix for the fixed effects,β:m×1 is the parameter vector of fixed effects,W :n×q is the design matrix for the random effects, u :q×1 is the vector of random effects, and ε :n×1 is the vector of random error terms.

We will assume that u and ε are normally distributed uncorrelated ran-dom variables with zero means, E uε

= 00

, variance and covariance matrices var(u) =G, var(ε) = R, var u

ε

!

= G 0

0 R

!

and cov(u,ε) =0. There-fore, the variance ofY is

V = var(Y) =W GWT +R (3.16)

and the expectation of Y is E(Y) =Xβ, and the vector of observations will be assumed normally distributed Y ∼N(Xβ,W GWT +R).

Estimation is more demanding in a linear mixed model than in a linear fixed-effect model. In addition to estimatingβ, there are unknown parameters W,G, and R to be estimated.

Estimates for parameters β and u are obtained by solving mixed model equations (Henderson, 1950):

"

XT−1X XT−1W WT−1X WT−1W + ˆG−1

# "

βˆ ˆ u

#

=

"

XT−1Y WT−1Y

#

to find the best linear unbiased estimator (BLUE) ofβand the best linear un-biased predictor (BLUP) ofu. Estimates are best in the sense that they mini-mize the sampling variance and they are linear functions ofY (Puntanen et al., 2011). Estimates are unbiased, i.e. E[BLUE(β)] = β and E[BLUP(u)] = u (Robinson, 1991).

The solutions of mixed model equations can be written as:

βˆ= (XT−1X)−1XT−1Y ˆ

u= ˆGWT−1(Y −Xβ),ˆ

whereV = var(Y) =W GWT +R. The parameters Gand R, the variances

of the random effects and residuals, are usually unknown. Typically these parameters are estimated and plugged into the predictor. The word empirical is often added to indicate such an approximation, thus BLUE and BLUB becomes EBLUE (the Empirical Best Linear Unbiased Estimator) and EBLUB (the Empirical Best Linear Unbiased Predictor).

Because mixed models contain the unknown parameters u, G and R, or-dinary least squares is not the best estimation method. A more appropriate method is generalized least squares (GLS), where

(Y −Xβ)TV−1(Y −Xβ) (3.17)

is minimized. However, to solve Equation (3.17), knowledge of G and R are required to obtain the variance of Y (see Equation (3.16)). Thus, the aim is to find appropriate estimates of G and R. The variance-covariance param-eters are usually estimated using Maximum Likelihood (ML) or Restricted Maximum Likelihood (REML) approaches.

Many software packages allow fitting of linear mixed models. Computa-tional tools are discussed in more detail in Section 5.3.

3.3.2 Generalized linear mixed models

A generalized linear mixed model is an extension to the generalized linear model which was discussed in Section 3.2.1. In the generalized linear mixed model, the linear predictor contains random effects in addition to the fixed ef-fects. Like the generalized linear model generalizes linear regression by allowing the linear model to be related to the response variable via a link function g(·), similarly in the generalized linear mixed models, the link function connects a linear predictor and response variable, but the linear predictor is in this model a combination of the fixed and random effects.

In a generalized linear mixed model, the fixed and random effects are com-bined to form a linear predictor η = Xβ +W u, where X : n×mX is the design matrix for the fixed effects, β : mX ×1 is a vector of fixed effects, W :n×q is the design matrix for the random effects, andu:q×1 is a vector of random effects.

The relationship between the vector of observationsY :n×1 and the linear predictor η is modelled through the conditional distribution of Y given u:

Y|u∼(g−1(η),R),

where E[Y|u] =g−1(η) =g−1(Xβ+W u) is the mean andRis the covariance

matrix of that conditional distribution.

For cluster-correlated data, a logistic mixed model is considered as the outcome model for the binary outcome variable. It is a special case of the generalized linear mixed model with inverse link functiong−1(·) = 1+exp(·)exp(·) . In our case, the fixed part of the linear predictor contains data x : nv ×m and Z : nv ×mZ, the random part contains a vector u : q×1 of random effects.

We assume that the conditional mean of Y is related to variables x, Z and W through a monotonic differentiable link function g(·). Thus, the outcome model is a generalized linear mixed model ofY given xand Z, specified by

E(Y|u) =µ=g−1(η) =g−1(xβx+ZβZ+W u), (3.18) where η = xβx +ZβZ +W u is a linear predictor, W : n ×q is a design matrix for random effects u:q×1,βx are regression coefficients forx, Z are error free data andβZ is regression coefficients forZ. In binary case var(Yi|u) is defined by E(Yi|u). Recall that, the βx is the parameter of interest.

Parameter estimation in generalized linear mixed models typically involves the method of maximum likelihood or some of its variants. The solutions are usually iterative and they can be numerically quite intensive. In order to approximate the likelihood of estimating generalized linear mixed model pa-rameters, various methods have been proposed, such as pseudo- and penalized quasilikelihood (PQL) (Wolfinger and O’Connell, 1993; Schall, 1991; Breslow and Clayton, 1993), Markov chain Monte Carlo (MCMC) algorithms (Zeger and Karim, 1991) and Laplace approximations (Raudenbush et al., 2000).

3.3.3 Measurement error model

Next we will present the study design with a main data and validation data approach for cluster-correlated data. Methods for measurement error adjust-ment in generalized linear mixed models have been studied, for example, in Wang and Davidian (1996); Wang et al. (1998); Tosteson et al. (1998); Wang et al. (1999); Lin and Carroll (1999); Wang et al. (2000); Buonaccorsi et al.

(2000); Zhong et al. (2002); Li et al. (2005); Shen et al. (2008); Bartlett et al.

(2009); Li and Wang (2012).

In Section 3.2.2, the measurement error model for the case of independent observations was defined in Equation (3.8) and Equation (3.9). For cluster-correlated data we must allow non-zero within cluster correlation. For a mea-surement error model in the cluster-correlated case, we will introduce a mixed

model for x:nv×m, so that

x=XγX +ZγZ+Bb+εx, (3.19) where γX and γZ are matrices of regression coefficients, X : nv ×mX, Z : nv ×mZ and B:nv×q are design matrices, b :q×m is a matrix of random effects andεx :nv×m is a matrix of random errors, where each row is a vector of zero mean and covariance matrix Σx.

In the following sections, we consider the measurement error model that contains cluster-level random intercepts only. In this caseB:nv×C is matrix of special structure, where C is number of clusters. The rows ofB correspond to units and columns to clusters. In each row there are zeroes and one 1, identifying a cluster where the unit belongs. The matrixb:C×m consists of random effects. There is one row for each cluster. This row has random effects for all m variables.

4 Methods for measurement error adjustment

In this Chapter, we introduce three methods for adjusting covariate measure-ment error: maximum likelihood, multiple imputation and regression calibra-tion. The methods are first discussed for independent observations. Multiple imputation and regression calibration are then extended to the case of corre-lated observations.

4.1 Maximum likelihood for measurement error adjust-ment

We begin from the Maximum Likelihood (ML) method. For internal validation design, the full sample likelihood function was defined in Equation (3.13) and in Equation (3.14) for external validation design. Likelihood models (3.13) and (3.14) are a mixture of exponential family models, including linear and logistic models. If the model is correct, the ML estimators of the parameters will have nice properties. The ML estimators of the parameters will be con-sistent, asymptotically normal and have the smallest asymptotic MSE among all ’regular’ estimators. (Messer and Natarajan, 2008).

Because of these properties of ML estimators and motivation arising from some earlier publications (e.g. Spiegelman et al., 2000; Messer and Natarajan, 2008) the ML estimator will set the reference method for comparison in the first simulation study which is conducted on independent observations. The challenge in the computation of the full sample likelihood is to solve the integral terml3 given in Equation (3.12). In the random effects literature, efficient nu-merical quadrature formulas for Equation (3.12) have been presented (Messer and Natarajan, 2008). Because the term l3 is a normal-logistic random effect model, the ML estimate can be computed using standard software such as the procedure NLMIXED in SAS.

4.2 Multiple imputation

This section will focus on Multiple imputation (MI) for measurement error adjustment. The notation and central concepts of MI are introduced. For a more extended treatment of the subject, see for example Rubin (1976); Little and Rubin (1987, 2002); Rubin (1996); Harel and Zhou (2007); R¨assler et al.

(2008); Yucel (2011).

Multiple imputation is usually applied to missing data problems. Missing data appears in almost all survey research. Non-response is failure to obtain

data from a selected sample unit. There are two principal types of non-response that can occur: unit non-response and item non-response. Unit non-response occurs when a respondent does not respond to all required response items e.g.

fill out or return a data collection instrument. In an item non-response case, the respondent does not respond to one or more items on a survey.

4.2.1 Missing data mechanism

Many methods for handling missing data are based on the assumptions of a missing data mechanism. A missing data mechanism describes the relationship between measured variables and the probability of missing data. Key concepts about missing data mechanisms were formalized originally by Rubin (1976).

Three different missing data mechanisms are presented: missing completely at random, missing at random and not missing at random.

If the missing data values are a simple random sample of all data values, the data aremissing completely at random (MCAR) (Little and Rubin, 1987).

In other words, missingness is completely unsystematic. The probability of missing data on some variableX is unrelated to other measured variables and to the value ofX itself. This missing data mechanism is quite rare in practice and mainly a theoretical special case.

More common in practice is the missing at random (MAR) case, i.e. the probability that an observation is missing may depend on an observed part of the data but not on the missing part of the data (Little and Rubin, 1987).

If data are not MAR or MCAR, then the data are not missing at random (NMAR) (Little and Rubin, 1987). The case of NMAR is difficult to deal with in practice. Most strategies for dealing with missing data assume that missingness is MCAR or at least MAR. If missing data are MCAR or MAR

If data are not MAR or MCAR, then the data are not missing at random (NMAR) (Little and Rubin, 1987). The case of NMAR is difficult to deal with in practice. Most strategies for dealing with missing data assume that missingness is MCAR or at least MAR. If missing data are MCAR or MAR