• Ei tuloksia

Adjustment for Covariate Measurement Errors in Complex Surveys : A Simulation Study of Three Competing Methods

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Adjustment for Covariate Measurement Errors in Complex Surveys : A Simulation Study of Three Competing Methods"

Copied!
150
0
0

Kokoteksti

(1)

Adjustment for Covariate Measurement Errors in Complex Surveys: A Simulation Study of

Three Competing Methods

Maria Valaste

Academic dissertation to be presented, by the permission of the Faculty of Social Sciences of the University of Helsinki,

in the lecture room PIII, Porthania, Yliopistonkatu 3, on March 27, 2015 at 12.00.

Helsinki 2015

(2)

Picture of the cover: Kristiina Dammert

Printed version

ISBN 978-951-51-0845-6 Painotalo Kyriiri Oy Helsinki 2015

PDF version

ISBN 978-951-51-0846-3

http://ethesis.helsinki.fi/

(3)

Maria Valaste

Adjustment for Covariate Measurement Errors in Complex Surveys: A Simulation Study of

Three Competing Methods

ISBN 978-951-51-0845-6 (nid.) ISBN 978-951-51-0846-3 (pdf)

(4)
(5)

Abstract

In sample surveys, the uncertainty of parameter estimates comes from two main sources: sampling and measuring the study units. Some aspects of survey errors are quite well understood (e.g. sampling errors, nonresponse errors) and reported but others, like measurement errors, are often neglected. This thesis studies measurement uncertainty in covariates.

Focus is on the adjustment for covariate measurement errors in logistic regression for cluster-correlated data. Three methods for adjustment for co- variate measurement errors in surveys are studied. The methods are Maximum Likelihood, Multiple Imputation and Regression Calibration. These methods require information obtained from validation study.

The thesis consists of a theoretical part and extensive Monte Carlo simula- tion experiments. At the first simulation experiment, the simulation study is conducted with artificial data and with independent observations to test and have experience of the three methods: MI, ML and RC. The second and third simulation study is performed with cluster-correlated data. In these simulation studies, the first simulation uses artificial data and the latter uses real data.

In both simulations regression calibration and multiple imputation approaches are examined in various simulation designs.

The quality of the methods is assessed by the bias and accuracy. The bias is measured by absolute relative bias percentages (ARB%) and the accuracy by relative root mean-squared error percentages (RRMSE%).

The results suggest that additional information from validation (calibra- tion) data enables more accurate estimates in terms of bias percentages.

Keywords: covariate measurement error, simulation, regression calibration, multiple imputation, maximum likelihood.

(6)
(7)

Acknowledgements

I would like to express my deepest gratitude to my advisors Professor Risto Lehtonen, Professor Emeritus Lauri Tarkkonen and University Lecturer Kimmo Vehkalahti. I could not have imagined having better advisors and mentors for my Ph.D study. Warmest thanks to Risto for his excellent guidance in my studies but also in working life. Sincere thanks to Lauri, who introduced me to the problem of measurement errors and many other interesting things in life. I highly appreciate Kimmo’s endless support, positive spirit and guidance to the academic world.

I wish to thank the examiners of my thesis, Docent Imbi Traat and Professor Mervi Eerola, for many insightful comments and suggestions to improve my thesis. I am grateful to Professor Emeritus Hannu Niemi for commenting my thesis when it was still a draft. Professor Juha Alho gave valuable comments at the research seminar in statistics.

I started this thesis when I worked at the Department of Mathematics and Statistics, University of Helsinki and finalized it while working at Social Insurance Institution of Finland (Kela). I wish to thank the people at both of these organisations. I am privileged to have had wonderful people around me. At Kela, I wish especially thank to the “table number nine” which has provided well-earned break during the working days.

Finally, I wish to thank to my friends and family for supporting me through- out all my studies.

Helsinki, February 2015

Maria Valaste

(8)

Notation

General

E, var, cov are often applied for vectors

E Expectation

var Variance

cov Covarince

l Likelihood function For independent observations

n number of observations in the main data

nv number of observations in the internal validation data ne number of observations in the external validation data

Y binary outcome variable (always observed in the main study) y n×1 vector of outcome variable values

x 1×m vector of the original variables (observed in the validation data)

x nv×m matrix of the original variable values

Z 1×mZ vector of covariates measured without error (observed in the main and validation data)

Z n×mZ matrix of covariate variable values X 1×mX vector of the surrogate variables

X n×mX matrix of the surrogate variable values

βx m×1, the parameter of interest in the outcome model εx nv×m matrix of measurement errors

εX n×mX matrix of measurement errors

αx m×mX matrix of regression coefficients for x αZ mZ ×mX matrix of regression coefficients for Z γX mX ×m matrix of regression coefficients for X γZ mZ ×m matrix of regression coefficients for Z Σx|X variance for measurement errorεx

For cluster-correlated observations

Y n×1 vector of outcome variable values X nv×mX design matrix for fixed effects βX mX ×1 parameter vector for fixed effects W n×q design matrix for random effects u q×1 vector of random effects

ε n×1 vector of random error terms in the linear mixed model G var(u), q×q variance matrix for the random effects

(9)

R var(ε),n×n variance matrix for the error

V var(Y), n×n variance matrix for outcome variable η linear predictor η=Xβ+W u

g(·) monotonic differentiable link function x nv ×m

Z nv ×mZ

B nv ×C design matrix for random effect in the measurement error model

b C×m matrix of random effects in the measurement error model

εx nv ×m vector of measurement error in the measurement error model

1 is a vector of ones C number of clusters

(10)

Contents

1 Introduction 1

1.1 Aim of the study . . . 2

1.2 Main results . . . 3

1.3 Structure of this thesis . . . 4

2 Literature review 5 2.1 Approaches to Measurement Error . . . 5

2.2 Measurement Errors in Surveys . . . 6

2.3 Theoretical aspects to measurement errors . . . 8

3 Study designs and models 11 3.1 Main and validation study designs . . . 11

3.2 Framework for independent observations . . . 14

3.2.1 Generalized linear models . . . 14

3.2.2 Outcome model and measurement error model . . . 16

3.2.3 Likelihood functions . . . 18

3.3 Framework for cluster-correlated data . . . 20

3.3.1 Linear mixed models . . . 21

3.3.2 Generalized linear mixed models . . . 22

3.3.3 Measurement error model . . . 23

4 Methods for measurement error adjustment 25 4.1 Maximum likelihood for measurement error adjustment . . . 25

4.2 Multiple imputation . . . 25

4.2.1 Missing data mechanism . . . 26

4.2.2 McMC for MI . . . 26

4.2.3 Multiple imputation for measurement error adjustment . 27 4.2.4 Combining the estimates . . . 29

4.2.5 Multiple imputation in cluster-correlated data . . . 29

4.2.6 Multiple imputation in measurement error adjustment for cluster-correlated data . . . 32

4.3 Regression calibration . . . 32

4.3.1 Regression calibration in independent observations . . . 33

4.3.2 Regression calibration in cluster-correlated data . . . 34

5 Simulation experiments: background and goals 35 5.1 Results from the literature . . . 35

5.2 Quality concepts . . . 39

5.3 Computation tools . . . 39

5.4 Study problems . . . 41

6 Simulation experiments with independent observations 44 6.1 Design for Monte Carlo experiments . . . 44

6.2 Simulation results . . . 47

(11)

6.3 Effect of the size of the main dataset . . . 53

6.4 Effect of the size of the validation dataset . . . 56

6.5 Overall performance . . . 58

6.6 Summary of simulation results . . . 60

7 Simulation experiments with artificial cluster-correlated data 61 7.1 Designs for Monte Carlo experiments . . . 62

7.2 Simulation results . . . 63

7.3 Consistency of regression calibration . . . 71

7.4 Overall performance . . . 74

7.5 Summary of simulation results . . . 75

8 Simulation experiments with real cluster-correlated data 77 8.1 Survey design of the ECHP survey . . . 77

8.2 Description of the original sample data . . . 77

8.3 Generating the population data . . . 82

8.4 Simulation setting . . . 88

8.5 Simulation results . . . 90

9 Discussion 97 9.1 Summary of empirical results . . . 97

9.2 Conclusions . . . 100

References 101 Appendices 109 A.1 Figures . . . 109

A.2 Tables . . . 119

A.3 Programs for simulations . . . 130

(12)

1 Introduction

This thesis studies measurement uncertainty in covariates. The problem of inaccurate measurements is common throughout science. If the uncertainty of measurement is neglected it may lead to biased estimates and consequently to imprecise conclusions.

In sample surveys, the uncertainty of parameter estimates comes from two main sources: sampling and measuring the study units. When estimating the parameters from survey data, it is important to have control over the sources of uncertainty in the estimation procedure. Often the data are collected by a com- plex sampling design involving stratification, clustering and unequal inclusion probabilities. The first source of error then comes from the implementation of the complex sampling design and generalizing the results to the population.

Another source of error is present when measuring the study units.

In empirical research the uncertainty of parameter estimates is treated in various ways. For example, in the natural sciences and engineering it is common to treat measurement errors with neglect and include them in the sampling variation. In psychometrics the effect of measurement errors is more specifically assessed. A number of statistical methods have been proposed for to take measurement uncertainty into consideration. Three approaches to adjusting for measurement error in covariates are studied in this thesis:

multiple imputation for measurement errors (MI), regression calibration (RC) and maximum likelihood (ML). Measurement error in covariates can cause bias in parameter estimation. Hence, the purpose of measurement error adjustment is to reduce the bias and thus, to achieve more accurate results.

To obtain accurate measurements may be difficult or impractical in some studies. For example, in diet studies, food diaries or food frequency ques- tionnaires (FFQ) are used to estimate the usual intake of foods consumed, although these self-reported dietary assessments are usually measured with er- ror. In studies where cost or practical considerations may prevent researchers from obtaining accurate data for all study subjects, a study design with main data and validation data may applied. These types of methods require gold standard assessments of some of the study objects to be available. For example, to assess memory function researcher may use mini-mental state examination (MMSE) for all study subjects. This instrument provides a rough screening of memory function and other cognitive abilities. For a smaller (random) sample, which will form the validation data, the researcher may utilize brain imaging.

This will give more accurate results, but it is expensive.

Regression calibration is one of the most commonly used methods for mea-

(13)

surement error adjustment for covariates in a study design with main and validation data. One of the principal advantages of RC is its simplicity, and usually it performs well under standard situations. Multiple imputation is well established for handling missing data problems. It has been proposed for deal- ing with measurement errors, but this has not been studied extensively. The third method is maximum likelihood, but its use has been limited because of computational difficulties. All of these three methods utilize external data to make the measurement error adjustment.

1.1 Aim of the study

This thesis focuses on the adjustment for covariate measurement errors in lo- gistic regression for cluster-correlated data. Complexity in this situation arises from correlation of observations due to cluster sampling. The aim of this thesis is to investigate the relative performance of three competing approaches for measurement error adjustment. The methods investigated are multiple im- putation, regression calibration and maximum likelihood for accounting mea- surement error adjustment in a complex survey setting. These methods are considered for logistic regression models where the binary variable is assumed to be error-free and continuous covariates are measured with error. The ulti- mate goal is to estimate a logistic model as good as possible. For successful adjustment, some information about the measurement error must be available.

This information is provided from validation study data. In this thesis, the relative performance of the three adjustment methods is assessed by simulation methods.

Statistical performance of the adjustment methods of interest cannot be studied analytically because of their complexity. Therefore, Monte Carlo ex- perimental designs are used to conduct empirical simulation studies. The sim- ulation approach in the first two Monte Carlo experiments is model based and the simulations are based on artificial data generated by a statistical model. This enables the statistical properties of the methods to be studied in a completely controlled setting. The final—and most important—Monte Carlo experiment uses design-based simulation for real cluster-correlated data.

In a design-based simulation approach experiment, repeated samples are drawn from a fixed population (Lehtonen et al., 2003; Molina and Rao, 2010).

In the first phase of this thesis (Chapter 6), limited Monte Carlo simulation studies are conducted to study the following aspects:

• the effect of various parametrizations on the relative performance of the measurement error adjustment methods

(14)

• the effect of different sizes of the main and validation datasets to adjust measurement error

• to test and obtain experience of the calculation tools that are available and to obtain materials for creating further experimental designs.

Observations in the first simulation study are assumed independent. The three approaches (MI, RC and ML) to adjusting measurement errors in covariates are compared in different experimental designs. The bias and accuracy properties of the MI and ML approaches were investigated and preliminary results were introduced in Valaste et al. (2010a,b); Valaste and Lehtonen (2011).

In the two last Monte Carlo experiments (Chapters 7 and 8), the bias and accuracy of two selected adjustment methods (regression calibration and multiple imputation) are examined for cluster-correlated data. First, artificial cluster-correlated data are generated and selected simulation designs are ap- plied where the size of the clusters and other characteristics of the data are controlled. The aim is to study the effect of clustering in two cases: 1) cluster correlation is ignored, and 2) cluster correlation is taken into account in the estimation procedure.

The third simulation study uses real data from Statistics Finland. The data encompass cluster correlated structure due to the hierarchical structure of the data. In the first phase, a close-to-reality population is created. Data used in the simulations are drawn from the constructed close-to-reality population.

The regression calibration and multiple imputation approaches are studied in different simulation settings. This simulation study is design-based.

The detailed research problems are presented in Section 5.4. The purpose of this study is also to identify research questions for future work. Extended study questions for the future are discussed in Chapter 9.

1.2 Main results

The results of the Monte Carlo simulation experiment for independent obser- vations showed that when the validation dataset is large, bias and accuracy figures for the primary interest parameter are smaller throughout than for smaller validation datasets. The effect of the size of the main dataset had quite similar performance in general for all three approaches. A large main dataset improves the performance of measurement error adjustment, although some inconsistency problems of regression calibration were detected. The results also showed that the regression calibration and multiple imputation methods in the simulations settings were carried out quite easily, but calculations for maximum likelihood were time-consuming and the results were sensitive to the

(15)

starting values.

Simulation experiments with cluster-correlated data revealed that the clus- ter structure of the data has to be taken into account in the measurement error adjustment and in the modelling of the outcome variable. According to bias values obtained from the simulation experiments, all adjustment methods im- proved the estimation compared to the na¨ıve method where validation data was not used. The results for the choice of validation dataset, either a simple random sample or a cluster sample from the main data, were not obvious.

For regression calibration, the results were consistent when assumptions of the model were correct and when the validation dataset was a cluster sample from the main data. A large correlation between predictors also causes problems in performance for cluster-correlated data.

The simulation studies presented in this thesis suggest that additional in- formation from validation (calibration) data enables more accurate estimates.

1.3 Structure of this thesis

Chapter 2 gives a short literature review of survey errors and the developments of measurement error theory. Theoretical aspects of measurement errors are discussed.

A study design with main data and validation data is discussed in Chap- ter 3. Assumptions of the measurement error model, and also the model that will be analysed in simulation studies, will be introduced. The first part of the Chapter concentrates on the case of independent observations. Thereafter, a framework for analysing cluster-correlated data is described.

Chapter 4 discusses the selected adjustment methods. The multiple impu- tation, regression calibration and maximum likelihood methods for measure- ment error adjustment in the estimators are presented.

Monte Carlo simulations with artificial and independent observations are conducted in Chapter 6. Various designs are examined with the maximum likelihood, multiple imputation and regression calibration methods.

Chapters 7 and 8 include Monte Carlo simulations with cluster-correlated data. The first simulation uses artificial data and the latter uses real data.

In both of these simulations regression calibration and multiple imputation approaches are examined in various simulation designs.

Chapter 9 gives a short summary of empirical results and the conclusions of the study.

(16)

2 Literature review

In the following Chapter, different approaches to measurement error are cov- ered. The developments of the measurement error theory are summarized briefly in the context of survey methodology.

2.1 Approaches to Measurement Error

For historical reasons, there have been various ways to accommodate uncer- tainty of observations. In the literature ofmeasurement error modelling, there are two distinct developments of the theory (Biemer and Stokes, 1991): the sampling approach and the psychometric approach. Groves (1989) remarks that, in his opinion, in survey research there appear to be at least three major disciplines that are addressing similar measurement error problems in isolation of one another: statisticians, psychologists and econometricians. Other survey data users appear to employ similar concepts to those above.

The sampling approach to measurement error modelling began with the early work of Hansen, Hurwitz, and Madow in 1953 (Biemer and Stokes, 1991).

This approach is focused on the total survey error where the relationships among the several types of survey errors are described with an expression of the mean square error (the sum of squared bias and variance).

The psychometric approach differs from the traditional sampling perspec- tive. The aim is to describe the variance-covariance structure of the whole population rather than trying to include different types of errors in terms of their contributions to sampling bias and sampling variance.

There have been occasional attempts to synthesize the sampling and psy- chometric traditions (Stanley, 1971; Valaste et al., 2008). An attempt to com- bine the traditions is e.g. generalizability theory, which expands on the Classi- cal Test Theory idea of error terms—the error can be due to multiple sources.

Generalizability theory uses an analysis of variance framework for estimating the proper variance components (Cronbach et al., 1963, 1972; Brennan, 2001).

In the econometric approach the terminology of errors originates mainly from the language of estimation of a general linear model (Groves, 1989). In this approach, measurement error is considered in two cases: measurement error in the dependent variable and measurement error in the independent variable (Hausman, 2001). The problem of measurement errors is receiving increased attention due in part to the increased availability of microeconomic datasets (Schennach, 2004). Measurement errors are also examined in the context of nonlinear models (see e.g. Schennach, 2004; Chen et al., 2011).

(17)

We might exaggerate and generalize the essential differences between the three approaches in the following way: The main difference between the sam- pling and psychometric approach is the assumption concerning of the mea- surement. The sampling statisticians presume that an observable true value exists. From the psychometric perspective, the measurements are taken on attributes that (typically) cannot be observed directly and perhaps the mea- surements cannot observed by anyone else than the respondents themselves.

Psychometrics and econometrics are more dedicated to measuring relationships among measures (e.g. correlations, regression coefficients) than the sampling approach. Their language of errors is based on the statistics which are used to measure covariation of the characteristics. The focus on estimates of means and totals (univariate statistics) are frequently used by sampling statisticians.

Despite the approach adopted, it is important to pay attention to measure- ment errors and their effect on statistical analysis. In this thesis, the focus is on sampling approach and on some open questions in this approach.

2.2 Measurement Errors in Surveys

Survey quality is often described by variance, bias, accuracy, reliability and validity, but frequently these terms are used in ambiguous ways. Sometimes the terms are related to measurement errors only; in other cases, these terms are used in a much more wider context, for example to denote the overall stability of survey results.

Groves (1989); Alwin (1991, 2007); de Leeuw et al. (2008) specify four sources of error in surveys: coverage error, sampling error, nonresponse error and measurement error. Figure 2.1 demonstrates a sequence of sources of survey errors when estimating population parameters.

Two types of coverage errors exist: undercoverage and overcoverage errors.

An undercoverage error arises when some population elements are not included in the sampling frame. An overcoverage error is present when a unit from the target population appears more then once in the sampling frame. Sampling errors exist in the surveys because only a subset of the population elements is used to represent the population. A nonresponse error occurs when the survey fails to get a response to one, or possibly all, of the questions. A measurement error is a lack of measurement precision due to weakness in the measurement instrument. All of these four errors are present when the data are a sample of population, but only measurement errors are present in empirical studies regardless of whether the data collection involves sampling or not.

Total survey error is a term that is used to refer to all sources of bias

(18)

Population Parameters Coverage Errors Sampling Errors Non-Response Errors

Measurement Errors

Figure 2.1: A sequence of sources of survey errors (Alwin, 1991, 2007).

(systematic error) and variance (random error) that may affect the validity (accuracy) of survey data (Lavrakas, 2008). Groves and Lyberg (2010) point out that the term is not well defined. They state that different researchers include different components of error within it, and a number of typologies exist in the literature. The total survey error framework has a long history (Groves and Lyberg, 2010). Its developments adopt the general progress of survey methodology.

The term survey measurement error refers to error in survey responses arising from data collection, the respondent, or the questionnaire (or other instrument) (Biemer et al., 1991). In this study, the focus is on measure- ment error in covariates. Assumptions on the measurement error are defined in Sections 3.2.2 and 3.3.3 and the measurement error models are given in Equations (3.8) and (3.19).

O’Muircheartaigh (1997) depicts the historical development of survey re- searchunder three distinct strands: governmental/official statistics, academic/- social research and commercial/advertising market research. These strands have developed different terminologies and frameworks in survey research mainly because of different needs and approaches in their research fields. Thus, con- ceptualization of error, especially measurement error, differs between these strands.

Kiaer, the director of the Norwegian bureau of statistics, activated official

(19)

statisticians at the end of the 19th century by presenting a report of his ex- perience with “representative investigations”. He also suggested replication when evaluating the survey results. The idea of repeated samples led to the concept of a sampling distribution. Neyman’s landmark paper in 1934 (Lessler and Kalsbeek, 1992; Fienberg and Tanur, 1996) on sampling stimulated statis- ticians and led to developments in this tradition. The term error meant a measure purely of variance or variability. Broadly stated, in this strand the term error describes any source of variation in the results or output or esti- mates from a survey (see Groves, 1989).

The second strand arose from the Social Policy and Social Research move- ments. The goal of this uninformal movement was social reform and the mech- anism was community description (O’Muircheartaigh, 1997). There were some sample survey pioneers who spanned both official statistics and social survey.

Especially, in 1915 Bowley made very important contribution to the develop- ment of measurement and sampling methods (Lessler and Kalsbeek, 1992).

The third strand responded to the increasing needs of commercial and advertising research. The effect of psychologists was strong: they provided a scientific basis for measurement. The term error was not used explicitly;

instead the terms validity and reliability were applied.

2.3 Theoretical aspects to measurement errors

Measurement error is defined in relation to “true value”. At least two different approaches to conceptualize “true value” can be found. A true score (or value) is named Platonic if the concept of a true value is plausible, e.g. such as phys- ical measurement (Biemer and Stokes, 1991). Platonic true scores encompass the idea that there is some “truth” out there that is to be discovered (Al- win, 2007). Non-Platonic true scores, used by psychometricians, instead are defined in statistical terms such as the expected value of a hypothetical infi- nite set of observations (response distribution for the measurement) for a fixed person (Lord and Novick, 1968).

In the literature there are several measurement error models that have been studied. Measurement error is generally equated with the term “observational errors” (Groves, 1989). At first we will present a classical model, the so-called true score model.

Let x be the observed score. The true score model of classical test theory is

x=τ+ε (2.1)

whereτ is the (unknown) true score and εis the (random) measurement error

(20)

with E(ε) = 0, cov(τ, ε) = 0. This means that we assume that the error is statistically independent of the true variable. This type of measurement error is called classical measurement error; other types are nonclassical.

Another type of measurement error model, referred to as the Berkson error model (Berkson, 1950), assumes that measurement error is statistically inde- pendent from the observed variable. This model is also called a controlled variable model because initially Berkson described the error model for experi- mental situations in which the observed variable was controlled. The Berkson model is defined as

τ =x+ε (2.2)

where E(ε) = 0 and cov(x, ε) = 0.

Measurement error may also depend on other variables. There might be some third variable alongside the “true” value τ and observed variable x e.g. some response variable Y which may be or may not be measured with error. The measurement error model is non-differential (with respect to Y) if it is independent of the values of the response variable, otherwise the mea- surement error model isdifferential.

Measurement error models in this study

In the forthcoming Chapters, we consider the situation wherexis the variable which we would like to measure exactly, but are not able to measure for all study units. The variable x is assumed to be error free. Instead of obtaining the variablexwe measure a surrogate variableX which is measured with error.

The measurement error model studied in this thesis is defined in Section 3.2.2 and the extension to cases where data have cluster correlation structure is discussed in Section 3.3.3. Throughout this thesis, the measurement error model is assumed to be non-differential and measurement errors follow the Berkson error model (2.2).

In the first two Monte Carlo simulation experiments, conducted in Chap- ters 6 and 7, the simulations are based on artificial data that are created by a model. Therefore, the uncertainty of the data are under control and measure- ment error is created by a model.

A Monte Carlo simulation experiment with real data is conducted in Chap- ter 8. Simulations in this section are based on Finnish ECHP (European Com- munity Household Panel) data from year 1996. In the ECHP, some of variables of interest are measured both by interview and by administrative registers, e.g.

income-related variables. This gives a unique opportunity to investigate meth-

(21)

ods of measurement error adjustment. Although measurement processes are quite different between sample surveys and register-based surveys (Wallgren and Wallgren, 2007), in this experiment we assume that register-based data are used as proxies for “true values” and differences between register-based data value and the corresponding interviewed value are assumed to reflect measure- ment errors. Previous literature contains this type of measurement error study, where register-based data are assumed to be the “true value”. A report on the quality of income data in particular in the European Community Household Panel Survey is provided by Rendtel et al. (2004).

(22)

3 Study designs and models

The methods for measurement error adjustment examined here require vali- dation (or calibration) data to be available. The following section begins by introducing a study design with main data and validation data approaches.

General concepts are first considered for the case of independent observations.

The measurement error model used in this thesis is defined and the likelihood function for inference is discussed.

A framework for analysing cluster-correlated data is examined. Especially the measurement error model is extended to data which follows a generalized linear mixed-model structure.

3.1 Main and validation study designs

Exposure measurement error has been studied in several papers (Rosner et al., 1989; Kuha, 1994; Spiegelman et al., 2000, 2001; Cole et al., 2006; Messer and Natarajan, 2008; Freedman et al., 2008; Padilla et al., 2009; Buonaccorsi, 2009), especially in epidemiology. The exposure measurement error for an individual can be defined as the difference between the measured exposure and the true exposure (White et al., 2008). In epidemiology, the term exposure can be broadly applied to any variable that may be associated with an outcome variable.

Sometimes in themain study surrogate variables is measured instead of the original variables. The reasons for measuring surrogate variables instead of the original variables might be, for example, measurement cost or it being too laborious to employ the measurements for large study datasets. To estimate the relation between the surrogate variables and the original variables (also called gold standard variables) a validation sub-study may be conducted. A validation study is usually a smaller study in which both variables (the original variables and surrogate variables) are measured. We assume that the main and validation study datasets contain common variables. These variables are observed on all subjects and it is also assumed that they are measured without error.

Denote an outcome variable byY, a vector of original variables byx: 1×m, a vector of surrogate variables byX : 1×mX, and one more vector of variables byZ : 1×mZ. Z is measured without error. Note that our vectors of variables are row vectors. When measured on subject i, these vectors appear in row i of the respective data matrices. In the main study the variables (Y, X, Z) are measured on study subjects. A data matrix (y,X,Z) withn rows is received,

(23)

where y:n×1 stands for a vector of outcome observations, X :n×mX is a matrix of the surrogate variable values andZ :n×mZ a respective matrix of Z measurements.

Ininternal validation designs, the study subjects who contribute validation data are a (random) subsample of the main study. In the validation subsample of size nv the original variables xare measured and a data matrix x:nv ×m is received. These data are known together with (y,X,Z)-data, measured for alln subjects (see Figure 3.1).

In external validation designs, the main study and the validation study use independent samples. In the validation sample we have a data matrix (x,Z,X), whereas in the main study a data matrix (y,Z,X) so that x and Y are not observed together for any subjects.

Figure 3.1 demonstrates a study designs with main data and validation data. In the main study data (y, X, Z) are measured on n study subjects.

Internal validation study data containsnvstudy subjects (nv < n) and external validation study data contains ne study subjects (ne < n). Different types of lines between main study data and validation study data represent connections:

internal validation data are a (random) subsample of the main study and the external validation study uses an independent sample. A more detailed figure for internal validation data design is given in Appendix Figure A.1.

Many statistical methods have been proposed for adjusting covariate mea- surement error for main study and validation study designs. Three approaches to adjusting for measurement error in covariates are studied in this thesis: mul- tiple imputation for measurement errors (MI), regression calibration (RC) and maximum likelihood (ML). Outcome variableY is assumed to be dichotomous, thus the logistic regression model (a fixed-effects model or a mixed model) is applied depending on the structure of the data. In the forthcoming sections, the internal validation data is assumed.

(24)

y X Z 1

...

n

MAIN DATA

y X Z x 1

...

nv

... n

INTERNAL VALIDATION DATA

a subsample from the main data

X Z x 1

...

ne

EXTERNAL VALIDATION DATA

an independent from main data new sample

Figure 3.1: Main and validation study data.

(25)

3.2 Framework for independent observations

3.2.1 Generalized linear models

Throughout this thesis the outcome variableY has two values. Generally, the values of this type of binary variableY are encoded as 1 and 0. The probability of Y being one is denoted by P(Y = 1) = p and the probability of Y being zero P(Y = 0) = 1−p. The ratio of probabilities, 1−pp , is called the odds and the logarithm of the odds, logit(p) = log 1−pp

, is called logit.

The binary variable can be modelled by a generalized linear model

(McCullagh and Nelder, 1989). In generalized linear models, the density/prob- ability function of the independent random variable Yi, i = 1, . . . , n, can be expressed as

fYi(yii, φ) = exp

ai[yiθi−b(θi)]

φ +c(yi, φ)

, (3.1)

where ai, i = 1, . . . , n, represent known weights, b and c are some known functions and θi, i= 1, . . . , n, and φ are unknown parameters.

The distribution given by (3.1) belongs to the exponential family. The exponential family includes many of the most common distributions such as the normal, exponential, gamma, chi-squared, Bernoulli, Wishart and many others. Probability distributions of the outcomeYiin generalized linear models are usually parametrized in terms of the meanµiand the dispersion parameter φ instead of the natural parameterθi. For example, for each observation i, for the binary Yi ∼Bin(1, pi),pi ∈(0,1), the distribution is P(Yi =yi) =pyii(1− pi)(1−yi),yi = 0,1. The notationYi ∼Bin(1, pi) has the same meaning asYi ∼ B(pi) which is a Bernoulli distribution with probability pi. The distribution of the independent Bernoulli trials with probabilitypi for observation ican be written as

P(Yi=yi) = pyii(1−pi)1−yi

=

pi

1−pi

yi

(1−pi)

= exp

log pi

1−pi

yi

(1−pi)

= exp

yilog pi

1−pi

+ log(1−pi)

= exp

yiθi−log(1 + exp(θi)) , (3.2) where θi = log 1−ppii

. The distribution (3.2) is a member of the exponential

(26)

family. It follows from (3.1) by choosingφ= 1, ai = 1, b(θi) = log(1 + exp(θi)) and c(yi, φ) = 0.

The generalized linear model generalizes linear regression by allowing the linear model to be related to the response variable via an invertible link func- tion g(·). The link function provides the relationship between the linear pre- dictor ηi and the mean µi of the distribution, g(µi) = ηi. There are a number of popular link functions, such as the logit link function

ηi =g(pi) = log pi

1−pi

=Xiβ, (3.3)

where pi ∈ (0,1), i = 1, . . . , n, and Xi is a row of the design matrix X. It has been chosen in this thesis because it is appropriate for the analysis and for comparison of the results from the literature. In Equation (3.3), β is the parameter vector (a column vector).

The mean and variance of outcome variableYi from the exponential family of distributions (3.1) are obtained by first and second derivatives with respect to θi, and they are for ai = φ = 1, E(Yi) = bi) = µi = g−1(Xiβ) and var(Yi) = b′′i). The primes denote derivatives with respect to θi. For the Bernoulli distribution

E(Yi) =bi) = eθi

1 +eθi =pi

and

var(Yi) =b′′i) = eθi

(1 +eθi)2 =pi(1−pi).

The column vector y contains elements yi (the outcomes of Y) which are the observed{0,1}values for each observationi. The joint probability function of observations is

f(y|β) = Yn

i=1

pyii(1−pi)1−yi, (3.4) where pi depends on β. This is a likelihood function with respect to β. The likelihood function can be expressed as

L(β|y) = Yn

i=1

pi

1−pi

yi

(1−pi). (3.5)

Using the exponential function to the sides of the expression (3.3) and solving for pi, we get

pi

1−pi

=eXiβ ⇔pi =

eXiβ 1 +eXiβ

.

(27)

The solved terms 1−ppii and pi are substituted to Equation (3.5)

L(β|y) = Yn

i=1

(eXiβ)yi

1− eXiβ 1 +eXiβ

= Yn

i=1

(eyiXiβ)(1 +eXiβ)−1.

Hence, the log-likelihood based on all data is

l(β) = Σni=1yi(Xiβ)−Σni=1log(1 +eXiβ) and for unit i it is

l(β) =yiXiβ−log(1 +eXiβ). (3.6) 3.2.2 Outcome model and measurement error model

For independent observations the outcome model is defined as P(Y = 1|x, Z) =π(xβx+ZβZ) = exp(xβx+ZβZ)

1 + exp(xβx+ZβZ), (3.7) where Y is a binary outcome variable which depends on variables x : 1×m and variables Z : 1×mZ through a logistic regression model, βZ : mZ ×1 is a coefficient vector and the parameter of interest is βx :m×1.

Below we will present the measurement error model for independent ob- servations. The framework is extended for analysing cluster-correlated data in Section 3.3.

Instead of variablesx: 1×msometimes surrogate variablesX : 1×mX are observed. The surrogate variablesXare related to variablesxby a multivariate normal linear model

X =xαx+ZαZX, (3.8)

where measurement error is εX ∼ N(0, σX2I) and parameter matrix αx : m × mX includes regression coefficients for x and matrix αZ : mZ × mX

are regression coefficient matrix for Z. This model is the measurement error model.

Because the observed surrogate variablesX are occasionally used to predict the unobserved variables x, a secondary study objective is to estimate the conditional distribution of x given X. We assume that the variables x ∼ N(0,Σx). Under these assumptions the conditional distribution of xgiven X

(28)

is multivariate normal and xcan be modelled as

x=XγX +ZγZx, (3.9)

where measurement errorεx ∼N(0,Σx|X) and γ’s are the matrices of regres- sion coefficients γX :mX×m and γZ :mZ×m(Messer and Natarajan, 2008).

Information on the measurement error parameterγX and parameterγZ is ob- tained from a validation study. Conditional on x and Z, the distribution of Y is assumed independent of X, which means that measurement error is non- differential. Recall that it was assumed thatZ is measured without error and for all study subjects. Later we will omitZ and use the distribution ofxgiven X, i.e. we use loglikelihood −12(log det(Σx|X) + (x−XγXx|X−1 (x−XγX)T).

As described earlier in Section 3.1, for an internal validation design, a data matrix (Y,x,X) is available on each subject in the validation sub-sample.

In the main study sample data, (Y,X) are available, but x is not available.

Therefore the concepts of missingness can be applied in conceptualization of measurement error structures. We now define concepts and terminology for measurement error adjustment approaches for a study design where a main dataset and an internal validation dataset are available.

Definition 1. If validation study subjects constitute a simple random sample from main study subjects, the measurement error mechanism will be called measurement error completely at random (MECAR).

A more relaxed assumption about the missing data mechanism is missing at random. The missing at random assumption can be defined as follows:

Definition 2. If the validation sample is stratified on covariates Z and the covariatesZ are included in models (3.7) and (3.9) and assuming that the mod- els are correct, the measurement error mechanism will be called measurement error at random (MEAR).

In both cases, MECAR and MEAR, the main point is that the conditional distribution of x|Y, X, Z will be the same for a validation study subject for which x is observed, as for a main study subject for which x is not observed.

In external validation designs, the main study and the validation study use independent samples. In the validation sample (x, X) are observed. In the main study (Y, X) are observed. Note thatxand Y are not observed together for any subjects because main and study use independent samples. The MEAR condition is now that the conditional distributionx|Y, X, Z should be the same for main study and validation study subjects, with a similar requirement for

(29)

Y|x, X, Z. For external validation design, no direct observations from either conditional distribution are available. Messer and Natarajan (2008, p. 6,338) have stated that in practice, the assertion that the data are MEAR may depend more strongly on a priori modelling assumptions for an external design than for an internal design.

In the simulations studies of Chapters 6, 7 and 8, internal validation design is used. The missing data mechanism is controlled in the simulation experiments. In the first simulation, the missing data mechanism is MECAR.

In the last two simulation experiments, different types of validation datasets are tested. In these experiments, the missing data mechanism is either MECAR or MEAR.

3.2.3 Likelihood functions

Likelihood functions are needed to obtain MI, RC and ML estimates in later sections. Three likelihood functions l1, l2, l3 are derived, as in the paper by Messer and Natarajan (2008), for cases: a) (Y, x, X) observed, b) (Y, X) observed and x missing, and c) full sample likelihood. Variables Z are ob- served on all subjects. It is also assumed that Z are measured without error.

Models (3.7) and (3.8) include variables Z explicitly, but often Z is omitted from the notation. So do we. We will assume that Y is an outcome variable with binary values, X : 1×mX a vector of surrogate variables, x : 1×m a vector of the original variables for internal validation and x: 1×m matrix of the original variables for external validation (see Figure 3.1). The regression coefficients are βx : mx ×1 and γX : mX ×m, and Σx|X is the variance for measurement error εx. Now we derive the likelihood functions for each of the cases a)–c).

a) Let (Y, x, X) be observed. The joint distribution of (Y, x) given X, f(Y, x|X), can be derived when noticing that models (3.7) and (3.9) imply that outcome variable Y is conditionally independent of the surrogate variables X wheneverxis observed. Thus, the conditional distribution of outcome variable Y given (x, X) isf(Y|x, X) =f(Y|x) and hence the joint distribution of (Y, x) given X isf(Y, x|X) =f(Y|x, X)f(x|X) =f(Y|x)f(x|X).

Hence, the log-likelihood for Y given (x, X), consists of two parts, l1 and l2:

l(βxXx|X;Y, x|X) =l1x;Y|x) +l2Xx|X;x|X), (3.10)

(30)

where l1 is a logistic regression likelihood,

l1x;Y|x) =Y xβx−log(1 + exp(xβx))

as was derived in Equation (3.6), and l2 is a normal regression likelihood, l2Xx|X;x|X) =−1

2

log det(Σx|X) + (x−XγX−1x|X(x−XγX)T .

b) Let (Y, X) be observed and x missing. From the joint distribution f(Y, x|X) = f(Y|x, X)f(x|X) the marginal distribution of Y|X is obtained by integrating the unobserved variables x out of the joint distribution:

f(Y|X;βxXx|X) = Z

f(Y|x, X;βx)f(x|X;γXx|X) dx. (3.11) If Equation (3.9) is plugged in to Equation (3.7), we get P(Y = 1|x, Z) = P(Y = 1|X,εxxX),i.e. model (3.11), that is, the likelihood for the logistic- normal effects model. An univariate integral is obtained when the variable v =εxβx substituted in the log-likelihood

l3xXx|X;Y|X)

= log

xTΣx|Xβx)−1/2

Z exp((XγXβx+v)Y)

1 + exp(XγXβx+v)exp − 1

2vTxTΣx|Xβx)−1v dv

(3.12)

c) Full sample likelihood. For an internal design, the full sample log- likelihood is the sum over study subjects in a validation study and in a main study:

l(βxXx|X)

= X

i∈validation

(l1x;Yi|xi) +l2(γ,Σx|X;xi|Xi)) + X

i∈main

l3xXx|X;Yi|Xi).

(3.13) Because for an external design, variables x and surrogate variables X are observed in the validation study only, the term l1 becomes zero and then the

(31)

full sample likelihood is l(βxXx|X) = X

i∈validation

l2Xx|X;xi|Xi) + X

i∈main

l3xXx|X;Yi|Xi).

(3.14)

3.3 Framework for cluster-correlated data

Cluster-correlated data arise when there is a hierarchical or clustered struc- ture in the sample or population. As a consequence, observations within a cluster are more alike than observations from different clusters. Intra-cluster correlation (ICC) describes how strongly units in the same group resemble each other. In the literature, there are several definitions and formulas of intra-cluster correlation, and many different ways for its calculation have been presented. A commonly used definition of ICC is based on a decomposition of the total variance into the within-cluster (s2w) and between-cluster (s2b) vari- abilities. Intra-cluster correlation is defined as

ICC = s2b s2b +s2w.

It is obvious that measurement error also can be intra-cluster correlated and should be taken into account in order to develop reliable and effective measure- ment error adjustment methods. As introduced in Section 3.2 for independent observations, two models are incorporated in the adjustment process: the mea- surement error model (3.8) and the outcome model (3.7). Measurement error models for cluster-correlated data are discussed in more detail in Section 3.3.3.

Because we are working with a binary outcome variable, the outcome model is a logistic mixed model and it is presented in Section 3.3.2.

A generalized linear mixed model (GLMM) provides a useful approach for analysing a wide variety of data structures including cluster-correlated data. It combines the properties of two statistical models: linear mixed models and gen- eralized linear models (Nelder and Wedderburn, 1972; McCulloch and Searle, 2001; Demidenko, 2004). Linear mixed models are utilized when incorporating random effects. By using proper link functions, generalized linear models can handle non-normal data, e.g. binary data.

(32)

3.3.1 Linear mixed models

A linear mixed model contains both fixed effects and random effects. By using standard notation, the model can be represented as

Y =Xβ+W u+ε, (3.15)

where Y :n×1 is the vector of observations, X :n×m is the design matrix for the fixed effects,β:m×1 is the parameter vector of fixed effects,W :n×q is the design matrix for the random effects, u :q×1 is the vector of random effects, and ε :n×1 is the vector of random error terms.

We will assume that u and ε are normally distributed uncorrelated ran- dom variables with zero means, E uε

= 00

, variance and covariance matrices var(u) =G, var(ε) = R, var u

ε

!

= G 0

0 R

!

and cov(u,ε) =0. There- fore, the variance ofY is

V = var(Y) =W GWT +R (3.16)

and the expectation of Y is E(Y) =Xβ, and the vector of observations will be assumed normally distributed Y ∼N(Xβ,W GWT +R).

Estimation is more demanding in a linear mixed model than in a linear fixed-effect model. In addition to estimatingβ, there are unknown parameters W,G, and R to be estimated.

Estimates for parameters β and u are obtained by solving mixed model equations (Henderson, 1950):

"

XT−1X XT−1W WT−1X WT−1W + ˆG−1

# "

βˆ ˆ u

#

=

"

XT−1Y WT−1Y

#

to find the best linear unbiased estimator (BLUE) ofβand the best linear un- biased predictor (BLUP) ofu. Estimates are best in the sense that they mini- mize the sampling variance and they are linear functions ofY (Puntanen et al., 2011). Estimates are unbiased, i.e. E[BLUE(β)] = β and E[BLUP(u)] = u (Robinson, 1991).

The solutions of mixed model equations can be written as:

βˆ= (XT−1X)−1XT−1Y ˆ

u= ˆGWT−1(Y −Xβ),ˆ

whereV = var(Y) =W GWT +R. The parameters Gand R, the variances

(33)

of the random effects and residuals, are usually unknown. Typically these parameters are estimated and plugged into the predictor. The word empirical is often added to indicate such an approximation, thus BLUE and BLUB becomes EBLUE (the Empirical Best Linear Unbiased Estimator) and EBLUB (the Empirical Best Linear Unbiased Predictor).

Because mixed models contain the unknown parameters u, G and R, or- dinary least squares is not the best estimation method. A more appropriate method is generalized least squares (GLS), where

(Y −Xβ)TV−1(Y −Xβ) (3.17)

is minimized. However, to solve Equation (3.17), knowledge of G and R are required to obtain the variance of Y (see Equation (3.16)). Thus, the aim is to find appropriate estimates of G and R. The variance-covariance param- eters are usually estimated using Maximum Likelihood (ML) or Restricted Maximum Likelihood (REML) approaches.

Many software packages allow fitting of linear mixed models. Computa- tional tools are discussed in more detail in Section 5.3.

3.3.2 Generalized linear mixed models

A generalized linear mixed model is an extension to the generalized linear model which was discussed in Section 3.2.1. In the generalized linear mixed model, the linear predictor contains random effects in addition to the fixed ef- fects. Like the generalized linear model generalizes linear regression by allowing the linear model to be related to the response variable via a link function g(·), similarly in the generalized linear mixed models, the link function connects a linear predictor and response variable, but the linear predictor is in this model a combination of the fixed and random effects.

In a generalized linear mixed model, the fixed and random effects are com- bined to form a linear predictor η = Xβ +W u, where X : n×mX is the design matrix for the fixed effects, β : mX ×1 is a vector of fixed effects, W :n×q is the design matrix for the random effects, andu:q×1 is a vector of random effects.

The relationship between the vector of observationsY :n×1 and the linear predictor η is modelled through the conditional distribution of Y given u:

Y|u∼(g−1(η),R),

where E[Y|u] =g−1(η) =g−1(Xβ+W u) is the mean andRis the covariance

(34)

matrix of that conditional distribution.

For cluster-correlated data, a logistic mixed model is considered as the outcome model for the binary outcome variable. It is a special case of the generalized linear mixed model with inverse link functiong−1(·) = 1+exp(·)exp(·) . In our case, the fixed part of the linear predictor contains data x : nv ×m and Z : nv ×mZ, the random part contains a vector u : q×1 of random effects.

We assume that the conditional mean of Y is related to variables x, Z and W through a monotonic differentiable link function g(·). Thus, the outcome model is a generalized linear mixed model ofY given xand Z, specified by

E(Y|u) =µ=g−1(η) =g−1(xβx+ZβZ+W u), (3.18) where η = xβx +ZβZ +W u is a linear predictor, W : n ×q is a design matrix for random effects u:q×1,βx are regression coefficients forx, Z are error free data andβZ is regression coefficients forZ. In binary case var(Yi|u) is defined by E(Yi|u). Recall that, the βx is the parameter of interest.

Parameter estimation in generalized linear mixed models typically involves the method of maximum likelihood or some of its variants. The solutions are usually iterative and they can be numerically quite intensive. In order to approximate the likelihood of estimating generalized linear mixed model pa- rameters, various methods have been proposed, such as pseudo- and penalized quasilikelihood (PQL) (Wolfinger and O’Connell, 1993; Schall, 1991; Breslow and Clayton, 1993), Markov chain Monte Carlo (MCMC) algorithms (Zeger and Karim, 1991) and Laplace approximations (Raudenbush et al., 2000).

3.3.3 Measurement error model

Next we will present the study design with a main data and validation data approach for cluster-correlated data. Methods for measurement error adjust- ment in generalized linear mixed models have been studied, for example, in Wang and Davidian (1996); Wang et al. (1998); Tosteson et al. (1998); Wang et al. (1999); Lin and Carroll (1999); Wang et al. (2000); Buonaccorsi et al.

(2000); Zhong et al. (2002); Li et al. (2005); Shen et al. (2008); Bartlett et al.

(2009); Li and Wang (2012).

In Section 3.2.2, the measurement error model for the case of independent observations was defined in Equation (3.8) and Equation (3.9). For cluster- correlated data we must allow non-zero within cluster correlation. For a mea- surement error model in the cluster-correlated case, we will introduce a mixed

(35)

model for x:nv×m, so that

x=XγX +ZγZ+Bb+εx, (3.19) where γX and γZ are matrices of regression coefficients, X : nv ×mX, Z : nv ×mZ and B:nv×q are design matrices, b :q×m is a matrix of random effects andεx :nv×m is a matrix of random errors, where each row is a vector of zero mean and covariance matrix Σx.

In the following sections, we consider the measurement error model that contains cluster-level random intercepts only. In this caseB:nv×C is matrix of special structure, where C is number of clusters. The rows ofB correspond to units and columns to clusters. In each row there are zeroes and one 1, identifying a cluster where the unit belongs. The matrixb:C×m consists of random effects. There is one row for each cluster. This row has random effects for all m variables.

(36)

4 Methods for measurement error adjustment

In this Chapter, we introduce three methods for adjusting covariate measure- ment error: maximum likelihood, multiple imputation and regression calibra- tion. The methods are first discussed for independent observations. Multiple imputation and regression calibration are then extended to the case of corre- lated observations.

4.1 Maximum likelihood for measurement error adjust- ment

We begin from the Maximum Likelihood (ML) method. For internal validation design, the full sample likelihood function was defined in Equation (3.13) and in Equation (3.14) for external validation design. Likelihood models (3.13) and (3.14) are a mixture of exponential family models, including linear and logistic models. If the model is correct, the ML estimators of the parameters will have nice properties. The ML estimators of the parameters will be con- sistent, asymptotically normal and have the smallest asymptotic MSE among all ’regular’ estimators. (Messer and Natarajan, 2008).

Because of these properties of ML estimators and motivation arising from some earlier publications (e.g. Spiegelman et al., 2000; Messer and Natarajan, 2008) the ML estimator will set the reference method for comparison in the first simulation study which is conducted on independent observations. The challenge in the computation of the full sample likelihood is to solve the integral terml3 given in Equation (3.12). In the random effects literature, efficient nu- merical quadrature formulas for Equation (3.12) have been presented (Messer and Natarajan, 2008). Because the term l3 is a normal-logistic random effect model, the ML estimate can be computed using standard software such as the procedure NLMIXED in SAS.

4.2 Multiple imputation

This section will focus on Multiple imputation (MI) for measurement error adjustment. The notation and central concepts of MI are introduced. For a more extended treatment of the subject, see for example Rubin (1976); Little and Rubin (1987, 2002); Rubin (1996); Harel and Zhou (2007); R¨assler et al.

(2008); Yucel (2011).

Multiple imputation is usually applied to missing data problems. Missing data appears in almost all survey research. Non-response is failure to obtain

(37)

data from a selected sample unit. There are two principal types of non-response that can occur: unit non-response and item non-response. Unit non-response occurs when a respondent does not respond to all required response items e.g.

fill out or return a data collection instrument. In an item non-response case, the respondent does not respond to one or more items on a survey.

4.2.1 Missing data mechanism

Many methods for handling missing data are based on the assumptions of a missing data mechanism. A missing data mechanism describes the relationship between measured variables and the probability of missing data. Key concepts about missing data mechanisms were formalized originally by Rubin (1976).

Three different missing data mechanisms are presented: missing completely at random, missing at random and not missing at random.

If the missing data values are a simple random sample of all data values, the data aremissing completely at random (MCAR) (Little and Rubin, 1987).

In other words, missingness is completely unsystematic. The probability of missing data on some variableX is unrelated to other measured variables and to the value ofX itself. This missing data mechanism is quite rare in practice and mainly a theoretical special case.

More common in practice is the missing at random (MAR) case, i.e. the probability that an observation is missing may depend on an observed part of the data but not on the missing part of the data (Little and Rubin, 1987).

If data are not MAR or MCAR, then the data are not missing at random (NMAR) (Little and Rubin, 1987). The case of NMAR is difficult to deal with in practice. Most strategies for dealing with missing data assume that missingness is MCAR or at least MAR. If missing data are MCAR or MAR then the missing data mechanism is ignorable.

Corresponding terminology—MECAR (measurement error completely at random) and MEAR (measurement error at random)—for measurement error adjustment approaches was defined in Section 3.2.2. The assumption of mea- surement error mechanism is applied when main study and validation study data are created in the forthcoming simulation experiments (Chapters 6–8) with independent observations and cluster-correlated data.

4.2.2 McMC for MI

The Markov chain Monte Carlo (McMC) methods are a collection of tech- niques such as the Metropolis-Hastings algorithm, Gibbs sampling and data augmentation for generating multiple imputations in nontrivial problems. In

Viittaukset

LIITTYVÄT TIEDOSTOT

This study examines the profitability of two methods for regenerating Scots pine (Pinus sylvestris L.) in northern Sweden. The methods are planting and natural regeneration with

Three themes were selected for the interviews: the adjustment of expatriates and their families to the general non-work environment, expatriate adjustment at the workplace,

Therefore, in this thesis, we use molecular dynamics, Metropolis Monte Carlo and kinetic Monte Carlo methods to study the atom-level growth mechanism of NPs.. 4.2 Molecular

Ydinvoimateollisuudessa on aina käytetty alihankkijoita ja urakoitsijoita. Esimerkiksi laitosten rakentamisen aikana suuri osa työstä tehdään urakoitsijoiden, erityisesti

Tämän työn neljännessä luvussa todettiin, että monimutkaisen järjestelmän suunnittelun vaatimusten määrittelyssä on nostettava esiin tulevan järjestelmän käytön

Mansikan kauppakestävyyden parantaminen -tutkimushankkeessa kesän 1995 kokeissa erot jäähdytettyjen ja jäähdyttämättömien mansikoiden vaurioitumisessa kuljetusta

Työn merkityksellisyyden rakentamista ohjaa moraalinen kehys; se auttaa ihmistä valitsemaan asioita, joihin hän sitoutuu. Yksilön moraaliseen kehyk- seen voi kytkeytyä

In chapter eight, The conversational dimension in code- switching between ltalian and dialect in Sicily, Giovanna Alfonzetti tries to find the answer what firnction