• Ei tuloksia

PART II: METHODOLOGICAL BACKGROUND

6.   METHODS TO ESTIMATE PARAMETERS

This chapter examines general model parameter estimation methods for later use in Chapter 9, which focuses on identifying MRF model parameters and in particular the Ising model. Bayes’

theorem provides the foundation on which all the probabilistic inference methods are ultimately based. The theorem works a framework for updating probabilistic information about a system as new uncertain information through measurements arrives. In parameter estimation, Bayes’ theo-rem is applied by combining existing uncertain (a priori) information about model parameters with (likelihood) information obtained by evaluating the parameterised model with a set of data observations. Combining the two pieces of information then leads to updated (a posteriori) in-formation about the model parameters.

Even though Bayes’ theorem is not, in fact, directly applied here, it forms the basis for under-standing all parameter estimation methods. Consequently, Bayes’ theorem is first discussed in Section 6.1. Section 6.2 introduces the maximum a posteriori (MAP) parameter estimation method, which exploits all the properties of Bayes’ theorem. Maximum likelihood (ML) estima-tion, introduced in Section 6.3, can be viewed as a special case of the MAP method without a priori information. Maximum pseudolikelihood (MPL) estimation, examined in Section 6.4, is the most specialised method arising as an approximation of the ML method, and is especially suitable for identifying MRF model parameters.

6.1 Bayes’ Theorem

Bayes’ theorem (see, e.g., [18], [46]) combines uncertain information obtained through observa-tions with uncertain prior system information to arrive at a posteriori system information. The approach carries all the uncertainty about system parameters and is thus formulated by using probabilities. Hence to estimate model parameters, Bayes’ theorem combines the prior probabil-ity of parameters with the likelihood function of the parameters evaluated with data observation values. Therefore, the Bayesian approach provides not only point estimates as the most probable parameter values, but also an entire probability distribution as the uncertainty information about the model parameters.

Let us now consider a system model with a multivariable state described by a random variable S and with a set of model parameters . Let there be a set of observations of size . Es-sentially, the Bayesian approach says that our information about parameters is described with a probability distribution, and that hence parameters are random variables. For the probability distribution of model parameters, Bayes’ theorem reads [18]

| | |

|

| .

(6.1)

Here | is the probability of the observation data set when parameter values are . This probability is also called the likelihood of parameters, because once given, the data set can be used to find parameter values that correspond to the largest probability of the observation set.

Assuming the observations statistically independent, we can write their likelihood as the product of the probabilities of individual observations: | ∏ | . The prior probability of the parameters is defined by . The denominator in Eq. (6.1), , is the marginal probability of the observation set. However, in view of parameter estimation, it is just a constant normalising the probabilities of parameter values, and hence does not affect the relative conditional probabilities of the parameters or their estimation. If necessary, is ob-tained for continuous (discrete) by integrating (summing) the numerator in Eq. (6.1) over a defined range of parameter values.

6.2 Maximum a Posteriori

With Bayes’ theory, the probability distribution of model parameters can be obtained in parame-ter estimation. There is strong information-theoretical motivation, not discussed here, why best point estimates of parameters should be chosen as mode values of the posterior probability dis-tribution. This distribution fully describes the uncertainty of the chosen point estimates. In MAP estimation [121], point estimates are obtained according to Bayes’ theory by choosing the pa-rameter values that correspond to the highest posterior probability. Assuming independent ob-servations, we can formulate MAP parameter estimation as [121]

where the normalisation constant is omitted, because, as a constant, it has no effect on the posi-tion of the maximum of the probability distribuposi-tion or on the relative probabilities of parameter values.

Logarithm is a monotonic function, and thus does not change the values at which its argument assumes maximum or minimum values. When the logarithm is taken from the product of the probabilities of individual observations in Eq. (6.2), the product of the terms transforms into a sum of the terms:

from which the parameter values corresponding to the maximum probability are usually easier to obtain.

6.3 Maximum Likelihood

Based on Bayes’ theorem, finding MAP parameter estimates requires prior information about parameter values, and if this is not available, a guess should be made about parameter probabili-ties. However, more often than not prior information is lacking or a good guess is hard to make.

Therefore, all information is obtained by observation, and hence only the likelihood of

parame-MAP argmax | argmax | , (6.2)

MAP argmax log | log , (6.3)

ters can be used to estimate parameters. In Bayes’ theorem, lack of prior information can be in-terpreted as probability being uniform for all parameter combinations. Because the prior prob-ability is a constant, the posterior probability can thus be rewritten as [46]

where the two constants are finally combined as . If parameters may assume only values, the prior distribution of constant is not well-defined. However, such improper priors can be used with Bayes’ theorem, because only the posterior distribution is then normalised. Without

, Eq. (6.4) is called the likelihood | of parameters [144]: |

∏ | . Because the constants again do not affect the parameter values at which maxi-mum probability occurs, in the ML method parameter estimates are obtained by simply maximis-ing the likelihood or its logarithm [144]:

As a result, best point estimates are again obtained for the parameters, whereas the uncertainties related to these estimates can be studied through Eq. (6.4).

6.4 Maximum Pseudolikelihood

The ML method is a general approach to estimating model parameters, if no prior information is available. However, it is difficult to apply to estimating MRF model parameters, because the con-ditional probability distribution | in Eq. (6.5) includes the partition function  , which, as discussed in Chapter 4, is practically impossible to calculate in general. Therefore, Eq. (6.5) cannot be evaluated, and parameters cannot be estimated. Yet again, we have exceptions, such as the Gaussian MRF model, discussed in Subsection 4.2.3, in which the normalisation constant is easy to calculate, and thus the ML method can be applied.

Maximum pseudolikelihood (MPL) [16], [17] is similar to ML, but the partition function need not be calculated for the whole joint probability distribution; instead the joint probability is approxi-mated as a product of the full conditionals of variables (see Eq. (4.2)), i.e., the conditional prob-ability of a variable given the remaining variables. Though all these conditional probprob-ability distri-butions include the normalisation term, the number of states over which the summation (or inte-gration) runs is equal to the number of possible variable states rather than possible system states.

For example, in the Ising model, the normalisation term of a full conditional consists only of the sum of two terms, as shown by Eq. (4.14).

When indexing the variables with subscript and subscript referring to the remaining vari-ables except , the approximation of the conditional probability in Eq. (6.5) at observation can be written as [16], [17]

The Bayesian posterior probability of parameters in approximate form now reads

| | | , (6.4)

ML argmax | argmax log | . (6.5)

| | , . (6.6)

Without the constant , Eq. (6.7) is called the pseudolikelihood | of parameters

: | ∏ ∏ | , . As a constant, can again be omitted from

the parameter estimation, and when taking the logarithm from the pseudolikelihood, MPL pa-rameter estimates are obtained as

and the pseudouncertainty of the parameters is given by Eq. (6.7).

| | , . (6.7)

MPL argmax | ,

argmax log | , ,

(6.8)