• Ei tuloksia

PART III: DEVELOPMENT AND EVALUATION OF MODEL IDENTIFICATION

8.1   N ETWORK N ODE D EPENDENCIES

Dependency measures of variables were discussed in Chapter 5. For topology identification, mu-tual information (MI), χ2-statistics approximation (CSS) of MI, and rank-correlation-based simi-larity measures can all be used to model the dependencies or similarities of network nodes. How-ever, the above measures, when calculated from data, are uncertain, their uncertainty depending on the number of observations available from the network. Therefore, more robust dependency measures are needed to compare node dependencies estimated with different-size observation sets and node pairs with different properties. The method developed here—instead of applying the above dependency measures directly—exploits the statistical significance that the measure implies nonzero dependency.

The statistical significance of some estimated quantity is calculated by comparing the estimated value to the distribution of values of that quantity, when the values of the quantity are calculated from the same-size data set under a related null hypothesis. With the dependency measures con-sidered here, the dependency value estimated with a data set for two node variables is thus com-pared to the distribution of values of the respective measure estimated from the same-size data set under the assumption that the two nodes are statistically independent. In topology

identifica-tion, statistical significances of dependency measures are estimated for each node pair, and the values are then used as similarity values between nodes to derive the topology of the network. In the following subsections, statistical significances are derived for MI, CSS, and rank-correlation measures.

8.1.1 Statistical Significance of Mutual Information

MI was defined in Eq. (5.7) as a statistical dependency measure of two random variables and as ; ∑ ∑ , log , . We now derive a more robust dependency meas-ure, the statistical significance of MI (SSMI). The MI value estimated from a set of observa-tions, ; | , must be compared to the distribution of MI values estimated from a same-size, generated data set under the null hypothesis that the two variables are statistically independent.

However, we face the difficulty that we do not know the analytical form of the MI distribution, and must, therefore, calculate numerically the probabilities under the null hypothesis.

Let us assume that the two random variables under consideration are statistically independent, i.e., , . Here and  are the marginal probability distributions of the variables, derived from the true joint probability distribution , , and , is the joint distribution under the assumed null hypothesis. In theory, and when estimated from an in-finite data set, MI is obviously zero under the assumed statistical independence: ;

∑ ∑ , log , 0. However in practice, when two variables are statistically inde-pendent, and when the marginal probabilities and are estimated from a finite data set of observations, the product of the marginals is not equal to the joint probability

, estimated from the same set of observations: ,

. Hence MI assumes a non-zero positive value.

Under the null hypothesis and observations, the distribution of MI values is estimated as fol-lows. First, data sets are generated for and under the null hypothesis assumption, each data set consisting of L observations. These observations are easily generated for discrete vari-ables from the multinomial distribution (binomial distribution for binary varivari-ables) by using the original data-estimated state probabilities and . Then MI values are estimated from each observation set, leading to a histogram estimate of the MI distribution | ; alterna-tively, a more sophisticated method can be used to estimate the distribution. Finally, the SSMI,

MI , , is obtained for variables and as

The probability that the null hypothesis is now erroneously discarded is 1 MI , . Because estimated from a simulated set of observations, SSMI is a random variable with its uncertainty depending on both and . Because the uncertainty of the generated MI values is determined by the fixed , the uncertainty of the SSMI estimate can be reduced only by increasing . Throughout the studies in this thesis 2000. Since SSMI is a probability value, it is clearly

MI , |

; |

. (8.1)

interpreted and always lies between zero and one: 0 MI , 1. These are desired prop-erties for a similarity measure.

In the literature, SSMI has been applied, e.g., to modelling interactions of perturbed genes in [45], where it is first calculated for all gene pairs, and then a network based on SSMI values is con-structed to represent genetic interactions. An estimation of mutual information and SSMI values from finite data sets is proposed also in [141].

8.1.2 Statistical Significance of χ2-Statistics

In Chapter 5, CSS was introduced as an approximation of MI and defined for two variables

and by Eq. (5.9) as ; ∑ ∑ , . The

prob-ability distribution of CSS is the incomplete gamma function. For and , the statistical signifi-cance of CSS (SSCSS), CSS , , when a CSS value ; | is estimated from a data set of observations, and with degrees of freedom, is defined as [114]

The degrees of freedom parameter is calculated as follows:

1, where the function indicates the number of accessible states of its argument vari-able. The gamma function, Γ, is defined as Γ /2  exp  / . Because the only random variable here is the estimate ; | , unlike to the SSMI estimate, no other uncertainties are related to the SSCSS estimate. On the other hand, the value ; | itself is an approximation of the respective MI value ; | , hence causing inaccuracies in the similarity estimate.

The advantage of CSS over MI is that, in the case of the null hypothesis, the analytical form of its distribution is always known. Thus SSCSS is much less demanding to estimate computationally than the SSMI. Recently some other approximations to MI and its distribution have also been introduced, e.g., in [56], [63], and [64]. Particularly in [56], the approximation of MI is similar to that of the respective CSS approximation. In [56] MI is approximated as a second-order Taylor series at , whereas CSS is obtained by approximating the logarithm function at as a first-order Taylor series (see Section 5.5). Hence the respective dependency test based on the MI distribution approximation, also shown in [56], is similar to the χ2-dependency test demonstrated here.

8.1.3 Statistical Significances of Rank-Correlation Measures

Approximations of the distributions of rank-correlation measures in the null hypothesis case are

also known. For Spearman’s rho, a test measure , 2 / 1 ,

with the data-estimated correlation value , (see Eq. (5.10)) has been constructed to test the hypothesis. In the null hypothesis case of statistical independence, this test measure is ap-proximately distributed according to Student’s distribution with 2 degrees of freedom [114].

CSS , 1

Γ D exp

; | /

. (8.2)

With Γ again denoting the gamma function, the statistical significance of an estimated value , with test measure is now obtained as

Under the null hypothesis of zero statistical dependency, Kendall’s tau correlation coefficient , (see Eq. (5.11)) is known to be approximately distributed according to the Gaussian distribution, with a zero expectation value and with a variance of 4 10 / 9 1 [114]. Consequently, for an estimated value , , statistical significance is obtained as

where the measure , now defines the upper limit for the integration.