• Ei tuloksia

2.2 Data

2.2.3 Protein-protein interaction data

Besides protein-DNA interactions, PPIs also play a crucial regulatory role in many cellular process, since the formation of polymers is required for most proteins to exert their functions [9;

54].

The yeast two-hybrid (Y2H) system [75; 76] is a commonly used technique for PPI detection.

The Y2H system uses a reporter gene to signal whether two proteins (called the ‘prey’ and the

‘bait’) interact [75; 76]. Specifically, the transcription factor (TF) that controls the expression of the reporter gene has two function domains, i.e., the binding domain and the activation domain, which are split and fused with the ‘bait’ and the ‘prey’, respectively; when the two proteins interact, the two domains combine and activate the expression of the reporter gene [75; 76].

Large-scale Y2H could now be carried out by using, e.g., a colony-array format [77–79],

Another widely applied technique to find PPIs is the affinity purification followed by mass spectrometry (AP-MS) [76; 80]. In such an experiment, the two potential interactive proteins are also named ‘bait’ and ‘prey’. In the AP step, the ‘bait’ is tagged and extracted together with ‘prey’ through co-immunoprecipitation or tandem affinity purification [76; 80]. Then, the extracted proteins are identified by MS [76; 80], which determines the identity of a protein by ionizing the molecule and measuring its mass-to-charge ratio [81]. Applying AP-MS in a high throughput fashion has now become available, e.g., in yeast [82–84].

Y2H and AP-MS, although both used for PPI detection, each has its own strength and is suitable for different tasks. For example, Y2H is able to find transient interactions which

2.2. DATA CHAPTER 2. LITERATURE REVIEW is beyond AP-MS’s reach; AP-MS is biased by the amount of proteins whereas Y2H is not;

AP-MS can be used to find one-to-many relationships whereas Y2H is only limited to detecting binary interactions; the interactions found by AP-MS exist in vivo, where those discovered by Y2H may not exist under physiological conditions; AP-MS can find weak interactions which are not detectable by Y2H [76].

PPIs are often stored in databases such as DIP [85; 86], MINT [87], MIPS/MPact [88], IntAct [89], BioGRID [90], and HPRD [91]. In practice, it is often to integrate and use infor-mation from multiple databases [76] since large discrepancies and contradictions exist among different sources [92–97].

PPIs used in this thesis is queried from six databases, i.e., DIP [85; 86], MINT [87], MIPS/MPact [88], IntAct [89], BioGRID [90], and HPRD [91] via a data integration platform, PINA [98].

Chapter 3 Method

This chapter presents the framework of BGBMM, its EM algorithm and the tested approximation-based model selection criteria.

3.1 Clustering framework

In BGBMM, define θ = [θ1, θ2, θ3, π]T, π = [π1, . . . , πg]T, θ1 = [α11, . . . , αgp1, β11, . . . , βgp1]T, θ2

µ11, . . . , µgp2, σ21, . . . , σ2p2¤T

, and θ3 = [q11, . . . , qgp3]T, where p1, p2 and p3 each represents the dimension of the observations in beta, Gaussian and Bernoulli mixture model, respectively.

Also denote X,Y and Z as the observations of beta, Gaussian and Bernoulli distributed data, respectively, function f of x, y, z as the density function of beta, Gaussian and Bernoulli distribution, respectively, ando = [xT,yT,zT]T.

BGBMM is a joint mixture model of beta, Gaussian and Bernoulli distributions, with the assumption that, for each component i, data of the three distributions are independent.

In the part of beta mixture model (BMM), each component is assumed to be the product of p1independent beta distributions, whose probability density function is defined in Equation 3.1, where θ1i = [αi1, . . . , αip1, βi1, . . . , βip1] and x= [x1, . . . , xp1]T.

fi(x|θ1i) =

p1

Y

u=1

xαuiu−1(1−xu)βiu−1

B(αiu, βiu) (3.1)

Likewise, each component is assumed to follow a Gaussian distribution in the Gaussian

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD mixture model (GMM), whose probability density function of each component for each gene is defined in Equation 3.2, where θ2i = [µi1, . . . , µip2, σ2i1, . . . , σip22], µi = [µi1, . . . , µip2]. Note that a diagonal covariance matrix, V = diag(σ12, σ22, . . . , σ2p2) (|V| =Qp2

v=1σv2), is used in the GMM part to reduce the number of parameters need to be estimated, which is especially useful when dealing with high-dimensional data.

In the part of Bernoulli mixture model (BerMM), each component is modeled as Bernoulli distribution, with the probability density function for each gene defined in Equation 3.3, where θ3i = [qi1, . . . , qip3].

A standard EM algorithm is applied to jointly estimate the parameters, θ, iteratively, where the data log-likelihood (natural logarithm is referred to throughout this thesis) is written as Equation 3.4. Recall that oj represents the observation j (j = 1, . . . , n), n is the number of genes, g is the number of components in this model, πi is the prior probability of drawing an observation from the component i (0 < πi 1, Pg

The direct maximization of Equation 3.4 is difficult, which can be casted in the framework of incomplete data. Since it is assumed that data of different distributions are independent, Lc

can be factored as Equation 3.5,

Lc(θ) = f(X|c, θ)f(Y|c, θ)f(Z|c, θ)f(c|θ). (3.5) If define cj ∈ {1, . . . , g} as the clustering membership of oj, then the complete data

log-3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD likelihood can be written as Equation 3.6, whereχ(cj =i) is the indicator function of whether oj is from the ith component or not. In the EM algorithm, E step computes the expectation of the complete data log-likelihood as shown in Equation 3.7 [99].

Q(θ|θ(m))

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD By computing the expectation, Equation 3.7 becomes Equation 3.8, where τ is calculated by Equation 3.8 according to Bayes’ rule [99].

Q(θ|θ(m)) =

Note that τji(m) is the estimated posterior probability of oj coming from component i at it-erationm, and eachoj can be assigned to its component based on

n

i0ji0 = max

iji) o

. Equa-tions 3.7 and 3.8 show that the assumption that the beta, Gaussian and Bernoulli distributed data are independent carries over to the expected log-likelihood as well.

Define Equations 3.8 to 3.11, then Equation 3.8 becomes Equation 3.12 [99].

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD Now the problem is converted to the convex optimization problem, with the Lagrangian function shown as Equation 3.13 [99].

L(θ) = L(θ1, θ2, θ3, π)

It is seen from Equation 3.13 that the standard EM for BGBMM will reduce to the standard EM for beta, Gaussian, and Bernoulli distribution, respectively, when the dimensions of the data of the other two distributions go to zero. In the EM algorithm of BGBMM, the parameters of BMM is estimated with the Newton-Raphson method, and those of the GMM and BerMM are updated with closed formulas.

Parameter estimation in BMM

Specifically, letθ1i = (αi, βi), then the new estimateθ(m+1)1i is obtained following Equation 3.14 [99], where H−1(m)1i ) is the Hessian matrix evaluated at θ(m)1i .

θ1i(m+1) =θ1i(m)−H−1(m)1i )∇θ1iL(θ(m)) θ1i 1 (3.14)

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD The derivation and formula [99] of θ1i(m) are given below, where Ψ and Ψ0 represent the digamma (the first logarithmic derivative of the gamma function) and trigamma (the second logarithmic derivative of the gamma function) functions in Matlab, respectively, and {u = 1, . . . , p1}. by the standard EM algorithm of GMM with diagonal covariance matrix. The derivations are shown below [16], which result in Equations 3.16 and 3.17 for parameter updates.

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD is given by Equation 3.18.

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD

Parameter estimation of the prior probability

The prior probability, π, can be computed as Equation 3.19 [99],

Q4(π) =