Protein-protein interaction data - A joint finite mixture model for clustering genes from beta,

2.2 Data

2.2.3 Protein-protein interaction data

Besides protein-DNA interactions, PPIs also play a crucial regulatory role in many cellular process, since the formation of polymers is required for most proteins to exert their functions [9;

54].

The yeast two-hybrid (Y2H) system [75; 76] is a commonly used technique for PPI detection.

The Y2H system uses a reporter gene to signal whether two proteins (called the ‘prey’ and the

‘bait’) interact [75; 76]. Specifically, the transcription factor (TF) that controls the expression of the reporter gene has two function domains, i.e., the binding domain and the activation domain, which are split and fused with the ‘bait’ and the ‘prey’, respectively; when the two proteins interact, the two domains combine and activate the expression of the reporter gene [75; 76].

Large-scale Y2H could now be carried out by using, e.g., a colony-array format [77–79],

Another widely applied technique to find PPIs is the affinity purification followed by mass spectrometry (AP-MS) [76; 80]. In such an experiment, the two potential interactive proteins are also named ‘bait’ and ‘prey’. In the AP step, the ‘bait’ is tagged and extracted together with ‘prey’ through co-immunoprecipitation or tandem affinity purification [76; 80]. Then, the extracted proteins are identified by MS [76; 80], which determines the identity of a protein by ionizing the molecule and measuring its mass-to-charge ratio [81]. Applying AP-MS in a high throughput fashion has now become available, e.g., in yeast [82–84].

Y2H and AP-MS, although both used for PPI detection, each has its own strength and is suitable for different tasks. For example, Y2H is able to find transient interactions which

2.2. DATA CHAPTER 2. LITERATURE REVIEW is beyond AP-MS’s reach; AP-MS is biased by the amount of proteins whereas Y2H is not;

AP-MS can be used to find one-to-many relationships whereas Y2H is only limited to detecting binary interactions; the interactions found by AP-MS exist in vivo, where those discovered by Y2H may not exist under physiological conditions; AP-MS can find weak interactions which are not detectable by Y2H [76].

PPIs are often stored in databases such as DIP [85; 86], MINT [87], MIPS/MPact [88], IntAct [89], BioGRID [90], and HPRD [91]. In practice, it is often to integrate and use infor-mation from multiple databases [76] since large discrepancies and contradictions exist among different sources [92–97].

PPIs used in this thesis is queried from six databases, i.e., DIP [85; 86], MINT [87], MIPS/MPact [88], IntAct [89], BioGRID [90], and HPRD [91] via a data integration platform, PINA [98].

Chapter 3 Method

This chapter presents the framework of BGBMM, its EM algorithm and the tested approximation-based model selection criteria.

3.1 Clustering framework

In BGBMM, define θ = [θ₁, θ₂, θ₃, π]^T, π = [π₁, . . . , π_g]^T, θ₁ = [α₁₁, . . . , α_gp₁, β₁₁, . . . , β_gp₁]^T, θ₂ =£

µ₁₁, . . . , µ_gp₂, σ²₁, . . . , σ²_p₂¤_T

, and θ₃ = [q₁₁, . . . , q_gp₃]^T, where p₁, p₂ and p₃ each represents the dimension of the observations in beta, Gaussian and Bernoulli mixture model, respectively.

Also denote X,Y and Z as the observations of beta, Gaussian and Bernoulli distributed data, respectively, function f of x, y, z as the density function of beta, Gaussian and Bernoulli distribution, respectively, ando = [x^T,y^T,z^T]^T.

BGBMM is a joint mixture model of beta, Gaussian and Bernoulli distributions, with the assumption that, for each component i, data of the three distributions are independent.

In the part of beta mixture model (BMM), each component is assumed to be the product of p₁independent beta distributions, whose probability density function is defined in Equation 3.1, where θ_1i = [α_i1, . . . , α_ip₁, β_i1, . . . , β_ip₁] and x= [x₁, . . . , x_p₁]^T.

f_i(x|θ_1i) =

u=1

x^α_u^iu⁻¹(1−x_u)^β^iu⁻¹

B(αiu, βiu) (3.1)

Likewise, each component is assumed to follow a Gaussian distribution in the Gaussian

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD mixture model (GMM), whose probability density function of each component for each gene is defined in Equation 3.2, where θ_2i = [µ_i1, . . . , µ_ip₂, σ²_i1, . . . , σ_ip²₂], µ_i = [µ_i1, . . . , µ_ip₂]. Note that a diagonal covariance matrix, V = diag(σ₁², σ²₂, . . . , σ²_p₂) (|V| =Q_p₂

v=1σ_v²), is used in the GMM part to reduce the number of parameters need to be estimated, which is especially useful when dealing with high-dimensional data.

In the part of Bernoulli mixture model (BerMM), each component is modeled as Bernoulli distribution, with the probability density function for each gene defined in Equation 3.3, where θ_3i = [q_i1, . . . , q_ip₃].

A standard EM algorithm is applied to jointly estimate the parameters, θ, iteratively, where the data log-likelihood (natural logarithm is referred to throughout this thesis) is written as Equation 3.4. Recall that oj represents the observation j (j = 1, . . . , n), n is the number of genes, g is the number of components in this model, π_i is the prior probability of drawing an observation from the component i (0 < π_i ≤ 1, P_g

The direct maximization of Equation 3.4 is difficult, which can be casted in the framework of incomplete data. Since it is assumed that data of different distributions are independent, Lc

can be factored as Equation 3.5,

L_c(θ) = f(X|c, θ)f(Y|c, θ)f(Z|c, θ)f(c|θ). (3.5) If define c_j ∈ {1, . . . , g} as the clustering membership of o_j, then the complete data

log-3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD likelihood can be written as Equation 3.6, whereχ(c_j =i) is the indicator function of whether o_j is from the i^th component or not. In the EM algorithm, E step computes the expectation of the complete data log-likelihood as shown in Equation 3.7 [99].

Q(θ|θ^(m))

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD By computing the expectation, Equation 3.7 becomes Equation 3.8, where τ is calculated by Equation 3.8 according to Bayes’ rule [99].

Q(θ|θ^(m)) =

Note that τ_ji^(m) is the estimated posterior probability of o_j coming from component i at it-erationm, and eacho_j can be assigned to its component based on

i₀|τ_ji₀ = max

i (τ_ji) o

. Equa-tions 3.7 and 3.8 show that the assumption that the beta, Gaussian and Bernoulli distributed data are independent carries over to the expected log-likelihood as well.

Define Equations 3.8 to 3.11, then Equation 3.8 becomes Equation 3.12 [99].

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD Now the problem is converted to the convex optimization problem, with the Lagrangian function shown as Equation 3.13 [99].

L(θ) = L(θ₁, θ₂, θ₃, π)

It is seen from Equation 3.13 that the standard EM for BGBMM will reduce to the standard EM for beta, Gaussian, and Bernoulli distribution, respectively, when the dimensions of the data of the other two distributions go to zero. In the EM algorithm of BGBMM, the parameters of BMM is estimated with the Newton-Raphson method, and those of the GMM and BerMM are updated with closed formulas.

Parameter estimation in BMM

Specifically, letθ_1i = (α_i, β_i), then the new estimateθ^(m+1)_1i is obtained following Equation 3.14 [99], where H⁻¹(θ^(m)_1i ) is the Hessian matrix evaluated at θ^(m)_1i .

θ_1i^(m+1) =θ_1i^(m)−H⁻¹(θ^(m)_1i )∇_θ_1iL(θ^(m)) θ_1i ≥1 (3.14)

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD The derivation and formula [99] of θ_1i^(m) are given below, where Ψ and Ψ⁰ represent the digamma (the first logarithmic derivative of the gamma function) and trigamma (the second logarithmic derivative of the gamma function) functions in Matlab, respectively, and {u = 1, . . . , p₁}. by the standard EM algorithm of GMM with diagonal covariance matrix. The derivations are shown below [16], which result in Equations 3.16 and 3.17 for parameter updates.

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD is given by Equation 3.18.

3.2. EXPECTATION MAXIMIZATION ALGORITHM CHAPTER 3. METHOD

Parameter estimation of the prior probability

The prior probability, π, can be computed as Equation 3.19 [99],

∵Q₄(π) =

In document A joint finite mixture model for clustering genes from beta, Gaussian and Bernoulli distributed data (sivua 18-28)