D ISCRIMINANT A NALYSIS - Applying Machine Learning Algorithms to Psychiatric Patient Data

Discriminant analysis can be used for dimension reduction (which was covered in Section 3.1.2) and classification. Linear Discriminant Analysis (LDA) is used to find the features or the linear combination of the features that discriminate between two or more groups best. One way to represent a two-class classifier is in terms of a discriminant function g(x). The equation g(x) = 0 defines the decision boundary which is a line that separates the samples of two classes.

A new unseen sample is categorized to the first class if g(x) > 0, and otherwise to the second class. This decision boundary is a hyperplane when g(x) is linear. [Bishop, 2006; Marsland, 2014]

There are two different approaches on LDA by two different researchers. For two class classification Fisher’s and Welch’s LDA are identical, only their approach is different. Welch’s LDA tries to minimize the total probability of misclassification, which depends on the features’

properties like the normal distribution of the data and the class probabilities. Fisher’s approach tries to find such linear combination of the features that there is a large separation between the projected class means while a small variance in each class cluster. Fisher’s LDA also does not have normal distribution requirement on the data and it can be generalized to more than two classes by small changes to the covariance matrices. [Kuhn, 2013]. Welch’s LDA is explained in Section 3.3.1, parameter estimation for Welch’s LDA in Section 3.3.2 and Fisher’s LDA is explained in Section 3.3.3.

3.3.1 Linear and Quadratic Discriminant Analysis

For the basis of classification, we use Bayes’s theorem to get Eq. 11. This is the same equation that was used in the context of the naïve Bayes classifier, but only with the difference that the features are presumed to be independent from each other in the naïve Bayes classifier. The density function 𝑓_𝑘 for a class k is given in Eq. 12, where the parameters are: µ_𝑘 is the mean vector of dimension p, 𝛴_𝑘 is the covariance matrix of dimension p, and 𝜋_𝑘 is prior probability of the class k. According to the Bayes’ theorem and presuming that all the parameters are known, 𝑃(𝐶_𝑘 | 𝑥) gives probability that the case/vector x belongs to the class k. Moreover, 𝜋_𝑘 is the prior probability of the class k defined as the number of the cases of the class k divided by the total number of cases in the training data. [Hastie et al., 2008]. For example, for a three class problem, if the prior probabilities are 𝜋₁= 0.8, 𝜋₂ = 0.1, and 𝜋₃ = 0.1 and densities 𝑓_1, = 0.3, 𝑓₂ = 0.3 and 𝑓₃ = 0.4, we get the following probabilities for the case x for the three classes:

𝑃(𝐶₁ | 𝑥 ) = ^0.8∙0.3

(0.8∙0.3)+(0.1∙0.3)+(0.1∙0.4) ≈ 0.77, 𝑃(𝐶_𝑘 | 𝑥) ≈ 0.10 and 𝑃(𝐶_𝑘 | 𝑥) ≈ 0.13.

𝑃(𝐶_𝑘 | 𝑋) = ^𝑓^𝑘^(𝑥)π^𝑘

∑^𝐾_𝑙=1𝑓_𝑙(𝑥)π_𝑙, 𝑘 = 1, 2, … , 𝑚 (11)

𝑓_𝑘(𝑥) = 1

√(2π)^𝑝/2|Σ_𝑘|^1/2𝑒⁻⁻¹²^(x−µ^𝑘⁾^T^Σ^𝑘⁻¹^(x−µ^𝑘⁾, 𝑘 = 1, 2, … , 𝑚 (12)

When 𝑃(𝐶_𝑘 | 𝑥) is defined, we can find the class that has the maximum posterior probability for the case/vector x as given in Eq. 13. The denominator from Eq. 11 can be left out because it is common for each k, so we can maximize 𝑓_𝑘(𝑥)π_𝑘 equivalently (lines 1-2). Also, we can maximize equivalently 𝑓_𝑘(𝑥)π_𝑘 and log (𝑓_𝑘(𝑥)π_𝑘), because log is a strictly increasing function (lines 2-3). In the last line, the density function 𝑓_𝑘 is inserted into the equation.

Ĝ(x) = arg max

Quadratic discriminant analysis (QDA) is a generalized version of LDA. The only difference to LDA is that there is no assumption that the covariances 𝛴_𝑘 are equal for each class, so the covariances need to be estimated for each class separately. Therefore, we would maximize Eq. 14 instead of Eq. 13. QDA creates a quadratic decision surface, whereas LDA created a linear decision surface. [Hastie et al., 2008].

Ĝ(x) = arg max

𝑘 [−1

2𝑙𝑜𝑔|Σ_𝑘| −1

2(x − μ_𝑘)^TΣ_𝑘⁻¹(x − μ_𝑘) + log(π_𝑘)] , 𝑘 = 1, 2, … , 𝑚 (14)

3.3.2 Parameter estimation

The parameters for the multivariate Gaussian distributions need to be estimated from the training data, as most of the time they are unknown. The parameters 𝜋̂_𝑘, 𝜇̂_𝑘 and 𝛴̂ are explained next.

𝜋̂_𝑘 is known as the maximum likelihood estimator for the class k. It is defined as 𝑁_𝑘 divided by N, where 𝑁_𝑘 is the number of cases of the class k and N is the total number of cases in the training data. [Hastie et al., 2008].

• 𝜋̂_𝑘 =^𝑁^𝑘

𝑁

μ̂_k is the mean vector for the class k. It is defined as the sum of a vector 𝑥_𝑖 of the class k divided by the number of cases of the class k. [Hastie et al., 2008].

• 𝜇̂_𝑘 = 𝛴_𝑔_𝑖_=𝑘 ^𝑥^𝑖

𝑁𝑘

Σ̂ is the common covariance matrix for all the classes. In the inner sum statement, we calculate the sum for each k: we multiply the i’th case subtracted by the class mean by its transpose.

Then the result is normalized by the scalar quantity N-K, where N is total number of cases and K is the total number of classes. After the first sum statement, we calculate the sum over all the classes. [Hastie et al., 2008].

• Σ̂ = ∑ ∑ ^(𝑥^𝑖^−μ^k^)(𝑥^𝑖^−μ^k⁾^𝑇

𝑔_𝑖=𝑘 𝑁−𝐾 𝐾𝑘=1

3.3.3 Fisher’s LDA

In two class problem, we want to find the linear combination (Eq. 15) of the features such that there is a large separation between the projected class means (between-class variance) and also a small variance in each class (within-class variance). This is illustrated in Figure 2. There are cases of two classes, blue and red, their corresponding histograms and the line in green where the dots/cases are projected. The class separation is better than in Welch’s LDA in which the projection line was drawn between the class means. [Hastie et al., 2008; Li et al., 2014].

𝑦 = 𝜃^𝑇𝑋 (15)

Figure 2: Fisher’s LDA [Li et al., 2014]

We define the objective function in Eq. 16, in which (𝑚̂₁ − 𝑚̂₂)² is the between-class variance and 𝑠₁²+ 𝑠₂² is the within-class variance.

𝐽(𝜃) = (𝑚̂₁ − 𝑚̂₂)²

𝑠₁²+ 𝑠₂² (16)

Let’s define the mean vector 𝑚_𝑘 for the class k in x-space and the projected mean vector 𝑚̂_𝑘 for the class k in y-space. In addition, we define variance 𝑠_𝑘² for the class k by utilizing 𝑚̂_𝑘 so that 𝑠_𝑘² is the sum of the projected samples subtracted by the projected mean of the class k.

The difference 𝑚̂₁ - 𝑚̂₂ in y-space gives an estimation of the separability between the classes 1 and 2. [Li et al., 2014]

• 𝑚_𝑘 = ¹

𝑁_𝑘∑_𝑥_𝑖_∈𝐶_𝑘𝑥_𝑖, where k = 1, 2.

• 𝑚̂_𝑘 = ¹

𝑁𝑘∑_𝑥_𝑖_∈𝐶₁𝑦_𝑖 = ¹

𝑁𝑘∑_𝑥_𝑖_∈𝐶₁𝜃^𝑇𝑥_𝑖 = 𝜃^𝑇𝑚_𝑘, where k = 1, 2.

• 𝑠_𝑘² = ∑_𝑥_𝑖_∈𝐶_𝑘(𝜃^𝑇𝑥_𝑖 − 𝜃^𝑇𝑚̂_𝑘), where k = 1, 2.

We want to transform Eq. 16 into a more convient quadratic form to make it easy to solve.

(𝑚̂₁ − 𝑚̂₂)² can be expressed as 𝜃^𝑇𝑆_𝐵𝜃 by inserting 𝑚̂_𝑘 = 𝜃^𝑇𝑚_𝑘 into the equation. The 𝑆_𝐵 matrix which is also called as the between-class scatter matrix is of form (𝑚̂₁ − 𝑚̂₂)(𝑚̂₁ −

𝑚̂₂)^𝑇. The sum 𝑠₁² + 𝑠₂² can also be rewritten as 𝜃^𝑇𝑆_𝑊𝜃. The matrix 𝑆_𝑊 is known as the within-class scatter matrix and is of the following form:

• 𝑆_𝑊= ∑_𝑥_𝑖_∈𝐶_𝑘(𝑥_𝑖 − 𝑚̂_𝑘)(𝑥_𝑖− 𝑚̂_𝑘)^𝑇

Finally, we obtain Eq. 17, where 𝑆_𝐵 and 𝑆_𝑊 are d x d matrices of the same dimension as 𝜃.

Maximizing the function gives us the projection where the within-class scatter 𝑆_𝑊 is as small as possible while the between-class scatter 𝑆_𝐵 is as large as possible. This means that the classes are clustered tightly together and their class means are as far apart as possible. The general case solution to this problem to find the w that maximizes J is known as the eigenvalue problem.

For a special case, when 𝑆_𝑊 is invertible, 𝜃 can be calculated by 𝜃 = 𝑆_𝑊⁻¹ (𝑀₁− 𝑀₂). [Li et

There are two reasons why decision trees are a good choice as a machine learning algorithm.

First, decision tree is based on a binary tree structure and the computational cost of querying it is low as is building the tree. Second, decision trees can be followed: it is easy to understand how the logic works compared to hard-to-interpret models.

The most popular decision tree algorithm is ID3 (Iterative Dichotomiser 3), which uses information gain as the attribute selection measure. The information gain is presented in Section 3.4.3 and ID3 algorithm is presented in Section 3.4.4. Other popular algorithms are ID3’s successor C4.5 and Classification and Regression Tree (CART). Decision tree is also used in ensemble learning in which the random forest algorithm uses it as the base classifier.

Ensemble learning and random forest are explained in Section 3.5. [Han et al., 2012; Kelleher et al., 2018]

Decision trees are working well when the input data is mostly nominal or ordinal. If the input data is numerical, then a new branch would be created for each numerical value, which would make the tree enormous. It is possible to convert numerical data into ordinal by defining intervals, but this can be difficult. Decision trees are sensitive to noise, because the tree is split after each decision which makes the number of cases left in each branch lower. If there are

In document Applying Machine Learning Algorithms to Psychiatric Patient Data (sivua 12-16)