• Ei tuloksia

Discriminant analysis can be used for dimension reduction (which was covered in Section 3.1.2) and classification. Linear Discriminant Analysis (LDA) is used to find the features or the linear combination of the features that discriminate between two or more groups best. One way to represent a two-class classifier is in terms of a discriminant function g(x). The equation g(x) = 0 defines the decision boundary which is a line that separates the samples of two classes.

A new unseen sample is categorized to the first class if g(x) > 0, and otherwise to the second class. This decision boundary is a hyperplane when g(x) is linear. [Bishop, 2006; Marsland, 2014]

There are two different approaches on LDA by two different researchers. For two class classification Fisher’s and Welch’s LDA are identical, only their approach is different. Welch’s LDA tries to minimize the total probability of misclassification, which depends on the features’

properties like the normal distribution of the data and the class probabilities. Fisher’s approach tries to find such linear combination of the features that there is a large separation between the projected class means while a small variance in each class cluster. Fisher’s LDA also does not have normal distribution requirement on the data and it can be generalized to more than two classes by small changes to the covariance matrices. [Kuhn, 2013]. Welch’s LDA is explained in Section 3.3.1, parameter estimation for Welch’s LDA in Section 3.3.2 and Fisher’s LDA is explained in Section 3.3.3.

3.3.1 Linear and Quadratic Discriminant Analysis

For the basis of classification, we use Bayes’s theorem to get Eq. 11. This is the same equation that was used in the context of the naïve Bayes classifier, but only with the difference that the features are presumed to be independent from each other in the naïve Bayes classifier. The density function 𝑓𝑘 for a class k is given in Eq. 12, where the parameters are: µ𝑘 is the mean vector of dimension p, 𝛴𝑘 is the covariance matrix of dimension p, and 𝜋𝑘 is prior probability of the class k. According to the Bayes’ theorem and presuming that all the parameters are known, 𝑃(𝐶𝑘 | 𝑥) gives probability that the case/vector x belongs to the class k. Moreover, 𝜋𝑘 is the prior probability of the class k defined as the number of the cases of the class k divided by the total number of cases in the training data. [Hastie et al., 2008]. For example, for a three class problem, if the prior probabilities are 𝜋1= 0.8, 𝜋2 = 0.1, and 𝜋3 = 0.1 and densities 𝑓1, = 0.3, 𝑓2 = 0.3 and 𝑓3 = 0.4, we get the following probabilities for the case x for the three classes:

𝑃(𝐶1 | 𝑥 ) = 0.8∙0.3

(0.8∙0.3)+(0.1∙0.3)+(0.1∙0.4) ≈ 0.77, 𝑃(𝐶𝑘 | 𝑥) ≈ 0.10 and 𝑃(𝐶𝑘 | 𝑥) ≈ 0.13.

𝑃(𝐶𝑘 | 𝑋) = 𝑓𝑘(𝑥)π𝑘

𝐾𝑙=1𝑓𝑙(𝑥)π𝑙, 𝑘 = 1, 2, … , 𝑚 (11)

𝑓𝑘(𝑥) = 1

√(2π)𝑝/2𝑘|1/2𝑒−−12(x−µ𝑘)TΣ𝑘−1(x−µ𝑘), 𝑘 = 1, 2, … , 𝑚 (12)

When 𝑃(𝐶𝑘 | 𝑥) is defined, we can find the class that has the maximum posterior probability for the case/vector x as given in Eq. 13. The denominator from Eq. 11 can be left out because it is common for each k, so we can maximize 𝑓𝑘(𝑥)π𝑘 equivalently (lines 1-2). Also, we can maximize equivalently 𝑓𝑘(𝑥)π𝑘 and log (𝑓𝑘(𝑥)π𝑘), because log is a strictly increasing function (lines 2-3). In the last line, the density function 𝑓𝑘 is inserted into the equation.

Ĝ(x) = arg max

Quadratic discriminant analysis (QDA) is a generalized version of LDA. The only difference to LDA is that there is no assumption that the covariances 𝛴𝑘 are equal for each class, so the covariances need to be estimated for each class separately. Therefore, we would maximize Eq. 14 instead of Eq. 13. QDA creates a quadratic decision surface, whereas LDA created a linear decision surface. [Hastie et al., 2008].

Ĝ(x) = arg max

𝑘 [−1

2𝑙𝑜𝑔|Σ𝑘| −1

2(x − μ𝑘)TΣ𝑘−1(x − μ𝑘) + log(π𝑘)] , 𝑘 = 1, 2, … , 𝑚 (14)

3.3.2 Parameter estimation

The parameters for the multivariate Gaussian distributions need to be estimated from the training data, as most of the time they are unknown. The parameters 𝜋̂𝑘, 𝜇̂𝑘 and 𝛴̂ are explained next.

𝜋̂𝑘 is known as the maximum likelihood estimator for the class k. It is defined as 𝑁𝑘 divided by N, where 𝑁𝑘 is the number of cases of the class k and N is the total number of cases in the training data. [Hastie et al., 2008].

• 𝜋̂𝑘 =𝑁𝑘

𝑁

μ̂k is the mean vector for the class k. It is defined as the sum of a vector 𝑥𝑖 of the class k divided by the number of cases of the class k. [Hastie et al., 2008].

• 𝜇̂𝑘 = 𝛴𝑔𝑖=𝑘 𝑥𝑖

𝑁𝑘

Σ̂ is the common covariance matrix for all the classes. In the inner sum statement, we calculate the sum for each k: we multiply the i’th case subtracted by the class mean by its transpose.

Then the result is normalized by the scalar quantity N-K, where N is total number of cases and K is the total number of classes. After the first sum statement, we calculate the sum over all the classes. [Hastie et al., 2008].

• Σ̂ = ∑ ∑ (𝑥𝑖−μk)(𝑥𝑖−μk)𝑇

𝑔𝑖=𝑘 𝑁−𝐾 𝐾𝑘=1

3.3.3 Fisher’s LDA

In two class problem, we want to find the linear combination (Eq. 15) of the features such that there is a large separation between the projected class means (between-class variance) and also a small variance in each class (within-class variance). This is illustrated in Figure 2. There are cases of two classes, blue and red, their corresponding histograms and the line in green where the dots/cases are projected. The class separation is better than in Welch’s LDA in which the projection line was drawn between the class means. [Hastie et al., 2008; Li et al., 2014].

𝑦 = 𝜃𝑇𝑋 (15)

Figure 2: Fisher’s LDA [Li et al., 2014]

We define the objective function in Eq. 16, in which (𝑚̂1 − 𝑚̂2)2 is the between-class variance and 𝑠12+ 𝑠22 is the within-class variance.

𝐽(𝜃) = (𝑚̂1 − 𝑚̂2)2

𝑠12+ 𝑠22 (16)

Let’s define the mean vector 𝑚𝑘 for the class k in x-space and the projected mean vector 𝑚̂𝑘 for the class k in y-space. In addition, we define variance 𝑠𝑘2 for the class k by utilizing 𝑚̂𝑘 so that 𝑠𝑘2 is the sum of the projected samples subtracted by the projected mean of the class k.

The difference 𝑚̂1 - 𝑚̂2 in y-space gives an estimation of the separability between the classes 1 and 2. [Li et al., 2014]

• 𝑚𝑘 = 1

𝑁𝑘𝑥𝑖∈𝐶𝑘𝑥𝑖, where k = 1, 2.

• 𝑚̂𝑘 = 1

𝑁𝑘𝑥𝑖∈𝐶1𝑦𝑖 = 1

𝑁𝑘𝑥𝑖∈𝐶1𝜃𝑇𝑥𝑖 = 𝜃𝑇𝑚𝑘, where k = 1, 2.

• 𝑠𝑘2 = ∑𝑥𝑖∈𝐶𝑘(𝜃𝑇𝑥𝑖 − 𝜃𝑇𝑚̂𝑘), where k = 1, 2.

We want to transform Eq. 16 into a more convient quadratic form to make it easy to solve.

(𝑚̂1 − 𝑚̂2)2 can be expressed as 𝜃𝑇𝑆𝐵𝜃 by inserting 𝑚̂𝑘 = 𝜃𝑇𝑚𝑘 into the equation. The 𝑆𝐵 matrix which is also called as the between-class scatter matrix is of form (𝑚̂1 − 𝑚̂2)(𝑚̂1

𝑚̂2)𝑇. The sum 𝑠12 + 𝑠22 can also be rewritten as 𝜃𝑇𝑆𝑊𝜃. The matrix 𝑆𝑊 is known as the within-class scatter matrix and is of the following form:

• 𝑆𝑊= ∑𝑥𝑖∈𝐶𝑘(𝑥𝑖 − 𝑚̂𝑘)(𝑥𝑖− 𝑚̂𝑘)𝑇

Finally, we obtain Eq. 17, where 𝑆𝐵 and 𝑆𝑊 are d x d matrices of the same dimension as 𝜃.

Maximizing the function gives us the projection where the within-class scatter 𝑆𝑊 is as small as possible while the between-class scatter 𝑆𝐵 is as large as possible. This means that the classes are clustered tightly together and their class means are as far apart as possible. The general case solution to this problem to find the w that maximizes J is known as the eigenvalue problem.

For a special case, when 𝑆𝑊 is invertible, 𝜃 can be calculated by 𝜃 = 𝑆𝑊−1 (𝑀1− 𝑀2). [Li et

There are two reasons why decision trees are a good choice as a machine learning algorithm.

First, decision tree is based on a binary tree structure and the computational cost of querying it is low as is building the tree. Second, decision trees can be followed: it is easy to understand how the logic works compared to hard-to-interpret models.

The most popular decision tree algorithm is ID3 (Iterative Dichotomiser 3), which uses information gain as the attribute selection measure. The information gain is presented in Section 3.4.3 and ID3 algorithm is presented in Section 3.4.4. Other popular algorithms are ID3’s successor C4.5 and Classification and Regression Tree (CART). Decision tree is also used in ensemble learning in which the random forest algorithm uses it as the base classifier.

Ensemble learning and random forest are explained in Section 3.5. [Han et al., 2012; Kelleher et al., 2018]

Decision trees are working well when the input data is mostly nominal or ordinal. If the input data is numerical, then a new branch would be created for each numerical value, which would make the tree enormous. It is possible to convert numerical data into ordinal by defining intervals, but this can be difficult. Decision trees are sensitive to noise, because the tree is split after each decision which makes the number of cases left in each branch lower. If there are