• Ei tuloksia

3. MACHINE LEARNING

3.2 Methods and metrics

When starting the training of a machine learning model, a dataset is needed. Datasets in this context are collections of examples, each example containing its own features [4].

The examples in a dataset are usually contained in a structure similar to a matrix or a vector [19]. In the case of a matrix each row contains the features of one example, and the matrix contains rows for each of the examples. With supervised learning models, the rows contain also the target value of the example that is wanted there. Classification datasets transform the class names into corresponding integer numbers so the data can be more easily processed [19].

Simple machine learning models that are trained on simple examples with only a few variables can possibly be taught with a dataset of hundreds or even just dozens of ex-amples [4]. Training on complicated and multidimensional exex-amples, such as images, might require tens of thousands of images in a dataset to train a good model that has a decent accuracy [21]. We will be using different types of data sets to evaluate different automated machine learning models so we will have data on different volumes of da-tasets. The type, size and complexity of a dataset are major factors on what machine learning method should be used [4].

To explain classification more, we will introduce a simple linear classification function and give a few examples of such algorithms [3]. Those algorithms will be introduced on very high level to avoid going too far into detail in those because they are not on the forefront of my study although still important to understand.

A large number of algorithms for classification can be expressed in relations of a linear function that allocates a score to each possible group k by combining the feature vector of an example with a vector of weights, using a dot product [18]. The projected category is the one with the peak score [22]. This sort of score function is identified as a linear predictor function and has the following universal form:

𝑠𝑐𝑜𝑟𝑒(Χ , 𝑘) = β × Χ (1)

where Χ is the feature vector for instance i, β is the vector of weights equivalent to category which is marked as k, and score(Χ, k) is the score linked with assigning in-stance i to category k [22]. In discrete choice theory, where inin-stances represent people and categories signify choices, the score is considered the usefulness associated with person i selecting category k [23].

One of the ways we will be measuring the performance in this paper is the receiver op-erating characteristic curve, or ROC curve [24]. It is a graphical plot that exemplifies the analytical ability of a binary classifier system as its discrimination threshold is diverse.

The ROC curve is created by plotting the true positive rate (TPR) against the false posi-tive rate (FPR) at numerous threshold settings. The true-posiposi-tive rate is also known as sensitivity, recall or probability of detection in machine learning [25]. The false-positive rate is also known as probability of false alarm and can be calculated as (1 − specificity) [25, 26]. A confusion matrix which demonstrates the TPR and FPR relation is shown in Table 1.

Table 1 A confusion matrix. The target responses are on the left and the model’s predictions on the top.

Positive Negative Positive True Positive False Negative Negative False Positive True Negative

When using normalized components, the area under the curve (often referred to as only the AUC) is equivalent to the probability that a classifier will rank a randomly selected

positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative' and this is truly what we want to happen) [25].

Figure 2 An example of a ROC curve and the area under it marked grey.

Confusion matrix is seen in Table 1 and it represents a table that can be used to picture the performance of a machine learning algorithm. Confusion matrix is typically used for supervised learning like classification [25]. The confusion matrix can be presented so that each row of the matrix represents the examples in a projected class, while each column represents the instances in a genuine class or the other way around like in Table 1. Confusion matrix can this way be used to calculate the true positive rate and false positive rate and plot the receiver operating characteristic curve and from that we will get the area under that curve which will be used as one of the performance metrics in our experiments [26]. An example of this usage as a curve can be seen in Figure 2.

Loss functions for classification are computationally achievable loss functions represent-ing the price paid for inaccuracy of predictions in classification problems [27]. One of these loss functions is logistic loss or more commonly logloss [28]. It is hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. For any given problem, a lower log loss value means improved predictions [27]. Logloss function is defined mathematically:

𝐿 (𝑦, 𝑝) = −(𝑦𝑙𝑜𝑔(𝑝) + (1 − 𝑦) log(1 − 𝑝)) (2)

Where a single sample of true or false value is 𝑦 ∈ {0,1} and the probability estimate is marked as 𝑝 = Pr (𝑦 = 1) [27]. Logloss heavily punishes classifiers that are confident

about an unfitting classification. For example, if for a particular observation, the classifier assigns a very small probability to the correct class then the corresponding contribution to the Log Loss will be very large [28]. Logloss will be used in this research to evaluate multiclass classification problems accuracy [27].

One more topic to clarify which can assist in the understanding of classification is the algorithm needs to be introduced. We will be using k-nearest neighbors-algorithm as an example because we think it shows clearly what classification is all about but is still an algorithm that is widely used [20]. The k-nearest neighbors algorithm (k-NN) is a non-parametric machine learning method where the input comprises of the k closest training instances and the output is a class membership [21]. An object is classified by a number vote of its neighbors, with the object being allocated to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor [20, 21].

In k-nearest neighbors the training examples are vectors in a multidimensional feature space, each with a class label [20]. The training phase of the algorithm consists only of loading the feature vectors and class labels of the training samples. In the classification stage, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most recurrent among the k training samples adjacent to that query point [21].

A drawback of the basic "majority voting" classification occurs when the class distribu-tion is twisted [21]. That is, examples of a more frequent class have a habit of dominating the prediction of the new example, because they lean towards to be common among the k nearest neighbors due to their large number [20]. This can be overcome by giving weights to the neighbors for example in a way that the closest neighbors affect the clas-sification more than the furthest ones [20, 21]. An example of how the k-nearest neigh-bors classification works can be seen in Figure 3.

Figure 3 Example of a k-NN classification where the green circle needs to be classi-fied as a red triangle or as a blue square. If the k = 3, it will be classiclassi-fied as a red triangle but if the k = 5, it would be classified as a blue square.

Next, we will give an example of a simple classification task to show what type of prob-lems can be solved using these methods. A typical classification problem that we chose to show here is that we have to distinguish between different fruits according to our given data. Usually this process roughly follows the following pattern: Get the data to be used, analyze the data and do some data engineering around it for example decide what to do with missing values, extracting features from the data and cleaning it in a way that there is not any unnecessary columns, testing out different models and designing them and analyzing the results from them and fine tuning the ones that perform well and finally reading the metrics of the results and evaluating them. [29]

In our example we will be using a dataset of apples, mandarins, oranges and lemons that was created in the University of Edinburgh and later altered by the University of Michigan. Few lines of the data can be seen in Table 2. Each row of the dataset repre-sents one piece of the fruit as represented by several features that are in the table’s columns. We have 59 pieces of fruits and 7 features in the dataset.

Table 2 Data used in the classification example

The dataset itself is quite balanced except for the mandarins as we have 19 apples, 16 lemons, 19 oranges and 5 mandarins. Going through the data there are some obvious correlations like width and mass so we will use those as our features to make the pre-diction. We can also see that the values have to be scaled. That is something we will not be going through more in detail. There are no empty values or much more to take into account regarding this particular dataset.

The data will be then split into a training set and a test set. We will have 80% in the training set and 20% in test set. After going through other algorithms like logistic regres-sion and a support vector machine we come to the concluregres-sion that k-nearest neighbors algorithm gives us the best results with k = 5. It does give us a 100% accuracy using a confusion matrix on the test case so there is a slight chance of overfitting although that is not a common problem with this algorithm. The decision boundary for the k-nearest neighbors classifier can be seen in Figure 4.

Figure 4. Classification space of the classification example