• Ei tuloksia

The combined list of all the results for classification of class accident category are given in Table 11. The results are sorted from the highest total accuracy to lowest. The comparison of accuracy as the only measure does not tell everything about the performance. Therefore, confusion matrices were given for the best performing methods in the previous sections.

Method Accuracy

Random Forest (100 trees) LOO 0.77

Random Forest (100 trees) K=5 0.76

Random Forest (30 trees) LOO 0.74

Random Forest (30 trees) K=5 0.74

Random Forest (10 trees) LOO 0.73

Random Forest (10 trees) K=5 0.72

Discriminant Analysis, linear LOO 0.71

K-NN Spearman: K=9 0.71

K-NN Cosine: K=7 0.71

K-NN Correlation: K=7 0.69

Discriminant Analysis, linear K=5 0.69

K-NN Jaccard: K=7 0.69

K-NN Chi-squared distance: K=7 0.66

K-NN Mahalanobis: K=25 0.66

K-NN Hamming: K=7 0.65

K-NN Manhattan: K=25 0.63

K-NN Euclidean: K=5 0.63

K-NN Minkowski distance p=3: K=5 0.63

K-NN Minkowski distance p=35: K=5 0.62

Discriminant Analysis, quad LOO 0.56

Bayes (kernel) K=5 0.51

Discriminant Analysis, quad K=5 0.50

Bayes (kernel) LOO 0.49

Bayes (kernel) K=10 0.48

K-NN Chebychev: K=3 0.46

Table 11: The Combined results

8 Results, classification of class compensation decision

The class compensation decision (as were given in Table 5) is very imbalanced for the classification task. The confusion matrix for the classification with random forest is given in Figure 18. The predicted accuracy for the class compensation decision is 88 %, but for this imbalanced dataset the accuracy is not a reliable metric as the only measure. There are 272 cases for the class ”0” and 36 cases for the class ”1” giving the ratio nearly 1:8 between the classes. Even though the sensitivity of 99.3 % for the class “0” was achieved, the senstivity of the class “1” is only 5.6 %. In practice, all the cases are classified as being of class “0”. The k-NN classifier was also tested in this classification task and similar results were obtained.

Figure 18: Random forest (100 trees)

SMOTE algorithm was used to generate more cases to balance the dataset. The classification was then repeated by using the random forest classifier (100 trees) which performed best in the previous classification problem. This way better results were achieved.

Figure 19 illustrates the improved predictive performance for the minority class with random forest and SMOTE: there are equal amount of 272 cases for both classes “0” and “1”.

The sensitivity is 87.9 % for the class “0” and 88.6 % for the class “1”.

Figure 19: Random forest (100 trees) with SMOTE algorithm

9 Conclusion

Machine learning algorithms were extensively covered, but there are also many others. For example, unsupervised learning in which the dataset does not have class labels was not discussed. Clustering analysis and PCA are examples of this. Another approach is deep learning that has applications for example in speech recognition and computer vision.

The beginning of the project was difficult because I didn’t know about the MATLAB’s extensive ML library and started to write the algorithms from scratch. Matlab’s ML library offers implementations of most of the popular ML algorithms with good documentation with example code. Only the SMOTE algorithm needed to be implemented to MATLAB.

Regardless of the extra work, writing the algorithms was also useful to have a deeper understanding of them.

There are also other options for the programming environment than the chosen MATLAB.

Python for example is the most popular programming language used in classification tasks.

There is a scikit-learn machine learning library for Python including a comprehensive support for multiple classification and clustering algorithms. [python.org, 2021]

This kind of classification that is based on text analysis hasn’t been done in the field of neurology and psychiatry before. The first couple of tests with k-NN didn’t give flattering results, but the results got improved by later experimenting with other ML algorithms and learning more about data preprocessing, dimensionality reduction and SMOTE. In the end, random forest provided better results than the other classifiers in the first classification task.

Random forest or k-NN classifier didn’t perform well in the binary classification because the dataset was very imbalanced. SMOTE improved the results for both classification tasks.

The dataset was quite limited in size (K=328). In future work the classification could be done with a bigger dataset. The categorization of the phrases into groups is also an essential part in the classification. The 35 phrase groups were created in one way for the classification, but the division could have been different. It would have been interesting to see how changing the groups and dividing the phrases to them would have changed the results. Also, the possibilities of deep learning in text analysis part could be explored.

10 References

[Alpaydin, 2016]. Alpaydin, Ethem. 2016. Machine Learning: The New AI. The Mit Press

[Barber, 2012]. Barber, David. Bayesian Reasoning and Machine Learning. Cambridge University Press

[Bishop, 2006]. Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning.

Springer Science+Business Media

[Breiman, 2001]. Breiman, Leo. 2001. Random Forests. Machine Learning, 45, 5–32, 2001.

[Chavla et al., 2002] Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002) 321–357

[Cover et al., 2006]. Cover , Thomas M., Joy A. Thomas. 2006. Elements of Information Theory. Second Edition. John Wiley & Sons

[Flach, 2012]. Flach, Peter. 2012. MACHINE LEARNING, The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, New York

[Han et al., 2012] Jiawei Han, Micheline Kamber and Jian Pei. 2012. Data Mining Concepts and Techniques. Third Edition. Morgan Kaufmann Publishers

[Hastie et al., 2008]. Hastie, Trevor, Robert Tibshirani, Jerome Friedman. 2008. Data Mining, Inference, and Prediction. Second Edition. Springer Series in Statistics

[Kampman, 2021] Kampman, Olli. 2021. Psykiatristen potilasvahinkojen luokittelu ja yleisyys Suomessa. Tutkimussuunnitelma. Lääketieteen ja terveysteknologian tiedekunta 27.11.2020

[Kelleher et al., 2018]. Kelleher, John D, Brendan Tierney. 2018. Datascience. 67, 100, The MIT Press, Cambridge, Massachusetts, London, England

[Kuhn and Johnson, 2013]. Kuhn, Max, Kjell Johnson. 2013. Applied Predictive Modeling.

Springer New York

[Li et al., 2014] Li, Cheng, Bingyu Wang. 2014. Fisher Linear Discriminant Analysis.

https://www.semanticscholar.org/paper/Fisher-Linear-Discriminant-Analysis-Li/1ab8ea71fbef3b55b69e142897fadf43b3269463 (Read on 30.9.2021)

[Marsland, 2009]. Marsland, Stephen. 2009. Machine Learning: An Algorithmic Perspective.

Chapman & Hall/CRC

[Marsland, 2014]. Marsland, Stephen. 2014. Machine Learning: An Algorithmic Perspective.

Chapman & Hall/CRC

[python.org, 2021]. Python Programming Language Documentation. https://www.python.org/

(Read on 30.10.2021)

[Seni and Elder, 2010]. Seni, Giovanni, John Elder. 2010. Ensemble Methods in Data Mining:

Improving Accuracy Through Combining Predictions. Morgan & Claypool

[Urdan, 2010]. Urdan, Timothy C. 2010. Statistics in Plain English. Third Edition. Routledge.

Taylor & Francis Group, LLC

[Witten et al., 2011]. Witten, Ian H., Eibe Frank, Mark A. Hall. Data Mining. Practical Machine Learning. Tools and Techniques. Third Edition. Morgan Kaufmann