Outlier detection techniques

(1)

uef.fi

PUBLICATIONS OF

THE UNIVERSITY OF EASTERN FINLAND Dissertations in Forestry and Natural Sciences

ISBN 978-952-61-3454-3

Dissertations in Forestry and Natural Sciences

DISSERTATIONS | JIAWEI YANG | OUTLIER DETECTION TECHNIQUES | No 383

JIAWEI YANG

OUTLIER DETECTION TECHNIQUES

PUBLICATIONS OF

THE UNIVERSITY OF EASTERN FINLAND

Outlier detection serves as a cornerstone of data science. In this thesis, we studied the shared weakness of all existing outlier detectors

over the decades and proposed solutions to improve all existing outlier detectors.

JIAWEI YANG

(2)

(3)

OUTLIER DETECTION TECHNIQUES

(4)

(5)

Jiawei Yang

OUTLIER DETECTION TECHNIQUES

Publications of the University of Eastern Finland Dissertations in Forestry and Natural Sciences

No 383

University of Eastern Finland Joensuu

2020

Academic dissertation

To be presented by permission of the Faculty of Science and Forestry for public examination at the University of Eastern Finland, Joensuu, on

September, 10, 2020, at 12 o’clock noon

(6)

Grano Oy Jyväskylä, 2020

Editors: Pertti Pasanen, Matti Vornanen, Jukka Tuomela, Matti Tedre

Distribution: University of Eastern Finland / Sales of publications www.uef.fi/kirjasto

ISBN: 978-952-61-3454-3 (Print) ISBN: 978-952-61-3455-0 (PDF)

ISSNL: 1798-5668 ISSN: 1798-5668 ISSN: 1798-5676 (PDF)

(7)

Author’s address: Jiawei Yang

University of Eastern Finland School of Computing

P.O. Box 111, 80101 JOENSUU, FINLAND email:jiaweiyang@ieee.org

Supervisors: Professor Pasi Fränti, Ph.D.

University of Eastern Finland School of Computing

P.O. Box 111, 80101 JOENSUU, FINLAND email: franti@cs.uef.fi

Professor Susanto Rahardja, Ph.D.

Northwestern Polytechnical University School of Marine Science and Technology 127 West Youyi Road, XI’AN, Shaanxi, 710072, CHINA

email: susantorahardja@ieee.org Reviewers: Professor Tommi Kärkkäinen, Ph.D

University of Jyväskylä

Faculty of Information Technology

P.O. Box 35 (Agora), FIN-40014, FINLAND email:tommi.karkkainen@jyu.fi

Professor Arthur Zimek, Ph.D University of Southern Denmark

Department of Mathematics and Computer Science Campusvej 55, DK-5230 Odense M, DENMARK email: zimek@imada.sdu.dk

Opponent: Associate Professor Giacomo Boracchi, Ph.D Politecnico di Milano

Dipartimento di Elettronica e Informazione Via G. Ponzio, 34/5, 20133 Milano, ITALY email: giacomo.boracchi@polimi.it

(8)

(9)

Yang, Jiawei

Outlier detection techniques

Joensuu: University of Eastern Finland, 2020 Publications of the University of Eastern Finland.

Dissertations in Forestry and Natural Sciences, No. 383.

ISBN: 978-952-61-3454-3 (Print) ISSNL: 1798-5668

ISSN: 1798-5668

ISBN: 978-952-61-3455-0 (PDF) ISSN: 1798-5676 (PDF)

ABSTRACT

Over the decades, outlier detection has been a fundamental problem in data mining and machine learning. The typical outlier detection process contains two main steps: computing outlier scores for objects and thresholding the outlier scores to decide outliers.

We first review how typical outlier detectors produce outlier scores for objects. To detect outliers in large numbers, we propose mean-shift outlier detector (MOD) and its refined version MOD+. Our simulated experiments show they outperform all the state-of-the-art outlier detectors we tested.

Second, we discuss outlier score post-processing techniques. We hypothesize that similar objects in the feature space should have similar outlier scores. Based on this hypothesis, to let objects in the neighborhood have more similar outlier scores than the original, we propose neighborhood averaging. Our experiments show it can improve the accuracy of all the detectors tested by 9% on average.

Third, we introduce outlier score thresholding techniques. We address the problem that the presence of outliers biasesthe calculation of thresholding that is based on statistics. To solve this problem, we propose a two-stage thresholding method (2T). The results show 2T can improve all the thresholding techniques tested when the number of outliers is large.

Finally, we show two cases of outlier detection in time-series applications: GPS trajectories and heart-rate variability, wherein typical outlier detectors are not applicable directly. We demonstrate how to apply typical outlier detectors to GPS trajectories indirectly. However, instead of applying typical outlier detectors to heart-rate variability directly, entropy methods are normally employed as an outlier detector and entropy values are used as outlier scores. We propose Attention

(10)

entropy. Our experiments show Attention entropy outperforms all the state-of-the-art entropy methods tested.

Universal Decimal Classification: 004.62, 311.15, 519.222, 519.237.8 Library of Congress Subject Headings: Data mining; Data editing;

Outliers (Statistics); Deviation (Mathematics); Filters (Mathematics);

Standard deviations; Classification; Time-series analysis; Global Position- ing System; Heart beat; Entropy (Information theory)

Yleinen suomalainen ontologia: tiedonhallinta; tiedonlouhinta; data;

poikkeavuus; anomaliat; keskihajonta; aikasarjat; satelliittipaikannus;

syke; entropia.

(11)

ACKNOWLEDGEMENTS

t is so logical to set the ACKNOWLEDGEMENTS before the thesis content. Because when I start to write the thesis and recall what I have done in research, the first feeling comes that without my supervisor Professor Pasi Fränti’s continuous help, support, guidance, and the freedom to me to work on a variety of problems, without my co- supervisor Professor Susanto Rahardja’s encouraging stories, support, leadership, and fruitful conversations, I cannot overcome the challenges in research smoothly.

First and foremost, I am tremendously grateful for my supervisor, Professor Pasi Fränti, for all his advice and support. He offered me a chance to do a Ph.D., which is a fix to my dream from childhood: to contribute to the knowledge pool of community. My soul is free from now on. He has been a never-ending source of wisdom and insight. I continuously learn from his commitment to research, his fortitude in exploring the unknown, his taste of research problems, and his professionalism. I couldn't have wished for a better advisor. I am very grateful for him for convincing Professor Susanto Rahardja to make a second try for offering a chance for my visiting his university in March 2018, which is the beginning of my research path.

I'm very fortunate to have had the opportunity to listen to Professor Susanto Rahardja's story and how he set and reached the challenging goals in research for himself when he was doing his Ph.D. I want to thank him for inspiring me to pursue a career in research. I always remember his constructive and considerate suggestions for my decisions in my life.

He has educated me on how to be professional, which is very vital everywhere. I've been greatly influenced by his philosophy. I am very grateful for him for making a second try to offer a chance for my visiting his university in March 2018, which is the beginning of my research path.

I thank Radu Mariescu-Istodor for his supports, help, and assistance in courses, discussions, and co-authors' papers. I thank Sami Sieranoja for the pleasant cooperation. I appreciate the collaborators and people from the machine learning group: Masoud Fatemi, and Juho Kettunen for their discussions related to research and the jokes we made for each other. I want to thank Oili Kohonen so much for her helps, support, and lively conversations from her.

I owe many thanks to my friends Xiangcai Zhang, Qiaoxi Yang, Bin Liu, Qihao Zou, Lingyun Chen, Jian Zhang, Wenzheng Hu, Yu Chen, and Zhe Yang. I thank their supports and company for years. During my I

(12)

Ph.D. studies, I feel there is always an eye watching me and encouraging me to move forward no matter how dark the nights are. The eye is exactly from them and based on our common characteristics, common dreams, or common memories spent together.

I want to thank Hongxin Zhao and Zhe Yang. Decades have passed, while the days and spirit of Three Musketeers have always been the most shining. I also want to thank Guangxiang Yang, the father of Zhe Yang, who was also my spiritual father in my childhood, who has one kind of spirit that seems I also have. The spirit is, therefore, enhanced. The spirit profoundly influences and encourages me in every corner along with my life, including how to make breakthroughs in research.

I want to thank Sirong Li for her support and help through the whole middle school, high school, bachelor, master, and Ph.D. studies. It means so much to me, and all the values will be reserved in my heart.

I want to thank Huajing Yang for the love, supports, and company through thick and thin. I cannot imagine how Ph.D. life would be without you. It is so good to meet you.

I owe many thanks to my parents, Ying Yang and Long Yang, who offered me a pleasant environment and family full of love to grow up.

They have always been giving me the freedom to try what I want and pursue what I like. They are always watching my back as my most solid cornerstone.

Finally, this thesis is dedicated to the ones who love or loved me, and the ones whom I love or loved, for all the years of love and support, and my hometown Shidian (施甸), the land that raised me, for the happiest time spent with the people I met within childhood.

Joensuu, 15th October 2019 Thesis Author

Oh, almost forget! I think I must thank the Coca-Cola Company too as Coca with ice is indeed a helper to freshen my mind during doing research and writing papers.

Joensuu, 2nd August 2020 Thesis Author

(13)

LIST OF ABBREVIATIONS

2T Two-stage thresholding ABOD Angle-based outlier detection IFOREST Isolation-based anomaly detection IQR Interquartile range

k-NN K nearest neighbors KNN An outlier detector LOF Local outlier factor

MAD Median absolute deviation MCD Minimum covariance determinant MOD/MOD+ Mean-shift outlier detection (plus) NA Neighbourhood averaging

NC Reverse unreachability

OCSVM One-class support vector machine ODIN Outlier detection using indegree of nodes PCAD Principal-component-analysis outlier detection SD Standard deviation

(14)

(15)

LIST OF ORIGINAL PUBLICATIONS

P1. P. Fränti and J.W. Yang, Medoid-shift noise removal to improve clustering, Int. Conf. Artificial Intelligence and Soft Computing (ICAISC), 604-614, 2018.

P2. J.W. Yang, S. Rahardja, and P. Fränti, Mean-shift outlier detection, Int. Conf. Fuzzy Systems and Data Mining (FSDM), 208-215, 2018.

P3. J.W. Yang, S. Rahardja, and P. Fränti, Mean-shift outlier detection and filtering, 2020 (submitted).

P4. J.W. Yang, S. Rahardja, and P. Fränti, Neighborhood averaging for improving outlier detection, 2020 (submitted).

P5. J.W. Yang, S. Rahardja, and P. Fränti, Outlier detection: how to threshold outlier scores?, 2019 Int. Conf. on Artificial Intelligence, Information Processing, and Cloud Computing, 37(6), 2019.

P6. J.W. Yang, R. Mariescu-Istodor, P. Fränti, Three rapid methods for averaging GPS segments, Applied Sciences, 9(22), 4899, 2019.

P7. J.W. Yang, G.I. Choudhary, S. Rahardja, and P. Fränti, Classification of interbeat interval time-series using attention entropy, 2020 (submitted).

(16)

AUTHOR’S CONTRIBUTION

The idea for outlier filtering in [P1] came from Prof. Pasi Fränti. All other ideas and their conceptual development originated from the thesis author. Prof. Pasi Fränti was the principal author writing [P1], [P2], [P6],  while the thesis author was the principal author of the rest of the papers.

The thesis author was responsible for all implementations, testing, and most of the illustrations.  The roles of Prof. Pasi Fränti and Prof. Susanto Rahardja were the supervisors, giving advices, participating in method development and revising the text.

(17)

1 INTRODUCTION

“A thing is as precious as it is rare.” - Chinese idiom

Outlier detection is a traditional research area, and it has been receiving much attention due to its importance in robust statistics, data mining, and machine learning. For example, in data analysis, outliers may cause problems because they may strongly bias the results [226]-[229]. In this section, we provide definitions for outliers, introduce outlier detection applications, and review outlier detector categories.

1.1 DEFINITION

The exact definition for outliers depends on data structures and the models used by detectors. However, several definitions are general enough to accommodate various applications and detectors. According to Hawkins [1], an outlier is an observation that deviates so much from the other observations as to arouse suspicions that it was generated by a diﬀerent mechanism. Barnet and Lewis [2] state that an outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. Similarly, Johnson [3] defines an outlier as an observation in a dataset that appears to be inconsistent with the remainder of that set of data.

Hawkins’ definition is widely accepted; we use this definition in this thesis. Fig. 1 shows an example of outliers. We can see that outlier objects are located far away from normalities, defined as the points that form a cluster in the case of Fig. 1.

Fig. 1. An example of normalities (top left) and outliers (bottom right).

(20)

An example of outliers in the real world is the human gestation period (Fig. 2). Mrs. Hadlum gave birth to a child after 349 days since Mr.

Hadlum left for military service, while the average human gestation period is 280 days (40 weeks). Statistically, 349 days is an outlier. The question is what process gave rise to it: an unusually long gestation? or adultery (as asserted by Mr. Hadlum)?

Fig. 2. Distribution of human gestation periods. [2, 4].

1.2 APPLICATION

Outliers are also referred to as abnormalities, deviants, or anomalies, in the literature on data mining and statistics [4]. Another related concept is novelty detection, which is usually used when we have insufficient

“abnormal” data to construct explicit models for abnormalities [234]. In some applications [5, 6], such as clustering, weak outliers are viewed as noise that is expected to be cleaned; strong outliers are viewed as anomalies expected to be detected and further analyzed. Aggarwal [4]

concluded that normal data, noise, anomalies, and outliers have different degrees of outlierness (Fig. 3). In [230], the author categorizes anomalies into three types, based on the relationship with surrounding objects. In most applications, data is produced by one or more generating process(es); this data can be observations of the property of a system, but when the generating process behaves unusually, it results in outliers [4].

Therefore, outliers can represent significant information about the abnormal properties of the generating process of the system. Outlier detection can be applied to various types of applications:

• fraud detection: credit card [7-13] and insurance [14, 15];

• intrusion detection systems: host-based [16-31] and network [32-52];

(21)

• interesting sensor events [53-60, 219];

• medical diagnosis [61-70];

• text and social media: even detection [71-76], spam email [77-80], noisy and spam links [81-85], and anomalous activity in social networks [86- 92];

• earth science: sea surface temperature anomalies [93-97] and land cover anomalies [99-102]；

• data cleaning [103-105, 218, 220];

• traﬃc and movement patterns [106-109];

• disease outbreaks [110], aviation safety [111];

• biological sequences [112];

• failure management of large computer clusters [113];

• astronomical data [114-116];

• disturbance events in terrestrial ecosystems [117];

• detection of malicious URLs [118, 119];

• internet routing updates [120];

• spatial linking of criminal incidents [121].

Fig. 3. The spectrum from normal data to outliers [4].

1.3 DETECTOR CATEGORY

The typical outlier detection process is shown in Fig. 4. We can see input X can be pre-processed by, for example, scaling or normalization [4, 122], before the outlier detector block. After pre-processing, the outlier detectors produce an outlier score for each object, to represent its outlierness. Post-processing techniques, such as ensemble [4, 123-126], can be used to combine multiple detectors, to create a more robust detector. Thereafter, outlier scores are thresholded [127] to decide the outliers and normalities.

Outlier detectors can be categorized by different indexes, such as the model-based [231], or the property of input or output. Table 1 shows the categories of outlier detectors. Some outlier detectors [128] do not produce outlier scores, but they produce an outlier-normality binary label

(22)

for an object. However, binary labeling contains less information than a scoring mechanism does [4]. We can also see that the input can have various data types from various application domains.

Fig. 4. The typical outlier detection process.

Table 1. Categories of outlier detectors [4].

Input Reference set Model-based Output Time series

Numeric Text Mixed

k-nearest neighbor Other

Probabilistic Statistical Information-theoretic

Binary labels Outlier scores

The organization of this thesis follows the outlier detection process illustrated in Fig. 4. Section 2 reviews outlier detectors that produce outlier scores. Section 3 focuses on outlier score post-processing techniques. Section 4 discusses outlier score thresholding techniques.

Section 5 introduces outlier detection in times-series applications. The contribution of the thesis is summarized in Section 6, and conclusions are drawn in Section 7.

(23)

2 OUTLIER DETECTORS

“The rare wonders of the world often lie in danger or the places where people are rare.”

- Anshi Wang

Two typical ways to create detectors exist: modeling and modifying.

In this section, we review nine typical well-known state-of-the-art outlier detectors, summarized in Table 2. LOF, ODIN, NC, KNN, ABOD, MCD, IFOREST, and OCSVM are model-based detectors. PCAD, MOD/DOD, and MOD+/DOD+ are modification-based detectors. An outlier detection example of these detectors is shown in Fig. 5. We can see that most detectors can detect outliers correctly merely with default parameter settings.

Table 2. State-of-the-art outlier detectors.

Detectors Model k-NN Publication and year

LOF [129] Density-based YES ACM SIGMOD, 2000

ODIN [130] Graph-based YES ICPR, 2004

NC [131] Representation YES IEEE-TNNLS, 2018

KNN [132] Distance-based YES ACM SIGMOD, 2000

ABOD [133] Angle-based YES KDD, 2008

MCD [134] Statistical testing NO J. A. Stat. Assoc, 1984

IFOREST [135] Tree-based NO TKDD, 2012

OCSVM [136] SVM-based NO Neural computation, 2001

PCAD [137] PCA-based NO ACM, 2000

MOD/DOD [P2] Shifting-based YES FSDM, 2018 MOD+/DOD+ [P3] Shifting-based YES -

(24)

Fig. 5. Examples of well-known outlier detectors. Outliers are expected to have a big circle radius. (k=20 is used for all k-NN-based detectors, and the default settings in the literature are used for the other detectors.)

2.1 MODEL-BASED OUTLIER DETECTORS

The model-based approach to build detectors is to model the normal patterns of the reference set and compute the outlier score by evaluating the quality of the fit between the object and the model [4]. The reference set is formed by the objects sampled from the original dataset. Different detectors make diﬀerent assumptions about the normal patterns of the data. For example, some nearest-neighbor-based outlier detectors utilize k-nearest neighbor (k-NN) objects as the reference set and model the outlier objects in terms of the distribution of its k-NN distance, with the assumption that outlier objects are located far away from most of their neighbor objects. Therefore, the model is significant, and it assumes that the data model is everything, as in [1]. However, in [235], the author also considered the difference between reference sets and context sets; the

(25)

former are used to compare models, while the latter are used to build an outlier model.

Distance-based outlier detectors assume that outliers are located far away from most of their k-NN objects. For example, in Fig. 6, point q is located further from its k-NN than point p is. The detector in [132], called KNN, calculates an outlier score as the distance between an object and its k^th neighbor.

In [130], a method called outlier detection using indegree of nodes (ODIN) is defined, by using the k-NN graph. ODIN uses the indegree of the graph as the outlier score and assumes that outliers have less indegree. An example of ODIN is shown in Fig. 7.

Fig. 6. KNN example. Fig. 7. ODIN example; the numbers are outlier scores¹.

However, distance-based outlier detectors encounter problems if the data has areas of varying densities. Fig. 8 shows that the example points in C1 have more significant distances from their k-NNs than the point o2, but o2 is more likely to be an outlier than C1. Density-based detectors can solve the problem of varying densities in data. They assume that the density of an outlier is considerably lower than that of its neighbors. Local outlier factor (LOF) [129] calculates the relative density of an object to that of its k-NN and uses this as an outlier score. LOF was the best detector among the twelve k-NN-based detectors compared in [122]. Many variants of LOF have been proposed and compared in [122], [233].

1 http://cs.uef.fi/pages/franti/cluster/Clustering-Part7-Outliers.pdf

(26)

Fig. 8. Data varying in densities

[129]. Fig. 9. Dataset varying in densities [138].

However, density-based outlier detectors are problematic if clusters of different densities are not clearly separated; see Fig. 9. Point p will have a higher LOF outlier score than points q and r do, which is counter- intuitive. To overcome this problem, a representation-based detector is defined in [131], called reverse unreachability (NC as defined in [131]). A given object is represented by a linear combination of its k-NNs. These representations produce a weight matrix of how much each neighbor contributes. The number of negative weights is the outlier score. NC assumes that outliers have more neighbors that contribute negative weights. An example of NC is shown in Fig. 10.

In high-dimensional data, distance-based and density-based outlier detectors become inefficient because of the so-called curse of dimensionality [133], which means various phenomena appear in high-dimensional spaces, but not in low-dimensional spaces, when analyzing data. The common theme of these phenomena is that the available data becomes sparse due to the quickly increasing volume of space with increasing dimensionality. This sparsity is problematic for any method that relies on statistical significance. To overcome this problem, angle-based outlier detection (ABOD) is proposed in [133]; it claims that the angle is more stable than the distance, in high-dimensional spaces. It analyses the angles between an object and its k-NN objects. Given a point p and any two other points x and y from its k-NN, the angle is calculated by means of the angle between two vectors 𝑝𝑥####⃗ and 𝑝𝑦####⃗. The outlier score is the variance of the angles. ABOD assumes that outliers have less variance of angles than the normalities do. Fig. 11 shows the variance of angles between outliers and normalities.

(27)

Fig. 10. An example of an NC outlier score [131]: example points (top) and its representation matrix W (bottom). The outlier score for each point is the number of the negative values in each column: the outlier score for {x1, x2, x3, x4, x5, x6} is {3, 2, 0, 2, 1, 0}. Hence, {x1, x2, x4} are more likely to be outliers.

Fig. 11. Angles between an object (outlier, top; normality, bottom) and the remaining objects² [133].

Some detectors are based on statistical analysis. They assume that normality objects follow the same distribution, whereas outliers do not.

The minimum covariance determinant (MCD) [134] fits a robust covariance estimate (an ellipse) to 50% of the objects. This 50% is the reference set. The ellipse is used as the data model. The outlier score is the distance between an object and the model. An example of MCD is shown in Fig. 12.

Fig. 12. An example of MCD³.

Tree-based detectors fundamentally differ from distance-based or density-based detectors. As defined in [135], isolation-based anomaly detection (IForest as defined in [135]) uses random subsamples of the data as the

2 https://www.dbs.ifi.lmu.de/~zimek/publications/PAKDD2013/TutorialPAKDD2013Outlier- DetectionHighDim.pdf

3 https://archive.siam.org/meetings/sdm10/tutorial3.pdf

(28)

reference set, to construct an isolation tree. It recursively splits the data by a randomly selected split value between the maximum and minimum of a randomly selected feature. An example of an isolation tree is shown in Fig. 13. A recursive partition of the set of points A, B, C, and D produces an isolation tree. Outliers tend to appear higher in the tree. An isolation forest is a collection of isolation trees. The average path length of an object defines its outlier score. Outliers have small outlier scores.

Fig. 13. An example of an Isolation Tree⁴.

Support vector machines can recognize patterns in data and can be used in the classification task. One class support vector machine (OCSVM) [136] trains the support vector model by using all objects as the reference set and treating all objects as one class. The outlier score is the distance between an object and the model. Fig. 14 shows an example of the OCSVM learning model. Objects located outside the learned frontier are outliers; the further they are outside the learned frontier, the more likely they are to be outliers.

4 https://e3s-center.berkeley.edu/wp-content/uploads/2017/08/RET_CStanton-2015.pdf

(29)

Fig. 14. An example of the OCSVM learning model.

Fig. 15. An example of the PCAD outlier score. The distance between the original data (black) and projection data (blue) is the outlier score.

2.2 MODIFICATION-BASED OUTLIER DETEC- TORS

The modification-based approach to construct detectors is to modify the data and calculate the outlier score by measuring the distance before and after the modification. Assumptions about the modification techniques are the key to creating detectors.

Principal component analysis (PCA) is an established technique in data mining. It reveals the variance of the data structure. It works by extracting the principal features of datasets. The PCA-based outlier detection method (PCAD) [137] modifies each object by projecting it with eigenvectors. The distance before and after the modification, namely projection, is calculated as the outlier score. Fig. 15 shows an example of

(30)

computing outlier scores by PCAD. A similar idea to calculate the outlier score with reconstruction error is considered in [236].

The mean-shift outlier detector (MOD) [P2] employs mean-shift to modify data. Mean-shift modifies objects by replacing each object by the mean of its k-NN. This process can be iterated, as shown in Algorithm 1.

This process forces the objects to move towards denser areas, as illustrated in Fig. 16 and Fig. 17. We can see point P is moved to P*, the denser area. The distance of the movement can be the evidence of being an outlier; an object with a more significant movement is more likely to be an outlier. MOD repeats the mean-shift process for a few iterations (we use three iterations as default) and calculates the distance before and after the shift as outlier score. The process is summarized in Algorithm 2.

Mean-shift is also used in [P1] to filter outliers in pre-processing, before clustering. From Fig. 17, we can see that, after applying mean-shift for three iterations to modify the data, most outliers are removed. Then any clustering algorithm can be applied to the modified data. Fig. 18 shows the clustering results of random swap [171] with and without applying mean-shift. We can see that, after applying mean-shift to filter outliers, clustering results are improved. The benefit of using mean-shift before clustering is that it does not need to decide how many objects to remove from the data.

Algorithm 1: Mean-shift process (MS (X, k, I)) Input: Dataset X ∈ R^d×n, k, I ∈ Z Output: Modified X* ∈ R^d×n REPEAT I TIMES:

FOR OBJECT Xi ∈X:

k-NN(Xi) ß Find its k-nearest neighbors from X;

Mi ß Calculate the mean of the neighbors k-NN(Xi);

X*i ß Mi (replace xi by the mean Mi), X*i ∈X*;

X ß X*;

Algorithm 2:

Mean-shift outlier detection (MOD (X, k)) Input: Dataset X ∈ R^d×n, k ∈ Z Output: Outlier score S ∈ R^d×n Y ß MS(X, k, 3) (Algorithm 1);

S ß Calculate distance between X and Y;

Algorithm 3:

Mean-shift outlier detection + (MOD+ (X, k)) Input: Dataset X ∈ R^d×n, k ∈ Z Output: Outlier score S ∈ R^d×n Y ß MS(X, k, 3) (Algorithm 1) ; FOR Xi ∈X:

k-NN(Xi) ß Find k-nearest neighbors of Xi ; Siß Σ j | Xj – Yj |, Xj ∈k-NN(Xi), Yj ∈Y, Si ∈S;

(31)

Fig. 16. The mean-shift process moves objects towards denser areas, illustrated on part of the A1 [139] dataset with an 8%

outlier level. Point P is shifted to point P*.

Fig. 17. Example of the iterative process of mean- shift on part of the A1 [139] dataset with 8% of outliers. Blue points are the centroids of clusters.

Fig. 18. An example of random swap clustering [171] results after each mean-shift iteration on part of the A1 [139] dataset with 8% of outliers. Red points are predicted clusters and blue points are ground truth clusters.

Both mean and median have been used in the mean-shift clustering concept [140-144]. Median is less sensitive to outlier objects [144].

However, with multivariate data, how the median is defined is inapparent. We use the medoid, which is the object in the set that has the minimum total distance to all other objects in the set. We call the detector using the medoid the medoid-shift outlier detector (DOD). The DOD can also be applied to a string dataset, using edit distance. An example of applying DOD on string data is shown in Fig. 19.

However, when the number of outliers in the dataset is significant, using objects only from the original data to form the reference set can bias the result because too many of the neighbor objects are also outliers. An example is shown at the top of Fig. 20, where we can see that MOD failed to detect many outliers. To overcome this problem, in [P3], a refined variant of MOD called MOD+ is proposed, based on the concept of the extended reference set. Compared to the reference set, which only

(32)

contains objects from the original data, the extended reference set is formed by sampling objects from both original data and modified data. An example of the extended reference set is shown in Fig. 21. The advantage of this extended reference set is that it contains more information about normalities from the data. As illustrated in Fig. 21, k-NN before the mean- or medoid-shift only contain outliers, whereas, after these shifts, it also contain normalities. MOD+ calculates outlier scores as the sum of the distances before and after the shift of objects in k-NN, as summarized in Algorithm 3. An example of the difference between using MOD and MOD+ is shown in Fig. 20. We can see that MOD failed to detect many outliers, but MOD+ succeeded.

Fig. 19. An example of medoid-shift outlier scores with k=3 when using edit distance on strings from the Countries⁵ dataset. The left figure shows the medoid- shift process. The right figure shows the outlier scores calculated as the edit distance between the strings before and after three iterations of the medoid-shift.

Black strings are the original strings (normalities), and the red (bold) ones are the outliers. [P3]

5 http://cs.uef.fi/sipu/string/countries/

(33)

Fig. 20. MOD detectors failed to detect many outliers (cross), but MOD+

succeeded (k=50).

Fig. 21. Point P (big red dot in the circle) and its k-NN (k=4, blue cross in the circle) before and after the mean-shift process on part of the A1 [139] dataset with an 8% outlier level.

2.3 SUMMARY

Typical outlier detectors are created based on modeling or modifying.

Model-based detectors compute outlier scores by evaluating the quality of the fit between the object and the model, built on the reference set [4].

The model and the reference set are key in model-based detectors. Modi- fication-based detectors calculate outlier scores by measuring the difference before and after modification. Modification techniques are key in modification-based detectors.

Based on the data modified, we introduce a concept called the extended reference set, which contains objects from both original and modified data. The benefit of the extended reference set is that it can contain

(34)

more information about normalities, especially when the number of outliers in the data is large.

(35)

3 POST-PROCESSING OUTLIER SCORES

“Give back the tears to the sea, emancipate the original light.”- Bidoan, UK.

Outlier score post-processing is an essential step before thresholding objects as outliers or normalities. Typical post-processing techniques use the strategy of combing outlier scores, such as from different detectors. In this section, we introduce two strategies. In an ensemble, we comb outlier scores from the same object but different outlier score production mechanisms; in Neighborhood Averaging, we comb outlier scores from different objects but use the same outlier score production mechanisms.

3.1 ENSEMBLES

The idea of combining the outputs of multiple algorithms, to create a more robust solution to the underlying problem is referred to as ensemble.

Ensembles are essential techniques in many data mining tasks, such as classification and clustering. Typical ensembles for classification are bagging, subsampling, boosting, and stacking [160, 161, 162]; those for clustering, such as cluster ensembles, can be found in [163]. Ensembles are a widely used strategy in data mining competitions, the most famous of which is its victory in the Netﬂix Prize challenge for recommender systems [4, 164]. Post-processing outlier scores with calibration, standardi- zation, or normalization is also explored in [237], [338] and used in [239], [240].

The basic idea of ensembles is that different algorithms work well on a different subsets of a dataset. Therefore, the combination of the ensemble is often able to perform more robustly across-the-board tasks, due to its ability to utilize the strengths of multiple algorithms.

Ensembles for outlier detection have been introduced in [123-126, 165, 221-225]. Two primary types of ensembles in mining outliers exist [4].

In sequential ensembles, multiple detectors are sequentially applied to either all or portions of the data. The core is that each detector enables a more refined execution with either a modified detector or a dataset [4].

For example, boosting methods [4, 162] belong to sequential ensembles.

In independent ensembles, different detectors on all or different portions of the data are combined for outlier detection. Alternatively, the same detector may be applied with a different initialization, parameter set, or random seed. The idea is to combine the results from multiple detectors, to

(36)

obtain a more robust outlier score. For example, the average ensemble [125]

calculates the average outlier score from multiple baseline detectors.

The broad framework of independent ensembles and sequential ensembles is shown in Fig. 22, and the outlier detection process of using ensembles is shown in Fig. 23. The final combined results of ensembles can be weighted, the mean, the median, or the maximum results from multiple detectors. At a fundamental level, outlier ensembles are not very diﬀerent from classiﬁcation ensembles in terms of the underlying theo- retical foundations [4, 123]. As a result, many natural ensembles such as bagging and subsampling can easily be generalized to outlier detection, with minor variations.

Fig. 22. The framework of sequential ensembles and independent ensembles [4].

Fig. 23. The outlier detection process using independent ensembles and sequential ensembles.

3.2 NEIGHBORHOOD AVERAGING

We hypothesize that similar objects in the feature space have similar outlier scores. To the best of our knowledge, all existing outlier detectors, including the detectors introduced in Section 2, and ensembles calculate the outlier score for each object independently, regardless of the outlier scores of the other objects; therefore, they do not guarantee that similar objects will have similar outlier scores. To overcome this problem, we propose a post-processing technique for outlier detectors, called Neigh-

(37)

borhood Averaging (NA) [P4], which pays attention to the most similar ob- jects in their neighborhood and gives them more similar outlier scores than the original scores.

The NA technique is simple: once outlier detectors or ensembles calculate the outlier score for each object, we then post-process the score of every object by updating this score by the average of its k-NN objects’

scores. NA basically smoothens the result of the baseline outlier detector, making outlier scores in the neighborhood closer to each other. Given a dataset X and its outlier score S from any given outlier detector or ensemble, the NA proposed works in two steps. First, for every object Xⁱ, we find the set of its k^-nearest neighbors k-NN(Xⁱ).Second, we calculate the revised outlier score S!∗ of Xⁱ as the average of the outlier scores of its k- nearest neighbors.

The pseudo-code of the Neighborhood Averaging proposed is shown in Algorithm 4 and demonstrated in Fig. 24. The algorithm works in two steps. Considering the red object, NA first finds its k-nearest neighbors and then updates its score by the average scores of the object and its neighbors. This process is independently applied to all objects. Fig. 25 shows the results with and without NA. We can see that LOF falsely de- tects many outliers, but after processing them with NA, detection im- proves. NA can also be iterated several times to achieve an even stronger smoothing effect to reduce the bias.

The beauty of the technique proposed is that it can be added after any existing outlier detector or ensemble technique, as an additional post-processing technique, as illustrated in Fig. 26. Compared to ensemble techniques, instead of merging the results from multiple detectors for a single object, NA merges the results of multiple objects from a single detector, as illustrated in Fig. 27. Therefore, the Neighborhood Averaging technique proposed is conceptually different from ensemble techniques. In fact, it complements ensemble techniques; ensemble techniques cannot guarantee that similar objects have similar outlier scores. Thus, we can apply NA after applying ensemble techniques.

Algorithm 4: Neighborhood Averaging NA(X, S, k) Input: Dataset X, Outlier scores S, Neighborhood size k Output: Revised outlier scores S*

FOR Xi ∈X:

k-NN(Xi) ß Find k-nearest neighbors of Xi; S*i ß ^!

"#!!S_$+ ∑^"_%&!S_%%; X_%∈ 𝑘 − NN(X_$), S_%∈ S;

(38)

Fig.24. NA process illustration: Given an object (red) and outlier scores, NA first finds its k-nearest neighbors and uses averaging scores with its neighbors as its modified scores. [P4]

Fig. 25. LOF falsely detected many boundary objects as outliers (cross) evaluated on a noisy A1⁶ dataset [23], but detection improved after using NA. [P4]

Fig. 26. Outlier detection process with post-processing: NA and ensembles.

6 http://cs.uef.fi/sipu/datasets/

(39)

Fig. 27. Relationship between ensemble and NA techniques. Ensemble techniques utilize the information on the same object (object at the bottom) from multiple detectors, while the NA technique utilizes the information on different (neighboring) objects from a single detector. [P4]

3.3 SUMMARY

We hypothesize that similar objects in the feature space should have similar outlier scores. Based on this hypothesis, a very robust outlier score post-processing technique called Neighborhood Averaging (NA) is proposed to improve outlier detection results. NA combines the results of multiple objects from a single outlier detector.

However, the hypothesis leads to several unexplored directions:

similar objects in the feature space should have similar outlierness (outlier scores). For example, how to define similar objects remains an open question. They could be objects that share common neighbors, objects that have a similar structure in a graph, objects that have similar depth, or objects that have similar contextual meaning. Moreover, fixing the problem is possible inside the detector, instead of fixing it after the detector, i.e. in post-processing.

Notably, a similar hypothesis—similar objects in the feature space should have similar prediction results—may be suitable for other data mining applications, such as classification, as well. This hypothesis can guide our future work.

(40)

(41)

4 THRESHOLDING OUTLIER SCORES

“True, a little learning is a dangerous thing, but it still beats total ignorance.”

– Abigail van Buren

Thresholding is an essential step in outlier detection; it directly decides whether an object is an outlier. In this section, we introduce several widely used thresholding techniques and the problems they have in common.

4.1 THRESHOLDING IS IMPORTANT

Scholars usually involve no thresholding techniques while evaluating outlier detectors because the scores provided by different methods differ widely in their scale, range, and meaning [133]. Instead, the area under the receiver operating characteristic (ROC) curve (AUC) [217] is generally em- ployed in the literature, to measure outlier detectors. The ROC AUC relies on the dynamic thresholding process. The idea of the ROC curve is to plot the true positive rate against the false positive rate over the ranked outlier score at various threshold values. The area under the ROC curve provides accuracy evaluation, which ranges from 0 to 1. The ROC value of a perfect outlier prediction equals 1.

In this case, the true positive rate is graphed against the false positive rate. The true positive rate TPR(t) and the false positive rate FPR(t) is the same in [29], and the definitions are as follows:

TPR(t) = Recall(t)= |$(&)∩)|

|)| (1)

FPR(t)=|$(&)*)|

|+*)| (2)

where D is a dataset, G is an outlier set, and S(t) is a predicted outlier set decided by a threshold t on the outlier scores.

For every threshold t, we can get a TPR-FPR pair. The ROC curve is drawn with the TPR against the FPR; see Fig. 28. ROC curve endpoints are always at (0,0) and (1,1). The area under the ROC curve is called ROC value, which ranges from 0 to 1, and the perfect prediction is 1.

(42)

However, thresholding outlier scores to decide outliers is a very vital step (in the outlier detection process) in real-world applications. The importance of thresholding techniques is usually ignored in the literature.

In this section, we discuss several thresholding techniques.

Fig. 28. ROC AUC curve example⁷.

4.2 STATISTICAL METHODS

We randomly picked 100 publications from June 2016 to June 2018 via Google Scholar and studied what thresholding techniques were used. The results are summarized in Fig. 29. They show that standard deviation (SD), median absolute deviation (MAD), and interquartile range (IQR) are the most used techniques. Some of the rest used user-given parameter (parameter, top-N); some others did not specify their choice of threshold technique;

still others had no need for thresholding. By parameter, we refer to a user- given constant for thresholding outlier scores. Top-N refers to a priori knowledge of how many outliers (%) are expected to be in the data.

7 http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

(43)

Fig. 29. Usage frequency of different threshold selection techniques in outlier detection literature June 2016-June 2018. [P5]

SD, MAD, and IQR are based on statistics. They are formally described in the following equations.

T = mean + a ∗ SD (3)

where mean and SD are the corresponding statistics of the outlier scores, and a is a control parameter decided by the user. The smaller the value, the more the objects marked as outliers. The most common choice is a=3 because the number of outliers is expected to be small [40].

1 T = median (X) + 𝑎 ∗ MAD

MAD = b ∗ median(|X– median(X)|) (4)

where median and MAD are the corresponding statistics of the outlier scores; a is decided by the user and usually set to 3. The value b is sug- gested to be 1.4826 in [40]; X is the outlier scores.

1T = Q3 + 𝑐 ∗ IQR

IQR = Q3 − Q1 (5)

where c is decided by the user and usually set to 1.5 [40].

(44)

Fig. 30. Effect of outliers on thresholding techniques. [P5]

However, none of these techniques works well when we have a large number of outlier objects. These faraway outliers cause the thresholding based on all these three statistics techniques to have very high values; see Fig. 30. As a result, all the existing statistics methods overestimate the threshold and fail to detect most outliers with a small value. The main problem is that the statistics used to set the threshold are unreliable due to the outliers. An example of thresholding results is shown at the top of Fig. 31. We can see SD, MAD, and IQR fail to thresholding all outliers.

Fig. 31. An example of thresholding outlier scores.

4.3 ITERATIVE METHODS

As the presence of outliers biases the values of SD, MAD, and IQR [167], to solve this problem, we propose two-stage thresholding (2T) [P5], where the most outlying scores are excluded from the process, to reduce their eﬀect in threshold calculations. In the second stage, any standard

(45)

thresholding can be applied, using the remaining scores. The purpose of the ﬁrst stage (cleaning step) is to make the second stage (threshold calculation) more robust in the presence of many outlier objects. The 2T process is summarized in Algorithm 5. We note that the same parameter setup (a, b, or c) can be used in both stages. The proposed 2T, therefore, involves no additional parameters or design choices; the exception is the decision whether to do it only once or multiple times (iterated). Our rec- ommendation is to do it only once. An example of thresholding with 2T is shown at the bottom of Fig. 30. We can see that both MAD and IQR succeeded after using 2T. SD failed in the first stage.

Algorithm 5: two-stage thresholding (2T) Input: Outlier scores: X Î 𝑅^1×n

Output: Threshold: T_'$(^(*)T Î R

Calculate T over X via Eq. (3), Eq. (4) or Eq. (5);

REPEAT;

Remove scores exceed T from X;

Re-calculate T via Eq. (3), Eq. (4) or Eq. (5);

UNTIL StopCondition;

RETURN T;

A similar technique is called Clever SD [166], which iteratively re- moves one score that exceeds the SD and then recalculates the SD. 2T dif- fers in three main ways from Clever SD. First, 2T is not limited to using SD; other existing techniques such as MAD and IQR can also be applied.

Second, instead of eliminating only the most extreme outlier, we remove all outliers that exceed the preliminary threshold. Third, Clever SD iter- ates until convergence, whereas 2T stops after a few iterations, which makes it signiﬁcantly faster. Similar ideas are also considered in [229] for robust statistics, wherein the values will be treated as outliers if they exceed a speciﬁed number of standard deviations from a center of the distribution.

4.4 SUMMARY

When the number of outliers is large, the presence of outliers biases thresholding techniques that are based on statistics. However, thresholding can be improved in a straightforward manner by applying the 2T technique proposed. In other words, we first exclude the scores that exceed the original threshold value and then recalculate the statistics.

The actual outlier removal is then performed by using the revised threshold.

(46)

(47)

5 APPLICATION IN TIME SERIES

“Two flowers in bloom, one for each branch.”-Dream of Red Mansions

A time series is a series of observations recorded in time order. Therefore, it has natural temporal ordering, which makes its analysis different from that of normal data. Usually, natural temporal ordering can represent essential information in data. For example, the different orders of the ele- ments in the same set can represent different passwords. In this section, we introduce outlier detection from time-series data with two cases:

Global Positioning System (GPS) data and human heart signals.

5.1 GPS TRAJECTORIES

A GPS trajectory is a series of geographical points (latitude, longitude) recorded by vehicles, cellphones, smart bracelets, and smart watches, all of which are equipped with GPS devices. Collecting and analyzing large- scale GPS trajectories can uncover hidden information about human be- haviors [172-179, 208-213] and city dynamics [180-182], such as inferring destinations [183, 207], detecting taxi driving frauds [184, 185], calculat- ing average routes for vehicle navigation [186-188], and extracting road networks [189-192, 206]. CNN⁸ reported a case in 2018 wherein the fitness tracking app Strava in cellphones disclosed the location of users such as army joggers. By analyzing GPS trajectories recorded by joggers, the de- tailed road network of the secret US army bases in Afghanistan could be constructed. Fig. 32 (left) shows Strava broadcast patterns of movements of US personnel at American military facilities in Djibouti, and Fig. 32 (right) shows the road network extracted from similar information in Af- ghanistan. This kind of information sharing can cause serious privacy and security issues.

We (normally) need only two steps to extract road network data: detect splits (intersections) and extract road segments. In Fig. 34 (left), the intersections of roads are first detected [193], and then the segments between two intersections are constructed by averaging all the segmented trajectories between the two intersections. However, in Fig. 33 (left), we can see data that contains an outlier segment, which can affect the quality of segment averaging.

8 https://edition.cnn.com/2018/01/28/politics/strava-military-bases-location/index.html

(48)

Fig. 32. (left) US Camp Lemonnier outside Djibouti City by Strava⁹. (right) A military base in Helmand Province, Afghanistan, with the route taken by joggers highlighted by Strava¹⁰.

In [P6], we consider detecting and removing the outlier segments as a pre-processing step before segment averaging. The problem of traditional outlier trajectory detectors [184, 185, 232] is that they require long-distance GPS trajectories as input to work successfully because they map each GPS trajectory into a set of 15 × 15 𝑚^, cells, in the pre-processing stage. After mapping, traditional outlier trajectory detectors detect the outlier trajectory by analyzing whether most of the other trajectories also contain the same cells. Moreover, the methods in Section 2 cannot be directly applied here because they require points as input, while a GPS trajectory is a series of points.

The method in [P6] processed trajectories by extracting their start and end points. This enables us to reduce the outlier detection problem back to normal pointwise detection. This method works in three steps. First, it analyses the data by identifying the start and the end point of each segment, to form the source set and the destination set, as shown in Fig. 34 (right). Second, any outlier detector introduced in Section 2 can be used to score points in the source and destination sets individually, and thresholding techniques from Section 4 can then be used to detect outlier points.

Third, we remove any segment the start or destination point of which is identified as an outlier.

9 https://www.theguardian.com/us-news/2018/jan/29/pentagon-strava-fitness-security-us- military

10 https://www.theguardian.com/world/2018/jan/28/fitness-tracking-app-gives-away-location-of-secret-us-army-bases

(49)

Fig. 33. Two steps to extract a road network: detect splits (intersections) and extract road segments [215].

Fig. 34. An example of an outlier segment (left) and detecting the outlier segment (right) [P6].

5.2 HEART RATE VARIABILITY

Heart rate (HR) and Heart rate variability (HRV) have gained much at- tention in human health monitoring. HR measures how many times the heart beats in a minute, on average. Heart rate is used, for example, to monitor how physical training affects the cardiovascular system in real time [202]. HRV, on the other hand, measures the variation in heart rate.

HRV is best measured during a resting state [204]. It helps us under- stand the overall health because HRV has a connection with the nervous system [194, 195], which controls and responds to many important body processes, such as digestion, respiration, metabolism, vision, hearing and smell, and sexual function. HRV can, therefore, be used to detect diseases such as diabetic neuropathy [196], liver cirrhosis [197], and cancer [200], as well as sudden cardiac death [198, 199] and congestive heart failure [149, 201].

The R-R interval or inter-beat interval (IBI), the time between successive beats and measured in milliseconds (ms), is normally used in HRV meas- urement. Fig. 35 shows how the R-R interval is derived from the electro- cardiogram (ECG) wave. Differentiating diseased subjects from healthy subjects in R-R interval time series is very challenging because the signal varies in a non-stationary and irregular manner [145-147], as illustrated

(50)

in Fig. 36. Therefore, extracting features to distinguish the signals is challenging for the traditional detectors introduced in Section 2. The complex- ity-loss theory of aging and disease [148] proposed in 1992 hypothesized that disease and physiological aging are associated with a general loss of complexity in the dynamics of healthy organ system function. In other words, the healthier a system is, the more variation in the signal measured.

Quantifying the complexity of signals can, therefore, help to detect disease.

Entropy methods [150-159] have been widely used in applications that aim to classify subjects based on measuring the complexity of heart rate variability. The healthier a system is, the bigger the entropy value is. If we define subjects with diseases as outliers and healthy subjects as normalities, the entropy value can be used as an outlier score as such. The smaller the score, the more likely the subject is an outlier.

Fig. 35. Heart rate variability¹¹.

11 https://hrvcourse.com/heart-rate-variability-vs-heart-rate/

(51)

Fig. 36. Sub-series of (R-R) heart rate interval time series in healthy subjects (young aged <55 and elderly aged >55) and patient subjects (congestive heart failure and atrial ﬁbrillation) [146, 147, P7]

Given a series, typical entropy methods calculate entropy in three steps: construct sub-series, extract patterns from each sub-series, and calculate entropy based on the frequency of the patterns extracted. Four typical methods exist to construct sub-series from a series, to extract patterns from each sub-series constructed. The first method is to do nothing but treat each value as one sub-series; examples are Shannon entropy [150]

and Rényi entropy [151]. The second method is to define template vectors, as in approximate entropy [152], sample entropy [153], permutation entropy [154], bubble entropy [155], and grouped horizontal visibility entropy [158]. The third method is to define delay vectors; examples are SVD entropy [157] and edge permutation entropy [205]. The fourth method is to convert the series into another series, as in spectral entropy [156], multiscale entropy [159], and average entropy [204]. The first method is a special case of the second and third methods, and the second method is a special case of the third method.

Typical entropy methods analyze complexity on the frequency distribution of artificial patterns extracted from the sub-series constructed. Dif- ferent entropy methods define different artificial patterns. For example, Shannon entropy treats each observation in the series as one pattern. Bub- ble entropy defines the patterns as the number of swaps to sort values in the sub-series, using bubble sort. Permutation entropy defines patterns as

(52)

the number of permutations of the ranks of the values in the sub-series.

Their exact mathematical definitions can be found in [P7].

In [P7], we propose Attention entropy, which analyzes only key pat- terns and calculates entropy based on the frequency distribution of the intervals of successive patterns. Attention entropy works in three steps:

(1) define key patterns; (2) find the locations of all key patterns from the series, to calculate the intervals between two adjacent key patterns; (3) calculate the Shannon entropy of these intervals. The difference in the calculation between Attention entropy and other entropy methods is shown in Fig. 37. Attention entropy considers only the key pattern, apple, and how often it appears. From Fig. 37, we can see that typical frequency- based entropy such as Shannon entropy cannot separate the Series 1 and Series 2, as both series have same frequency distribution of patterns ap- peared in the series. Attention entropy can separate the Series 1 and Series 2, as the distribution of intervals of the key pattern (Apple) in the series are different.

Fig. 37. Difference between Attention entropy and other entropy methods. [P7]

If we view each point in a time series as one state of the system, the change of states of the system can be viewed as adjustment to the environment. A complex system is supposed to have a complex process of state change when adapting to the environment. Peak points, defined as points bigger or smaller than both the point before and after, can be viewed as the approximate local upper and lower bounds of the state change of a system. We can define peak points as key patterns. In medical

(53)

terms, it associates with the adaptive capacity of the system. Patients (outliers) have an adaptive capacity different from that of healthy subjects (normalities). Fig. 38 shows the frequency distribution of Attention entropy values of normalities and outliers. We can see that they have different kernel density estimation.

Fig. 38. The distribution of Attention entropy values of normalities (healthy subjects) and outliers (patient subjects with heart failure).

5.3 SUMMARY

Traditional outlier detectors, introduced in Section 2, cannot be directly applied to time series data. In this section, we discussed two case studies on how to detect outliers from times series data: GPS trajectory and heart-rate variability (HRV).

In [P6], we proposed a method that allows the traditional outlier detectors introduced in Section 2 to be indirectly applied to GPS trajectory data, to detect outliers. The method extracts the start and end points from each trajectory, to form two sets of points. Then any traditional outlier detector introduced in Section 2 can be applied to detect the outlier points in these two sets. Trajectories that contain one of these outlier points are detected as outlier trajectories. Simulated experiments show that this strategy works well in training data but has only a slight effect on the test data, probably because the test data has fewer outliers, compared to the training data.

In [P7], we defined patients with heart failure as outliers, based on their HRV data. The entropy value of the signal analyzed is used as an

(54)

outlier score. Following the complexity-loss theory of aging and disease [148], an outlier subject is expected to have a smaller entropy value. Based on this, we proposed an entropy method in [P7] called Attention entropy.

The experiment results show that it performs significantly better than competing entropy methods.

Outlier detection techniques

uef.fi

Dissertations in Forestry and Natural Sciences

JIAWEI YANG

OUTLIER DETECTION TECHNIQUES

JIAWEI YANG

OUTLIER DETECTION TECHNIQUES

Jiawei Yang

OUTLIER DETECTION TECHNIQUES

ABSTRACT

ACKNOWLEDGEMENTS

LIST OF ABBREVIATIONS

LIST OF ORIGINAL PUBLICATIONS

AUTHOR’S CONTRIBUTION

CONTENTS

1 INTRODUCTION

1.1 DEFINITION

1.2 APPLICATION

1.3 DETECTOR CATEGORY

2 OUTLIER DETECTORS

2.1 MODEL-BASED OUTLIER DETECTORS

2.2 MODIFICATION-BASED OUTLIER DETEC- TORS

2.3 SUMMARY

3 POST-PROCESSING OUTLIER SCORES

3.1 ENSEMBLES

3.2 NEIGHBORHOOD AVERAGING

3.3 SUMMARY

4 THRESHOLDING OUTLIER SCORES

4.1 THRESHOLDING IS IMPORTANT

4.2 STATISTICAL METHODS

4.3 ITERATIVE METHODS

4.4 SUMMARY

5 APPLICATION IN TIME SERIES

5.1 GPS TRAJECTORIES

5.2 HEART RATE VARIABILITY

5.3 SUMMARY