• Ei tuloksia

Anomaly Detection Using Unsupervised Learning Technique

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Anomaly Detection Using Unsupervised Learning Technique"

Copied!
46
0
0

Kokoteksti

(1)

Anomaly Detection Using Unsupervised Learning Technique

Ansari Mohammed Abdul Aziz

Master's Thesis

School of Computing Computer Science

March 2021

(2)

ii

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu, School of Computing, Computer Science

Mohammed Ansari: Anomaly Detection Using Unsupervised Learning Technique Master’s Thesis, (41 p.)

Supervisors of the Master’s Thesis: Ph.D., Project Manager Keijo Haataja, and BBA, Team Lead at Critical Infrastructure, ASSA ABLOY Global Solutions Taavi Saarelainen

March 2021

Keywords: Anomaly Detection, Clustering, Model, Machine Learning, Unsupervised Learning

Abstract: Anomaly detection is an important aspect in identifying problems that go unnoticed in software applications. An anomaly data is data that deviates from normal data. Identifying anomalies with unlabeled data has its own challenges. It helps to identify the unknown issues or anomalies in software applications to take preventive actions for eliminating such issues in the future. Early identification of anomalies helps organizations to protect from malicious attacks. In software systems, there are high chances of attackers breaching the security and accessing the API server. Anomaly detection helps to identify security threats and suspecting exceptions in software sys- tems. Monitoring time-series data plays an important role in detecting unusual data patterns in software systems. Data analysis using different ML techniques helps to monitor the data efficiently. ML techniques are classified into two categories: Super- vised and Unsupervised learning. As unsupervised learning works appropriately with the unlabeled dataset, we apply different unsupervised ML techniques on an unlabeled API dataset and predict the result. The results extracted after applying unsupervised learning analysis on data gives us significant output to identify correct anomalies and helps to backtrack the problem. We check the application of different unsupervised clustering algorithms on API data and identify its usefulness in different scenarios. We use different features of API data for the training model using unsupervised ML. The features of API data utilized vary as per the ML technique used to train the model. In this thesis, we use different unsupervised ML techniques like OCSVM, DBSCAN, and K-means to detect anomalies. Each different technique gives significant results in dif- ferent areas. We use the OCSVM approach to train the model and detect anomalies based on the feature processing time of API. OCSVM approach helps us to detect out- liers in time series API data based on the feature processing time of API. The anomaly results from OCSVM analysis help us to know the origination or root cause of anom- alies with the help of time-series data. It helps to monitor data sources that include logs from users, networks, and servers. The clustering-based techniques include DBSCAN and K-means. DBSCAN clustering technique to detect anomalies proves helpful on time-series data whereas K-means gives significant results using non-time-series data.

DBSCAN performs best with categorical data and detects anomalies based upon the density of different clusters created based upon categorical features. K-means per- forms well with numeric data since it uses the Euclidean distance formula to calculate the distance between two data points.

(3)

iii

Foreword

The basis of the thesis originated from a need for a software system that helps to detect anomalies in API data. An anomaly detection system helps in easy maintenance of any modern software application. Anomaly detection using different ML techniques in large data related to API brings great help for developers to know the root cause of the unnoticed problem. My passion for productive implementation led me to introduce a solution for an existing software application to detect anomalies that would help the developers to identify the root cause of anomalies.

The Thesis work implemented at Abloy Oy was in collaboration with the University of Eastern Finland (UEF), Kuopio Campus, DigiCenterNS-project (Digiteknologian TKI-ympäristö; Digikeskus-hanke; Hankekoodi: A74338). I would like to extend my gratitude towards all the team members especially the company supervisor Taavi Saarelainen at Abloy Oy and the UEF supervisor Keijo Haataja for the extensive sup- port provided throughout the Thesis work. Thank you all for the consistent guidance and support.

(4)

iv

List of abbreviations

API Application Programming Interface CPS Cyber-Physical System

DBSCAN Density-Based Spatial Clustering of Applications with Noise

ML Machine Learning

OCSVM One-Class Support Vector Machine SVM Support Vector Machine

(5)

v

Contents

1 Introduction ... 1

2 Different Machine Learning (ML) Techniques ... 3

2.1 Unsupervised ML Process for Anomaly Detection ... 7

2.2 API Logs Anomaly Detection ... 7

2.3 Unsupervised ML Techniques for Anomaly Detection ... 8

3 Results ... 10

3.1 OCSVM ... 10

3.1.1 Application of OCSVM Model ... 10

3.1.2 Visualization of OCSVM Model Result ... 13

3.2 DBSCAN ... 15

3.2.1 Application of DBSCAN Model ... 17

3.2.2 Visualization of DBSCAN Model Result ... 20

3.2.3 DBSCAN: Yearly Analysis ... 21

3.2.4 DBSCAN: Monthly Analysis ... 23

3.2.5 DBSCAN: Weekday Analysis ... 25

3.2.6 DBSCAN: Day Analysis ... 27

3.2.7 DBSCAN: Day, Weekday, Month, and Year Analysis .... 28

3.3 K-Means Clustering ... 30

3.3.1 Application of K-Means Model ... 32

3.3.2 Visualization of K-Means Model Result ... 34

4 Conclusion and Future Work ... 37

References ... 39

(6)

1 Introduction

Machine Learning (ML) is about how computer systems improve through previous experiences. ML has shown great progress in the field of research. The use of ML has shifted from laboratory to commercial use. A learning problem is improving some measure of performance while performing some task. The performance improves through training experiences. For example, in credit card fraud detection, the learning is best when classifier efficiently separates good data and bad data. In this case, bad data is anomaly data that corresponds to “fraud” and good data corresponds to “not fraud” [1]. Thus, we say that ML is a process that predicts the result based on previous experiences. ML is useful in detecting fraud in the system. After the completion of analysis using a machine-learning technique, the learning is good if the classifier sep- arates anomaly data and normal data efficiently. On the other hand, learning is bad if the classifier fails to separate anomaly data and normal data efficiently.

Anomaly Detection is an important data analysis task that detects anomalous or abnor- mal data from a given dataset [2]. Any data that deviates from normal data is an anom- aly. Anomaly detection helps to identify unusual activity patterns in observed data [3].

API data in our software system classifies into two categories: Load Balancer logs and Application logs. The Load Balancer logs capture details related to request sent to the load balancer for further passing it to the target server. Application logs capture all the API specific logs that generate through the software application. Here, we focus to detect anomalies in load balancer logs since the load balancer API data is larger. The result from the analysis is better as the machine-learning process performs efficiently with a larger size of data. The more data, the better is the performance of the ML process. Anomaly Detection is an integral part of any well-managed modern software system. The anomaly detection in API data of software applications helps us to know performance issues, detect unnoticed problems, and identify the root cause of the prob- lems. The API data divides into two categories: time-series data and non-time series data. In today’s modern software application, an anomaly detection system has become a significant requirement to find a quicker resolution to problems that occur unnoticed.

(7)

2

It is critical to identify such anomalies to prevent the resulting problems that can occur in the future.

A variety of learning techniques used for different intrusion detection problems are classified into two broad categories: Supervised (classification) and Unsupervised (anomaly detection and clustering) [3]. In our case, we are dealing with anomaly de- tection and clustering in API data in large dataset. Thus, we will apply different unsu- pervised learning techniques to detect anomalies in API data. The advantage of unsu- pervised ML is that it does not require any understanding of the complexities of the target CPS; instead, it builds models solely from data logs that are ordinarily available from historians [4]. Thus, we use unsupervised ML techniques since we solely train our model based upon API data logs available for anomaly detection and not on pre- vious experiences. In later sections, we discuss different unsupervised learning tech- niques applied to API data to detect anomalies in a dataset.

Scikit-learn provides a rich environment for the use of different ML algorithms [5].

The variety of different ML algorithms provided by the Scikit-learn python library provided extensive support in the practical implementation of anomaly detection sys- tems throughout the Thesis work.

Section 2 discusses different ML techniques. Results of our experiments are explained in Section 3. Finally, Section 4 concludes the Thesis and sketches some future research work ideas.

(8)

3

2 Different Machine Learning (ML) Techniques

Anomaly detection using different ML techniques in API dataset plays an important role in identifying abnormal data and backtracking the root cause of the problem. The ML model can be trained using either a supervised (inherent) or an unsupervised (un- ambiguous) way to classify the patterns or new information in the dataset [6]. Thus, we perform different ML techniques on our API dataset to detect anomalies and back- track the root cause of the anomaly that occurred in a software system. Accurate anom- aly detection for the software system leads to better performance of the software sys- tem in the future. The two basic categories of ML techniques for anomaly detection are as follows:

1 Supervised Learning: In the supervised learning process, known labels corre- spond to each data in the dataset. Each data in the dataset represents a set of features. These features can be categorical or continuous. The model trained uses instances that consist of known labels in the dataset. The data used to train the model is the training set [7]. Thus, the supervised learning technique uses a set of rules to predict anomalies in data. This set of rules are class labels assigned to each data in the training dataset. Supervised learning performs clas- sification based upon these class labels. Class labels are also known as desired outputs. Given a sample of data and desired output, the main objective of the supervised learning technique is to approximate a function that maps inputs to outputs. Thus, the accuracy of the result highly depends upon the training model performed using sample data and desired output. Figure 1 shows an ex- ample classification done based upon desired output using supervised ML tech- nique.

(9)

4

Figure 1. Classification by supervised learning.

2 Unsupervised learning: In unsupervised learning, the machine receives inputs 𝑥1, 𝑥2, 𝑥3,…, but obtains no desired output or rewards from the environment. The machine does not have any knowledge about future inputs when learning the data. The unsupervised learning technique works in a way to create a represen- tation of the input that helps in performing decision-making or predicting fu- ture inputs. The two basic examples of unsupervised learning are clustering and dimensionality reduction. Unsupervised learning helps in detecting pat- terns in data and any unusual pattern detected is noise [8]. Thus, the unsuper- vised learning technique learns from the data itself and significantly helps to detect noise in a large dataset. Noise is an anomaly in the dataset. If we observe the supervised learning technique, we need to pass desired output parameter during the learning process. In unsupervised learning, model training does not depend on desired output or rewards gained from the environment. In our case,

(10)

5

we see in upcoming sections, how unsupervised learning technique using de- cision making and clustering approach gives significant results that help for anomaly detection. Figure 2 shows a sequence diagram of an unsupervised learning process to perform decision making or clustering.

Figure 2. Decision-making/clustering by unsupervised learning approach.

The stages involved in the unsupervised learning approach are the following:

a) Input Raw Data: The input raw data consists of a mixture of all categories.

The data is a mixture of N categories. At this stage, the processing of data does not take place and the size of the data is large. The result after applying an unsupervised learning process on a raw data gives us segregated catego- ries from the data. This segregation of data is also known as clustering the data into N categories. The size of the data is large, because it consists of all of the features at this stage and feature selection takes place in the next stage.

b) Feature Selection and Transformation: This stage involves the selection of particular features from the entire dataset and transforming the dataset in the required format. For example, a time feature that consists of a date, month, and year needs to be converted into a different format. We extracted the features from the dataset and performed a transformation later to train

(11)

6

the model using an unsupervised learning algorithm. We will see in up- coming sections that transformation is also required for visualizing the re- sult.

c) Train Model: At this stage, we applied an unsupervised learning algorithm on the raw dataset to train the model. In this stage, we see different unsu- pervised learning algorithms used to perform clustering and detect noise in the dataset. We selected appropriate unsupervised learning algorithms as per the type of features used to train the model. We conclude that the DBSCAN algorithm fits well with categorical features to cluster the data whereas OCSVM and K-means algorithm fits well with numerical data.

We also conclude that OCSVM and DBSCAN give useful results from time-series data. While training the model, we do not pass any desired out- put to the model. In unsupervised learning, the model trains solely based on input raw data.

d) Result: After completion of the unsupervised learning process, we obtain different clusters or categories from the data. Each cluster or category con- sists of data with similar features. The cluster with the least density is noise or anomaly in data. Furthermore, data that does not fit in a normal pattern is an anomaly or noise. The result is useful for end-users when visualized with the help of graphs. It helps us quickly know the exact root cause of the anomaly that occurred.

Thus, after studying both learning approaches (supervised and unsupervised) we ob- served that unsupervised learning works better with a large dataset and gives signifi- cant results for anomaly detection. Unsupervised learning helps better to detect noise in the dataset and does not needs any external factor like the desired output to train the model. It learns from the raw data itself and creates a representation from the input data that helps to detect an unusual pattern in data for future inputs.

Section 2.1 discusses unsupervised ML process for anomaly detection. Data source (API logs) are explained in Section 2.2. Section 2.3 presents different unsupervised ML techniques for anomaly detection.

(12)

7

2.1 Unsupervised ML Process for Anomaly Detection

The basic stages of anomaly detection techniques are quite similar to each other. These stages include parameterization, training, and detection. Parameterized means collect- ing raw data from a monitored environment. The raw data should be representative of the system to create a model. The training stage involves the creation of a model based on raw data used by the system to detect anomalies. The behavior of the model depends on different ML techniques [9]. Every unsupervised machine-learning process in- volves basic stages: extracting raw data, data preparation, data transformation, training model, and predicting results. We created a transformed data from raw API data for the machine-learning model to produce desired results. In the training stage, using a particular ML technique, we created a model based on raw API data. Finally, the pre- diction of data takes place based on a model trained that gives the result. We visualized the result from the learning process on a graph. In the upcoming sections, we explain the application of different unsupervised ML techniques in detail.

2.2 API Logs Anomaly Detection

The API logs consist of different features that help for anomaly detection in software systems using a variety of different unsupervised learning models. We developed an application to retrieve these logs, process the data, and visualize the anomaly data re- sult from API data with the help of different graph tools. Matplotlib is a portable 2D plotting and imaging package aimed primarily at the visualization of scientific, engi- neering, and financial data. Matplotlib helps to produce interactive graphs using Python shell, called from python scripts [10]. In our software system, we generated different visualization graphs using the most commonly used matplotlib library. We also used the seaborn python library that internally uses matplotlib library [11]. Thus, seaborn is a dependent library of matplotlib.

The API data is time-series data. Thus, we also see the utilization of time-series data for anomaly detection in Section 3.1.1. There are several features obtained from API

(13)

8

logs. The major features extracted and transformed into an acceptable format from API logs for training model in unsupervised learning are the following:

• target_processing_time: The total time that takes place when the sender sends the request to the target and the target starts processing the response.

• target_status_code: The status code value of the response received from the target.

• received_bytes: The size of the request received in bytes by the target from a client.

• time: The time when the response generates to the client.

2.3 Unsupervised ML Techniques for Anomaly Detection

There are different ways of anomaly detection using unsupervised ML techniques.

There are two broad techniques applied for anomaly detection: small percentage out- lier detection and clustering techniques. These techniques applied to raw API data de- pend upon different features of the data used. In the upcoming sections, we study such different unsupervised ML algorithms applied to our raw API data considering differ- ent scenarios. We check if DBSCAN clustering fits well with time series categorical data and gives useful results for anomaly detection. Furthermore, we use the K-means clustering approach to analyze numerical features from the dataset. Figure 3 gives an overview of different types of features in data and application of unsupervised ML algorithm as per the features in data.

(14)

9

Figure 3. Data hierarchy with the application of unsupervised ML algorithm.

(15)

10

3 Results

In Sections 3.1-3.3, we go through a detailed explanation and results of each unsuper- vised learning algorithm applied for anomaly detection and check how it works with different types of data.

3.1 OCSVM

The OCSVM is a specified sample of a Support Vector Machine (SVM) that gears for anomaly detection. SVM proposed by Vapnik [9] maps input vector in high-dimen- sional feature space and then obtains separating hyperplane or decision boundary de- termined by support vectors. The positive values denote normal data and negative val- ues represent abnormal data. OCSVM learns a hyperplane to separate all the data points from the origin in a reproducing kernel Hilbert space and maximizes the dis- tance from this hyperplane to the origin. The origin represents a negative labeled in- stance and all the other data points away from the origin represent a positive labeled instance [10]. The hyperplane segregates the data into normal and abnormal data. The abnormal data are also known as anomaly deviate from the normal pattern of data. In Sections 3.1.1-3.1.2, we identify how OCSVM was utilized in our case to detect anom- alies in login API data.

3.1.1 Application of OCSVM Model

We used the scikit learn python library for implementing the OCSVM approach. The OCSVM scikit learn library belongs to unsupervised outlier detection. OCSVM per- forms very well in the case of novelty detection. Novelty detection proves to be useful where outliers do not pollute training data and we are interested in detecting an unusual pattern in data [5]. Thus, we used the OCSVM scikit learn library to detect unusual patterns in our time-series API data and produced significant results from it.

There are several parameters available for the OCSVM library function. The list of parameters used for training the OCSVM model are as follows [5]:

(16)

11

• nu: The parameter type is float. The default value is 0.5. The value of the pa- rameter that needs to be set as per the proportion of outliers required in data. It is an optional value and if not set then by default 0.5 value is set.

• kernel: This specifies the kernel type provided for the algorithm. It must be one of the values from ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’, or a calla- ble. By default, the value is set to ‘rbf’.

• gamma: The gamma values must be ‘scale’ or ‘auto’. It is the kernel coefficient for ‘rbf’, ‘poly’, and ‘sigmoid’. If no value is set then by default ‘scale’ value is set. If gamma=‘scale’ then the coefficient value of kernel (gamma value) is calculated by 1 / (n_features * X.var()) and for gamma=‘auto’, the value is calculated as 1 / n_features.

We applied the OCSVM approach by using the scikit learn library on our API data.

Figure 4 shows a high-level implementation of the OCSVM approach on our API data.

Figure 4. OCSVM sequence block diagram.

As depicted in Figure 4, there are five stages of OCSVM anomaly detection imple- mentation as follows:

1. Extract API Data: This stage includes importing the data from a specific loca- tion into the code and using the data for further preparation of data. The pandas python library offers to handle two-dimensional, size-mutable, potentially het- erogeneous tabular data. It is a dictionary-like container of series of objects [12]. We used pandas python library to load data into DataFrame object and processed it for the further data preparation stage. We imported pandas python library by using the below code snippet:

(17)

12 1: import pandas as pd

We imported API data by using the below code snippet and converted it into a dataframe object:

1: df = pd.read_csv (‘path_to_data_file’)

OCSVMs are commonly used to separate one specific class, i.e., target class, from other data. OCSVM model trains with data that consists of more positive data points compared to negative data points [13]. Thus, the OCSVM model fits best with our API data, since it consists only small amounts of outliers or abnormal data.

2. Data Preparation: Data preparation stage involves the selection of only re- quired features of data for model training. In our case, we selected ‘target_pro- cessing_time’ and ‘time’ for the learning process. In the later stage, we see the

‘time’ feature used for visualizing time series data on a graph. Below is the code snippet for selecting required features for model training:

1: data = df [[‘target_processing_time’]]

3. Data Transformation: The data transformation phase includes converting raw data into an acceptable format for the learning process and visualization. A particular standard date format is compatible with the visualization library py- plot provided by matplotlib. We imported the ‘datetime’ library for converting raw data to standard format. Thus, we converted the ‘time’ feature into standard date format ‘%Y-%m-%d %H:%M:%S’ using the below code:

1: import datetime as datetime

2: df [‘datetime’] = [datetime.datetime.strptime(d,

“%Y-%m-%d %H:%M:%S”) for d in df[‘time’]]

4. Train Model: This stage involves the training of the model using OCSVM methodology using transformed data. We created an instance of the model by

(18)

13

passing necessary parameters to OneClassSVM class provided by the scikit learn library. The parameters passed to the class are ‘nu’, ‘kernel’, and

‘gamma’. After creating the instance of the model and storing it in a variable, we called the fit() function of the created instance to train the model. We trained the model using below code:

1: from sklearn.svm import OneClassSVM

2: model = OneClassSVM(nu=0.01, kernel=”rbf”, gamma=0.01)

3: model.fit(data)

5. Predict Result: After training the model, we predicted the outcome/results by using the trained model on input data. The results give us anomaly data found in the entire dataset. The anomaly data predicted deviate from the normal pat- tern of time-series data. The below code helped us to predict the results:

1: model.predict(data)

The predicted result was stored in a new dataframe object ‘df_anomaly’ for visualizing anomalies on graph. We used the following code for storing anom- alies in new dataframe object:

1: df[‘anomaly’] = pd.Series(model.predict(data)) 2: df_anomaly = df.loc[df[‘anomaly’] == -1,

[‘datetime’, ‘target_processing_time’]]

In the above code snippet, we used df.loc() function to access all data from the entire data which consists of anomaly value ‘-1’.

3.1.2 Visualization of OCSVM Model Result

We visualized the anomaly result from OCSVM model using matplotlib library. The visualization of anomaly result obtained from time-series API data in the form of graph

(19)

14

helps user understand about the time when the anomaly occurred. Thus, it makes easier for the user to identify and backtrack the anomaly. The following code snippet gener- ates a scatter plot graph for visualizing data by configuring red color for anomaly data points and blue color for normal data points:

1: import matplotlib.pyplot as plt

2: fig, ax = plt.subplots(figsize = (18, 6))

3: ax.plot(df[‘datetime’], df[‘target_processing_time’], color = ‘blue’, label = ‘Normal’)

4: ax.scatter(df_anomaly[‘datetime’], df_anomaly[‘tar- get_processing_time’], color= ‘red’, label= ‘Anomaly’) 5: ax.set(xlabel= ‘Datetime’, ylabel= ‘Processing time’, title= ‘Anomaly detection for login api’)

6: plt.legend() 7: plt.show()

Figure 5 illustrates a generated graph in which data points marked with red color are anomalies. The graph shows analysis from the month July to November 2020 by plot- ting the ‘datetime’ feature on X-axis and the ‘target_processing_time’ feature on Y- axis. From the graph, we know that all the extreme data points considering both ends are marked as anomalies, because they deviate from the normal pattern of time series data. This data analysis gives significant output, because there are high chances of the occurrence of problems with API calls having the lowest or highest processing times.

API calls with higher processing times may indicate security breach attempts or threats, visible through higher server utilization. API calls with the lowest processing times may indicate that there is an improper request to the API. Backtracking the anomaly data points from the graph helps us to find the root cause of the issue.

(20)

15

Figure 5. OCSVM result.

It is possible to save the anomaly data using the following code snippet for backtrack- ing the exact issue and knowing the source of the problem:

1: df [df[‘anomaly’] == -1].to_csv[r‘anomaly_data’]

The anomaly data saved on the system showed that all the API calls with 0.003 seconds of processing times have status codes 400 or 406. The status code 400 indicates a ‘bad request’. A ‘bad request’ occurs due to values out of range or missing data in the re- quest. The status code 406 indicates a ‘not acceptable request’. A ‘not acceptable re- quest’ occurs when the requested media type is not supported [14]. Also from the re- trieved anomaly data, we know that API calls with the highest processing time consist of higher sent bytes. It shows that processing a large size of the response from the target to the client resulted in a higher processing time of API call. Thus, the lowest or highest processing times for API calls resulted in failure that deviates from other API calls and is appropriately marked as an anomaly using the OCSVM technique.

3.2 DBSCAN

DBSCAN model uses simple minimum density level estimation based on a threshold for the number of neighbors and minimum points within the radius. Core points include all the objects with more than minimum points neighbors within the radius. DBSCAN does not perform density estimation in-between points. All neighbors within a radius

(21)

16

of a core point are considered to be a part of the same cluster as core points. These core points are known as direct density reachable. If any of these neighbor points are again core points then their neighbor points are transitively included. Points that do not belong to any cluster are noise. These points are not density reachable points. Non- core points are border points and are density connected [15]. Thus, non-core points include all objects that have less than the minimum number of points within the radius and core points include all objects that have more than the minimum number of points within the radius. Objects that do not include any neighbor points within the radius are noise. Noise exists far away from core points and border points.

Figure 6 shows an illustration of the DBSCAN cluster model. The objects marked with red color are core points. Here, the minimum number of points required within the radius of the core point is four. If there exists less than four neighbor points within the radius of a particular object then the object is a border point. All the border points are marked with yellow color in the above figure. The noise objects are marked with blue color that resides far away from core points and border points. Thus, noise points do not include any core and border points within the radius.

Figure 6. Illustration of DBSCAN cluster model. [15]

(22)

17 3.2.1 Application of DBSCAN Model

We applied the DBSCAN technique on API data and the output anomaly result proves to be very significant. The process of applying the DBSCAN technique in our case took place in several stages. Figure 7 shows an overview of these stages involved in the entire process.

Figure 7. Overview of DBSCAN approach.

As Figure 7 illustrates, the stages involved in the DBSCAN approach are as follows:

1. Pre-Processing: The pre-processing stage involves data retrieval, data prepa- ration, and data transformation. Data retrieval is fetching the data from a loca- tion into the software system. Data preparation is selecting the required fields as features to train a model using the DBSCAN technique. Data transformation involves converting the data into an acceptable format for the application to compute further stages involved in the process. We selected the ‘time’ feature using python pandas library and converted it into standard date format ‘%Y-

%m-%d %H:%M:%S’. After converting into a standard date format, we ex- tracted weekday, day, month, and year from time. We also selected ‘target_sta- tus_code’ as a feature for the training DBSCAN model. We maintained backup/copy of original data by using copy() function provided by pandas da- taframe library. The following code performs pre-processing stage:

1: df = pd.read_csv (‘path_to_data_file’)

(23)

18

2: df = df [[‘target_status_code’, ‘time’]]

3: df [‘datetime’] = [datetime.datetime.strptime(d,

“%Y-%m-%d %H:%M:%S”) for d in df [‘time’]]

4: df['weekday'] = [calendar.day_name[d.weekday()]

for d in df['datetime']]

df['day'] = [datetime.datetime.strptime(d, "%Y-%m-%d

%H:%M:%S").day for d in df['time']]

df['month'] = [datetime.datetime.strptime(d, "%Y-%m-

%d %H:%M:%S").month for d in df['time']]

df['year'] = [datetime.datetime.strptime(d, "%Y-%m-

%d %H:%M:%S").year for d in df['time']]

5: df_copy = df.copy()

2. Label Encoding: The label encoding process encodes the target label with val- ues 0 and n_classes-1 [5]. The ‘n_classes’ are the number of categories found within the label. We encoded weekdays Monday to Sunday with values 0 to 7, since there are seven different categories involved within the ‘weekdays’ label.

The same process applies to ‘day’, ‘month’, and ‘year’. We import the ‘La- belEncoder’ library provided by sklearn and use the fit_transform() function to label encode the data. The following code snippet illustrates performing label encoding on data:

1: from sklearn.preprocessing import LabelEncoder 2: le = LabelEncoder()

df['weekday'] = le.fit_transform(df['weekday']) 3: df['day'] = le.fit_transform(df['day'])

df['month'] = le.fit_transform(df['month']) df['year'] = le.fit_transform(df['year'])

(24)

19

3. OneHotEncoding: At this stage, we used the OneHotEncoding technique to convert the categorical values into dummy indicators [16]. The indicators de- note whether the categorical value is true or not. The indicators can have the value ‘0’ or ‘1’. The value ‘0’ indicates false and the value ‘1’ indicates true.

The OneHotEncoding process divides all different categories within a particu- lar label into separate columns and assigns them indicator values. Figure 7 shows an example of OneHotEncoding at the bottom right corner. We used the get_dummies() function provided by the pandas library that performs the OneHotEncoding process. The following code snippet applies OneHotEncod- ing technique on data and splits different categories within label ‘target_sta- tus_code’ into separate columns with assigned indicators:

1: df = df.join(pd.get_dummies(df ['target_sta- tus_code'], prefix='status_code'))

4. Train the DBSCAN Model and Predict Result: After the OneHotEncoding pro- cess, we applied two stages to train and predict. DBSCAN technique helped us to find core samples of high density and expands clusters for them. DBSCAN approach is good for data that contains clusters of similar density. We used the fit_predict() function from the DBSCAN library provided by sklearn for train- ing the DBSCAN model and for predicting results. The following code snippet performs training and predicts operations for the DBSCAN technique:

1: dbscan = DBSCAN()

2: model = dbscan.fit_predict(df[['status_code_200', 'status_code_400', 'status_code_401', 'sta-

tus_code_405', 'status_code_406', 'status_code_415', 'year']])

In the code, we passed a different combination of features, such as a weekday, day, month, and year to the model. In the upcoming sections we display inter- esting results produced after passing these different combinations of features to the model.

(25)

20

5. Result Processing: As per the density level, the DBSCAN model assigns dif- ferent category numbers to each data in the dataset along with anomalies marked with category ‘-1’. We created a color map that assigns unique color for each category and assigns black color for the ‘-1’ category. The visualiza- tion graph uses a color-map as one of the parameters to display a scatter plot.

We used the below code snippet to create the color-map for different catego- ries:

1: df_model = pd.DataFrame({'categories': model}) df_copy = df_copy.join(df_model)

df_copy['categories'] = df_copy['cate- gories'].map(str)

2: category_colors = ["#" + ''.join([ran-

dom.choice('0123456789ABCDEF') for j in range(6)]) for i in range(df_copy['categories'].nunique())]

3: category_list = df_copy['categories'].unique() colors = {category_list[index]: color for index, color in enumerate(category_colors)}

colors['-1'] = 'black'

3.2.2 Visualization of DBSCAN Model Result

For visualizing DBSCAN model anomaly prediction result, we used combination of matplotlib and seaborn library. Seaborn library internally uses matplotlib library. We passed colormap created during result processing stage as a ‘palette’ parameter to scat- terplot() function provided by seaborn library. We removed the extra space between the graph by setting the y-axis limits using set_ylim() function and setting parameter ncols=1 & nrows=2 in subplots() function. The graph shows time-series data visuali- zation by plotting ‘time’ feature on x-axis and ‘target_status_code’ feature on y-axis.

The following code snippet generates the scatter plot graph and displays DBSCAN result:

(26)

21

1: import matplotlib.pyplot as plt import seaborn

2: fig, (ax_top, ax_bottom) = plt.subplots(ncols=1, nrows=2, sharex=True, figsize=(14, 9))

ax = seaborn.scatterplot(data=df_copy, x="datetime",

y="target_status_code", hue="categories", palette=colors, linewidth=0, ax=ax_top)

ax = seaborn.scatterplot(data=df_copy, x="datetime",

y="target_status_code", hue="categories", palette=colors, linewidth=0, ax=ax_bottom)

3: ax_top.set_ylim(bottom=390) ax_bottom.set_ylim(170, 230) seaborn.despine(ax=ax_bottom)

seaborn.despine(ax=ax_top, bottom=True) ax = ax_top

d = .015

kwargs = dict(transform=ax.transAxes, color='k', clip_on=False)

ax.plot((-d, +d), (-d, +d), **kwargs) ax2 = ax_bottom

kwargs.update(transform=ax2.transAxes)

ax2.plot((-d, +d), (1 - d, 1 + d), **kwargs 4: ax_top.legend(title='Categories', bbox_to_an- chor=(1.09, 1), borderaxespad=0)

ax_top.set(title="Anomaly detection on year basis") ax_bottom.legend_.remove()

plt.xticks(rotation=45) plt.show()

In the upcoming sections, we see different scatter plot graphs generated using the code based upon a different combination of features used to train the DBSCAN model.

3.2.3 DBSCAN: Yearly Analysis

Let us check the description of each status code before we study different visualization graphs. Table 1 helps us to understand the meaning of each status code.

(27)

22

Status code: Status name: Description:

200 Accepted/OK The response has been processed suc- cessfully.

400 Bad Request Invalid request, malformed request, re- quired data missing, or syntax errors in expressions passed in the request.

401 Unauthorized The request requires client authoriza- tion or passed credentials are incorrect.

405 Method Not Allowed The request is known to the server, but not supported by the target resource.

406 Not acceptable Requested media type not supported.

415 Unsupported Media Type

The entity passed in the request was not in a format supported by the server.

Table 1. Status codes with description [14, 16].

We generated the graph shown in Figure 8 by passing the ‘target_status_code’ and

‘year’ feature extracted from ‘time’ data to train the DBSCAN model. Based on Figure 8, we can say that the DBSCAN approach detects anomalies on a yearly basis. Con- sidering the entire data, i.e., data for the year 2020, we know from the graph that there are two anomalies detected from months July to August. The number of anomalies identified is less on yearly basis. These anomalies do not fit into any of the other clus- ters that are marked with black color. Consider n_clusters as a total number of clusters identified in DBSCAN analysis. The category number ‘-1’ denotes anomalies and other category numbers assigned to clusters are from ‘0’ to ‘n_clusters-1’. The cluster with category ‘0’ displays the highest number of data points that consists ‘200’ target status code. The target status code ‘200’ for an API denotes a successful response.

Thus, there has been more successful responses from API considering a year of time span. The marked anomalies consist of status code ‘415’. The status code ‘415’ indi- cates that the entity passed by the request does not support the API. The server is un- able to understand the entity found in the request [14]. Thus, responses with status code ‘415’ are unusual. It is better to backtrack the anomaly data and identify the cause of the problem that occurred.

(28)

23

Figure 8. DBSCAN result on a yearly basis.

Table 2 shows the result anomalies from the DBSCAN approach. We observed anom- alies in the months of July and August with status code ‘415’ on a yearly basis.

Month: target_status_code of anomaly data:

July 415

August 415

Table 2. Anomalies identified on a yearly basis.

3.2.4 DBSCAN: Monthly Analysis

We generated the graph shown in Figure 9 by passing the ‘target_status_code’ and

‘month’ feature extracted from ‘time’ data to train the DBSCAN model. We performed DBSCAN analysis on a dataset from months July to November of the year 2020. Based on Figure 9, we can say that the DBSCAN approach detects anomalies on a monthly basis. There is a total of 14 different clusters/categories identified excluding the anom- aly category. All the data points displayed in black color reside under the anomaly category, i.e., ‘-1’. The algorithm works in a way that it detects anomalies considering every month of all the years. Based on Figure 9, we observed that one cluster/category consists of all data points with similar status codes within a month. For example, cat-

(29)

24

egory ‘0’ consists of all data points with status code ‘200’ for the month of July. Sim- ilarly, all the other data points with status code ‘200’ are included in different clusters as per the month. Thus, the same approach of forming clusters applies to other data points with status codes 400, 401, 405, 406, and 415.

Figure 9. DBSCAN result on a monthly basis.

Table 3 gives an overview of all the anomalies identified on monthly basis for the year 2020. If we observe the data from Table 3, the DBSCAN algorithm marks data from the month of July with status code ‘415’ as an anomaly, since it has the least density compared to data with other status codes. The anomaly data, i.e., noise identified in the month of July does not resemble any of the other data in clusters and thus, it is not included in any of the clusters. For the month of August, we see anomaly data points with status codes ‘415’, ‘400’, and ‘406’ that has the least density and does not includes in any of the other clusters or categories. Similarly, for the month of September and October, data with status code ‘406’ is marked as an anomaly.

(30)

25

Month: target_status_code of anomaly data:

July 415

August 415, 400, 406

September 406

October 406

Table 3. Anomalies identified on a monthly basis.

3.2.5 DBSCAN: Weekday Analysis

We generated the graph shown in Figure 10 by passing the ‘target_status_code’ and

‘weekday’ feature extracted from ‘time’ data to train the DBSCAN model. We ob- served from Figure 10 that the DBSCAN approach detects anomalies on a weekday basis. There is a total of 19 different clusters/categories identified excluding the anom- aly category. The category ‘-1’ consists of all anomaly data marked with black color on the graph. The algorithm works in a way that it detects anomalies considering every weekday of all the years.

Figure 10. DBSCAN result on a weekday basis.

According to Figure 10, a cluster/category consists of all data points with similar status codes on a weekday basis. For example, category ‘3’ consists of all data points with

(31)

26

status code ‘200’ for weekday ‘Wednesday’. Similarly, all the other data points with status code ‘200’ are included in different clusters based on a weekday. Thus, the same approach of forming clusters applies to other data points with status codes 400, 401, 405, 406, and 415.

Table 4 gives an overview of all the anomalies identified on a weekday basis for the year 2020. Table 4 lists all the status codes of data marked as anomalies on a weekday basis analysis. The DBSCAN algorithm marks data with the least density considering each weekday as an anomaly. Based on Table 4, we observed that considering all data with weekday Tuesday, the DBSCAN algorithm marks data with status codes ‘415’

and ‘400’ as anomalies, since it has the least density compared to data with other status codes. The same approach works for every weekday. We observe that there are no anomalies identified considering weekdays Monday and Saturday. For all Wednes- days, data with status code ‘405’ are marked as an anomaly. According to API data recorded every Thursday, status codes ‘415’ and ‘406’ are marked as an anomaly. Fur- thermore, if we observe data every Friday, the API data with status code ‘400’ is marked as an anomaly. Similarly, considering API data for every Sunday, data with status ‘405’ is marked as an anomaly. Thus, it is clear that API data considering all the weekdays, the data with status code ‘200’ appears in highest density level known as normal data.

Weekday: target_status_code of anomaly data:

Monday -

Tuesday 415, 400

Wednesday 405

Thursday 415, 406

Friday 400

Saturday -

Sunday 405

Table 4. Anomalies identified on a weekday basis.

(32)

27 3.2.6 DBSCAN: Day Analysis

We generated the graph shown in Figure 11 by passing the ‘target_status_code’ and

‘day’ feature extracted from ‘time’ data to train the DBSCAN model. We observed that the number of anomalies increases as compared to other graphs discussed earlier.

Figure 11. DBSCAN result on day basis.

There is a total of 62 clusters formed excluding the anomaly category, i.e., ‘-1’. There are more clusters in Figure 11 compared to the previous generated analysis graphs.

The DBSCAN algorithm predicts the result in a way that it detects anomalies based on every day of each month in a year. Each cluster consists of data based on density level by grouping data with similar status codes considering every day of all months in all years. The data with a status code that consists of the least density is marked as an anomaly based on every day of all months. For example, the prediction result forms clusters and detects anomalies based on every 7th day of all months in a year. The algorithm works in a similar way for all the other days in a month. Each category/clus- ter represents data with similar status codes except the anomaly category. For example, category ‘0’ consists of data with status code ‘200’ for every 7th day of all the months.

(33)

28

In the same way, the result shows other categories created based on status code simi- larity.

Table 5 shows anomaly results for 5 days of all months in a year. We got similar anomaly results for all other days of all months. Table 5 lists the status codes marked as an anomaly for each corresponding day. Based on the day of all months, the DBSCAN algorithm marks data with the least density as an anomaly. We observed from Table 5 that considering every 7th day of all the months, data with status code

‘405’ is marked as an anomaly, since it has the least density compared to data with other status codes. The processing works in a similar way for days 8th, 9th, 10th, and 11th for all months. On a day basis of all months, data with status code ‘200’ acquires the highest density compared to other status codes. Thus, we can say that all data with status codes are normal data.

Day: target_status_code of anomaly data:

7 405

8 406

9 405, 406

10 400

11 400, 405

Table 5. Anomalies identified on a day basis (5 days).

3.2.7 DBSCAN: Day, Weekday, Month, and Year Analysis

We generated the graph shown in Figure 12 by passing ‘target_status_code’, ‘day’,

‘weekday’, ‘month’, and ‘year’ features extracted from ‘time’ data to train the DBSCAN model. We observed from Figure 12 that number of anomalies detected is highest compared to other graphs discussed earlier. There are total of 132 clusters formed after analysis excluding the anomaly category ‘-1’. The DBSCAN algorithm performs analysis in a way that it detects anomalies each day of every month in every year. In other words, it performs anomaly detection on a daily basis. The clusters con- sist of data grouped based on similar status codes each day of every month in every year. The clusters that consist of the least density of data are marked as anomalies.

(34)

29

Each category/cluster represents data with similar status codes found on a daily basis analysis. For example, category ‘1’ consists of data with status code ‘200’ for the 8th day of July in the year 2020. In the same way, other categories represent data with a particular status code observed on a daily basis.

Figure 12. DBSCAN result on day, weekday, month, and year basis.

Based on Table 6 we know that on the 7th day of month July 2020 the data, which consists of status codes ‘401’ and ‘405’, is marked as an anomaly. Similarly, on the 10th day of month July 2020, the anomaly data consists of status code ‘400’. The anom- aly detection works in similar ways for other days, i.e., 13th, 14th, and 19th day of July 2020. The DBSCAN algorithm marks data with the least density as anomaly based on each day of every month in each year.

Day: Month: Year: target_status_code of anomaly data:

7 7 2020 401, 405

10 7 2020 400

13 7 2020 401

14 7 2020 401, 405

19 7 2020 401

Table 6. Anomalies identified on daily basis (5 days).

(35)

30

3.3 K-Means Clustering

K-means clustering technique groups objects based on the feature values into K dis- joint clusters. Objects that reside in the same cluster show similar feature values. The positive integer ‘K’ defines the number of clusters and has to be provided prior to model training [17]. Thus, the K-means algorithm clusters the data based on feature similarity and the number of predefined clusters.

K-means is one of the simplest solutions to solve clustering problems. K-means is an unsupervised ML technique to classify given data set based on a number of clusters that are predefined. We define K centroids for each cluster. The centroids are located in such a way that they are far away from each other for better results. The next step is to associate points belonging to a given dataset with the nearest centroids located.

At this stage, when no point is identified that needs to be associated with the nearest centroid, the initial set of groups are formed. Now, the new location of K centroids is calculated as per the new clusters/groups created. After the recalculation of the cen- troid location step, the nearest data points are again associated with new centroid lo- cations. This process continues until the location of centroids is fixed and it does not change anymore [16]. Thus, the calculation of the distance between the data points and centroids takes place every time when the location of the centroids moves. It is an on- going process until the centroid finds its fixed location.

The K-means clustering algorithm has the following steps [17]:

1. Define the number of K clusters.

2. Initialize centroids for K-clusters. This can be achieved by dividing the dataset into K clusters and computing the centroids for each cluster.

3. Iterate over all the data points and calculate the distance from the data point to the centroids of all clusters. Assign all the data points to their nearest respective clusters.

4. Calculate the location of centroids for each cluster again.

5. Repeat step 3 until centroids do not change their location.

(36)

31

A formula is required to calculate the distance between data points and centroid. The most common formula (see Formula 1) used to calculate the distance between data points is as follows [17]:

𝑑(𝑥, 𝑦) = √∑𝑚𝑖=1(𝑥𝑖 − 𝑦𝑖)2 (1) where 𝑥 = (𝑥1, 𝑥2, … , 𝑥𝑚) and 𝑦 = (𝑦1, 𝑦2, … , 𝑦𝑚) are two input vectors with 𝑚 as quantitative features [17]. As the distance is calculated using quantitative features, K- Means Clustering works best with numerical data and not categorical data. Thus, we used the K-Means Clustering algorithm for anomaly detection for numerical features in our dataset.

Figure 13 gives an overview of the K-means clustering technique. Figure 13 shows three different clusters that consist of data points based on centroid locations. Here, we have predefined clusters 𝑘 = 4. Initially, we calculated centroid locations for three divided different sections in the dataset. From Figure 13 we observed that after per- forming K-Means analysis, there are four different clusters created and marked by dif- ferent colors for each cluster. Each cluster consists of its own centroid.

Figure 13. K-Means Clustering overview.

In Section 3.3.1 we utilize K-means clustering to identify anomalies using numerical features in our dataset. In Section 3.3.2 we visualize the achieved results.

(37)

32 3.3.1 Application of K-Means Model

We used the K-Means scikit learn library to perform K-Means Clustering algorithm on the API dataset. The function used for performing K-Means analysis includes pa- rameter ‘n_clusters’. The parameter ‘n_clusters’ indicates the number of clusters to form as well as the number of clusters to generate [5]. We defined n_clusters = ‘5’ to cluster the data into ‘5’ groups. Each group consists of data with similar features. Fig- ure 14 gives an overview of different stages required for performing K-Means analysis.

Figure 14. K-Means approach.

As illustrated in Figure 14, the stages involved in K-Means analysis are as follows:

1. Extract API Data: We extracted the API data from a specific location using the pandas library. We used the ‘read_csv’ function to extract the data and convert the data into a dataframe object. We used the following code snippet to extract API data:

1: df = pd.read_csv (‘path_to_data_file’)

2. Data Preparation: At this stage, we selected the required features from the dataframe object and convert the data into the required format. We selected the following features from the dataframe object: ‘target_status_code’, ‘time’, and

‘received_bytes’. We utilized feature ‘target_status_code’ and ‘time’ feature to plot graph with X- and Y-axis respectively. We used the ‘received_bytes’ fea- ture to train the model using the K-Means technique. The following code snip- pet selects the required feature for K-Means analysis from the dataset:

(38)

33

1: df = df [[‘target_status_code’, ‘time’, ‘re- ceived_bytes’]]

We also converted the ‘time’ feature into the required format for plotting the graph. We used the following code snippet to convert the ‘time’ feature into the required format and stored the data into a new dataframe column:

1: df [‘datetime’] = [datetime.datetime.strptime(d,

“%Y-%m-%d %H:%M:%S”) for d in df [‘time’]]

Finally, we transformed the selected data using the ‘MinMaxScaler’ scikit learn library. We used the MinMaxScaler library to perform scaling on the da- taset. For example, MinMaxScaler scales and translates each feature such that it is in the given range between ‘0’ to ‘1’. Moreover, we applied MinMaxScaler on feature ‘received_bytes’ to scale and translate the data. We used the follow- ing code snippet to translate the data using MinMaxScaler:

1: data = df[[‘received_bytes’]]

2: mms = MinMaxScaler() 3: mms.fit(data)

4: data_transformed = mms.transform(data)

3. Train K-Means Model: After the data preparation stage, we trained the K- Means model using the prepared dataset. We passed define n_clusters = ‘5’

and passed it to the K-Means model as a parameter. We called the ‘fit_predict’

function and passed the prepared dataset as a parameter to the function. The following code snippet trains the K-Means model:

1: result = KMeans(n_clusters=5).fit_pre- dict(data_transformed)

(39)

34

4. Results from K-Means Analysis: Finally, we extracted the result to visualize it on a scatter plot graph. We defined colors for each category from the result.

Each category consists unique color. We used the following code snippet to extract the result and define unique colors for categories:

1: df_model = pd.DataFrame({‘categories’: result}) 2: category_colors = ["#" + ''.join([ran-

dom.choice('0123456789ABCDEF') for j in range(6)]) for i in range(df_copy['categories'].nunique())]

3: category_list = df_copy['categories'].unique() 4: colors = {category_list[index]: color for index, color in enumerate(category_colors)}

3.3.2 Visualization of K-Means Model Result

We used an exact similar approach for visualizing the K-means anomaly result as we used for the DBSCAN result earlier. Thus, we used the same code snippet to generate visualization for K-Means anomaly results. Figure 15 illustrates a scatter plot graph that shows the anomaly result for K-Means analysis in which ‘received_bytes’ are shown on X-axis and ‘target_status_code’ on Y-axis. The ‘received_bytes’ feature de- notes the size of the request received by the target server. If the size of the request is large or small and deviates from normal data then we can say that particular data is anomaly data. Figure 15 shows different categories/clusters created by K-Means anal- ysis. Each cluster represents a unique color. We see a total of five clusters/categories in the graph and each category shows data points with similar features. Thus, we have clusters that consist of similar received bytes. We observed that category ‘2’ consists of data that shows the highest received bytes. The data points that reside in category

‘2’ deviate from other data and are hence considered as an anomaly. Furthermore, data points that lie under category ‘3’ consist of lowest amount of received bytes compared with other data points. Thus, these data points also deviate from normal data and are considered as anomaly data points.

(40)

35

Figure 15. K-Means anomaly result.

Next, we discuss in detail the anomaly results listed in Tables 7 and 8, which show anomaly data with least density clusters. Table 7 provides details about target status codes that consist of different received bytes from the client, which are very few as compared to other data. This type of data forms a category 3 cluster that has a lesser density of data as compared with other clusters/categories. From Table 7 we observed that data that consists of target status code ‘405’ shows fewer received bytes from the sender/client compared to other data points. The status code ‘405’ denotes that the target server knows the request, but the server does not support the method type. Sim- ilarly, we observed from Table 7 that data with target status code ‘400’ shows fewer received bytes 114 and 238 from the client. The target status code ‘400’ denotes that the request is invalid or malformed. Thus, we noted that category 3 data from Figure 15 is anomaly data.

target_status_code: received_bytes:

405 114, 113, 116, 244, 115

400 114, 238

Table 7. Anomaly cluster/category 3.

Table 8 provides details about target status codes that consist of different received bytes from the client. This type of data forms a category 2 cluster that consists of less density of data as compared with other clusters/categories. We observed from Table 8

(41)

36

that data consisting of target_status_code ‘200’ shows a very high number of received bytes 1742 and 1577 from the sender/client compared with other data points. A high number of received bytes shows an indication of intrusion. There is a possibility of an attacker sending a high number of received bytes to the server that breaches the secu- rity of the software system. Similarly, we observed from Table 8 that data with target status code ‘401’ shows a high number of received bytes 1742 from the client. The target status code ‘401’indicates unauthorized access to the API. Unauthorized access means a request sent by the client to the API consists of invalid credentials. Hence, we can say that all data from category 2 is anomaly data.

target_status_code: received_bytes:

200 1742, 1577

401 1742

Table 8. Anomaly cluster/category 2.

(42)

37

4 Conclusion and Future Work

After applying different unsupervised ML algorithms on the API dataset, we know that unsupervised learning gives significant results in detecting anomalies in a dataset. We utilized DBSCAN and OCSVM algorithms that work nicely with time-series data to detect anomalies. OCSVM algorithm helps in determining patterns in the processing time of API and identifying unusual patterns, i.e., anomalies in a dataset. Thus, OCSVM helps to find the root cause of a sudden increase in the processing time of API and fix the problem to maintain the performance of the system in the future.

DBSCAN gives interesting insights into the API dataset by clustering time series data using categorical features of data and detecting anomalies based upon time and target status code features. It helps to perform anomaly detection within a particular time span. For example, by using the DBSCAN approach, we can perform anomaly detec- tion on a yearly, monthly, weekday, or daily basis. K-Means Clustering performs effi- ciently with numeric features of non-time series data and helps to detect anomalies based on received bytes of the API. We can say that clusters formed with the least density are anomalies after applying K-Means Clustering analysis.

The future work involves developing a standalone anomaly-detection software system that utilizes the previously discussed unsupervised learning algorithms. The main ob- jective of the application would be to accumulate different types of anomaly detection in one place and help to make a more user-friendly approach to detect anomalies in the API dataset. The anomaly detection application includes OCSVM, DBSCAN, and K- Means unsupervised learning techniques to detect anomalies in systems. By using the DBSCAN approach, we can apply filters based on time span to detect anomalies in the API dataset. OCSVM approach will help significantly to monitor data sources that include user logs, server logs, and network logs. It also helps significantly to detect unusual patterns in data and identify unknown security threats in software systems.

We can also utilize K-Means to cluster the numerical features of the dataset apart from received bytes for anomaly detection. The results produced from the anomaly detec- tion application are displayed using third-party graph visualization libraries. This

(43)

38

third-party graph visualization will help to make the anomaly detection software more user-friendly by providing features like hovering over the graph to know more about the data.

(44)

39

References

[1] Jordan M. and Mitchell T. (2015), Machine Learning: Trends, Perspectives, and Prospects, Vol. 349 (6245), 255-260. DOI: 10.1126/science.aaa8415

[2] Ahmed M., Mahmood A., and Hu J. (2016): A Survey of Network Anomaly Detection Techniques. 19-31. DOI: https://doi.org/10.1016/j.jnca.2015.11.016

[3] Laskov P., Dussel P., Schäfer C., and Rieck K. (2005): Learning Intrusion Detection: Supervised or Unsupervised? LNCS 3617, 50-57. Retrieved from:

https://link.springer.com/chapter/10.1007/11553595_6#enumeration

[4] Inoue J., Yamagata Y., Chen Y., Poskitt M., and Sun, J. (2017): Anomaly Detection for a Water Treatment System Using Unsupervised Machine Learning. DOI:

10.1109/ICDMW.2017.149

[5] Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, Vol. 12(85), 2825-2830. Retrieved from:

https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

[6] Nawir M., Amir A., Yakoob N., and Lynn B. (2019), Effective and Efficient Network Anomaly Detection System Using Machine Learning Algorithm, Vol. 8(1), 46-51. DOI: 10.11591/eei.v8i1.1387

[7] Bahvsar H. and Ganatra A. (2012), A Comparative Study of Training Algorithms for Supervised Machine Learning, Vol. 2(4), 74-81. Retrieved from: https://www.re- searchgate.net/publication/265068741_A_Comparative_Study_of_Training_Algo- rithms_for_Supervised_Machine_Learning

[8] Ghahramani Z. (2004), Unsupervised Learning. Published by Gatsby Computa- tional Neuroscience Unit. Retrieved from: http://mlg.eng.cam.ac.uk/zoubin/pa- pers/ul.pdf

(45)

40

[9] Omar S., Ngadi A., and Jebur H. (2013), Machine Learning Techniques for Anomaly Detection: An Overview. International Journal of Computer Applications.

Vol. 79(2), 0975-8887. DOI: 10.5120/13715-1478

[10] Chalapathy R., Menon K., and Chawla S. (2018): Anomaly Detection Using One- Class Neural Networks. Retrieved from: https://arxiv.org/abs/1802.06360

[11] Barrett P., Hunter J., Miller J., Hsu J.-C., and Greenfield P. (2005): Matplotlib – A Portable Python Plotting Package. Published by Astronomical Data Analysis Software and Systems XIV, Vol. 347(91). Retrieved from: http://aspbooks.org/cus- tom/publications/paper/347-0091.html

[12] McKinney W. (2010), Data Structures for Statistical Computing in Python.

Proceedings of the 9th Python in Science Conference, Vol. 445, 57-61. Retrieved from:

https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf

[13] Lamrini B., Gjini A., Daudin S., Armando F., Pratmarty P., and Louise T. (2018), Anomaly Detection Using Similarity-Based One-Class SVM for Network Traffic

Characterization. Published in 29th International Workshop on Principles of Diagnosis. Retrieved from: http://ceur-ws.org/Vol-2289/paper12.pdf

[14] Leymann et al. (2016), WSO2 Rest API Design Guidelines. Published by WSO2 community. Retrieved from: https://wso2.com/wso2_resources/wso2-whitepaper- wso2-rest-apiss-design-guidelines.pdf

[15] Schubert et al. (2017), DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems, Vol. 42(3), 21. DOI:

https://doi.org/10.1145/3068335

[16] Fielding R. and Reschke J. (2014), Hypertext Transfer Protocol (HTTP/1.1):

Semantics and Content. Published by Internet Engineering Task Force. Retrieved from: https://tools.ietf.org/html/rfc7231

(46)

41

[17] Carle G., Munz G., and Li S. (2016), Traffic Anomaly Detection Using K-Means Clustering. Published by Computer Networks and Internet, Wilhelm Schickard Insti- tute for Computer Science. Retrieved from: https://www.researchgate.net/pro- file/Georg_Carle2/publication/242158247_Trafc_Anomaly_Detection_Using_K- Means_Clustering/links/569b91c508ae748dfb101c18.pdf

[18] Kodinariya T. and Dr. Makwana P. (2013). Review on Determining Number of Cluster in K-Means Clustering. International Journal of Advance Research in Computer Science and Management Studies, Vol. 1(6), 90-95. Retrieved from https://www.academia.edu/5514429/Review_on_determining_number_of_Clus- ter_in_K_Means_Clustering

Viittaukset

LIITTYVÄT TIEDOSTOT

The traditional unsupervised version of SVM is called One-Class SVM (OC-SVM) [26], which is mostly used for anomaly detection. In this model, a decision function is constructed

4 ANOMALY DETECTION OF LABELLED WIRELESS SENSOR NETWORK DATA USING MACHINE LEARNING TECHNIQUES

Different from the aforementioned scoring methods in i- vector system, another possible technique is to train an SVM model using the training i-vectors of natural and synthetic

In the numerical simulation studies, forward solutions using the transformation-based forward model were compared against the conventional forward model using various levels of

(If our proximities are distances, then the names, MIN and MAX, are short and suggestive. For similarities, however, where higher values indicate closer points, the names seem

I Numerical optimization methods, principal component analysis, dimensionality reduction, independent component analysis, EM algorithm... I Probabilistic Models (with

Keywords: clustering, number of clusters, binary data, distance function, large data sets, centroid model, Gaussian mixture model, unsupervised

In this thesis a novel unsupervised anomaly detection method for the detection of spectral and spatial anomalies from hyperspectral data was proposed.. A proof-of-concept