Automated machine learning: Evaluating AutoML frameworks

(1)

Miro Heikkonen

AUTOMATED MACHINE LEARNING

Evaluating AutoML frameworks

Master of Science Thesis Faculty of Information Technology and Communication Sciences

May 2021

(2)

Miro Heikkonen: Automated machine learning: Evaluating AutoML frameworks Master of science thesis

Tampere University Computing Sciences May 2021

The purpose of this thesis is to study automated machine learning and the tools that make it possible. Machine learning and automated machine learning are presented based on their theory along with the most common concepts. Automated machine learning tools were compared with each other using mathematical functions and publicly available datasets with the purpose of testing. These datasets are meant for classification which means that the research is done for supervised learning. This research continues the work of another research and its goal is to be helpful for the members of the automated machine learning community to compare different frameworks with each other.

In this thesis the prior research was used so that we added more frameworks to the comparison and confirmed the former results using and editing the codebase of the initial research. Using these results the tools were compared with each other and selected traditional machine learning algorithms which were thought of as adequate result. Along with these results there are some use cases that can show in what kind of situations some of the tools could be used.

Keywords: machine learning, automated machine leaning, supervised learning

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

Miro Heikkonen: Automaattinen koneoppinen:

Diplomityö

Tampereen yliopisto Tietotekniikka Toukokuu 2021

Tämän työn tarkoituksena on tutkia automaattista koneoppimista ja sen mahdollistavia väli- neitä. Koneoppiminen ja automaattinen koneoppiminen esitellään teoriaan pohjaten työssä yleis- ten käsitteiden joukossa. Automaattisen koneoppimisen välineitä vertailtiin toisiinsa matemaattis- ten funktioiden avulla käyttäen julkisia tietoaineistoja testaustarkoituksessa. Nämä tietoaineistot on tarkoitettu luokiteltaviksi, joten kyseessä on ohjatun oppimisen tutkimus. Tämä työ pohjautuu aiemmin tehtyyn tutkimukseen ja sen tavoitteena on auttaa automaattisen koneoppimisen yhtei- söä vertailemaan eri välineitä keskenään.

Työssä aiemmin tehtyyn tutkimukseen lisättiin automaattisen koneoppimisen välineitä sekä todistettiin aiempien tutkimustulosten oikeus käyttäen ja muokaten aiemman tutkimuksen koodi- pohjaa. Näiden tulosten perusteella välineitä vertailtiin toisiinsa ja valittuihin perinteisiin koneop- pimismenetelmiin, joita pidettiin vertailukohtana tyydyttävälle tulokselle. Tutkimuksen tuloksena on lisäksi käyttötapaukset, joiden tarkoituksena on havainnollistaa, millaisiin tilanteisiin kukin tut- kituista välineistä toimii.

Avainsanat: koneoppiminen, automaattinen koneoppiminen, ohjattu oppiminen

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck –ohjelmalla.

(4)

This thesis has been written for the faculty of Information Technology and Communica- tion Sciences in Tampere University. It was done during the academic year 2020-2021 and the research was done for the faculty under the tutorship of professor Tapio Elomaa.

I would like to thank Tampere University for the opportunity to study this fascinating field and giving me the support and knowledge I need to succeed in my career. Without the faculty and all the professors I had the pleasure of learning from this thesis would have never been possible so thanks to all of you too.

Lastly, I would like to thank my family and friends for supporting me during my studies and driving me forward even when I really didn’t feel like it. Most of all I would like to thank Kunmitukset for making my time at Tampere University unforgettable and giving me a better support network I could have hoped for inside, outside and everywhere in between my student life.

Thank you

Helsinki, 30.5.2021 Miro Heikkonen

(5)

1. INTRODUCTION ... 1

2. RELATED WORK ... 3

3. MACHINE LEARNING ... 4

3.1 Basic concepts of machine learning ... 4

3.1.1 Supervised, reinforced and unsupervised learning ... 4

3.1.2 Classification and regression ... 5

3.2 Methods and metrics ... 6

4. AUTOMATED MACHINE LEARNING ... 12

4.1 Hyperparameter optimization ... 13

4.1.1 Blackbox hyperparameter optimization ... 13

4.1.2 Multi-fidelity optimization ... 14

4.2 Meta-Learning ... 16

4.2.1Model Evaluations ... 16

4.2.2Task properties ... 17

4.2.3 Learning from prior models ... 18

4.3 Neural architecture search ... 19

5. FRAMEWORK INTRODUCTION ... 21

5.1 AutoML Benchmark... 21

5.2 Already tested systems re-evaluated ... 23

5.2.1 Auto-sklearn ... 25

5.2.2 TPOT ... 25

5.2.3H20 AutoML ... 26

5.2.4Random forest ... 26

5.3 New frameworks ... 27

5.3.1 Autokeras ... 27

5.3.2 MLBox ... 27

5.3.3 Lightautoml ... 28

5.3.4Autogluon ... 28

5.3.5Oboe ... 28

5.3.6 Mlplan ... 29

5.3.7 GAMA ... 29

5.3.8 Ludwig ... 29

6. TESTING SETUP ... 31

6.1 Testing environment... 31

6.2 Testing dataset ... 32

6.3 Benchmark’s functionalities ... 33

7. RESULTS ... 35

7.1 Overview ... 36

7.2 Comparison to Open AutoML benchmark ... 39

7.3 Analysis of the performance ... 41

7.4 Use cases for different frameworks ... 43

(6)

(7)

AUC Area under curve

AutoML Automated machine learning

BOHB Bayesian optimization and HyperBand

CSV Comma-separated values (text file type)

FPR False positive rate

GAMA General automated machine learning assistant

HPO Hyper parameter optimization

ICML International conference of machine learning

k-NN k- nearest neighbors

ML Machine Learning

NAS Neural architecture search

ROC Receiver-operator charasteric

sklearn scikit-learn

TPOT Tree-based pipeline optimization tool

TPR True positive rate

WEKA Waikato environment for knowledge analysis

.

(8)

1. INTRODUCTION

Machine learning techniques are deeply rooted in our everyday life, such as recommendation when we are shopping online or listening to music, handwriting recognition when we are using our cell-phones, recognizing speech when using smart home applications or image recognition in cellphones and other cameras. Furthermore, machine learning has also gained significant achievements. For example, AlphaGO defeated human champion in the game of GO, ResNet surpassed human performance in image recognition, Microsoft’s speech system approximated human level in speech transcription. [1]

Machine Learning is a field in information technology that is developing all the time and it is almost impossible for anyone to keep up with all the new innovations happening in the field. The progress in machine learning is being published so fast that there is no way a human can consume as much. It has driven the machine learning community to rely on automated machine learning frameworks. In addition, designing and tuning of machine learning systems is very laborious and intensive work that requires extensive knowledge in newest machine learning trends and research.

While machine learning has a lot of demonstrated benefit, the successful utilization of machine learning requires a large effort from human experts given that no algorithm can achieve good performance on all possible problems. That is why AutoML was invented as it makes it possible to use machine learning based predictions and systems without knowing everything about the subject. While the field of machine learning has existed for many years, automated machine learning with large datasets and complex models has only recently become a viable option thanks to the expansion in computational power available through specialized hardware and cloud computing services.

Automated machine learning tools allow novice users to create useful machine learning models, while experts can use them to free up valuable time for other tasks. To achieve this automated machine learning aims to improve the current way of building machine learning applications by automation. This way machine learning experts can proﬁt from automated machine learning by automating tedious tasks like hyperparameter optimization (HPO) leading to a higher eﬃciency. This will also mean that domain experts can be enabled to build machine learning pipelines on their own without having to rely on a data scientist. In addition to HPO automated machine learning can bring automation to other

(9)

steps in the machine learning process such as: raw data processing, feature engineering and feature selection, model selection, hyperparameter optimization and parameter optimization, deployment with consideration for business and technology constraints, evaluation metric selection, monitoring and problem checking and analysis of results.

Although some research on the subject has been made there still is not a universally best AutoML approach. Hence, we need further comparisons of all the relevant frameworks to help practitioners select the right tools and provide objective feedback to the research community and ultimately to the industry. Even the highly promising OpenML AutoML Benchmark [1] is lacking in some of the bigger more recent frameworks like Autokeras etc. This work will use some of the techniques used in the OpenML AutoML Benchmark to evaluate these methods while also bringing my own point of view to the evaluations.

This thesis consists of going through the theory of machine learning as well as automated machine learning while introducing vital parts of the research to make it possible for the reader to understand the results. The results will be gathered by running different automated machine learning frameworks through a benchmark consisting of various tasks with different types of datasets. From the results we will make conclusions about the different frameworks and compare them in their ability to perform in the task using methods described in the theory part of this thesis. We will also attempt to clarify the usage of these frameworks and evaluate them through use cases and suggest which framework would be applicable for each use case.

(10)

2. RELATED WORK

There has been some work done on this exact topic and our goal is to expand on it the best way we can. The best known and closely monitored research events are the yearly organized AutoML workshops hosted by automl.org [2]. There are always a lot of research papers sent to them prior to the event and the papers are carefully assessed if they can be accepted for the event. At the event itself they organize a poster session during the event where the researches get to explain and show off their findings in the automated machine learning field to the community.

The AutoML community’s workshops over the years have been closely looked at during my research and many of the papers referenced in this paper are from their events, as well as some of the most relevant literature of the field.

One of the most closely related works on this topic is OpenML AutoML Benchmark [1]

that we will be using as a reference point throughout this document. The research done on that document will be reviewed a bit and added to it by bringing in frameworks that are not touched upon in their work. In addition to that we will be verifying their test results at least by some part to make sure my new research is applicable. This quote from their paper is something we found out as well going through other submissions and that made their benchmarking seem like something we could expand upon: “Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source and uses public datasets.”

During this research there will be hopefully some interaction with the community of developers that are currently improving the framework and the discussion boards on the Github repository. In the best case some of the new frameworks we will be adding in this research can also be used in the benchmarks official version added to it for future usage.

(11)

3. MACHINE LEARNING

In this chapter we will introduce some of the basics of machine learning and focus on its theory. The basics will be talked about quite thoroughly and we will introduce the basic machine learning approaches in short. We will also introduce some basic classification methods as well because those are the type of problems that will be tested in this work.

In addition, we will also investigate another common task called regression to give a clearer picture of machine learning in general and what is achievable with it.

3.1 Basic concepts of machine learning

Machine learning is usually seen as a subset of artificial intelligence that uses data and computer algorithms to make predictions and learn iteratively and improve automatically through experience [3, 4]. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so [3]. Machine learning algorithms are widely used in mod- ern applications like computer vision or email filtering along with any type of prediction algorithms using data sets [5]. Usually machine learning is a good option in fields where it is difficult to develop conventional algorithms to solve problems or perform tasks that are needed. [6] An example of a solvable machine learning task is for example a set of features, like a vector of humidity values or a matrix of values from a picture or video where the values are pixels, that can be processed by the machine learning algorithm [3, 7].

Machine learning is typically divided into three categories. These are supervised learning, unsupervised learning and reinforcement learning [4, 7]. These categories are quite broad but at the same time distinct from one another. The categorization is made by analyzing the tools available for the algorithm to make decisions and learn. More recently deep learning has been thought of as a separate category due to its rise in popularity as an approach in the field [6].

3.1.1 Supervised, reinforced and unsupervised learning

In supervised learning the computer is presented with example inputs and their desired outputs, given by a part of the program that works as a teacher of sorts, and the goal is to learn a general rule that maps inputs to outputs [8, 9, 10]. Unsupervised learning means that no labels are given to the learning algorithm, leaving it on its own to find

(12)

structure in its input. Unsupervised learning can be a goal for example when discovering hidden patterns in data or a means towards an end when doing something called feature learning [11].

In reinforcement learning on the other hand a computer program interacts with a dynamic environment in which it must perform a certain goal such as driving a vehicle or playing a game against an opponent [12]. As it navigates its problem space, the program is provided feedback that is analogous to rewards, which it tries to maximize and this way makes more correct decisions on each iteration [13, 14].

Deep learning is unofficially the fourth category and it is defined as a class of machine learning algorithms that uses multiple layers to progressively extract higher-level features from the raw input [4]. For example, in image processing, lower layers may identify edges, while higher layers may identify the concepts relevant to a human such as digits or letters or faces. These approaches are used to solve machine learning tasks of which the most common ones are classification and regression that will be introduced more specifically next [15, 16].

3.1.2 Classification and regression

A classification task is where the machine learning algorithms are used to learn how to assign a class label to the examples that are present in the specific problem domain [17].

The used machine learning algorithm is given an input vector of values whose class category is unknown, and it assigns it to one of the discrete classes [18]. Most algorithms describe an individual instance whose category is to be predicted using a feature vector of properties but sometimes the output can also be represented as a probability distribu- tion of the classes [19]. In machine learning, classification is considered an instance of supervised learning [3].

An easy to understand example is classifying emails as “spam” or “not spam.” This would be an example of binary classification because there are only two classes that are in- volved. Another example could be about identifying which animal is in a picture. This would be multi-class classification because there are more than just two possible classes for each picture to belong to [19]. The corresponding unsupervised procedure is known as clustering and involves grouping data into categories based on some measure of similarity [20].

Unlike classification problems, regression tasks do not have a set of discrete classes as target outputs. As it is also a supervised learning method it can take similar data but the

(13)

output differs a lot as regression algorithms gives a continuous numerical value [3]. This means that regression can be used in applications that use data and the results that can be measured numerically [4]. One example of a regression task could be predicting a company’s sales and the effect of marketing in it using the company’s historical data from those budgets [4].

Figure 1 Example plotting of classification and regression

3.2 Methods and metrics

When starting the training of a machine learning model, a dataset is needed. Datasets in this context are collections of examples, each example containing its own features [4].

The examples in a dataset are usually contained in a structure similar to a matrix or a vector [19]. In the case of a matrix each row contains the features of one example, and the matrix contains rows for each of the examples. With supervised learning models, the rows contain also the target value of the example that is wanted there. Classification datasets transform the class names into corresponding integer numbers so the data can be more easily processed [19].

Simple machine learning models that are trained on simple examples with only a few variables can possibly be taught with a dataset of hundreds or even just dozens of examples [4]. Training on complicated and multidimensional examples, such as images, might require tens of thousands of images in a dataset to train a good model that has a decent accuracy [21]. We will be using different types of data sets to evaluate different automated machine learning models so we will have data on different volumes of datasets. The type, size and complexity of a dataset are major factors on what machine learning method should be used [4].

(14)

To explain classification more, we will introduce a simple linear classification function and give a few examples of such algorithms [3]. Those algorithms will be introduced on very high level to avoid going too far into detail in those because they are not on the forefront of my study although still important to understand.

A large number of algorithms for classification can be expressed in relations of a linear function that allocates a score to each possible group k by combining the feature vector of an example with a vector of weights, using a dot product [18]. The projected category is the one with the peak score [22]. This sort of score function is identified as a linear predictor function and has the following universal form:

𝑠𝑐𝑜𝑟𝑒(Χ , 𝑘) = β × Χ (1)

where Χ is the feature vector for instance i, β is the vector of weights equivalent to category which is marked as k, and score(Χ, k) is the score linked with assigning instance i to category k [22]. In discrete choice theory, where instances represent people and categories signify choices, the score is considered the usefulness associated with person i selecting category k [23].

One of the ways we will be measuring the performance in this paper is the receiver operating characteristic curve, or ROC curve [24]. It is a graphical plot that exemplifies the analytical ability of a binary classifier system as its discrimination threshold is diverse.

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at numerous threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning [25]. The false-positive rate is also known as probability of false alarm and can be calculated as (1 − specificity) [25, 26]. A confusion matrix which demonstrates the TPR and FPR relation is shown in Table 1.

Table 1 A confusion matrix. The target responses are on the left and the model’s predictions on the top.

Positive Negative Positive True Positive False Negative Negative False Positive True Negative

When using normalized components, the area under the curve (often referred to as only the AUC) is equivalent to the probability that a classifier will rank a randomly selected

(15)

positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative' and this is truly what we want to happen) [25].

Figure 2 An example of a ROC curve and the area under it marked grey.

Confusion matrix is seen in Table 1 and it represents a table that can be used to picture the performance of a machine learning algorithm. Confusion matrix is typically used for supervised learning like classification [25]. The confusion matrix can be presented so that each row of the matrix represents the examples in a projected class, while each column represents the instances in a genuine class or the other way around like in Table 1. Confusion matrix can this way be used to calculate the true positive rate and false positive rate and plot the receiver operating characteristic curve and from that we will get the area under that curve which will be used as one of the performance metrics in our experiments [26]. An example of this usage as a curve can be seen in Figure 2.

Loss functions for classification are computationally achievable loss functions represent- ing the price paid for inaccuracy of predictions in classification problems [27]. One of these loss functions is logistic loss or more commonly logloss [28]. It is hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. For any given problem, a lower log loss value means improved predictions [27]. Logloss function is defined mathematically:

𝐿 (𝑦, 𝑝) = −(𝑦𝑙𝑜𝑔(𝑝) + (1 − 𝑦) log(1 − 𝑝)) (2)

Where a single sample of true or false value is 𝑦 ∈ {0,1} and the probability estimate is marked as 𝑝 = Pr (𝑦 = 1) [27]. Logloss heavily punishes classifiers that are confident

(16)

about an unfitting classification. For example, if for a particular observation, the classifier assigns a very small probability to the correct class then the corresponding contribution to the Log Loss will be very large [28]. Logloss will be used in this research to evaluate multiclass classification problems accuracy [27].

One more topic to clarify which can assist in the understanding of classification is the algorithm needs to be introduced. We will be using k-nearest neighbors-algorithm as an example because we think it shows clearly what classification is all about but is still an algorithm that is widely used [20]. The k-nearest neighbors algorithm (k-NN) is a non- parametric machine learning method where the input comprises of the k closest training instances and the output is a class membership [21]. An object is classified by a number vote of its neighbors, with the object being allocated to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor [20, 21].

In k-nearest neighbors the training examples are vectors in a multidimensional feature space, each with a class label [20]. The training phase of the algorithm consists only of loading the feature vectors and class labels of the training samples. In the classification stage, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most recurrent among the k training samples adjacent to that query point [21].

A drawback of the basic "majority voting" classification occurs when the class distribu- tion is twisted [21]. That is, examples of a more frequent class have a habit of dominating the prediction of the new example, because they lean towards to be common among the k nearest neighbors due to their large number [20]. This can be overcome by giving weights to the neighbors for example in a way that the closest neighbors affect the classification more than the furthest ones [20, 21]. An example of how the k-nearest neighbors classification works can be seen in Figure 3.

(17)

Figure 3 Example of a k-NN classification where the green circle needs to be classified as a red triangle or as a blue square. If the k = 3, it will be classified as a red triangle but if the k = 5, it would be classified as a blue square.

Next, we will give an example of a simple classification task to show what type of problems can be solved using these methods. A typical classification problem that we chose to show here is that we have to distinguish between different fruits according to our given data. Usually this process roughly follows the following pattern: Get the data to be used, analyze the data and do some data engineering around it for example decide what to do with missing values, extracting features from the data and cleaning it in a way that there is not any unnecessary columns, testing out different models and designing them and analyzing the results from them and fine tuning the ones that perform well and finally reading the metrics of the results and evaluating them. [29]

In our example we will be using a dataset of apples, mandarins, oranges and lemons that was created in the University of Edinburgh and later altered by the University of Michigan. Few lines of the data can be seen in Table 2. Each row of the dataset represents one piece of the fruit as represented by several features that are in the table’s columns. We have 59 pieces of fruits and 7 features in the dataset.

Table 2 Data used in the classification example

(18)

The dataset itself is quite balanced except for the mandarins as we have 19 apples, 16 lemons, 19 oranges and 5 mandarins. Going through the data there are some obvious correlations like width and mass so we will use those as our features to make the prediction. We can also see that the values have to be scaled. That is something we will not be going through more in detail. There are no empty values or much more to take into account regarding this particular dataset.

The data will be then split into a training set and a test set. We will have 80% in the training set and 20% in test set. After going through other algorithms like logistic regression and a support vector machine we come to the conclusion that k-nearest neighbors algorithm gives us the best results with k = 5. It does give us a 100% accuracy using a confusion matrix on the test case so there is a slight chance of overfitting although that is not a common problem with this algorithm. The decision boundary for the k-nearest neighbors classifier can be seen in Figure 4.

Figure 4. Classification space of the classification example

(19)

4. AUTOMATED MACHINE LEARNING

Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time intense, iterative responsibilities of machine learning model development. More precisely, it automates the selection, composition and parameteriza- tion of machine learning models. The roadmap of automated machine learning can be seen in Figure 5. It allows data scientists, analysts, and developers to create ML models with high scale, efficiency, and productivity all while supporting model quality. Automat- ing the process of applying machine learning end-to-end furthermore offers the rewards of producing simpler solutions, faster creation of those solutions, and models that often outclass hand-designed models. [30]

But at the same time, it introduces new problems. Complex models can be hard to trans- late and because of this it is hard to distinguish when a model is introducing bias. AutoML now worsens this problem of a black box model, by hiding not only the mathematics of the model, but also performing data cleaning, feature selection, model selection and parameter selection in the background. [30]

Figure 5 Machine learning pipeline with AutoML highlighted

The most significant goal of automated machine learning is to deliver methods and pro- cedures to make machine learning accessible for non-machine learning experts, to improve competence of machine learning and to quicken research on machine learning.

This is done by automating already introduced tasks of the machine learning pipeline.

As the difficulty of these tasks is often beyond people that are not machine learning experts, the fast growth of machine learning applications has formed a demand for off-the- shelf machine learning methods that can be used effortlessly and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.

(20)

4.1 Hyperparameter optimization

Every machine learning system that has been created has hyperparameters, and that is why the most basic task in AutoML is to automatically set these hyperparameters to optimize performance. Automated hyperparameter optimization (HPO) has numerous vital use cases [31]. For example, it can decrease the human effort needed for applying machine learning, improve the performance of machine learning algorithms and improve the reproducibility and equality of scientific studies. Because we are interested in mostly the first example of reducing human effort we will be focusing on it and shedding a light to points related to that [32].

The domain of a hyperparameter can be real-valued for example its learning rate, integer valued for example its number of layers, binary which means whether to use early stop- ping or not, or categorical for example choice of optimizer. The configuration space can hold conditionality, which means that a hyperparameter may only be relevant if another hyperparameter or some combination of hyperparameters has a specific value [30]. In hyperparameter optimization we try to enhance these parameters so that we can dimin- ish the loss of the model that is found in evaluation [33].

4.1.1 Blackbox hyperparameter optimization

At large, every blackbox optimization method can be applied to HPO. Due to the non- conventional nature of the problem, global optimization algorithms are likely to be fa- vored, but some locality in the optimization process is beneficial in order to make progress within the function evaluations that are usually available [33]. We will talk more specifically about model-free Blackbox optimization and Blackbox Bayesian optimization methods.

Grid search is a simple and most basic HPO method. There the user stipulates a set of values for each hyperparameter and grid search assesses the sets using cartesian product [33]. The problem this approach has is that when the dimensionality of the configuration space raises the function assessments grow exponentially [30].

An alternate for grid search is random search. Random search samples configurations at random until a certain limit for the search is reached. This works better than grid search when some hyperparameters are more important than others which can be true in many cases [32]. Random search is a valuable baseline because it makes no expec- tations on the machine learning algorithm being optimized, and, given enough resources,

(21)

will, in expectation, achieve performance close to the ideal. Random search can also be helpful for preparing the search process, as it explores the entire configuration space and thus often finds settings with reasonable performance [34]. However, it is no silver bullet and often takes far longer than guided search methods to identify one of the best performing hyperparameter configurations.

Population based optimization methods are algorithms that maintain a population and improves this population to obtain a new generation of better configurations.

Bayesian optimization is a state-of-the-art optimization framework for the global optimization of expensive blackbox functions [30]. Bayesian optimization is an iterative algorithm with two key ingredients: a probabilistic surrogate model and an acquisition function to decide which point to evaluate next [30].

Bayesian optimization frameworks using information theoretic acquisition functions allow decoupling the evaluation of the target function and the limitations to dynamically choose which of them to evaluate next. This becomes advantageous when evaluating the function of attention and the constraints require vastly different amounts of time, such as evaluating a deep neural network’s performance and memory consumption.[30]

4.1.2 Multi-fidelity optimization

Increasing dataset sizes and increasingly complex models are a major obstacle in HPO since they make blackbox performance assessment more expensive. Training a single hyperparameter configuration on large datasets can now easily exceed several hours and take up to numerous days [30]. A common method to speed up manual tuning is therefore to examine an algorithm/hyperparameter configuration on a small subset of the data, by training it only for a few repetitions, by running it on a subset of features, by only using one or a few of the cross-validation folds, or by using down-sampled pictures in computer vision [38]. Multi-fidelity methods can cast the same kinds of manual heuristics into proper algorithms, using so-called low fidelity approximations of the actual loss function to minimalize.

Some of the multi-fidelity methods are such methods that evaluate and model learning curves during HPO and then decide whether to add further resources or stop the training procedure for a given hyperparameter configuration. [30] An example of learning curves is the performance of the same configuration trained on increasing dataset subsections.

Learning curve extrapolation is used in the situation of predictive termination, where a

(22)

learning curve model is used to extrapolate a partially observed learning curve for a configuration, and the training procedure is stopped if the configuration is predicted to not reach the performance of the best model trained so far in the optimization process [39].

When combined with Bayesian optimization, the predictive termination standard enabled lower error rates than off-the-shelf blackbox Bayesian optimization for optimizing neural networks. While this method is limited by not distributing information across different hyperparameter configurations, this can be achieved by using the basic functions as the output layer of a Bayesian neural network [30]. The parameters and weights of the basic functions, and thus the full learning curve, can thus be predicted for arbitrary hyperparameter configurations [35].

Bandit-based algorithm selection methods are methods that try to determine the best algorithm out of a set of algorithms based on approximations and performance. Bandit- based strategies successive halving and Hyperband have shown strong performance, especially for optimizing deep learning algorithms. [30]

Successive splitting is an extremely simple but at the same time very powerful, and therefore popular strategy for multi-fidelity algorithm selection: for a given preliminary budget, query all algorithms for that budget; then, remove the half that performed poorest, double the budget and consecutively repeat until only a single algorithm is left [30]. While successive halving is an effective approach, it suffers from the budget vs number of configurations trade off. Given a total budget, the user must decide in advance whether to try many configurations and only assign a small budget to each, or to attempt only a few and assign them a larger budget. Assigning too small a budget can result in prematurely terminating good configurations, while assigning too large a budget can result in running bad configurations too long and thereby wasting resources [36].

Hyperband is designed to overcome the problems in successive halving when trying to select randomly sampled configurations. It divides the total budget into several combinations that consists of number of configurations compared to budget for each, to then call successive halving as a subprogram on each set of random configurations. Hyperband’s problem is that it is very restrictive that you cannot adapt the configuration proposal plan to the function evaluations. That is why the new approach BOHB was developed [30]. It combines Bayesian optimization and HyperBand to achieve the most usage from both of the components: strong anytime performance and strong final performance. BOHB has been shown to outperform numerous state-of-the-art HPO methods for tuning support vector machines, neural networks and reinforcement learning algorithms. [37]

(23)

Fidelities can also be chosen adaptively unlike the previously introduced methods that had predefined agenda for fidelities. One of those tasks is multitask Bayesian optimization. Multi-task Bayesian optimization uses a multi-task Gaussian process to model the performance of tasks and to automatically learn their association during their optimization process. This method can be used to switch between cheaper, low-fidelity tasks and the more expensive, high-fidelity target task based on the function. Multi-task Bayesian optimization can be used to handover information also from previous optimization tasks.

[30]

4.2 Meta-Learning

Meta-learning is essentially learning to learn and it is the science of systematically de- tecting how different machine learning approaches perform on a wide range of learning tasks, and then learning from this experience, or meta-data, to learn new tasks much quicker than otherwise possible [38]. Meta-learning can speed up and improve the design of machine learning pipelines or neural architectures, it also permits us to replace hand-engineered algorithms with novel approaches learned in a data-driven way [30].

The challenge in meta-learning is to learn from previous experience in a methodical, data-driven way. First, we need to gather meta-data that define prior learning tasks and previously learned models. The meta-data comprises of for example algorithm configurations, hyperparameter settings, accuracy, training time and model parameters among numerous others [39]. Second, we need to learn from this prior meta-data, to extract and transfer knowledge that directs the search for optimal models for new tasks. [30]

We will be looking at meta-learning techniques in subsequent subchapters from three distinct angles. First, we will talk about how to learn from model evaluations. Next, we will focus on how to characterize tasks clearly and build models that can learn relation- ships between characteristics and performance. And finally, we discuss how to transfer learned model parameters between similar tasks. [40]

4.2.1 Model Evaluations

If we have access to prior tasks, we can get add the previous evaluations and use them to train a meta-learner. The meta-learner can then forecast recommended configurations for a new task. Sometimes the training results can be warm-started with some initial data generated by another method. [30]

(24)

In task-independent recommendations we do not have access to previous evaluations but still use a common function to get configurations. These configurations are usually ranked and evaluated by success rates or other means of evaluation. Configuration space design is also independent from the task but it uses prior evaluations to learn an improved configuration space [41]. It has turned out to be a very important part of AutoML systems comparisons. It focuses on learning optimal hyperparameter default settings.

Default values can be learned in unison for all hyperparameters of an algorithm by first training surrogate models for that algorithm for a large number of tasks [30]. Next, most of the configurations are sampled, and the configuration that minimizes the average risk across all tasks is the recommended as the default configuration. Finally, the importance of each parameter is estimated by observing how much enhancement can still be gained by tuning it. [30]

If we want to deliver recommendations for a precise task, we need additional information on how similar it is to prior tasks. One way to do this is to evaluate several recommended (or potentially random) configurations on, yielding new evidence [41]. If we then observe that the evaluations, are like, then and can be considered intrinsically similar, based on empirical evidence. We can include this knowledge to train a meta-learner that predicts a recommended set of configurations. [30]

We can also extract meta-data about the training process itself, such as how fast the model performance improves when more training data is added. If we divide the training in steps, usually adding a specific number of training examples every step, we can measure the performance of configuration on task after step, yielding a learning curve across the time steps. Learning curves are also used to speed up hyperparameter optimization on a given task. In meta-learning, learning curve information is transferred across tasks.

[30]

4.2.2 Task properties

Another rich foundation of meta-data are characterizations (meta-features) of the task at hand. Each task can be described as a vector of meta-features and that can be used to outline a similarity measure. Then we can transfer information of the most similar task to the new one. After this we can train a meta-learner to predict the performance of specific configurations on an original task. [30]

Some of the common meta-features in machine learning are for instance the number of instances, number of features, number of classes, class entropy, data consistency and

(25)

information gain just to name a few [42]. All of these meta-features have their own mean- ing and reasoning why they are important in optimizing a model. These reasons can be as simple as the speed of the model or scalability or more complex reasons like feature interdependence or noisiness of the data [30].

To build a meta-feature vector, one needs to select and further process these meta- features. Many meta-features are calculated on single features, or combinations of features, and need to be aggregated by summary statistics. One needs to thoroughly extract and aggregate them [43]. Outside these general-purpose meta-features, many more specific ones were formulated. For streaming data one can use streaming landmarks, for time series data one can compute autocorrelation coefficients or the slope of regression models, and for unsupervised problems one can cluster the data in different ways and extract assets of these clusters [30].

Now we have only talked about meta-features in general, but it is also likely to learn a joint representation for these groups and tasks. One method is to build meta-models that produce a landmark-like meta-feature representation from other tasks meta-features and train that. In a simple way we can have prior tasks and their configurations and run tests if the new configurations outperform the old ones. [42, 30]

We can also learn the complex relationship between a task’s meta-features and the usefulness of specific configurations by building a meta-model that recommends the most beneficial configurations new given the meta-features of the new task. These sequen- tially are called meta-models and they can for example create a ranking of the best configurations or do performance prediction of a configuration when it has access to meta- features. [30]

4.2.3 Learning from prior models

The final type of meta-data we can learn from are prior machine learning models themselves, that is, their structure and learned model parameters. In this approach we want to train a meta-learner that learns how to train a learner for a new task given comparable tasks and the corresponding models. The learner can usually be defined by its parameters or its configuration. [30]

In transfer learning, we take models trained on one or more source tasks, and use them as preliminary points for creating a model on a similar target task. This can be done by making the target model to be structurally or otherwise similar to the source model. This

(26)

can be used largely, and transfer learning methods have been used or at least proposed for Bayesian networks, clustering, kernel methods and reinforcement learning which is most interesting for us in our ultimate research [44]. Transfer learning is most suitable to be used with neural networks. Meta-learning is certainly not limited to (semi-)supervised tasks and has been effectively applied to resolve tasks as varied as reinforcement learning, active learning, density estimation and item recommendation. The base-learner may be unsupervised while the meta-learner is supervised, but other groupings are certainly possible as well [30].

We should never have to start entirely from scratch. Instead, we should systematically collect our ‘learning experiences’ and learn from them to build AutoML systems that con- tinuously improve over time, helping us tackle new learning problems ever more effi- ciently. The more new tasks we encounter, and the more similar those new tasks are, the more we can get from prior experience, to the point that most of the required learning has already been done earlier. [30]

4.3 Neural architecture search

Deep learning has permitted remarkable progress over the last years on an assortment of tasks, such as image recognition, speech recognition, and machine translation. One vital feature for this progress is novel neural architectures. Currently employed architectures have mostly been developed manually by human specialists, which is a time-con- suming and error-prone process [45]. Because of this, there is growing interest in automated neural architecture search methods. Here we will give a quick introduction to neural architecture research and its techniques [30].

Neural Architecture Search (NAS), the process of automating architecture engineering, is a rational next step in automating machine learning. This is because along with automation of the feature engineering there has been need for architecture engineering where more complex neural architectures have been designed manually. NAS can be seen as a subfield of AutoML and has noteworthy intersection with hyperparameter optimization and meta-learning [45]. We categorize methods for NAS according to three dimensions: search space, search strategy, and performance approximation strategy which we will be looking into further [30].

The search space defines which architectures can be represented in principle. Incorpo- rating prior knowledge about properties well-suited for a task can decrease the size of the search space and streamline the search. However, this also introduces a human

(27)

bias, which may prevent finding novel architectural building blocks that go beyond the current human knowledge. [30]

The search strategy details how to explore the search space. It incorporates the classical exploration-exploitation trade-off since, on the one hand, it is desirable to find well-performing architectures quickly, while alternatively, premature convergence to a region of suboptimal architectures should be avoided.

The objective of NAS is typically to find architectures that achieve high predictive performance on hidden data. Performance estimation refers to the process of estimating this performance: the simplest option is to perform a standard training and validation of the architecture on data, but this is unfortunately computationally expensive and limits the number of architectures that can be explored. Much fresh research therefore focuses on developing methods that reduce the cost of these performance estimations. [46, 30]

(28)

5. FRAMEWORK INTRODUCTION

In this thesis we will be working with multiple automated machine learning frameworks that will be introduced in this chapter. The testing will rely on Open Souce AutoML Bench- mark which was introduced in 2019 ICML AutoML Workshop [2]. In that paper they already ran tests for Auto-WEKA, auto-sklearn, TPOT and H20 AutoML along with using a random forest algorithm as a baseline [1]. We will attempt to reproduce these results and document any possible differences and improvements in these frameworks. We will add to this benchmark our own additional frameworks or at least reproduce same kind of testing environment and compare the new frameworks with the already tested ones.

The frameworks we will be looking to add to the test body are: Autokeras, MLBox, GAMA and Ludwig by Uber.

If needed and if even possible, we will also try to follow another paper from the AutoML workshop although this is from 2020 ICML AutoML Workshop: On evaluation of AutoML systems [47]. This is a paper where Milutinovic, Schoenfeld, Martinez-Garcia, Ray, Shah and Yan talked about the ways to evaluate a AutoML system and we will try to add those ideas to our testing if we see it fit.

To understand the tests and the benchmark we will have to look into the benchmark and its purposes as well as the frameworks themselves. The following subchapters will introduce all of these components. These components are largely based on already imple- mented machine learning frameworks that have introduced an automated machine learning component later in their lifecycle. So, we will also scratch the surface on the existing frameworks and its commonness and usage in general. It should be noted that we have taken into account Python runnable frameworks because widening the stack to other languages might cause performance imbalance in favor or against them and the assessment of the results could easily become erroneous.

To make the assessment even clearer we will introduce some use cases and evaluate the frameworks based on those. It will showcase the differences between the frameworks based on the experience level needed and what kind of tasks could be possible to complete using each framework.

5.1 AutoML Benchmark

The AutoML Benchmark provides an overview and comparison of open-source AutoML systems [1]. As one can guess the AutoML benchmark was created for non-machine

(29)

learning experts to be able to run machine learning systems and algorithms. Their goal was to find a universally best AutoML system and to help those users to find the right tools for their project. But as they already state: many of the comparisons were lacking back then and they intended to create a good comparison between them which they admittedly did. Although at the present time their own comparison is also lacking as there are more AutoML systems being developed and used.

At the same time, they have to be given credit as they increased and streamlined the usage of different datasets so that the same old datasets are not used every time. Using one and only dataset could have let to overfitting for that particular dataset. The benchmark has thus made it simpler for us to test the new AutoML systems as there is kind of the basis already created and possible datasets to use for tests. They also state that before their solution there was a possibility that all of the methods might not have been understood correctly which might have led to the decision to not choose the best possible machine learning system.

In their paper Gijsbers and other state: “The benchmark is completely open source and allows anyone to extend it by adding or updating AutoML systems through pull requests.

Finally, it is ongoing because we will update it with new benchmark datasets, run the experiments again when AutoML tools have substantial version updates”. It does seem there have been updates on the github repository and they have kept it current and there are also new frameworks added but they are not as carefully documented as the first ones as of the moment of writing this [1]. We will also include those frameworks that have been added there as new ones and test them in relation to the first ones added there.

The benchmark is designed so that each task consists of a dataset, one metric to optimize and resources given to it. The 39 datasets that are used in the benchmark are taken from earlier AutoML research papers, machine learning benchmarks and AutoML com- petitions. The dataset base is varied by size and have some other characteristics to some of them that are not in others like missing values. The same dataset base will work fine for this test case also. But as stated before if deemed necessary we will have to add our own datasets there for further testing.

For the results in their paper, area under the receiver operator curve (AUROC) was used for binary classification problems and log loss is used for multi-class classification problems. We will use the same mainly because these are insightful, commonly used and

(30)

supported by most AutoML tools but also because it keeps these both researches uni- form with each other. It is imperative that AutoML system optimize for the same metric they are evaluated on. The measures are estimated with ten-fold cross-validation.

The actual frameworks will be more closely discussed in the next subchapter but what has to be mentioned is that the AutoML tools were all used with their default hyperparameter values and search spaces, since most users will use them in this way. This will mostly be true for our research also. The exception are hyperparameters which specified available resources, which were fixed to a specific number of cores, memory and total runtime. This was done to allow a more practical comparison, and because it is practi- cally impossible to homogenize the search spaces for each tool.

Also, they decided not to address meta-learning in any way because all of the frameworks are not using it. In this research meta-learning is completely allowed and even though it gives a huge advantage for those that use it compared to those that do not. We will talk more about the results they gathered in their research in the results of our own experiments. The same goes for the datasets and what they consist of.

5.2 Already tested systems re-evaluated

The first prominent AutoML tool was Auto-WEKA which used Bayesian optimization to select and tune the algorithms in a machine learning pipeline based on WEKA [17]. The WEKA workbench is a collection of machine learning algorithms and data preprocessing tools that provides support for experimenting with data mining, evaluating learning schemes and visualizing the results of learning. WEKA has a graphic interface for working but it can also be used from a Python wrapper, which we will be doing in our research.

The workbench includes methods for the main data mining problems: regression, classification, clustering, association rule mining, and attribute selection. Getting to know the data is an integral part of the work, and many data visualization facilities and data preprocessing tools are provided. All algorithms take their input in the form of a single rela- tional table that can be read from a file or generated by a database query [48].

Auto-WEKA is an AutoML system based on the original WEKA designed to help such users by automatically searching through the joint space of WEKA’s learning algorithms and their respective hyperparameter settings to maximize performance. Each of the algorithms that are present in WEKA have their own hyperparameters that can drastically change their performance, and there are a staggeringly large number of possible alter- natives overall. Auto-WEKA considers the problem of simultaneously selecting a learning

(31)

algorithm and setting its hyperparameters, going beyond previous methods that address these issues in isolation. Auto-WEKA does this using a fully automated approach, lever- aging recent innovations in Bayesian optimization. [49]

(32)

5.2.1 Auto-sklearn

Auto-sklearn is based on scikit-learn and added meta-learning and warm-startting so that it can use the results of similar dataset problems. Scikit-learn is a machine-learning library created for Python. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. [50]

Auto-sklearn provides out-of-the-box supervised machine learning. Built around the scikit-learn machine learning library, auto-sklearn automatically searches for the right learning algorithm for a new machine learning dataset and optimizes its hyperparameters. Thus, it frees the machine learning practitioner from these tedious tasks and allows her to focus on the real problem. [51]

Auto-sklearn extends the idea of configuring a general machine learning framework with efficient global optimization which was introduced with Auto-WEKA. To improve gener- alization, auto-sklearn builds an ensemble of all models tested during the global optimization process. In order to speed up the optimization process, auto-sklearn uses meta- learning to identify similar datasets and use knowledge gathered in the past. Auto- sklearn wraps a total of 15 classification algorithms, 14 feature preprocessing algorithms and takes care about data scaling, encoding of categorical parameters and missing values. [51]

5.2.2 TPOT

The Tree-Based Pipeline Optimization Tool (TPOT) was one of the very first AutoML methods and open-source software packages developed for the data science community. The goal of TPOT is to automate the building of ML pipelines by combining a flexible expression tree representation of pipelines with stochastic search algorithms such as genetic programming. TPOT makes use of the Python-based scikit-learn library as its ML menu. So, it basically optimizes scikit-learn pipelines via genetic programming, starting with simple ones and evolving them over generations. [30]

(33)

5.2.3 H20 AutoML

H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

H2O’s data parser has built-in intelligence to guess the schema of the incoming dataset and supports data ingest from multiple sources in various formats. [52]

H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.

Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models. They do promote that H20’s models are often on top of the AutoML Leaderboards. [52]

5.2.4 Random forest

As a baseline in the original benchmark they had random forest based baselines and we will be using the same ones because it will keep the research viable. Random forests are a learning method for example for classification and works by constructing a multi- tude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees.

Random decision forests correct for decision trees' habit of overfitting to their training set.

Baseline methods include a constant predictor, which always predicts the class prior, an untuned Random Forest, and a tuned Random Forest for which up to eleven unique values of max features are evaluated with cross-validation (as time permits), and evaluated by refitting the final model with the optimal max features values. [1]

(34)

5.3 New frameworks

The frameworks that were brought in as new ones to this research are presented here in the same manner as the already added frameworks were.

5.3.1 Autokeras

Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It is designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible. Keras contains numerous implementations of commonly used neural-network building blocks such as layers, objectives, activation functions, optimizers, and a host of tools to make working with image and text data easier to simplify the coding necessary for writing deep neural network code [53]. The usage of Tensorflow as its backend allows to tap into the computers GPU for more effective processing.

Autokeras is the AutoML system based on Keras and it is developed by Texas A&M University. Like AutoML in general, Autokeras also intends to bring machine learning closer to the user and be easier to use in tasks. Autokeras brings powerful Tensorflow backend also to the user in a very simple way. Keras is one of the most well known and most used machine learning systems but the AutoKeras has not found such success at least for now. [53]

5.3.2 MLBox

MLBox is a powerful Automated Machine Learning python library. MLBox claims to have fast reading and distributed data preprocessing/cleaning/formatting, highly robust feature selection and leak detection, accurate hyper-parameter optimization in high-dimensional space, state-of-the art predictive models for classification and regression and prediction with models interpretation [54]. To standout MLBox focuses on drift identification, entity embedding and hyperparameter optimization. MLBox does not support unsupervised learning but luckily we will be testing classification which is highly supported in MLBox [54].

(35)

5.3.3 Lightautoml

LightAutoML is a project from Sherbank AI Lab AutoML group and it is a framework for automatic classification and model creation which makes it perfect for our research setup. At the moment LightAutoML enables the creation of a pipeline that does automatic hyperparameter tuning and processing, feature selection and some easy-to-use graphical interfaces [55]. The LightAutoML is also a framework that has been added to the benchmark in the time since their research so it will be interesting to document that frameworks performance.

5.3.4 Autogluon

AutoGluon is another AutoML tool for Python that automates machine learning tasks enabling the user to easily achieve strong predictive performance in their applications. It describes itself as being very easy to use and their example actually has just five lines of code including the data input. Autogluon leverages automatic hyperparameter tuning, model selection, architecture search, and data processing. [https://auto.gluon.ai/sta- ble/index.html] Autogluon is originally created by Amazon for Amazon Web Services but it has since been open sourced. [56]

5.3.5 Oboe

Oboe and TensorOboe, are automated model selection systems that uses collaborative filtering to find good models for supervised learning tasks within a user-specified time limit. Further hyperparameter tuning can be performed afterwards [57]. We will be using the regular Oboe version because Oboe does not support pip package installation and the TensorOboe package is slightly inconvenient to install.

The following is a quotation of how Oboe works from the makers of Oboe from their paper OBOE: Collaborative Filtering for AutoML Model Selection: “Oboe is a collaborative filtering method for time-constrained model selection and hyperparameter tuning. Oboe forms a matrix of the cross-validated errors of a large number of supervised learning models (algorithms together with hyperparameters) on a large number of datasets, and fits a low rank model to learn the low-dimensional feature vectors for the models and datasets that best predict the cross-validated errors. To find promising models for a new dataset, Oboe runs a set of fast but informative algorithms on the new dataset and uses

(36)

their cross-validated errors to infer the feature vector for the new dataset. Oboe can find good models under constraints on the number of models fit or the total time budget”.

Oboe basically works so that it searches for promising estimators. This brings up its biggest weakness which is that it needs pre-processed dataset to work and all features need to be standardized to have zero mean and unit variance. Oboe is still largely under development particularly on its documentation side. Oboe is also one of the frameworks already added to the benchmark but was not available in the original paper of the benchmark.

5.3.6 Mlplan

ML-Plan is a Java based AutoML framework that uses WEKA and Scikit-Learn to its advantage to provide automated machine learning for Java users through Eclipse IDE.

It has been integrated into a larger AILibs project. It has been also added to the benchmark later so it should be operatable in the Python world. This framework will largely be testing the same things that have already been tested with Auto-WEKA and Auto Scikit- Learn so including it might prove to be redundant if the results do not vary in some significant way. [58]

5.3.7 GAMA

GAMA or General Automated Machine learning Assistant is another tool for AutoML that has already been added to the benchmark repository. GAMA’s technique is to find automatically a good machine learning pipeline. GAMA defines the pipeline as data preprocessing steps, various machine learning algorithms, and their possible hyperparameters configurations. GAMA also provides a command line tool where you can load your dataset directly but it supports only some of the functionality of the full Python package. On top of that it has a dashboard that can be used but it is also still in further development.

It is obvious GAMA has taken a lot from the other AutoML frameworks and we have high hopes for it as it has been developed by one of the authors of the AutoML benchmak Pieter Gijsbers. [59]

5.3.8 Ludwig

Ludwig is a ”code-free” deep-learning tool box that offers also AutoML usage that is developed by Uber. Ludwig has been built on top of Tensorflow and its goal is to make it

(37)

super easy for users to train and test deep learning models. It has been built entirely using Python and thus it also provides an API for more code-oriented users like us to get some research done. [60]

Ludwig has drawn inspiration from other machine learning and automated machine learning models such as WEKA and scikit-learn as well and admit it as they did not want to “re-invent the wheel”. Ludwig provides three main functionalities: training models and using them to predict and evaluate them. It is based on datatype abstraction, so that the same data preprocessing and postprocessing will be performed on different datasets that share datatypes and the same encoding and decoding models developed can be re-used across several tasks. Of course, Ludwig also suffers from the same issue as MLPlan because it is built on top of other systems. Does it provide additional knowledge?

That will be shown during research. [60]

(38)

6. TESTING SETUP

As stated earlier the data generation for the testing leaned heavily on the Open Source AutoML Benchmark that has been mentioned already a couple of times. There was some finetuning that needed to be done to be able to use the benchmark, but it worked as the basis of the research. This chapter will cover the functionalities of the benchmark, the data used for the tests and the technical setup including hardware and the environment the AutoML frameworks run in. Some of the frameworks we could not work into the constraints of the benchmark, so they were tested separately with the same parameters and environments as the benchmark has. This meant having to go through the framework quite thoroughly so it is good to explain it here as closely as needed.

6.1 Testing environment

Because most of the AutoML tools covered in this report are optimized for Linux usage I had to create a Virtual Machine with Ubuntu running in it. We used the latest 20.04 version even though the Benchmark was suggesting using 18.04. This did not seem to have any negative effect in running the frameworks through the benchmark and all the external packages were fully supported also in the latest Ubuntu. We allocated 200 GB of memory and the full of usage of my Intel i7-9850H CPU and 16 GB RAM for the virtual machine to make it have enough computing power to complete the necessary tasks.

Inside the virtual machine we created a Python virtual environment where the required package installations would be installed for each tested framework. We tried to use the latest possible Python version 3.8 for each framework but some of them needed to use older versions either due to some package dependencies or just because they were lacking in more recent updates. The frameworks that were tested without the benchmark tools were run in a separate Python virtual environment but with identical properties in every other aspect to keep the research balanced.

This setup does differ from the original Open Souce AutoML Benchmark in that aspect that they used Amazon web serviced to host their runs. In that sense the numbers and speeds that are recorded from our runs should not be compared straight up with those numbers that the original research had. The emphasis should be more in comparing the