Already tested systems re-evaluated - FRAMEWORK INTRODUCTION

5. FRAMEWORK INTRODUCTION

5.2 Already tested systems re-evaluated

The first prominent AutoML tool was Auto-WEKA which used Bayesian optimization to select and tune the algorithms in a machine learning pipeline based on WEKA [17]. The WEKA workbench is a collection of machine learning algorithms and data preprocessing tools that provides support for experimenting with data mining, evaluating learning schemes and visualizing the results of learning. WEKA has a graphic interface for work-ing but it can also be used from a Python wrapper, which we will be dowork-ing in our research.

The workbench includes methods for the main data mining problems: regression, clas-sification, clustering, association rule mining, and attribute selection. Getting to know the data is an integral part of the work, and many data visualization facilities and data pre-processing tools are provided. All algorithms take their input in the form of a single rela-tional table that can be read from a file or generated by a database query [48].

Auto-WEKA is an AutoML system based on the original WEKA designed to help such users by automatically searching through the joint space of WEKA’s learning algorithms and their respective hyperparameter settings to maximize performance. Each of the al-gorithms that are present in WEKA have their own hyperparameters that can drastically change their performance, and there are a staggeringly large number of possible alter-natives overall. Auto-WEKA considers the problem of simultaneously selecting a learning

algorithm and setting its hyperparameters, going beyond previous methods that address these issues in isolation. Auto-WEKA does this using a fully automated approach, lever-aging recent innovations in Bayesian optimization. [49]

5.2.1 Auto-sklearn

Auto-sklearn is based on scikit-learn and added meta-learning and warm-startting so that it can use the results of similar dataset problems. Scikit-learn is a machine-learning li-brary created for Python. It features various classification, regression and clustering al-gorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. [50]

Auto-sklearn provides out-of-the-box supervised machine learning. Built around the scikit-learn machine learning library, auto-sklearn automatically searches for the right learning algorithm for a new machine learning dataset and optimizes its hyperparame-ters. Thus, it frees the machine learning practitioner from these tedious tasks and allows her to focus on the real problem. [51]

Auto-sklearn extends the idea of configuring a general machine learning framework with efficient global optimization which was introduced with Auto-WEKA. To improve gener-alization, auto-sklearn builds an ensemble of all models tested during the global optimi-zation process. In order to speed up the optimioptimi-zation process, auto-sklearn uses meta-learning to identify similar datasets and use knowledge gathered in the past. Auto-sklearn wraps a total of 15 classification algorithms, 14 feature preprocessing algorithms and takes care about data scaling, encoding of categorical parameters and missing val-ues. [51]

5.2.2 TPOT

The Tree-Based Pipeline Optimization Tool (TPOT) was one of the very first AutoML methods and open-source software packages developed for the data science commu-nity. The goal of TPOT is to automate the building of ML pipelines by combining a flexible expression tree representation of pipelines with stochastic search algorithms such as genetic programming. TPOT makes use of the Python-based scikit-learn library as its ML menu. So, it basically optimizes scikit-learn pipelines via genetic programming, start-ing with simple ones and evolvstart-ing them over generations. [30]

5.2.3 H20 AutoML

H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

H2O’s data parser has built-in intelligence to guess the schema of the incoming dataset and supports data ingest from multiple sources in various formats. [52]

H2O’s AutoML can be used for automating the machine learning workflow, which in-cludes automatic training and tuning of many models within a user-specified time-limit.

Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual mod-els to produce highly predictive ensemble modmod-els. They do promote that H20’s modmod-els are often on top of the AutoML Leaderboards. [52]

5.2.4 Random forest

As a baseline in the original benchmark they had random forest based baselines and we will be using the same ones because it will keep the research viable. Random forests are a learning method for example for classification and works by constructing a multi-tude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees.

Random decision forests correct for decision trees' habit of overfitting to their training set.

Baseline methods include a constant predictor, which always predicts the class prior, an untuned Random Forest, and a tuned Random Forest for which up to eleven unique values of max features are evaluated with cross-validation (as time permits), and evalu-ated by refitting the final model with the optimal max features values. [1]

In document Automated machine learning: Evaluating AutoML frameworks (sivua 30-34)