TESTING SETUP - Automated machine learning: Evaluating AutoML frameworks

As stated earlier the data generation for the testing leaned heavily on the Open Source AutoML Benchmark that has been mentioned already a couple of times. There was some finetuning that needed to be done to be able to use the benchmark, but it worked as the basis of the research. This chapter will cover the functionalities of the benchmark, the data used for the tests and the technical setup including hardware and the environment the AutoML frameworks run in. Some of the frameworks we could not work into the con-straints of the benchmark, so they were tested separately with the same parameters and environments as the benchmark has. This meant having to go through the framework quite thoroughly so it is good to explain it here as closely as needed.

6.1 Testing environment

Because most of the AutoML tools covered in this report are optimized for Linux usage I had to create a Virtual Machine with Ubuntu running in it. We used the latest 20.04 ver-sion even though the Benchmark was suggesting using 18.04. This did not seem to have any negative effect in running the frameworks through the benchmark and all the external packages were fully supported also in the latest Ubuntu. We allocated 200 GB of memory and the full of usage of my Intel i7-9850H CPU and 16 GB RAM for the virtual machine to make it have enough computing power to complete the necessary tasks.

Inside the virtual machine we created a Python virtual environment where the required package installations would be installed for each tested framework. We tried to use the latest possible Python version 3.8 for each framework but some of them needed to use older versions either due to some package dependencies or just because they were lacking in more recent updates. The frameworks that were tested without the benchmark tools were run in a separate Python virtual environment but with identical properties in every other aspect to keep the research balanced.

This setup does differ from the original Open Souce AutoML Benchmark in that aspect that they used Amazon web serviced to host their runs. In that sense the numbers and speeds that are recorded from our runs should not be compared straight up with those numbers that the original research had. The emphasis should be more in comparing the

frameworks with each other and their relative performance in that paper and in our re-search. Also, the random forest and constant predictor will be used as good baselines in the same way as was done in the Open Souce AutoML Benchmark research.

6.2 Testing dataset

To conduct the research some data needs to be used to evaluate the frameworks. This data is taken from openml.org where there are readymade tasks with datasets. The same data was used on Open Souce AutoML Benchmark but we had to choose only some of the tasks as some of them took way too much time and computing power to run in our setup. The different tasks will be introduced without going to too much detail and mainly just explaining the basic idea of each task used and what the goal there is.

The datasets are widely used in AutoML papers and that is how they were originally found in the Open Souce AutoML Benchmark paper. The datasets are different enough from each other so that they have varied numbers of samples and other features as well as have missing values or not. There should not be any tasks that too easy for AutoML systems to solve. There are 13 different datasets used in the actual research as well as some others that were used to validate the usability of the frameworks and their ease of use in typical AutoML situation.

For our tests the “Helena” dataset was used a bit differently than others. This dataset crashed the setup a few times so that was excluded from the overall comparison but because this dataset was under a lot of interest in the Open Souce AutoML Benchmark research, it had to be included in some way. This particular dataset was a dataset that all the AutoML frameworks of the original setup failed to outperform a random forest baseline. Because of this the “Helena” task was involved to test has there been improve-ment in AutoML systems in general.

The tasks that the datasets aim to represent are all classification tasks. There are tasks that are binary classification problems and tasks that multi-class classification problems.

They both have different metrics of evaluation with binary tasks being evaluated with area under the curve and multi-class tasks are evaluated with logarithmic loss function.

These are the same metrics that were used in Open Souce AutoML Benchmark so it felt fitting to use the same metrics here.

6.3 Benchmark’s functionalities

As stated previously the benchmark is designed to run openml tasks on different AutoML frameworks. The results from these classification tasks are then stored in CSV format where they are evaluated on AUC or logloss depending on is the task a binary or multi-class multi-classification tasks. The framework will try to optimize this metric during its runtime purposefully. Furthermore, the framework uses ten-fold cross-validation to estimate these measures.

All the frameworks were used in a “out-of-the-box” manner as much as possible because that is the most likely way for any user to use these. That means that the frameworks use their default hyperparameter optimization option as well as their default search spaces. The exception being of course the hyperparameters that were needed to specify for each framework to actually work. These being the possibly available resources like how many cores are available for the tasks, the amount of memory that is available and we also limited the runtime of each task due to time constraints and because that is one criteria that the frameworks are evaluated on and it would not be the same if one frame-work had the opportunity to run for one hour and another one could run for example four hours.

The benchmarks original design had three different sets of data which were labeled as

“small”, “medium” and “large”. We decided to go with our own sample of these but mostly it is based on the “medium” dataset. This was done because the “small” set seemed to be a bit too easy for the AutoML frameworks to solve and because the “large” took way too much time to get any reasonable results as we used 2 hours as a maximum time for each task. This differs a bit from the Open Souce AutoML Benchmark’s original as they ran it with 1 hour on the “small” set and 4 hours on the “medium” set. In our experiments the added time from 2 hours to 4 hours did not affect the overall scores and the relative-ness of the frameworks too much so we decided to save some time and cut the time a little bit.

In the benchmark design it is quite easy to run our planned tests on the different frame-works. Specifying the datasets that we wanted to use was just a matter of creating a Yaml file that specifies the datasets we want to use from the openml dataset base. Add-ing frameworks were in some cases quite easy but sometimes there needed to be extra steps. Particularly when some of the frameworks did not right away support a Python API or they needed some data manipulation before the execution. Of course, this should

not be a problem with good AutoML tools and that is taken into consideration in the evaluation and mentioned with the results.

As stated earlier Autosklearn, Autoweka, H20automl and Tpot along with the baselines constant predictor and random forest were already added to the benchmark so those were only about rerunning the tests. Some problems did occur with Autoweka since it had not been updated in a while, so we needed to use an older version of Python as well as Ubuntu. Otherwise the old test set worked quite well.

The frameworks that were already implemented into the framework were Gama, Oboe and Lightautoml. From these Lightautoml and Gama worked fine as they were but with oboe we needed to change some minor settings from its configuration file. Mainly to keep it from crashing the whole virtual machine. So, we needed to restrict its capabilities a little bit but hopefully it does not show too much in the results. In any case there should be an asterisk on oboe’s results.

Autokeras, MLBox, Autogluon and Ludwig were the frameworks we needed to introduce to the tests by ourselves. Autogluon and MLBox were quite easy to get there and at the same time as we did our work the online community also had this task at hand. In the current release of the benchmark those can be found in the benchmark already working.

Autokeras and Ludwig on the other hand had to be run on their own in similar environ-ments. Ludwig was way to difficult to get to working in this benchmark as it is meant to be used in its own user interface and for some reason splitting the data in the bench-marks style to work with Autokeras could not be easily achieved.

In document Automated machine learning: Evaluating AutoML frameworks (sivua 38-42)