Meta-Learning - AUTOMATED MACHINE LEARNING

4. AUTOMATED MACHINE LEARNING

4.2 Meta-Learning

Meta-learning is essentially learning to learn and it is the science of systematically de-tecting how different machine learning approaches perform on a wide range of learning tasks, and then learning from this experience, or meta-data, to learn new tasks much quicker than otherwise possible [38]. Meta-learning can speed up and improve the de-sign of machine learning pipelines or neural architectures, it also permits us to replace hand-engineered algorithms with novel approaches learned in a data-driven way [30].

The challenge in meta-learning is to learn from previous experience in a methodical, data-driven way. First, we need to gather meta-data that define prior learning tasks and previously learned models. The meta-data comprises of for example algorithm configu-rations, hyperparameter settings, accuracy, training time and model parameters among numerous others [39]. Second, we need to learn from this prior meta-data, to extract and transfer knowledge that directs the search for optimal models for new tasks. [30]

We will be looking at meta-learning techniques in subsequent subchapters from three distinct angles. First, we will talk about how to learn from model evaluations. Next, we will focus on how to characterize tasks clearly and build models that can learn relation-ships between characteristics and performance. And finally, we discuss how to transfer learned model parameters between similar tasks. [40]

4.2.1 Model Evaluations

If we have access to prior tasks, we can get add the previous evaluations and use them to train a meta-learner. The meta-learner can then forecast recommended configurations for a new task. Sometimes the training results can be warm-started with some initial data generated by another method. [30]

In task-independent recommendations we do not have access to previous evaluations but still use a common function to get configurations. These configurations are usually ranked and evaluated by success rates or other means of evaluation. Configuration space design is also independent from the task but it uses prior evaluations to learn an improved configuration space [41]. It has turned out to be a very important part of AutoML systems comparisons. It focuses on learning optimal hyperparameter default settings.

Default values can be learned in unison for all hyperparameters of an algorithm by first training surrogate models for that algorithm for a large number of tasks [30]. Next, most of the configurations are sampled, and the configuration that minimizes the average risk across all tasks is the recommended as the default configuration. Finally, the importance of each parameter is estimated by observing how much enhancement can still be gained by tuning it. [30]

If we want to deliver recommendations for a precise task, we need additional information on how similar it is to prior tasks. One way to do this is to evaluate several recommended (or potentially random) configurations on, yielding new evidence [41]. If we then observe that the evaluations, are like, then and can be considered intrinsically similar, based on empirical evidence. We can include this knowledge to train a meta-learner that predicts a recommended set of configurations. [30]

We can also extract meta-data about the training process itself, such as how fast the model performance improves when more training data is added. If we divide the training in steps, usually adding a specific number of training examples every step, we can meas-ure the performance of configuration on task after step, yielding a learning curve across the time steps. Learning curves are also used to speed up hyperparameter optimization on a given task. In meta-learning, learning curve information is transferred across tasks.

[30]

4.2.2 Task properties

Another rich foundation of meta-data are characterizations (meta-features) of the task at hand. Each task can be described as a vector of meta-features and that can be used to outline a similarity measure. Then we can transfer information of the most similar task to the new one. After this we can train a meta-learner to predict the performance of specific configurations on an original task. [30]

Some of the common meta-features in machine learning are for instance the number of instances, number of features, number of classes, class entropy, data consistency and

information gain just to name a few [42]. All of these meta-features have their own mean-ing and reasonmean-ing why they are important in optimizmean-ing a model. These reasons can be as simple as the speed of the model or scalability or more complex reasons like feature interdependence or noisiness of the data [30].

To build a feature vector, one needs to select and further process these meta-features. Many meta-features are calculated on single features, or combinations of fea-tures, and need to be aggregated by summary statistics. One needs to thoroughly extract and aggregate them [43]. Outside these general-purpose meta-features, many more specific ones were formulated. For streaming data one can use streaming landmarks, for time series data one can compute autocorrelation coefficients or the slope of regres-sion models, and for unsupervised problems one can cluster the data in different ways and extract assets of these clusters [30].

Now we have only talked about meta-features in general, but it is also likely to learn a joint representation for these groups and tasks. One method is to build meta-models that produce a landmark-like meta-feature representation from other tasks meta-features and train that. In a simple way we can have prior tasks and their configurations and run tests if the new configurations outperform the old ones. [42, 30]

We can also learn the complex relationship between a task’s meta-features and the use-fulness of specific configurations by building a meta-model that recommends the most beneficial configurations new given the meta-features of the new task. These sequen-tially are called meta-models and they can for example create a ranking of the best con-figurations or do performance prediction of a configuration when it has access to meta-features. [30]

4.2.3 Learning from prior models

The final type of meta-data we can learn from are prior machine learning models them-selves, that is, their structure and learned model parameters. In this approach we want to train a meta-learner that learns how to train a learner for a new task given comparable tasks and the corresponding models. The learner can usually be defined by its parame-ters or its configuration. [30]

In transfer learning, we take models trained on one or more source tasks, and use them as preliminary points for creating a model on a similar target task. This can be done by making the target model to be structurally or otherwise similar to the source model. This

can be used largely, and transfer learning methods have been used or at least proposed for Bayesian networks, clustering, kernel methods and reinforcement learning which is most interesting for us in our ultimate research [44]. Transfer learning is most suitable to be used with neural networks. Meta-learning is certainly not limited to (semi-)supervised tasks and has been effectively applied to resolve tasks as varied as reinforcement learn-ing, active learnlearn-ing, density estimation and item recommendation. The base-learner may be unsupervised while the meta-learner is supervised, but other groupings are certainly possible as well [30].

We should never have to start entirely from scratch. Instead, we should systematically collect our ‘learning experiences’ and learn from them to build AutoML systems that con-tinuously improve over time, helping us tackle new learning problems ever more effi-ciently. The more new tasks we encounter, and the more similar those new tasks are, the more we can get from prior experience, to the point that most of the required learning has already been done earlier. [30]

In document Automated machine learning: Evaluating AutoML frameworks (sivua 23-26)