On Search Space Constraining Methods for Automatic Composition and Optimisation of Machine Learning Pipelines
Created by W.Langdon from
gp-bibliography.bib Revision:1.8564
- @PhdThesis{Tien-Dung_Nguyen:thesis,
-
author = "Tien Dung Nguyen",
-
title = "On Search Space Constraining Methods for Automatic
Composition and Optimisation of Machine Learning
Pipelines",
-
school = "Faculty of Engineering and Information Technology,
University of Technology Sydney",
-
year = "2023",
-
address = "Australia",
-
month = mar,
-
keywords = "genetic algorithms, genetic programming, TPOT",
-
URL = "
https://opus.lib.uts.edu.au/handle/10453/171452",
-
URL = "
http://hdl.handle.net/10453/171452",
-
URL = "
https://opus.lib.uts.edu.au/bitstream/10453/171452/1/thesis.pdf",
-
size = "326 pages",
-
abstract = "Automated machine learning (AutoML) has been developed
and studied to automate the process of collecting,
preprocessing and integrating data, composing and
optimising ML pipelines, and deploying and maintaining
predictive models. A machine learning (ML) pipeline is
a work o w consisting of many components to perform
data transformation, data preprocessing, feature
engineering and classification/regression tasks. ML
pipelines have been used to build predictive models for
a variety of ML problems. One of the most important
research topics of AutoML is ML pipeline composition
and optimisation (PCO). PCO processes search for valid
and well-performing ML pipelines in a given search
space. A search space consists of ML components, i.e.
preprocessing and predictor/meta-predictor components,
components hyperparameters and pipeline structures that
link ML components. Due to the large size of search
spaces, it is challenging for PCO processes to find
valid and well-performing pipelines. The aim of the
dissertation is to study methods/strategies to
constrain a search space to enable PCO processes to
more easily and efficiently find valid and
well-performing pipelines. This dissertation has three
main contributions. The first contribution is a novel
method, AVATAR, to constrain search space to consider
only valid ML pipelines. The AVATAR eliminates invalid
pipelines by assessing ML pipeline validity using a
surrogate model, a Petri-net approach based on
consideration of simplified pipelines not requiring the
use of training data or its processing. The second
contribution is a critical evaluation of the so called
opportunistic meta-knowledge based on previous
experience in the form of PCO runs and predictor
evaluations with default hyperparameters. Although such
meta-knowledge is inherently noisy and with high
statistical variability, we found that it is still very
useful for constraining search spaces with promising,
well-performing ML components. The third contribution
is an exploration of different search space reduction
strategies employing that meta-knowledge. The results
show that the reduction of search space should not be
extreme. In addition, reducing the search space based
on the knowledge of the problem itself enables PCO
processes to deliver the best-performing ML pipelines,
followed by the knowledge of strong general performers.
Finally, the results obtained in this study can form
the basis of and be extended in the future to
dynamically control search spaces to find better ML
pipelines by combining the knowledge of the ML problems
solved in the past and the knowledge of the problem
itself acquired from the run of the PCO processes.",
-
notes = "Supervisors: Bogdan Gabrys and Katarzyna Musial",
- }
Genetic Programming entries for
Tien-Dung Nguyen
Citations