STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison
Created by W.Langdon from
gp-bibliography.bib Revision:1.8051
- @InProceedings{Urbanowicz:2022:GPTP,
-
author = "Ryan Urbanowicz and Robert Zhang and Yuhan Cui and
Pranshu Suri",
-
title = "STREAMLINE: A Simple, Transparent, End-To-End
Automated Machine Learning Pipeline Facilitating Data
Analysis and Algorithm Comparison",
-
booktitle = "Genetic Programming Theory and Practice XIX",
-
year = "2022",
-
editor = "Leonardo Trujillo and Stephan M. Winkler and
Sara Silva and Wolfgang Banzhaf",
-
series = "Genetic and Evolutionary Computation",
-
pages = "201--231",
-
address = "Ann Arbor, USA",
-
month = jun # " 2-4",
-
publisher = "Springer",
-
keywords = "genetic algorithms, genetic programming",
-
isbn13 = "978-981-19-8459-4",
-
DOI = "doi:10.1007/978-981-19-8460-0_9",
-
abstract = "Machine learning (ML) offers powerful methods for
detecting and modeling associations often in data with
large feature spaces and complex associations. Many
useful tools/packages (e.g. scikit-learn) have been
developed to make the various elements of data
handling, processing, modeling, and interpretation
accessible. However, it is not trivial for most
investigators to assemble these elements into a
rigorous, replicatable, unbiased, and effective data
analysis pipeline. Automated machine learning (AutoML)
seeks to address these issues by simplifying the
process of ML analysis for all. Here, we introduce
STREAMLINE, a simple, transparent, end-to-end AutoML
pipeline designed as a framework to easily conduct
rigorous ML modeling and analysis (limited initially to
binary classification). STREAMLINE is specifically
designed to compare performance between datasets, ML
algorithms, and other AutoML tools. It is unique among
other autoML tools by offering a fully transparent and
consistent baseline of comparison using a carefully
designed series of pipeline elements including (1)
exploratory analysis, (2) basic data cleaning, (3)
cross validation partitioning, (4) data scaling and
imputation, (5) filter-based feature importance
estimation, (6) collective feature selection, (7) ML
modeling with Optuna hyperparameter optimization across
15 established algorithms (including less well-known
Genetic Programming and rule-based ML), (8) evaluation
across 16 classification metrics, (9) model feature
importance estimation, (10) statistical significance
comparisons, and (11) automatically exporting all
results, plots, a PDF summary report, and models that
can be easily applied to replication data.",
-
notes = "Part of \cite{Banzhaf:2022:GPTP} published after the
workshop in 2023",
- }
Genetic Programming entries for
Ryan J Urbanowicz
Robert Zhang
Yuhan Cui
Pranshu Suri
Citations