Current Challenges of Symbolic Regression: Optimization, Selection, Model Simplification, and Benchmarking
Created by W.Langdon from
gp-bibliography.bib Revision:1.8721
- @PhdThesis{Aldeia:thesis,
-
author = "Guilherme Seidyo Imai Aldeia",
-
title = "Current Challenges of Symbolic Regression:
Optimization, Selection, Model Simplification, and
Benchmarking",
-
school = "Computer Science of the Federal University of ABC",
-
year = "2025",
-
address = "Santo Andre, Sao Paulo, Brazil",
-
month = "1 " # dec,
-
keywords = "genetic algorithms, genetic programming, symbolic
regression, simplification, e-lexicase selection,
non-linear optimization, multi-objective optimization,
ITEA, FEAT, Pareto, MOGP, NSGA2, PTC2",
-
URL = "
https://arxiv.org/abs/2512.01682",
-
size = "192 pages",
-
abstract = "Symbolic Regression (SR) is a regression method that
aims to discover mathematical expressions that describe
the relationship between variables, and it is often
implemented through Genetic Programming, a metaphor for
the process of biological evolution. Its appeal lies in
combining predictive accuracy with interpretable
models, but its promise is limited by several
long-standing challenges: parameters are difficult to
optimize, the selection of solutions can affect the
search, and models often grow unnecessarily complex. In
addition, current methods must be constantly
re-evaluated to understand the SR landscape. This
thesis addresses these challenges through a sequence of
studies conducted throughout the doctorate, each
focusing on an important aspect of the SR search
process. First, I investigate parameter optimization,
obtaining insights into its role in improving
predictive accuracy, albeit with trade-offs in runtime
and expression size. Next, I study parent selection,
exploring epsilon-lexicase to select parents more
likely to generate good performing offspring. The focus
then turns to simplification, where I introduce a novel
method based on memoization and locality-sensitive
hashing that reduces redundancy and yields simpler,
more accurate models. All of these contributions are
implemented into a multi-objective evolutionary SR
library, which achieves Pareto-optimal performance in
terms of accuracy and simplicity on benchmarks of
real-world and synthetic problems, outperforming
several contemporary SR approaches. The thesis
concludes by proposing changes to a famous large-scale
symbolic regression benchmark suite, then running the
experiments to assess the symbolic regression
landscape, demonstrating that a SR method with the
contributions presented in this thesis achieves
Pareto-optimal performance.",
-
notes = "Supervisor: Fabricio Olivetti de Franca and William La
Cava",
- }
Genetic Programming entries for
Guilherme Seidyo Imai Aldeia
Citations