abstract = "Code bloat, the excessive increase of code size, is an
important issue in Genetic Programming (GP). This paper
proposes a theoretical analysis of code bloat in the
framework of symbolic regression in GP, from the
viewpoint of Statistical Learning Theory, a well
grounded mathematical toolbox for Machine Learning. Two
kinds of bloat must be distinguished in that context,
depending whether the target function lies in the
search space or not. Then, important mathematical
results are proved using classical results from
Statistical Learning. Namely, the Vapnik-Chervonenkis
dimension of programs is computed, and further results
from Statistical Learning allow to prove that a
parsimonious fitness ensures Universal Consistency (the
solution minimising the empirical error does converge
to the best possible error when the number of examples
goes to infinity). However, it is proved that the
standard method consisting in choosing a maximal
program size depending on the number of examples might
still result in programs of infinitely increasing size
with their accuracy; a more complicated modification of
the fitness is proposed that theoretically avoids
unnecessary bloat while nevertheless preserving the
Universal Consistency.",