abstract = "Recent research \cite{poli:2007:eurogp}\cite{1277277}
has enabled the accurate prediction of the limiting
distribution of tree sizes for Genetic Programming with
standard sub-tree swapping crossover when GP is applied
to a flat fitness landscape. In that work, however,
tree sizes are measured in terms of number of internal
nodes. While the relationship between internal nodes
and length is one-to-one for the case of a-ary trees,
it is much more complex in the case of mixed arities.
So, practically the length bias of subtree crossover
remains unknown. This paper starts to fill this
theoretical gap, by providing accurate estimates of the
limiting distribution of lengths approached by
tree-based GP with standard crossover in the absence of
selection pressure. The resulting models confirm that
short programs can be expected to be heavily resampled.
Empirical validation shows that this is indeed the
case. We also study empirically how the situation is
modified by the application of program length limits.
Surprisingly, the introduction of such limits further
exacerbates the effect. However, this has more profound
consequences than one might imagine at first. We
analyse these consequences and predict that, in the
presence of fitness, size limits may initially speed up
bloat, almost completely defeating their original
purpose (combating bloat). Indeed, experiments confirm
that this is the case for the first 10 or 15
generations. This leads us to suggest a better way of
using size limits. Finally, this paper proposes a novel
technique to counteract bloat, sampling parsimony, the
application of a penalty to resampling.",
notes = "Also known as \cite{conf/eurogp/DignumP08a}