abstract = "we address the problem of defining a measure of
diversity for a population of individuals whose genome
can be subjected to major reorganisations during the
evolutionary process. To this end, we introduce a
measure of diversity for populations of strings of
variable length defined on a finite alphabet, and from
this measure we derive a semi-metric distance between
pairs of strings. The definitions are based on counting
the number of substrings of the strings, considered
first separately and then collectively. This approach
is related to the concept of linguistic complexity,
whose definition we generalise from single strings to
populations. Using the substring count approach we also
define a new kind of Tanimoto distance between strings.
We show how to extend the approach to representations
that are not based on strings and, in particular, to
the tree-based representations used in the field of
genetic programming. We describe how suffix trees can
allow these measures and distances to be implemented
with a computational cost that is linear in both space
and time relative to the length of the strings and the
size of the population. The definitions were devised to
assess the diversity of populations having genomes of
variable length and variable structure during
evolutionary computation runs, but applications in
quantitative genomics, proteomics, and pattern
recognition can be also envisaged.",
notes = "Section 3.4 (10 lines) suggests using Tanimoto
distance for tree GP p503. Page 507 says {"}Tanimoto
... O(lnn) complexity makes the direct implementation
... rapidly impractical for runtime diversity
assessment when the population size (n)
grows.{"}