abstract = "This dissertation uses genetic programming in text
categorization problems. Genetic programming algorithms
are applied to a set of news articles to evolve
programs that determine whether the article belongs to
a particular category. The programs are randomly
generated from the set of initial functions and
constants. Programs with the fewest amount of false
assignments are favoured in the selection for
recombination in the subsequent iterations of the
genetic programming algorithm. The form of the solution
is not determined a priori as in other text
categorization methods. The basis set of functions and
constants used by the genetic analysis program are
specified in advance and may include the three basic
logical functions and a set of vocabulary words. Other
sets of basis functions can be supplied to the genetic
algorithm to obtain different programs. The form in
which these functions and constants are combined is
determined randomly by the genetic algorithm. The
results indicate that genetic programming methods are
in the cases examined as good and slightly better than
other decision tree or rule induction methods described
by Apte et. al. [Apte 1994]. The Genetic Programming
methods used a simpler set of features and functions:
no word stemming no explicit stop word removal, local
dictionary, Boolean functions. The F1-measure of
categorization performance of 80.percent achieved by
Genetic Programming compares favorably with 78.5percent
break even performance of traditional Boolean rule
induction methods. It is comparable with 80.5percent
Breakeven performance of the rule induction methods
with a more complex feature set such as word frequency
[Apte 1994]. Characteristics of Genetic Programming
text categorization were studied to understand the
sensitivity of Genetic Programming methods to
vocabulary size, population size, training and testing
set selection methods. Temporal characteristics of the
Reuters Article Corpus [Lewis-21578) were studied. The
results are of interest to both Genetic Programming as
well as Traditional categorization methods and may
point to significant future performance improvements in
both domains. In some cases these results were better
than Apte's.",