abstract = "In many real-world classification applications, such
as medical diagnosis, fraud detection, bioinformatics,
or fault diagnostics, it is common that one class has
only a limited number of training instances (called the
minority class), while the other class (called the
majority class) conceive the rest. Such types of data
sets are called unbalanced. In data classification,
machine learning (ML) methods can face a performance
bias when the nature of data sets is unbalanced. In
this case, the trained classifiers may have good
accuracy on the majority class but lower accuracy on
the minority class. Genetic Programming (GP) is an
optimistic machine learning method based on the
Darwinian theory of evolution to automatically emerge
computer programs to solve problems without any
domain-specific knowledge. Although GP has revealed
much success in developing reliable and precise
classifiers for typical classification jobs, GP, like
many other ML algorithms, can produce biased
classifiers when the nature of data is unbalanced. This
biasing is because traditional training standards such
as the overall success rate in the fitness function in
GP can be influenced by the more significant number of
instances from the majority class.
This research focuses on algorithmic methods assuming
that the whole training data is important and valuable,
and no data sample should be removed from the training
process. The second consideration in this work is that
the proposed methods should be problem-independent, and
they should not expect any a-priori domain-specific or
expert knowledge. Thus, this research focuses on
developing GP-based approaches for unbalanced data-set
classification, based on internal cost alteration in
the GP fitness function and facilitating the unbalanced
data set to be used “as is” in the training
process. This research work demonstrates that by
designing various methods in GP, we can evolve
classifiers with good classification performance on the
majority and the minority classes. These developed
methods are evaluated, on publicly available, UCI-based
binary benchmark classification problems with varying
levels of imbalanced factors.",