Development of Symbolic Expressions Ensemble for Breast Cancer Type Classification Using Genetic Programming Symbolic Classifier and Decision Tree Classifier
Created by W.Langdon from
gp-bibliography.bib Revision:1.8051
- @Article{Andelic:2023:Cancers,
-
author = "Nikola Andelic and Sandi {Baressi Segota}",
-
title = "Development of Symbolic Expressions Ensemble for
Breast Cancer Type Classification Using Genetic
Programming Symbolic Classifier and Decision Tree
Classifier",
-
journal = "Cancers",
-
year = "2023",
-
volume = "15",
-
number = "13",
-
pages = "article no. 3411",
-
month = "29 " # jun,
-
email = "nandelic@riteh.hr",
-
keywords = "genetic algorithms, genetic programming, PCA, breast
cancer, genetic programming symbolic classifier, 5-fold
cross validation, random hyperparameter value search",
-
ISSN = "2072-6694",
-
URL = "https://www.mdpi.com/2072-6694/15/13/3411",
-
DOI = "doi:10.3390/cancers15133411",
-
size = "27 pages",
-
abstract = "Breast cancer is a type of cancer with several
sub-types. It occurs when cells in breast tissue grow
out of control. The accurate sub-type classification of
a patient diagnosed with breast cancer is mandatory for
the application of proper treatment. Breast cancer
classification based on gene expression is challenging
even for artificial intelligence (AI) due to the large
number of gene expressions. The idea in this paper is
to use genetic programming symbolic classifier (GPSC)
on the publicly available dataset to obtain a set of
symbolic expressions (SEs) that can classify the breast
cancer sub-type using gene expressions with high
classification accuracy. The initial problem with the
used dataset is a large number of input variables
(54676 gene expressions), a small number of dataset
samples (151 samples), and six classes of breast cancer
sub-types that are highly imbalanced. The large number
of input variables is solved with principal component
analysis (PCA), while the small number of samples and
the large imbalance between class samples are solved
with the application of different oversampling methods
generating different dataset variations. On each
oversampled dataset, the GPSC with random
hyperparameter values search (RHVS) method is trained
using 5-fold cross validation (5CV) to obtain a set of
SEs. The best set of SEs is chosen based on mean values
of accuracy (ACC), the area under the receiving
operating characteristic curve (AUC), precision,
recall, and F1-score values. In this case, the highest
classification accuracy is equal to 0.992 across all
evaluation metric methods. The best set of SEs is
additionally combined with a decision tree classifier,
which slightly improves ACC to 0.994.",
-
notes = "Department of Automation and Electronics, Faculty of
Engineering, University of Rijeka, Vukovarska 58, 51000
Rijeka, Croatia",
- }
Genetic Programming entries for
Nikola Andelic
Sandi Baressi Segota
Citations