Pap smear diagnosis using a hybrid intelligent scheme focusing on genetic algorithm based feature selection and nearest neighbor classification

https://doi.org/10.1016/j.compbiomed.2008.11.006Get rights and content

Abstract

The term pap-smear refers to samples of human cells stained by the so-called Papanicolaou method. The purpose of the Papanicolaou method is to diagnose pre-cancerous cell changes before they progress to invasive carcinoma. In this paper a metaheuristic algorithm is proposed in order to classify the cells. Two databases are used, constructed in different times by expert MDs, consisting of 917 and 500 images of pap smear cells, respectively. Each cell is described by 20 numerical features, and the cells fall into 7 classes but a minimal requirement is to separate normal from abnormal cells, which is a 2 class problem. For finding the best possible performing feature subset selection problem, an effective genetic algorithm scheme is proposed. This algorithmic scheme is combined with a number of nearest neighbor based classifiers. Results show that classification accuracy generally outperforms other previously applied intelligent approaches.

Introduction

Intelligent methodologies for data classification have always remained in the center of the interest in modern artificial intelligence. Various approaches appear in literature every year, proposing new intelligent methodologies which prove successful in a variety of interesting real world data applications (financial decision making, medical diagnosis, fault diagnosis in engineering systems, management, etc.). Regarding the nature of the data under analysis, a classification problem can contain (a) datasets of high complexity, (b) different types of data (nominal or numerical), (c) existence of two or more classes for data separation, (d) data abnormalities, such as incomplete or sparse datasets, etc. On the other hand, classification methods usually vary from standard mathematical approaches to intelligent algorithmic techniques (e.g., Gustafson-Kessel or hard/fuzzy C-means clustering, nearest neighbor methods, inductive machine learning algorithms, genetic optimization, neural networks, etc.). Furthermore, hybrid intelligent methods are also proposed in literature, i.e., effective combinations of more than one of the above mentioned approaches, or combination of “feature selection” (for complexity reduction) with standard or intelligent data classification approaches (see. Fig. 1).

In this paper, we propose a novel hybrid intelligent approach which outperforms a number of other competitive techniques on the medical problem of pap-smear cell data classification. By this approach, we primarily solve the feature subset selection problem as a genetic optimization problem and then we perform nearest neighbor based classification on the reduced feature set. The proposed method for the solution of the feature selection problem is a genetic algorithm (GA).

GAs are search procedures based on the mechanics of natural selection and natural genetics. The first GA was developed by John H. Holland in the 1960s to allow computers to evolve solutions to difficult search and combinatorial problems, such as function optimization and machine learning [13]. GAs offer a particularly attractive approach for problems like feature subset selection since they are generally quite effective for rapid global search of large, non-linear and poorly understood spaces. Moreover, GAs are very effective in solving large-scale problems. GAs [11] mimic the evolution process in nature. GAs are based on an imitation of the biological process in which new and better populations among different species are developed during evolution. Thus, unlike most standard heuristics, GAs use information of a population of solutions, called individuals, when they search for better solutions. A GA is a stochastic iterative procedure that maintains the population size constant in each iteration, called a generation. Their basic operation is the mating of two solutions in order to form a new solution. To form a new population, a binary operator called crossover, and a unary operator, called mutation, are applied [26], [27]. Crossover takes two individuals, called parents, and produces two new individuals, called offspring, by swapping parts of the parents.

In the classification phase of the proposed algorithm, a number of variants of the nearest neighbor classification method are used [9]. In order to assess the efficacy of the proposed methodologies, the algorithm is used for the pap-smear cell classification task. The term “pap-smear” refers to samples of human cells stained by the so-called Papanicolaou method. The Papanicolaou method is a medical procedure to detect pre-cancerous cells in the uterine cervix. The performance of the proposed algorithm is tested using two datasets of images of pap-smear cells distributed unequally on 7 different classes. Each cell is described by 20 features extracted from pictures of single human cells. Several intelligent methodologies have been previously applied on the specific domain, with variable classification performance, see [4], [5], [8], [21], [23], [31].

Medical experts stated that a definite presence of cancer is always clear and thus easily detected and diagnosed correctly by doctors. On the contrary, the characterization of a precancerous stage is highly subjective and requires detailed discussion among experts prior to a final decision. Nevertheless, the outcome of an intelligent technique can come up in certain cases even with a 0% error (as ability to model and generalize accurately from carefully collected datasets) a fact which is quite different from real medical practice and rather advantageous. Thus, the motivation of applying the previously mentioned method to the pap smear cell classification problem was to point out that this kind of methods can be used with significant effectiveness to medical diagnosis and that intelligent tools and techniques should rather be recommended to be used exclusively as second opinion medical decision making assistants.

The rest of the paper is organized as follows: In the next section, a detailed analysis of Algorithms for Classification is presented. In Section 3, an analytical description of the proposed algorithm is given. In Section 4, the application of the proposed algorithm to the pap-smear cell classification task is presented. In Section 5, the computational results of the proposed method are presented while in the last section conclusions and future research are given.

Section snippets

Algorithms for classification

In recent years, there has been an increasing need for novel data mining methodologies that can analyze and interpret large volumes of data. Selecting the right set of features for classification is one of the most important problems in designing a good classifier. The basic feature selection problem is an optimization problem, with a performance measure for each subset of features to measure its ability to classify the samples. The problem is to search through the space of feature subsets to

Application example

Below a brief description of the pap-smear medical diagnosis is given. A description of the pap-smear database characteristics (attributes, classes, etc.) follows and also the algorithmic parameters and settings of the proposed intelligence classification approach are briefly explained.

Presentation and discussion of results

In this section, a detailed presentation and discussion of the results acquired from the proposed methodology in both, the new and old pap-smear dataset, are given. Also, previous intelligent approaches applied to the problem of pap-smear diagnosis are presented and compared to the results obtained from the proposed methodology. Finally, a comparison with the results of another metaheuristic algorithm for this feature selection problem, a tabu search based metaheuristic algorithm [20] is

Conclusions

In this paper, a classification algorithm was proposed, combining nearest neighbor techniques and a GA for solving the optimal feature subset selection problem. The proposed approach was then used for solving a real world medical diagnosis problem, namely the pap-smear cell classification. Different classifiers have been tried throughout the paper for the main classification task, based on the nearest neighbor classification rule i.e., the 1-nn, the k-nn and the wk-nn for solving the pap-smear

Conflict of interest statement

None declared.

Yannis Marinakis was born in Chania, Greece, in 1976. He received a Diploma in Production Engineering and Management from the Technical University of Crete, Greece, in 1999 and a Ph.D., from the same University, in 2005. Currently he is a Lecturer in the Technical University of Crete, Chania, Greece as well as a Research Associate in the University of the Aegean, Chios, Greece. His research interests focus on computational methods in optimization problems.

References (32)

  • R. Kohavi et al.

    Wrappers for feature subset selection

    Artificial Intelligence

    (1997)
  • P.S. Shelokar et al.

    An ant colony classifier system: application to some process engineering problems

    Computers and Chemical Engineering

    (2004)
  • A. Tsakonas et al.

    Evolving rule based systems in two medical domains using genetic programming

    Artificial Intelligence in Medicine

    (2004)
  • D.W. Aha et al.

    A comparative evaluation of sequential feature selection algorithms

  • A. Al-Ani

    Feature subset selection using ant colony optimization

    International Journal of Computational Intelligence

    (2005)
  • A. Al-Ani

    Ant colony optimization for feature subset selection

    Transactions on Engineering, Computing and Technology

    (2005)
  • N. Ampazis et al.

    Efficient second order neural network training algorithms for the construction of a pap-smear classifier

  • Byriel, J., 1999. Neuro-Fuzzy Classification of Cells in Cervical Smears. Master's Thesis, Technical University of...
  • E. Cantu-Paz

    Feature subset selection, class separability, and genetic algorithms

    Genetic and Evolutionary Computation Conference

    (2004)
  • Cantu-Paz, E., Newsam, S., Kamath, C., 2004. Feature selection in scientific application, in: Proceedings of the 2004...
  • G. Dounias et al.

    Automated identification of cancerous smears using various competitive intelligent techniques

    Oncology Reports

    (2006)
  • R.O. Duda et al.

    Pattern Classification and Scene Analysis

    (1973)
  • A.P. Engelbrecht

    Computational Intelligence: An Introduction

    (2007)
  • D.E. Goldberg

    Genetic Algorithms in Search, Optimization, and Machine Learning

    (1989)
  • Y.C. Ho et al.

    Simple explanation of the no-free-lunch theorem and its implications

    Journal of Optimization Theory and Applications

    (2002)
  • J.H. Holland

    Adaptation in Natural and Artificial Systems

    (1975)
  • Cited by (0)

    Yannis Marinakis was born in Chania, Greece, in 1976. He received a Diploma in Production Engineering and Management from the Technical University of Crete, Greece, in 1999 and a Ph.D., from the same University, in 2005. Currently he is a Lecturer in the Technical University of Crete, Chania, Greece as well as a Research Associate in the University of the Aegean, Chios, Greece. His research interests focus on computational methods in optimization problems.

    Georgios Dounias was born in Evia, Greece, in 1967. He received a Diploma in Production Engineering and Management from the Technical University of Crete, Greece, in 1989 and a Ph.D., from the same University, in 1995. Since 1999, he has been with the University of the Aegean, Department of Financial & Management Engineering, Chios, Greece, where he is Associate Professor in Management & Decision Engineering. His research interests focus on computational intelligence techniques in various engineering applications.

    Jan Jantzen was born in Denmark, in 1953. He received an electrical engineering degree from the Denmark Technical University (DTU), Copenhagen, Denmark, in 1979 and a Ph.D., from the same University, in 1982. Since 1986, he has been with the DTU, Copenhagen, Denmark, where he is Associate Professor. His research interests focus on systems science and control theory using intelligent techniques.

    View full text