Lung cancer prediction using multi-gene genetic programming by selecting automatic features from amino acid sequences

https://doi.org/10.1016/j.compbiolchem.2022.107638Get rights and content

Highlights

  • Application of machine learning in health care for lung cancer prediction.

  • Classification of cancer vs. non-cancer protein amino acid sequences using supervised learning.

  • Transformation of Protein amino acid sequences into various feature spaces.

  • Automatic feature selection with bio-inspired multi-gene genetic programming.

Abstract

Lung cancer is one of the leading causes of cancer related deaths. Early diagnosis of lung cancer using automatic feature selection from large number of features is a challenging task. Conventionally, cancer diagnosis approaches use physical features that appear in later stages, while harmful effects have already been occurred due to abnormal somatic mutations. In order to extract useful novel patterns to efficiently predict cancer at early stages, we analyzed lung cancer related mutated genes that reveal useful information in protein amino acid sequences. For this, we developed a new evolutionary learning technique with biologically inspired multi-gene genetic programming algorithm using discriminant information of protein amino acids. The proposed model efficiently selects 23 discriminant features out of 1500 features. Then it combines the selected features and related primitive functions optimally for prediction of lung cancer. Hence, an efficient predictive model is constructed that helps in understanding the complex heterogeneous nature of lung cancer. The proposed system achieved area under ROC curve and accuracy values of 98.79% and 95.67%, respectively outperforming related lung cancer prediction approaches.

Introduction

Lung cancer is one of the leading causes of cancer related deaths (Torre et al., 2016, Key Statistics for Lung Cancer, 2018, Aareleid et al., 2017, Blandin Knight et al., 2017, Cheng et al., 2016). According to World Health Organization, about 1.76 million people were died because of lung cancer in 2018. Main causes of lung cancer include smoking, air pollution, exposure to radon gas, asbestos and other carcinogens, alcohol consumption, and genetic susceptibility (Malhotra et al., 2016; Carreras-Torres et al., 2017; Chen et al., 2016). Clinically various tests such as Chest X-ray, CT scan, sputum cytology, biopsy etc. are recommended to diagnose lung cancer. However, signs and symptoms of lung cancer generally occur when disease is at advanced stage. Hence, lung cancer mortality rate is increasing worldwide (Aareleid et al., 2017, Cheng et al., 2016). Consequently, most of the computer aided diagnostic methods are based on physical/geometrical features of tumor that appear in later stages. In this scenario an effective lung cancer prediction system, capable of early stage lung cancer detection, is indispensable.

Likewise, Life is based on the ability of cells to store, retrieve, and translate genetic information (Lieu et al., 2020). This information is carried by the dioxyribonucleic acid (DNA) in all living things. DNA is a discrete code physically present in cells of an organism. DNA can be thought as a one dimensional string of characters. Where, each unique three character sequence of nucleotides corresponds to one amino acid. Essentially, the instructions in DNA are first transcribed into ribonucleic acid (RNA) and the RNA is then translated into protein sequence of amino acid (Lieu et al., 2020).

Since cancer is a heterogeneous genetic disease caused by different types of somatic mutations (Dimitrakopoulos and Beerenwinkel, 2017), new studies explain the important role of amino acid in cancer metabolism. A recent article outlines the diverse roles of amino acids within the tumor and its microenvironment (Lieu et al., 2020). As the human genome is composed of nucleic acid sequences which are encoded in the chromosomes. These chromosomes are packaged into deoxyribonucleic acid (DNA) that is comprised of protein-coding and non-coding genes. Specific genes maintain cell growth, division, and apoptosis in human body (Zhan and Boutros, 2016, Chen et al., 2017, Liu et al., 2017). Some genes are responsible for tumor suppression (Zhan and Boutros, 2016), transcription activation, protein binding (Zhan and Boutros, 2016, Chen et al., 2017), regulation, phosphorylation-dependent ubiquitination (Huang et al., 2018), negative regulation (Kei‐Ichiro et al., 2018), and activation of signaling pathways. These coordinated activities should be normal for smooth functioning of lung cells. Due to somatic mutations, DNA chromosome potentially remains in danger. Generally, damages in DNA are repaired with no detrimental effects. However, in case of failure of DNA repairing process, abnormal mutations are accumulated in DNA. This abnormal process damages ribonucleic acid (RNA) and abnormal cell functions are replicated. Thus, cancer related proteins amino acids are formed as a result of translation of damaged RNA. For example, tumor suppressor genes BRCA1, BRCA2, and p53 are responsible for controlling cell growth and its division. However, normal function of lung is degraded due to cancer-driving somatic mutations. Where, somatic mutations within coding-region interrupt gene normal function. As a result, abnormal mutations deregulate cellular networks to transform normal cells into cancer cells (Amar et al., 2017, Forbes et al., 2017, Kuijjer et al., 2018, Vural et al., 2016). Since, somatic mutations drive biological abnormal processes which are reflected in tumor phenotype (Rios Velazquez et al., 2017). Hence, abnormal mutation profiles can be exploited to predict lung cancer. Accordingly, somatic mutation based prediction system can be developed to predict cancer patient at early stages(Blandin Knight et al., 2017). We have created an image (Fig. 1) to visually demonstrate the biological processes involved in formation of lung cancer. In Fig. 1, the process starts from top left corner towards right, follows the arrows and ends up to forming amino acid sequences. Where, a lung cancer prediction system can differentiate between healthy and potential lung cancer protein sequences.

By literature review, it can be witnessed that development of lung cancer prediction system is an active research area. Various prediction systems are already proposed using different learning approaches (Hosseinzadeh et al., 2013, Li et al., 2015, Li et al., 2018, Petousisa et al., 2016, Ramani and Jacob, 2013, Zhang et al., 2016, Ibáñez et al., 2016, Liu et al., 2018, Al-Thanoon et al., 2018, Liu et al., 2018, Ibáñez et al., 2016, Abdar et al., 2018, Abdel-Nasser et al., 2017, Narayanan et al., 2017, Salem et al., 2017). Bayesian network models are developed using structural and physiochemical properties of protein sequences (Ramani and Jacob, 2013). Support vector machines (SVM), logistic regression (LR), Naive Bayes (NB), and random forest (RF) algorithms are employed using gene expressions features (Li et al., 2015). Similarly, dynamic Bayesian network is developed with low-dose computed tomography features(Petousisa et al., 2016). Decision tree (DT) algorithm is used to predict lung cancer using DNA methylation markers (Zhang et al., 2016). In another study, micro-RNA expression profile is used to classify lung cancer (A S et al., 2016). Due to discriminant information carrying capability of primary amino acid sequences, these sequences are used for prediction of various types of cancers (Ramani and Jacob, 2013, Ali et al., 2016). Recently, biologically inspired deep neural network (DNN) has gained popularity for classification modeling (Teramoto et al., 2017, Coudray et al., 2018). These deep learning approaches have automatic feature extraction and learning capability. In contrast to previous learning approaches, we proposed bio-inspired multi-gene genetic programming (MGGP) based advanced data mining approach for lung cancer prediction using amino acid sequence information. The proposed approach is highly effective due to better extraction of novel and meaningful patterns and optimal combination of selected features in decision space.

In this work, we computed various molecular descriptors using physiochemical properties of protein amino acid sequences. The descriptors information is transformed into various feature spaces such as autocorrelation features, sequence order features, amino acid composition, dipeptide composition, split amino acid composition, series-pseudo amino acid composition, and parallel-pseudo amino acid composition. These feature spaces are reliable as some of them have already been used in several other fields including pattern recognition, medical, and bioinformatics (Ali et al., 2016, Mei and Zhao, 2018). Additionally, we have constructed two new hybrid feature spaces and named them as hybrid-FS1 and hybrid-FS2. The proposed model is evaluated using performance measures of accuracy, sensitivity, specificity, F-measure, Mathew correlation coefficient and area under ROC curve. The proposed approach has proved its effectiveness by outperforming previous learning approaches for lung cancer prediction.

The remaining paper is organized as follows: Section 2 provides description of materials and methods including proposed MGGP prediction system. Section 3 describes evaluation metrics. In Section 4, Results are presented. While, discussion and comparison with other approaches are presented in Section 5. Finally, Section 6 concludes the current study.

Section snippets

Materials and methods

This section provides description related to the formation of lung cancer dataset, formation of feature spaces and proposed prediction system.

Evaluation Metrics

Evaluation of machine learning algorithms require the use of metrics. In order to comprehensively analyze the performance of our approach, we have used multiple quality measures which include accuracy, sensitivity, specificity, F-measure, Mathew correlation coefficient (MCC), and area under ROC curve (AUC). These measures are based on the values of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) directly or indirectly. The ROC curve shows the association

Results

The experiments were performed using 1000 examples as described in the section Dataset Formation. Where, MGGP model was trained using the training set of 700 examples and the model was tested on the test set of 300 examples. Following results were obtained in terms of evaluation metrics (Table 5).

The model achieved the sensitivity of 0.9733, 0.9733, 0.8733, 0.9133, 0.9333, 0.9267, 0.9067, 0.9667, and 0.9467 for the feature spaces of Hybrid-FS2, Hybrid-FS1, SOF, AutoCr, PPseAAC, SPseAAC, DC,

Discussion

Lung cancer detection is an important research topic in computer aided diagnostics, and a large number of methods have been proposed in scientific literature. However, novel approaches are still being developed (Xie et al., 2021, Yin et al., 2021), clearly indicating less satisfaction with existing approaches. Indeed, lung cancer detection systems to detect cancer at early stages are genuinely required. Similarly, somatic mutation plays a significant role in cancer metabolism. Mutated genes may

Conclusion

In this study, we have developed biologically-inspired MGGP based lung cancer prediction system using discriminant features of mutated genes. We expressed discriminant features in different spaces after statistical analysis of mutated genes. These features incorporated useful information to predict lung cancer from protein amino acid sequences. The results demonstrated that Hybrid-FS2 feature space has more discrimination power for accurate prediction. The proposed system has shown better

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by Higher Education Commission (HEC) of Pakistan under Indigenous PhD Fellowship for 5000 scholars, Phase-II (PIN No. 213-59474-2PS2-056).

References (57)

  • T. Aareleid et al.

    Divergent trends in lung cancer incidence by gender, age and histological type in Estonia: a nationwide population-based study

    BMC Cancer

    (2017)
  • M. Abdar et al.

    A new nested ensemble technique for automated diagnosis of breast cancer

    Pattern Recognit. Lett.

    (2018)
  • D. Amar et al.

    Utilizing somatic mutation data from numerous studies for cancer research: proof of concept and applications

    Oncogene

    (2017)
  • S. Blandin Knight et al.

    Progress and prospects of early detection in lung cancer

    Open Biol.

    (2017)
  • D.-S. Cao et al.

    propy: a tool to generate various modes of Chou’s PseAAC

    Bioinformatics

    (2013)
  • R. Carreras-Torres et al.

    Obesity, metabolic factors and risk of different histological types of lung cancer: a Mendelian randomization study

    PLoS ONE

    (2017)
  • L. Chen et al.

    Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways

    PLoS ONE

    (2017)
  • T.U. Consortium

    UniProt: the universal protein knowledgebase

    Nucleic Acids Res.

    (2017)
  • N. Coudray et al.

    Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning

    Nat. Med.

    (2018)
  • C.M. Dimitrakopoulos et al.

    Computational approaches for the identification of cancer genes and pathways

    Wiley Interdiscip. Rev. Syst. Biol. Med.

    (2017)
  • S.A. Forbes et al.

    COSMIC: somatic cancer genetics at high-resolution

    Nucleic Acids Res.

    (2017)
  • Genetic Scissors: a tool for rewriting the code of life (2020). Retrieved July 16, 2021, from...
  • F. Hosseinzadeh et al.

    Prediction of lung tumor types based on protein attributes by machine learning algorithms

    SpringerPlus

    (2013)
  • F. Hosseinzadeh et al.

    Prediction of lung tumor types based on protein attributes by machine learning algorithms

    Springerplus

    (2013)
  • Y. Huang et al.

    S6K1 phosphorylation-dependent degradation of Mxi1 by β-Trcp ubiquitin ligase promotes Myc activation and radioresistance in lung cancer

    Theranostics

    (2018)
  • K. Ibáñez et al.

    A computational approach inspired by simulated annealing to study the stability of protein interaction networks in cancer and neurological disorders

    Data Min. Knowl. Discov.

    (2016)
  • K. Ibáñez et al.

    A computational approach inspired by simulated annealing to study the stability of protein interaction networks in cancer and neurological disorders

    Data Min. Knowl. Discov.

    (2016)
  • Y. Jiao et al.

    Performance measures in evaluating machine learning based bioinformatics predictors for classifications

    Quant. Biol.

    (2016)
  • Cited by (9)

    • Prediction of protein-protein interactions based on ensemble residual convolutional neural network

      2023, Computers in Biology and Medicine
      Citation Excerpt :

      Presently, there have been a variety of prediction methods on S. cerevisiae, H. pylori and Human-Y. pestis to identify PPIs and non-PPIs. The comparation results of EResCNN with LD + KNN [49], LD + SVM [21], ACC + SVM [19], PR-LPQ [50], DeepPPI [13], LightGBM-PPI [9], StackPPI [7], DPPI [51], SVM [20], HKNN [52], MCD + SVM [53], DCT + WSRC [54], PCA-EELM [55] and Lian et al. [12] are listed in Table 4 and Supplementary Table S3. The results of EResCNN and above PPIs prediction methods on S. cerevisiae dataset are listed in Table 4.

    View all citing articles on Scopus
    View full text