Lung cancer prediction using multi-gene genetic programming by selecting automatic features from amino acid sequences
Graphical Abstract
Introduction
Lung cancer is one of the leading causes of cancer related deaths (Torre et al., 2016, Key Statistics for Lung Cancer, 2018, Aareleid et al., 2017, Blandin Knight et al., 2017, Cheng et al., 2016). According to World Health Organization, about 1.76 million people were died because of lung cancer in 2018. Main causes of lung cancer include smoking, air pollution, exposure to radon gas, asbestos and other carcinogens, alcohol consumption, and genetic susceptibility (Malhotra et al., 2016; Carreras-Torres et al., 2017; Chen et al., 2016). Clinically various tests such as Chest X-ray, CT scan, sputum cytology, biopsy etc. are recommended to diagnose lung cancer. However, signs and symptoms of lung cancer generally occur when disease is at advanced stage. Hence, lung cancer mortality rate is increasing worldwide (Aareleid et al., 2017, Cheng et al., 2016). Consequently, most of the computer aided diagnostic methods are based on physical/geometrical features of tumor that appear in later stages. In this scenario an effective lung cancer prediction system, capable of early stage lung cancer detection, is indispensable.
Likewise, Life is based on the ability of cells to store, retrieve, and translate genetic information (Lieu et al., 2020). This information is carried by the dioxyribonucleic acid (DNA) in all living things. DNA is a discrete code physically present in cells of an organism. DNA can be thought as a one dimensional string of characters. Where, each unique three character sequence of nucleotides corresponds to one amino acid. Essentially, the instructions in DNA are first transcribed into ribonucleic acid (RNA) and the RNA is then translated into protein sequence of amino acid (Lieu et al., 2020).
Since cancer is a heterogeneous genetic disease caused by different types of somatic mutations (Dimitrakopoulos and Beerenwinkel, 2017), new studies explain the important role of amino acid in cancer metabolism. A recent article outlines the diverse roles of amino acids within the tumor and its microenvironment (Lieu et al., 2020). As the human genome is composed of nucleic acid sequences which are encoded in the chromosomes. These chromosomes are packaged into deoxyribonucleic acid (DNA) that is comprised of protein-coding and non-coding genes. Specific genes maintain cell growth, division, and apoptosis in human body (Zhan and Boutros, 2016, Chen et al., 2017, Liu et al., 2017). Some genes are responsible for tumor suppression (Zhan and Boutros, 2016), transcription activation, protein binding (Zhan and Boutros, 2016, Chen et al., 2017), regulation, phosphorylation-dependent ubiquitination (Huang et al., 2018), negative regulation (Kei‐Ichiro et al., 2018), and activation of signaling pathways. These coordinated activities should be normal for smooth functioning of lung cells. Due to somatic mutations, DNA chromosome potentially remains in danger. Generally, damages in DNA are repaired with no detrimental effects. However, in case of failure of DNA repairing process, abnormal mutations are accumulated in DNA. This abnormal process damages ribonucleic acid (RNA) and abnormal cell functions are replicated. Thus, cancer related proteins amino acids are formed as a result of translation of damaged RNA. For example, tumor suppressor genes BRCA1, BRCA2, and p53 are responsible for controlling cell growth and its division. However, normal function of lung is degraded due to cancer-driving somatic mutations. Where, somatic mutations within coding-region interrupt gene normal function. As a result, abnormal mutations deregulate cellular networks to transform normal cells into cancer cells (Amar et al., 2017, Forbes et al., 2017, Kuijjer et al., 2018, Vural et al., 2016). Since, somatic mutations drive biological abnormal processes which are reflected in tumor phenotype (Rios Velazquez et al., 2017). Hence, abnormal mutation profiles can be exploited to predict lung cancer. Accordingly, somatic mutation based prediction system can be developed to predict cancer patient at early stages(Blandin Knight et al., 2017). We have created an image (Fig. 1) to visually demonstrate the biological processes involved in formation of lung cancer. In Fig. 1, the process starts from top left corner towards right, follows the arrows and ends up to forming amino acid sequences. Where, a lung cancer prediction system can differentiate between healthy and potential lung cancer protein sequences.
By literature review, it can be witnessed that development of lung cancer prediction system is an active research area. Various prediction systems are already proposed using different learning approaches (Hosseinzadeh et al., 2013, Li et al., 2015, Li et al., 2018, Petousisa et al., 2016, Ramani and Jacob, 2013, Zhang et al., 2016, Ibáñez et al., 2016, Liu et al., 2018, Al-Thanoon et al., 2018, Liu et al., 2018, Ibáñez et al., 2016, Abdar et al., 2018, Abdel-Nasser et al., 2017, Narayanan et al., 2017, Salem et al., 2017). Bayesian network models are developed using structural and physiochemical properties of protein sequences (Ramani and Jacob, 2013). Support vector machines (SVM), logistic regression (LR), Naive Bayes (NB), and random forest (RF) algorithms are employed using gene expressions features (Li et al., 2015). Similarly, dynamic Bayesian network is developed with low-dose computed tomography features(Petousisa et al., 2016). Decision tree (DT) algorithm is used to predict lung cancer using DNA methylation markers (Zhang et al., 2016). In another study, micro-RNA expression profile is used to classify lung cancer (A S et al., 2016). Due to discriminant information carrying capability of primary amino acid sequences, these sequences are used for prediction of various types of cancers (Ramani and Jacob, 2013, Ali et al., 2016). Recently, biologically inspired deep neural network (DNN) has gained popularity for classification modeling (Teramoto et al., 2017, Coudray et al., 2018). These deep learning approaches have automatic feature extraction and learning capability. In contrast to previous learning approaches, we proposed bio-inspired multi-gene genetic programming (MGGP) based advanced data mining approach for lung cancer prediction using amino acid sequence information. The proposed approach is highly effective due to better extraction of novel and meaningful patterns and optimal combination of selected features in decision space.
In this work, we computed various molecular descriptors using physiochemical properties of protein amino acid sequences. The descriptors information is transformed into various feature spaces such as autocorrelation features, sequence order features, amino acid composition, dipeptide composition, split amino acid composition, series-pseudo amino acid composition, and parallel-pseudo amino acid composition. These feature spaces are reliable as some of them have already been used in several other fields including pattern recognition, medical, and bioinformatics (Ali et al., 2016, Mei and Zhao, 2018). Additionally, we have constructed two new hybrid feature spaces and named them as hybrid-FS1 and hybrid-FS2. The proposed model is evaluated using performance measures of accuracy, sensitivity, specificity, F-measure, Mathew correlation coefficient and area under ROC curve. The proposed approach has proved its effectiveness by outperforming previous learning approaches for lung cancer prediction.
The remaining paper is organized as follows: Section 2 provides description of materials and methods including proposed MGGP prediction system. Section 3 describes evaluation metrics. In Section 4, Results are presented. While, discussion and comparison with other approaches are presented in Section 5. Finally, Section 6 concludes the current study.
Section snippets
Materials and methods
This section provides description related to the formation of lung cancer dataset, formation of feature spaces and proposed prediction system.
Evaluation Metrics
Evaluation of machine learning algorithms require the use of metrics. In order to comprehensively analyze the performance of our approach, we have used multiple quality measures which include accuracy, sensitivity, specificity, F-measure, Mathew correlation coefficient (MCC), and area under ROC curve (AUC). These measures are based on the values of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) directly or indirectly. The ROC curve shows the association
Results
The experiments were performed using 1000 examples as described in the section Dataset Formation. Where, MGGP model was trained using the training set of 700 examples and the model was tested on the test set of 300 examples. Following results were obtained in terms of evaluation metrics (Table 5).
The model achieved the sensitivity of 0.9733, 0.9733, 0.8733, 0.9133, 0.9333, 0.9267, 0.9067, 0.9667, and 0.9467 for the feature spaces of Hybrid-FS2, Hybrid-FS1, SOF, AutoCr, PPseAAC, SPseAAC, DC,
Discussion
Lung cancer detection is an important research topic in computer aided diagnostics, and a large number of methods have been proposed in scientific literature. However, novel approaches are still being developed (Xie et al., 2021, Yin et al., 2021), clearly indicating less satisfaction with existing approaches. Indeed, lung cancer detection systems to detect cancer at early stages are genuinely required. Similarly, somatic mutation plays a significant role in cancer metabolism. Mutated genes may
Conclusion
In this study, we have developed biologically-inspired MGGP based lung cancer prediction system using discriminant features of mutated genes. We expressed discriminant features in different spaces after statistical analysis of mutated genes. These features incorporated useful information to predict lung cancer from protein amino acid sequences. The results demonstrated that Hybrid-FS2 feature space has more discrimination power for accurate prediction. The proposed system has shown better
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by Higher Education Commission (HEC) of Pakistan under Indigenous PhD Fellowship for 5000 scholars, Phase-II (PIN No. 213-59474-2PS2-056).
References (57)
- et al.
Analyzing the evolution of breast tumors through flow fields and strain tensors
Pattern Recognit. Lett.
(2017) - et al.
Can-CSC-GBE: developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data
Comput. Biol. Med.
(2016) - et al.
Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification
Comput. Biol. Med.
(2018) - et al.
Genetic risk can be decreased: quitting smoking decreases and delays lung cancer for smokers with high and low CHRNA5 risk genotypes — a meta-analysis
EBioMedicine
(2016) - et al.
The international epidemiology of lung cancer: latest trends, disparities, and tumor characteristics
J. Thorac. Oncol.: Off. Publ. Int. Assoc. Study Lung Cancer
(2016) - et al.
Adaptive multinomial regression with overlapping groups for multi-class classification of lung cancer
Comput. Biol. Med.
(2018) - et al.
A prognosis-related based method for miRNA selection on liver hepatocellular carcinoma prediction
Comput. Biol. Chem.
(2021) - et al.
Early lung cancer diagnostic biomarker discovery by machine learning methods
Transl. Oncol.
(2021) - et al.
Smoking-associated DNA methylation markers predict lung cancer incidence
Clin. Epigenetics
(2016) - A S, R A, S VCS (2016) SVM Based Lung Cancer Prediction Using microRNA Expression Profiling from NGS Data. Paper...
Divergent trends in lung cancer incidence by gender, age and histological type in Estonia: a nationwide population-based study
BMC Cancer
A new nested ensemble technique for automated diagnosis of breast cancer
Pattern Recognit. Lett.
Utilizing somatic mutation data from numerous studies for cancer research: proof of concept and applications
Oncogene
Progress and prospects of early detection in lung cancer
Open Biol.
propy: a tool to generate various modes of Chou’s PseAAC
Bioinformatics
Obesity, metabolic factors and risk of different histological types of lung cancer: a Mendelian randomization study
PLoS ONE
Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways
PLoS ONE
UniProt: the universal protein knowledgebase
Nucleic Acids Res.
Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning
Nat. Med.
Computational approaches for the identification of cancer genes and pathways
Wiley Interdiscip. Rev. Syst. Biol. Med.
COSMIC: somatic cancer genetics at high-resolution
Nucleic Acids Res.
Prediction of lung tumor types based on protein attributes by machine learning algorithms
SpringerPlus
Prediction of lung tumor types based on protein attributes by machine learning algorithms
Springerplus
S6K1 phosphorylation-dependent degradation of Mxi1 by β-Trcp ubiquitin ligase promotes Myc activation and radioresistance in lung cancer
Theranostics
A computational approach inspired by simulated annealing to study the stability of protein interaction networks in cancer and neurological disorders
Data Min. Knowl. Discov.
A computational approach inspired by simulated annealing to study the stability of protein interaction networks in cancer and neurological disorders
Data Min. Knowl. Discov.
Performance measures in evaluating machine learning based bioinformatics predictors for classifications
Quant. Biol.
Cited by (9)
A comprehensive review of automatic programming methods
2023, Applied Soft ComputingInvestigating the best automatic programming method in predicting the aerodynamic characteristics of wind turbine blade
2023, Engineering Applications of Artificial IntelligencePrediction of protein-protein interactions based on ensemble residual convolutional neural network
2023, Computers in Biology and MedicineCitation Excerpt :Presently, there have been a variety of prediction methods on S. cerevisiae, H. pylori and Human-Y. pestis to identify PPIs and non-PPIs. The comparation results of EResCNN with LD + KNN [49], LD + SVM [21], ACC + SVM [19], PR-LPQ [50], DeepPPI [13], LightGBM-PPI [9], StackPPI [7], DPPI [51], SVM [20], HKNN [52], MCD + SVM [53], DCT + WSRC [54], PCA-EELM [55] and Lian et al. [12] are listed in Table 4 and Supplementary Table S3. The results of EResCNN and above PPIs prediction methods on S. cerevisiae dataset are listed in Table 4.