Created by W.Langdon from gp-bibliography.bib Revision:1.8081
Proteins of a breast tissue generally reflect the initial changes caused by successive genetic mutations, which may lead to cancer. In this research, such changes in protein sequences are exploited for the early diagnosis of breast cancer. It is found that substantial variation of Proline, Serine, Tyrosine, Cysteine, Arginine, and Asparagine amino acid molecules in cancerous proteins offer high discrimination for cancer diagnostic. Molecular descriptors derived from physicochemical properties of amino acids are used to transform primary protein sequences into feature spaces of amino acid composition (AAC), split amino acid composition (SAAC), pseudo amino acid composition-series (PseAAC-S), and pseudo amino acid composition-parallel (PseAAC-P).
The research work in this thesis is divided in two phases. In the first phase, the basic framework is established to handle imbalanced dataset in order to enhance true prediction performance. In this phase, conventional individual learning algorithms are employed to develop different prediction systems. Firstly, in conjunction with oversampling based Mega-Trend-Diffusion (MTD) technique, individual prediction systems are developed. Secondly, homogeneous ensemble systems CanPro-IDSS and Can-CSCGnB are developed using MTD and cost-sensitive classifier (CSC) techniques, respectively. It is found that assimilation of MTD technique for the CanPro-IDSS system is superior than CSC based technique to handle imbalanced dataset of protein sequences. In this connection, a web based CanPro-IDSS cancer prediction system is also developed. Lastly, a novel heterogeneous ensemble system called IDMS-HBC is developed for breast cancer detection.
The second phase of this research focuses on the exploitation of variation of amino acid molecules in cancerous protein sequences using physicochemical properties. In this phase, unlike traditional ensemble prediction approaches, the proposed IDM-PhyChm-Ens ensemble system is developed by combining the decision spaces of a specific classifier trained on different feature spaces. This intelligent ensemble system is constructed using diverse learning algorithms of Random Forest(RF), Support Vector Machines, K-Nearest Neighbor, and Naive Bayes (NB). It is observed that the combined spaces of SAAC+PseAAC-S and AAC+SAAC possess the best discrimination using ensemble-RF and ensemble-NB. Lastly, a novel classifier stacking based evolutionary ensemble system Can-Evo-Ens is also developed, whereby Genetic programming is used as the ensemble method. This study revealed that PseAAC-S feature space carries better discrimination power compared to AAC, SAAC, and PseAAC-P based feature extraction strategies.
Intensive experiments are performed to evaluate the performance of the proposed intelligent decision making systems for cancer/non-cancer and breast/non-breast cancer datasets. The proposed approaches have demonstrated improvement over previous state-of-the-art approaches. The proposed systems maybe useful for academia, practitioners, and clinicians for the early diagnosis of breast cancer using protein sequences. Finally, it is expected that the findings of this research would have positive impact on diagnosis, prevention, treatment, and management of cancer",
Supervisor: Abdul Majid
Co-Supervisor: Asifullah Khan",
Genetic Programming entries for Safdar Ali