Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Ahmed, Soha; Zhang, Mengjie; Peng, Lifeng

doi:10.1007/978-3-642-37189-9_5

Soha Ahmed¹⁹,
Mengjie Zhang¹⁹ &
Lifeng Peng²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7833))

Included in the following conference series:

European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics

1625 Accesses
15 Citations

Abstract

Biomarker discovery using mass spectrometry (MS) data is very useful in disease detection and drug discovery. The process of biomarker discovery in MS data must start with feature selection as the number of features in MS data is extremely large (e.g. thousands) while the number of samples is comparatively small. In this study, we propose the use of genetic programming (GP) for automatic feature selection and classification of MS data. This GP based approach works by using the features selected by two feature selection metrics, namely information gain (IG) and relief-f (REFS-F) in the terminal set. The feature selection performance of the proposed approach is examined and compared with IG and REFS-F alone on five MS data sets with different numbers of features and instances. Naive Bayes (NB), support vector machines (SVMs) and J48 decision trees (J48) are used in the experiments to evaluate the classification accuracy of the selected features. Meanwhile, GP is also used as a classification method in the experiments and its performance is compared with that of NB, SVMs and J48. The results show that GP as a feature selection method can select a smaller number of features with better classification performance than IG and REFS-F using NB, SVMs and J48. In addition, GP as a classification method also outperforms NB and J48 and achieves comparable or slightly better performance than SVMs on these data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Listgarten, J., Emili, A.: Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 4, 419–434 (2005)
Article Google Scholar
Ge, G., Wong, G.W.: Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles. BMC Bioinformatics 9(1), 275 (2008)
Article Google Scholar
Lin, Q., Peng, Q., Yao, F., Pan, X.F., Xiong, L.W., Wang, Y., Geng, J.F., Feng, J.X., Han, B.H., Bao, G.L., Yang, Y., Wang, X., Jin, L., Guo, W., Wang, J.C.: A classification method based on principal components of seldi spectra to diagnose of lung adenocarcinoma. PLoS ONE 7, e34457 (2012)
Article Google Scholar
He, S., Cooper, H.J., Ward, D.G., Yao, X., Heath, J.K.: Analysis of premalignant pancreatic cancer mass spectrometry data for biomarker selection using a group search optimizer. Transactions of the Institute of Measurement and Control 34, 668–676 (2011)
Article Google Scholar
Satten, G.A., Datta, S., Moura, H., Woolfitt, A.R., da G. Carvalho, M., Carlone, G.M., De, B.K., Pavlopoulos, A., Barr, J.R.: Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens. Bioinformatics 20(17), 3128–3136 (2004)
Article Google Scholar
Wagner, M., Naik, D., Pothen, A.: Protocols for disease classification from mass spectrometry data. Proteomics 3(9), 1692–1698 (2003)
Article Google Scholar
Li, L., Tang, H., Wu, Z., Gong, J., Gruidl, M., Zou, J., Tockman, M., Clark, R.A.: Data mining techniques for cancer detection using serum proteomic profiling. Artificial Intelligence in Medicine 32(2), 71–83 (2004)
Article Google Scholar
Jong, K., Marchiori, E., Sebag, M., Vaart, A.V.D.: Feature selection in proteomic pattern data with support vector machines (2004)
Google Scholar
Langdon, W.B., Poli, R., McPhee, N.F., Koza, J.R.: Genetic Programming: An Introduction and Tutorial, with a Survey of Techniques and Applications. In: Fulcher, J., Jain, L.C. (eds.) Computational Intelligence: A Compendium. SCI, vol. 115, pp. 927–1028. Springer, Heidelberg (2008)
Chapter Google Scholar
Poli, R., Langdon, W.B., McPhee, N.F.: A field guide to genetic programming. Lulu Enterprises, UK Ltd. (2008)
Google Scholar
Neshatian, K., Zhang, M., Andreae, P.: Genetic Programming for Feature Ranking in Classification Problems. In: Li, X., Kirley, M., Zhang, M., Green, D., Ciesielski, V., Abbass, H.A., Michalewicz, Z., Hendtlass, T., Deb, K., Tan, K.C., Branke, J., Shi, Y. (eds.) SEAL 2008. LNCS, vol. 5361, pp. 544–554. Springer, Heidelberg (2008)
Chapter Google Scholar
Paul, T.K., Iba, H.: Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6, 353–367 (2009)
Article Google Scholar
Lv, Y., Guo, Y., Sun, H., Zhang, M., Wang, J.: Feature extraction using composite individual genetic programming: An application to mass classification. Applied Mechanics and Materials 198, 468–473 (2012)
Article Google Scholar
Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Article Google Scholar
Sun, Y., Wu, D.: A relief based feature extraction algorithm. In: SDM, pp. 188–195 (2008)
Google Scholar
Kononenko, I.: Estimating Attributes: Analysis and Extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994)
Chapter Google Scholar
Petricoin, Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.: Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359, 572–577 (2002)
Google Scholar
Guyon, I., Gunn, S.R., Ben-Hur, A., Dror, G.: Result analysis of the nips 2003 feature selection challenge. In: NIPS (2004)
Google Scholar
Tuli, L., Tsai, T.H., Varghese, R., Xiao, J.F., Cheema, A., Ressom, H.: Using a spike-in experiment to evaluate analysis of LC-MS data. Proteome Science 10, 13 (2012)
Article Google Scholar
Cai, J., Smith, D., Xia, X., Yuen, K.Y.: MBEToolbox: a Matlab toolbox for sequence data analysis in molecular biology and evolution. BMC Bioinformatics 6(1), 64 (2005)
Article Google Scholar
Sandin, I., Andrade, G., Viegas, F., Madeira, D., da Rocha, L.C., Salles, T., Goncalves, M.A.: Aggressive and effective feature selection using genetic programming. In: IEEE Congress on Evolutionary Computation, pp. 1–8. IEEE (2012)
Google Scholar
Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H.: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19(13), 1636–1643 (2003)
Article Google Scholar
White, D.R.: Software review: the ecj toolkit. Genetic Programming and Evolvable Machines, 65–67 (2012)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington, 6140, New Zealand
Soha Ahmed & Mengjie Zhang
School of Biological Sciences, Victoria University of Wellington, PO Box 600, Wellington, 6140, New Zealand
Lifeng Peng

Authors

Soha Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Mengjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lifeng Peng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ISEGI, Universidade Nova de Lisboa, 1070-312, Lisboa, Portugal
Leonardo Vanneschi
Center for Human Genetics Research, Department of Biomedical Informatics, Vanderbilt University, 519 Light Hall, 37232, Nashville, USA
William S. Bush
Department of Veterinary Sciences, University of Torino, via Leonardi da Vinci 44, 10095, Grugliasco, TO, Italy
Mario Giacobini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahmed, S., Zhang, M., Peng, L. (2013). Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach. In: Vanneschi, L., Bush, W.S., Giacobini, M. (eds) Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. EvoBIO 2013. Lecture Notes in Computer Science, vol 7833. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37189-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-37189-9_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37188-2
Online ISBN: 978-3-642-37189-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics