Abstract
Proteins can be grouped into families according to some features such as hydrophobicity, composition or structure, aiming to establish common biological functions. This paper presents MAHATMA—memetic algorithm-based highly adapted tool for motif ascertainment—a system that was conceived to discover features (particular sequences of amino acids, or motifs) that occur very often in proteins of a given family but rarely occur in proteins of other families. These features can be used for the classification of unknown proteins, that is, to predict their function by analyzing their primary structure. Experiments were done with a set of enzymes extracted from the Protein Data Bank. The heuristic method used was based on genetic programming using operators specially tailored for the target problem. The final performance was measured using sensitivity, specificity and hit rate. The best results obtained for the enzyme dataset suggest that the proposed evolutionary computation method is effective in finding predictive features (motifs) for protein classification.
Similar content being viewed by others
Notes
Available at http://www.ncbi.nlm.nih.gov/blast.
Available at http://www.pdb.org/pdb/home/home.do.
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic programming: an introduction. Morgan Kaufmann, San Mateo, CA
Branden CI, Tooze J (1999) Introduction to protein structure. Garland, New York
Chua H, Sung W, Wong L (2006) Exploiting indirect neighbors and topological weight to predict protein function from protein interactions. Bioinformatics 32(13):1623–1630. doi:10.1093/bioinformatics/btl145
desJardins M, Karp PD, Krummenacker M, Lee TJ (1997) Prediction of enzyme classification from protein sequence without the use of sequence similarity. ISMB-97 Proceedings, pp 92–99
Eiben AE, Smith JE (2003) Introduction to evolutionary computing, 2nd printing. Natural computing series. Springer, Berlin
Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C Appl Rev 40(2):121–144. doi:10.1109/TSMCC.2009.2033566
Freitas AA, de Carvalho ACPLF (2007) A tutorial on hierarchical classification with applications in bioinformatics. In: Taniar D (ed) Research and trends in data mining technologies and applications, Idea Group, pp 175–208
Freitas AA, Wieser DC, Apweiler R (2010) On the importance of comprehensible classification models for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform 7(1):172–182. doi:10.1109/TCBB.2008.47
Friedberg I (2006) Automated protein function prediction—the genomic challenge. Brief Bioinform 7(3):225–242. doi:10.1093/bib/bbl004
Goldberg DE (1989) Genetic algorithms in search optimization and machine learning. Addison-Wesley, Reading
Hsu WH (2009) Genetic programming. In: Wang J (ed) Encyclopedia of data warehousing and mining, 2nd edn. Idea Group Inc. Global, pp 926–931
Izrailev S, Farnum MA (2004) Enzyme classification by ligand binding. Proteins Struct Funct Bioinform 57(4):711–724. doi:10.1002/prot.20277
Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C, Andersen CAF, Knudsen S, Krogh A, Valencia A, Brunak S (2002) Prediction of human protein function from post-translational modifications and localization features. J Mol Biol 319:1257–1265. doi:10.1016/S0022-2836(02)00379-0
Kaminska KH, Milanowska K, Bujnicki JM (2009) The basics of protein sequence analysis. In: Bujnicki JM (ed) Prediction of protein structures, functions, and interactions, pp 1–38. doi:10.1002/9780470741894
Koza JR (1992) Genetic programming—on the programming of computers by means of natural selection. MIT Press, Cambridge
Koza JR (1994) Genetic programming ii: automatic discovery of reusable programs. MIT Press, Cambridge
Larose DT (2006) Data mining methods and models. Wiley and Sons, Hoboken, NJ
Lehninger AL, Nelson DL, Cox MM (1998) Principles of biochemistry, 2nd edn. Worth Publishers, New York
Lesk AM (2001) Introduction to protein architecture. Oxford University Press Inc., New York
Leung CM, Chin FYL (2006) Algorithms for challenging motif problems. J Bioinform Comput Biol 4:43–58. doi:10.1142/S0219720006001692
Lopes HS (1996) Analogia e Aprendizado Evolucionário: uma aplicação em diagnóstico clínico. PhD thesis, Brazil (in Portuguese)
Moscato P (1989) On evolution, search, optimization, genetic algorithms and martial arts: towards memetic algorithms. Technical report Caltech Concurrent Computation Program, No. 826, CA
Nisbet R, Elder J, Miner G (2009) Statistical analysis and data mining applications. Elsevier, San Diego, CA
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA
RCBS (2010) Research collaboratory for structural bioinformatics (RCSB) website. Available at http://www.pdb.org/pdb/home/home.do
Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y (2003) Automatic prediction of protein function. CMLS Cell Mol Life Sci 60:2637–2650
Santos CT, Bazzan ALC, Lemke N (2009) Automatic classification of enzyme family in protein annotation. Lect Notes Comput Sci 5676:86–96. doi:10.1007/978-3-642-03223-3_8
Silla Jr CN, Freitas AA (2010) A survey of hierarchical classification across different application domains. Data Min Knowl Discov (in press)
Tsunoda DF, Lopes HS (2005) Automatic motif discovery in an enzyme database using a genetic algorithm-based approach. Soft Comput Fusion Found Methodol Appl 10(4):325–330. doi:10.1007/s00500-005-0490-z
Tsunoda DF, Freitas AA, Lopes HS (2009) MAHATMA: a genetic programming-based tool for protein classification. In: Proc 2009 ninth international conference on intelligent systems design and applications (ISDA-09), IEEE Press, pp 1136–1142
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Mateo, CA
Zhao XM, Wang Y, Chen L, Aihara K (2008) Protein function prediction with high-throughput data. Amino Acids 35(3):517–530. doi:10.1007/s00726-008-0077-y
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tsunoda, D.F., Freitas, A.A. & Lopes, H.S. A genetic programming method for protein motif discovery and protein classification. Soft Comput 15, 1897–1908 (2011). https://doi.org/10.1007/s00500-010-0624-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-010-0624-9