Skip to main content

Advertisement

Log in

A genetic programming method for protein motif discovery and protein classification

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Proteins can be grouped into families according to some features such as hydrophobicity, composition or structure, aiming to establish common biological functions. This paper presents MAHATMA—memetic algorithm-based highly adapted tool for motif ascertainment—a system that was conceived to discover features (particular sequences of amino acids, or motifs) that occur very often in proteins of a given family but rarely occur in proteins of other families. These features can be used for the classification of unknown proteins, that is, to predict their function by analyzing their primary structure. Experiments were done with a set of enzymes extracted from the Protein Data Bank. The heuristic method used was based on genetic programming using operators specially tailored for the target problem. The final performance was measured using sensitivity, specificity and hit rate. The best results obtained for the enzyme dataset suggest that the proposed evolutionary computation method is effective in finding predictive features (motifs) for protein classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Available at http://www.ncbi.nlm.nih.gov/blast.

  2. Available at http://www.pdb.org/pdb/home/home.do.

References

  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    Google Scholar 

  • Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic programming: an introduction. Morgan Kaufmann, San Mateo, CA

    MATH  Google Scholar 

  • Branden CI, Tooze J (1999) Introduction to protein structure. Garland, New York

    Google Scholar 

  • Chua H, Sung W, Wong L (2006) Exploiting indirect neighbors and topological weight to predict protein function from protein interactions. Bioinformatics 32(13):1623–1630. doi:10.1093/bioinformatics/btl145

    Article  Google Scholar 

  • desJardins M, Karp PD, Krummenacker M, Lee TJ (1997) Prediction of enzyme classification from protein sequence without the use of sequence similarity. ISMB-97 Proceedings, pp 92–99

  • Eiben AE, Smith JE (2003) Introduction to evolutionary computing, 2nd printing. Natural computing series. Springer, Berlin

  • Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C Appl Rev 40(2):121–144. doi:10.1109/TSMCC.2009.2033566

    Article  Google Scholar 

  • Freitas AA, de Carvalho ACPLF (2007) A tutorial on hierarchical classification with applications in bioinformatics. In: Taniar D (ed) Research and trends in data mining technologies and applications, Idea Group, pp 175–208

  • Freitas AA, Wieser DC, Apweiler R (2010) On the importance of comprehensible classification models for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform 7(1):172–182. doi:10.1109/TCBB.2008.47

    Article  Google Scholar 

  • Friedberg I (2006) Automated protein function prediction—the genomic challenge. Brief Bioinform 7(3):225–242. doi:10.1093/bib/bbl004

    Article  Google Scholar 

  • Goldberg DE (1989) Genetic algorithms in search optimization and machine learning. Addison-Wesley, Reading

    MATH  Google Scholar 

  • Hsu WH (2009) Genetic programming. In: Wang J (ed) Encyclopedia of data warehousing and mining, 2nd edn. Idea Group Inc. Global, pp 926–931

  • Izrailev S, Farnum MA (2004) Enzyme classification by ligand binding. Proteins Struct Funct Bioinform 57(4):711–724. doi:10.1002/prot.20277

    Article  Google Scholar 

  • Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C, Andersen CAF, Knudsen S, Krogh A, Valencia A, Brunak S (2002) Prediction of human protein function from post-translational modifications and localization features. J Mol Biol 319:1257–1265. doi:10.1016/S0022-2836(02)00379-0

    Article  Google Scholar 

  • Kaminska KH, Milanowska K, Bujnicki JM (2009) The basics of protein sequence analysis. In: Bujnicki JM (ed) Prediction of protein structures, functions, and interactions, pp 1–38. doi:10.1002/9780470741894

  • Koza JR (1992) Genetic programming—on the programming of computers by means of natural selection. MIT Press, Cambridge

    MATH  Google Scholar 

  • Koza JR (1994) Genetic programming ii: automatic discovery of reusable programs. MIT Press, Cambridge

    MATH  Google Scholar 

  • Larose DT (2006) Data mining methods and models. Wiley and Sons, Hoboken, NJ

    MATH  Google Scholar 

  • Lehninger AL, Nelson DL, Cox MM (1998) Principles of biochemistry, 2nd edn. Worth Publishers, New York

    Google Scholar 

  • Lesk AM (2001) Introduction to protein architecture. Oxford University Press Inc., New York

    Google Scholar 

  • Leung CM, Chin FYL (2006) Algorithms for challenging motif problems. J Bioinform Comput Biol 4:43–58. doi:10.1142/S0219720006001692

    Article  Google Scholar 

  • Lopes HS (1996) Analogia e Aprendizado Evolucionário: uma aplicação em diagnóstico clínico. PhD thesis, Brazil (in Portuguese)

  • Moscato P (1989) On evolution, search, optimization, genetic algorithms and martial arts: towards memetic algorithms. Technical report Caltech Concurrent Computation Program, No. 826, CA

  • Nisbet R, Elder J, Miner G (2009) Statistical analysis and data mining applications. Elsevier, San Diego, CA

    MATH  Google Scholar 

  • Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA

    Google Scholar 

  • RCBS (2010) Research collaboratory for structural bioinformatics (RCSB) website. Available at http://www.pdb.org/pdb/home/home.do

  • Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y (2003) Automatic prediction of protein function. CMLS Cell Mol Life Sci 60:2637–2650

    Article  Google Scholar 

  • Santos CT, Bazzan ALC, Lemke N (2009) Automatic classification of enzyme family in protein annotation. Lect Notes Comput Sci 5676:86–96. doi:10.1007/978-3-642-03223-3_8

    Article  Google Scholar 

  • Silla Jr CN, Freitas AA (2010) A survey of hierarchical classification across different application domains. Data Min Knowl Discov (in press)

  • Tsunoda DF, Lopes HS (2005) Automatic motif discovery in an enzyme database using a genetic algorithm-based approach. Soft Comput Fusion Found Methodol Appl 10(4):325–330. doi:10.1007/s00500-005-0490-z

    Google Scholar 

  • Tsunoda DF, Freitas AA, Lopes HS (2009) MAHATMA: a genetic programming-based tool for protein classification. In: Proc 2009 ninth international conference on intelligent systems design and applications (ISDA-09), IEEE Press, pp 1136–1142

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Mateo, CA

  • Zhao XM, Wang Y, Chen L, Aihara K (2008) Protein function prediction with high-throughput data. Amino Acids 35(3):517–530. doi:10.1007/s00726-008-0077-y

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denise Fukumi Tsunoda.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tsunoda, D.F., Freitas, A.A. & Lopes, H.S. A genetic programming method for protein motif discovery and protein classification. Soft Comput 15, 1897–1908 (2011). https://doi.org/10.1007/s00500-010-0624-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-010-0624-9

Keywords

Navigation