Skip to main content

Advertisement

Log in

Prediction of expected performance for a genetic programming classifier

  • Published:
Genetic Programming and Evolvable Machines Aims and scope Submit manuscript

Abstract

The estimation of problem difficulty is an open issue in genetic programming (GP). The goal of this work is to generate models that predict the expected performance of a GP-based classifier when it is applied to an unseen task. Classification problems are described using domain-specific features, some of which are proposed in this work, and these features are given as input to the predictive models. These models are referred to as predictors of expected performance. We extend this approach by using an ensemble of specialized predictors (SPEP), dividing classification problems into groups and choosing the corresponding SPEP. The proposed predictors are trained using 2D synthetic classification problems with balanced datasets. The models are then used to predict the performance of the GP classifier on unseen real-world datasets that are multidimensional and imbalanced. This work is the first to provide a performance prediction of a GP system on test data, while previous works focused on predicting training performance. Accurate predictive models are generated by posing a symbolic regression task and solving it with GP. These results are achieved by using highly descriptive features and including a dimensionality reduction stage that simplifies the learning and testing process. The proposed approach could be extended to other classification algorithms and used as the basis of an expert system for algorithm selection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. Such a distance measure is a common fitness function for many application domains of GP, particularly for symbolic regression problems.

  2. It is important to state that our PGPC and SPEP implementations were not implemented in any optimal way, and that running times with other implementations might be substantially different. Nonetheless, we believe that these results give a sufficiently accurate estimate of the possible usefulness of our proposed methodology.

References

  1. L. Altenberg, The Evolution of Evolvability in Genetic Programming (MIT Press, Cambridge, 1994)

    Google Scholar 

  2. L. Altenberg, Fitness distance correlation analysis: an instructive counterexample, in Proceedings of the Seventh International Conference on Genetic Algorithms (Morgan Kaufmann, Los Altos, 1997), pp. 57–64

  3. P.J. Bentley, Evolutionary, my dear Watson Investigating Committee-based Evolution of Fuzzy Rules for the Detection of Suspicious Insurance Claims, in Genetic and Evolutionary Computation Conference (GECCO-2000) (2000), pp. 702–709

  4. M. Clergue, P. Collard, M. Tomassini, L. Vanneschi, Fitness distance correlation and problem difficulty for genetic programming, in GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA (2002), pp. 724–732

  5. R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edn. (Wiley, London, 2000)

    MATH  Google Scholar 

  6. A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing (Springer, Berlin, 2003)

    Book  MATH  Google Scholar 

  7. G. Folino, C. Pizzuti, G. Spezzano, An ensemble-based evolutionary framework for coping with distributed intrusion detection. Genet. Program Evol. Mach. 11(2), 131–146 (2010)

    Article  Google Scholar 

  8. E. Galván-López, S. Dignum, R. Poli, The effects of constant neutrality on performance and problem hardness in gp, in Proceedings of the 11th European Conference on Genetic Programming, EuroGP’08 (Springer, Berlin, 2008), pp. 312–324

  9. E. Galván-López, J. McDermott, M. O’Neill, A. Brabazon, Defining locality in genetic programming to predict performance, in IEEE Congress on Evolutionary Computation (2010), pp. 1–8

  10. E. Galván-López, J. McDermott, M. O’Neill, A. Brabazon, Defining locality as a problem difficulty measure in genetic programming. Genet. Program Evol. Mach. 12(4), 365–401 (2011)

    Article  Google Scholar 

  11. E. Galván-López, R. Poli, An empirical investigation of how and why neutrality affects evolutionary search, in Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, GECCO ’06 (ACM, New York, 2006), pp. 1149–1156

  12. E. Galván-López, R. Poli, Some steps towards understanding how neutrality affects evolutionary search, in Parallel Problem Solving from Nature—PPSN IX, vol. 4193, Lecture Notes in Computer Science, ed. by T. Runarsson, H.G. Beyer, E. Burke, J. Merelo-Guervós, L. Whitley, X. Yao (Springer, Berlin, 2006), pp. 778–787

  13. D.E. Goldberg, Simple genetic algorithms and the minimal, deceptive problem, in Genetic Algorithms and Simulated Annealing, Research Notes in Artificial Intelligence, ed. by L. Davis (Pitman, London, 1987), pp. 74–88

    Google Scholar 

  14. M. Graff, H.J. Escalante, J. Cerda-Jacobo, A.A. Gonzalez, Models of performance of time series forecasters. Neurocomputing 122(0), 375–385 (2013). Advances in Cognitive and Ubiquitous Computing Selected Papers from the Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS-2012)

  15. M. Graff, R. Poli, Practical model of genetic programming’s performance on rational symbolic regression problems, in EuroGP (2008), pp. 122–133

  16. M. Graff, R. Poli, Practical performance models of algorithms in evolutionary program induction and other domains. Artif. Intell. 174(15), 1254–1276 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  17. M. Graff, R. Poli, Performance models for evolutionary program induction algorithms based on problem difficulty indicators, in Proceedings of the 14th European Conference on Genetic Programming, EuroGP’11 (Springer, Berlin, Heidelberg, 2011), pp. 118–129

  18. M. Graff, R. Poli, J.J. Flores, Models of performance of evolutionary program induction algorithms based on indicators of problem difficulty. Evol. Comput. 21(4), 533–560 (2013)

    Article  Google Scholar 

  19. H. Guo, L. Jack, A. Nandi, Feature generation using genetic programming with application to fault classification. IEEE Trans. Syst. Man Cybern. Part B Cybern. 35(1), 89–99 (2005)

    Article  Google Scholar 

  20. J. He, T. Chen, X. Yao, On the easiest and hardest fitness functions. IEEE Trans. Evol. Comput. 19(2), 295–305 (2015)

    Article  Google Scholar 

  21. S. Hengpraprohm, P. Chongstitvatana, A genetic programming ensemble approach to cancer microarray data classification, in 3rd International Conference on Innovative Computing Information and Control, 2008. ICICIC ’08 (2008), pp. 340–340

  22. T.K. Ho, M. Basu, Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)

    Article  Google Scholar 

  23. K. Imamura, T. Soule, R. Heckendorn, J. Foster, Behavioral diversity and a probabilistically optimal GP ensemble. Genet. Program Evol. Mach. 4(3), 235–253 (2003)

    Article  Google Scholar 

  24. T. Jones, S. Forrest, Fitness distance correlation as a measure of problem difficulty for genetic algorithms, in Proceedings of the 6th International Conference on Genetic Algorithms (Morgan Kaufmann Publishers Inc., San Francisco, 1995), pp. 184–192

  25. S. Kauffman, S. Levin, Towards a general theory of adaptive walks on rugged landscapes. J. Theor. Biol. 128(1), 11–45 (1987)

    Article  MathSciNet  Google Scholar 

  26. M. Kimura, The Neutral Theory of Molecular Evolution (Cambridge University Press, Cambridge, 1983)

    Book  Google Scholar 

  27. K.E. Kinnear, Fitness landscapes and difficulty in genetic programming, in Proceedings of the First IEEE Conference on Evolutionary Computing (IEEE Press, Piscataway, 1994), pp. 142–147

  28. S.B. Kotsiantis, I.D. Zaharakis, P.E. Pintelas, Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26(3), 159–190 (2006)

    Article  Google Scholar 

  29. J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, 1992)

    MATH  Google Scholar 

  30. W.B. Langdon, R. Poli, Foundations of Genetic Programming (Springer, Berlin, 2002)

    Book  MATH  Google Scholar 

  31. M. Lichman, UCI machine learning repository (2013) http://archive.ics.uci.edu/ml

  32. K. Malan, A.P. Engelbrecht, Particle swarm optimisation failure prediction based on fitness landscape characteristics, in 2014 IEEE Symposium on Swarm Intelligence, SIS 2014, Orlando, FL, USA (2014), pp. 149–157

  33. K.M. Malan, A.P. Engelbrecht, A survey of techniques for characterising fitness landscapes and some possible ways forward. Inf. Sci. 241, 148–163 (2013)

    Article  Google Scholar 

  34. Y. Martínez, L. Trujillo, E. Galván-López, P. Legrand, A comparison of predictive measures of problem difficulty for classification with genetic programming, in ERA 2012 (Tijuana, Mexico, 2012)

  35. K. McClymont, D. Walker, M. Dupenois, The lay of the land: a brief survey of problem understanding, in Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference Companion, GECCO Companion ’12 (ACM, New York, 2012), pp. 425–432

  36. N. McPhee, B. Ohs, T. Hutchison, Semantic building blocks in genetic programming, in Genetic Programming, Lecture Notes in Computer Science, ed. by M. O’Neill, L. Vanneschi, S. Gustafson, A. Esparcia Alcázar, I. De Falco, A. Della Cioppa, E. Tarantino, vol. 4971 (Springer, Berlin, 2008), pp. 134–145

  37. D. Michie, D.J. Spiegelhalter, C.C. Taylor, J. Campbell (eds.), Machine Learning, Neural and Statistical Classification (Ellis Horwood, Upper Saddle River, 1994)

    MATH  Google Scholar 

  38. A. Moraglio, K. Krawiec, C.G. Johnson, Geometric semantic genetic programming, in Parallel Problem Solving from Nature—PPSN XII—12th International Conference, Taormina, Italy, September 1–5, 2012, Proceedings, Part I (2012), pp. 21–31

  39. M. Muharram, G. Smith, Evolutionary constructive induction. IEEE Trans. Knowl. Data Eng. 17(11), 1518–1528 (2005)

    Article  Google Scholar 

  40. L. Muñoz, S. Silva, L. Trujillo, in M3GP—multiclass classification with GP. Genetic programming—18th European conference, EuroGP 2015, Copenhagen, Denmark, April 8–10, 2015, Proceedings (2015), pp. 78–91

  41. M. O’Neill, L. Vanneschi, S. Gustafson, W. Banzhaf, Open issues in genetic programming. Genet. Program Evol. Mach. 11(3–4), 339–363 (2010)

    Article  Google Scholar 

  42. R. Poli, E. Galván-López, The effects of constant and bit-wise neutrality on problem hardness, fitness distance correlation and phenotypic mutation rates. IEEE Trans. Evol. Comput. 16(2), 279–300 (2012)

    Article  MATH  Google Scholar 

  43. R. Poli, M. Graff, N.F. McPhee, Free lunches for function and program induction, in Proceedings of the tenth ACM SIGEVO workshop on foundations of genetic algorithms, FOGA ’09 (ACM, New York, 2009), pp. 183–194

  44. B. Punch, D. Zongker, E. Goodman, Advances in genetic programming, in The Royal Tree Problem, a Benchmark for Single and Multiple Population Genetic Programming (MIT Press, Cambridge, 1996), pp. 299–316

  45. C. Qing-Shan, G.G. De-fu, W. Li-Jun, C. Huo-Wang, A modified genetic programming for behavior scoring problem, in IEEE Symposium on Computational Intelligence and Data Mining, 2007. CIDM, 2007 (2007), pp. 535–539

  46. R. Quick, V. Rayward-Smith, G. Smith, Fitness distance correlation and ridge functions, in Parallel Problem Solving from Nature PPSN V, vol. 1498, Lecture Notes in Computer Science, ed. by A. Eiben, T. Bäck, M. Schoenauer, H.P. Schwefel (Springer, Berlin Heidelberg, 1998), pp. 77–86

  47. F. Rothlauf, Representations for Genetic and Evolutionary Algorithms (Springer, Secaucus, 2006)

    Book  MATH  Google Scholar 

  48. J.R. Sherrah, R.E. Bogner, A. Bouzerdoum, The evolutionary pre-processor: automatic feature extraction for supervised classification using genetic programming, in Proceedings of 2nd International Conference on Genetic Programming (GP-97) (Morgan Kaufmann, Los Altos, 1997), pp. 304–312

  49. S. Silva, J. Almeida, GPLAB—A Genetic Programming Toolbox for MATLAB, in Proceedings of the Nordic MATLAB Conference ed. by L. Gregersen, pp. 273–278 (2003)

  50. S. Silva, E. Costa, Dynamic limits for bloat control in genetic programming and a review of past and current bloat theories. Genet. Program Evol. Mach. 10(2), 141–179 (2009)

    Article  MathSciNet  Google Scholar 

  51. M. Smith, L. Bull, Genetic programming with a genetic algorithm for feature construction and selection. Genet. Program Evol. Mach. 6(3), 265–281 (2005)

    Article  Google Scholar 

  52. S.Y. Sohn, Meta analysis of classification algorithms for pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 21(11), 1137–1144 (1999)

    Article  Google Scholar 

  53. A. Sotelo, E. Guijarro, L. Trujillo, L.N. Coria, Y. Martínez, Identification of epilepsy stages from ecog using genetic programming classifiers. Comput. Biol. Med. 43(11), 1713–1723 (2013)

    Article  Google Scholar 

  54. P. Stadler, Fitness landscapes, in Biological Evolution and Statistical Physics, vol. 585, Lecture Notes in Physics, ed. by M. Lässig, A. Valleriani (Springer, Berlin Heidelberg, 2002), pp. 183–204

  55. T. Tanigawa, Q. Zhao, A study on efficient generation of decision trees using genetic programming, in Proceedings of Genetic and Evolutionary Computation Conference (GECCO’2000), Las Vegas (Morgan Kaufmann, Los Altos, 2000), pp. 1047–1052

  56. M. Tomassini, L. Vanneschi, P. Collard, M. Clergue, A study of fitness distance correlation as a difficulty measure in genetic programming. Evol. Comput. 13(2), 213–239 (2005)

    Article  MATH  Google Scholar 

  57. L. Trujillo, Y. Martínez, E. Galván-López, P. Legrand, Predicting problem difficulty for genetic programming applied to data classification, in Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO ’11 (ACM, New York, 2011), pp. 1355–1362

  58. L. Trujillo, Y. Martínez, E.G. López, P. Legrand, A comparative study of an evolvability indicator and a predictor of expected performance for genetic programming, in Genetic and Evolutionary Computation Conference, GECCO ’12, Philadelphia, PA, USA, July 7–11, 2012, Companion Material Proceedings (2012), pp. 1489–1490

  59. L. Trujillo, Y. Martínez, P. Melin, Estimating classifier performance with genetic programming, in Proceedings of the 14th European conference on Genetic Programming, EuroGP’11 (Springer, Berlin, 2011), pp. 274–285

  60. L. Trujillo, Y. Martínez, P. Melin, How many neurons? A genetic programming answer, in Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’11 (ACM, New York, 2011), pp. 175–176

  61. A. Tsakonas, A comparison of classification accuracy of four genetic programming-evolved intelligent structures. Inf. Sci. 176(6), 691–724 (2006)

    Article  Google Scholar 

  62. L. Vanneschi, M. Castelli, L. Manzoni, The K landscapes: A tunably difficult benchmark for genetic programming, in Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO ’11 (ACM, New York, 2011), pp. 1467–1474

  63. L. Vanneschi, M. Castelli, S. Silva, A survey of semantic methods in genetic programming. Genet. Program Evol. Mach. 15(2), 195–214 (2014)

    Article  Google Scholar 

  64. L. Vanneschi, M. Clergue, P. Collard, M. Tomassini, S. Verel, Fitness clouds and problem hardness in genetic programming, in Proceedings of the Genetic and Evolutionary Computation Conference, GECCO’04, pp. 690–701 (2004)

  65. L. Vanneschi, M. Tomassini, P. Collard, M. Clergue, Fitness distance correlation in genetic programming: a constructive counterexample, in Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2003, 8–12 December 2003, Canberra, Australia, pp. 289–296 (2003)

  66. L. Vanneschi, M. Tomassini, P. Collard, S. Verel, Negative slope coefficient: a measure to characterize genetic programming fitness landscapes, in Genetic Programming, 9th European Conference, EuroGP 2006, Budapest, Hungary, April 10–12, 2006, Proceedings, pp. 178–189 (2006)

  67. L. Vanneschi, M. Tomassini, P. Collard, S. Vérel, Y. Pirola, G. Mauri, A comprehensive view of fitness landscapes with neutrality and fitness clouds, in Proceedings of the 10th European Conference on Genetic Programming, EuroGP’07 (Springer, Berlin, Heidelberg, 2007), pp. 241–250

  68. L. Vanneschi, A. Valsecchi, R. Poli, Limitations of the fitness-proportional negative slope coefficient as a difficulty measure, in Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO ’09 (ACM, New York, 2009), pp. 1877–1878

  69. S. Verel, P. Collard, M. Clergue, Where are bottlenecks in NK fitness landscapes?, in Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2003, 8–12 December 2003, Canberra, Australia, pp. 273–280 (2003)

  70. D. Wolpert, W. Macready, No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1(1), 67–82 (1997)

    Article  Google Scholar 

  71. S. Wright, The roles of mutation, inbreeding, crossbreeding and selection in evolution. Proc. Sixth Int. Congr. Genet. 1, 356–366 (1932)

    Google Scholar 

  72. T. Yu, J. Miller, Neutrality and the evolvability of boolean function landscape, in Genetic Programming, vol. 2038, Lecture Notes in Computer Science, ed. by J. Miller, M. Tomassini, P. Lanzi, C. Ryan, A. Tettamanzi, W. Langdon (Springer, Berlin, 2001), pp. 204–217

  73. E. Z-Flores, L. Trujillo, O. Schütze, P. Legrand, A local search approach to genetic programming for binary classification, in Proceedings of the 2015 on Genetic and Evolutionary Computation Conference, GECCO ’15 (ACM, New York, 2015), pp. 1151–1158

  74. M. Zhang, W. Smart, Multiclass object classification using genetic programming, in Applications of Evolutionary Computing, vol. 3005, Lecture Notes in Computer Science, ed. by G. Raidl, S. Cagnoni, J. Branke, D. Corne, R. Drechsler, Y. Jin, C. Johnson, P. Machado, E. Marchiori, F. Rothlauf, G. Smith, G. Squillero (Springer, Berlin Heidelberg, 2004), pp. 369–378

  75. M. Zhang, W. Smart, Using gaussian distribution to construct fitness functions in genetic programming for multiclass object classification. Pattern Recogn. Lett. 27(11), 1266–1274 (2006)

    Article  Google Scholar 

  76. Z.H. Zhou, Ensemble Methods: Foundations and Algorithms, 1st edn. (Chapman and Hall/CRC, London, 2012)

    Google Scholar 

Download references

Acknowledgments

This research was supported by CONACYT Basic Science Research Project No. 178323, TecNM (México) Research Project 5621.15-P, and by the FP7-Marie Curie-IRSES 2013 European Commission program through project ACoBSEC with Contract No. 612689. First author was supported by CONACYT doctoral Scholarship No. 226981. The fourth author acknowledges funding provided by an ELEVATE Fellowship, the Irish Research Council’s Career Development Fellowship co-funded by Marie Curie Actions, and thanks the TAO group at INRIA Saclay and LRI—Univ. Paris-Sud and CNRS, Orsay, France for hosting him during the outgoing phase of the ELEVATE Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonardo Trujillo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Martínez, Y., Trujillo, L., Legrand, P. et al. Prediction of expected performance for a genetic programming classifier. Genet Program Evolvable Mach 17, 409–449 (2016). https://doi.org/10.1007/s10710-016-9265-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10710-016-9265-9

Keywords

Navigation