Skip to main content

Filtering Outliers in One Step with Genetic Programming

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11101))

Abstract

Outliers are one of the most difficult issues when dealing with real-world modeling tasks. Even a small percentage of outliers can impede a learning algorithm’s ability to fit a dataset. While robust regression algorithms exist, they fail when a dataset is corrupted by more than 50% of outliers (breakdown point). In the case of Genetic Programming, robust regression has not been properly studied. In this paper we present a method that works as a filter, removing outliers from the target variable (vertical outliers). The algorithm is simple, it uses a randomly generated population of GP trees to determine which target values should be labeled as outliers. The method is highly efficient. Results show that it can return a clean dataset when contamination reaches as high as 90%, and may be able to handle higher levels of contamination. In this study only synthetic univariate benchmarks are used to evaluate the approach, but it must be stressed that no other approaches can deal with such high levels of outlier contamination while requiring such small computational effort.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In fact, while the reported experiments only consider up to this level of data contamination, it is straightforward to extend our approach to more severe scenarios.

  2. 2.

    While it would be relatively simple to parallelize the algorithm, since all samples are taken independently, the cost can still become quite high if the modelling is done with GP.

References

  1. Trujillo, L., et al.: Neat genetic programming: controlling bloat naturally. Inf. Sci. 333, 21–43 (2016)

    Article  Google Scholar 

  2. López, U., Trujillo, L., Martinez, Y., Legrand, P., Naredo, E., Silva, S.: RANSAC-GP: dealing with outliers in symbolic regression with genetic programming. In: McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., García-Sánchez, P. (eds.) EuroGP 2017. LNCS, vol. 10196, pp. 114–130. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55696-3_8

    Chapter  Google Scholar 

  3. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

    Article  MathSciNet  Google Scholar 

  4. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004). ISBN 0521540518

    Book  Google Scholar 

  5. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)

    Article  Google Scholar 

  6. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)

    Article  Google Scholar 

  7. Pearson, R.: Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Society for Industrial and Applied Mathematics. SIAM, Philadelphia (2005)

    Book  Google Scholar 

  8. Kotanchek, M.E., Vladislavleva, E.Y., Smits, G.F.: Symbolic regression via genetic programming as a discovery engine: insights on outliers and prototypes. In: Riolo, R., O’Reilly, U.M., McConaghy, T. (eds.) Genetic Programming Theory and Practice VII. Genetic and Evolutionary Computation, pp. 55–72. Springer, Boston (2010). https://doi.org/10.1007/978-1-4419-1626-6_4

    Chapter  Google Scholar 

  9. Miranda, L.F., Oliveira, L.O.V.B., Martins, J.F.B.S., Pappa, G.L.: How noisy data affects geometric semantic genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2017, pp. 985–992. ACM, New York (2017)

    Google Scholar 

  10. Tran, C.T., Zhang, M., Andreae, P., Xue, B.: Genetic programming based feature construction for classification with incomplete data. In: Proceedings of the Genetic and Evolutionary Computation Conference. GECCO 2017, pp. 1033–1040. ACM, New York (2017)

    Google Scholar 

  11. Rousseeuw, P.J.: Least median of squares regression. J. Am. Stat. Assoc. 79(388), 871–880 (1984)

    Article  MathSciNet  Google Scholar 

  12. Alfons, A., Croux, C., Gelper, S.: Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat. 7(1), 226–248 (2013)

    Article  MathSciNet  Google Scholar 

  13. Giloni, A., Padberg, M.: Least trimmed squares regression, least median squares regression, and mathematical programming. Math. Comput. Model. 35(9), 1043–1060 (2002)

    Article  MathSciNet  Google Scholar 

  14. Bertsimas, D., Mazumder, R.: Least quantile regression via modern optimization. ArXiv e-prints (2013)

    Google Scholar 

  15. Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7, 983–999 (2006)

    MathSciNet  MATH  Google Scholar 

  16. Hubert, M., Rousseeuw, P.J., Van Aelst, S.: Statist. Sci. High-breakdown robust multivariate methods 23, 92–119 (2008)

    Google Scholar 

  17. Torr, P.H., Zisserman, A.: MLESAC: a new robust estimator with application to estimating image geometry. Comput. Vis. Image Underst. 78(1), 138–156 (2000)

    Article  Google Scholar 

  18. Hast, A., Nysjö, J., Marchetti, A.: Optimal RANSAC-towards a repeatable algorithm for finding the optimal set (2013)

    Google Scholar 

  19. Gonçalves, I., Silva, S.: Balancing learning and overfitting in genetic programming with interleaved sampling of training data. In: Krawiec, K., Moraglio, A., Hu, T., Etaner-Uyar, A.Ş., Hu, B. (eds.) EuroGP 2013. LNCS, vol. 7831, pp. 73–84. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37207-0_7

    Chapter  Google Scholar 

  20. Spector, L.: Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In: Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference Companion, GECCO Companion 2012, pp. 401–408. ACM (2012)

    Google Scholar 

  21. Kotanchek, M., Smits, G., Vladislavleva, E.: Pursuing the pareto paradigm: tournaments, algorithm variations and ordinal optimization. In: Riolo, R., Soule, T., Worzel, B. (eds.) Genetic Programming Theory and Practice IV. Genetic and Evolutionary Computation. Springer, Heidelberg (2007). https://doi.org/10.1007/978-0-387-49650-4_11

    Chapter  MATH  Google Scholar 

  22. Fortin, F.A.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This research was funded by CONACYT (Mexico) Fronteras de la Ciencia 2015-2 Project No. FC-2015-2:944, BioISI R&D unit, UID/MULTI/04046/2013 funded by FCT/MCTES/PIDDAC, Portugal, and first author supported by CONACYT graduate scholarship No. 573397.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonardo Trujillo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

López, U., Trujillo, L., Legrand, P. (2018). Filtering Outliers in One Step with Genetic Programming. In: Auger, A., Fonseca, C., Lourenço, N., Machado, P., Paquete, L., Whitley, D. (eds) Parallel Problem Solving from Nature – PPSN XV. PPSN 2018. Lecture Notes in Computer Science(), vol 11101. Springer, Cham. https://doi.org/10.1007/978-3-319-99253-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99253-2_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99252-5

  • Online ISBN: 978-3-319-99253-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics