Skip to main content

Symbolic Regression Via Genetic Programming as a Discovery Engine: Insights on Outliers and Prototypes

  • Chapter
  • First Online:
Genetic Programming Theory and Practice VII

Abstract

In this chapter we illustrate a framework based on symbolic regression to generate and sharpen the questions about the nature of the underlying system and provide additional context and understanding based on multi-variate numeric data.

We emphasize the necessity to perform data modeling in a global approach, iteratively applying data analysis and adaptation, model building, and problem reduction procedures. We illustrate it for the problem of detecting outliers and extracting significant features from the CountryData1-a data set of economic, political, social and geographic data collected. We present two complementary ways of extracting outliers from the data-the content-based and the model-based approach. The content-based approach studies the geometrical structure of the multi-variate data, and uses data-balancing algorithms to sort the data records in the order of decreasing typicalness, and identify the outliers as the least typical records before the modeling is applied to a data set. The model-based outlier detection approach uses symbolic regression via Pareto genetic programming (GP) to identify records which are systematically under-or over-predicted by diverse ensembles of (thousands of) global non-linear symbolic regression models.

Both approaches applied to the CountryData produce insights into outlier vs. prototypes division amongworld countries and about driving economic properties predicting gross domestic product (GDP) per capita.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Aggarwal, Charu C., Hinneburg, Alexander, and Keim, Daniel A. (2001). On the surprising behavior of distance metrics in high dimensional space. Lecture Notes in Computer Science, 1973:420–434.

    Article  Google Scholar 

  • Francois, Damien, Wertz, Vincent, and Verleysen, Michel (2007). The concentration of fractional distances. IEEE Trans. on Knowledge and Data Engineering, 19(7):873–886.

    Article  Google Scholar 

  • Harmeling, Stefan, Dornhege, Guido, Tax, David, Meinecke, Frank, and Muller, Klaus-Robert (2006). From outliers to prototypes: Ordering data. Neurocomputing, 69(13–15):1608–1618.

    Article  Google Scholar 

  • Kotanchek, Mark, Smits, Guido, and Vladislavleva, Ekaterina (2006). Pursuing the pareto paradigm tournaments, algorithm variations & ordinal optimization. In Riolo, Rick L., Soule, Terence, and Worzel, Bill, editors, Genetic Programming Theory and Practice IV, volume 5 of Genetic and Evolutionary Computation, chapter 12, pages 167–186. Springer, Ann Arbor.

    Google Scholar 

  • Kotanchek, Mark, Smits, Guido, and Vladislavleva, Ekaterina (2007). Trustable symoblic regression models. In Riolo, Rick L., Soule, Terence, and Worzel, Bill, editors, Genetic Programming Theory and Practice V, Genetic and Evolutionary Computation, chapter 12, pages 203–222. Springer, Ann Arbor.

    Google Scholar 

  • Smits, Guido, Kordon, Arthur, Vladislavleva, Katherine, Jordaan, Elsa, and Kotanchek, Mark (2005). Variable selection in industrial datasets using pareto genetic programming. In Yu, Tina, Riolo, Rick L., and Worzel, Bill, editors, Genetic Programming Theory and Practice III, volume 9 of Genetic Programming, chapter 6, pages 79–92. Springer, Ann Arbor.

    Google Scholar 

  • Smits, Guido and Kotanchek, Mark (2004). Pareto-front exploitation in symbolic regression. In O’Reilly, Una-May, Yu, Tina, Riolo, Rick L., and Worzel, Bill, editors, Genetic Programming Theory and Practice II, chapter 17, pages 283–299. Springer, Ann Arbor.

    Google Scholar 

  • Vladislavleva, Ekaterina (2008). Model-based Problem Solving through Symbolic Regression via Pareto Genetic Programming. PhD thesis, Tilburg University, Tilburg, the Netherlands.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Kotanchek, M.E., Vladislavleva, E.Y., Smits, G.F. (2010). Symbolic Regression Via Genetic Programming as a Discovery Engine: Insights on Outliers and Prototypes. In: Riolo, R., O'Reilly, UM., McConaghy, T. (eds) Genetic Programming Theory and Practice VII. Genetic and Evolutionary Computation. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1626-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-1626-6_4

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-1653-2

  • Online ISBN: 978-1-4419-1626-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics