Skip to main content
Log in

A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Incompleteness is one of the problematic data quality challenges in real-world machine learning tasks. A large number of studies have been conducted for addressing this challenge. However, most of the existing studies focus on the classification task and only a limited number of studies for symbolic regression with missing values exist. In this work, a new imputation method for symbolic regression with incomplete data is proposed. The method aims to improve both the effectiveness and efficiency of imputing missing values for symbolic regression. This method is based on genetic programming (GP) and weighted K-nearest neighbors (KNN). It constructs GP-based models using other available features to predict the missing values of incomplete features. The instances used for constructing such models are selected using weighted KNN. The experimental results on real-world data sets show that the proposed method outperforms a number of state-of-the-art methods with respect to the imputation accuracy, the symbolic regression performance, and the imputation time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Availability of data and materials

The used data are obtained from the publicly free repositories: UCI and OpenML.

References

  • Al-Helali B, Chen Q, Xue B, Zhang M (2018) A hybrid GP-KNN imputation for symbolic regression with missing values. In: Australasian joint conference on artificial intelligence. Springer, pp 345–357

  • Anjum A, Sun F, Wang L, Orchard J (2019) A novel continuous representation of genetic programmings using recurrent neural networks for symbolic regression. arXiv preprint arXiv:1904.03368

  • Arnaldo I, O’Reilly UM, Veeramachaneni K (2015) Building predictive models via feature synthesis. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 983–990

  • Chen C, Luo C, Jiang Z (2017) Elite bases regression: A real-time algorithm for symbolic regression. In: 2017 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). IEEE, pp 529–535

  • Chen Q (2018) Improving the generalisation of genetic programming for symbolic regression. PhD thesis, Victoria University of Wellington

  • Davidson JW, Savic DA, Walters GA (2003) Symbolic and numerical regression: experiments and applications. Inf Sci 150(1–2):95–117

    Article  MathSciNet  Google Scholar 

  • Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091

    Article  Google Scholar 

  • Fortin FA, Rainville FMD, Gardner MA, Parizeau M, Gagné C (2012) Deap: evolutionary algorithms made easy. J Mach Learn Res 13:2171–2175

    MathSciNet  Google Scholar 

  • García JCF, Kalenatic D, Bello CAL (2011) Missing data imputation in multivariate data by evolutionary algorithms. Comput Hum Behav 27(5):1468–1474

    Article  Google Scholar 

  • García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282

    Article  Google Scholar 

  • Gautam C, Ravi V (2015) Data imputation via evolutionary computation, clustering and a neural network. Neurocomputing 156:134–142

    Article  Google Scholar 

  • Ghorbani A, Zou JY (2018) Embedding for informative missingness: deep learning with incomplete data. In: 2018 56th annual allerton conference on communication, control, and computing (Allerton). IEEE, pp 437–445

  • Johnson CG (2003) Artificial immune system programming for symbolic regression. In: European conference on genetic programming. Springer, pp 345–353

  • Kammerer L, Kronberger G, Burlacu B, Winkler SM, Kommenda M, Affenzeller M (2020) Symbolic regression by exhaustive search: reducing the search space using syntactical constraints and efficient semantic structure deduplication. In: Genetic programming theory and practice, vol XVII. Springer, pp 79–99

  • Koza JR (1992) Genetic programming II, automatic discovery of reusable subprograms. MIT Press, Cambridge

    Google Scholar 

  • Koza JR (1994) Genetic programming as a means for programming computers by natural selection. Stat Comput 4(2):87–112

    Article  Google Scholar 

  • Kronberger G (2011) Symbolic regression for knowledge discovery: bloat, overfitting, and variable interaction networks. Trauner, Linz

    Book  Google Scholar 

  • Kubalík J, Žegklitz J, Derner E, Babuška R (2019) Symbolic regression methods for reinforcement learning. arXiv preprint arXiv:1903.09688

  • Lobato F, Sales C, Araujo I, Tadaiesky V, Dias L, Ramos L, Santana A (2015a) Multi-objective genetic algorithm for missing data imputation. Pattern Recogn Lett 68:126–131

    Article  Google Scholar 

  • Lobato FM, Tadaiesky VW, Araújo IM, de Santana ÁL (2015b) An evolutionary missing data imputation method for pattern classification. In: Proceedings of the companion publication of the 2015 annual conference on genetic and evolutionary computation. ACM, pp 1013–1019

  • Martins JFB, Oliveira LOV, Miranda LF, Casadei F, Pappa GL (2018) Solving the exponential growth of symbolic regression trees in geometric semantic genetic programming. In: Proceedings of the genetic and evolutionary computation conference, pp 1151–1158

  • McConaghy T (2011) Ffx: Fast, scalable, deterministic symbolic regression technology. In: Genetic programming theory and practice, vol IX. Springer, pp 235–260

  • Oliveira LOV, Otero FE, Miranda LF, Pappa GL (2016) Revisiting the sequential symbolic regression genetic programming. In: 2016 5th Brazilian conference on intelligent systems (BRACIS). IEEE, pp 163–168

  • O’Sullivan J, Ryan C (2002) An investigation into the use of different search strategies with grammatical evolution. In: European conference on genetic programming. Springer, pp 268–277

  • Patil DV, Bichkar R (2010) Multiple imputation of missing data with genetic algorithm based techniques. In: IJCA special issue on evolutionary computation for optimization techniques, pp 74–78

  • Pennachin C, Looks M, de Vasconcelos J (2011) Improved time series prediction and symbolic regression with affine arithmetic. In: Genetic programming theory and practice, vol IX. Springer, pp 97–112

  • Pornprasertmanit S, Miller P, Schoemann A, Quick C, Jorgensen T, Pornprasertmanit MS (2016) Package ’simsem’

  • Priya RD, Kuppuswami S (2012) A genetic algorithm based approach for imputing missing discrete attribute values in databases. WSEAS Trans Inf Sci Appl 9(6):169–178

    Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MathSciNet  Google Scholar 

  • Salleh MNM, Samat NA (2017) An imputation for missing data features based on fuzzy swarm approach in heart disease classification. In: International conference in swarm intelligence. Springer, pp 285–292

  • Samat NA, Salleh MNM (2016) A study of data imputation using fuzzy c-means with particle swarm optimization. In: International conference on soft computing and data mining. Springer, pp 91–100

  • Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147

    Article  Google Scholar 

  • Searson DP (2015) Gptips 2: an open-source software platform for symbolic data mining. In: Handbook of genetic programming applications. Springer, New York, pp 551–573

  • Takahashi M, Ito T (2012) Multiple imputation of turnover in edinet data: toward the improvement of imputation for the economic census. In: Work session on statistical data editing, UNECE, pp 24–26

  • Tran CT (2018) Evolutionary machine learning for classification with incomplete data. PhD thesis, Victoria University of Wellington

  • Tran CT, Zhang M, Andreae P (2015) Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation. ACM, pp 583–590

  • Tran CT, Zhang M, Andreae P (2016) A genetic programming-based imputation method for classification with missing data. In: European conference on genetic programming. Springer, pp 149–163

  • Tran CT, Zhang M, Andreae P, Xue B (2017) Multiple imputation and genetic programming for classification with incomplete data. In: Proceedings of the genetic and evolutionary computation conference. ACM, pp 521–528

  • van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw. https://doi.org/10.18637/jss.v045.i03

  • Vanschoren J, Van Rijn JN, Bischl B, Torgo L (2014) Openml: networked science in machine learning. ACM SIGKDD Exp Newsl 15(2):49–60

    Article  Google Scholar 

  • Virgolin M, Alderliesten T, Bosman PA (2019) Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In: Proceedings of the genetic and evolutionary computation conference, pp 1084–1092

  • Virgolin M, Alderliesten T, Witteveen C, Bosman PAN. Improving model-based genetic programming for symbolic regression of small expressions. Evolut Comput 1–27. https://doi.org/10.1162/evco_a_00278. PMID:32574084

  • Vladislavleva E, Smits G, Den Hertog D (2010) On the importance of data balancing for symbolic regression. IEEE Trans Evolut Comput 14(2):252–277

    Article  Google Scholar 

  • Wang Y, Wagner N, Rondinelli JM (2019) Symbolic regression in materials science. MRS Commun 9(3):793–805

    Article  Google Scholar 

  • Žegklitz J, Pošík P (2020) Benchmarking state-of-the-art symbolic regression algorithms. In: Genetic programming and evolvable machines, pp 1–29

  • Zelinka I, Oplatkova Z, Nolle L (2005) Analytic programming-symbolic regression by means of arbitrary evolutionary algorithms. Int J Simul, Syst, Sci Technol 6(9):44–56

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baligh Al-Helali.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Code availability

The methods are implemented using python language, mainly based on the open-source package DEAP.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 104 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Helali, B., Chen, Q., Xue, B. et al. A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data. Soft Comput 25, 5993–6012 (2021). https://doi.org/10.1007/s00500-021-05590-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-05590-y

Keywords

Navigation