Analytic Continued Fractions for Regression: A Memetic Algorithm Approach

https://doi.org/10.1016/j.eswa.2021.115018Get rights and content

Highlights

  • A regression method employing analytic continued fractions as novel representation.

  • CFR ranked 1st among the 16 methods for generalisation performances on 94 datasets.

  • Statistically comparable to the best 3 algorithms (eplex-1m, xgboost & grad-boost).

  • CFR results have been obtained with a limited level of parameter tuning.

Abstract

We present an approach for regression problems that employs analytic continued fractions as a novel representation. Comparative computational results using a memetic algorithm are reported in this work. Our experiments included fifteen other different machine learning approaches including five genetic programming methods for symbolic regression and ten machine learning methods. The comparison on training and test generalization was performed using 94 datasets of the Penn State Machine Learning Benchmark. The statistical tests showed that the generalization results using analytic continued fractions provide a powerful and interesting new alternative in the quest for compact and interpretable mathematical models for artificial intelligence.

Introduction

Symbolic regression is a unique type of multivariate regression analysis in which the goal is to find the mathematical expression of an unknown target function that would fit a dataset S={(x(i),y(i))}, i.e. a set of pairs of an unknown multivariate target function f:RnR. It has been argued that when analysing experimental data for decision making symbolic regression methods should at least be used to complement standard multivariate analysis (Duffy and Engle-Warnick, 2002). Compared with the output of artificial neural network approaches, when the models generated by symbolic regressions are relatively smaller, they are perhaps more amenable to downstream studies via uncertainty propagation and sensitivity analysis and thus more “explainable” (Otte, 2013, Sun et al., 2019). The other benefit of symbolic regression is the lack of assumptions on prior knowledge on the underlying process or mechanism which produced the observed data (Moscato and de Vries, 2019). This allows researchers to explore problem domains for which they have incomplete knowledge and identifying underlying trends and patterns without subjecting human bias.

Over the past three decades, symbolic regression had produced an impressive number of results in many applications.3 For instance, symbolic regression has helped to extract physical laws using experimental data of chaotic dynamical systems without any knowledge of Newtonian mechanics (Schmidt and Lipson, 2009), which then motivated the data-driven discovery of hidden relationships in astronomy (Graham et al., 2013). More recent applications include prediction of friction systems performance (Kronberger et al., 2018), identification of nonlinear relationships in fMRI data (Märtens et al., 2017), radiotherapy dose reconstruction of childhood cancer survivors (Virgolin et al., 2018), also in the oncology field our own work on uncovering mechanisms of drug response in cancer cell lines using genomic and experimental data (Fitzsimmons and Moscato, 2018), predicting wind farm output from weather data (Vladislavleva et al., 2013), energy consumption forecasting (Delgado et al., 2018), computer game scene generation (Frade et al., 2009), Boolean classification (Muruzábal et al., 2000). They have also played a role in the elicitation of functional constructs from surveys (de Vries et al., 2014) and in the analysis of consumer and business data (Moscato and de Vries, 2019).

One common approach to implementing symbolic regression is Evolutionary Computation (EC). EC is a family of optimization algorithms inspired by biological evolution, in particular, building upon Darwin’s theory of natural selection. In EC, a population of candidate solutions (of a problem, generally posed as an optimization one) is subject to a set of heuristics and exact algorithms to produce new solutions, while less desirable solutions are being removed from the population currently under consideration.

EC approaches to symbolic regression are commonly based on Genetic Programming (GP) with a tree-based representation. Karaboga et al. (2012) proposed Artificial Bee Colony Programming which also used the tree-based representation method for symbolic regression and the method showed competitive performance against GP-based methods. Each solution (a.k.a a mathematical expression) is written as a syntax tree and new solutions are produced by exchanging subtrees of two solutions (crossover) or modifying a syntax element, such as a binary operator (mutation) (Koza, 1990). Although highly popular, several researchers noted that recombination methods based on sub-tree crossovers have shown not to be better than some simple mutation of the sub-branches (Clegg et al., 2007). Clegg et al. (2007) cite previous contributions to this issue by Angeline (1997) and Luke and Spector (1997), in which they stated that “due to findings like these, some people now implement their GP’s without using crossover at all, i.e. using mutation only.”

We believe that this difficulty in symbolic regression could be addressed with generic problem-domain information about function approximation to search for better models. Like GP, Memetic Algorithms (MAs) are generic denomination for a population-based approach to solve optimization problems. However, MAs take explicit advantage of heuristic and exact methods in which solutions are individually optimized and also recombined and changed to improve the diversity of the population (Neri et al., 2012). First started at the California Institute of Technology three decades ago (Moscato, 1989, Moscato, 2012), research in MAs has demonstrated over the past three decades that problem-domain information can be used to produce local search (LS) methods that can significantly accelerate the evolutionary process. Trujillo et al. (2017) recognize this fact and they point that, in contrast, local search has been underused in Genetic Programming. In their view, some of the problems faced by Genetic Programming are linked to the use of a tree-based representation of solutions. They conclude “that numerical LS and memetic search is seldom integrated in most GP systems” and that “The fact that memetic approaches have not been fully explored in GP literature, opens up several areas (for) future lines of inquiry”. We agree with this statement, in fact, of the 3918 publications we found about Memetic Algorithms (MAs) on the bibliographic database Web of Science (on 20/11/2019), we have identified very few regarding the use of local search for symbolic regression. However, it is also true that some researchers have been trying to address the need of including individual optimization to existing Genetic Programming approaches, e.g. Cagnoni et al., 2005, Azad and Ryan, 2014, Ffrancon and Schoenauer, 2015, Semenkina and Semenkin, 2015, Kommenda et al., 2020. While this list is probably not comprehensive, it is recognized that introducing individual optimization steps into EA methods based on current representations for solutions has been a challenge for symbolic regression approaches.

In this paper, we introduce a new approach to regression with a memetic algorithm and we analyze its performance against other existing implementations of symbolic regression and machine learning approaches. In particular, our contributions are as follows:

  • We introduce a novel method to represent mathematical expressions with analytic continued fractions by drawing inspirations from Padé approximants. We discuss the advantages of this particular representation over the more traditional syntax-tree based representation.

  • We implement a MA for symbolic regression with the continued fraction representation, a hierarchical population structure to manage the quality of the population of solutions, and an individual search method based on the Nelder-Mead algorithm.

  • We compare our MA-based approach with 15 other state-of-the-art implementations of symbolic regression with 94 benchmark data sets. We demonstrate that our algorithm is able to extrapolate well-fitting relationships and its performance is comparable to other methods.

Following this introduction the remainder of this paper is organized as follows: the datasets and methods used in Section 2, in particular, the memetic algorithm is described in Section 2.2; we then present an illustrative one-dimensional example of using symbolic regression to approximate an important special function in mathematics, the Gamma Function in Section 2.3. The computational results are presented in Section 3 followed by their discussion in Section 4. Finally, Section 5 contains concluding remarks and discusses the possibility of future work in the area.

Section snippets

Data and methods

In this section, we describe the proposed methods and datasets used to estimate its performance. We describe in detail the proposed symbolic regression method’s representation, followed by the memetic algorithm for model identification, and an illustrative example of the proposed method in approximating the gamma function. Next, the section will describe the experimental procedures and datasets used to measure the performance of the proposed method.

Results on Penn machine learning benchmarks datasets

We measured the MSE and NMSE scores obtained on the training set for each of the datasets. The model found during the training phase was also evaluated on the testing data. We tabulated the median scores of MSE and NMSE for both in training (t) and testing (T) scores of the algorithm for the 100 independent runs. Table 2 summarizes the scores for all benchmark datasets. We have sorted the results from the smallest to the largest of the median NMSE scores achieved in testing.

Performance comparison with state-of-the-art algorithms

We compared our

Discussion

After analyzing the median ranking of state-of-the-art regression algorithms (both of the GP and ML-based) we found that our proposed approach comparable performance while compared with all state-of-the-art approaches. The results show that the use of continued fractions is a promising new idea due to its representational power.

The statistical test on the results (in Fig. 6) showed that there are no statistically significant differences for the performance of CFR with the eplex-1m, xgboost and

Conclusion

In this paper, we present a comprehensive experimental comparison of an alternative method for multivariate regression problems that uses analytic continued fractions as a representation of the mathematical models. This has led to a challenging type of nonlinear optimization problem and we have presented a memetic algorithm for this problem using a hierarchical structured population. A variant of the classical Nelder-Mead-based search was proposed as the individual optimization. We compared the

CRediT authorship contribution statement

Pablo Moscato: Conceptualization, Methodology, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Supervision, Project administration, Funding acquisition. Haoyuan Sun: Methodology, Software, Validation, Formal analysis, Investigation, Writing - review & editing. Mohammad Nazmul Haque: Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We thank Dr Markus Wagner, School of Computer Science at The University of Adelaide, Australia for his thoughtful comments that helped us to improve an earlier version of the manuscript. We also thank the members of Prof. Jason H. Moore’s research lab at the University of Pennsylvania, USA, for making both the source code of their experiments and the Penn Machine Learning Benchmarks datasets available. M.N.H. and P.M. thank Renata Sarmet from the Universidade Federal de São Carlos, Sao Paulo,

References (88)

  • Baker Jr., G. A. (2012). Padé approximant. Accessed April 15,...
  • R. Berretta et al.

    Enhancing the performance of memetic algorithms by using a matching-based recombination algorithm

  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • L. Buriol et al.

    A new memetic algorithm for the asymmetric traveling salesman problem

    Journal of Heuristics

    (2004)
  • Cagnoni, S., Rivero, D. & Vanneschi, L. (2005). A purely evolutionary memetic algorithm as a first step towards...
  • Calvo, B. & Santafé Rodrigo, G. (2016). scmamp: Statistical comparison of multiple algorithms in multiple problems. The...
  • Chaffy, C. (1986). How to compute multivariate pade approximants. In B. W. Char (Ed.), SYMSAC 1986, Proceedings of the...
  • Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD...
  • Clegg, J., Walker, J. A. & Miller, J. F. (2007). A new crossover technique for cartesian genetic programming. In...
  • C. Cotta et al.

    Applying memetic algorithms to the analysis of microarray data

  • Cotta, C., Mendes, A., Garcia, V., França, P. M. & Moscato, P. (2003b). Applying memetic algorithms to the analysis of...
  • Cotta, C. & Moscato, P. (2002). Inferring phylogenetic trees using evolutionary algorithms. In J. J. Merelo Guerv’os,...
  • Cotta, C. & Moscato, P. (2003). A memetic-aided approach to hierarchical clustering from distance matrices: application...
  • R.E. Crandall

    Projects in scientific computation

    (1994)
  • Delgado, R. R., Ruíz, L. G. B., Cuéllar, M. P., Calvo-Flores, M. D. & del Carmen Pegalajar Jiménez, M. (2018). A...
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • Dick, G. (2014). Bloat and generalisation in symbolic regression. In G. Dick, W. N. Browne, P. A. Whigham, M. Zhang, L....
  • E.D. Dolan et al.

    Benchmarking optimization software with performance profiles

    Mathematical Programming

    (2002)
  • Drucker, H. (1997). Improving regressors using boosting techniques. In Proceedings of the fourteenth international...
  • J. Duffy et al.

    Using symbolic regression to infer strategies from experimental data

  • B. Efron et al.

    Least angle regression

    Annals of Statistics

    (2004)
  • B.E. Eskridge et al.

    Memetic crossover for genetic programming: Evolution through imitation

  • Euler, L. (1748). Introductio in analysin infinitorum. Chapter 18. Vol. 1. Reprinted as Opera...
  • I. Fajfar et al.

    Evolving a Nelder-Mead algorithm for optimization with genetic programming

    Evolutionary Computation

    (2016)
  • Ffrancon, R. & Schoenauer, M. (2015). Memetic semantic genetic programming. In Proceedings of the genetic and...
  • Fitzsimmons, J. & Moscato, P. (2018). Symbolic regression modelling of drug responses. In First IEEE conference on...
  • Frade, M., de Vega, F. F. & Cotta, C. (2009). Breeding terrains with genetic terrain programming: The evolution of...
  • J.H. Friedman

    Greedy function approximation: A gradient boosting machine

    Annals of Statistics

    (2000)
  • M. Friedman

    The use of ranks to avoid the assumption of normality implicit in the analysis of variance

    Journal of the American Statistical Association

    (1937)
  • M. Graham et al.

    Machine-assisted discovery of relationships in astronomy

    Monthly Notices of the Royal Astronomical Society

    (2013)
  • Harris, M., Berretta, R., Inostroza-Ponta, M. & Moscato, P. (2015). A memetic algorithm for the quadratic assignment...
  • R.L. Iman et al.

    Approximations of the critical region of the fbietkan statistic

    Communications in Statistics-Theory and Methods

    (1980)
  • M. Inostroza-Ponta et al.

    Qapgrid: A two level qap-based approach for large-scale data analysis and visualization

    PLOS One

    (2011)
  • Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint...
  • Cited by (11)

    • Multiple regression techniques for modelling dates of first performances of Shakespeare-era plays

      2022, Expert Systems with Applications
      Citation Excerpt :

      In 2019, a regression approach based on ‘Continued Fraction’ (CFR) was proposed; it views multivariate regression as a non-linear optimisation problem and the authors used a memetic algorithm to find approximations to the unknown target functions from experimental data (Sun & Moscato, 2019). Memetic algorithms are a population-based approach to solve computational problems that are posed as optimisation tasks and have been heavily used for other data analytics in combinatorial optimisation problems (Gabardo et al., 2020; Haque & Moscato, 2019; Zaher et al., 2019) and that are also showing impressive results for non-linear regression problems (Moscato, Haque et al., 2020; Moscato, Sun et al., 2020; Moscato et al., 2021; Sun & Moscato, 2019) and other machine learning problems (Moscato & Mathieson, 2019). Memetic Algorithms (MAs) is a type of population-based approach used for solving complex problems which are generally posed as an optimisation task with one or multiple objectives and constraints.

    • DoME: A deterministic technique for equation development and Symbolic Regression

      2022, Expert Systems with Applications
      Citation Excerpt :

      This search space is explored through a grammar and a hashing scheme and the result is a deterministic algorithm for symbolic regression. Finally, one of the most important and recent approaches is called CFR (Continued Fraction Regression) (Moscato et al., 2021). This approach uses a memetic algorithm that uses continued fractions as a representation.

    View all citing articles on Scopus

    This work was supported by The University of Newcastle, Caltech SURF, the Maitland Cancer Appeal and the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (project DP200102364).

    1

    https://www.newcastle.edu.au/profile/pablo-moscato.

    2

    https://www.newcastle.edu.au/profile/mohammad-haque.

    View full text