Analytic Continued Fractions for Regression: A Memetic Algorithm Approach☆
Introduction
Symbolic regression is a unique type of multivariate regression analysis in which the goal is to find the mathematical expression of an unknown target function that would fit a dataset , i.e. a set of pairs of an unknown multivariate target function . It has been argued that when analysing experimental data for decision making symbolic regression methods should at least be used to complement standard multivariate analysis (Duffy and Engle-Warnick, 2002). Compared with the output of artificial neural network approaches, when the models generated by symbolic regressions are relatively smaller, they are perhaps more amenable to downstream studies via uncertainty propagation and sensitivity analysis and thus more “explainable” (Otte, 2013, Sun et al., 2019). The other benefit of symbolic regression is the lack of assumptions on prior knowledge on the underlying process or mechanism which produced the observed data (Moscato and de Vries, 2019). This allows researchers to explore problem domains for which they have incomplete knowledge and identifying underlying trends and patterns without subjecting human bias.
Over the past three decades, symbolic regression had produced an impressive number of results in many applications.3 For instance, symbolic regression has helped to extract physical laws using experimental data of chaotic dynamical systems without any knowledge of Newtonian mechanics (Schmidt and Lipson, 2009), which then motivated the data-driven discovery of hidden relationships in astronomy (Graham et al., 2013). More recent applications include prediction of friction systems performance (Kronberger et al., 2018), identification of nonlinear relationships in fMRI data (Märtens et al., 2017), radiotherapy dose reconstruction of childhood cancer survivors (Virgolin et al., 2018), also in the oncology field our own work on uncovering mechanisms of drug response in cancer cell lines using genomic and experimental data (Fitzsimmons and Moscato, 2018), predicting wind farm output from weather data (Vladislavleva et al., 2013), energy consumption forecasting (Delgado et al., 2018), computer game scene generation (Frade et al., 2009), Boolean classification (Muruzábal et al., 2000). They have also played a role in the elicitation of functional constructs from surveys (de Vries et al., 2014) and in the analysis of consumer and business data (Moscato and de Vries, 2019).
One common approach to implementing symbolic regression is Evolutionary Computation (EC). EC is a family of optimization algorithms inspired by biological evolution, in particular, building upon Darwin’s theory of natural selection. In EC, a population of candidate solutions (of a problem, generally posed as an optimization one) is subject to a set of heuristics and exact algorithms to produce new solutions, while less desirable solutions are being removed from the population currently under consideration.
EC approaches to symbolic regression are commonly based on Genetic Programming (GP) with a tree-based representation. Karaboga et al. (2012) proposed Artificial Bee Colony Programming which also used the tree-based representation method for symbolic regression and the method showed competitive performance against GP-based methods. Each solution (a.k.a a mathematical expression) is written as a syntax tree and new solutions are produced by exchanging subtrees of two solutions (crossover) or modifying a syntax element, such as a binary operator (mutation) (Koza, 1990). Although highly popular, several researchers noted that recombination methods based on sub-tree crossovers have shown not to be better than some simple mutation of the sub-branches (Clegg et al., 2007). Clegg et al. (2007) cite previous contributions to this issue by Angeline (1997) and Luke and Spector (1997), in which they stated that “due to findings like these, some people now implement their GP’s without using crossover at all, i.e. using mutation only.”
We believe that this difficulty in symbolic regression could be addressed with generic problem-domain information about function approximation to search for better models. Like GP, Memetic Algorithms (MAs) are generic denomination for a population-based approach to solve optimization problems. However, MAs take explicit advantage of heuristic and exact methods in which solutions are individually optimized and also recombined and changed to improve the diversity of the population (Neri et al., 2012). First started at the California Institute of Technology three decades ago (Moscato, 1989, Moscato, 2012), research in MAs has demonstrated over the past three decades that problem-domain information can be used to produce local search (LS) methods that can significantly accelerate the evolutionary process. Trujillo et al. (2017) recognize this fact and they point that, in contrast, local search has been underused in Genetic Programming. In their view, some of the problems faced by Genetic Programming are linked to the use of a tree-based representation of solutions. They conclude “that numerical LS and memetic search is seldom integrated in most GP systems” and that “The fact that memetic approaches have not been fully explored in GP literature, opens up several areas (for) future lines of inquiry”. We agree with this statement, in fact, of the 3918 publications we found about Memetic Algorithms (MAs) on the bibliographic database Web of Science (on 20/11/2019), we have identified very few regarding the use of local search for symbolic regression. However, it is also true that some researchers have been trying to address the need of including individual optimization to existing Genetic Programming approaches, e.g. Cagnoni et al., 2005, Azad and Ryan, 2014, Ffrancon and Schoenauer, 2015, Semenkina and Semenkin, 2015, Kommenda et al., 2020. While this list is probably not comprehensive, it is recognized that introducing individual optimization steps into EA methods based on current representations for solutions has been a challenge for symbolic regression approaches.
In this paper, we introduce a new approach to regression with a memetic algorithm and we analyze its performance against other existing implementations of symbolic regression and machine learning approaches. In particular, our contributions are as follows:
- •
We introduce a novel method to represent mathematical expressions with analytic continued fractions by drawing inspirations from Padé approximants. We discuss the advantages of this particular representation over the more traditional syntax-tree based representation.
- •
We implement a MA for symbolic regression with the continued fraction representation, a hierarchical population structure to manage the quality of the population of solutions, and an individual search method based on the Nelder-Mead algorithm.
- •
We compare our MA-based approach with 15 other state-of-the-art implementations of symbolic regression with 94 benchmark data sets. We demonstrate that our algorithm is able to extrapolate well-fitting relationships and its performance is comparable to other methods.
Following this introduction the remainder of this paper is organized as follows: the datasets and methods used in Section 2, in particular, the memetic algorithm is described in Section 2.2; we then present an illustrative one-dimensional example of using symbolic regression to approximate an important special function in mathematics, the Gamma Function in Section 2.3. The computational results are presented in Section 3 followed by their discussion in Section 4. Finally, Section 5 contains concluding remarks and discusses the possibility of future work in the area.
Section snippets
Data and methods
In this section, we describe the proposed methods and datasets used to estimate its performance. We describe in detail the proposed symbolic regression method’s representation, followed by the memetic algorithm for model identification, and an illustrative example of the proposed method in approximating the gamma function. Next, the section will describe the experimental procedures and datasets used to measure the performance of the proposed method.
Results on Penn machine learning benchmarks datasets
We measured the MSE and NMSE scores obtained on the training set for each of the datasets. The model found during the training phase was also evaluated on the testing data. We tabulated the median scores of MSE and NMSE for both in training (t) and testing (T) scores of the algorithm for the 100 independent runs. Table 2 summarizes the scores for all benchmark datasets. We have sorted the results from the smallest to the largest of the median NMSE scores achieved in testing.
Performance comparison with state-of-the-art algorithms
We compared our
Discussion
After analyzing the median ranking of state-of-the-art regression algorithms (both of the GP and ML-based) we found that our proposed approach comparable performance while compared with all state-of-the-art approaches. The results show that the use of continued fractions is a promising new idea due to its representational power.
The statistical test on the results (in Fig. 6) showed that there are no statistically significant differences for the performance of CFR with the eplex-1m, xgboost and
Conclusion
In this paper, we present a comprehensive experimental comparison of an alternative method for multivariate regression problems that uses analytic continued fractions as a representation of the mathematical models. This has led to a challenging type of nonlinear optimization problem and we have presented a memetic algorithm for this problem using a hierarchical structured population. A variant of the classical Nelder-Mead-based search was proposed as the individual optimization. We compared the
CRediT authorship contribution statement
Pablo Moscato: Conceptualization, Methodology, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Supervision, Project administration, Funding acquisition. Haoyuan Sun: Methodology, Software, Validation, Formal analysis, Investigation, Writing - review & editing. Mohammad Nazmul Haque: Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
We thank Dr Markus Wagner, School of Computer Science at The University of Adelaide, Australia for his thoughtful comments that helped us to improve an earlier version of the manuscript. We also thank the members of Prof. Jason H. Moore’s research lab at the University of Pennsylvania, USA, for making both the source code of their experiments and the Penn Machine Learning Benchmarks datasets available. M.N.H. and P.M. thank Renata Sarmet from the Universidade Federal de São Carlos, Sao Paulo,
References (88)
- et al.
Newton-padé approximations for multivariate functions
Applied Mathematics and Computation
(2018) - et al.
A memetic algorithm for a multistage capacitated lot-sizing problem
International Journal of Production Economics
(2004) - et al.
Artificial bee colony programming for symbolic regression
Information Sciences
(2012) Padé approximation and continued fractions
Applied Numerical Mathematics
(2010)- et al.
Benchmarking a memetic algorithm for ordering microarray data
Biosystems
(2007) - et al.
Predicting the energy output of wind farms based on weather data: Important variables and their correlation
Renewable Energy
(2013) Subtree crossover: Building block engine or macromutation
Genetic Programming
(1997)- Arnaldo, I., Krawiec, K. & O’Reilly, U. -M. (2014). Multiple regression genetic programming. In Proceedings of the 2014...
- et al.
A simple approach to lifetime learning in genetic programming-based symbolic regression
Evolutionary Computation
(2014) - Backeljauw, F. & Cuyt, A. A. M. (2009). Algorithm 895: A continued fractions package for special functions. ACM...
Enhancing the performance of memetic algorithms by using a matching-based recombination algorithm
Random forests
Machine Learning
A new memetic algorithm for the asymmetric traveling salesman problem
Journal of Heuristics
Applying memetic algorithms to the analysis of microarray data
Projects in scientific computation
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
Benchmarking optimization software with performance profiles
Mathematical Programming
Using symbolic regression to infer strategies from experimental data
Least angle regression
Annals of Statistics
Memetic crossover for genetic programming: Evolution through imitation
Evolving a Nelder-Mead algorithm for optimization with genetic programming
Evolutionary Computation
Greedy function approximation: A gradient boosting machine
Annals of Statistics
The use of ranks to avoid the assumption of normality implicit in the analysis of variance
Journal of the American Statistical Association
Machine-assisted discovery of relationships in astronomy
Monthly Notices of the Royal Astronomical Society
Approximations of the critical region of the fbietkan statistic
Communications in Statistics-Theory and Methods
Qapgrid: A two level qap-based approach for large-scale data analysis and visualization
PLOS One
Cited by (11)
Multiple regression techniques for modelling dates of first performances of Shakespeare-era plays
2022, Expert Systems with ApplicationsCitation Excerpt :In 2019, a regression approach based on ‘Continued Fraction’ (CFR) was proposed; it views multivariate regression as a non-linear optimisation problem and the authors used a memetic algorithm to find approximations to the unknown target functions from experimental data (Sun & Moscato, 2019). Memetic algorithms are a population-based approach to solve computational problems that are posed as optimisation tasks and have been heavily used for other data analytics in combinatorial optimisation problems (Gabardo et al., 2020; Haque & Moscato, 2019; Zaher et al., 2019) and that are also showing impressive results for non-linear regression problems (Moscato, Haque et al., 2020; Moscato, Sun et al., 2020; Moscato et al., 2021; Sun & Moscato, 2019) and other machine learning problems (Moscato & Mathieson, 2019). Memetic Algorithms (MAs) is a type of population-based approach used for solving complex problems which are generally posed as an optimisation task with one or multiple objectives and constraints.
DoME: A deterministic technique for equation development and Symbolic Regression
2022, Expert Systems with ApplicationsCitation Excerpt :This search space is explored through a grammar and a hashing scheme and the result is a deterministic algorithm for symbolic regression. Finally, one of the most important and recent approaches is called CFR (Continued Fraction Regression) (Moscato et al., 2021). This approach uses a memetic algorithm that uses continued fractions as a representation.
Mathematical Modelling of Peak and Residual Shear Strength of Rough Rock Discontinuities Using Continued Fractions
2024, Rock Mechanics and Rock EngineeringApproximating the Nuclear Binding Energy Using Analytic Continued Fractions
2023, Research SquareNew alternatives to the Lennard-Jones potential
2023, Research SquareContinued fractions and the Thomson problem
2023, Scientific Reports
- ☆
This work was supported by The University of Newcastle, Caltech SURF, the Maitland Cancer Appeal and the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (project DP200102364).
- 1
https://www.newcastle.edu.au/profile/pablo-moscato.
- 2
https://www.newcastle.edu.au/profile/mohammad-haque.