Elsevier

Neurocomputing

Volume 275, 31 January 2018, Pages 1973-1980
Neurocomputing

Block building programming for symbolic regression

https://doi.org/10.1016/j.neucom.2017.10.047Get rights and content

Abstract

Symbolic regression that aims to detect underlying data-driven models has become increasingly important for industrial data analysis. For most existing algorithms such as genetic programming (GP), the convergence speed might be too slow for large-scale problems with a large number of variables. This situation may become even worse with increasing problem size. The aforementioned difficulty makes symbolic regression limited in practical applications. Fortunately, in many engineering problems, the independent variables in target models are separable or partially separable. This feature inspires us to develop a new approach, block building programming (BBP). BBP divides the original target function into several blocks, and further into factors. The factors are then modeled by an optimization engine (e.g. GP). Under such circumstances, BBP can make large reductions to the search space. The partition of separability is based on a special method, block and factor detection. Two different optimization engines are applied to test the performance of BBP on a set of symbolic regression problems. Numerical results show that BBP has a good capability of structure and coefficient optimization with high computational efficiency.

Introduction

Data-driven modeling of complex systems has become increasingly important for industrial data analysis when the experimental model structure is unknown or wrong, or the concerned system has changed [1], [2]. Symbolic regression aims to find a data-driven model that can describe a given system based on observed input-response data, and plays an important role in different areas of engineering such as signal processing [3], system identification [4], industrial data analysis [5], and industrial design [6]. Unlike conventional regression methods that require a mathematical model of a given form, symbolic regression is a data-driven approach to extract an appropriate model from a space of all possible expressions S defined by a set of given binary operations (e.g. + , ,  × , ÷) and mathematical functions (e.g. sin , cos , exp , ln ), which can be described as follows: f*=argminfSif(x(i))yi,where x(i)Rd and yiR are sampling data. f is the target model and f* is the data-driven model. Symbolic regression is a kind of non-deterministic polynomial (NP) problem, which simultaneously optimizes the structure and coefficient of a target model. How to use an appropriate method to solve a symbolic regression problem is considered as a kaleidoscope in this research field [7], [8], [9].

Genetic programming (GP) [10] is a classical method for symbolic regression. The core idea of GP is to apply Darwin’s theory of natural evolution to the artificial world of computers and modeling. Theoretically, GP can obtain accurate results, provided that the computation time is long enough. However, describing a large-scale target model with a large number of variables is still a challenging task. This situation may become even worse with increasing problem size (increasing number of independent variables and range of these variables). This is because the target model with a large number of variables may result in large search depth and high computational costs of GP. The convergence speed of GP may then be too slow. This makes GP very inconvenient in engineering applications.

Apart from basic GP, two groups of methods for symbolic regression have been studied. The first group focused on evolutionary strategy, such as grammatical evolution [11] and parse-matrix evolution [12]. These variants of GP can simplify the coding process. Gan et al. [13] introduced a clone selection programming method based on an artificial immune system. Karaboga et al. [14] proposed an artificial bee colony programming method based on the foraging behavior of honeybees. However, these methods are still based on the idea of biological simulation processes. This helps little to improve the convergence speed when solving large-scale problems.

The second branch exploited strategies to reduce the search space of the solution. McConaghy [15] presented the first non-evolutionary algorithm, fast function eXtraction (FFX), based on pathwise regularized learning, which confined its search space to generalized linear space. However, the computational efficiency is gained at the sacrifice of losing the generality of the solution. More recently, Worm [16] proposed a deterministic machine-learning algorithm, prioritized grammar enumeration (PGE). PGE merges isomorphic chromosome presentations (equations) into a canonical form. The author argues that it could make a large reduction to the search space. However, debate still remains on how the simplification affects the solving process [17], [18], [19].

In many scientific or engineering problems, the target models are separable. Luo et al. [20] presented a divide-and-conquer (D&C) method for GP. The authors indicated that detecting the correlation between each variable and the target function could accelerate the solving process. D&C can decompose a concerned separable model into a number of sub-models, and then optimize them. The separability is probed by a special method, the bi-correlation test (BiCT). However, the D&C method can only be valid for an additively/multiplicatively separable target model (see Definition 1 in Section 2). Many practical models are out of the scope of the separable model (Eqs. (6) and (7)). This limits the D&C method for further applications.

In this paper, a more general separable model that may involve mixed binary operators, namely plus (+), minus (), times (×), and division (÷), is introduced. In order to get the structure of the generalized separable model, a new approach, block building programming (BBP), for symbolic regression is also proposed. BBP reveals the target separable model using a block and factor detection process, which divides the original model into a number of blocks, and further into factors. Meanwhile, binary operators could also be determined. The method can be considered as a bi-level D&C method. The separability is detected by a generalized BiCT method. Numerical results show that BBP can obtain the target functions more reliably, and produce extremely large accelerations of the GP method for symbolic regression.

The presentation of this paper is organized as follows. Section 2 is devoted to the more general separable model. The principle and procedure of the BPP approach are described in Section 3. Section 4 presents numerical results, discussions, and efficiency analysis for the proposed method. In the last section, conclusions are drawn with future works.

Section snippets

Examples

As previously mentioned, in many applications, the target models are separable. Below, two real-world problems are given to illustrate separability.

Example 1

When developing a rocket engine, it is crucial to model the internal flow of a high-speed compressible gas through the nozzle. The closed-form expression for the mass flow through a choked nozzle [21] is m˙=p0A*T0γR(2γ+1)(γ+1)/(γ1),where p0 and T0 represent the total pressure and total temperature, respectively. A* is the sonic throat area. R is

Bi-correlation test

The bi-correlation test (BiCT) method proposed in [20] is used to detect whether a concerned target model is additively or multiplicatively separable. BiCT is based on random sampling and the linear correlation method.

Block and factor detection

The additively or multiplicatively separable target function can be easily detected by the BiCT. However, how to determine each binary operator ⊗i of Eq. (8) is a critical step in BBP. One way is to recognize each binary operator ⊗i sequentially with random sampling and linear

Numerical results and discussion

The proposed BBP is implemented in Matlab/Octave. In order to test the performance of BBP, two different optimization engines, LDSE [24] and GPTIPS [23], are used. For ease of use, a Boolean variable is used on the two selected methods. Numerical experiments on 10 cases of completely separable or partially separable target functions, as given in Appendix B, are conducted. These cases help evaluate BBP’s overall capability of structure and coefficient optimization. Computational efficiency is

Conclusion

We established a more general separable model with mixed binary operators. In order to obtain the structure of the generalized model, a block building programming (BBP) method is proposed for symbolic regression. BBP reveals the target separable model by a block and factor detection process, which divides the original model into a number of blocks, and further into factors. The method can be considered as a bi-level divide-and-conquer (D&C) method. The separability is detected by a generalized

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 11532014).

Chen Chen is currently a master candidate in the Institute of Mechanics, Chinese Academy of Sciences, Beijing, China. He received his bachelor's degree in Aircraft Design Engineering from Northwestern Polytechnical University, Xi'an, China in 2015. His research interests include fast mathematical modelling methods and their applications in aerodynamic forces and heating prediction.

References (33)

  • A. Garg et al.

    An integrated SRM-multi-gene genetic programming approach for prediction of factor of safety of 3-D soil nailed slopes

    Eng. Appl. Artif. Intel.

    (2014)
  • A.H. Alavi et al.

    A new approach for modeling of flow number of asphalt mixtures

    Arch. Civil Mech. Eng.

    (2017)
  • H. Kaydani et al.

    Permeability estimation in heterogeneous oil reservoirs by multi-gene genetic programming algorithm

    J. Pet. Sci. Eng.

    (2014)
  • H.M.R. Ugalde et al.

    Computational cost improvement of neural network models in black box nonlinear system identification

    Neurocomputing

    (2015)
  • C. Chen, C. Luo, Z. Jiang, Elite bases regression: a real-time algorithm for symbolic regression, in: Proceedings of...
  • L.F. dal Piccol Sotto et al.

    Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression

    Neurocomputing

    (2016)
  • Cited by (0)

    Chen Chen is currently a master candidate in the Institute of Mechanics, Chinese Academy of Sciences, Beijing, China. He received his bachelor's degree in Aircraft Design Engineering from Northwestern Polytechnical University, Xi'an, China in 2015. His research interests include fast mathematical modelling methods and their applications in aerodynamic forces and heating prediction.

    Changtong Luo is an associate professor in the Institute of Mechanics, Chinese Academy of Sciences, Beijing, China. He received his Ph.D. degree from Jilin University in 2007. He has been working on Nagoya University, Japan from 2007 to 2009 as a COE researcher. His research interests include computational fluid dynamics, evolutionary computation, global optimization, and numerical algebra, and their applications in aerodynamics.

    Zonglin Jiang is a professor in the Institute of Mechanics, Chinese Academy of Sciences, Beijing, China. He received his Ph.D. degree from Peking University in 1993. He received "One Hundred Person Project" of the Chinese Academy of Sciences in 1999. He had been the director of State Key Laboratory of High Temperature Gas Dynamics, Institute of Mechanics since 2001 until 2015. He was granted the Ground Testing Award 2016 by American Institute of Aeronautics and Astronautics, for his skillful leadership in developing and successful commissioning of the world's largest shock tunnel JF12. His research interests include shockwave and detonation physics, supersonic and hypersonic experiments, etc.

    View full text