Elsevier

Expert Systems with Applications

Volume 109, 1 November 2018, Pages 25-34
Expert Systems with Applications

A multilevel block building algorithm for fast modeling generalized separable systems

https://doi.org/10.1016/j.eswa.2018.05.021Get rights and content

Highlights

  • Defined a generalized separability (GS) to handle repeated variables.

  • Proposed a multilevel block building (MBB) algorithm to determine the target function of a GS system.

  • MBB decomposes the target model into a number of blocks, further into minimal blocks and factors.

  • MBB is a promising algorithm for modeling engineering systems with separability features.

Abstract

Symbolic regression is an important application area of genetic programming (GP), aimed at finding an optimal mathematical model that can describe and predict a given system based on observed input-response data. However, GP convergence speed towards the target model can be prohibitively slow for large-scale problems containing many variables. With the development of artificial intelligence, convergence speed has become a bottleneck for practical applications. In this paper, based on observations of real-world engineering equations, generalized separability is defined to handle repeated variables that appear more than once in the target model. To identify the structure of a function with a possible generalized separability feature, a multilevel block building (MBB) algorithm is proposed in which the target model is decomposed into several blocks and then into minimal blocks and factors. The minimal factors are relatively easy to determine for most conventional GP or other non-evolutionary algorithms. The efficiency of the proposed MBB has been tested by comparing it with Eureqa, a state-of-the-art symbolic regression tool. Test results indicate MBB is more effective and efficient; it can recover all investigated cases quickly and reliably. MBB is thus a promising algorithm for modeling engineering systems with separability features.

Introduction

Symbolic regression seeks to identify an optimal mathematical model that can describe and predict a given system based on observed input-response data. Unlike conventional regression methods that require a preset explicit expression of the target model, symbolic regression can extract an appropriate function (model) from a space of all possible expressions S defined by a set of given binary operations (e.g., + , − ,  × , ÷) and mathematical functions (e.g., sin , cos , exp , ln ), which can be described as follows: f*=argminfSif(x(i))yi,where x(i)Rd and yiR are sample data, f is the target model, and f* is the regression model.

Symbolic regression has been widely applied in many engineering sectors, such as industrial data analysis (e.g., Li, Zhang, Bailey, Hoagg, Martin, 2017, Luo, Hu, Zhang, Jiang, 2015), circuits analysis and design (e.g., Ceperic, Bako, Baric, 2014, Shokouhifar, Jalali, 2015, Zarifi, Satvati, Baradaran-nia, 2015), signal processing (e.g., Volaric, Sucic, Stankovic, 2017, Yang, Wang, Soh, 2005), empirical modeling (e.g., Gusel, Brezocnik, 2011, Mehr, Nourani, 2017), and system identification (e.g., Guo, Li, 2012, Wong, Yip, Li, 2008). Genetic programming (GP) (Koza, 1992) is a classical method of symbolic regression. Theoretically, GP can obtain an optimal solution provided that the computation time is sufficiently long. However, the computational cost of GP for large-scale problems with many input variables is still quite high. This situation can be further exacerbated by increasing problem size (i.e., the number of involved independent variables) and complexity of the target function.

GP has been refined in several ways. Some variants focus on the coding plan. For example, grammatical evolution (GE) (O’Neill & Ryan, 2001) suggests using a variable-length binary string as the genotype of a target function, and parse-matrix evolution (PME) (Luo & Zhang, 2012) suggests using a parse-matrix with integer entries to retain more information from the parse tree. Some other variants have tested different evolutionary strategies, such as clone selection programming (Gan, Chow, & Chau, 2009) and artificial bee colony programming (Karaboga, Ozturk, Karaboga, & Gorkemli, 2012). GP variants can simplify the coding process and provide alternative evolutionary strategies; however, these methods do little to improve convergence speed when solving large-scale problems.

In the past decades, increasing attention has been paid to reducing search space. For instance, McConaghy (2011) presented the first non-evolutionary algorithm, fast function eXtraction (FFX), which confined its search space to a generalized linear space. However, computational efficiency is gained by sacrificing the generality of the solution. More recently, Worm (2016) proposed a deterministic machine learning algorithm, prioritized grammar enumeration (PGE), in his thesis. PGE merges isomorphic chromosome presentations (equations) into a canonical form, yet a debate is ongoing regarding how simplification affects the solving process (Kinzett, Johnston, Zhang, 2009, Kinzett, Zhang, Johnston, 2008, McRee, Software, Park, 2010).

More recently, a favorable feature in the symbolic regression method, separability, has been addressed based on the fact that the target model is separable in many scientific or engineering problems (Luo, Chen, & Jiang, 2017). A divide-and-conquer (D&C) method for GP has also been presented to make use of the separability feature. The solving process is accelerated by dividing the target function into a number of sub-functions. Compared to conventional GP, the D&C method can reduce computational effort (complexity) by orders of magnitude. Chen, Luo, and Jiang (2018) recently proposed an improved version of D&C, block building programming (BBP), in which the target function is partitioned into blocks and factors so it can further reduce the complexity of sub-functions.

However, the separability defined in Luo et al. (2017) and Chen et al. (2018) is limited in that it does not allow for recurrence of the same variable in different sub-functions; it would otherwise be considered non-separable. As a result, the sub-function size could still be large in many practical applications, which will be demonstrated in the following sections. This drawback motivates us to broaden the prospective applications of D&C and BBP in this work.

First, a generalized separability is defined to allow for recurrence of the same variable in different sub-functions. More specifically, the variables involved are classified into two types: repeated variables and non-repeated variables. The structure of the target function and the type of variables (repeated or non-repeated) are identified by a new proposed algorithm, multilevel block building (MBB), in which the blocks could be further decomposed into a higher level of blocks and factors until they are confirmed to be minimal blocks and factors. Therefore, the sub-functions (i.e., minimal factors) may have smaller sizes and be more easily identified. The minimal blocks and factors are then assembled together properly to form the target function. The block building process is similar to that of BBP.

In short, the new algorithm is an improved version of BBP (Chen et al., 2018) with more general application potential. The efficiency of the proposed MBB has been compared with the results of Eureqa, a state-of-the-art symbolic regression tool. Numerical results show that the proposed algorithm is more effective and can recover all investigated cases quickly and reliably.

The rest of this paper is organized as follows. Section 2 analyzes different types of separability in practical engineering. Section 3 is devoted to establishing the mathematical model of the GS system. In Sections 4 and 5, we propose an MBB algorithm and illustrate it using a case study. Section 6 presents numerical results and discussions for the proposed algorithm. The paper concludes with Section 8, which provides remarks on future work.

Section snippets

Observation of separability types

Recall that the separability introduced in Luo et al. (2017) can be described as follows.

Definition 2.1 Separability

A scalar function f(X) with n continuous variables X={xi:i=1,2,,n} (f:RnR, XΩRn, where Ω is a closed bounded convex set, such that Ω=[a1,b1]×[a2,b2]××[an,bn]), is said to be separable if and only if it can be written as f(X)=c01c1φ1(X1)2c2φ2(X2)3mcmφm(Xm),where the variable set Xi is a proper subset of X, such that Xi ⊂ X with i=1mXi=X, i=1mXi=, and the cardinal number of Xi is denoted by card(

Generalization of separability

As can be seen from Definition 2.1, each variable appears only once in the model function. However, as mentioned above, some variables might appear twice or more in practical applications. Thus, the standard D&C and BBP methods lost their basis of working mechanism and cannot be used to model such systems. In this section, to let the symbolic regression algorithm take more advantage of separability, variables are distinguished as repeated variables and non-repeated variables, and a more general

Multilevel block building

The function structure of a given system with standard separability is detected by BiCT (Luo et al., 2017), a statistical method in which the target function can be divided into a number of additively or multiplicatively separable sub-functions. However, due to the presence of repeated variables, the GS function f(X) is no longer separable in terms of standard BiCT; that is, the standard BiCT method cannot be used directly. It is necessary to carry out a deeper probe to determine the function

Case study

In this section, a toy example (Eq. (12)) will be used to illustrate the implementation of the proposed MBB algorithm. The target function involves six independent variables, two of which (x5 and x6) are repeated variables. f(x)=sin3x12(x5*x6)cosx2+ex6lnx3+x5x4.

Numerical results

In our implementation, LDSE (Luo & Yu, 2012) is chosen as the optimization engine. LDSE is a hybrid evolutionary algorithm for continuous global optimization. The efficiency of LDSE-powered MBB is tested by comparing the method with a state-of-the-art symbolic regression tool, Eureqa (Schmidt & Lipson, 2009), a proprietary A.I.-powered modeling engine based on GP, developed by Dr. Hod Lipson from the Computational Synthesis Lab at Cornell University. The efficiency is evaluated by the structure

Discussion

So far, the proposed method has been described using functions with explicit expressions. In fact, MBB only works if we have full control over the underlying system and are free to take samples, such as when attempting to identify a simple function to approximate a computationally expensive computational fluid dynamic (CFD) simulation or to identify a more concise equivalent formula with a given symbolic expression (known as exact simplification and transformation; see Stoutemyer, 2012). This

Conclusion

Based on the observations of different separability types in practical engineering formulas, a more general concept of separability is defined to handle repeated variables that appear more than once in the target model. To identify the structure of a function with a possible GS feature, an MBB algorithm is proposed in which variables are distinguished as repeated variables and non-repeated variables and the target model is decomposed into a higher level of blocks and factors until they are

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 11532014). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions on the earlier versions of this manuscript.

References (32)

  • I. Volaric et al.

    A data driven compressive sensing approach for time-frequency signal enhancement

    Signal Processing

    (2017)
  • K.Y. Wong et al.

    Automatic identification of weather systems from numerical weather prediction data using genetic algorithm

    Expert Systems with Applications

    (2008)
  • M.H. Zarifi et al.

    Analysis of evolutionary techniques for the automated implementation of digital circuits

    Expert Systems with Applications

    (2015)
  • J.D. Anderson

    Hypersonic and high-temperature gas dynamics

    (2006)
  • J.D. Anderson

    Fundamentals of aerodynamics

    (2011)
  • J. Blazek

    Computational fluid dynamics: Principles and applications

    (2015)
  • Cited by (15)

    • Export sales forecasting using artificial intelligence

      2021, Technological Forecasting and Social Change
    • Scale-up of continuous microcapsule production

      2020, Chemical Engineering and Processing - Process Intensification
      Citation Excerpt :

      The relation between power draw and rotational speed is shown in Fig. 5. A non-linear statistical modelling software program, Eurequa® (Nutonian), is used to find the best model to fit the data [45,46]. For both inline and batch RSM good correlations (Pearson correlation coefficient > 0.995) are found with N to the power of 2.11 and 1.99 respectively.

    • Characterization method for mass mixing in batch reactors based on temperature profiles

      2020, Chemical Engineering Research and Design
      Citation Excerpt :

      The fitting tool uses artificial programming to find a general equation that suits the data by producing generations of equations. The Pearson correlation coefficient is used as target statistical parameter (Chen et al., 2018; Wieland and Rogasik, 2015). The influence of the heat dosing parameters is investigated.

    View all citing articles on Scopus
    View full text