Elsevier

Information Sciences

Volume 561, June 2021, Pages 181-195
Information Sciences

MSGP-LASSO: An improved multi-stage genetic programming model for streamflow prediction

https://doi.org/10.1016/j.ins.2021.02.011Get rights and content

Abstract

This paper presents the development and verification of a new multi-stage genetic programming (MSGP) technique, called MSGP-LASSO, which was applied for univariate streamflow forecasting in the Sedre River, an intermittent river in Turkey. The MSGP-LASSO is a practical and cost-neutral improvement over classic genetic programming (GP) that increases modelling accuracy, while decreasing its complexity by coupling the MSGP and multiple regression LASSO methods. The new model uses average mutual information to identify the optimum lags, and root mean-square technique to minimize forecasting error. Based on Nash-Sutcliffe efficiency and bias-corrected Akaike information criterion, MSGP-LASSO is superior to GP, multigene GP, MSGP, and hybrid MSGP-least-square models. It is explicit and promising for real-life applications.

Introduction

Over the past few decades, extensive research has been conducted on the use of soft computing (SC) techniques to increase human awareness of complicated systems. Generally, SC is defined as the use of inexact solutions for computationally hard tasks. Fuzzy logic, artificial neural networks (ANN), evolutionary computing, and decision trees are some of the well-known SC techniques, which have been used broadly to solve a wide range of problems. The SC techniques have been mainly used by engineers to design an efficient control system [33], [35] or elucidate mathematical/logical relationships between the empirically observed variables, a method aka system identification or black-box modelling [11], [15]. Once a model is elucidated and verified, it can be used to predict future values of the state variables of the system.

External disturbances, inconsistent observations, discrete objective functions, and multiple-source uncertainties, such as outliers and missing data, are common issues in practical applications that make classical SC techniques turn out to be inefficient [33]. Therefore, it is necessary to use non-classical SC methods to achieve the best model for the desired system. As an alternative, metaheuristic optimization algorithms have been applied for the parameter tuning of classical SC models [33], [36], a robust filtering technique is coupled with SC method [27], or unsupervised SC techniques may be preferred in the presence of difficulties for labeling the observed data [35].

State-of-the-art genetic programming (GP) is one of the most popular evolutionary computing techniques that uses a Darwinian algorithm to solve problems. In any GP variant, a population of random programs (potential solutions) are created at first, and the genetic materials of every solution are repetitively improved via evolutionary operations to achieve a desired state. GP has received a great deal of attention over the past few decades and has been applied in many research areas [2], [10], [12], [24], [32]. Existing studies show that this method could be used satisfactorily to solve classification, pattern recognition, and time series modelling problems [14], [16], [21]. An inclusive review paper by Danandeh Mehr et al. [10] demonstrated how GP is beneficial for solving a variety of problems in water resources engineering, including prediction of hydro-meteorological variables, design of hydraulic structures, and recognition of hidden patterns in hydrological phenomena, such as rainfall-runoff, interaction between surface water and groundwater, and streamflow time series. The authors highlighted the great popularity of GP among the hydrologists as a grey box model that allows the modeler to apply human knowledge into the SC algorithms.

Modelling of streamflow process is an important task for planning, management, and operation of water resources systems. For example, precise forecasts are required for flood damage mitigation, food production, navigation management and environmental protections. So far, many efforts have been carried out to identify/model the streamflow process. It is known as a complex phenomenon, and typically modelled using either conceptual methods or SC techniques [7], [24], [36], [38]. The problem becomes more complicated when flow pattern in the rivers is intermittent, for example, the river/stream experiences occasional drought spells [6]. This type of event takes place frequently in intermittent rivers, particularly in the mountainous tributaries of snow-fed streams. Owing to low-density-gauging networks in mountainous regions, one may not be able to conceptually model streamflow process [8]. In such cases, if a reliable set of streamflow records is available in a stream gauge, use of SC techniques is an appropriate alternative to model and forecast the streamflow time series. Such an approach is typically called univariate modelling in which only the antecedent streamflow records are employed to form a predictive model [20], [39].

Table 1 lists some of the GP variants that have been used for univariate streamflow prediction in recent years. The studies focused on both short- (up to daily) and long-lead time forecasting. A study presented by Danandeh Mehr et al. [10] revealed that the selection of an appropriate GP engine, identifying the effective inputs (here lags), and being cautious with respect to the common overfitting problem of the models are the most important issues in univariate streamflow modeling. Among the above-mentioned issues of practical applications, the outliers (i.e., rare streamflow observations) were addressed as peak floods which need to be considered during the modeling process. Any model capable of capturing such peaks is more favorable. By contrast, missing data and inconsistent measurements that commonly arise from instrumental/human errors must be addressed before simulation. To cope with the influence of external disturbances and various uncertainties, robust filtering techniques, such as wavelet transform, have been applied. When the performance of a standalone GP variant is not desirable, ensemble or hybrid models provide great improvement in terms of modeling accuracy. However, this imposes a huge computational burden on the model.

This short review also demonstrates a great attention of hydrologists in applying GP-based models such as gene expression programming (GEP), Linear GP (LGP), and multigene GP (MGGP) for streamflow forecasting. However, the skill of multi-stage GP (MSGP) in streamflow modeling and forecasting has not been explored. The main goal of this study is to apply and assess the performance of MSGP for streamflow forecasting. The proposed technique was apllied to a case study in Antalya, Turkey, and the results were compared with the conventional methods such as GP and state-of-the-art MGGP. Furthermore, a new hybrid MSGP algorithm through the inclusion of the least absolute shrinkage and selection operator (LASSO) is developed to increase streamflow forecasting accuracy without increasing the number of predictors.

Section snippets

Conventional GP technique

GP [19] is an emerging SC technique in which computer programs are generated to find solutions for problems. As previously mentioned, the GP algorithm is based on the principle of “survival of the fittest”. The GP programs, aka genomes, are classically formed by tree elements, nodes, and branches. Fig. 1 illustrates a genome including a root node (addition), the inner nodes of multiplication and sin functions, and the terminal nodes of X1, X2, and a random number 2.0. Each node in a GP tree can

Study area and data

The Antalya Basin covers approximately 19.5 Km2, which is about 2.5% of the territory in Turkey. The basin is surrounded by Sultan Mountains in the north, Alanya District and Taurus Mountains in the east, Beydağları and Katrancık Mountains in the west, and bound by the Gulf of Antalya in the south. There are 11 main rivers (from west to east including Boğaçay, Düden, Aksu, Köprüçay, Manavgat, Karpuz, Alara, Kargı, Obaçay, Dim, and the Sedre Rivers) and many lakes, such as Eğirdir and Karacaören

Performance evaluation criteria

After training the GP variants, the model that produces the best result in terms of prediction accuracy and simplicity is chosen as the best solution. To this end, we can use Nash-Sutcliffe coefficient of efficiency (NSE, Eq. (3)), root mean squared error (RMSE, Eq. (4)), and bias-corrected Akaike information criterion (AICc, Eq. (5)) measures. The NSE and RMSE are the performance evaluation criteria that have been frequently used in hydrological studies. The AICc is the sum of the conventional

Determination of the effective inputs

Determination of efficient input vectors (optimal number of lags) is the first phase in the proposed model. The optimum number of lags may lead an SC method to generate a robust model. On the other hand, inadequate or extra inputs may yield poor or complicated models [6]. Typically, ACF and PACF of the given time series are used to input identification for the time series modelling [25], [28], [36]. However, these functions are based on the linear correlation among past and present streamflow

Conclusion

This study presented the first application of MSGP in streamflow forecasting and introduced a new hybrid MSGP model, called MSGP-LASSO, for univariate streamflow prediction. Many researchers have proven that GP and its variants work better than classic time series models or even SC models, such as ANN and support vector machine (SVM) for streamflow prediction [10]. The pertinent literature also revealed that they do not appear to be sufficiently accurate for streamflow modelling in intermittent

CRediT authorship contribution statement

Ali Danandeh Mehr: Conceptualization, Methodology, Software, Validation, Formal analysis, Data curation, Resources, Visualization, Investigation, Writing - original draft, Writing - review & editing. Amir H. Gandomi: Supervision, Methodology, Writing - review & editing, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors appreciate editors and annonumus reviwers for their commonts on this manuscript. We also thank Mostafa Gandomi for his conrtibution in developing GEP-based models used in this study. The data used in this article are available at https://github.com/alidanandeh/MSGP-LASSO.

References (39)

  • S.S.M. Astarabadi et al.

    Genetic programming performance prediction and its application for symbolic regression problems

    Inf. Sci.

    (2019)
  • M. Bender et al.

    Time-series modeling for long-range stream-flow forecasting

    J. Water Resour. Plann. Manage.

    (1994)
  • Bhavita, K., Swathi, D., Manideep, J., Sandeep, D. S., & Rathinasamy, M. (2019). Regime-wise genetic programming model...
  • Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series analysis: forecasting and control. John...
  • A.D. Danandeh Mehr et al.

    On the calibration of multigene genetic programming to simulate low flows in the Moselle River

    Uludağ Univ. J. Faculty Eng.

    (2016)
  • Gandomi, A. H., Alavi, A. H., & Ryan, C. (Eds.). (2015). Handbook of genetic programming applications. Switzerland:...
  • Hrnjica, B., & Danandeh Mehr, A. (2019). Optimized Genetic Programming Applications: Emerging Research and...
  • B. Hrnjica et al.

    Genetic programming for turbidity prediction: hourly and monthly scenarios

    Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi

    (2019)
  • Jabeen, H., & Baig, A. R. (2010). Review of classification using genetic programming. International journal of...
  • Cited by (0)

    View full text