Multi-objective Grammatical Evolution of Decision Trees for Mobile Marketing user conversion prediction

https://doi.org/10.1016/j.eswa.2020.114287Get rights and content

Highlights

  • We propose two novel methods (MGEDT and MGEDTL) to evolve decision trees.

  • The methods optimize the classification performance and model interpretability.

  • The methods were designed for the Mobile Performance Marketing domain.

  • A realistic experimentation was conducted using big data (6 million records).

  • Competitive results were obtained by the proposed MGEDT and MGEDTL methods.

Abstract

The worldwide adoption of mobile devices is raising the value of Mobile Performance Marketing, which is supported by Demand-Side Platforms (DSP) that match mobile users to advertisements. In these markets, monetary compensation only occurs when there is a user conversion. Thus, a key DSP issue is the design of a data-driven model to predict user conversion. To handle this nontrivial task, we propose a novel Multi-objective Optimization (MO) approach to evolve Decision Trees (DT) using a Grammatical Evolution (GE), under two main variants: a pure GE method (MGEDT) and a GE with Lamarckian Evolution (MGEDTL). Both variants evolve variable-length DTs and perform a simultaneous optimization of the predictive performance and model complexity. To handle big data, the GE methods include a training sampling and parallelism evaluation mechanism. The algorithms were applied to a recent database with around 6 million records from a real-world DSP. Using a realistic Rolling Window (RW) validation, the two GE variants were compared with a standard DT algorithm (CART), a Random Forest and a state-of-the-art Deep Learning (DL) model. Competitive results were obtained by the GE methods, which present affordable training times and very fast predictive response times.

Introduction

The massive usage of mobile devices (e.g., smartphones, tablets) is increasing the value of the mobile advertising industry, which was estimated at 100 billion dollars worldwide in 2016 (Du et al., 2016). In the particular domain of Mobile Performance Marketing, users are matched to advertisements through Demand-Side Platforms (DSPs), involving several types of mobile market players: users, publishers and advertisers. Publishers own popular digital spaces (e.g., news websites, online game services) that attract a vast audience of users to their content. Advertisers own marketing campaigns regarding products or services that they want to sell. DSPs allow the market to function by linking publishers to advertisers through a digital platform. Publishers can be funded by requiring users to click in a dynamic ad prior to accessing their contents. If a user activates a dynamic link then a redirect data event is generated. In these markets, compensation only occurs when there is a conversion, when a product or service is acquired by the user. If there is a conversion, a portion of the advertiser’s profit is automatically returned to the publisher, and the DSP company also receives a base fee. Thus, the DSP goal is to perform a good match between users and advertisement campaigns, in order to increase conversions. Under this context, a key issue for the implementation of a DSP expert system is the design of a prediction model for the user Conversion Rate (CVR), which is often modeled as a binary classification task (“sale”, “no sale”), aiming to estimate if a user will produce a conversion once a redirect occurred.

The mobile user CVR prediction task is nontrivial due to four main reasons (Matos et al., 2019): it involves big data, with millions of redirects being generated every day; most redirects (e.g., 99%) do not result in conversions (it is a highly unbalanced task); only a limited set of input features are available (due to privacy and technology issues); and most features are categorical and contain a high cardinality (with hundreds or thousands of levels). Despite these issues, several attempts for CVR prediction have been performed, using different types of Machine Learning (ML) models. The first attempts involved more rigid and linear models, such as Poisson regression and Logistic Regression (Chen et al., 2009). These models are easy to interpret but usually provide limited prediction performances. Thus, recent CVR prediction studies use more flexible ML algorithms, such as: Random Forests (Du et al., 2016); Gradient Boosting Decision Trees (Zhang et al., 2014); Bagging, Stacking and Voting ensembles (King et al., 2015); XGBoost (Matos et al., 2018); Deep Learning (Matos et al., 2019); and Neuroevolution (evolutionary optimization of neural network models) (Pereira et al., 2019). Yet, all these flexible ML approaches use black-box prediction models (Cortez & Embrechts, 2013), which are more difficult to be interpreted by Mobile Marketing domain experts. In effect, model interpretability, often termed Explainable Artificial Intelligence (XAI) (Arrieta et al., 2020), is a key element that helps to determine if a prediction model makes sense and can be trusted. Moreover, a human interpretable model can be used in posterior analysis (e.g., to help in the design of successful future marketing campaigns).

Decision Trees (DT) are well-known ML models, particularly used in classification tasks (e.g., CART, ID3, C4.5 algorithms), due to their fast training time and good interpretability (Hastie et al., 2017, Witten et al., 2017). However, in complex classification tasks the predictive performance is often lower than achieved by other ML methods. To improve the classification results, one research direction was to propose DT ensembles, in which several trees are combined into a single model. This resulted in popular predictive models (e.g., Random Forest, XGBoost) but at the cost of losing interpretability. Another research approach is to adopt a single decision tree and improve the fitting algorithm in order to provide a higher predictive performance. For instance, by using Evolutionary Computation (EC), such as proposed in Chabbouh et al., 2019, Czajkowski and Kretowski, 2019a, Fitzgerald et al., 2015, Rivera-López and Canul-Reich, 2018 and Motsinger-Reif et al. (2010).

According to Barros et al. (2012), there are two main types of evolutionary design of DTs for classification: axis-parallel and oblique. In the former, a single attribute is used to split the data in each node, while in the latter there is a combination of two or more attributes in each split node. In some studies (Barros et al., 2011, Czajkowski and Kretowski, 2010), a combination between these two is used, named mixed trees, where each split node can contain either a single attribute or a combination of multiple attributes. Axis-parallel DTs are popular in literature because they are easier to interpret when compared with oblique DTs (Barros et al., 2012). This paper follows the research direction of using EC to evolve axis-parallel DTs. The associated state-of-the-art works are summarized in Table 1, ordered by publication year and with some characterizing elements:

    T

    – the type of DT (Axis-Parallel, Oblique or Mixed);

    EC

    – the type of EC algorithm used;

    V

    – if a variable-length DT representation was adopted;

    Goal

    – the optimization main goal;

    CM

    – the adopted DT complexity measure;

    MO

    – if a Multi-objective Optimization (MO) algorithm was used;

    LE

    – if a Lamarckian Evolution (LE) was included;

    Data

    – the highest dataset size (total number of instances).

This work, represented by the last row of Table 1, explores a Grammatical Evolution (GE) approach (Ryan et al., 1998). GE shares some similarities with Genetic Programming (GP) (Koza, 1993) and Gene Expression Programming (GEP) (Ferbluereira, 2001), since all these approaches optimize programs (Guogis & Misevicius, 2014). The main difference is that GE evolves programs in an arbitrary language based on a grammar. In past studies, GE has been applied to optimize ML models with different application purposes, including the creation of neural logic networks for bankruptcy prediction (Tsakonas et al., 2006) and the automatic design of Neural Networks for classification tasks (Ahmadizar et al., 2015). GE is particularly suited for variable-length solution representations, which is the case of a DT. Indeed, GE was used to evolve DTs in Motsinger-Reif et al. (2010) and Fitzgerald et al. (2015), outperforming standard DT algorithms (e.g, C4.5, CART) in several classification tasks. The advantage of using GE is that no limiting threshold needs to be set a priori, which is a limitation of the fixed-length tree representations used in Rivera-López and Canul-Reich (2018) and Chabbouh et al. (2019). An important distinctive aspect of our work is the type of EC goal. Most related works from Table 1, including the ones that use GE (Fitzgerald et al., 2015, Motsinger-Reif et al., 2010), only focus on predictive performance and not interpretability. These two goals are usually conflicting and thus a trade-off often needs to be set. In Barros et al., 2011, Czajkowski and Kretowski, 2010, Jankowski and Jackowski, 2014 and Czajkowski and Kretowski, 2013, Czajkowski and Kretowski, 2019a, this issue was addressed by using a single fitness function with an additive weighted formula. The problem with this approach is that it is only possible to optimize a single trade-off on each run and the fitness weights need to be set in advance.

A more natural approach, followed in our work, is to adopt a MO using a Pareto front, simultaneously maximizing the predictive performance and minimizing the DT complexity. In effect, we adopted the MO GE proposed by Colmenar et al. (2011), that adapted the Non-dominated Sorting Genetic Algorithm II (NSGA-II) algorithm (Srinivas & Deb, 1994) to be the evolutionary engine of the GE. As far as we know, there are no studies that use a MO approach to optimize both classification performance and complexity for axis-parallel DTs. A MO was used in Chabbouh et al. (2019) to evolve DTs but it only optimized predictive performance measures (Precision and Recall), while in Czajkowski and Kretowski (2019b) a MO was adopted to optimize oblique and mixed DT model complexity and predictive performance for regression tasks. Moreover, LE can use a local learning procedure to accelerate evolution, where the improved solution is encoded back into the chromosome (Cortez et al., 2002). Our work is the only study that introduces a LE, which uses a fast local ML search to improve the GE solutions. In Mingo et al. (2013), a similar approach was used with GE but with a non supervised local learning procedure applied in a reinforcement learning context. Finally, we note that all related work studies from Table 1 worked with datasets with a few hundred or thousands of examples. The specific mobile CVR prediction task addressed in this paper involves a higher magnitude of order, namely millions of training records. To cope with such “big data”, in the sense that the datasets that are too big to be dealt with standard evolutionary DT methods, our GE approach includes two specific mechanisms: the use of a balanced sampling over the training data during the fitness evaluation; and a parallel evaluation of the population individuals by means of multi-core processors. As shown in Section 3, this allows to deploy a timely feasible solution for the analyzed Mobile Marketing domain.

The reason for minimizing the DT complexity is twofold. Firstly, a less complex tree will imply a better model interpretability, since it will be easier to understand by domain experts. Secondly, trees encoded by a lower number of genes are faster to build, thus reducing the prediction time used by the algorithm. This last reason is very important for the Mobile Marketing domain, since we aim to perform predictions in real-time.

In this paper, we propose a MO based GE to evolve DTs for the Mobile Performance Marketing domain. To measure the effect of a LE, we explore two main variants: a pure GE method (MGEDT) and a GE that uses local learning to further improve the evolved solutions (MGEDTL). Both MGEDT and MGEDTL are tested using recent real-world DSP data, provided by a marketing company and compared with traditional decision trees, a Random Forest and Deep Learning. The main contributions are:

    (i)

    we propose a MO approach that optimizes both the classification performance and model interpretability using GE under two main variants (MGEDT and MGEDTL);

    (ii)

    we adopt a robust experimentation procedure, using big data from a real-world Mobile Marketing DSP provider (with 6 million data records) and a realistic rolling window evaluation (Oliveira et al., 2017), with several training and test iterations;

    (iii)

    we compare the proposed GE methods with a standard decision tree, a Random Forest and a state-of-the-art Deep Learning method (Matos et al., 2019); and

    (iv)

    we discuss how the best evolved DT is useful in the analyzed application domain (Mobile Performance Marketing).

The paper is organized as follows. Section 2 presents the Mobile Performance Marketing data, the classification algorithms (including MGEDT and MGEDTL) and the evaluation procedure. The results are presented and analyzed in Section 3. Finally, Section 4 draws the main conclusions and suggestions of future work.

Section snippets

Mobile performance marketing data

This research was conducted during a R&D project that involved a worldwide Mobile Marketing company (OLAmobile). The collected data was retrieved from the company data center cloud system, containing two main data events: redirects and sales. A redirect event record is generated each time a user clicks on a dynamic link related with an advertisement. The user is forwarded to a mobile marketing campaign. A sale event only occurs if the redirect originates a conversion (e.g., product purchase or

Results

All experiments were conducted using code written in the Python programming language. The GE and DT were executed using a dedicated Linux Intel Xeon 1.7 GHz server, where 25 cores were used by each GE experiment. The DL experiments were conducted on a personal computer with a NVIDIA Geforce GTX 1060 GPU using the Keras and Tensorflow libraries, aiming to decrease the computational cost.

Fig. 3 presents the overall RW test data results in terms of the BEST (left) and TEST (right) traffic modes.

Conclusions

The Mobile Performance Marketing industry value is increasing due to a worldwide usage of mobile devices (e.g., smartphone, tablet). It consists of markets supported by Demand-Side Platforms (DSP), which match advertisement to dynamic links activated by users. In these markets, monetary compensation only occurs when there is a product or service acquisition (the conversion). A crucial DSP expert system design issue is the nontrivial task of user Conversion Rate (CVR) prediction, that consists

CRediT authorship contribution statement

Pedro José Pereira: Conceptualization, Methodology, Software, Investigation, Resources, Data curation, Writing - original draft. Paulo Cortez: Conceptualization, Methodology, Validation, Formal analysis Writing - review & editing, Visualization, Supervision, Project administration, Funding acquisition. Rui Mendes: Conceptualization, Methodology, Validation, Formal analysis, Writing - review & editing, Visualization, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This article is a result of the project NORTE-01-0247-FEDER-017497, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF). This work was also supported by FCT – Fundação para a Ciência e Tecnologia, Portugal within the Project Scope: UID/CEC/00319/2019. We wish to thank the OLAmobile company for providing the data and domain feedback. We would also like to thank the anonymous

References (50)

  • TsakonasA. et al.

    Bankruptcy prediction with neural logic networks by means of grammar-guided genetic programming

    Expert Systems with Applications

    (2006)
  • ArrietaA.B. et al.

    Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI

    Information Fusion

    (2020)
  • BarrosR.C. et al.

    A survey of evolutionary algorithms for decision-tree induction

    IEEE Transactions on Systems, Man, and Cybernetics, Part C

    (2012)
  • BeazleyD.

    Ply (python lex-yacc)

    (2001)
  • BeumeN. et al.

    On the complexity of computing the hypervolume indicator

    IEEE Transactions on Evolutionary Computation

    (2009)
  • BianchiniM. et al.

    On the complexity of neural network classifiers: A comparison between shallow and deep architectures

    IEEE Transactions on Neural Networks Learning System

    (2014)
  • BreimanL.

    Random forests

    Machine Learning

    (2001)
  • CamposG.O. et al.

    On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

    Data Mining and Knowledge Discovery

    (2016)
  • ChenY. et al.

    Large-scale behavioral targeting

  • ColmenarJ.M. et al.

    Multi-objective optimization of dynamic memory managers using grammatical evolution

  • CortezP. et al.

    Multi-step time series prediction intervals using neuroevolution

    Neural Computing and Applications

    (2019)
  • CortezP. et al.

    A lamarckian approach for neural network training

    Neural Processing Letters

    (2002)
  • CzajkowskiM. et al.

    Globally induced model trees: An evolutionary approach

  • CzajkowskiM. et al.

    Global induction of oblique model trees: An evolutionary approach

  • CzajkowskiM. et al.

    A multi-objective evolutionary approach to pareto-optimal model trees

    Soft Computing

    (2019)
  • Cited by (20)

    • Using supervised and one-class automated machine learning for predictive maintenance

      2022, Applied Soft Computing
      Citation Excerpt :

      Unlike Genetic Programming (GP), GE performs the evolutionary process on a provided grammar instead of on the actual programs. A GE execution starts by creating an initial population of solutions (usually randomly), where each solution (usually named individual) corresponds to an array of integers (or genome) that is used to generate the program (or phenotype) [45]. For each generation, the evolutionary process of GE includes two main phases.

    • Leveraging email marketing: Using the subject line to anticipate the open rate

      2022, Expert Systems with Applications
      Citation Excerpt :

      This marketing tool’s effectiveness is being challenged because every modernized company is adopting it, resulting in a low rate of opened emails (Feld et al., 2013). Consumers receive, each day, a massive amount of emails and coupons (Buckinx et al., 2004; Pereira et al., 2021), which increases the competition for their limited attention (Feld et al., 2013). Therefore, having high open rates becomes critical to business success (Balakrishnan & Parekh, 2015).

    • Induction of decision trees as classification models through metaheuristics

      2022, Swarm and Evolutionary Computation
      Citation Excerpt :

      Furthermore, Ono & Kushida [306] try controlling the search bias present on the evolutionary process by estimating the solution landscape using rank correlation. Finally, the MO approach to evolving DT using GE (MGEDT) of Pereira et al. [295] is used to induce binary DTs. Gene Expression Programming: Ferreira [307] and Wang et al. [308] conduct a global-search of near-optimal axis-parallel-DTs.

    View all citing articles on Scopus
    View full text