1 Introduction

Exploratory data analysis (EDA) is a fundamental task in the data mining process, in which data scientists analyse the properties and characteristics of different features (or instances) in a dataset, and the relationships between them [1]. Simple linear feature relationships can be discovered through the use of statistical techniques such as Pearson’s or Spearman’s correlations.

Non-linear relationships are generally found by performing association rule mining (ARM) [2], a rule-based machine learning method that produces rules that represent relationships between discrete features in a dataset. In the case of continuous data, discretisation techniques are commonly applied before performing ARM, limiting the quality and increasing the complexity of rules.

Genetic programming (GP) is, perhaps, most known for its success in symbolic regression tasks: the canonical tree-based GP is intrinsically suited to representing non-linear regression models. The use of GP for interpretable symbolic regression—where a user can understand the operation of the evolved function—has also been very successful [3].

The above properties make GP a natural choice for discovering interpretable relationships between continuous variables in data. However, no such approach has yet been proposed; all existing uses of GP for ARM use either a rule-based grammar, or discretise the input space.

This paper aims to propose the first approach to mining feature relationships (FRs), which are symbolic representations of intrinsic relationships between features in a dataset. A new GP method will be developed which uses a fitness function that considers both the quality of the discovered FR, as well as the potential interpretability of the FR. A speciation-based approach will also be proposed to allow for multiple distinct and complementary FRs to be automatically found as part of a single evolutionary search.

2 Background

GP has seen significant success in recent years in feature analysis applications. Tree-based GP [4], in particular, has been widely used due to its functional structure, which is well-suited to mapping a set of input features to a new constructed feature [5,6,7].

The use of GP for feature construction for regression and unsupervised learning tasks are perhaps the most closely related areas to this work: evolving feature relationships can be seen as a form of “unsupervised regression”. Several works have suggested the use of methods to limit model complexity in symbolic regression, either to improve interpretability or generalisability. These include parsimony pressure and other bloat control strategies [3] as well as complexity measures such as Rademacher complexity [8]. The discovery and combination of “subexpressions” in GP (i.e. feature construction) was also shown to improve performance on symbolic regression tasks [9]. GP has been used for unsupervised tasks such as clustering [10] and nonlinear dimensionality reduction [11, 12], often with a focus on interpretability [13].

2.1 Related Work

A number of evolutionary computation approaches to ARM have been proposed [14], with most using a vector-based representation such as a genetic algorithm (GA) [15] or particle swarm optimisation (PSO) [16]. The small number of papers using GP for ARM can be categorised into two paradigms: those using Genetic Network Programming (GNP) [17, 18], and those using a grammar-based G3P approach [19, 20]. Of these, only a handful address the task of mining ARMs from continuous data [14] (known as quantitative association rule (QAR) mining). These QAR methods, however, are all still constrained by the use of a grammar or network programming structure, and so they are unable to represent the relationships between continuous features in a more intuitive and precise symbolic manner.

3 Proposed Method: GP-FRM

The proposed method, Genetic Programming for Feature Relationship Mining (GP-FRM), aims to evolve compact rules (trees) that reconstruct a feature of the dataset from other features. In this way, the learnt rule represents a relationship between a given (“target”) feature and a set of other features. A simple example is the tree \(f_2 = f_1 \times f_0\), which is a non-linear relationship that would not be discovered by association rule mining algorithms. Such relationships are common: for example, the Body Mass Index (BMI) is a well-known “target” feature in the medical domain which is based on a person’s mass (m) and height (h): \(BMI = \frac{m}{h^2}\). As GP-FRM is an unsupervised learning method, it is also crucial that it can discover the best target features automatically without a priori knowledge.

3.1 Overall Algorithm

The overall GP-FRM algorithm is shown in Algorithm 1. A core component to the algorithm is the use of speciation: the population is split into a number of species, each of which share a common target feature. This niching approach serves two main purposes: it encourages multiple diverse FRs to be produced in a single GP run (rather than only a single best individual), while also restricting crossover and mutation to occurring only between individuals that share the same target feature, improving learning efficacy. The target feature of a given GP individual is automatically determined based at each generation, based on the feature which gives the best fitness, i.e. the first feature according to Eq. (1). This allows GP individuals to change more readily over time, moving between species or discovering an entirely new species niche.

$$\begin{aligned} ClosestFeatures(x|F) = \mathop {\mathrm {argsort}}\limits _{f \in F} |r_{x,f}| \qquad \quad \,\, (\textit{Decreasing sort}) \end{aligned}$$
(1)
figure a
figure b

The core of Algorithm 1 is similar to a standard evolutionary search, with the main difference being the use of speciation for breeding and elitism (Lines 8–14). The speciation algorithm is shown in Algorithm 2. The number of species (\(N_S\)) is a parameter of the algorithm, and is used to constrain the number of niches in the search space: having too many species would give many poorer-quality FRs and prohibit niche-level exploitation by a group of individuals. The population is sorted by fitness (best to worst) into a Closest Features (CF) list and then each individual is considered in turn:

  1. 1.

    if the individual’s closest feature (\(CF_0\)) has already been selected as a species, it is added to that species;

  2. 2.

    otherwise, if the number of species (\(N_S\)) has not been reached, a new species is created with the individual’s closest feature (\(CF_0\)) as the seed;

  3. 3.

    otherwise, the individual’s list of closest features (\(P_{j}CF\)) is searched to find the first seed (feature) which is in the species list, and then the individual is added to that species.

In this way, the species are always selected from the fittest individuals, and the species seed represents the best individual in that species. A species’ seed is always transferred to the next generation unmodified during the breeding process, as a form of elitism. When breeding a species, the number of offspring produced is \(\frac{|P|}{N_S}\) to ensure each species has equal weighting.

3.2 Fitness Function

A simple approach to assess the quality of a tree would be to measure the error between its output and its target feature, for example, the mean absolute error (MAE):

$$\begin{aligned} MAE(x,f) = \frac{\sum _{i=1}^{|x|} |x_i - f_i|}{|x|} \end{aligned}$$
(2)

where x is the n-dimensional output of a given tree, f is the n-dimensional target feature, and n is the number of instances.

The MAE is sensitive to scale: if x was exactly 10 times the scale of f, it would give an error of 9. This presents two problems: firstly, it means that the GP algorithm must learn constant factors within a FR, which traditional GP algorithms struggle with due to their use of random mutationFootnote 1. Secondly, the scale of the learnt FRs is not actually important in many cases: a relationship between weight and height, for example, is meaningful whether weight is measured in grams, kilograms, or pounds. With these issues in mind, we instead employ Pearson’s correlation, \(r_{x,f}\), as our cost measure, given its scale invariance:

$$\begin{aligned} \text {Cost} = r_{x,f} = \frac{ \sum _{i=1}^{n}(x_i - \overline{x})(f_i - \overline{f})}{\sqrt{\sum _{i=1}^{n}(x_i-\overline{x})^2 } \sqrt{\sum _{i=1}^{n}(f_i-\overline{f})^2}} \end{aligned}$$
(3)

Pearson’s correlation has a value between \(+1\) and \(-1\), where a value of \(+1\) represents a completely positive linear relationship from x to f, 0 represents no correlation as all, and \(-1\) represents a completely negative linear relationship. The magnitude of the correlation measures the degree of linearity in the relationship; the sign provides the directionality. We do not consider the directionality to be important in this work, as a negative feature relationship is equally as informative as a positive one. Therefore, we consider the absolute value of \(r_{x,f}\) which is in the range [0, 1], where 1 is optimal. Pearson’s correlation has seen previous use in GP to encourage diversity and approximate fitness [21, 22].

If an evolved FR is to be realistically useful in understanding data, it must be sufficiently small and simple for a human to easily interpret. To achieve this, we introduce a penalty term into the fitness function, with an \(\alpha \) parameter that controls the trade-off between high correlation (high cost) and small tree size. In practice, \(\alpha \) is generally small, so this can be seen as a relaxed version of lexicographic parsimony pressure [23]. The proposed fitness function—which should be minimised—is shown in Eq. (4), for an individual x with target feature f.

$$\begin{aligned} Fitness(x|f) = {\left\{ \begin{array}{ll} 1+|r_{x,f}|+\alpha \times size(x_{\text {Tree}}),&{} \text {if } f \in x_{\text {Tree}}\\ 1-|r_{x,f}| +\alpha \times size(x_{\text {Tree}}), &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

Equation (4) consists of two cases: one where the target feature f is used in the tree \(x_\text {Tree}\) and one where it is not. This is to penalise the evolution of naïve or self-referential trees such as \(f_1 = f_1\) or \(f_2 = f_1 \times \frac{f_2}{f_1}\). In the case where a target feature is used in \(x_\text {Tree}\), the fitness is penalised by the size of the linear correlation (i.e. in the range [1, 2], disregarding \(\alpha \)). When it is not, the fitness will be in the range [0, 1]. In this way, the fitness of an individual not using the target feature will always be better (lower) than another that does.

3.3 Preventing the Discovery of Naïve Relationships

Often in many real-world datasets, features will be highly linearly correlated with each other: either due to redundancy in the feature set, or due to other natural linearity. For example, weight measured as a feature in kg will be perfectly correlated with weight measured in lb. While it is not incorrect for GP to discover such relationships, they are not very useful, as they can be found in \(O(n^2)\) time for n features, by calculating the pairwise Pearon’s correlation matrix.

We prevent GP from evolving such relationships by pre-computing a list of “matching features” for each feature. This list contains all the other features that are linearly correlated with the featureFootnote 2. This list is used in place of f in the calculation of fitness (Eq. (4)), such that the fitness will be penalised if any matching features to the target feature appear in the GP tree. Note that this does not prevent any features from being used as a species seed.

4 Experiment Design

To evaluate the potential of GP-FRM, we tested it on a range of real-world classification datasets (from different domains), which were selected due to having clearly human-meaningful features. These are summarised in Table 1, ordered according to the number of features. Some minor data cleaning was done, including the removal of missing values by removing whole features or instances as appropriate.

Table 1. Classification datasets used for experiments.

GP-FRM was tested at three \(\alpha \) values (0.01, 0.001, 0.0001) to evaluate the trade-off between correlation and tree size. On each dataset, 30 runs of GP-FRM were performed for each value of \(\alpha \). The parameter settings used for GP-FRM are shown in Table 2. A reasonably high population size and number of generations was used due to the cheap computational cost of the fitness function. In practice, GP-FRM tended to convergence by about \(400-500\) generations. A small maximum tree depth of six was used to encourage interpretable trees, which also further reduced the computational cost. The number of species, \(N_S\), was set to 10 for all experiments based on initial tests. In future, we hope to allow the number of species to be dynamically determined during the evolutionary process.

Table 2. GP Parameter Settings.

5 Results

The mean fitness, cost, and number of nodes over 30 runs for each dataset and value of \(\alpha \) are shown in Table 3Footnote 3.

Table 3. Mean results of GP-FRM across the datasets

As \(\alpha \) is increased, there is a clear increase in mean cost and decrease in the mean number of nodes on most of the datasets in Table 3. In many cases, the proportional decrease in the number of nodes is much higher than the increases in cost: for example, on the Wine dataset, the number of nodes at \(\alpha =0.001\) is less than half that of at \(\alpha =0.0001\), but the cost only increases from 0.149 to 0.172. Similar patterns are seen for WDBC, Dermatology, Spambase, and MFEAT. When \(\alpha \) is increased from 0.001 to 0.01, however, the increase in cost is often proportionally much higher, especially on WDBC and PC3. The Arrhythmia dataset appears to exhibit strange behaviour, as a result of it having a high number of features which are simple multiplicative combinations of other features. Such relationships could be “filtered out” as a pre-processing step, using a similar approach to that of removing highly correlated features.

Generally, as the number of features in the dataset increases, the mean cost decreases. This is not surprising: the more features available, the more likely it is to find a stronger relationship between them. From a similar perspective, higher-dimensional datasets often require the use of fewer nodes, as there are a greater number of simple FRs to be found. A multiobjective GP approach [26] would likely help to find a balance between tree and/or node count and number of FRs.

To find more complex and interesting FRs, a greater number of species should be used on high-dimensional datasets. The \(\alpha \) parameter should be set based on initial tests of cost: on datasets such as PC3, Steel Plates Fault, and MFEAT, a high \(\alpha \) encourages small FRs while still achieving a very good cost. On other datasets such as Wine and Spambase, a low \(\alpha \) is needed to ensure that the FRs found are of sufficient quality.

Table 4. Relative Standard Deviation of GP-FRM across the datasets

The relative standard deviation (RSD: \(100\% \times \frac{SD}{mean}\)) of these results is presented in Table 4 to show the variation across the 30 runs. In general, the RSD is around 20–50%, aside from a few cases where it is much higher due to the measured values being very small. This level of variance is not unusual for GP, but could be reduced further in future work through the use of a more constrained search space, or the introduction of domain knowledge to give a higher-fitness initial population.

Fig. 1.
figure 1

Convergence Analysis for the Wine (left) and Spambase (right) datasets. Top row is \(\alpha =0.0001\), middle is \(\alpha =0.001\), and bottom is \(\alpha =0.01\). Median values of fitness and size are plotted to represent the expected average performance of a single GP run.

5.1 Convergence Curves

The convergence curve for GP-FRM for each value of \(\alpha \) is shown for two representative datasets (Wine and Spambase) in Fig. 1. Clearly, when \(\alpha \) has a high value (0.01), convergence occurs much more quickly—the high pressure to use few nodes in a tree greatly restricts the size of trees, reducing the number of good individuals in the search space. The convergence curve is also much less granular, due to the restrictions on tree size. At a lower \(\alpha \), the size of individuals starts quite low, but then increases over the evolutionary process, before levelling off. This again reflects the difficulty of finding larger individuals which have sufficiently lower costs to out-perform simpler, but higher-cost individuals. Early in the evolutionary process, it is much “easier” to find small individuals that have a relatively good fitness than larger ones. In the future, it may be interesting to investigate dynamically adapting \(\alpha \) throughout evolution, to better guide the search based on whether the cost or size of individuals is sufficiently low.

6 Further Analysis

To better understand the potential of GP-FRM for mining feature relationships which are useful for providing insight in data, we further analyse a selection of the evolved FRs in this section.

6.1 Analysis of Selected Features

Of the eight tested datasets, the four which showed the biggest decrease in tree size from \(\alpha =0.0001\) to \(\alpha =0.01\) were Wine, WDBC, Dermatology, and Spambase. A decrease in tree size will generally give a decrease in the number of feature terminals, and hence decrease the occurrence of each feature in an evolved FR. This pattern can be seen in Fig. 2, which plots the histogram of the features used to produce FRs for the five most commonFootnote 4 target features in each of the datasets. As \(\alpha \) is increased (from left to right), the number of features (the area under the histogram) decreases significantly. On three of the datasets, there are features which are never selected to be in a GP tree when \(\alpha =0.01\). This shows that the parsimony pressure is not only encouraging GP to evolve smaller trees, but also simpler trees which use fewer distinct features.

Fig. 2.
figure 2

Histogram of the features used to produce feature relationships (FRs) for the five most common target features on four datasets. Each colour represents one target feature. \(\alpha \) varies from left-to-right, increasing the penalty for using more nodes in a tree. The x-axis represents each feature indexed in the order it appeared in the dataset.

Across the four datasets, the five target features utilise clearly distinct groups of features. For example, at \(\alpha =0.0001\) on Spambase, the purple target feature mostly uses features with indices between 25 and 30. There is also a clear spike on Spambase with features above index 50 being particularly popular for the blue and orange target features. On all datasets, we can see an increase in this niching-style behaviour as \(\alpha \) is increased. For example, on the Wine dataset, at \(\alpha =0.0001\), most features are commonly used across all the target features; at \(\alpha =0.01\), only a few different features are commonly used for each target feature. This pattern reinforces the benefit of a speciation approach (particularly at a high \(\alpha \)) in encouraging multiple distinct FRs to be learned simultaneously. Using a more complex parsimony pressure that considers the number of unique features in a tree is likely to further improve the performance of speciation.

Fig. 3.
figure 3

Best evolved trees on each dataset that use fewer than five unique features and ten nodes in total.

6.2 Analysis of Evolved Relationships

To further understand the usefulness of the evolved FRs, we selected the tree with the lowest cost on each dataset that used fewer than five unique features and no more than ten nodes overall. While other trees had slightly lower cost, their greater complexity makes them less useful for simple analysis. The eight trees for the eight datasets are shown in Fig. 3. We analyse a sample of these trees further to evaluate their meaning in the context of the features of the dataset.

The tree shown for the Wine dataset has the highest cost across the datasets, but a cost of 0.098 still gives a Pearson’s correlation of greater than 0.9, indicating a very strong correlation [27]. f5, f8, and f11 correspond to “total phenols”, “proanthocyanins” and “hue” respectively, with the target feature being flavanoids. This FR therefore shows that flavanoids have a high linear correlation with the greater of the amount of proanthocyanins and the product of total phenols and hue. This information could be useful to a food chemist in understanding how to control the amount of flavanoids in wine.

The target feature for WDBC is “worst area”: the largest cell nucleus area in the breast tissue sample. The GP tree is equivalent to the formula: \(\text {worst area} = \text {se area} - \text {mean radius} - (\frac{\text {worst concavity}}{\text {se concavity}}) \times \text {se area}\), where se is the standard error.

The tree evolved on the Dermatology dataset uncovers a relationship between the presence of a band-like pattern on the skin, and other skin attributes, including a clear relationship with the age of the patient: \(\text {band-like infiltrate} = \text {scalp involvement} \div (\frac{\text {oral mucosal involvement}}{\text {age}} + \frac{\text {polygonal papules}}{\text {age}})\). This could be very useful to dermatologists in understanding how the likelihood of different symptoms varies as a patient gets older.

On both the PC3 and Arrhythmia datasets, GP-FRM found very simple trees. For PC3, the discovered rule is \(\text {Halstead Effort} = \text {Halstead Volume} \times \text {Halstead Difficulty}\). The PC3 dataset measures various aspects of code quality of NASA software for orbiting satellites. The Halstead effort measures the “mental effort required to develop or maintain a program”, and indeed is defined in the original paper in this formulation [28]. The fact that GP-FRM discovered this (already known) relationship highlights its ability to find rules that ARM algorithms would not.

Finally, on the Spambase dataset, a high correlation is found between the number of times that the token “857” occurs in an email and a number of other tokens such as “650”, “telnet”, “lab”, and “address”. A security researcher analysing this dataset may be able to use this information to better understand common patterns in spam, in order to block it more accurately.

7 Conclusion

This paper proposed the first approach to automatically discovering feature relationships (FRs): symbolic functions which uncover underlying non-linear relationships between features of a large dataset. Our proposed GP-FRM method used a variation on Pearson’s correlation with a speciation-based genetic programming algorithm to automatically produce a set of distinct and meaningful feature relationships. Empirical testing across a range of real-world datasets demonstrated the ability of GP-FRM to find very strong relationships which used a small number of features, aided by the use of parsimony pressure as a secondary objective. Further analysis reinforced these findings and demonstrated how the learned relationships could be used in practice.

Future work will primarily focus on improving GP-FRM further through the use of more sophisticated parsimony pressure methods; development of approaches to minimise the number of distinct features used in a given species; and further refinements to the fitness function to better measure the interpretability and meaningfulness of feature relationships. Employing measures such as the Shapley value [29] or the Vapnik–Chervonenkis dimension [30] could give better measures of tree complexity than a simple count of nodes.