Genetic programming and frequent itemset mining to identify feature selection patterns of iEEG and fMRI epilepsy data
Introduction
Epilepsy is a neurological disorder that impairs millions worldwide with recurrent often uncontrollable seizures (World Health Organization, 2012). Physicians at epilepsy centers acquire combinations of various noninvasive or/and invasive brain signal modalities to diagnose the seizure onsets in patients, plan neurosurgical treatment, or fathom ictogenesis mechanisms and epilepsy symptom (Bragin et al., 2010, Donaire et al., 2009a, Donaire et al., 2009b, Engel, 1993, Engel et al., 2010, Fried, 1995, Rosenow and Luders, 2001, Sierra-Marcos et al., 2013, Staba and Bragin, 2011). These modalities include scalp electroencephalography (EEG), intracranial electroencephalography (iEEG), magnetoelectroencephalography (MEG), functional magnetic resonance imaging (fMRI), and other neuroimaging data. Traditionally, clinicians resort to subjective manual procedures when screening iEEG or fMRI for epilepsy patient diagnoses. But signal processing techniques in pattern classification and recognition for fMRI and iEEG have become useful tools in epilepsy research to help clinicians more objectively discern differences between functional and dysfunctional brain regions, identifying diseased brain tissue for therapy (Ayoubian et al., 2012, Donaire et al., 2009a, Donaire et al., 2009b, Fernandez-Blanco et al., 2012, Gaspard et al., 2014, Gotman et al., 1995, Grewal and Gotman, 2005, Halford, 2009, Han et al., 2011, Keogh and Cordes, 2007, Lee et al., 2009, Navakatikyan et al., 2006, Osorio et al., 1995, Osorio et al., 1998, Qu and Gotman, 1997, Saab and Gotman, 2005, Tzallas et al., 2009, Tzallas et al., 2012, Wilson and Emerson, 2002, Worrell et al., 2012). With further development, validation, and acceptance across several research groups, such algorithms may translate to practical clinical use as decision-support tools for physicians.
One approach to developing semi-automated pattern classification and decision-support tools for epilepsy data has been in implementing evolutionary computation techniques to select, combine, or create measures (extracted features) that quantify the difference between interictal biomarkers (e.g., pathological gamma oscillations in iEEG, resting-state blood oxygenation changes in fMRI) and interictal background or basal activity (Burrell et al., 2007a, Smart et al., 2007, Smart et al., 2011). Scant research has been published on the application of evolutionary computation to interictal resting-state fMRI signals from epilepsy patients (Burrell et al., 2007b). Instead, pattern detection applications for ictal (seizure) activity dominate prior research involving evolutionary computation techniques applied to iEEG and EEG, including mostly genetic algorithms (Haydari et al., 2011, Hsu and Yu, 2010, Ocak, 2008, Patnaik and Manyam, 2008, Rivero et al., 2013, Shen et al., 2013), genetic programming (Sotelo et al., 2013a, 2013b), and harmony search optimization (Gandhi et al., 2012, Zainuddin et al., 2013), although some projects have focused on spike detection applications (Haydari et al., 2011, Kinnear et al., 1999, Marchesi et al., 1997a, Shen et al., 2013). Additional studies have used evolutionary computation in other manners for epilepsy data (Bandarabadi et al., 2011, Firpi et al., 2005a, Harikumar et al., 2004, Rivero et al., 2013, Wei et al., 2010). Only a few groups apply evolutionary computation techniques to gamma oscillation pattern detection for encephalography (Firpi et al., 2007, Smart et al., 2007, Smart et al., 2011). We have focused on pattern classification research for interictal rather than ictal events of interest for the following philosophical positions: (1) certain analyses of interictal brain signals can lead to diagnostic and prognostic information for identifying seizure onset zones equivalent to or complementary to certain analysis of ictal brain signals (Bettus et al., 2011, Crepon et al., 2010, Heers et al., 2014, Korzeniewska et al., 2014, Lu et al., 2014, Matsumoto et al., 2013, Spencer et al., 2008, Thornton et al., 2011, Valentin et al., 2014, Worrell and Gotman, 2011, Zhang et al., 2014); and (2) some patients do not have seizures during epilepsy monitoring, so using brain signals not dependent on seizures provides a means to offer some clinical analysis for patients rather than send them home without any diagnosis. Because it is not a trivial problem to detect interictal biomarkers within iEEG or fMRI signals, especially depending on the signal-to-noise ratio for each brain signal modality, we have used evolutionary computation techniques to search for robust optimal albeit relatively complex and somewhat human-intractable solutions. Commonly referenced alternative approaches for interictal biomarker detectors using iEEG signals still involve a human verification stage to discard numerous false positive detections (Crepon et al., 2010, Gardner et al., 2007, Worrell et al., 2008) despite some algorithm developments (Zelmann et al., 2012). Such human involvement counteracts the main purpose of the semi-automated approach and depending on the false-positive rate of the detection method might be as laborious as marking the actual true-positive events without running the pattern classification. On the other hand, we demonstrated via our prior work on interictal biomarker detection algorithms that – for at least our epilepsy brain signal data – one may gain higher pattern classification performance for features selected using an evolutionary computation method than for features selected by conventional or popular-in-literature methods for iEEG (Firpi et al., 2007, Smart et al., 2007, Smart et al., 2011) and for fMRI (Burrell et al., 2007a).
Evolutionary computation is a discipline comprising the study and development of evolution-based (Darwin, 1978) search optimization algorithms: for instance, an initial search space contains a population of organisms (possible optimal solutions) that stochastically undergoes mutation and recombination before survival selection (i.e., choosing the most fit organisms) and generational (iterative) production of a new population of organisms from the prior population (refining the possible optimal solutions) until no more evolution (optimization convergence). Since its pioneering inception in the 1950s through 1960s (Baeck et al., 1997, Barricelli, 1957, Barricelli, 1962, Barricelli, 1963, Fogel, 1998, Fogel et al., 1968, Fraser, 1960, Turing, 1950), evolutionary computation has spawned numerous computer science techniques that today has five classes or dialects of research areas: (1) evolutionary programming (Fogel and Fogel, 1986, Fogel et al., 1968, Sebald and Fogel, 1994), (2) genetic algorithms (Booker et al., 1989, Holland, 1992a, Holland, 1992b, Holland, 1995), (3) evolution strategy (Schwefel, 1981, Schwefel, 1995), (4) genetic programming (Koza, 1989, Koza, 1992, Koza, 1994, Koza, 1996, Koza, 1997, Koza, 1999, Koza et al., 2006), and (5) swarm intelligence (Beni, 2005, Beni and Wang, 1989, Blum and Merkle, 2008, Bonabeau et al., 1999, Dorigo and Gambardella, 1997, Kennedy et al., 2001). In particular, the genetic programming (GP) methodology is a global search optimization procedure that heuristically develops possible solutions (programs) to a predefined problem statement using biological evolution concepts (i.e., mutation, crossover, and selection) (Koza, 1989, Koza, 1992, Koza, 1994). The GP algorithm (see Algorithm 1) initializes a set (population) of solutions (individuals) of size P with each element representing a mathematical operation on the input of the GP in the form of a tree structure, uses an objective function to compute an index (fitness) for each individual, and executes the evolutionary processes to create new populations that optimize the fitness to compute best individual. The evolutionary processes have many variations in implementation but the same basic concepts: the selection stage chooses a current population subset (intermediate population) based upon individual fitness; the crossover stage creates new individuals using combinations of paired individuals from the intermediate population, forming a new population; the mutation stage introduces diversity into the new population by randomly altering the makeup of a subset of individuals in the new population; and the survival stage simply selects the fittest individuals from the new population, creating a new initial population of size P for subsequent GP iterations (generations). The algorithm ends upon attaining a predefined number of generations or predefined fitness value. From this final population of solutions, one may select the best (optimal) solution according to the chosen fitness function. Algorithm 1 Pseudocode for GP Algorithm.
For pattern classification problems to quantitatively discriminate non-biomarker (e.g., basal or baseline activity) and biomarker (e.g., spikes, seizures, PGOs, abnormal CBF activations) brain signals from epilepsy patients, GP has been used for both noninvasive and invasive electroencephalography (Fernández-Blanco et al., 2013, Firpi et al., 2005b, Firpi et al., 2005c, Firpi et al., 2006, Guo et al., 2011, Lopes, 2007, Marchesi et al., 1997b, Smart et al., 2007, Sotelo et al., 2013a, Sotelo et al., 2013b), MEG (Georgopoulos et al., 2009, Theofilatos et al., 2009), and fMRI (Burrell et al., 2007b). However, GP is not a feature selection algorithm. Technically, its use in this way is a mischaracterized application, where algorithmic issues such as bloat, fitness function definition, and choices for the terminals (features) and functions can substantially affect ‘feature selection’ results. Also, GP the algorithm can output practically inelegant solutions since theoretical parsimony is not a guaranteed effect (Kelly, 1995). These limitations accentuate the importance of examining whether GP-based feature-selection solutions demonstrate reproducible results and useful patterns for pattern classification of epilepsy data. Consequently, we investigated GP-based feature selection (i.e., implicit feature selection with GP algorithm) for interictal resting-state brain recordings with focus on two main questions regarding the computed feature subsets. Across patients, do selected subsets exhibit the same feature subsets, indicating universal measures, or different subsets, indicating unconventional case-by-case measures? Per patient, are the selected subsets similar if not the same in content, indicating consistency in solutions? Since the confidence interval concept embodies the computation of consistency or reliability in an estimated value or parameter set (Neyman, 1937), we investigated these two questions under the same aim: construct and evaluate confidence intervals for GP-based feature selection. Since a feature subset list is qualitative rather than quantitative data, we considered frequent itemset mining (FIM) as an approach for confidence interval construction.
Frequent itemset mining is the first stage in association rule learning, an established data-mining method to discover highly replicable information within a multitude of data (Agrawal et al., 1993, Agrawal et al., 1996, Agrawal and Srikant, 1994, Agrawal and Srikant, 1995, Rakesh and Ramakrishnan, 1994, Rakesh and Ramakrishnan, 1995, Rakesh and Ramakrishnan, 1998, Rakesh et al., 1993, Zaki, 2000). As its name implies, an FIM algorithm discovers (i.e., mines) frequently occurring itemsets (i.e., collections of data variables) for pattern observation. An event (item) represents some variable of interest in the data-mining framework. A set of items (itemset), sometimes called a transaction, is an observed combination occurring events. A collection of multiple itemsets (database) is the input for FIM analysis to identify patterns. An itemset percentage (support), s, indicates its regularity or frequency within the database, where an itemset with n events that exceeds a support threshold (or likelihood level), λ, is termed a λ-frequent n-itemset. A maximal (max) λ-frequent n-itemset is an itemset such that any (n+1)-itemset of which it is a subset has s<λ. Alternatively stated, a max λ-frequent itemset is an itemset that has infrequent (s<λ) proper supersets. It is important to note that for a given support threshold λ, FIM may output multiple itemsets as max frequent n-itemsets and these max itemsets may range in cardinality (e.g., 2-itemsets and 4-itemsets without 3-itemsets) (Fig. 3). As illustrated (Fig. 3), given a database (upper left), the FIM algorithm computes frequent itemsets (gray rectangles) and max-frequent itemsets (black rectangles) with the support threshold λ, while the FIM avoids sub-threshold trials within the database (white rectangles) for final output results. The max-frequent itemset with the highest occurrence and largest size in the database represents the final solution (e.g., Fig. 1D). Because numerous potentially coincident event combinations must be evaluated to identify at least one pattern in a database, data-mining algorithms such as FIM provide efficient computational execution in terms of memory usage, disk access, and computational burden to search for putative patterns, aiming to avoid spurious results. Among many different FIM implementations (Agrawal et al., 1996, Borgelt, 2005, Zaki, 2000), the APRIORI algorithm (see Algorithm 2) is likely the best known and most often used approach over decades (Bodon, 2003). Yet, there exists few applications of FIM to epilepsy data (Bourien et al., 2005, Bourien et al., 2004, Exarchos et al., 2006, Smart et al., 2012) and none apply FIM in the same manner that we present with this work. Algorithm 2 Pseudocode for APRIORI FIM Algorithm.
We present a framework to essentially compute confidence intervals for GP-based feature selection that categorize epileptic biomarkers (not seizures but interictal resting-state activity) and brain activity not considered as epileptic biomarkers by implementing the APRIORI FIM algorithm after several GP feature-selection trials. This approach (Fig. 1) transforms stochastic results of the GP analysis into more deterministic results via FIM. In Section 2, we explain the approach details: Section 2.1 for the acquisition of each the iEEG and fMRI signals (Fig. 1A); Section 2.2 for computation of signal measures via feature extraction process (e.g., Fig. 1B); Section 2.3 for selection of a subset of these measures using GP, a process that we repeated in multiple trials for application of FIM (Figs. 1C and 2); Section 2.4 for recognition of patterns in repeatedly selected measures (features) via FIM (Fig. 1D); and Section 2.5 for our three main computational experiments to apply and validate the framework.
Section snippets
Signal acquisition
We analyzed fMRI collected from one patient group and iEEG collected from another patient group. For each de-identified dataset, the Internal Review Boards at the Georgia Institute of Technology, Emory University, and the University of Pennsylvania approved data analysis. For each dataset, a board-certified clinician annotated “gold standard” epileptic biomarkers, which provided classification labels (i.e., biomarker, non-biomarker) for the in silico experiments.
We retrospectively analyzed
Observed patterns with FIM features
In our first experiment, we observed whether any pattern resulted from mining the GP-selected features across patients and generations to evaluate the reproducibility and subset size of the GP-based feature-selection. We computed the max frequent itemsets, or the most frequently occurring feature subsets among the 100 trials, for each patient, biological data modality, and number of GP iterations (Table 4, Table 5). For each max frequent itemset, the first item occurred most and the last item
Discussion
Applying FIM to repeated GP-based feature selection, we found patterns in the cardinality of the selected feature subsets, reproducibility of the subsets, and correlations between infrequently selected measures as well as a validation of patient-specific feature subsets.
Conclusions
We developed a method to categorize biomarker and non-biomarker epileptic activity in iEEG and fMRI signals from epilepsy patients by combining GP and FIM techniques. We used FIM to compute qualitative confidence intervals for features selected via a GP algorithm. We observed within-subject consistency and across-subject variability for GP-based feature selection for both fMRI and iEEG signals. We concluded that the problem of detecting interictal biomarkers for each iEEG and fMRI signal
Acknowledgments
Grant funds from the United Negro College Fund Special Programs Corporation NASA Harriett G. Jenkins Pre-doctoral Fellowship Program to Dr. Smart and the National Institute of Neurological Disorders and Stroke (1R01NS048598-01A2) to both Drs. Burrell and Smart provided partial research support for this work. The authors thank the physicians from the Center for Functional Neuroimaging and Department of Neurology at the University of Pennsylvania, the Children’s Hospital of Philadelphia, and the
References (138)
- et al.
Classifier systems and genetic algorithms
Artif. Intell.
(1989) - et al.
A method to identify reproducible subsets of co-activated structures during interictal spikes. Application to intracerebral EEG in temporal lobe epilepsy
Clin. Neurophysiol.
(2005) - et al.
Identifying the structures involved in seizure generation using sequential analysis of ictal-fMRI data
NeuroImage
(2009) - et al.
Automatic seizure detection based on star graph topological indices
J. Neurosci. Methods
(2012) Magnetic resonance imaging and epilepsy: neurosurgical decision making
Magn. Reson. Imaging
(1995)- et al.
Discrete harmony search based expert model for epileptic seizure detection in electroencephalography
Expert Syst. Appl.
(2012) - et al.
Human and automated detection of high-frequency oscillations in clinical intracranial EEG recordings
Clin. Neurophysiol.
(2007) - et al.
Automatic detection of prominent interictal spikes in intracranial EEG: validation of an algorithm and relationship to the seizure onset zone
Clin. Neurophysiol.
(2014) - et al.
An automatic warning system for epileptic seizures recorded on intracerebral EEGs
Clin. Neurophysiol.
(2005) - et al.
Automatic feature extraction using genetic programming: an application to epileptic EEG classification
Expert Syst. Appl.
(2011)
Computerized epileptiform transient detection in the scalp electroencephalogram: obstacles to progress and the example of computerized ECG interpretation
Clin. Neurophysiol.
Features and futures: seizure detection in partial epilepsies
Neurosurg. Clin. N. Am.
Detection of seizures in EEG using subband nonlinear parameters and genetic algorithm
Comput. Biol. Med.
Genetic programming for epileptic pattern recognition in electroencephalographic signals
Appl. Soft Comput.
Seizure detection algorithm for neonates based on wave-sequence analysis
Clin. Neurophysiol.
Optimal classification of epileptic seizures in EEG using wavelet analysis and genetic algorithm
Signal Process.
Intracranial EEG power and metabolism in human epilepsy
Epilepsy Res.
Epileptic EEG detection using neural networks and post-classification
Comput. Methods Programs Biomed.
Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data
Fast discovery of association rules, advances in knowledge discovery and data mining
Am. Assoc. Artif. Intell.
Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases
Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering
Automatic seizure detection in SEEG using high frequency activities in wavelet domain
Med. Eng. Phys.
Handbook of Evolutionary Computation
Wepilet, optimal orthogonal wavelets for epileptic seizure prediction with one single surface channel
Conf. Proc. IEEE Eng. Med. Biol. Soc.
Numerical testing of evolution theories
Acta Biotheor.
Numerical testing of evolution theories. Part II. Preliminary tests of performance, symbiogenesis and terrestrial life
Acta Biotheor.
From swarm intelligence to swarm robotics. In: Proceedings of the 2004 International Conference on Swarm Robotics
Interictal functional connectivity of human epileptic networks assessed by intracerebral EEG and BOLD signal fluctuations
PLoS One
Swarm Intelligence: Introduction and Applications
A trie-based APRIORI implementation for mining frequent item sequences. In: Proceedings of the First International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations
Swarm Intelligence: From Natural to Artificial Systems
An implementation of the FP-growth algorithm. In: Proceedings of the First International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations
Mining reproducible activation patterns in epileptic intracerebral EEG signals: application to interictal activity
IEEE Trans. Biomed. Eng.
High-frequency oscillations in human brain
Hippocampus
High-frequency oscillations in epileptic brain
Curr. Opin. Neurol.
Graphical Methods for Data Analysis
Mapping interictal oscillations greater than 200 Hz recorded with intracranial macroelectrodes in human epilepsy
Brain
The Origin of Species by Means of Natural Selection
Adapting operator probabilities in genetic algorithms. In: Proceedings of the Third International Conference on Genetic algorithms
Sequential analysis of fMRI images: a new approach to study human epileptic networks
Epilepsia
Ant colony system: a cooperative learning approach to the traveling salesman problem
IEEE Trans. Evol. Comput.
Pattern Classification
Surgical Treatment of the Epilepsies
New York, NY
Clinical neurophysiology, neuroimaging, and the surgical treatment of epilepsy
Curr. Opin. Neurol. Neurosurg.
Cited by (12)
Image feature selection using genetic programming for figure-ground segmentation
2017, Engineering Applications of Artificial IntelligenceCitation Excerpt :However, this method may be inefficient for problems with a large number of samples or classes. Smart and Burrell (2015) apply GP to design a filter based method to select features for pattern classification problems on functional magnetic resonance imaging (fMRI) and intra-cranial electroencephalogram (iEEG) signals. The lexicographic parsimony pressure is used to control bloat in GP.
A novel genetic programming approach for epileptic seizure detection
2016, Computer Methods and Programs in BiomedicineCitation Excerpt :Wang et al. [17] modified the feature extraction with the use of Wavelet Transform along with Shannon Entropy. Smart et al. [18] demonstrated that implicitly selecting features with a genetic programming (GP) algorithm more effectively determined the proper features to discern biomarker and non-biomarker interictal iEEG and fMRI activity than conventional feature selection approaches. Nicolaou et al. [19] integrated the concept of permutation entropy with the support vector machine to achieve very high classification accuracy.
Effectiveness of Feature Selection in Text Summarization
2023, Proceedings - 11th IEEE International Conference on Intelligent Computing and Information Systems, ICICIS 2023FPGA/GPU-based Acceleration for Frequent Itemsets Mining: A Comprehensive Review
2022, ACM Computing SurveysDevelopment of Data Mining Models Based on Features Ranks Voting (FRV)
2022, Computers, Materials and ContinuaBibliographical analysis of artificial intelligence learning in higher education: Is the role of the human educator and educated a thing of the past?
2020, Fostering Communication and Learning With Underutilized Technologies in Higher Education