Genetic programming neural networks: A powerful bioinformatics tool for human genetics
Introduction
One goal of genetic epidemiology is to identify genes associated with common, complex multifactorial diseases. Success in achieving this goal will depend on a research strategy that recognizes and addresses the importance of interactions among multiple genetic and environmental factors in the etiology of diseases such as essential hypertension [8], [15]. One traditional approach to modeling the relationship between discrete predictors such as genotypes and discrete clinical outcomes is logistic regression [7]. Logistic regression is a parametric statistical approach for relating one or more independent or explanatory variables (e.g. genotypes) to a dependent or outcome variable (e.g. disease status) that follows a binomial distribution. However, as reviewed by Moore and Williams [15], the number of possible interaction terms grows exponentially as each additional main effect is included in the logistic regression model. Thus, logistic regression is limited in its ability to deal with interactions involving many factors. Having too many independent variables in relation to the number of observed outcome events is a well-recognized problem [3], [17] and is an example of the curse of dimensionality [2].
In response to this limitation, Ritchie et al. [19] developed a genetic programming optimized neural network (GPNN). Neural networks (NN) have been utilized in genetic epidemiology, however, with little success. A potential weakness in the previous NN applications is the poor specification of NN architecture. GPNN was developed in an attempt to improve upon the trial-and-error process of choosing an optimal architecture for a pure feed-forward back propagation neural network. The GPNN optimizes the inputs from a larger pool of variables, the weights, and the connectivity of the network including the number of hidden layers and the number of nodes in the hidden layer. Thus, the algorithm attempts to generate optimal neural network architecture for a given data set. This is an advantage over the traditional back propagation NN in which the inputs and architecture are pre-specified and only the weights are optimized.
Although previous empirical studies suggest GPNN has excellent power for identifying gene–gene interactions, a comparison of GPNN with traditional statistical methods has not yet been performed. The goal of the present study was to compare the power of GPNN to that of stepwise logistic regression (SLR) and classification and regression trees (CART) for identifying gene–gene and gene–environment interactions using data simulated from a variety of interaction models. This study is motivated by the number of studies in human genetics where SLR and CART have been applied. We wanted to determine if GPNN is more powerful than the status quo in the field. We find that GPNN has higher power to detect gene–gene and gene–environment interactions than stepwise logistic regression and classification and regression trees. These results demonstrate that GPNN may be an important pattern recognition tool for future studies in genetic epidemiology.
Section snippets
A genetic programming neural network approach
GPNN was developed to improve upon the trial-and-error process of choosing an optimal architecture for a pure feed-forward back propagation neural network (NN) [19]. Optimization of NN architecture using genetic programming (GP) was first proposed by Koza and Rice [9]. The goal of this approach is to use the evolutionary features of genetic programming to evolve the architecture of an NN. The use of binary expression trees allow for the flexibility of the GP to evolve a tree-like structure that
Results
The results of this study are shown in Table 11, Table 12, and Fig. 3, Fig. 4. Here, we list the 20 epistasis models sorted by number of genes, allele frequency, and heritability along the vertical axis. Table 11 and Fig. 3 report the power results of the three methodologies. Here, power refers to the method correctly identifying the functional genes. SLR has no power to detect the functional genes in any of the models studied. These results led to some skepticism that logistic regression (LR)
Discussion
Identifying disease susceptibility genes associated with common complex, multifactorial diseases is a major challenge for genetic epidemiology. One of the dominating factors in this challenge is the difficulty in detecting gene–gene and gene–environment interactions with currently available statistical approaches. To deal with this issue, new statistical approaches have been developed such as the GPNN. GPNN has been shown to have higher power than a back propagation NN using simulated data
Acknowledgements
This work was supported by National Institutes of Health grants HL65234, HL65962, GM31304, AG19085, AG20135, AI59694, HD047447, and LM007450.
References (22)
- et al.
A perspective on epistasis limits of models displaying no main effect
Am. J. Hum. Genet.
(2002) - et al.
Genetic epidemiology of multistage carcinogenesis
Mutat. Res.
(2001) - et al.
Routine discovery of high-order epistasis models for computational studies in human genetics
Appl. Soft Comput.
(2004) - et al.
A simulation study of the number of events per variable in logistic regression analysis
J. Clin. Epidemiol.
(1996) - et al.
Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer
Am. J. Hum. Genet.
(2001) - et al.
Non-familial Alzheimer's disease is mainly due to genetic factors
J. Alzheimers Dis.
(2002) Adaptive Control Processes
(1961)- et al.
The risk of determining risk with multivariable models
Ann. Int. Med.
(1996) - et al.
Pattern Classification
(2000) - et al.
Applied Logistic Regression
(2000)