Elsevier

Applied Soft Computing

Volume 7, Issue 1, January 2007, Pages 471-479
Applied Soft Computing

Genetic programming neural networks: A powerful bioinformatics tool for human genetics

https://doi.org/10.1016/j.asoc.2006.01.013Get rights and content

Abstract

The identification of genes that influence the risk of common, complex disease primarily through interactions with other genes and environmental factors remains a statistical and computational challenge in genetic epidemiology. This challenge is partly due to the limitations of parametric statistical methods for detecting genetic effects that are dependent solely or partially on interactions. We have previously introduced a genetic programming neural network (GPNN) as a method for optimizing the architecture of a neural network to improve the identification of genetic and gene–environment combinations associated with disease risk. Previous empirical studies suggest GPNN has excellent power for identifying gene–gene and gene–environment interactions. The goal of this study was to compare the power of GPNN to stepwise logistic regression (SLR) and classification and regression trees (CART) for identifying gene–gene and gene–environment interactions. SLR and CART are standard methods of analysis for genetic association studies. Using simulated data, we show that GPNN has higher power to identify gene–gene and gene–environment interactions than SLR and CART. These results indicate that GPNN may be a useful pattern recognition approach for detecting gene–gene and gene–environment interactions in studies of human disease.

Introduction

One goal of genetic epidemiology is to identify genes associated with common, complex multifactorial diseases. Success in achieving this goal will depend on a research strategy that recognizes and addresses the importance of interactions among multiple genetic and environmental factors in the etiology of diseases such as essential hypertension [8], [15]. One traditional approach to modeling the relationship between discrete predictors such as genotypes and discrete clinical outcomes is logistic regression [7]. Logistic regression is a parametric statistical approach for relating one or more independent or explanatory variables (e.g. genotypes) to a dependent or outcome variable (e.g. disease status) that follows a binomial distribution. However, as reviewed by Moore and Williams [15], the number of possible interaction terms grows exponentially as each additional main effect is included in the logistic regression model. Thus, logistic regression is limited in its ability to deal with interactions involving many factors. Having too many independent variables in relation to the number of observed outcome events is a well-recognized problem [3], [17] and is an example of the curse of dimensionality [2].

In response to this limitation, Ritchie et al. [19] developed a genetic programming optimized neural network (GPNN). Neural networks (NN) have been utilized in genetic epidemiology, however, with little success. A potential weakness in the previous NN applications is the poor specification of NN architecture. GPNN was developed in an attempt to improve upon the trial-and-error process of choosing an optimal architecture for a pure feed-forward back propagation neural network. The GPNN optimizes the inputs from a larger pool of variables, the weights, and the connectivity of the network including the number of hidden layers and the number of nodes in the hidden layer. Thus, the algorithm attempts to generate optimal neural network architecture for a given data set. This is an advantage over the traditional back propagation NN in which the inputs and architecture are pre-specified and only the weights are optimized.

Although previous empirical studies suggest GPNN has excellent power for identifying gene–gene interactions, a comparison of GPNN with traditional statistical methods has not yet been performed. The goal of the present study was to compare the power of GPNN to that of stepwise logistic regression (SLR) and classification and regression trees (CART) for identifying gene–gene and gene–environment interactions using data simulated from a variety of interaction models. This study is motivated by the number of studies in human genetics where SLR and CART have been applied. We wanted to determine if GPNN is more powerful than the status quo in the field. We find that GPNN has higher power to detect gene–gene and gene–environment interactions than stepwise logistic regression and classification and regression trees. These results demonstrate that GPNN may be an important pattern recognition tool for future studies in genetic epidemiology.

Section snippets

A genetic programming neural network approach

GPNN was developed to improve upon the trial-and-error process of choosing an optimal architecture for a pure feed-forward back propagation neural network (NN) [19]. Optimization of NN architecture using genetic programming (GP) was first proposed by Koza and Rice [9]. The goal of this approach is to use the evolutionary features of genetic programming to evolve the architecture of an NN. The use of binary expression trees allow for the flexibility of the GP to evolve a tree-like structure that

Results

The results of this study are shown in Table 11, Table 12, and Fig. 3, Fig. 4. Here, we list the 20 epistasis models sorted by number of genes, allele frequency, and heritability along the vertical axis. Table 11 and Fig. 3 report the power results of the three methodologies. Here, power refers to the method correctly identifying the functional genes. SLR has no power to detect the functional genes in any of the models studied. These results led to some skepticism that logistic regression (LR)

Discussion

Identifying disease susceptibility genes associated with common complex, multifactorial diseases is a major challenge for genetic epidemiology. One of the dominating factors in this challenge is the difficulty in detecting gene–gene and gene–environment interactions with currently available statistical approaches. To deal with this issue, new statistical approaches have been developed such as the GPNN. GPNN has been shown to have higher power than a back propagation NN using simulated data

Acknowledgements

This work was supported by National Institutes of Health grants HL65234, HL65962, GM31304, AG19085, AG20135, AI59694, HD047447, and LM007450.

References (22)

  • S.L.R. Kardia

    Context-dependent genetic effects in hypertension

    Curr. Hypertens. Rep.

    (2000)
  • Cited by (0)

    View full text