Adapting genetic regulatory models by genetic programming
Introduction
Bioinformatics is a field driven by the rapid accumulation of molecular biology data. Until recently, the flood of data becoming available mainly consisted of DNA and amino acid sequences. With the rapid advances made in sequencing technology, it became routine to sequence genes, and feasible to sequence even entire genomes. For many years, therefore, bioinformatics was focused on dealing with these sequence collections, and developing computer algorithms for tasks such as finding coding regions and exons in DNA, analysing evolutionary relationships by aligning sequences and identifying similarities, trying to predict protein 2D and 3D structure from amino acid sequence data, etc.
Today, however, bioinformatics has undergone a rapid, dramatic, and fundamental change of focus. The reason for this is that microarray technology, popularly referred to as “gene chips”, has become a mature technology, and it has become routine for molecular biologists to collect expression data for thousands of genes under varying conditions. Studying changes in the expression levels of genes in response to environmental change, medication, exposure to toxins, or other stimuli, has rapidly become one of the standard techniques for gaining insight into the function of the proteins encoded by these genes. In addition, microarray technology also opens up the possibility of understanding not only which genes are involved in the response to particular stimuli, but also the networks involved in regulating the expression of these genes.
This paper will focus on one of the exciting possibilities opened up by the advent of microarray technology, namely to utilize the availability of gene expression data to infer regulatory relationships between genes. If a gene has a regulatory impact on another gene, we can reasonably assume that—at least in some cases—this should be detectable from the expression data. It is now, therefore, of urgent interest to explore the possibility of developing methods for inferring large networks of gene interactions from gene expression data.
Kohane et al. (2003) points out that standard statistical techniques for elucidating relationships between multiple variables do not hold up well when applied to gene expression data sets because of their underdetermined nature. Such data sets contain measurements of very high dimensionality (on the order of thousands of variables) but only for a small number of cases (on the order of tens to hundreds), which means that multiple models fit the data equally well and that additional knowledge of the learning domain is required to resolve the ambiguities. The modeling of genomic data sets therefore requires new approaches.
Due to the underdetermined nature of expression data the approach of inferring regulatory models without any bias towards plausible models is often not applicable to real world data. Also, the size of the networks that currently can be inferred by such techniques seems far too small. We think that these issues call for an increased use of expert knowledge in the discovery of regulatory models as well as a preference for qualitative models over quantitative ones. The approach we have adopted aims at achieving that in an interactive environment that will let experts repetitively state qualitative regulatory models, evaluate how the models fit the expression data, specify constraints on the search for revised models, search for revised models, and select revisions that they find plausible. Similar approaches have also been reported in Iba and Mimura (2002), and Shrager et al. (2002).
Section snippets
Modeling gene regulation
In selecting the type of regulatory model to fit to the expression data we conclude that qualitative models would be an appropriate choice for the reasons outlined above. Our regulatory models are qualitative in that they only specify directions of influence in a non-recurrent network of genes. Since the real biological networks that we model are believed to be highly recurrent at the lowest level of abstraction, our models only aim to explain highly abstract properties of those networks. In
Optimizing regulatory models
The method we have chosen to adapt the regulatory models uses an evolutionary algorithm (EA) to improve the models according to a quality measure. This allows for experts to revise one or multiple working models through the seeding of the population. Domain knowledge can also be incorporated in the design of the evaluation function, representation, and variation operators.
All EAs require a fitness function as a solution quality measure. In our case the quality of a solution is given by its
GP design issues
The GP system we used is based on ECJ 9 (Luke, 2002). Unless explicitly stated below, methods and parameters of the system are those that come by default in ECJ 9.
Experiments
We conducted a series of experiments to evaluate our methods. Although the methods were designed to allow for experts to revise working models through the seeding of the initial population, we decided to evaluate their ability to infer regulatory models without being provided with such models before evaluating their model revising ability. The initial population was therefore initiated with small random programs, which is common practice in GP. For the ease of evaluation, models were fitted to
Results
In the first experiment we tried to infer a 10 gene network. Fig. 4 shows the best and average fitness values of the population for 100 generations averaged over 10 runs. The error bars show the area in which the average is located with 95% confidence assuming a t-distribution. In this experiment an individual whose network fits the data perfectly would receive a fitness value of 10. In the most successful run, such an individual was found in generation 19. Fig. 5 shows that the target and best
Discussion
The best solutions found in the experiments had fitness values amounting to 100, 92, 80, 71, and 58% of the optimal values, in inferring networks with 10, 20, 40, 80, and 160 genes, respectively. To see how our methods inference capabilities scale with the number of genes of the target networks, a more reliable measure is the percentage of the averaged best fitness in the final generation compared to the fitness of a perfect individual. Applying this measure to our results yields the values
References (19)
- et al.
Inference of a gene regulatory network by means of interactive evolutionary computing
Inf. Sci.
(2002) - et al.
Identification of genetic networks from a small number of gene expression patterns under the Boolean network model
Pac. Symp. Biocomput.
(1999) - Ando, S., Iba, H., 2001. Inference of gene regulatory model by genetic algorithms. In: Proceedings of the 2001 IEEE...
- et al.
The hardwiring of development: organization and function of genomic regulatory systems
Development
(1997) - Banzhaf, W., Nordin, P., Keller, R., Francone, F., 1998. Genetic Programming: An Introduction. Morgan Kaufmann...
- et al.
Using Bayesian networks to analyze expression data
J. Comput. Biol.
(2000) - Gruau, F., 1992. Genetic synthesis of Boolean neural networks with a cell rewriting developmental process. In: Whitley,...
- et al.
Making sense of microarray data distributions
Bioinformatics
(2002) - Kohane, I.S., Kho, A.T., Butte, A.J., 2003. Microarrays for an Integrative Genomics. MIT...
Cited by (19)
Inferring gene regulatory networks with hybrid of multi-agent genetic algorithm and random forests based on fuzzy cognitive maps
2018, Applied Soft Computing JournalCitation Excerpt :For example, Ramteke et al. [37] used a real-coded genetic algorithm (GA) to enhance the performance of genetic algorithm, which was termed as simulated binary jumping gene. Eriksson et al. [10] proposed genetic programming for inferring discrete GRNs. Chao et al. [7] used GA to search feed forward regulatory genes, which was based on the recurrent neural network model.
Constructing gene regulatory networks from microarray data using GA/PSO with DTW
2012, Applied Soft Computing JournalCitation Excerpt :For example, Chan et al. used three computational intelligence methods including least angle regression (LARS), expectation maximization (EM) with Kalman filter (KF) and evolving fuzzy neural network (EFuNN) [15] to infer GRNs. Eriksson and Olsson inferred GRNs using genetic programming [16]. Tian proposed a stochastic model which is based on noise of the microarray experiments to predict GRNs [17].