Elsevier

Neurocomputing

Volume 73, Issues 4–6, January 2010, Pages 633-638
Neurocomputing

Structural analysis of regulatory DNA sequences using grammar inference and Support Vector Machine

https://doi.org/10.1016/j.neucom.2009.09.018Get rights and content

Abstract

Regulatory DNA sequences such as promoters or splicing sites control gene expression and are important for successful gene prediction. Such sequences can be recognized by certain patterns or motifs that are conserved within a species. These patterns have many exceptions which makes the structural analysis of regulatory sequences a complex problem. Grammar rules can be used for describing the structure of regulatory sequences; however, the manual derivation of such rules is not trivial. In this paper, stochastic L-grammar rules are derived automatically from positive examples and counterexamples of regulatory sequences using genetic programming techniques. The fitness of grammar rules is evaluated using a Support Vector Machine (SVM) classifier. SVM is trained on known sequences to obtain a discriminating function which serves for evaluating a candidate grammar ruleset by determining the percentage of generated sequences that are classified correctly. The combination of SVM and grammar rule inference can mitigate the lack of structural insight in machine learning approaches such as SVM.

Introduction

Recognition of specific functionally important DNA sequence fragments such as promoters (short sequences that precede the beginnings of genes) or splice-junction sites (boundary points between exons and introns where splicing occurs) is one of the most important problems in bioinformatics. Common sequence analysis methods such as pattern search and sequence alignment using dynamic programming (Needleman–Wunsch, Smith–Waterman algorithms) and substitution-matrix based methods (PAM, BLOSUM) cannot solve this problem because of noisy data and large variability of consensus sequences across different species. Machine learning methods can be used for sequence classification, because these methods can learn useful descriptions of genetic concepts from data instances only rather than from explicit definitions.

Modern sequence recognition tools use machine learning techniques such as Naive Bayes, Decision Trees, Hidden Markov Models, Neural Networks or Support Vector Machine (SVM) [1]. These techniques allow achieving up to 98% accuracy [2], however they do not provide any insight on the internal structure of the analyzed sequences.

Many types of DNA sequences have been shown to have a modular structure [3], [4]. The structural models of DNA sequences can be analyzed using computational approaches such as adaptive quality-based clustering [5] and the iterative expectation-maximization algorithm-based method (MEME) [6]. In this paper, we continue our research [7] on structural modelling of DNA sequences using a parallel formal stochastic grammar, and substitute the sequence recognition problem with grammar induction [8].

Formal grammars can provide a means of describing complex repeatable structures such as DNA. For example, Koza [9] demonstrates the possibility of discovering the rewrite rule for L-systems and the state transition rules for cellular automata using genetic programming techniques. For the study of protein formation, Marcus [10] considers so-called semi-Lindenmayer systems and discusses an isomorphism between the genetic and natural language. Jimenéz-Montano [11] proposes an algorithm to construct a short context-free grammar that generates a given sequence. Infante-Lopez and de Rijke [12] describe the inference of regular language grammar rules based on n-grams and minimization of the Kullback–Leibler divergence. O’Neill et al. [13] generate regular expressions for promoter recognition problem using Grammatical Swarm technique. Each individual swarm particle represents choices of program construction rules, where these rules are specified using a Backus–Naur Form (BNF) grammar. Denise et al. [14] generate genomic sequences and basic RNA secondary structures according to a given probability distribution and syntactical (grammatical) parameters. Formal biosequence linguistic research also has used stochastic grammars based on hidden Markov models [15], context-free grammars [16], grammars based on computational logic [17], String Variable Grammars [18], finite-state automata [19], Definite Clause Grammar and Prolog [20], [21], and logic grammar formalism based Basic Gene Grammars [22]. Stochastic context-free grammars induced from sample sets of sequences are also considered for modelling RNA sequences [23], [24].

The aim of this paper is to describe a method for structural analysis of regulatory DNA sequences. The method combines SVM classification with automatic derivation of stochastic L-grammar rules using genetic programming techniques. The structure of the paper is as follows. Section 2 provides an introduction into L-grammar. Section 3 describes the principles of SVM classification. Section 4 considers grammar induction problem. Section 5 describes derivation of stochastic L-grammar rules for drosophila, vertebrate and monocot plant promoter datasets, as well as for primate splicing sites. Finally, Section 6 evaluates the results and presents conclusions.

Section snippets

Introduction into L-systems

Treating genome as a language can allow to generalize structural information contained in biological sequences and to investigate it using formal language theory methods. From the biological point of view, components of any biological organism evolve simultaneously, so we cannot expect that biological processes could be modelled using a sequential approach. It is more likely that the cells that can reproduce simultaneously would be modelled by a mechanism that is based on the same behavioural

Classification using Support Vector Machine

Support Vector Machine (SVM) [34] is a structural risk minimization-based method for creating binary classification functions from a set of labelled training data. SVM requires that each data instance is represented as a vector of real numbers in feature space. Hence, if there are categorical attributes, we first have to convert them into numeric data. First, SVM implicitly maps the training data into a (usually higher-dimensional) feature space. A hyperplane (decision surface) is then

Problem of grammar inference

Grammar inference (or grammatical induction) refers to the process of inducing a formal grammar (usually in the form of production rules) from a set of observations using the machine learning techniques [35], [36]. Effective grammar inference algorithms exist only for regular languages, therefore, the construction of algorithms that learn context-free grammars is still and open problem [37].

The result of grammar inference is a model that reflects the characteristics of the observed objects.

Derivation of L-grammar rules for promoters

Promoters are short regulatory DNA sequences that precede the beginnings of genes. They are common both in prokaryotic and eukaryotic genomes. Analysis of promoter structure is important for our understanding of gene regulation mechanisms and genome evolution process, elucidation of the mechanisms for transcriptional activation of genes, annotation of transcriptional regulatory elements, and development of efficient promoter prediction programs [40]. The crucial obstacle in analyzing promoters

Evaluation and conclusions

Lack of structural insight is one of the major drawbacks of machine learning methods such as SVM. Since the task of reliable recognition of regulatory DNA sequences is already solved sufficiently by the underlying SVM (see 99.7% promoter recognition accuracy on drosophila test sequences in Table 1), the derived grammars are used for structural analysis of regulatory DNA sequences.

Classification results of the artificial promoter sequences generated using the derived L-grammar rules are almost

Damaševičius received his M.Sc. (2001) and Ph.D. (2005) degrees in informatics from Kaunas University of Technology (KTU), Kaunas, Lithuania. Currently he is an associated professor at Software Engineering Department, KTU. He is also the member of Design Process Automation Group at Software Engineering Department. His main research interests include bioinformatics, formal grammars, intelligent data mining methods as well as design automation. He is the author of more than 60 scientific papers.

References (46)

  • A.O. Schmitt et al.

    The modular structure of informational sequences

    Biosystems

    (1996)
  • M.A. Jiménez-Montaño

    On the syntactic structure of protein sequences and the concept of grammar complexity

    Bull. Math. Biol.

    (1984)
  • G. Abramson et al.

    Fractal properties of DNA walks

    Biosystems

    (1999)
  • M.I. Monteiro et al.

    Machine learning techniques for predicting Bacillus Subtilis promoters

  • R. Ranawana et al.

    A neural network based multiclassifier system for gene identification in DNA sequences

    J. Neural Comput. Appl.

    (2005)
  • M.D. Cochran et al.

    Modular structure of the beta-globin and the TK promoters

    EMBO J.

    (1984)
  • K. Florquin et al.

    Large-scale structural analysis of the core promoter in mammalian and plant genomes

    Nucleic Acids Res.

    (2005)
  • U. Ohler et al.

    Computational analysis of core promoters in the Drosophila genome

    Genome Biol.

    (2002)
  • R. Damaševičius, Derivation of context-free stochastic L-Grammar rules for promoter sequence modeling using Support...
  • O. Unold

    Grammar-based classifier system for recognition of promoter regions

  • J.R. Koza, Discovery of rewrite rules in lindenmayer systems and state transition rules in cellular automata via...
  • S. Marcus

    Linguistic structures and generative devices in molecular genetics

    Cah. Ling. Theor. Appl.

    (1974)
  • G. Infante-Lopez, M. de Rijke, Alternative approaches for generating bodies of grammar rules, in: Proceedings of 42nd...
  • M. O’Neill, A. Brabazon, C. Adley, The automatic generation of programs for classification problems with grammatical...
  • A. Denise, Y. Ponty, M. Termier, Random Generation of structured genomic sequences, in: Proceedings of Seventh Annual...
  • R. Durbin et al.

    Biological Sequence Analysis

    (1998)
  • D. Fredouille, C.H. Bryant, Speeding up parsing of biological context-free grammars, in: Proceedings of 16th Annual...
  • D. Searls

    Linguistic approaches to biological sequences

    Bioinformatics

    (1997)
  • D. Searls

    String variable grammar: a logic grammar formalism for the biological language of DNA

    J. Logic Programming

    (1993)
  • E. Petre

    Watson-Crick-Automata

    J. Automata, Languages Combinatorics

    (2003)
  • J. Collado-Vides

    Grammatical model of the regulation of gene expression

    Proc. Natl. Acad. Sci. USA

    (1992)
  • D. Rosenblueth et al.

    Syntactic recognition of regulatory regions in Escherichia coli

    Comput. Appl. Biosci.

    (1996)
  • S.W. Leung et al.

    Basic gene grammars and DNA-chart parser for language processing of Escherichia coli promoter DNA sequences

    Bioinformatics

    (2001)
  • Cited by (31)

    • Groundwater spring potential modelling: Comprising the capability and robustness of three different modeling approaches

      2018, Journal of Hydrology
      Citation Excerpt :

      The regularization parameter controls the trade-off between maximizing the target margin and minimizing the L1-norm of the margin slack vector of the training data (Friedrichs and Igel, 2005). Therefore, the over-fitting problem can be controlled using the C parameter whereby if a large value is used for it, fewer margins and thus decreased training errors occur and vice versa (Damaševičius, 2010). Additionally, the degree of nonlinearity of the SVM model is controlled by the γ parameter.

    • GFO: A data driven approach for optimizing the Gaussian function based similarity metric in computational biology

      2013, Neurocomputing
      Citation Excerpt :

      In [12], the authors employed the Gaussian kernel function as a similarity measure for 29 global and intrinsic hairpin folding attributes to identify pre-miRs with high sensitivity and specificity. Robertas proposed a method to analyze the regulatory DNA sequences by using grammar inference and SVM [13]. In the area of predicting the protein secondary structure [14–17], subcellular localization [18–20], membrane protein topology [21–24], SVM are also widely applied.

    View all citing articles on Scopus

    Damaševičius received his M.Sc. (2001) and Ph.D. (2005) degrees in informatics from Kaunas University of Technology (KTU), Kaunas, Lithuania. Currently he is an associated professor at Software Engineering Department, KTU. He is also the member of Design Process Automation Group at Software Engineering Department. His main research interests include bioinformatics, formal grammars, intelligent data mining methods as well as design automation. He is the author of more than 60 scientific papers.

    View full text