Structural analysis of regulatory DNA sequences using grammar inference and Support Vector Machine
Introduction
Recognition of specific functionally important DNA sequence fragments such as promoters (short sequences that precede the beginnings of genes) or splice-junction sites (boundary points between exons and introns where splicing occurs) is one of the most important problems in bioinformatics. Common sequence analysis methods such as pattern search and sequence alignment using dynamic programming (Needleman–Wunsch, Smith–Waterman algorithms) and substitution-matrix based methods (PAM, BLOSUM) cannot solve this problem because of noisy data and large variability of consensus sequences across different species. Machine learning methods can be used for sequence classification, because these methods can learn useful descriptions of genetic concepts from data instances only rather than from explicit definitions.
Modern sequence recognition tools use machine learning techniques such as Naive Bayes, Decision Trees, Hidden Markov Models, Neural Networks or Support Vector Machine (SVM) [1]. These techniques allow achieving up to 98% accuracy [2], however they do not provide any insight on the internal structure of the analyzed sequences.
Many types of DNA sequences have been shown to have a modular structure [3], [4]. The structural models of DNA sequences can be analyzed using computational approaches such as adaptive quality-based clustering [5] and the iterative expectation-maximization algorithm-based method (MEME) [6]. In this paper, we continue our research [7] on structural modelling of DNA sequences using a parallel formal stochastic grammar, and substitute the sequence recognition problem with grammar induction [8].
Formal grammars can provide a means of describing complex repeatable structures such as DNA. For example, Koza [9] demonstrates the possibility of discovering the rewrite rule for L-systems and the state transition rules for cellular automata using genetic programming techniques. For the study of protein formation, Marcus [10] considers so-called semi-Lindenmayer systems and discusses an isomorphism between the genetic and natural language. Jimenéz-Montano [11] proposes an algorithm to construct a short context-free grammar that generates a given sequence. Infante-Lopez and de Rijke [12] describe the inference of regular language grammar rules based on n-grams and minimization of the Kullback–Leibler divergence. O’Neill et al. [13] generate regular expressions for promoter recognition problem using Grammatical Swarm technique. Each individual swarm particle represents choices of program construction rules, where these rules are specified using a Backus–Naur Form (BNF) grammar. Denise et al. [14] generate genomic sequences and basic RNA secondary structures according to a given probability distribution and syntactical (grammatical) parameters. Formal biosequence linguistic research also has used stochastic grammars based on hidden Markov models [15], context-free grammars [16], grammars based on computational logic [17], String Variable Grammars [18], finite-state automata [19], Definite Clause Grammar and Prolog [20], [21], and logic grammar formalism based Basic Gene Grammars [22]. Stochastic context-free grammars induced from sample sets of sequences are also considered for modelling RNA sequences [23], [24].
The aim of this paper is to describe a method for structural analysis of regulatory DNA sequences. The method combines SVM classification with automatic derivation of stochastic L-grammar rules using genetic programming techniques. The structure of the paper is as follows. Section 2 provides an introduction into L-grammar. Section 3 describes the principles of SVM classification. Section 4 considers grammar induction problem. Section 5 describes derivation of stochastic L-grammar rules for drosophila, vertebrate and monocot plant promoter datasets, as well as for primate splicing sites. Finally, Section 6 evaluates the results and presents conclusions.
Section snippets
Introduction into L-systems
Treating genome as a language can allow to generalize structural information contained in biological sequences and to investigate it using formal language theory methods. From the biological point of view, components of any biological organism evolve simultaneously, so we cannot expect that biological processes could be modelled using a sequential approach. It is more likely that the cells that can reproduce simultaneously would be modelled by a mechanism that is based on the same behavioural
Classification using Support Vector Machine
Support Vector Machine (SVM) [34] is a structural risk minimization-based method for creating binary classification functions from a set of labelled training data. SVM requires that each data instance is represented as a vector of real numbers in feature space. Hence, if there are categorical attributes, we first have to convert them into numeric data. First, SVM implicitly maps the training data into a (usually higher-dimensional) feature space. A hyperplane (decision surface) is then
Problem of grammar inference
Grammar inference (or grammatical induction) refers to the process of inducing a formal grammar (usually in the form of production rules) from a set of observations using the machine learning techniques [35], [36]. Effective grammar inference algorithms exist only for regular languages, therefore, the construction of algorithms that learn context-free grammars is still and open problem [37].
The result of grammar inference is a model that reflects the characteristics of the observed objects.
Derivation of L-grammar rules for promoters
Promoters are short regulatory DNA sequences that precede the beginnings of genes. They are common both in prokaryotic and eukaryotic genomes. Analysis of promoter structure is important for our understanding of gene regulation mechanisms and genome evolution process, elucidation of the mechanisms for transcriptional activation of genes, annotation of transcriptional regulatory elements, and development of efficient promoter prediction programs [40]. The crucial obstacle in analyzing promoters
Evaluation and conclusions
Lack of structural insight is one of the major drawbacks of machine learning methods such as SVM. Since the task of reliable recognition of regulatory DNA sequences is already solved sufficiently by the underlying SVM (see 99.7% promoter recognition accuracy on drosophila test sequences in Table 1), the derived grammars are used for structural analysis of regulatory DNA sequences.
Classification results of the artificial promoter sequences generated using the derived L-grammar rules are almost
Damaševičius received his M.Sc. (2001) and Ph.D. (2005) degrees in informatics from Kaunas University of Technology (KTU), Kaunas, Lithuania. Currently he is an associated professor at Software Engineering Department, KTU. He is also the member of Design Process Automation Group at Software Engineering Department. His main research interests include bioinformatics, formal grammars, intelligent data mining methods as well as design automation. He is the author of more than 60 scientific papers.
References (46)
- et al.
The modular structure of informational sequences
Biosystems
(1996) On the syntactic structure of protein sequences and the concept of grammar complexity
Bull. Math. Biol.
(1984)- et al.
Fractal properties of DNA walks
Biosystems
(1999) - et al.
Machine learning techniques for predicting Bacillus Subtilis promoters
- et al.
A neural network based multiclassifier system for gene identification in DNA sequences
J. Neural Comput. Appl.
(2005) - et al.
Modular structure of the beta-globin and the TK promoters
EMBO J.
(1984) - et al.
Large-scale structural analysis of the core promoter in mammalian and plant genomes
Nucleic Acids Res.
(2005) - et al.
Computational analysis of core promoters in the Drosophila genome
Genome Biol.
(2002) - R. Damaševičius, Derivation of context-free stochastic L-Grammar rules for promoter sequence modeling using Support...
Grammar-based classifier system for recognition of promoter regions
Linguistic structures and generative devices in molecular genetics
Cah. Ling. Theor. Appl.
Biological Sequence Analysis
Linguistic approaches to biological sequences
Bioinformatics
String variable grammar: a logic grammar formalism for the biological language of DNA
J. Logic Programming
Watson-Crick-Automata
J. Automata, Languages Combinatorics
Grammatical model of the regulation of gene expression
Proc. Natl. Acad. Sci. USA
Syntactic recognition of regulatory regions in Escherichia coli
Comput. Appl. Biosci.
Basic gene grammars and DNA-chart parser for language processing of Escherichia coli promoter DNA sequences
Bioinformatics
Cited by (31)
GIS-based machine learning models for mapping tar mat zones in upper part (DJ unit) of Zubair Formation in North Rumaila supergiant oil field, southern Iraq
2019, Journal of Petroleum Science and EngineeringGroundwater spring potential modelling: Comprising the capability and robustness of three different modeling approaches
2018, Journal of HydrologyCitation Excerpt :The regularization parameter controls the trade-off between maximizing the target margin and minimizing the L1-norm of the margin slack vector of the training data (Friedrichs and Igel, 2005). Therefore, the over-fitting problem can be controlled using the C parameter whereby if a large value is used for it, fewer margins and thus decreased training errors occur and vice versa (Damaševičius, 2010). Additionally, the degree of nonlinearity of the SVM model is controlled by the γ parameter.
An evaluation of SVM using polygon-based random sampling inlandslide susceptibility mapping: The Candir catchment area(western Antalya, Turkey)
2014, International Journal of Applied Earth Observation and GeoinformationGFO: A data driven approach for optimizing the Gaussian function based similarity metric in computational biology
2013, NeurocomputingCitation Excerpt :In [12], the authors employed the Gaussian kernel function as a similarity measure for 29 global and intrinsic hairpin folding attributes to identify pre-miRs with high sensitivity and specificity. Robertas proposed a method to analyze the regulatory DNA sequences by using grammar inference and SVM [13]. In the area of predicting the protein secondary structure [14–17], subcellular localization [18–20], membrane protein topology [21–24], SVM are also widely applied.
Damaševičius received his M.Sc. (2001) and Ph.D. (2005) degrees in informatics from Kaunas University of Technology (KTU), Kaunas, Lithuania. Currently he is an associated professor at Software Engineering Department, KTU. He is also the member of Design Process Automation Group at Software Engineering Department. His main research interests include bioinformatics, formal grammars, intelligent data mining methods as well as design automation. He is the author of more than 60 scientific papers.