A genetic programming method for feature mapping to improve prediction of HIV-1 protease cleavage site

doi:10.1016/j.asoc.2018.06.045

Applied Soft Computing

Volume 72, November 2018, Pages 56-64

https://doi.org/10.1016/j.asoc.2018.06.045 Get rights and content

Highlights

•
This article proposes a new method for Prediction of Cleavage of Amino Acid Sequences by HIV-1.
•
The proposed method relies on a new encoding and model for prediction of amino acid sequence cleavage by HIV protease.
•
In the encoding stage, a combination of amino acids’ spatial and structural features are taken into account.
•
The Final prediction model is developed with the combination of genetic programming and support vector machine.

Abstract

The human immunodeficiency virus (HIV) is the cause of acquired immunodeficiency syndrome (AIDS), which has profound implications in terms of both economic burden and loss of life. Modeling and examination of the HIV protease cleavage of amino acid sequences can contribute to control of this disease and production of more effective drugs. The present paper introduces a new method for encoding and characterization of amino acid sequences and a new model for the prediction of amino acid sequence cleavage by HIV protease. The proposed encoding scheme utilizes a combination of amino acids’ spatial and structural features in conjunction with 20 amino acid sequences to make sure that their physicochemical and sequencing features are all taken into account. The proposed HIV-1 amino acid cleavage prediction model is developed with the combination of genetic programming and support vector machine. The results of evaluations performed on various datasets demonstrate the superior performance of the proposed encoding and better accuracy of the proposed HIV-1 cleavage prediction model as compared to the state-of-the-art methods.

Introduction

The human immunodeficiency virus (HIV) is known as cause of acquired immunodeficiency syndrome (AIDS). The Centers for Disease Control and Prevention (CDC) first recognized AIDS as a distinct syndrome in 1981 [1]. According to the latest statistics published by the World Health Organization (WHO), approximately 37.7 million people worldwide were infected with HIV [2]. According to this organization, in 2015, 2.1 million new cases of HIV and 1.1 million deaths among these patients have been reported. Despite successful efforts to reduce the mortality due to HIV infection, the high morbidity and the risk of opportunistic infections make this virus one of the key topics of global research for infectious diseases. There are two types of this virus, HIV-1, and HIV-2. HIV-1, which originated from chimpanzees and apes of West Africa [3], is responsible for the majority of AIDS cases and is more pathogenic than HIV-2 [4]. As a result, this type of HIV has been the subject of many research works. Major efforts towards HIV treatment fall within four domains: inhibition of cell binding to lymphocytes, inhibition of cell entry into lymphocytes, inhibition of genome proliferation, and inhibition of cell regeneration.

One of the main obstacles to the progress of therapeutic efforts for HIV infection is its high genetic variability [5]. One of the promising angles of approach for the treatment of AIDS (and also hepatitis C) is the use of protease inhibitors. These drugs inhibit the virus protease, thus preventing the cleavage of amino acid chains and generation of proteins which is needed for the assembly of new virus variants. HIV protease performs essential task for its replication by cleaving the precursor polyproteins for binding and entering to them. HIV protease has two identical sites, but only one of them is active for binding and penetrating into the susceptible site of proteins. The susceptible sites of proteins may contain one less or one more subsite based on the case of heptapeptide or nonapeptide respectively [6]. Knowledge of the HIV protease-cleavable peptide sites can aid the scientists for designing more effective HIV protease inhibitors. Therefore, after modification of such peptide scissile bond, with some routine procedure, it loses cleavability but can still bind to the active site of an HIV enzyme and prevent it to cleavage another protein. Hence, loss of the cleavage ability of HIV protease, as design of the HIV protease inhibitors, has been considered as a new approach for AIDS therapy. To this end, the information about HIV protease substrate specificity and also the cleavable site of each protein employed to design the HIV protease inhibitors. In order to find effective inhibitors, it is necessary to have proper method for predicting the cleavability of a peptide by HIV protease. Since each peptide is encoded by eight amino acids sequence, called octamer, and we have 20 distinctive amino acids, the total number of possible octamers is 20⁸ [6]. Therefore, employing traditional search strategies could not be applicable, because of its huge search space. Modeling and examination of the cleavage function of HIV protease, by employing meta-heuristic methods, can contribute to the analysis of its behavior and the production of more effective drugs based on suitable protease inhibitors.

The present paper introduces a new encoding scheme for the characterization of amino acid sequences and a feature-learning based model for the prediction of amino acid sequence cleavage by the HIV-1 protease. The encoding scheme provided in this paper utilizes a combination of amino acid shape and sequence as a feature. This approach is based on the argument that the features are the products of chemical properties of the amino acid sequence, and therefore, both the shape of amino acids, which is a function of their chemical properties, and their sequence are associated with the recognition of cleavage. Since recent works such as Rögnvaldsson and You [7] have proven the linear nature of this problem, the final phase of the proposed method is developed with the combined use of SVM with linear kernel and Genetic Programming (GP).

Innovations of this paper fall within two areas of encoding and feature learning. As mentioned earlier, some researchers believe that this problem should be solved with the symmetry of amino acid sequence considered as the primary determinant of cleavage, but others argue that the sequence of amino acids and how they are positioned, is also important in this regard. Therefore, in this paper, features are encoded according to the sequences of amino acids such that the spatial shape of these sequences is implicitly taken into account. Another innovation of this paper is the development of a GP-based learning algorithm for the problem of amino acid sequence cleavage by the HIV-1 protease. Regardless of linear or non-linear nature of this problem and how it is encoded, the use of genetic programming technique to map the problem space leads to the improvement of accuracy of the classifier.

In the rest of this paper, the next section provided the literature review and some backgrounds of the proposed method. The section three describes the proposed approach for encoding of amino acid sequence, and then the proposed feature learning method for the problem of amino acid sequence cleavage by the HIV-1 protease. The subsequent sections present the results of the proposed method and compare them with those of the best existing works, and the final section presents the conclusions.

Section snippets

Related works

Several studies in the last two decades tried to model the cleavage function of HIV protease. These studies have been mostly focused on predicting the location of cleavage in the amino acid sequence by the HIV-1 protease. However, not all researchers believe in this approach; for example, Prabu-Jeybalan et al. [8] reported that the HIV-1 protease recognition can be better characterized by the apparent symmetry of amino acids, and has little association with the amino acid sequence.

Besides the

Encoding

This section explains the proposed scheme for mapping a sequence of amino acids to a feature set. In this mapping, there will be a feature set for each amino acid in the octamer. In other words, each amino acid sequence consists of eight amino acid elements, called octamer AA: ${A A = {a a}_{1}, {a a}_{2}, \dots, {a a}_{8}}$

Each amino acid is characterized and encoded by 19 distinct features (f₁, f₂, …, f₁₉): $h ({a a}_{i}) = L (f_{1} ({a a}_{i}), f_{2} ({a a}_{i}), \dots, f_{19} ({a a}_{i}))$

In this equation, ${a a}_{i}$ represents one of the 20 amino acids. The extracted features (f₁ to f₁₉

Results

The performance of the proposed method was evaluated by the employing four datasets: Data746 [33], Data1625 [34], DataSchilling [24] and DataImpens [35]. The characteristics of these datasets are presented in Table 2. Also the parameters of the GP and the list of the employed mathematical operations in GP are presented in Table 3, Table 4 respectively. The proposed method was implemented in MATLAB 2013 and the source code of it can be accessible by public (download from //github.com/rasool-sadeghi/Encoding-and-Prediction-of-Cleavage-of-Amino-Acid-Sequences-by-HIV-1

Discussion

As previously stated, this study used four datasets, namely Data746 [33], Data1625 [34], DataSchilling [24] and DataImpens [35], with respectively 746, 1625, 947, and 3272 octamers for performance evaluation. In many works which carried out their evaluation using these datasets, one of them has been used as training data and the rest have been used as test data, so that each set may serve at least once as training data. These datasets have similar octamers, which create a degree of redundancy.

Conclusion

In this paper, a GP-based feature learning algorithm was combined with an SVM classifier and a new encoding scheme to develop a new model for the prediction of amino acid sequence cleavage by the HIV-1 protease. In the proposed encoding scheme, the chemical features and the sequence structure of the amino acids were utilized to extract a 152-element vector as the initial feature of their octamer sequences. Then genetic programming was used for mapping this feature vector to a new N-element

References (35)

K.C. Chou
Review: prediction of HIV protease cleavage sites in proteins
Anal. Biochem.
(1996)
M. Prabu-Jeyabalan et al.
Substrate shape determines specificity of recognition for HIV-1 protease: analysis of crystal structures of six substrate complexes
Structure
(2002)
H. Ogul
Variable context Markov chains for HIV protease cleavage site prediction
BioSystems
(2009)
K.C. Chou
A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins
J. Biol. Chem.
(1993)
G. Kim et al.
An MLP-based feature subset selection for HIV-1 protease cleavage site analysis
Artif. Intell. Med.
(2010)
H.B. Shen
HIVcleave: a web-server for predicting HIV protease cleavage sites in proteins
Anal. Biochem.
(2008)
L. Guo et al.
Automatic feature extraction using genetic programming: an application to epileptic EEG classification
Expert Syst. Appl.
(2011)
Y. Zhang et al.
A generic optimising feature extraction method using multiobjective genetic programming
Appl. Soft Comput.
(2011)
A. Elola et al.
Hybridizing cartesian genetic programming and harmony search for adaptive feature construction in supervised learning problems
Appl. Soft Comput.
(2017)
M. Amir Haeri et al.
Statistical genetic programming for symbolic regression
Appl. Soft Comput.
(2017)

R.C. Gallo

A reflection on HIV/AIDS research after 25 years

Retrovirology

(2006)

World Health Organization, www.who.int, [Online Access: 2 February...

P.M. Sharp et al.

Origins of HIV and the AIDS Pandemic

Cold Spring Harb. Perspect. Med.

(2011)

J.D. Reeves et al.

Human immunodeficiency virus type 2

J. Gen. Virol.

(2002)

D.L. Robertson et al.

Recombination in AIDS viruses

J. Mol. Evol.

(1995)

T. Rögnvaldsson et al.

Why neural networks should not be used for HIV-1 protease cleavage site prediction

Bioinformatics

(2004)

T. Rögnvaldsson

How to find simple and accurate rules for viral protease cleavage specificities

BMC Bioinform.

(2009)

Cited by (13)

Semantic schema based genetic programming for symbolic regression
2022, Applied Soft Computing
Citation Excerpt :
As a kind of evolutionary algorithm, genetic programming is inspired by natural evolution. Although the motivation behind its innovation was to achieve the success of natural evolution in problem-solving and has been applied to solve numerous real-world problems [1–4], some features of this algorithm contradict natural evolution and keep genetic programming from having an effective search within search space. Non-locality and non-gradual optimization are important samples of such features.
Despite the empirical success of Genetic programming (GP) in various symbolic regression applications, GP is not still known as a reliable problem-solving technique in this domain. Non-locality of GP representation and operators causes ineffectiveness of its search procedure. This study employs semantic schema theory to control and guide the GP search and proposes a local GP called semantic schema-based genetic programming (SBGP). SBGP partitions the semantic search space into semantic schemas and biases the search to the significant schema of the population, which is gradually progressing towards the optimal solution. Several semantic local operators are proposed for performing a local search around the significant schema. In combination with schema evolution as a global search, the local in-schema search provides an efficient exploration–exploitation control mechanism in SBGP. For evaluating the proposed method, we use six benchmarks, including synthesized and real-world problems. The obtained errors are compared to the best semantic genetic programming algorithms, on the one hand, and data-driven layered learning approaches, on the other hand. Results demonstrate that SBGP outperforms all mentioned methods in four out of six benchmarks up to 87% in the first set and up to 76% in the second set of experiments in terms of generalization measured by root mean squared error.
Compositional framework for multitask learning in the identification of cleavage sites of HIV-1 protease
2020, Journal of Biomedical Informatics
Citation Excerpt :
OETMAP encoding [33], genetic programming [34], are the techniques which exploit both sequence and physicochemical based features. Recently genetic programming based approach is proposed for the cleavage site prediction problem [35] where a new amino acid sequence encoding is accomplished by spatial and structural features. In the classification phase Artificial neural network (ANN) [36,37], Decision tree (DT) [38], Rule-based predictive model [39], Support vector machine (SVM) [40,8], and variable context markov chains [41] are examined on the peptide classification problem.
Inadequate patient samples and costly annotated data generations result into the smaller dataset in the biomedical domain. Due to which the predictions with a trained model that usually reveal a single small dataset association are fail to derive robust insights. To cope with the data sparsity, a promising strategy of combining data from the different related tasks is exercised in various application. Motivated by, successful work in the various bioinformatics application, we propose a multitask learning model based on multi-kernel that exploits the dependencies among various related tasks. This work aims to combine the knowledge from experimental studies of the different dataset to build stronger predictive models for HIV-1 protease cleavage sites prediction. In this study, a set of peptide data from one source is referred as ‘task’ and to integrate interactions from multiple tasks; our method exploits the common features and parameters sharing across the data source. The proposed framework uses feature integration, feature selection, multi-kernel and multifactorial evolutionary algorithm to model multitask learning. The framework considered seven different feature descriptors and four different kernel variants of support vector machines to form the optimal multi-kernel learning model. To validate the effectiveness of the model, the performance parameters such as average accuracy, and area under curve have been evaluated on the suggested model. We also carried out Friedman and post hoc statistical test to substantiate the significant improvement achieved by the proposed framework. The result obtained following the extensive experiment confirms the belief that multitask learning in cleavage site identification can improve the performance.
Multi Hive Artificial Bee Colony Programming for high dimensional symbolic regression with feature selection
2019, Applied Soft Computing Journal
Citation Excerpt :
Embedded methods [29] want to reduce the computation time required to reclassify different subgroups obtained using wrapper methods. A plethora of different automatic programming-based feature selection methods have been presented in the literature [30–35]. Harvey et al. have proposed an automatic GP-based feature design, called Autofead, which can easily incorporate specific terminals and functions [30].
Feature selection is a process that provides model extraction by specifying necessary or related features and improves generalization. The Artificial Bee Colony (ABC) algorithm is one of the most popular optimization algorithms inspired on swarm intelligence developed by simulating the search behavior of honey bees. Artificial Bee Colony Programming (ABCP) is a recently proposed high level automatic programming technique for a Symbolic Regression (SR) problem based on the ABC algorithm. In this paper, a new feature selection method based on ABCP is proposed, Multi Hive ABCP (MHABCP) for high-dimensional SR problems. The learning ability and generalization performance of the proposed MHABCP is investigated using synthetic and real high-dimensional SR datasets and is compared with basic ABCP and GP automatic programming methods. Experimental results show that MHABCP has better performance choosing relevant features in high dimensional SR problems and generalization than other methods.
Multi-objective optimization with majority voting ensemble of classifiers for prediction of HIV-1 protease cleavage site
2023, Soft Computing
Multi-objective Optimization with Majority Voting Ensemble of Classifiers for Prediction of HIV-1 Protease Cleavage Site
2023, Research Square
HIV-1 Protease Cleavage Site Prediction using Stacked Autoencoder with Ensemble of Classifiers
2022, Proceedings of the International Joint Conference on Neural Networks

View all citing articles on Scopus

View full text

A genetic programming method for feature mapping to improve prediction of HIV-1 protease cleavage site

Highlights

Abstract

Introduction

Section snippets

Related works

Encoding

Results

Discussion

Conclusion

Anal. Biochem.

Structure

BioSystems

J. Biol. Chem.

Artif. Intell. Med.

Anal. Biochem.

Expert Syst. Appl.

Appl. Soft Comput.

Appl. Soft Comput.

Appl. Soft Comput.

A reflection on HIV/AIDS research after 25 years

Retrovirology

Origins of HIV and the AIDS Pandemic

Cold Spring Harb. Perspect. Med.

Human immunodeficiency virus type 2

J. Gen. Virol.

Recombination in AIDS viruses

J. Mol. Evol.

Why neural networks should not be used for HIV-1 protease cleavage site prediction

Bioinformatics

How to find simple and accurate rules for viral protease cleavage specificities

BMC Bioinform.