Elsevier

Applied Soft Computing

Volume 72, November 2018, Pages 56-64
Applied Soft Computing

A genetic programming method for feature mapping to improve prediction of HIV-1 protease cleavage site

https://doi.org/10.1016/j.asoc.2018.06.045Get rights and content

Highlights

  • This article proposes a new method for Prediction of Cleavage of Amino Acid Sequences by HIV-1.

  • The proposed method relies on a new encoding and model for prediction of amino acid sequence cleavage by HIV protease.

  • In the encoding stage, a combination of amino acids’ spatial and structural features are taken into account.

  • The Final prediction model is developed with the combination of genetic programming and support vector machine.

Abstract

The human immunodeficiency virus (HIV) is the cause of acquired immunodeficiency syndrome (AIDS), which has profound implications in terms of both economic burden and loss of life. Modeling and examination of the HIV protease cleavage of amino acid sequences can contribute to control of this disease and production of more effective drugs. The present paper introduces a new method for encoding and characterization of amino acid sequences and a new model for the prediction of amino acid sequence cleavage by HIV protease. The proposed encoding scheme utilizes a combination of amino acids’ spatial and structural features in conjunction with 20 amino acid sequences to make sure that their physicochemical and sequencing features are all taken into account. The proposed HIV-1 amino acid cleavage prediction model is developed with the combination of genetic programming and support vector machine. The results of evaluations performed on various datasets demonstrate the superior performance of the proposed encoding and better accuracy of the proposed HIV-1 cleavage prediction model as compared to the state-of-the-art methods.

Introduction

The human immunodeficiency virus (HIV) is known as cause of acquired immunodeficiency syndrome (AIDS). The Centers for Disease Control and Prevention (CDC) first recognized AIDS as a distinct syndrome in 1981 [1]. According to the latest statistics published by the World Health Organization (WHO), approximately 37.7 million people worldwide were infected with HIV [2]. According to this organization, in 2015, 2.1 million new cases of HIV and 1.1 million deaths among these patients have been reported. Despite successful efforts to reduce the mortality due to HIV infection, the high morbidity and the risk of opportunistic infections make this virus one of the key topics of global research for infectious diseases. There are two types of this virus, HIV-1, and HIV-2. HIV-1, which originated from chimpanzees and apes of West Africa [3], is responsible for the majority of AIDS cases and is more pathogenic than HIV-2 [4]. As a result, this type of HIV has been the subject of many research works. Major efforts towards HIV treatment fall within four domains: inhibition of cell binding to lymphocytes, inhibition of cell entry into lymphocytes, inhibition of genome proliferation, and inhibition of cell regeneration.

One of the main obstacles to the progress of therapeutic efforts for HIV infection is its high genetic variability [5]. One of the promising angles of approach for the treatment of AIDS (and also hepatitis C) is the use of protease inhibitors. These drugs inhibit the virus protease, thus preventing the cleavage of amino acid chains and generation of proteins which is needed for the assembly of new virus variants. HIV protease performs essential task for its replication by cleaving the precursor polyproteins for binding and entering to them. HIV protease has two identical sites, but only one of them is active for binding and penetrating into the susceptible site of proteins. The susceptible sites of proteins may contain one less or one more subsite based on the case of heptapeptide or nonapeptide respectively [6]. Knowledge of the HIV protease-cleavable peptide sites can aid the scientists for designing more effective HIV protease inhibitors. Therefore, after modification of such peptide scissile bond, with some routine procedure, it loses cleavability but can still bind to the active site of an HIV enzyme and prevent it to cleavage another protein. Hence, loss of the cleavage ability of HIV protease, as design of the HIV protease inhibitors, has been considered as a new approach for AIDS therapy. To this end, the information about HIV protease substrate specificity and also the cleavable site of each protein employed to design the HIV protease inhibitors. In order to find effective inhibitors, it is necessary to have proper method for predicting the cleavability of a peptide by HIV protease. Since each peptide is encoded by eight amino acids sequence, called octamer, and we have 20 distinctive amino acids, the total number of possible octamers is 208 [6]. Therefore, employing traditional search strategies could not be applicable, because of its huge search space. Modeling and examination of the cleavage function of HIV protease, by employing meta-heuristic methods, can contribute to the analysis of its behavior and the production of more effective drugs based on suitable protease inhibitors.

The present paper introduces a new encoding scheme for the characterization of amino acid sequences and a feature-learning based model for the prediction of amino acid sequence cleavage by the HIV-1 protease. The encoding scheme provided in this paper utilizes a combination of amino acid shape and sequence as a feature. This approach is based on the argument that the features are the products of chemical properties of the amino acid sequence, and therefore, both the shape of amino acids, which is a function of their chemical properties, and their sequence are associated with the recognition of cleavage. Since recent works such as Rögnvaldsson and You [7] have proven the linear nature of this problem, the final phase of the proposed method is developed with the combined use of SVM with linear kernel and Genetic Programming (GP).

Innovations of this paper fall within two areas of encoding and feature learning. As mentioned earlier, some researchers believe that this problem should be solved with the symmetry of amino acid sequence considered as the primary determinant of cleavage, but others argue that the sequence of amino acids and how they are positioned, is also important in this regard. Therefore, in this paper, features are encoded according to the sequences of amino acids such that the spatial shape of these sequences is implicitly taken into account. Another innovation of this paper is the development of a GP-based learning algorithm for the problem of amino acid sequence cleavage by the HIV-1 protease. Regardless of linear or non-linear nature of this problem and how it is encoded, the use of genetic programming technique to map the problem space leads to the improvement of accuracy of the classifier.

In the rest of this paper, the next section provided the literature review and some backgrounds of the proposed method. The section three describes the proposed approach for encoding of amino acid sequence, and then the proposed feature learning method for the problem of amino acid sequence cleavage by the HIV-1 protease. The subsequent sections present the results of the proposed method and compare them with those of the best existing works, and the final section presents the conclusions.

Section snippets

Related works

Several studies in the last two decades tried to model the cleavage function of HIV protease. These studies have been mostly focused on predicting the location of cleavage in the amino acid sequence by the HIV-1 protease. However, not all researchers believe in this approach; for example, Prabu-Jeybalan et al. [8] reported that the HIV-1 protease recognition can be better characterized by the apparent symmetry of amino acids, and has little association with the amino acid sequence.

Besides the

Encoding

This section explains the proposed scheme for mapping a sequence of amino acids to a feature set. In this mapping, there will be a feature set for each amino acid in the octamer. In other words, each amino acid sequence consists of eight amino acid elements, called octamer AA:AA={aa1,aa2,,aa8}

Each amino acid is characterized and encoded by 19 distinct features (f1, f2, …, f19):haai=Lf1aai,f2aai,,f19aai

In this equation, aai represents one of the 20 amino acids. The extracted features (f1 to f19

Results

The performance of the proposed method was evaluated by the employing four datasets: Data746 [33], Data1625 [34], DataSchilling [24] and DataImpens [35]. The characteristics of these datasets are presented in Table 2. Also the parameters of the GP and the list of the employed mathematical operations in GP are presented in Table 3, Table 4 respectively. The proposed method was implemented in MATLAB 2013 and the source code of it can be accessible by public (download from //github.com/rasool-sadeghi/Encoding-and-Prediction-of-Cleavage-of-Amino-Acid-Sequences-by-HIV-1

Discussion

As previously stated, this study used four datasets, namely Data746 [33], Data1625 [34], DataSchilling [24] and DataImpens [35], with respectively 746, 1625, 947, and 3272 octamers for performance evaluation. In many works which carried out their evaluation using these datasets, one of them has been used as training data and the rest have been used as test data, so that each set may serve at least once as training data. These datasets have similar octamers, which create a degree of redundancy.

Conclusion

In this paper, a GP-based feature learning algorithm was combined with an SVM classifier and a new encoding scheme to develop a new model for the prediction of amino acid sequence cleavage by the HIV-1 protease. In the proposed encoding scheme, the chemical features and the sequence structure of the amino acids were utilized to extract a 152-element vector as the initial feature of their octamer sequences. Then genetic programming was used for mapping this feature vector to a new N-element

References (35)

  • R.C. Gallo

    A reflection on HIV/AIDS research after 25 years

    Retrovirology

    (2006)
  • World Health Organization, www.who.int, [Online Access: 2 February...
  • P.M. Sharp et al.

    Origins of HIV and the AIDS Pandemic

    Cold Spring Harb. Perspect. Med.

    (2011)
  • J.D. Reeves et al.

    Human immunodeficiency virus type 2

    J. Gen. Virol.

    (2002)
  • D.L. Robertson et al.

    Recombination in AIDS viruses

    J. Mol. Evol.

    (1995)
  • T. Rögnvaldsson et al.

    Why neural networks should not be used for HIV-1 protease cleavage site prediction

    Bioinformatics

    (2004)
  • T. Rögnvaldsson

    How to find simple and accurate rules for viral protease cleavage specificities

    BMC Bioinform.

    (2009)
  • Cited by (13)

    • Semantic schema based genetic programming for symbolic regression

      2022, Applied Soft Computing
      Citation Excerpt :

      As a kind of evolutionary algorithm, genetic programming is inspired by natural evolution. Although the motivation behind its innovation was to achieve the success of natural evolution in problem-solving and has been applied to solve numerous real-world problems [1–4], some features of this algorithm contradict natural evolution and keep genetic programming from having an effective search within search space. Non-locality and non-gradual optimization are important samples of such features.

    • Compositional framework for multitask learning in the identification of cleavage sites of HIV-1 protease

      2020, Journal of Biomedical Informatics
      Citation Excerpt :

      OETMAP encoding [33], genetic programming [34], are the techniques which exploit both sequence and physicochemical based features. Recently genetic programming based approach is proposed for the cleavage site prediction problem [35] where a new amino acid sequence encoding is accomplished by spatial and structural features. In the classification phase Artificial neural network (ANN) [36,37], Decision tree (DT) [38], Rule-based predictive model [39], Support vector machine (SVM) [40,8], and variable context markov chains [41] are examined on the peptide classification problem.

    • Multi Hive Artificial Bee Colony Programming for high dimensional symbolic regression with feature selection

      2019, Applied Soft Computing Journal
      Citation Excerpt :

      Embedded methods [29] want to reduce the computation time required to reclassify different subgroups obtained using wrapper methods. A plethora of different automatic programming-based feature selection methods have been presented in the literature [30–35]. Harvey et al. have proposed an automatic GP-based feature design, called Autofead, which can easily incorporate specific terminals and functions [30].

    • HIV-1 Protease Cleavage Site Prediction using Stacked Autoencoder with Ensemble of Classifiers

      2022, Proceedings of the International Joint Conference on Neural Networks
    View all citing articles on Scopus
    View full text