An immune programming-based ranking function discovery approach for effective information retrieval

https://doi.org/10.1016/j.eswa.2010.02.019Get rights and content

Abstract

In this paper, we propose RankIP, the first immune programming (IP) based ranking function discovery approach. IP is a novel evolution based machine learning algorithm with the principles of immune systems, which is verified to be superior to Genetic Programming (GP) on the convergence of algorithm according to their experimental results in Musilek et al. (2006).

However, such superiority of IP is mainly demonstrated for optimization problems. RankIP adapts IP to the learning to rank problem, a typical classification problem. In doing this, the solution representation, affinity function, and high-affinity antibody selection require completely different treatments. Besides, two formulae focusing on selecting best antibody for test are designed for learning to rank.

Experimental results demonstrate that the proposed RankIP outperforms the state-of-the-art learning-based ranking methods significantly in terms of P@n,MAP and NDCG@n.

Introduction

In an information retrieval (IR) system, a ranked list of documents is returned as a response for each query. Thus the ranking issue is critical to the effectiveness of such systems.

Several methods have been proposed to solve this problem, such as the boolean model, vector space model, probabilistic model, and language model, which can be regarded as empirical IR methods (Tsai, Liu, Qin, Chen, & Ma, 2007). In addition to these traditional IR approaches, machine learning techniques are becoming more widely used for the ranking problem of IR, referred to “learning to rank”. It aims to design and apply methods to automatically learn a function from training data, such that the function can sort objects (e.g., documents) according to their degrees of relevance, preference, or importance as defined in a specific application (Joachims, Li, Liu, & Zhai, 2007). Actually this area has become an active and growing research area both in information retrieval and machine learning communities, and lots of traditional classification methods have been adopted for it, e.g., (Cao et al., 2006, Freund et al., 2003, Joachims, 2002, Xu and Li, 2007), etc.

Meanwhile, recently evolutionary computation (EC) based methods, especially Genetic Programming (GP) based technologies, have been successfully applied into this problem and gained some promising results, e.g., (Fan et al., 2000, Fan et al., 2004a, Fan et al., 2004b, Fan et al., 2005, Trotman, 2005). Nowadays it becomes an important branch in the “learning to rank” area.

EC is a kind of effective search or optimization techniques by mimicking the process of natural evolution in biology. In the theoretical and application research area of EC, there has recently been growing interest in the use of methods inspired by the immune systems or their principles and mechanisms (de Castro & Timmis, 2003). These systems have already been applied to numerous types of problems such as computer security, data analysis, clustering, pattern matching and parametric optimization (Dasgupta, Ji, & Gonzlez, 2003). Immune programming (IP) (Musilek, Lau, Reformat, & Wyard-Scott, 2006), is an extension of immune algorithms, particularly the clonal selection algorithm in AIS. Musilek et al. (2006) demonstrate that for optimization problem the convergence of IP is superior to GP, that is, IP can find an ideal antibody/individual in fewer generations with the most dramatic improvement evidently.

Thus we propose RankIP, by adapting IP into learning to rank, a classification problem. To validate our approach we performed experiments on the OHSUMED, TREC 2003 and 2004 data collections. Results indicate that the use of our framework leads to effective ranking functions that significantly outperform the baselines, include RankSVM (Joachims, 2002), RankBoost (Freund et al., 2003) and BM25 (Robertson, 1997) in terms of MAP,NDCG@n and P@n.

In order to adapt IP, which is proposed for optimization problems, into the learning to rank problem, many adaption such as solution representation, affinity function, and high-affinity antibody selection need to be considered. Besides, formulae focusing on selecting best antibody for test should be designed for learning to rank.

This paper is organized as follows. In Section 2, the related work are summarized. In Section 3, the background information on immune programming is provided, and RankIP, a novel immune programming-based approach for optimizing the performance measures with respect to the training and validation data, is presented in Section 4. Experimental results and discussions are described in Section 5. Finally, Section 6 concludes the paper.

Section snippets

Learning to rank using traditional classification methods

Opposite to the traditional IR methods, such as BM25 (Robertson, 1997) and LMIR (Zhai & Lafferty, 2001), recently methods of “learning to rank” have been applied to ranking model construction and some promising results have been obtained. Joachims (2002) develops RankSVM, a support vector machine (SVM) based approach that utilizes click-through data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Cao et

Background: immune programming

In this section we will introduce immune programming (IP), an novel evolutionary computation (EC) approach for machine learning. Actually, IP is an extension of immune algorithms, particularly the clonal selection algorithm, inspired by the biological immune systems or their principles and mechanisms.

Formal definitions

The problem of information retrieval can be formalized as follows. For a query q and a document collection D, the optimal retrieval system should return a ranking that orders the documents in D according to their relevance to the query q.

Let Q be the query set, for a given query, in the training data the relevance of the certain document is labeled as an integer number, formally, it is defined as a function rel(Q):DN. For example, for OHSUMED data collection, rel(q)(d)=0 stands for that the

Experiments

We use three data sets in the experiments, i.e., OHSUMED, a benchmark data set for document retrieval and TREC, a data set obtained from web track of TREC 2003 and 2004. These data collections are all provided by Microsoft research web site.

We compared the ranking accuracies of RankIP with those of three baseline methods: Ranking SVM, RankBoost and BM25. The ranking performances of both Ranking SVM and RankBoost are evaluated and reported in Liu et al. (2007). Table 2 shows the control

Conclusions

On the basis of the tree-based representation architecture, in this paper we presented RankIP, an approach for learning to rank with the goal of improving the accuracy of conventional IR and Web searching. In order to adapt IP to the learning to rank problem, we employed the mapping mechanism for RankIP to make sure that the affinity values of the antibodies are well-distributed over the range [0..1]. Besides, we introduced the deme technology to the IP algorithms. Furthermore, two formulae

Acknowledgments

Thanks are given to anonymous referees for the helpful suggestions and comments that they provided. The authors acknowledge that this research is supported by the Natural Science Fund of China No. 60970047 and the Key Science-Technology Project of Shandong Province of China No. 2008GG10001026.

References (30)

  • W. Fan et al.

    A generic ranking function discovery framework by genetic programming for information retrieval

    Information Processing and Management

    (2004)
  • P. Musilek et al.

    Immune programming

    Information Sciences

    (2006)
  • R. Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • Cao, Y., Xu, J., Liu, T.-Y., Li, H., Huang, Y., & Hon, H.-W. (2006). Adapting ranking SVM to document retrieval. In...
  • Collins, R. J. (1992). Studies in artificial evolution. PhD thesis, Los Angeles, CA,...
  • Dasgupta, D., Ji, Z., & Gonzlez, F. (2003). Artificial immune system (AIS) research in the last five years. In...
  • de Almeida, H. M., Gonçalves, M. A., Cristo, M., & Calado, P. (2007). A combined component approach for finding...
  • L.R. de Castro et al.

    Artificial immune systems as a novel soft computing paradigm

    Soft Computing

    (2003)
  • L.N. de Castro et al.

    Learning and optimization using the clonal selection principle

    IEEE Transactions on Evolutionary Computation

    (2002)
  • Fan, W., Gordon, M. D., & Pathak, P. (2000). Personalization of search engine services for effective retrieval and...
  • W. Fan et al.

    Discovery of context-specific ranking functions for effective information retrieval using genetic programming

    IEEE Transactions on Knowledge and Data Engineering

    (2004)
  • W. Fan et al.

    The effects of fitness functions on genetic programming-based ranking discovery for web search

    Journal of the American Society for Information Science and Technology

    (2004)
  • W. Fan et al.

    Genetic programming-based discovery of ranking functions for effective web search

    Journal of Management Information Systems

    (2005)
  • Y. Freund et al.

    An efficient boosting algorithm for combining preferences

    Journal of Machine Learning Research

    (2003)
  • K. Järvelin et al.

    Cumulated gain-based evaluation of IR techniques

    ACM Transactions on Information Systems (TOIS)

    (2002)
  • Cited by (12)

    • Broken link repairing system for constructing contextual information portals

      2019, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      Since the information available in different sources is complementary, it is useful to combine sources (features) to gain improvement in effectiveness. We use learning to rank technique to help “learn” the feature combination (Liu, 2009; Wang et al., 2009, 2010). The expectation is that a feature combination that works well on a training set will also generate reasonable effectiveness on unseen queries for repairing broken link.

    • A new fuzzy logic based ranking function for efficient Information Retrieval system

      2015, Expert Systems with Applications
      Citation Excerpt :

      They compare their approach with Cosine ranking function and find satisfactory results. Wang, Ma, and He (2010) propose the first immune programming based ranking function discovery approach. They use immune programming to the learning to the rank problem.

    • Robust Learning to Rank Based on Portfolio Theory and AMOSA Algorithm

      2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems
    • A comparative analysis of fuzzy based ranking functions for information retrieval

      2016, Proceedings of the 10th INDIACom; 2016 3rd International Conference on Computing for Sustainable Global Development, INDIACom 2016
    View all citing articles on Scopus
    View full text