Elsevier

Decision Support Systems

Volume 47, Issue 4, November 2009, Pages 398-407
Decision Support Systems

Genetic-based approaches in ranking function discovery and optimization in information retrieval — A framework

https://doi.org/10.1016/j.dss.2009.04.005Get rights and content

Abstract

An Information Retrieval (IR) system consists of document collection, queries issued by users, and the matching/ranking functions used to rank documents in the predicted order of relevance for a given query. A variety of ranking functions have been used in the literature. But studies show that these functions do not perform consistently well across different contexts. In this paper we propose a two-stage integrated framework for discovering and optimizing ranking functions used in IR. The first stage, discovery process, is accomplished by intelligently leveraging the structural and statistical information available in HTML documents by using Genetic Programming techniques to yield novel ranking functions. In the second stage, the optimization process, document retrieval scores of various well-known ranking functions are combined using Genetic Algorithms. The overall discovery and optimization framework is tested on the well-known TREC collection of web documents for both the ad-hoc retrieval task and the routing task. Utilizing our framework we observe a significant increase in retrieval performance compared to some of the well-known stand alone ranking functions.

Introduction

As the cost of storage devices continues to decrease, there is a huge growth in databases of all sorts (relational, graphical, and textual). The tremendous growth of the World Wide Web has also contributed to the explosive growth of documents available. This has led to huge, fragmented, and unstructured document collections within organizations. Although it has become easier to collect and store information in document collections, it has become increasingly difficult to retrieve relevant information from these large document collections. This is true both in the context of World Wide Web (WWW) as well as in e-commerce. In WWW when users search the Web using popular search engines such as Google and Yahoo, they often are faced with either having to look at many webpages before finding a relevant one or they cannot find all the relevant information they are looking for. In the context of e-commerce users may find it difficult to locate products they are trying to shop for. Retrieval performance is paramount for the users of the Web and e-commerce. Various techniques have been used by researchers to address the issue of improving retrieval performance [8], [19], [23], [40].

An Information Retrieval (IR) system typically consists of three subsystems: Documents, Users with Queries, and Matching/Ranking Functions1. There are users with varying information requirements, both in terms of breadth and depth of topics they are interested in. A document collection consists of documents about many different topics. Documents are represented in a form (typically using vector space model [39]) that can be easily used by the matching function, taking care that this representation correctly represents author's intention. User's information requirements are translated into queries that the system can process. Query formatting depends on the underlying model of retrieval used (Boolean models [6], vector space models [39], probabilistic models [37], fuzzy retrieval models [7], models based on artificial intelligence techniques [10]).

A system matching function matches the information in queries with that in the document representations and typically calculates a score called ‘retrieval status value’ (RSV). The documents are presented to the user in decreasing order of RSV. The user rates these documents as either relevant or non-relevant to his/her information need. Various system performance criteria like precision and recall have been used to gauge the effectiveness of the system in meeting users' information requirements. Recall is the ratio of the number of relevant retrieved documents to the total number of relevant documents available in the document collection. Precision is defined as the ratio of the number of relevant retrieved documents to the total number of retrieved documents. Relevance feedback is typically used by the system to improve document descriptions, or queries with the expectation that the overall performance of the system will improve after such a feedback.

An IR system's performance can be affected by factors affecting any of the three subsystems: documents, queries, or ranking functions. Researchers have extensively looked at how to improve retrieval performance by manipulating the documents and the queries [3], [19], [22], [27], [32], [40]. In this paper we focus our attention on discovering and optimizing the ranking functions. Ranking functions, in the web scenario, typically exploit three characteristics of the documents: the content of the document, the links to the documents, and the structure of the document. The content based ranking functions [38], [42] make extensive usage of many lexical/syntactical statistics (e.g. token frequency (tf), document frequency (df), document length, etc.) of words in a document collection for ranking purposes. Link-based ranking functions [8], [30] utilize web interconnection information to help boost the ranking performance by identifying pages that are highly endorsed by others. Structure based ranking functions exploit the structural properties in documents by assigning weights to words appearing in different structural position, such as Title, Header, Anchor, and use those weighting heuristics to improve ranking performance. Although for proprietary reasons, the exact algorithm used by commercial search engines is not known, it is conjectured that these search engines typically use structural information in their ranking functions [1].

There are some other ranking functions that seek to combine the evidence at the content, link, and structure levels as evidenced in recent TREC2 web track competition [24], [25]. In the TREC competition it was clear that using link information alone does not provide much help in performance improvement (in terms of performance measures such as precision and recall) as compared to using content information alone. Also, the ranking functions based on content alone are still very successful. For example, Okapi [42], a ranking function based on content alone was found very successful. We conducted a preliminary test using Okapi function by adding keywords from document title, in addition to the body text of the document. It was found that adding structural information (like the information from the document title) improved retrieval performance even by the same ranking function Okapi.

Ranking function tuning is very important for IR system performance improvement (in terms of precision and recall). There is some prior research in using Genetic Programming (GP) for ranking function discovery [14] and using Genetic Algorithms (GA) for ranking fusion [5], [34], [35], [36]. But, to the best of our knowledge, there is no research combining these two into a systematic, integrated framework. GP is known to be very powerful for novel nonlinear function discovery, and GA is known to be suitable for parameterized nonlinear optimization. However, it remains to be explored whether the novel ranking functions discovered by GP can be effectively fused later with other well-known ranking functions by GA to further improve the ranking function performance. We believe that these two streams of ranking function improvement research can be integrated yielding improved retrieval performance. In this paper, we propose such an integrated two-stage framework for improving retrieval performance. In the first stage, called discovery or exploration stage, we would exploit the structural information in documents along with the content information in them to discover new ranking functions. We would use GP for such a discovery. In the second stage, called optimization or exploitation stage, we would combine the information provided by well-known ranking functions (including the ones discovered by GP) using an optimization technique like GA to further improve retrieval performance.

The paper is organized as follows. In Section 2 we will discuss related work in the area of ranking function discovery and adaptation. In Section 3 we will present our framework for ranking function discovery and optimization. The framework will be tested on a well-known web document collection by conducting experiments detailed in Section 4. In Section 5 we will discuss the results of the experiments, and Section 6 will conclude the paper.

Section snippets

Related work

In this section we will briefly review research related to our work in this paper. Specifically we will first review the vector space model (VSM), which is the theoretical model upon which our integrated framework of ranking function discovery and optimization is based. Then we will review the work in data fusion technique as applied to information retrieval (IR), and finally we will review work in IR that uses GP and GA.

Ranking discovery and optimization framework

Evaluation studies [42], [45] on use of ranking functions have shown that no single ranking function performs best for all contexts of document collections and queries. The best method to pick a good ranking function for a given query is still an open question. Moreover, there may still be some good ranking functions which are yet to be discovered. In this section we present a framework to address these issues. The first part of the framework explores a variety of clues available in content and

Experiments

We test the effectiveness of our framework for retrieval by conducting various experiments. We now describe the data that was used, the exact process for GP discovery and GA optimization that was followed, the various ranking functions (apart from the newly discovered ones) that were used, and the fitness functions that were used in the discovery and optimization process.

Results and discussion

In this section we discuss the results of experiments done in the last section. We first report results for user-provided queries and then for relevance feedback queries. Within each of these we will report results for ad-hoc retrieval and then for routing retrieval. After that we will discuss the effects of using various fitness functions on the robustness of our results. Results have been provided as averages of performance measures obtained across all queries. This is standard practice for

Conclusion

In this paper we presented an integrated framework for using genetic-based approaches (specifically the Genetic Programming and Genetic Algorithms) to discover new ranking functions as well as to optimize the well-known existing ones. The first part of the framework uses GP to discover novel ranking functions. We used both the content as well as the structural information in HTML documents to discover such functions. It was observed that these newly discovered functions outperformed the

Dr. Weiguo (Patrick) Fan is an Associate Professor of Accounting and Information Systems at the Virginia Polytechnic Institute and State University (Virginia Tech). He received his Ph.D. in Information Systems from the Ross School of Business, University of Michigan, Ann Arbor, in July 2002, a M. Sce in Computer Science from the National University of Singapore in 1997, and a B. E. in Information and Control Engineering from the Xi'an Jiaotong University, P.R. China, in 1995.

His research

References (45)

  • ...
  • B. Bartell et al.

    Automatic combination of multiple ranked retrieval systems

  • B. Bartell et al.

    Optimizing similarity using multi-query relevance feedback

    Journal of the American Society for Information Science

    (1998)
  • H. Billhardt et al.

    Learning retrieval expert combinations with genetic algorithms

    International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

    (2003)
  • A. Bookstein

    Probability and fuzzy set applications to information retrieval

    Annual Review of Information Science and Technology

    (1985)
  • G. Bordogna et al.

    A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its evaluation

    Journal of the American Society for Information Science

    (1993)
  • L. Chen et al.

    WebMate: a personal agent for browsing and searching

  • H. Chen et al.

    A smart itsy bitsy spider for the Web

    Journal of the American Society for Information Science

    (1998)
  • H. Chen et al.

    A machine learning approach to inductive query by examples: an experiment using relevance feedback, ID3, genetic algorithms, and simulated annealing

    Journal of the American Society for Information Science

    (1998)
  • O. Cordon et al.

    A new evolutionary algorithm combining simulated annealing and genetic programming for relevance feedback in fuzzy information retrieval systems

    Soft Computing

    (2002)
  • W. Fan et al.

    The effects of fitness functions on genetic programming-based ranking discovery for web search

    Journal of the American Society for Information Science and Technology

    (2004)
  • W. Fan et al.

    Discovery of context-specific ranking functions for effective information retrieval using genetic programming

    IEEE Transactions on Knowledge and Data Engineering

    (2004)
  • Cited by (0)

    Dr. Weiguo (Patrick) Fan is an Associate Professor of Accounting and Information Systems at the Virginia Polytechnic Institute and State University (Virginia Tech). He received his Ph.D. in Information Systems from the Ross School of Business, University of Michigan, Ann Arbor, in July 2002, a M. Sce in Computer Science from the National University of Singapore in 1997, and a B. E. in Information and Control Engineering from the Xi'an Jiaotong University, P.R. China, in 1995.

    His research interests focus on the design and development of novel information technologies — information retrieval, data mining, text/web mining, social computing, personalization and knowledge management techniques — to support better business information management and decision making. He has published more than 90 refereed journal and conference papers. His research has appeared in many prestigious information technology journals such as ACM Transactions on Internet Technology, Communications of the ACM, Decision Support Systems, IEEE Transactions on Knowledge and Data Engineering, IEEE Intelligent Systems, Information Systems, Information Processing and Management, Journal of the American Society on Information Science and Technology, Journal of Management Information Systems, Pattern Recognition, etc., and in leading information technology conferences such as SIGIR, WWW, CIKM, HLT, ICIS, HICSS, AMCIS,DS, ICOTA, etc. His research studies are/have been funded by NSF, PWC.

    Dr. Praveen Pathak is an Associate Professor of Information Systems and Operations Management at the Warrington College of Business at the University of Florida. He received his PhD in Information Systems from the Ross School of Business, University of Michigan, Ann Arbor, in 2000. He also holds an MBA (PGDM) from the Indian Institute of Management, Calcutta, and an Engineering degree, B. Tech. (Hons.), from the Indian Institute of Technology, Kharagpur. His research interests include information retrieval, web mining, offshore outsourcing and business intelligence. His research has appeared in many journals such as Journal of Management Information Systems (JMIS), Decision Support Systems (DSS), IEEE Transactions on Knowledge and Data Engineering (TKDE), Information Processing and Management (IP&M), Journal of the American Society for Information Science and Technology (JASIST), and in leading information technology conferences such as ICIS, HICSS, WITS, and INFORMS.

    Dr. Mi Zhou is an Assistant Professor of School of Management at Xi'an Jiaotong University, P.R. China. She received her Ph.D. in Management Science and Engineering from the School of Management at Xi'an Jiaotong University, P.R. China, in July 2007, a M.Sce. in Management Science and Engineering at Xi'an Jiaotong University, P.R. China, in 1998, and a B.E. in Mechanical Engineering from Nanjing Polytechnic University, P.R. China, in 1992. Her research interests focus on the information management, social relationship and knowledge management (knowledge transfer, knowledge sharing, knowledge creation), models of online knowledge communities.

    View full text