Improving the learning of Boolean queries by means of a multiobjective IQBE evolutionary algorithm

https://doi.org/10.1016/j.ipm.2005.02.006Get rights and content

Abstract

The Inductive Query By Example (IQBE) paradigm allows a system to automatically derive queries for a specific Information Retrieval System (IRS). Classic IRSs based on this paradigm [Smith, M., & Smith, M. (1997). The use of genetic programming to build Boolean queries for text retrieval through relevance feedback. Journal of Information Science, 23(6), 423–431] generate a single solution (Boolean query) in each run, that with the best fitness value, which is usually based on a weighted combination of the basic performance criteria, precision and recall.

A desirable aspect of IRSs, especially of those based on the IQBE paradigm, is to be able to get more than one query for the same information needs, with high precision arid recall values or with different trade-offs between both.

In this contribution, a new IQBE process is proposed combining a previous basic algorithm to automatically derive Boolean queries for Boolean IRSs [Smith, M., & Smith, M. (1997). The use of genetic programming to build Boolean queries for text retrieval through relevance feedback. Journal of Information Science, 23(6), 423–431] and an advanced evolutionary multiobjective approach [Coello, C. A., Van Veldhuizen, D. A., & Lamant, G. B. (2002). Evolutionary algorithms for solving multiobjective problems. Kluwer Academic Publishers], which obtains several queries with a different precision–recall trade-off in a single run. The performance of the new proposal will be tested on the Cranfield and CACM collections and compared to the well-known Smith and Smith’s algorithm, showing how it improves the learning of queries and thus it could better assist the user in the query formulation process.

Introduction

Information retrieval (IR) may be defined, in general, as the problem of the selection of documentary information from storage in response to search questions provided by a user (Baeza-Yates and Ribeiro-Neto, 1999, Salton and McGill, 1983). Information retrieval systems (IRSs) are a kind of information systems that deal with data bases composed of information items—documents that may consist of textual, pictorial or vocal information—and process user queries trying to allow the user to access to relevant information in an appropriate time interval. Nowadays, the development of the WWW has increased the interest on the study of IRSs.

Many IRSs still consider the Boolean IR model (Van Rijsbergen, 1979), based on the use of Boolean queries where the query terms are joined by the logical operators AND and OR. This way, the user needs to have a clear knowledge on how to connect the query terms together using the Boolean operators in order to build a query defining his information needs. The difficulty found by nonexpert users to formulate these kinds of queries sometimes makes necessary the design of automatic methods for this task. The paradigm of Inductive Query by Example (IQBE) (Chen, Shankaranarayanan, She, & lyer, 1998), where a query describing the information contents of a set of documents provided by a user is automatically derived, can be useful to assist the user in the query formulation process. Focusing on the Boolean IR model, the most known existing approach is that of Smith and Smith (1997), which is based on a kind of evolutionary algorithm (EA) (Bäck, Fogel, & Michalewicz, 1997), genetic programming (GP) (Koza, 1992). As usual in the topic (Cordón, Herrera-Viedma, López-Pujalte, Luque, & Zarco, 2003), this approach is guided by a weighted fitness function combining two retrieval accuracy criteria, precision and recall. The main characteristic of this approach is that it provides a single query in each run.

Given the retrieval performance of an IRS is usually measured in terms of these two criteria, precision and recall (Van Rijsbergen, 1979), the optimization of any of its components, and concretely the automatic learning of Boolean queries, is thus a clear example of a multiobjective problem. EAs have been commonly used for IQBE purposes and their application in the area has been usually based on combining both criteria in a single scalar fitness function by means of a weighting scheme (Cordón, Herrera-Viedma, López-Pujalte, et al., 2003). However, there is a kind of EA specially designed for multiobjective problems, multiobjective evolutionary algorithms, which are able to obtain different nondominated solutions to the problem in a single run (Coello et al., 2002, Deb, 2001). In IR, specifically in the IQBE paradigm, they would allow us to derive a number of queries with a different precision–recall trade-off in a single run of the IQBE algorithm, and in such a way to improve the aid possibilities to the users in the formulation of their queries.

In this paper, we present a new evolutionary tool to learn Boolean queries that improves the Smith and Smith’s (1997) approach, called multiobjective IQBE EA. We define it by extending the Smith and Smith’s approach incorporating Pareto-based evolutionary multiobjective components into GP. To do so, we consider one of the most known and well performing Pareto-based multiobjective EAs, SPEA (Zitzler & Thiele, 1999). The main feature of this EA is the maintenance of the elitism concept in a multiobjective evolutionary algorithm. This improves the performance of our multiobjective GP algorithm. In order to represent a real-world text retrieval IQBE environment where a user provides a relatively small number of relevant and irrelevant documents, the experimental testbed will be based on two of the most known small size IR benchmarks, the Cranfield and CACM document collections (Baeza-Yates and Ribeiro-Neto, 1999, Salton and McGill, 1983). With our proposal we improve and increase the user assistance possibilities in the formulation of queries by means of evolutionary computation tools.

With this aim, this contribution is structured as follows. Section 2 is devoted to introduce the preliminaries, including the basis of Boolean IRSs, the definition of both precision and recall criteria, the main aspects of IQBE techniques, a review on EAs and on their application to IR, tasks, arid finally, the main aspects of multiobjective EAs. Section 3 is devoted to introduce the main aspects of the Smith and Smith’s proposal and to extend the latter algorithm to deal with the multiobjective problem of simultaneously optimizing both precision and recall by means of the SPEA Pareto-based approach while the experiments developed to test the new proposal and the results obtained are shown in Sections 4 Experiments developed, 5 Results and analysis of results, respectively. Finally, several concluding remarks are pointed out in Section 6.

Section snippets

Boolean IRS

An IRS is basically constituted by three main components, as shown in Fig. 1.

The documentary data base. This component stores the documents and the representation of their information contents. It is associated with the indexer module, which automatically generates a representation for each document by extracting the document contents. Textual document representation is typically based on index terms (that can be either single terms or sequences) which are the content identifiers of the

A multiobjective IQBE EA to learn multiple Boolean queries

Our main objective is to improve Smith and Smith’s results obtaining several queries instead of just one in a single run. To do so, we will use a multiobjective focus, incorporating Pareto-based evolutionary multiobjective components into GP, whose good behaviour was demonstrated in Rodríguez-Vazquez, Fonseca, and Fleming (1997).

Firstly we will review Smith and Smith’s approach and then introduce our proposal.

Experiments developed

As said, the experimental study has been developed using the Cranfield and CACM collections. Cranfield is composed of 1398 documents about Aeronautics while CACM contains 3204 documents published in the journal Communications of the ACM between 1958 and 1979. In both collections, the textual documents have been automatically indexed in the usual way5 by first extracting the nonstop words and performing a stemming process, thus

Analysis of the Pareto sets derived

Table 1, Table 2 show several statistics corresponding to our multiobjective proposal. These tables collects several data, about the composition of the 10 Pareto sets generated for each query, always showing the averaged value and its standard deviation. From left to right, the columns contain the number of nondominated solutions obtained (#p), equal to the number of different objective vectors (i.e., precision–recall pairs) existing among them, and the values of the two multiobjective EA

Concluding remarks

The automatic derivation of Boolean queries has been considered by incorporating a second generation multiobjective evolutionary approach, SPEA, to an existing GP-based IQBE proposal. The proposed approach has performed appropriately in 35 queries, 17 of the well known Cranfield collection, and 18 of the CACM collection, in terms of absolute retrieval performance arid of the quality of the obtained Pareto sets, allowing us to derive a set of queries with different precision–recall trade-offs.

In

Acknowledgement

This research has been supported by CICYT under projects TIC2003-07977 and TIC2003-00877 with FEDER fundings.

References (47)

  • M. Boughanem et al.

    On using genetic algorithms for multimodal relevance optimization in information retrieval

    Journal of the American Society for Information Science and Technology

    (2002)
  • M. Boughanem et al.

    Multiple query evaluation based on an enhanced genetic algorithm

    Information Processing & Management

    (2003)
  • V. Chankong et al.

    Multiobjective decision making theory and methodology

    (1983)
  • H. Chen et al.

    A machine learning approach to Inductive Query by Examples: An experiment using relevance feedback, IDS, genetic algorithms, and simulated annealing

    Journal of the American Society for Information Science

    (1998)
  • Chen, Y., & Shahabi, C. (2001). Automatically improving the accuracy of user profiles with genetic algorithm. In...
  • C.A. Coello et al.

    Evolutionary algorithms for solving multi-objective problems

    (2002)
  • Cordón, O., Herrera-Viedma, E., Luque, M., Moya, F., & Zarco, C. (2003). Analyzing the performance of a multiobjective...
  • O. Cordón et al.

    A GA-P algorithm to automatically formulate extended Boolean queries for a fuzzy information retrieval system

    Mathware & Soft Computing

    (2000)
  • O. Cordón et al.

    A new evolutionary algorithm combining simulated annealing and genetic programming for relevance feedback in fuzzy information retrieval systems

    Soft Computing

    (2002)
  • O. Cordón et al.

    Automatic learning of multiple extended Boolean queries by multiobjective GA-P algorithms

  • K. Deb

    Multi-objective optimization using evolutionary algorithms

    (2001)
  • J. Fernández-Villacanas et al.

    Investigation of the importance of the genotype–phenotype mapping in information retrieval

    Future Generation Computer Systems

    (2003)
  • D. Fogel

    System identification through simulated evolution: A machine learning approach

    (1991)
  • Cited by (20)

    • Exploring the space of information retrieval term scoring functions

      2017, Information Processing and Management
      Citation Excerpt :

      A similar approach was followed by Billhardt, Borrajo, and Maojo (2002), where genetic algorithms are used to first select a set of candidate scoring functions from a pool of functions, and then to find the weight of each candidate in a linear combination of the functions. Like genetic algorithm, genetic programming is also used for query learning, for example Cordón, Herrera-Viedma, and Luque (2006) derives boolean queries through relevance feedback and Malo, Siitari, and Sinha (2013) uses Wikipedia concepts instead of bag-of-words in query learning. However to the best of our knowledge, Fan, Gordon, and Pathak (2000) was the first to apply genetic programming to explore the IR function space; this seminal work was revisited later in Fan, Gordon, and Pathak (2004).

    • A multiobjective evolutionary programming framework for graph-based data mining

      2013, Information Sciences
      Citation Excerpt :

      Several studies have shown that multiobjective learning approaches are more powerful compared to learning algorithms with a scalar objective function in addressing various topics of machine learning. A nonexhaustive list of examples includes classification, clustering, feature selection, improvement of generalization ability, knowledge extraction, system identification, and ensemble generation [4,5,14,23,29,36]. The concept of Pareto optimality [6] has been recently applied to machine learning and data mining, particularly inspired by the successful developments in EMO [8,31,32].

    • Bipolar queries in textual information retrieval: A new perspective

      2012, Information Processing and Management
      Citation Excerpt :

      It gives the user an additional insight into the coverage of the domain of interest by the queried collection of documents. The queries submitted to a textual information retrieval system may be either “manually” constructed using a query language (of the Boolean formulas, in the case considered here) or may be automatically derived on the basis of, e.g., a short text, with the aim to find related documents (cf. also, e.g., Cordón, Herrera-Viedma, & Luque, 2006 for some other way of an automatic generation of a Boolean query). In both scenarios, the bipolar queries seem to be applicable and useful.

    View all citing articles on Scopus
    View full text