Improving the learning of Boolean queries by means of a multiobjective IQBE evolutionary algorithm

doi:10.1016/j.ipm.2005.02.006

Information Processing & Management

Volume 42, Issue 3, May 2006, Pages 615-632

https://doi.org/10.1016/j.ipm.2005.02.006 Get rights and content

Abstract

The Inductive Query By Example (IQBE) paradigm allows a system to automatically derive queries for a specific Information Retrieval System (IRS). Classic IRSs based on this paradigm [Smith, M., & Smith, M. (1997). The use of genetic programming to build Boolean queries for text retrieval through relevance feedback. Journal of Information Science, 23(6), 423–431] generate a single solution (Boolean query) in each run, that with the best fitness value, which is usually based on a weighted combination of the basic performance criteria, precision and recall.

A desirable aspect of IRSs, especially of those based on the IQBE paradigm, is to be able to get more than one query for the same information needs, with high precision arid recall values or with different trade-offs between both.

In this contribution, a new IQBE process is proposed combining a previous basic algorithm to automatically derive Boolean queries for Boolean IRSs [Smith, M., & Smith, M. (1997). The use of genetic programming to build Boolean queries for text retrieval through relevance feedback. Journal of Information Science, 23(6), 423–431] and an advanced evolutionary multiobjective approach [Coello, C. A., Van Veldhuizen, D. A., & Lamant, G. B. (2002). Evolutionary algorithms for solving multiobjective problems. Kluwer Academic Publishers], which obtains several queries with a different precision–recall trade-off in a single run. The performance of the new proposal will be tested on the Cranfield and CACM collections and compared to the well-known Smith and Smith’s algorithm, showing how it improves the learning of queries and thus it could better assist the user in the query formulation process.

Introduction

Information retrieval (IR) may be defined, in general, as the problem of the selection of documentary information from storage in response to search questions provided by a user (Baeza-Yates and Ribeiro-Neto, 1999, Salton and McGill, 1983). Information retrieval systems (IRSs) are a kind of information systems that deal with data bases composed of information items—documents that may consist of textual, pictorial or vocal information—and process user queries trying to allow the user to access to relevant information in an appropriate time interval. Nowadays, the development of the WWW has increased the interest on the study of IRSs.

Many IRSs still consider the Boolean IR model (Van Rijsbergen, 1979), based on the use of Boolean queries where the query terms are joined by the logical operators AND and OR. This way, the user needs to have a clear knowledge on how to connect the query terms together using the Boolean operators in order to build a query defining his information needs. The difficulty found by nonexpert users to formulate these kinds of queries sometimes makes necessary the design of automatic methods for this task. The paradigm of Inductive Query by Example (IQBE) (Chen, Shankaranarayanan, She, & lyer, 1998), where a query describing the information contents of a set of documents provided by a user is automatically derived, can be useful to assist the user in the query formulation process. Focusing on the Boolean IR model, the most known existing approach is that of Smith and Smith (1997), which is based on a kind of evolutionary algorithm (EA) (Bäck, Fogel, & Michalewicz, 1997), genetic programming (GP) (Koza, 1992). As usual in the topic (Cordón, Herrera-Viedma, López-Pujalte, Luque, & Zarco, 2003), this approach is guided by a weighted fitness function combining two retrieval accuracy criteria, precision and recall. The main characteristic of this approach is that it provides a single query in each run.

Given the retrieval performance of an IRS is usually measured in terms of these two criteria, precision and recall (Van Rijsbergen, 1979), the optimization of any of its components, and concretely the automatic learning of Boolean queries, is thus a clear example of a multiobjective problem. EAs have been commonly used for IQBE purposes and their application in the area has been usually based on combining both criteria in a single scalar fitness function by means of a weighting scheme (Cordón, Herrera-Viedma, López-Pujalte, et al., 2003). However, there is a kind of EA specially designed for multiobjective problems, multiobjective evolutionary algorithms, which are able to obtain different nondominated solutions to the problem in a single run (Coello et al., 2002, Deb, 2001). In IR, specifically in the IQBE paradigm, they would allow us to derive a number of queries with a different precision–recall trade-off in a single run of the IQBE algorithm, and in such a way to improve the aid possibilities to the users in the formulation of their queries.

In this paper, we present a new evolutionary tool to learn Boolean queries that improves the Smith and Smith’s (1997) approach, called multiobjective IQBE EA. We define it by extending the Smith and Smith’s approach incorporating Pareto-based evolutionary multiobjective components into GP. To do so, we consider one of the most known and well performing Pareto-based multiobjective EAs, SPEA (Zitzler & Thiele, 1999). The main feature of this EA is the maintenance of the elitism concept in a multiobjective evolutionary algorithm. This improves the performance of our multiobjective GP algorithm. In order to represent a real-world text retrieval IQBE environment where a user provides a relatively small number of relevant and irrelevant documents, the experimental testbed will be based on two of the most known small size IR benchmarks, the Cranfield and CACM document collections (Baeza-Yates and Ribeiro-Neto, 1999, Salton and McGill, 1983). With our proposal we improve and increase the user assistance possibilities in the formulation of queries by means of evolutionary computation tools.

With this aim, this contribution is structured as follows. Section 2 is devoted to introduce the preliminaries, including the basis of Boolean IRSs, the definition of both precision and recall criteria, the main aspects of IQBE techniques, a review on EAs and on their application to IR, tasks, arid finally, the main aspects of multiobjective EAs. Section 3 is devoted to introduce the main aspects of the Smith and Smith’s proposal and to extend the latter algorithm to deal with the multiobjective problem of simultaneously optimizing both precision and recall by means of the SPEA Pareto-based approach while the experiments developed to test the new proposal and the results obtained are shown in Sections 4 Experiments developed, 5 Results and analysis of results, respectively. Finally, several concluding remarks are pointed out in Section 6.

Section snippets

Boolean IRS

An IRS is basically constituted by three main components, as shown in Fig. 1.

The documentary data base. This component stores the documents and the representation of their information contents. It is associated with the indexer module, which automatically generates a representation for each document by extracting the document contents. Textual document representation is typically based on index terms (that can be either single terms or sequences) which are the content identifiers of the

A multiobjective IQBE EA to learn multiple Boolean queries

Our main objective is to improve Smith and Smith’s results obtaining several queries instead of just one in a single run. To do so, we will use a multiobjective focus, incorporating Pareto-based evolutionary multiobjective components into GP, whose good behaviour was demonstrated in Rodríguez-Vazquez, Fonseca, and Fleming (1997).

Firstly we will review Smith and Smith’s approach and then introduce our proposal.

Experiments developed

As said, the experimental study has been developed using the Cranfield and CACM collections. Cranfield is composed of 1398 documents about Aeronautics while CACM contains 3204 documents published in the journal Communications of the ACM between 1958 and 1979. In both collections, the textual documents have been automatically indexed in the usual way⁵ by first extracting the nonstop words and performing a stemming process, thus

Analysis of the Pareto sets derived

Table 1, Table 2 show several statistics corresponding to our multiobjective proposal. These tables collects several data, about the composition of the 10 Pareto sets generated for each query, always showing the averaged value and its standard deviation. From left to right, the columns contain the number of nondominated solutions obtained (#p), equal to the number of different objective vectors (i.e., precision–recall pairs) existing among them, and the values of the two multiobjective EA

Concluding remarks

The automatic derivation of Boolean queries has been considered by incorporating a second generation multiobjective evolutionary approach, SPEA, to an existing GP-based IQBE proposal. The proposed approach has performed appropriately in 35 queries, 17 of the well known Cranfield collection, and 18 of the CACM collection, in terms of absolute retrieval performance arid of the quality of the obtained Pareto sets, allowing us to derive a set of queries with different precision–recall trade-offs.

Acknowledgement

This research has been supported by CICYT under projects TIC2003-07977 and TIC2003-00877 with FEDER fundings.

References (47)

O. Cordón et al.
A review of the application of evolutionary computation to information retrieval
International Journal of Approximate Reasoning
(2003)
W. Fan et al.
A generic ranking function discovery framework by genetic programming for information retrieval
Information Processing & Management
(2004)
E. Herrera-Viedma et al.
A model of fuzzy linguistic IRS based on multi-granular linguistic information
International Journal of Approximate Reasoning
(2003)
J. Horng et al.
Applying genetic algorithms to query optimization in document retrieval
Information Processing & Management
(2000)
C. López-Pujalte et al.
A test of genetic algorithms in relevance feedback
Information Processing & Management
(2002)
D. Vrajitoru
Crossover improvement for the genetic algorithm in information retrieval
Information Processing & Management
(1998)
R. Baeza-Yates et al.
Modern information retrieval
(1999)
G. Bordogna et al.
Fuzzy approaches to extend Boolean information retrieval
M. Boughanem et al.
Genetic approach to query space exploration
Information Retrieval
(1999)

M. Boughanem et al.

On using genetic algorithms for multimodal relevance optimization in information retrieval

Journal of the American Society for Information Science and Technology

(2002)

M. Boughanem et al.

Multiple query evaluation based on an enhanced genetic algorithm

Information Processing & Management

(2003)

V. Chankong et al.

Multiobjective decision making theory and methodology

(1983)

H. Chen et al.

A machine learning approach to Inductive Query by Examples: An experiment using relevance feedback, IDS, genetic algorithms, and simulated annealing

Journal of the American Society for Information Science

(1998)

Chen, Y., & Shahabi, C. (2001). Automatically improving the accuracy of user profiles with genetic algorithm. In...

C.A. Coello et al.

Evolutionary algorithms for solving multi-objective problems

(2002)

Cordón, O., Herrera-Viedma, E., Luque, M., Moya, F., & Zarco, C. (2003). Analyzing the performance of a multiobjective...

O. Cordón et al.

A GA-P algorithm to automatically formulate extended Boolean queries for a fuzzy information retrieval system

Mathware & Soft Computing

(2000)

O. Cordón et al.

A new evolutionary algorithm combining simulated annealing and genetic programming for relevance feedback in fuzzy information retrieval systems

Soft Computing

(2002)

O. Cordón et al.

Automatic learning of multiple extended Boolean queries by multiobjective GA-P algorithms

K. Deb

Multi-objective optimization using evolutionary algorithms

(2001)

J. Fernández-Villacanas et al.

Investigation of the importance of the genotype–phenotype mapping in information retrieval

Future Generation Computer Systems

(2003)

D. Fogel

System identification through simulated evolution: A machine learning approach

(1991)

Cited by (20)

Exploring the space of information retrieval term scoring functions
2017, Information Processing and Management
Citation Excerpt :
A similar approach was followed by Billhardt, Borrajo, and Maojo (2002), where genetic algorithms are used to first select a set of candidate scoring functions from a pool of functions, and then to find the weight of each candidate in a linear combination of the functions. Like genetic algorithm, genetic programming is also used for query learning, for example Cordón, Herrera-Viedma, and Luque (2006) derives boolean queries through relevance feedback and Malo, Siitari, and Sinha (2013) uses Wikipedia concepts instead of bag-of-words in query learning. However to the best of our knowledge, Fan, Gordon, and Pathak (2000) was the first to apply genetic programming to explore the IR function space; this seminal work was revisited later in Fan, Gordon, and Pathak (2004).
In this paper we are interested in finding good IR scoring functions by exploring the space of all possible IR functions. Earlier approaches to do so however only explore a small sub-part of the space, with no control on which part is explored and which is not. We aim here at a more systematic exploration by first defining a grammar to generate possible IR functions up to a certain length (the length being related to the number of elements, variables and operations, involved in a function), and second by relying on IR heuristic constraints to prune the search space and filter out bad scoring functions. The obtained candidate scoring functions are tested on various standard IR collections and several simple but promising functions are identified. We perform extensive experiments to compare these functions with classical IR models. It is observed that these functions are yielding either better or comparable results. We also compare the performance of functions satisfying IR heuristic constraints and those which do not; the former set of functions clearly outperforms the latter, which shows the validity of IR heuristic constraints to design new IR models.
A multiobjective evolutionary programming framework for graph-based data mining
2013, Information Sciences
Citation Excerpt :
Several studies have shown that multiobjective learning approaches are more powerful compared to learning algorithms with a scalar objective function in addressing various topics of machine learning. A nonexhaustive list of examples includes classification, clustering, feature selection, improvement of generalization ability, knowledge extraction, system identification, and ensemble generation [4,5,14,23,29,36]. The concept of Pareto optimality [6] has been recently applied to machine learning and data mining, particularly inspired by the successful developments in EMO [8,31,32].
Subgraph mining is the process of identifying concepts describing interesting and repetitive subgraphs within graph-based data. The exponential number of possible subgraphs makes the problem very challenging. Existing methods apply a single-objective subgraph search with the view that interesting subgraphs are those capable of not merely compressing the data, but also enhancing the interpretation of the data considerably. Usually the methods operate by posing simple constraints (or user-defined thresholds) such as returning all subgraphs whose frequency is above a specified threshold. Such search approach may often return either a large number of solutions in the case of a weakly defined objective or very few in the case of a very strictly defined objective. In this paper, we propose a framework based on multiobjective evolutionary programming to mine subgraphs by jointly maximizing two objectives, support and size of the extracted subgraphs. The proposed methodology is able to discover a nondominated set of interesting subgraphs subject to tradeoff between the two objectives, which otherwise would not be achieved by the single-objective search. Besides, it can use different specific multiobjective evolutionary programming methods. Experimental results obtained by three of the latter methods on synthetically generated as well as real-life graph-based datasets validate the utility of the proposed methodology when benchmarked against classical single-objective methods and their previous, nonevolutionary multiobjective extensions.
Automated query learning with Wikipedia and genetic programming
2013, Artificial Intelligence
Most of the existing information retrieval systems are based on bag-of-words model and are not equipped with common world knowledge. Work has been done towards improving the efficiency of such systems by using intelligent algorithms to generate search queries, however, not much research has been done in the direction of incorporating human-and-society level knowledge in the queries. This paper is one of the first attempts where such information is incorporated into the search queries using Wikipedia semantics. The paper presents Wikipedia-based Evolutionary Semantics (Wiki-ES) framework for generating concept based queries using a set of relevance statements provided by the user. The query learning is handled by a co-evolving genetic programming procedure.
To evaluate the proposed framework, the system is compared to a bag-of-words based genetic programming framework as well as to a number of alternative document filtering techniques. The results obtained using Reuters newswire documents are encouraging. In particular, the injection of Wikipedia semantics into a GP-algorithm leads to improvement in average recall and precision, when compared to a similar system without human knowledge. A further comparison against other document filtering frameworks suggests that the proposed GP-method also performs well when compared with systems that do not rely on query-expression learning.
Bipolar queries in textual information retrieval: A new perspective
2012, Information Processing and Management
Citation Excerpt :
It gives the user an additional insight into the coverage of the domain of interest by the queried collection of documents. The queries submitted to a textual information retrieval system may be either “manually” constructed using a query language (of the Boolean formulas, in the case considered here) or may be automatically derived on the basis of, e.g., a short text, with the aim to find related documents (cf. also, e.g., Cordón, Herrera-Viedma, & Luque, 2006 for some other way of an automatic generation of a Boolean query). In both scenarios, the bipolar queries seem to be applicable and useful.
A new concept of a bipolar query against collections of textual documents, i.e. in the context of information retrieval (IR), is introduced using recent developments in bipolar information modeling and bipolar database queries. Specifically, a particular approach to bipolar queries with an explicit “and possibly” type of an aggregation operator is used. An effective and efficient processing of such bipolar queries using standard IR data structures is briefly discussed. The bipolar queries proposed combine a flexibility provided by fuzzy logic with a more sophisticated representation of user preferences and intentions. This combination can make the search of vast resources of textual document, notably those available via the Internet, more intelligent.
Structure of morphologically expanded queries: A genetic algorithm approach
2010, Data and Knowledge Engineering
In this paper we deal with two issues. First, we discuss the negative effects of term correlation in query expansion algorithms, and we propose a novel and simple method (query clauses) to represent expanded queries which may alleviate some of these negative effects. Second, we discuss a method to optimize local query-expansion methods using genetic algorithms, and we apply this method to improve stemming. We evaluate this method with the novel query representation method and show very significant improvements for the problem of stemming optimization.
Applying multi-objective evolutionary algorithms to the automatic learning of extended Boolean queries in fuzzy ordinal linguistic information retrieval systems
2009, Fuzzy Sets and Systems
The performance of information retrieval systems (IRSs) is usually measured using two different criteria, precision and recall. Precision is the ratio of the relevant documents retrieved by the IRS in response to a user's query to the total number of documents retrieved, whilst recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents for the user's query that exist in the documentary database. In fuzzy ordinal linguistic IRSs (FOLIRSs), where extended Boolean queries are used, defining the user's queries in a manual way is usually a complex task. In this contribution, our interest is focused on the automatic learning of extended Boolean queries in FOLIRSs by means of multi-objective evolutionary algorithms considering both mentioned performance criteria. We present an analysis of two well-known general-purpose multi-objective evolutionary algorithms to learn extended Boolean queries in FOLIRSs. These evolutionary algorithms are the non-dominated sorting genetic algorithm (NSGA-II) and the strength Pareto evolutionary algorithm (SPEA2).

View all citing articles on Scopus

View full text

Improving the learning of Boolean queries by means of a multiobjective IQBE evolutionary algorithm

Abstract

Introduction

Section snippets

Boolean IRS

A multiobjective IQBE EA to learn multiple Boolean queries

Experiments developed

Analysis of the Pareto sets derived

Concluding remarks

Acknowledgement

International Journal of Approximate Reasoning

Information Processing & Management

International Journal of Approximate Reasoning

Information Processing & Management

Information Processing & Management

Information Processing & Management

Modern information retrieval

Fuzzy approaches to extend Boolean information retrieval

Genetic approach to query space exploration

Information Retrieval

On using genetic algorithms for multimodal relevance optimization in information retrieval

Journal of the American Society for Information Science and Technology

Multiple query evaluation based on an enhanced genetic algorithm

Information Processing & Management

Multiobjective decision making theory and methodology

A machine learning approach to Inductive Query by Examples: An experiment using relevance feedback, IDS, genetic algorithms, and simulated annealing

Journal of the American Society for Information Science

Evolutionary algorithms for solving multi-objective problems

A GA-P algorithm to automatically formulate extended Boolean queries for a fuzzy information retrieval system

Mathware & Soft Computing

A new evolutionary algorithm combining simulated annealing and genetic programming for relevance feedback in fuzzy information retrieval systems

Soft Computing

Automatic learning of multiple extended Boolean queries by multiobjective GA-P algorithms

Multi-objective optimization using evolutionary algorithms

Investigation of the importance of the genotype–phenotype mapping in information retrieval

Future Generation Computer Systems

System identification through simulated evolution: A machine learning approach