Abstract
This paper describes a method, using Genetic Programming, to automatically determine term weighting schemes for the vector space model. Based on a set of queries and their human determined relevant documents, weighting schemes are evolved which achieve a high average precision. In Information Retrieval (IR) systems, useful information for term weighting schemes is available from the query, individual documents and the collection as a whole.
We evolve term weighting schemes in both local (within-document) and global (collection-wide) domains which interact with each other correctly to achieve a high average precision. These weighting schemes are tested on well-known test collections and are compared to the traditional tf-idf weighting scheme and to the BM25 weighting scheme using standard IR performance metrics.
Furthermore, we show that the global weighting schemes evolved on small collections also increase average precision on larger TREC data. These global weighting schemes are shown to adhere to Luhn’s resolving power as both high and low frequency terms are assigned low weights. However, the local weightings evolved on small collections do not perform as well on large collections. We conclude that in order to evolve improved local (within-document) weighting schemes it is necessary to evolve these on large collections.
Article PDF
Similar content being viewed by others
References
Bergstrom A, Jaksetic P and Nordin P (2000) Enhancing information retrieval by automatic acquisition of textual relations using genetic programming. In: Proceedings of the 5th international conference on Intelligent user interfaces. pp. 29–32, ACM Press
Darwin C (1859) The Origin of the Species by means of Natural Selection, or The Preservation of Favoured Races in the Struggle for Life. First edition
Fan W, Fox EA, Pathak P and Wu H (2004a) The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology 55(7):628–636
Fan W, Gordon MD and Pathak P (2004b) A generic ranking function discovery framework by genetic programming for information retrieval. Information Processing & Management
Goldberg DE (1989) Genetic Algorithms in Search, Optimisation and Machine learning. Addison-Wesley
Gordon, M (1988) Probabilistic and genetic algorithms in document retrieval. Commun. ACM 31(10):1208–1218
Greiff W (1998) A theory of term weighting based on exploratory data analysis. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98). Melbourne, Australia
Gustafson S (2004) An Analysis of Diversity in Genetic Programming. Ph.D. thesis, School of Computer Science and Information Technology, University of Nottingham, Nottingham, England
Hersh W, Buckley C, Leone TJ and Hickam D (1994) OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 192–201, Springer-Verlag New York, Inc
Horng J and Yeh C (2000) Applying genetic algorithms to query optimization in document retrieval. Information Processing & Management 36(5):737–759
Kim S and Zhang B-T (2001) Evolutionary Learning of Web-Document Structure for Information Retrieval. In: Proceedings of the 2001 Congress on Evolutionary Computation CEC2001. pp. 1253–1260, IEEE Press
Koza JR (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA
Kuscu I (2000) Generalisation and domain specific functions in Genetic Programming. In: Proceedings of the 2000 Congress on Evolutionary Computation CEC00. pp. 1393–1400, IEEE Press
Kwok KL (1996) A new method of weighting query terms for ad-hoc retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 187–195, ACM Press
Lewis D (1992) Feature Selection and Feature Extraction for Text Categorization. Proceedings of Speech and Natural Language Workshop pp. 212–217
Li L, Shang Y and Zhang W (2002) Improvement of HITS-based algorithms on web documents. In: Proceedings of the eleventh international conference on World Wide Web. pp. 527–535, ACM Press
Lucas JM, van Baronaigien DR and Ruskey F (1993) On rotations and the generation of binary trees. J. Algorithms 15(3):343–366
Luhn H (1958) The automatic creation of literature abstracts. IBM Journal of Research and Development pp. 159–165
Oren N (2002) Re-examining tf.idf based information retrieval with Genetic Programming. Proceedings of SAICSIT
Pirkola A and Jarvelin K (2001) Employing the resolution power of search keys. J. Am. Soc. Inf. Sci. Technol. 52(7):575–583
Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137
Robertson SE, Walker S, Hancock-Beaulieu M, Gull A and Lau M (1998) Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In: The Seventh Text REtrieval Conference (TREC-7) NIST
Salton G and C Buckley (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5):513–523
Salton G, Wong A and Yang CS (1975) A vector space model for automatic indexing. Commun. ACM 18(11):613–620
Salton G and Yang CS (1973) On the specification of term values in automatic indexing. Journal of Documentation 29, 351–372
Schultz C (1968) H.P. Luhn: Pioneer of Information Science - Selected Works. Macmillan, London
Singhal A (2001) Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4):35–43
Singhal A, Buckley C and Mitra M (1996) Pivoted document length normalization. In: SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 21–29, ACM Press
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21
Van Rijsbergen, CJ (1979) Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow
Vrajitoru D (1998) Crossover improvement for the genetic algorithm in information retrieval. Inf. Process. Manage. 34(4):405–415
Vrajitoru D (2000) In F. Crestani, G. Pasi (eds.): Soft Computing in Information Retrieval. Techniques and Applications, pp. 199–222. Physica-Verlag
Yang J-J and Korfhage R (1993) Query Optimization in Information Retrieval Using Genetic Algorithms. In: Proceedings of the 5th International Conference on Genetic Algorithms. pp. 603–613, Morgan Kaufmann Publishers Inc
Yu, CT and Salton G (1976) Precision weighting - An effective automatic indexing method. Journal of the ACM 23(1):76–88
Zipf G (1949) Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, Massachusetts
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cummins, R., O’Riordan, C. Evolving local and global weighting schemes in information retrieval. Inf Retrieval 9, 311–330 (2006). https://doi.org/10.1007/s10791-006-1682-6
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10791-006-1682-6