Abstract
We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that because the induced rules are meaningful to a human analyst they may have a number of other uses beyond classification and provide a basis for text mining applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hayes, P.J., Andersen, P.M., Nirenburg, I.B., Schmandt, L.M.: Tcs: a shell for content-based text categorization. In: Proceedings of CAIA 1990, 6th IEEE Conference on Artificial Intelligence Applications, Santa Barbara, CA, pp. 320–326 (1990)
Apté, C., Damerau, F.J., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. on Inform. Syst. 12, 3, 233–251. ATTARDI (1994)
Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Bennet, K., Shawe-Taylor, J., Wu., D.: Enlarging the margins in perceptron decision trees. Machine Learning 41, 295–313 (2000)
Pickens, J., Croft, W.B.: An Exploratory Analysis of Phrases in Text Retrieval. In: Proceedings of RIAO Conference, Paris, France (2000)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992)
Clack, C., Farrington, J., Lidwell, P., Yu, T.: Autonomous Document Classification for Business. In: Proceedings of The ACM Agents Conference (1997)
Bergström, A., Jaksetic, P., Nordin, P.: Enhancing Information Retrieval by Automatic Acquisition of Textual Relations Using Genetic Programming. In: Proceedings of the 2000 International Conference on Intelligent User Interfaces, pp. 29–32. ACM Press, New York (2000)
Cavnar, W., Trenkle, J.: N-Gram-Based Text Categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994)
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848 (1995)
Biskri, I., Delisle, S.: Text Classification and Multilinguism: Getting at Words via N-grams of Characters. In: Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI-2002), Orlando, Florida, USA, vol. 5, pp. 110–115 (2002)
Tauritz, D.R., Kok, J.N., Sprinkhuizen-Kuyper, I.G.: Adaptive information filtering using evolutionary computation. Information Sciences 122(2-4), 121–140 (2000)
Langdon, W.B.: Natural Language Text Classification and Filtering with Trigrams and Evolutionary Classifiers. In: Whitley, D. (ed.) Late Breaking Papers at the 2000 Genetic and Evolutionary Computation Conference, Las Vegas, Nevada, USA, pp. 210–217 (2000)
Lodhi, H., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems 13, pp. 563–569. MIT Press, Cambridge (2001)
Feldman, R., Fresko, M., Kinar, Y., Lindell, O., Liphstat, M., Rajman, Y., Schler, O., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, Nantes, France, pp. 65–73 (1998)
Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In: Proceedings of the 16th International Conference in Machine Learning ICML Bled, Slovenia (1999)
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management: an International Journal 38(4), 529–546 (2002)
Berleant, D., Gu, Z.: Hash table sizes for storing n-grams for text processing, Technical Report 10-00a, Software Research Lab, 3215 Coover Hall, Dept. of Electrical and Computer Engineering, Iowa State University (2000)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2000)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Department of Computer Science, University of Glasgow (1979)
Montana, D.: Strongly Typed Genetic Programming. In: Evolutionary Computation, vol. 3(2), pp. 199–230. The MIT Press, Cambridge (1995)
Ebert, D., Shaw, D., Zwa, A., Miller, E., Roberts, D.: Interactive Volumetric Information Visualization for Document Corpus Management. In: Proceedings of Graphics Interface 1997, Kelowna, B.C, May 1997, pp. 121–128 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hirsch, L., Saeedi, M., Hirsch, R. (2005). Evolving Rules for Document Classification. In: Keijzer, M., Tettamanzi, A., Collet, P., van Hemert, J., Tomassini, M. (eds) Genetic Programming. EuroGP 2005. Lecture Notes in Computer Science, vol 3447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31989-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-31989-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25436-2
Online ISBN: 978-3-540-31989-4
eBook Packages: Computer ScienceComputer Science (R0)