An Adaptive Document Classification Agent



Chris Clack, Jonny Farringdon, Peter Lidwell and Tina Yu
Department of Computer Science
University College London
RN/96/45

Key words: genetic programming, machine learning, text information filter


Correspondence to: C.Clack@cs.ucl.ac.uk
This work has been undertaken as part of a collaborative research project with Friends of the Earth ("FOE") funded by the Department of Trade and Industry.


Abstract

Overview

A new intelligent text filtering (classification) application, [9], is discussed here which utilises genetic programming methods, [4, 5]. The development of an advanced text filter/classification system with learning capabilities is part of a broader research programme at UCL-CS and FOE:

This theoretical research involves genetic programming design decisions and scientific investigation based on an intelligent text filter application. The needs of commercial exploitation are taken into account as the development of this system advances, while the more abstract work involving genetic programming science is facilitated by the use of a functional programming test bed. The text filter application described here (while implemented in custom built software at UCL/FOE) is self contained and relies only upon an appropriate genetic programming platform, [5], many of which are freely available, [7].

The need for an adaptive text filter

The adaptive text filter system described here uses emerging technology to meet a rapidly increasing need - that of deriving structure and priority from the flood of information unleashed in this information age.

Keywords are Not Enough:

Keywords are a favoured technique for classifying documents. However, a modern document archive can contain thousands of documents and it is infeasible to ask a user to provide these keywords.

Asking a busy user to make an exhaustive list of the keywords they look for in a document would require a significant amount of their time, and certainly not cover all the keywords that they make judgments upon in practice. This is because the enumeration and generation on demand of all the key words that interest and effect a person are very unnatural acts. When a person reads a document, the groups of features thet are brought to mind are dependent on context.

Rather than ask the user to provide keywords, they may be introduced automatically:

Our text classifier currently uses a supplied list of key-words from the user, though we have started to implement automatic methods. For example, we invert the first of the above two methods and use a lexicon of the top 1,000 high-frequency words as an exclusion list. For example, there are 4771 words in (the draft of) this document, 1362 of which are in the top 60 high frequency words. By excluding just the top 60 high frequency words the candidate key-word list is reduced in size by 30%. By using the top 1000 words this document could be rapidly reduced to a small key-word list.

Finally, we note that keywords on their own provide an impoverished representation of the original text. Thus, we are investigating other analysis techniques such as word pairs and proximity.

Current Commercial Systems are Failing

Single key word classification is not sufficient in current applications. This can be seen, for example, in the case of the Web "nanny" for children which used the single word "couples" to screen out sexually explicit material. This had the effect of incidentally screening out the (not very salubrious) USA's White-house pages amongst many others (where Bill and Hillary are described as a couple) [2]. On Compuserve a Usenet post screen rejected any document containing the word "breast", instantly clobbering the Usenet cancer support groups for Compuserve members. Language is rich, and single key word classification is self-evidently insufficient.

Information is Dynamic

Language, slang, jargon and hot-topics are forever changing. To maintain accurate classification a system must be able to adapt its search criteria. The ability of an automated classification system to adapt over time allows for a responsive filter to be developed. This technology is appropriate in such a rapidly changing world.

The Text Filter Application

The Working Environment

An intelligent text filter could be put to good use in any environment where large volumes of (on-line) textual documents need attending to, and where the abstract nature of the document (say its subject classification) can aid in prioritising or structuring working patterns. For example, in a busy office with each person receiving 100's of e-mails a day it's very useful to have the incoming messages automatically classified into importance. Knowing the subject area of a document means work can be distributed efficiently, and critical messages can make themselves known to people as soon as they arrive.

Equally in a text retrieval system - say when browsing the Web in search of a relevant document (with respect to some existing collection of documents or "bookmarks") a text filter could rapidly scan entire documents and rate their relevance. It is envisaged that this process could be automated and integrated into current Web indexing and browsing systems, particularly important as the PICS system of Web content classification comes into use [1].

[figure 1 here]
Figure 1. First Time Use. Significant learning takes place upon the initial use of the system. This computationally expensive process is done in "one-shot".

How It Works

Figure 1 shows how the text filter agent learns to classify documents. Parse trees are generated which, when run, will provide a classification value. Figure 2 shows the parse trees being applied to new documents.

Testing

Approximately 500 pre-graded documents per category are set aside for training at the development stage. Categories for the trials include {teaching, research, administration}, typical mail folders of a busy academic. In practice the training data would be live e-mail folders, Web bookmarks lists, and so forth.

[figure 2 here] Figure 2. Every Day Use. New documents are classified by the agent - which triggers appropriate actions. The usual act of filing a document (in a mailfolder say) is used as feedback to the agent and can trigger top-up learning upon specific parse trees which under perform.

The pre-processor

The pre-processor (a process too simple to qualify as a "parser") maps a text document to a table of words and numerical locations for those words. Thus far there is no data loss.

The benefits of pre-processing

Space saving

The storage space required to hold the table will normally be far less than the (computational) space required for storing the text document. This reduction of storage space is an incidental gain, a constraint relation for this space saving is shown in Appendix 1.

Speed

At a research stage one is not overly concerned with optimisations of any algorithms, nevertheless all processing of the genetic programming system involves operators which access the document (in one representation or another). Numerical based operator application upon a table is much faster than equivalent textual application upon an original document. These two examples illustrate the significant benefit in execution time of using a table representation, ergo applying the pre-processor is beneficial. All the word-based parse tree operators used are executed faster using a table representation than a text one (in the general case). Thus the initial pre-processing overhead (in run-time as well as software development time) will save considerable execution time during the evolution (learning) stages. The decision to pre-process the data is based then upon its ease of implementation and the speed gain for the evolutionary system.

Data Loss

In addition to the 1-1 mapping which generates the table representation of a document, some data loss filters can be applied, these are intended to simplify the table or reduce "noise". The data loss filters considered are:

Removal of High Frequency Words

In this system the top 60 high frequency words in the English language, (current usage, Aug 1994, shown in Appendix 2) are removed from the table. Because such high frequency words will be found in almost every document they hold little or no discriminating potential, thus their removal will reduce the search space (which is some increasing function of the number of different words in a document) and loose little or no classification power.

This has the effect of keeping the text representation simple, thus speeding things up. If these high frequency words were left in the document representation then the text filter's classification results should be exactly the same, only the search space would be larger and learning would take significantly longer.

Reduction to base form

This filter is not implemented in our system but is considered important enough to warrant documentation. Some words could be reduced (in English and some other languages) to their base form, plurals, past-tense, third-person, &c.; For example "cats" -> "cat", "editing" -> "edit". Data is lost by this process, but generality is gained. Such a filter may make a (desirable) difference to the effectiveness of the classifier. In the current implementation of the pre-processor {speak, speaks, spoke} are identified as three different words.

This filter would have the effect of aiding the system in co-classifying separate documents which used (for example) these words - "budget", "budgeting", and "budgeted".

Removal of e-mail signatures

If applied to an e-mail message, then any ".signature" could be ignored, assuming the convention that everything after a "RTN--RTN" is a signature. This will filter out "noise", such as random quotes, ASCII artwork, place names and popular philosophy.

Special & Meta Pre-processing Information

Some extra information can be generated from a pre-processor which carries considerable classification potential. For example:

The feature detectors can make significant use of these items.

The Parse Trees


A parse tree in this system is a computer "programme" that will be evolved into a good classifier of documents.

In this system a parse tree is a programme that can be evaluated through the use of an emulator which matches nodes of the tree to aspects of a document. During evaluation against a document the tree ultimately reduces to a single numerical value - the classification. The method used here has operators at each branch of the tree, all of which produce a number. Leaves of the tree (terminals) are nodes containing word operators and arguments (words) or numerical-constant terminals. Thus any part of a tree can be interchanged with another part and the tree remains valid - perfect for flexible evolution and mutations. An example parse tree is shown in Figure 3.

One drawback to this method is that word operator nodes of the tree are strongly bound to their arguments, so that evolution operators can not easily (in this numeric based approach) manipulate the words in a tree independently of the word operator. For example, a leaf of the parse tree may evaluate to the frequency of "complaint". During evolution it is not a simple matter to exchange the word "complaint" for another word, rather the node frequency of "complaint" is exchanged in its entirety. This ridged approach is useful because an evolutionary exchange of a word operator may not be for another leaf - but for a complete sub-tree.

[Figure 3 here]
An example parse tree.

Parse Tree Operators

The parse tree operators all reduce to a single (real) number. Thus a tree of these operators reduces to one numerical value also. It should be noted that while the following operators are made available to the system there is no requirement for any particular one to be used.

Feature Detectors: Word Operators

Binding Groups of Features: Numeric Operators

Evolution Operators. Manipulating the Parse Trees

During learning the evolution operators are applied by an evolution engine with access to the fitness value of each parse tree in the population. Little has so far been discussed about the evolution of these parse trees, apart from making the trees particularly easy to manipulate. Two particular schemes in the spirit of natural evolution, are popular in current genetic programming systems, [4, 5, 6], and are used to evolve the parse trees.

Population Control

Traditional population control methods (with similarities in natural evolution but very different to population control in humans) are available with many genetic programming systems. One standard is for a very few members of a population to change at rate X. An alternative method (non-standard, but which shall be explored later in this research programme) is for many members to change at rate delta-X.

The Fitness Function

During learning the fitness function will score each parse tree in the population based upon the gold-standard (the pre-defined score for a document), the result of the parse tree, some user parameters, and some distance function.

In its most simple form, the fitness function could simply find the difference between the correct (desired) score, and the one generated by the programme. Using a least-squares distance between the gold standard and the parse trees' score, this is summed over all the test documents for each parse tree. Thus a parse tree giving perfect classification would have a sum distance of zero from the gold standard. As a parse tree gets increasingly worse at classification (in some way) the sum distance increases.

What can go wrong?

Messages can set out to, or inadvertently, confuse automated classification systems. Such messages may become more common as use of e-mail by libertarians and the paranoid expands. For such users there are facilities like "spook" in the (worlds most) popular editor emacs which add random text to a document in order to confuse the secret service or others monitoring e-mail with text filters.

This may be of no consequence to expected user groups in academia, home and business, but the serious point here is that any document may be mis-classified either because it goes out of its way to disguise its true nature or content, or because some significant body of off-topic text confuses the classification. For example, this innocent post script could easily confuse a classification, "PS: Did you see the great goal Arsenal scored on Saturday?".

As a second example, a random quote in an e-mail signature may bias a classification, thus signatures should be considered as noise.

Another possible situation when erroneous results might occur is when the context for classification changes faster than a system can learn how those changes effect classification. Changes to the type of documents in a classification (like an e-mail folder) probably happen in fits and spurts, thus there may be small periods when the classifier is actively (top-up) learning in an effort to catch up. It can be assumed that most of the time (under normal use) the classifications are both stable and maximal.

Customer Base

This list is far from exhaustive, and shows the broad need such a system would satisfy. We envisage a text filter product as an independent piece of (service) software where by other software applications can query the filter (thus only one filter need be running, perhaps even centrally over a LAN, for the convenience of many other applications)

Work in Progress & Conclusions

The system outlined here is under current software development at UCL and FOE. Any software and results will be made available in due course from UCL Computer Science.

A specific area of interest for the future is a comparison of this adaptive genetic programming method (a so called AI method) against the more traditional method of a key word search. The software under development has not reached the testing stage yet so comparisons with other classification methods such as key word match are not yet possible.

Current results involving the evolution of key-word parse trees are promising, with appropriately weighted trees being evolved rapidly, an example of one such tree is shown in Figure 3.

The need for an adaptive text classification agent grows every day (in proportion to the size of the Internet and the increased use of on-line documents). Interim results show the viability of implementing such an agent using genetic programming methods. An agent capable of initial and continued long term top-up learning would be of great benefit to a considerable population of computer users, and is within the bounds of current machine learning technology.


Appendix 1

Constraints such that a table representation requires less storage space than an original text document.

Key:
Average Word length	AWL
No. of Words		NW
No. of Different Words	NDW

Space required for the table: 		(AWL . NDW) + NW
Space required for the text document:	 AWL . NW

	      NDW <=       NW		by definition
=>	AWL . NDW <= AWL . NW


The table requires less space than the original document when

	AWL . NDW + NW < AWL . NW
        AWL . NDW      < AWL . NW - NW
	AWL . NDW      < (AWL - 1) NW 


	     NDW	   AWL - 1
     	   -------     <   -------
	     NW              AWL


      No. of Different Words            Average Word Length - 1
==   -----------------------    <      -------------------------
        No. of Words			Average Word Length

A quick look at this inequality for average word lengths of {4,5,6} show the relationship between different words and total number of words.

AWL (NDW / NW)
must be <
4 3/4
5 4/5
6 5/6

So, with an average word length of say 5, so long as 4 (or less) out of every 5 words are different then the table saves space.

A draft of this document, for example, has 1800:4771 different-words:total-words. That's approximately 9 different words out of every 24, well within the constraint indicating that a table representation of this document would be significantly smaller than the document itself.


Appendix 2

Top 60 high frequency words, current English usage, [8]. The score indicates relative frequency.

THE 69971
OF 36411
AND 28852
TO 26149
IN 21341
THAT 10595
IS 10099
WAS 09816
HE 09543
FOR 09489
IT 08756
WITH 07289
AS 07250
HIS 06997
ON 06742
BE 06377
AT 05378
BY 05305
I 05173
THIS 05146
HAD 05133
NOT 04609
ARE 04393
BUT 04381
FROM 04369
OR 04207
HAVE 03941
AN 03747
THEY 03618
WHICH 03562
ONE 03292
YOU 03286
HER 03037
ALL 03001
SHE 02859
THERE 02724
WOULD 02714
THEIR 02670
WE 02653
HIM 02619
BEEN 02472
HAS 02439
WHEN 02331
WILL 02244
MORE 02216
NO 02201
IF 02199
OUT 02096
SO 01984
SAID 01961
WHAT 01908
UP 01895
ITS 01858
ABOUT 01815
INTO 01791
THEM 01789
THAN 01789
CAN 01772
ONLY 01747
OTHER 01702


References

[1]. Paul Resnick & James Miller. 1996. PICS: Internet Access Controls Without Censorship. World Wide Web Consortium, MIT, Cambridge, MA. URL:http://www.w3.org/

[2]. David Mazzarella (Ed.) 1996. "Editorial: Misfire on the 'Net." pp8A. USA Today/International Edition, February 28, 1996.

[3]. Clive Parker. 1996. "Parents seal off the no-go areas." Interface section pp8. The Times. March 13, 1996. London

[4]. David E Goldberg. 1989. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.

[5]. Z Michalewicz. 1992. "Genetic Algorithms + Data Structures = Evolution Programs." Springer-Verlag.

[6]. John R. Koza and James P. Rice. 1992. Genetic Programming - The Movie. MIT Press.

[7]. Koza's Genetic Programming in LISP, from the book Genetic Programming II: litlflet.lsp. Plus GP in C++, gpcpp4.zip and gpc++-0.4.tar.gz, are available with other implementations from the usual FTP archives.

[8]. P T Quinlan. 1992. Oxford psycholinguistic database.

[9]. Tery R Payne & Peter Edwards. 1995. Learning for Information Filtering Agents. University of Aberdeen, AUCS/TR9509. URL: http://www.csd.abdn.ac.uk/~pedwards/research.html


The moral right of Clack, Farringdon, Lidwell and Yu to be recognised as the authors of this work is asserted. June 19 1996.