An Adaptive Document Classification Agent
Chris Clack, Jonny Farringdon, Peter Lidwell and Tina Yu
Department of Computer Science
University College London
RN/96/45
Key words: genetic programming, machine learning, text information
filter
Correspondence to: C.Clack@cs.ucl.ac.uk
This work has been undertaken as part of a collaborative research project with
Friends of the Earth ("FOE") funded by the Department of Trade and Industry.
Abstract
The development of an intelligent text classification application is discussed
which utilises genetic programming methods. Learning capabilities are used to
effect an adaptive system in order to meet the needs of dynamic-information
users. Deriving structure and priority from text, target environments are
discussed where large volumes of (on-line) textual documents are manipulated.
Overview
A new intelligent text filtering (classification) application, [9], is
discussed here which utilises genetic programming methods, [4, 5]. The
development of an advanced text filter/classification system with learning
capabilities is part of a broader research programme at UCL-CS and FOE:
- Developing (useful) systems using Genetic Programming techniques.
- Developing a Genetic Programming system using Functional Programming
techniques.
- Developing novel Genetic Programming science and technology.
This theoretical research involves genetic programming design decisions
and scientific investigation based on an intelligent text filter application.
The needs of
commercial exploitation are taken into account as the development of this
system advances, while the more abstract work involving genetic programming
science is facilitated by the use of a functional programming test bed. The
text filter application described here (while implemented in custom built
software at UCL/FOE) is self contained and relies only upon an appropriate
genetic programming platform, [5], many of which are freely available, [7].
The need for an adaptive text filter
The adaptive text filter system described here uses emerging technology to
meet a rapidly increasing need - that of deriving structure and priority from
the flood of information unleashed in this information age.
Keywords are Not Enough:
Keywords are a favoured technique for classifying documents.
However, a modern document archive can contain thousands of
documents and it is infeasible to ask a user to provide these
keywords.
Asking a busy user to make an exhaustive list of the keywords they look
for in a document would require a significant amount of their time, and
certainly not cover all the keywords that they make judgments upon in
practice. This is because the enumeration and generation on demand of
all the key words that interest and effect a person are very unnatural
acts. When a person reads a document, the groups of features thet are
brought to mind are dependent on context.
Rather than ask the user to provide keywords, they may be introduced
automatically:
-
a lexicon of low-frequency words (words used little in normal English
usage) can be used to identify unusual words with a high level of
use in a given document.
-
all words (and
combinations) from a document can be used as the starting point for
a learning system to derive the important key-words itself.
However, Payne & Edwards [9] discuss this method and conclude that the
complexity of using all words and combinations makes machine learning ill
suited to the task.
Our text classifier currently uses a supplied list of
key-words from the user, though we have started to implement automatic methods.
For example, we invert the first of the above two methods and use a
lexicon of the top 1,000 high-frequency words as an exclusion list.
For example, there are 4771 words in (the draft of) this document, 1362 of
which are in the top 60 high frequency words. By excluding just the top 60
high frequency words the candidate key-word list is reduced in size by 30%.
By using the top 1000 words this document could be rapidly reduced to a small
key-word list.
Finally, we note that keywords on their own provide an impoverished
representation of the original text. Thus, we are investigating other analysis
techniques such as word pairs and proximity.
Current Commercial Systems are Failing
Single key word classification is not sufficient in current
applications. This can be seen, for example, in the case of the Web
"nanny" for children which used the single word "couples" to screen out
sexually explicit material. This had the effect of incidentally
screening out the (not very salubrious) USA's White-house pages amongst
many others (where Bill and Hillary are described as a couple) [2]. On
Compuserve a Usenet post screen rejected any document containing the
word "breast", instantly clobbering the Usenet cancer support groups for
Compuserve members. Language is rich, and single key word classification
is self-evidently insufficient.
Information is Dynamic
Language, slang, jargon and hot-topics are forever changing. To maintain
accurate classification a system must be able to adapt its search criteria.
The ability of an automated classification system to adapt over time
allows for a responsive filter to be developed. This technology is
appropriate in such a rapidly changing world.
The Text Filter Application
The Working Environment
An intelligent text filter could be put to good use in any environment where
large volumes of (on-line) textual documents need attending to, and where the
abstract nature of the document (say its subject classification) can aid in
prioritising or structuring working patterns. For example, in a busy office
with each person receiving 100's of e-mails a day it's very useful to have the incoming
messages automatically classified into importance. Knowing the subject area
of a document means work can be distributed efficiently, and critical
messages can make themselves known to people as soon as they arrive.
Equally in a text retrieval system - say when browsing the Web in search of a
relevant document (with respect to some existing collection of documents or
"bookmarks") a text filter could rapidly scan entire documents and rate their
relevance. It is envisaged that this process could be automated and
integrated into current Web indexing and browsing systems, particularly
important as the PICS system of Web content classification comes into use [1].
[figure 1 here]
Figure 1. First Time Use. Significant learning takes place upon the initial
use of the system. This computationally expensive process is done in
"one-shot".
How It Works
Figure 1 shows how the text filter agent learns to classify documents.
Parse trees are generated which, when run, will provide a classification
value. Figure 2 shows the parse trees being applied to new documents.
Testing
Approximately 500 pre-graded documents per category are set aside for training
at the development stage. Categories for the trials include {teaching,
research, administration}, typical mail folders of a busy academic. In
practice the training data would be live e-mail folders, Web bookmarks lists,
and so forth.
[figure 2 here]
Figure 2. Every Day Use. New documents are classified by the agent - which
triggers appropriate actions. The usual act of filing a document (in a
mailfolder say) is used as feedback to the agent and can trigger top-up
learning upon specific parse trees which under perform.
The pre-processor
The pre-processor (a process too simple to qualify as a "parser") maps a text
document to a table of words and numerical locations for those words. Thus
far there is no data loss.
The benefits of pre-processing
Space saving
The storage space required to hold the table will normally
be far less than the (computational) space required for storing the text
document. This reduction of storage space is an incidental gain, a
constraint relation for this space saving is shown in Appendix 1.
Speed
At a research stage one is not overly concerned with
optimisations of any algorithms, nevertheless all processing of the
genetic programming system involves operators which access the document
(in one representation or another). Numerical based operator application
upon a table is much faster than equivalent textual application upon an
original document. These two examples illustrate the significant benefit
in execution time of using a table representation, ergo applying
the pre-processor is beneficial.
- Counting the frequency of a word. Each operator in the evolving
parse tree that requires the frequency of a word (maybe 1000's will) can
either: scan the entire original text word by word counting frequency,
or more simply look up the word in the table and calculate the frequency
directly without having to process any other part of the document.
- Existence. An operator that wishes to check the existence of a word in the
document can either: scan the original document word by word (and it may have
to scan to the very last word in the document), or it can simply look up the
word in the table.
All the word-based parse tree operators used are executed faster using a table
representation than a text one (in the general case). Thus the initial
pre-processing overhead (in run-time as well as software development time)
will save considerable execution time during the evolution (learning) stages.
The decision to pre-process the data is based then upon its ease of
implementation and the speed gain for the evolutionary system.
Data Loss
In addition to the 1-1 mapping which generates the table representation of a
document, some data loss filters can be applied, these are intended to
simplify the table or reduce "noise". The data loss filters considered are:
Removal of High Frequency Words
In this system the top 60 high
frequency words in the English language, (current usage, Aug 1994, shown
in Appendix 2) are removed from the table. Because such high frequency
words will be found in almost every document they hold little or no
discriminating potential, thus their removal will reduce the search
space (which is some increasing function of the number of different
words in a document) and loose little or no classification power.
This has the effect of keeping the text representation simple, thus
speeding things up. If these high frequency words were left in the
document representation then the text filter's classification results
should be exactly the same, only the search space would be larger and
learning would take significantly longer.
Reduction to base form
This filter is not implemented in our system but
is considered important enough to warrant documentation. Some words could be
reduced (in English and some other languages) to their base form, plurals,
past-tense, third-person, &c.; For example "cats" -> "cat", "editing" ->
"edit". Data is lost by this process, but generality is gained. Such a filter
may make a (desirable) difference to the effectiveness of the classifier. In
the current implementation of the pre-processor {speak, speaks, spoke} are
identified as three different words.
This filter would have the effect of aiding the system in co-classifying
separate documents which used (for example) these words - "budget",
"budgeting", and "budgeted".
Removal of e-mail signatures
If applied to an e-mail message, then any
".signature" could be ignored, assuming the convention that everything
after a "RTN--RTN" is a signature. This will filter out "noise", such as
random quotes, ASCII artwork, place names and popular philosophy.
Special & Meta Pre-processing Information
Some extra information can be generated from a pre-processor which
carries considerable classification potential. For example:
- Total words. This is very useful for working out normalisations.
- {Subject, title}, To, {From, author}, &c.; in e-mail are also significant
classifiers.
The feature detectors can make significant use of these items.
The Parse Trees
A parse tree in this system is a computer "programme" that will be evolved
into a good classifier of documents.
In this system a parse tree is a programme that can be evaluated through
the use of an emulator which matches nodes of the tree to aspects of a
document. During evaluation against a document the tree ultimately
reduces to a single numerical value - the classification. The method
used here has operators at each branch of the tree, all of which produce
a number. Leaves of the tree (terminals) are nodes containing word
operators and arguments (words) or numerical-constant terminals. Thus
any part of a tree can be interchanged with another part and the tree
remains valid - perfect for flexible evolution and mutations. An example parse
tree is shown in Figure 3.
One drawback to this method is that word operator nodes of the tree are
strongly bound to their arguments, so that evolution operators can not
easily (in this numeric based approach) manipulate the words in a tree
independently of the word operator. For example, a leaf of the parse
tree may evaluate to the frequency of "complaint". During
evolution it is not a simple matter to exchange the word "complaint" for
another word, rather the node frequency of "complaint" is
exchanged in its entirety. This ridged approach is useful because an
evolutionary exchange of a word operator may not be for another leaf -
but for a complete sub-tree.
[Figure 3 here]
An example parse tree.
Parse Tree Operators
The parse tree operators all reduce to a single (real) number. Thus a tree of
these operators reduces to one numerical value also. It should be noted that
while the following operators are made available to the system there is no
requirement for any particular one to be used.
Feature Detectors: Word Operators
- frequency {word}. Relative frequency (absolute-frequency/document-length)
gives a comparable measure between documents.
- exists {word} ( == frequency > 0 ). Testing for the existence of an
individual word gives the same power as many other systems which use
key-word-search and match for classification. Thus the evolved parse tree can
be at least as powerful as a key word search.
- pair-distance {word' word''}. Mean distance between words in the document.
This distance is normalised so that results can be applied over many
documents. The order that the words are used in the document should not
matter, for example "search space" and "space to search" would both figure
in the result of pair-distance(search, space).
- adjacent-exists {word' word''}. Adjacent words exist in the document.
Pairs of words are often their own units, for example {search space, world
wide}.
- adjacent-frequency {word' word''}. Mean adjacent-words frequency. Pairs of
adjacent words are important discriminators (an order of magnitude more
complex than single word discriminators).
Binding Groups of Features: Numeric Operators
- not {number}. wrt the range in use.
- magnify {number}. By using a squashing or expanding function (say
e) the differences between numbers can be magnified.
This allows for sub-trees to establish their importance.
- {+, -, *, /, =, <>, <, >=, Max, Min}. Standard numerical operators. The
comparison operators might be thought of as being used to compare the current
document with learnt discriminators. The numerical operators as biasing the
importance of branches of the tree and combinations thereof.
Evolution Operators. Manipulating the Parse Trees
During learning the evolution operators are applied by an evolution engine
with access to the fitness value of each parse tree in the population. Little
has so far been discussed about the evolution of these parse trees, apart from
making the trees particularly easy to manipulate. Two particular schemes in
the spirit of natural evolution, are popular in current genetic programming
systems, [4, 5, 6], and are used to evolve the parse trees.
- Mutations. These operators are carried out to individual parse trees
independently, generating new "species". In nature new "programme operators"
are generated this way. We don't really have the time nor the populations to
rely upon this, hence we will supply as many programme operators as possible
for the system to select from, and let the evolution process screen out
redundant ones as it sees fit.
- Breeding. Both breeding and crossover operators are carried out upon >1
parse tree at a time. The familiar case uses two parents to generate a number
of children. In a computational environment is clearly possible to use methods
involving >2 parents. Breeding generates children with an (overall) consistent
genetic structure to the parents, this generates variations in genetics. As
the parse trees for the text filter vary considerably from one another
breeding is not used in this text filter system.
- Crossover. Similar to breeding, effectively generating children from
parents, crossover does not ensure a consistent genetic structure and hence is
much more unpredictable (genetically). For the parse trees, where all nodes
are interchangeable, crossover allows complete sub-trees to be grafted onto
(into) other trees. Both breeding and crossover can generate new
quasi-operators, these are new combinations of operator application. Crossover
(recombination) generates new chromosomes in genetics.
Population Control
Traditional population control methods (with similarities in natural
evolution but very different to population control in humans) are
available with many genetic programming systems. One standard is for a
very few members of a population to change at rate X. An alternative
method (non-standard, but which shall be explored later in this research
programme) is for many members to change at rate delta-X.
The Fitness Function
During learning the fitness function will score each parse tree in the
population based upon the gold-standard (the pre-defined score for a
document), the result of the parse tree, some user parameters, and some
distance function.
In its most simple form, the fitness function could simply find the
difference between the correct (desired) score, and the one generated by
the programme. Using a least-squares distance between
the gold standard and the parse trees' score, this is summed over all
the test documents for each parse tree. Thus a parse tree giving perfect
classification would have a sum distance of zero from the gold standard.
As a parse tree gets increasingly worse at classification (in some way)
the sum distance increases.
What can go wrong?
Messages can set out to, or inadvertently, confuse automated classification
systems. Such messages may become more common as use of e-mail by libertarians
and the paranoid expands. For such users there are facilities like "spook" in
the (worlds most) popular editor emacs which add random text to a document in
order to confuse the secret service or others monitoring e-mail with text
filters.
This may be of no consequence to expected user groups in academia, home and
business, but the serious point here is that any document may be
mis-classified either because it goes out of its way to disguise its true
nature or content, or because some significant body of off-topic text confuses
the classification. For example, this innocent post script could easily
confuse a classification, "PS: Did you see the great goal Arsenal scored on
Saturday?".
As a second example, a random quote in an e-mail signature may bias a
classification, thus signatures should be considered as noise.
Another possible situation when erroneous results might occur is
when the context for classification changes faster than a system can
learn how those changes effect classification. Changes to the type of
documents in a classification (like an e-mail folder) probably happen in
fits and spurts, thus there may be small periods when the classifier is
actively (top-up) learning in an effort to catch up. It can be assumed
that most of the time (under normal use) the classifications are both
stable and maximal.
Customer Base
This list is far from exhaustive, and shows the broad need such a system would
satisfy. We envisage a text filter product as an independent piece of
(service) software where by other software applications can query the
filter (thus only one filter need be running, perhaps even centrally
over a LAN, for the convenience of many other applications)
- For e-mail, Web page, and electronic document readers: Academics,
business, family.
- Security products: for both the PICS data-base maintainers and PICS
user, products such as screens for sexually explicit sites for children,
on the Web, for Usenet posts & articles, &c.; [1, 3]
- News gathering agencies, and their users. ie PA, &c.;
- Web index agencies.
- Browser developers, for e-mail and Web. ie Sun, Microsoft, Netscape, &c.;
- Secret service agencies, for screening e-mail en-mass.
Work in Progress & Conclusions
The system outlined here is under current software development at UCL and FOE.
Any software and results will be made available in due course from UCL
Computer Science.
A specific area of interest for the future is a comparison of this adaptive
genetic programming method (a so called AI method) against the more
traditional method of a key word search. The software under development has
not reached the testing stage yet so comparisons with other classification
methods such as key word match are not yet possible.
Current results involving the evolution of key-word parse trees are
promising, with appropriately weighted trees being evolved rapidly, an example
of one such tree is shown in Figure 3.
The need for an adaptive text classification agent grows every day (in
proportion to the size of the Internet and the increased use of on-line
documents). Interim results show the viability of implementing such an agent
using genetic programming methods. An agent capable of initial and continued
long term top-up learning would be of great benefit to a considerable
population of computer users, and is within the bounds of current machine
learning technology.
Appendix 1
Constraints such that a table representation requires less storage space
than an original text document.
Key:
Average Word length AWL
No. of Words NW
No. of Different Words NDW
Space required for the table: (AWL . NDW) + NW
Space required for the text document: AWL . NW
NDW <= NW by definition
=> AWL . NDW <= AWL . NW
The table requires less space than the original document when
AWL . NDW + NW < AWL . NW
AWL . NDW < AWL . NW - NW
AWL . NDW < (AWL - 1) NW
NDW AWL - 1
------- < -------
NW AWL
No. of Different Words Average Word Length - 1
== ----------------------- < -------------------------
No. of Words Average Word Length
A quick look at this inequality for average word lengths of {4,5,6} show the
relationship between different words and total number of words.
AWL
| (NDW / NW) must be <
|
4
| 3/4
|
5
| 4/5
|
6
| 5/6
|
So, with an average word length of say 5, so long as 4 (or less) out of every
5 words are different then the table saves space.
A draft of this document, for example, has 1800:4771
different-words:total-words. That's approximately 9 different words out of
every 24, well within the constraint indicating that a table representation of
this document would be significantly smaller than the document itself.
Appendix 2
Top 60 high frequency words, current English usage, [8]. The score indicates
relative frequency.
THE 69971
OF 36411
AND 28852
TO 26149
IN 21341
THAT 10595
IS 10099
WAS 09816
HE 09543
FOR 09489
IT 08756
WITH 07289
AS 07250
HIS 06997
ON 06742
|
BE 06377
AT 05378
BY 05305
I 05173
THIS 05146
HAD 05133
NOT 04609
ARE 04393
BUT 04381
FROM 04369
OR 04207
HAVE 03941
AN 03747
THEY 03618
WHICH 03562
|
ONE 03292
YOU 03286
HER 03037
ALL 03001
SHE 02859
THERE 02724
WOULD 02714
THEIR 02670
WE 02653
HIM 02619
BEEN 02472
HAS 02439
WHEN 02331
WILL 02244
MORE 02216
|
NO 02201
IF 02199
OUT 02096
SO 01984
SAID 01961
WHAT 01908
UP 01895
ITS 01858
ABOUT 01815
INTO 01791
THEM 01789
THAN 01789
CAN 01772
ONLY 01747
OTHER 01702
|
References
[1]. Paul Resnick & James Miller. 1996. PICS: Internet Access Controls
Without Censorship. World Wide Web Consortium, MIT, Cambridge, MA.
URL:http://www.w3.org/
[2]. David Mazzarella (Ed.) 1996. "Editorial: Misfire on the 'Net." pp8A.
USA Today/International Edition, February 28, 1996.
[3]. Clive Parker. 1996. "Parents seal off the no-go areas." Interface section
pp8. The Times. March 13, 1996. London
[4]. David E Goldberg. 1989. Genetic Algorithms in Search, Optimization,
and Machine Learning. Addison-Wesley.
[5]. Z Michalewicz. 1992. "Genetic Algorithms + Data Structures = Evolution
Programs." Springer-Verlag.
[6]. John R. Koza and James P. Rice. 1992. Genetic Programming - The
Movie. MIT Press.
[7]. Koza's Genetic Programming in LISP, from the book Genetic Programming
II: litlflet.lsp. Plus GP in C++, gpcpp4.zip and gpc++-0.4.tar.gz, are
available with other implementations from the usual FTP archives.
[8]. P T Quinlan. 1992. Oxford psycholinguistic database.
[9]. Tery R Payne & Peter Edwards. 1995. Learning for Information Filtering
Agents. University of Aberdeen, AUCS/TR9509. URL:
http://www.csd.abdn.ac.uk/~pedwards/research.html
The moral right of Clack, Farringdon, Lidwell and Yu to be recognised as
the authors of this work is asserted. June 19 1996.