PROCEEDINGS OF
AN INTERNATIONAL CONFERENCE ON
GENETIC ALGORITHMS
AND THEIR APPLICATIONS

July 24-26, 1985
at
Carnegie-Mellon University
Pittsburgh, PA

Sponsored By
Texas Instruments, Inc.

U.S. Navy Center for Applied Research
in Artificial Intelligence

(NCARAI)

John J. Grefenstette
Editor

 
 

PROCEEDINGS OF
AN INTERNATIONAL CONFERENCE ON
GENETIC ALGORITHMS

AND THEIR APPLICATIONS

July 24-26, 1985
at
Carnegie-Mellon University
Pittsburgh, PA.

Sponsored By
Texas Instruments, Inc.
U.S. Navy Center for Applied Research

in Artificial Intelligence
(NCARAI)

John J. Grefenstette
Editor

 
 

Copyright © 1985 John J. Grefenstette
 

PREFACE

It has been ten years since the publication of John Holland’s seminal book, Adaptation
in Natural and Artificial Systems. One of the major contribution of the book was the
formulation of a class of algorithms, now known as Genetic Algorithms (GA’s), which
incorporate metaphors from natural population genetics into artificial adaptive systems.
Since the publication of Holland's book, interest in GA’s has spread from the University
of Michigan to research centers throughout the U.S., Canada, and Great Britain. GA's
have been applied to a striking variety of areas, from machine learning to image
processing to combinatorial optimization. The great range of application attests to the
power and the generality of underlying approach. However, much of the GA research
has been reported only in Ph, D. theses and informal workshops. This Conference was
organized to provide a forum in which the diverse groups involved in GA research can
share results and ideas concerning this exciting area.

On behalf of the organizing committee, it is my pleasure to acknowledge the support of
Texas Instruments, Inc. and the U.S. Navy Center for Research in Applied Al. Special
thanks go to Dave Davis for his efforts in obtaining the TI Grant.

John J. Grefenstette
Program Chair

Conference Committee

John H. Holland, University of Michigan (Conference Chair)
Lashon B. Booker, NCARAI

Kenneth A. De Jong, NCARAI and George Mason University
John J. Grefenstette, Vanderbilt University (Program Chair)
Stephen F. Smith, CMU Robotics Institute (Local Arrangements)

 
TABLE OF CONTENTS

Wednesday, July 24, 1985

Session 1: 8:45 a.m - 10:15 a.m.
Chair: John Holland

Properties of the bucket brigade
John H. Holland, University of Michigan

Genetic algorithms and rule learning in dynamic system control
David E. Goldberg, University of Alabama

Knowledge growth in an artificial animal
Stewart W. Wilson, Rowland Institute for Science

Coffee Break: 10:15 a.m. - 10:45 a.m.

Session 2: 10:45 a.m, - 12:00 noon
Chair: Lashon Booker

 

Implementing semantic network structures using the classifier system
Stephanie Forrest, University of Michigan

The bucket brigade is not genetic
‘Thomas H. Westerdale, University of London

Genetic plans and the probabilistic learning system: synthesis and results
Larry Rendell, University of Illinois at Urbana-Champaign

Lunch: 12:00 noon - 2:00 p.m.

Session 3: 2:00 p.m. - 2:50 p.m.
Chair: Stephen Smith

Learning multiclass pattern discrimination
J. David Schaffer, Vanderbilt University

Improving the performance of genetic algorithms in classi fier systems
Lashon B. Booker, Navy Center for Applied Research in Al

Coffee Break: 2:50 p.m. - 3:15 p.m.

page 1

page 8

page 16

page 24

page 45

page 60

page 7%

page 80

 
 

Discussion: 3:15 p.m. - 4:30 p.m.
Topic: GA's and Machine Learning
Chair: John Holland

Thursday, July 25, 1985

Session 4: 9:00 a.m. - 10:15 a.m.
Chair: John Grefenstette

Multiple objective optimization with vector evaluated genctic algorithms
J. David Schaffer, Vanderbilt University

Adaptive selection methods for genetic algorithms
James E. Baker, Vanderbilt University

Genetic search with approximate function evaluations
John J. Grefenstette and J. Michael Fitzpatrick, Vanderbilt University

Coffee Break: 10:15 a.m. - 10:45 a.m.

Session 5: 10:45 a.m. - 12:00 noon
Chair: John Grefenstette

 

A connectionist algorithm for genetic search
David H. Ackley, Carnegie-Mellon University

Job shop scheduling with genetic algorithms
Lawrence Davis, Bolt, Beranek and Newman, Inc.

Compaction of symbolic layout using genetic algorithms
Michael P_ Fourman, Brunel University

Lunch: 12:00 noon - 2:00 p.m

 

Session 6: 2:00 p.m.
Chair: Ken De Jong

3:15 pm.

Alleles, loci, and the traveling salesman problem
David E. Goldberg and Robert Lingle, Jr., University of Alabama

Genetic algorithms for the traveling salesman problem
John J. Grefenstette, Rajeev Gopal, Brian J. Rosmaita and Dirk Van Gucht,
Vanderbilt University

 

 

page 99

page 101

page 112

page 121

page 196

page 141

page 154

page 160
Genetic algorithms: a 10 year perspective
Kenneth De Jong, George Mason University

 

Coffee Break

 

15 p.m. - 3:45 p.m.

Discussion: 3:45 p.m. - 4:30 p.m.
Topic: GA's as Search Algorithms
Chair: Kenneth De Jong

Friday, July 26, 1985

Session 7: 9:00 - 10:30 a.m.
Chair: John Grefenstette

Classt fier systems with long term memory
Hayong Zhou, Vanderbilt University

A representation for the adaptive generation of simple sequential programs
Nichael Lynn Cramer, Texas Instruments, Inc.

Adaptive "cortical" pattern recognition
Stewart W. Wilson, Rowland Institute for Science

Machine learning of visual recognition using genetic algorithms
Arnold C. Englander, Itran Corp.

Bin packing with adaptive search
Derek Smith, Texas Instruments, Inc.

Directed tress method for fitting a potential function
Craig Schaefer, Rowland Institute for Science

Coffee Break: 10:30 a.m. - 11:00 a.m.

Discussion: 11:00 a.m. - 12:00 noon
Topic: Summary and Future Directions
Chair: John Holland

 

page 169

page 178

page 189

page 188

page 197

page 202

page 207

 
PROPERTIES OF THE BUCKET BRIGADE ALGORITHM
John H Holland

The University of Michigan

The bucket brigade algorithm is designed to solve the apportronment
of credit problem for massively paraliel, message-passing, rule-based
systems. The apportionment of credit problem was recognized and
explored in one of the earliest significant works in machine learning
(Samuel [1959]) In the context of rule-based systems it is the problem of
deciding which of a set of early acting rules should receive credit for
“setting the stage” for later, overtly successful actions In the systems of
interest here, in which rules conform to the standard condition/action
paradigm, a rule's overall usefulness to the systern is indicated by a
parameter called its strength Each time arule is active, the bucket
brigade algorithm modifies the strength so that it provides a better
estimate of the rule's usefulness in the contexts in which it 1s activated.

The bucket brigade algorithm functions by introducing an element of
competition into the process of deciding which rules are activated
Normally, for a parallel message-passing system, all rules having
condition parts satisfied by some of the messages posted at a given time
are automatically activated at that time. However, under the bucket
brigade algorithm only sorne of the satisfied rules are activated. Each
Satisfied rule makes a 0/g, based In part on its strength, and only the
highest bidders become active (thereby posting the messages specified by
their action parts) The size of the bid depends upon both the rule's
strength and the specificity of the rule's conditions. (The rule's
Specificity is used on the broad assumption that, other things being equal,
the more information required by a rule's conditions, the more likely it is
to be “relevant” to the particular situation confronting it). In a specific
Version of the algorithm used for classifier systerns, the bid of classifier
Cat time t is given by

b(C,t) = cr(C)s(C,t),
where r(C) is the specificity of rule C (equal, for classifier systerns, to
the difference between the total number of defining positions in the
Condition and the number of “don’t cares” in the condition), s(C,t) 1s the
Strength of the rule at time t, and c 1s a constant considerably less than |

1

 
 

(eg, 1/4 or 1/8)

The essence of the bucket brigade algorithm is its treatment of each
rule as a kind of mid-level entrepreneur (a “middleman”) in a complex
enconorny When arule C wins the competition at time t, it must decrease
its strength by the amount of the bid. Thus its strength on tirne-step t+!,
after winning the competition, is given by

SiC, t+1) = SIC, t) - DIC, t) = (1 - er(C))SKC, t)
In effect C has paid for the privilege of posting its message. Moreover
this amount is actually paid to the classifers that sent messages
satisfying C’s conditions -~ in the simplest formulation the bid is split
equally arnongst ther These rnessage senders are C's suypp/rers, and each
receives its share of the payrnent from the consumer C. Thus, if C, has

posted a message that satisfies one of C's conditions, C has its strength

increased so that
S(Cy, t+1) = S(Cy, t) * DIC, L/MC, t) = C1 ~ cr(Cd/M(C,t SCC, t),

where n(C, t) is the nurnber of classifiers sending messages that satisfy C
at time t

In terms of the economic metaphor, the suppliers (C ;} are paid for

setting up a situation usable by consumer CC, on the next time step,
changes from consumer to supplier because it has posted its message !f
other classifiers then bid because they are satisfied by C’s message, and if
they win the bidding competition, then C in turn will receive some fraction
of those bids. C’s survival in the system depends upon its turning a profit
as an intermediary in these local transactions In other words, when C 1s
activated, the bid it pays to its suppliers must be less (or, at least, no
more) than the average of the sum of the payments it receives from its
consumers

it is important that this process involves no complicated
“pookkeeping” or memory over long sequences of action. When activated, C
simply pays out !ts bid on one time-step, and is immediately paid by its
consumers (if any) on the next time-step. The only variation on this
transaction occurs on time-steps when there is payoff from the
environment. Then, all classifiers active on that time-step receive equa!
fractions of the payoff in addition to any payments from classifiers active
on the next time-step in effect, the environment is the system's ultimate
consumer From a global point of view, a given classifier C is likely to be
 

profitable only if its usual consumers are profitable The profitability of
any chain of consumers thus depends upon their relevance to the ultimate
consumer. Stated more directly, the profitability of a classifier depends
upon its being coupled into sequences leading to payoff

As a way of illustrating the bucket brigade algorithm, consider a set
of 2-condition classifiers where, for each classifier, condition | attends
to messages from the environment and condition 2 attends to rnessages
from other classifrers in the set As above, let a given classifier C have a
bid fraction b(C) and strength s(C,t) at tine t. Note that condition | of C
defines an equivalence class £ in the environrnent consisting of those
environmental states producing messages satisfying the condition

Consider now the special case where the activation of C produces a
response r that transforms states in £ to states in another equivalence
class £’ having an (expected) payoff u Under the bucket brigade
algorithm, when C wins the competition under these circumstances its
strength will change from s(C,t) to

s(C,t+1) = s(C,t) - b(C)s(C,t) + u
+ (any bids C receives from classifiers active on
the next time-step)
Assuming the strength of C 1s small enough that its bid b(C)s(C,t) is
considerably less than u, the usual case for a new rule or for arule that
has only been activated a few times, the effect of the payoff isa
considerable strengthening of rule C

This strengthening of C has two effects First, C becornes more likely
to win future competitions when its conditions are satisfied. Second,
rules that send messages satisfying one (or more) of C’s conditions will
receive higher bids under the bucket brigade, because b(C)s(C,t+ 1) »
bIC)s(C,t).

Both of these effects strongly influence the developnient of the
system. The increased strength of C means that response r will be made
more often to states in £ when C competes with other classifiers that
Produce different responses If states in £’ are the only payoff states
accessible from £, andr is the only response that will produce the
required transformation from states in £ to states in £, then the higher
Probability of a win for C translates into a higher payoff rate to the
Classifier system

 
 

Of equal importance, C’s higher bids mean that rules sending
Messages satisfying C’s second condition will be additionally strengthened
because of C’s higher bids. Consider, for example, a classifier C, that

transforms environmental states in some class & to states in class £ by
evoking response ry, That is, Cy acts upon a causal relation in the
environment to “set the stage” for C. If Cy also sends a message that
satisfies C’s second condition, then Cy will benefit from the “stage
setting” because C's higher bid 1s passed to it via the bucket brigade.

It is instructive to contrast the "stage setting” case with the case
where some Classifier, say C,, sends a message that satisfies C but aves

not transform states in & (the environmental equivalence class defined
by its first condition) to states in & That is, C, attempts to “parasitize”

C, extracting bids from C via the bucket brigade without modifying the
environment in ways suitable for C’s action. Because C, is not

instrumental in transforming states in 4 to states in &, it will often
happen that activation of C; is not followed by activation of C on the

subsequent time-step because C's first (environmental) condition is not
satisfied. Every time C, is activated without a subsequent activation of C

it suffers a loss because it has paid out its bid b(C psC),t), without
receiving any income from C Eventually C;'s strength will decrease to the

point that it is no longer a competitor. (There is a more interesting case
where Co and C, manage to become active simultaneously, but that goes

beyond the confines of the present illustration).

One of the most important consequences of the bidding process is the
automatic emergence of default hierarchies in response to complex
environments For rule-based systems a “default” rule has two basic
properties:

1) It is a general rule with relatively few specified properties and

many “don't cares” in its condition part, and

2) when it wins a competition tt is often in error, but it still

Manages to profit often enough to survive.

It ts clear that a default rule is preferable to no rule at all, but, because it
is often in error, it can be improved One of the simplest improvements is
the addition of an “exception” rule that responds to situations that cause
 

the default rule to be in error. Note that, in attempting to identify the
error-causing situations, the condition of the exception rule specifies a
subset of the set of messages that satisfy the default rule. That is, the
condition part of the exception rule re/ines the condition part of the
default rule by using aaastiona/ identifying bits (properties). Because
rule discovery algorithms readily generate and test refinements of
existing strong rules, useful exception rules are soon added to the system

As a direct result of the bidding competition, an exception rule, once
in place, actually aids the survival of its parent default rule Consider the
case where the default rule and the exception rule atternpt to set a given
effector to a different values. In the typical classifier system this
conflict is resolved by letting the highest bidding rule set the effector
Because the exception rule is more specific than the default rule, and
hence makes a higher bid, it usually wins this competition In winning, the
exception rule actually prevents the default rule from paying its bid This
outcome saves the the default rule from a loss, because the usual effect
of an error, under the bucket brigade, is activation of consumers that do
not bid enough to return a profit to the default rule in effect the
exception protects the default from some errors. Similar arguments
apply, under the bucket brigade algorithm, when the default and the
exception only influence the setting of effectors indirectly through
intervening, coupled classifiers

Of course the exception rules may be imperfect themselves, selecting
Sore error-causing cases, but making errors in other cases. Under such
circumstances, the exception rules become default rules relative to more
detailed exceptions. iteration of the above process yields an ever more
refined, and efficient, default hierarchy. The process improves both
overall performance and the profitability of each of the rules in the
hierarchy. It also uses fewer rules than would be required if ail the rules
were developed at the most detailed level of the hierarchy (see Holland,
Holyoak, Nisbett, and Thagard [1986]). The bucket brigade algorithm
strongly encourages the top-down discovery and development of such
hierarchies (cf. Goldberg [1983] for a concrete example)

At first sight, consideration of long sequences of coupled rules would
Seem to uncover an important limitation of the bucket brigade algorithm
Because of its local nature, the bucket brigade algorithm can only
Propagate strength back along a chain of suppliers through repeated
activations of the whole sequence That 1s, on the first repetition of a

5

 
 

sequence leading to payoff, the increment in strength is propagated to the
immediate precursors of the payoff rule(s). On the second repetition it is
propagated to the precursors of the precursors, etc. Accordingly, it takes
on the order of n repetitions of the sequence to propagate the increments
back to rules that “set the stage” n steps before the final payoff. However,
this observation is misleading because certain kinds of rule can serve to
“pridge” long sequences.

The simplest “bridging action” occurs when a given rule remains
active over, say, T successive time-steps. Such a rule passes increments
back over an interval of T time-steps on the next repetition of the
sequence This qualification takes on importance when we think of arule
that shows persistent activity over an epoch -- an interval of time
characterized by a broad plan or activity that the system is attempting to
execute. For the activity to be persistent, the condition of the
epoch-marking rule must be general enough to be satisfied by just those
properties or cues that characterize the epoch. Such arule, if strong,
marks the epoch by remaining active for its duration.

To extract the consequences of this persistent activation, consider @
concrete plan involving a sequence of activities, such as a “going home”
plan. The sequence of coupled rules used to execute this plan on a given
day will depend upon variable requirements such as “where the car is
parked’, “what errands have to be run", etc These detailed variations will
call upon various combinations of rules in the system's repertoire, but the
epoch-marking “going home” rule D will be active throughout the execution
of each variant In particular, it will be active both at the beginning of the
epoch and at the time of payoff at the end of the plan (“arrival home") As
‘such it “bridges” the whole epoch

Consider now arule | that initiates the plan and ts coupled to (sends a
message satisfying) the general epoch-tnarking rule D The //rs¢t
repetition of the sequence initiated by | will result in the strength of 1
being incremented This comes about because D is strengthened by being
active at the time of payoff and, because it is a consumer of |'s message,
it passes this increment on to | the very next time | is activated. 0
“supports” | as an elernent of the “going home” plan The result 1s a kind of
one-shot learning in which the earliest elements in a plan are rewarded on
the very next use. This occurs despite the local nature of the bucket
brigade algorithm. It requires only the presence of a general rule -- a kind
of default -- that is activated when some general kind of activity or goal

6
 

 

is to be attained. An appropriate rule discovery algorithm, such as a
genetic algorithm, will soon couple more detailed rules to the
epoch-marking rule. And, much as in the generation of a default hierarchy,
these detailed rules can give rise to further refined offspring. The result
is an emergent plan hierarchy going from a high-level sketch through
progressive refinements yielding ways of combining progressively more
detailed components (rule clusters) to meet the particular constraints
posed by the current state of the environment in this way a limited
repertoire of rules can be combined in a variety of ways, and in parallel, to
meet the perpetual novelty of the environment

References.

Goldberg, D.E. Computer-aided Gas Pipeline Operation Using Genetic
Algorithms and Machine Learning. Ph 0, Dissertation (Civil Engineering)
The University of Michigan 1983.

Holland, J. H, Holyoak, K J,, Nisbett, R.E, and Thagard, P R= /nauction:
Learning, Discovery, and the Growth of Knowledge. (forthcoming, MIT
Press]

Samuel, AL “Some studies in machine learning using the game of
checkers.” /&/7 Journal of Research and Development, I 2\\-232, 1959

 
 

 

GENETIC ALGORITHMS AND RULE LEARNING
IW
DYNAMIC SYSTEM CONTROL
David E. Goldberg

Department of Engineering Mechanics
The University of Alabana

ABSTRACT

In this paper, recent research resuits
[2] are presented which denonstrate the
effectiveness of genetic algorithes in the
control of dynamite eyatens.” Genetic algo-
rithes are search algorithms based upon. the
nechantes of natural genetics. They combine
a survival-of-the-fletestanong string
Structures with a structured, yet randontzed,
infornation exchange to form a search algo”
rithn with sone of the innovative flair of
hnunan “search. While randomized, genetic
algorithns are no sinple random walk. ‘They
effictenciy explote Mstorical inforration to
speculate on new search pointe with inproved
performance,

 

Two applications of genetic algorithas
are considered. In the first, a triparcite
genetic algorithm is applied to a paraneter
optimization problem, the optimization of a
serial natural gas ‘pipeline with 10 con~
pressor stations. While solvable by other
methods (dynamic programming, gradient
search, etc.) the problem is interesting as a
straightforward engineering application of
genetic algorithms. Furthermore, a surpris-
ingly small number of function’ evaluations
are required (relative to the size of the
discretized search space) to achieve near-
optimal performance,

Inthe second application, a genetic
algorithm is used as the fundamental learning
algoritha in a more complete rule learning
systen called a learning classifier system.
‘The learning syste combines a complete
string rule and tessage system, an apportion
nent of credit algorithn modeled after a
competitive service econoay, and a genetic
algorithm to forn a systen which continually
evaluates its ‘present rules while forming
new, possibly better, rules from the bits and
pieces of the old. ‘tn an application to the
Control. of a natural gas’ pipeline, the
learning systen is trained to control the
pipeline under normal winter and  sunner
Conditions. It is also trained co detect che
presence or absence of a leak with increasing
accuracy.

INTRODUCTION

Many industrial tasks and machines that
once required human intervention have been
all but completely automated. Where once a
person tooled a part, a machine tools,
Senses, and tools again. Where once a person
controlled a machine, a computer controls,
senses, and continues its task, Repetitive

 

ks requiring a high degree of precision
have been most susceptible to these extrene
forns of automated control. Yet despite
these successes, there are still many tasks
and nechanisns that require the attention of
a huran operator. Plloting an afrplane,
controlling a pipeline, driving a car, and
fixing a machine are just a few examples of
ordinary tasks vhich have resisted a high
degree of automation. What ts it about these
tasks chat has prevented more autonomous,
automated control? Primarily, each of the
example tasks requires, not just a single
capability, but a broad range of skills for
successful’ performance. Furthermore, cach
task requires performance under circunstances
which have never been encountered before.
For example, a pilot must take off, navigate,
control speed and direction, operate auxilia~
Ty equipment, consunicate with tower control,
and Jand the aircraft. He may be called upon
to do any or all of these tasks under extrene
weather conditions or with equipment malfunc~
tions he has never faced before. Clearly,
the breadth and perpetual novelty of the
piloting task (and similarly complex task
environments) prevents the ordinary algo-
rithaic solution used in nore repetitive
chores. In other words, difficult environ-
ents are difficult because not every possi-
ble outcome can be anticipated in advance,
nor can every possible response be pre-
defined. This truth places a premium on
adaptation.

 

 

 

In this paper, we attack sone of these
issues by examining research results in two
distinct, but related problens. In the
first, the steady state control of a serial
gas pipeline is optimized using a genetic
algorithm. While the optimization problem
itself is unremarkable (a  straightforvard
paraneter optimization problem which has been
solved by other methods), the genetic algo-
rithm approach we adopt 1s noteworthy because
it draws from the most successful and longest
lived search algorith known to pan (natural
genetics + survival-of-thefittest). Further-
Tore, the GA approach is provably efficient
in tts exploitation of important sinilari-
ties, and thus connects to our om notions of
Snnovative or creative search. In the second
Problem, we use a genetic algorithm as a
prinary’ discovery mechanism in a larger rule
learning systen called learning
classifier system (LCS), In this particular
application the LCS learns to control a
simulated natural gas pipeline. Starting
from a random rule set the LCS learns appro
priate rules for high perfornence control
Under normal summer and winter conditions;
additionally it learns to detect simulated
leaks with increasing accuracy.

 

   
‘A TRIPARTITE GENETIC ALGORITHM

Genetic algorithns are different from
the normal search methods encountered in
engineering optimization in the following

1. GA's work with a coding of the paraneter
set not the paraneters themselves.

2, GA's search from @ population of points.

3, GA's use probabilistic not deterministic
transition rules.

Genetic algorithms require the natural
paraneter set of the optinization problen to
be coded as a finite length string. A
variety of coding schenes can and have been
used successfully, Because Gas work directly
with the underiying code they are difficult
to fool because they are not dependent upon
continuity of the parameter space and deriva-
tive existence,

In many optintzation methods, ve nove
gingerly from a single point in che decision
Space to the next using sone decision rule to
tell us how to get to the next point. This
poine-by-point method 1s dangerous because 4t
often locates false peaks in  multinodal
Search spaces. GA's work from a database of
points “simultaneously (a population of
strings) climbing many peaks in parallel,
thus reducing the probability of finding a
False peak.

Unlike many methods, GAs use probabi-
Listic decision rules to guide their search,
The use of probability does not suggest that
the method is simply a randon search, howev-
er. Genetic algorithns are quite rapid in
Tocating inproved performance.

For our work, ve pay consider the
serings in our ‘population of strings, to be
tapressed ip a. plcaty alphabet containing the
characters (0,1). Each string is of length 2

i the population contains total of n'such
erings.” Of course, each” string say be
decoded’ cova ‘set ot ‘physical. perancters
according to our design’ Additionatly, we
assune that with each string (parameter set)
te say evaluate a fitness values Fleness ts
defined as the non-negative figure of nerie
Weare maxinizing, “thus, the. fitness tn
Benetic algoritha’ work corresponds, to. the
Shjestive™ funetion “in jornal” optini sation

 

 

A simple genetic algorithm which gives
G00d results 1s composed of three operators:

1, Reproduction
2, Crossover

3. Mutation

With our simple genetic elgorithn we
view reproduction as a process. by which

individual strings are copied according to
their fitness. Highly fie strings receive
higher nusbers of copies in the nating pool.
There are nany ways to do this; we simply
give a proportionately higher probability of
Teproduction to those strings with higher
fitness (objective function value). Repro
duction is ‘thus the survival-of-the-fittest
or enphasis step of the genetic algoritha.
The best strings make nore copies for mating
than the worst.

After reproduction, simple crossover may
proceed in two steps. First, menbers of the
newly reproduced strings in ‘the mating pool
are mated at random. Second, each pair of
strings undergoes crossing over as follows:
an integer position k along the string is
selected uniformly at random on the interval
Q,g-1), Two new strings are created by
swapping all characters between positions 1
and k inclusively,

For example, consider two strings A and
B of length 7 nated at random fron the mating
pool created by previous reproduction:

A = al a2 a3 ab a5 a6 a7
B = bl b2 b3 bs b5 6 b7

Suppose the roll of a die turns up a four.
The resulting crossover yields two new
strings A' and Bt following the partial
exchange

At = DL b2 b3 bu a5 a6 a7
BY = al a2 a3 ab b5 b6 b7

The mechanics of the reproduction and
crossover operators are surprisingly simple,
involving nothing more complex than string
copies and partial string exchanges; however,
together the emphasis step of reproduction
‘and the structured, though randontzed,
information exchange’ of crossover give
genetic algorithas mich of their power. At
First this’ seens surprising. How can’ such
sinple (computationally trivial) operators
result in anything useful let along a rapid
and relatively robust. search - mechanisn?
Furthermore, doesn't it seen a little strange
that chance should play such a fundanental
role in a directed search process? The
answer to the second question was well
Fcopnized by the athenatician J. Hodanaré

We shall see a little later that
the possibility of imputing discov:
ery to pure chance is already
excluded....0n the contrary, that
there is’ an intervention of ‘chance
but also a necessary work of
unconsciousness, the latter imply-
ing and not ’ contradicting the
former.... Indeed, it is obvious
that invention or ‘discovery, be it
in mathematics or anywhere else,
takes place by conbining ideas.

 

 

‘The suggestion here is that while
discovery is not a result of pure chance, it

 
 

is almost certainly guided by directed
serendipity, Furthermore, Hadanard hints
that a proper role for chance is to cause the
juxtaposition of different notions. It is
Interesting that genetic algorithms adopt
Wadamard's mix of direction and chance in a
manner which efficiently builds new solutions
fron the best partial solutions of previous
trials,

 

To see this, consider a population of n
strings over sone appropriate aiphaber coded
so that each is a conplete IDEA or prescrip~
tion for performing a particular task (in our
coming example, each string is a description
of how to operate all 10 compressors ‘on a
natural gas pipeline.). Substrings within
each string (IDEA) contain various NOTIONS of
what's important or relevant to the task.
Viewed in this way, the population contains
not just a sample’ of n IDEAS, rather it
contéins a multitude of NOTIONS and rankings
of those NOTIONS for task performance.
Genetic algorithns carefully explote this
wealth of infornation about important NOTIONS
by 1) reproducing quality NOTONS according
to their perfornance and 2) crossing these
NOTIONS with many other high performance
NOTIONS from other strings. Thus, the act of
crossover with previous. reproduction specu-
Yates on new IDEAS constructed fron the high
perforeance buflding blocks (NOTIONS) of past
Era

 

 

 

If reproduction according to fitness
combined with crossover give genetic algo-
rithms the bulk of their processing pover,
what then is the purpose of the mutation

operator? Not surprisingly there is much
confusion about the role of mutation in
genetics (both natural and artificial).

Perhaps it is the result of too many B novies
detatling the exploits of mutant eggplants
that devour portions of Chicago, but whatever
the cause for the confusion, we find that
mutation plays a decidedly secondary role in
the operation of genetic algorithms. Muta~
tion {s needed because, even though reproduc-
tion and crossover effectively search and
reconbine extant NOTIONS, occasionally the:
nay becone overzealous and lose some poten
tially useful genetic material (2's or O's at
particular locations). The mutation operator
protects against such an unrecoverable loss.
In the simple tripartite GA, rutation is the
occasional random alteration of a string
position. In a binary code, this simply
‘beans changing a1 to a 0 and vice versa. By
itself, putation is a random walk through the
string’ space. ‘hen used sparingly with
reproduction and crossover it is an insurance
policy against prenature loss of important
NOTIONS.

 

 

That the mutation operator plays a
secondary role we simply note that the
Frequency of mutation to obtain good results
in enpirical genetic algorithn studies is on
the order of 1 mutation per thousand bit
(position) transfers. Mutation rates are
similarly small in natural populations which
leads us to conclude that mutation is

 

10

 

appropriately considered as a secondary
nechanisn,
The underlying processing power of

genetic algorithns is understood in nore
rigorous terms by considering the notion of a
NOTION more carefully. If two or more
strings (IDEAS) contain the sare NOTION there
are sin{larities between the strings at one
or more positions. To consider the number
and form of the possibie relevant sinilari-
ties we consider a schena [3] or similarity
teaplate; a similarity template is simply a
string over our original alphabet (1,0) with
the addition of a wild card or don't care
character *, For example, with string length
2 = 7 the schena is0*t* represents all
Strings with a 1 in the first position and a
© in the third position. A simple counting

argunent shows that while there are only 2%

strings, there are 3% well-defined schenata
or possible templates of similarity.
Furthermore, it is easy to show that a

particular string 1s itself a representative

of 2% different schemata. thy 1s this
interesting? The interesting part cones from
considering the effect of reproduction and
crossover on the multitude of schemata
contained in a population of n strings (at

most n+2% schemata). Reproduction on average
gives exponentially nore samples to. the
Observed best similarity patterns (a near~
optinal sampling strategy 1f we consider a
multi-arned bandit problen), Second, cross=

 

 

 

over, combines schenata fron different
strings so that only very long defining
length schemata (relative to the string

length) are interrupted. Thus, short defin-
ing length schemata are propagated generation
to generation by giving exponentially in-
creasing samples to the observed best, and
all this goes on in parallel with little
explicit book-keeping or special menory other
than the population of n strings. How many

of the n+2® schenata are usefully processed
per generation? Using a conservative esti-
Sate, Helland’ hos shotm that_O(n3) senenate
are ‘usefully sampled per generation. This
Compares favorably with the munber of func
tion evaluations (n), and because this
processing leverage 18 so important (and
apparently unique to genetic. algorithns)
Holland gives it a special nane, implicit
porallelisn. In the next section we exploit
this leverage in the optinization of a
natural gas pipeline. :

 

‘THE TRIPARTITE GENETIC ALGORITHM 31.
GAS PIPELINE OPTIMIZATION

NATURAL.

We apply the genetic algorithm to the
steady state serial natural gos pipeline
problem of Wong and Larson [4]. As rentioned
previously, the problen is not renarkable.
Wong and Larson successfully used a dynamic
progranning approach and gradient procedures
have also been used. Our goal here ts to
connect with extant optimization and control

 
 

 

literature. We also look at sone of the
issues we face in applying genetic algorithns
to more difficult problens where standard
techniques may be inappropriate.

We envision a serial system vith an
alternating sequence of 10 compressors and 10
pipelines. A fixed pressure source exists at
the inlet; gas is delivered at line pressure
to the delivery point. Along the way,
compressors boost pressure using fuel taken
fron the line, Nodeling relationships for
the steady flow of an ideal gas are well
studied. Ke adopt Wong and Larson's formula-
tion for consistency. The reader interested
in more modeling detail should refer to their
original work.

 

 

Along with the usual modeling rela
tionships, we must pose a reasonable objec-
tive function and constraints. For this
problem, we use Wong and Larson's objective
function and constraint specification.
Specifically, we minimize the sumed horse-
power over the 10 compressor stations in the
Serial line subject fo maximun and minizun
pressure constraints as well as maximum and
mininun pressure ratio constraints.  Con-
straints in these state variables are ad-
Joined to the problen using an exterior
Penalty method. Whenever @ constraint is
violated a penalty cost is added to the
objective function in proportion to the
square of the violation. As ve shall see in
a nonent, constraints in control variables
may be handled with the choice of some
appropriate finite coding.

 

 

As discussed in the previous section,
one of the necessary conditions for using &
genetic algorithm is the ability to code the
underlying parameter set as a finite length
string. This is no real limitation as every
user of a digital computer or calculator
knows; hovever, there 1s motivation for
constructing Special, relatively crude
codings. In this study, the full string is
formed from the concatenation of 10, four bit
substrings where each substring is’ a mapped
fixed point binary integer (precision = 1
part in 16) representing the difference in
squared pressure across each of the ten
compressor stations. This rather crude
Giscretization gives an average precision in
Pressure of 34 psi over the operating range
‘500-1000 psia,

The nodel, objective function, con-
straints, ‘and genetic. algorithn have been
Programed “in Pascal. ‘We exanine results
fron a nunber of independent trinis and
compare to published results. To initiave
Simlation, "a starting population of 50
trings 13 selected at randon, For” each
trial of the genetic algorities we run to
generation 60. "This represents 0 total of
50#61"3050 function evaluations per inde
Pendent ‘trial The resuice froa. three

als ure shom in Figure 1. This figure
shova the ‘cost of the best string of each
gereration “as the ‘solution ‘procéeds, — At
Firat, perforeance 12 poor. After suffictent

 

 

 

genetic action, near-optimal results are
obtained. In all three cases, near-optinal
results are obtained by generation 20 (1050
function evaluations).

 

Figure 1. Best-of-Generation
Results - Steady
Serial Problem

To better understand these results, we
compare the best solution obtained in’ the
first trial (run $8.1) to the optinal results
obtained by dynanic programming. A pressure
profile is presented in Figure 2. The GA
Tesults are very close to the dynanic
programming solution, with most of the
difference explained by the large discretiza-
tion errors associated with the CA solution.

   

i
t

= 7
STATION NUTBER

Figure 2, Pressure Profile -
Run $S.1 Steady
Serial Problen

To gain a feel for che search rapidity
of the genetic algorithm, we must compare the

 

 
 

nunber of points searched to the size of the
search space. Recall that in this problea,
near-optimal performance is obtained after
only 1050 function evaluations. To put this
in perspective, with a string of length 40,

there are 2° different possible solutions in

the search space (20 = 1.1612), Therefore,
we obtain near-optinal results after search
ing only le-7% of the possible alternatives.
If we were, for example, to search for the
best person anong the worlds 4.5 billion
people as rapidly as the genetic algorithn ve
Would only need to talk to 4 or 5 people
before making our near-optina) selection.

 

 

   

‘A LEARNING CLASSIFIER SYSTEM FOR DYNAMIC
SYSTEM CONTROL

In the remainder of this paper, we show
how the genetic algorithm's penchant for
discovery in string spaces may be usefully
applied to search for string rules in a
learning classifier system (LCS), Learning
classifier systens are the latest outgrowth
of Holland's continuing work on adaptive
systens [5]. Others have continued and
extended this work in a variety of areas
ranging from visual pattern recognition to
draw poker [6-8],

A learning classifier systen (LCS) {s an
artificial systen that learns rules, called
classifiers, to guide its interaction in an
arbitrary environment. It consists of three
main elements:

 

 

1, Rule and Message System
2. Apportionment of Credit Systen
3. Genetic Algorithm

A schenatic of an LCS is shown in Figure
3. In this schematic, we see that the rule
and ressage systen receives environmental
information through its sensors, called
detectors, which decode to sone” standard
pessage Tormat. This environsental nessage
Te placed on a pessage Lise along with a
finite number of “other “Internal ‘messages
generated from the previous cycle. Messages
on the message list may activate classifier:

rules in the classifier store If activated a
classifier may then be chosen to send a
nessage to the message list for the next
cycle. Additionally, certain messages may
call for external action through a nunber of
action triggers called effectors. In this
way, the rule and message systen combines
both external and internal data to guide
behavior and the state of mind in the next
state cycle,

 

 

  

 

In an LCS, it {s important to maintain
simple syntax in the primary units of infor-
pation, messages and classifiers. In the
current study messages are £-bit (binary)
strings and classifiers ,are 3%-position
strings over the alphabet {0,1,@). In this
alphabet the @ 1s a wild card, matching a 0

 

 

12

 

or a 1 in a given message. Thus, we maintain
powerful pattern recognition capability with
sinple structures,

 

ENVIRONMENT

Xoo Les

GA
reas §
i

MESSAGE
pe rons |] MESSACE
popes ust

[erssrien
stone

 

   

 

 

 

 

 

Figure 3, Schematic ~ Learning
Classifier System

In traditional rule-based expert sys-
tems, the value or rating of a rule relative
to other rules is fixed by the programmer in
conjunction with the expert or group of
experts being emulated. In a rule learning
system, we don't have this luxury. The
relative value of different rules is one of
the key pieces of information which must be
learned. To facilitate this type of learn-
ing, Holland has suggested that rules coexist
in a competitive service economy. A competi-
tion {s held anong classifiers where the
right to answer relevant messages goes to the
highest bidders with this payment serving as
a source of, incone to previously successful
message senders. In this way, a chain of
middlenen is formed from manufacturer (source
Message) to message consumer (environmental
action and payoff). The competitive nature
of the economy insures that the good rules
Survive and that bad rules die off.

 

 

In addition to rating existing rules, we
must also have a way of discovering new,
possibly better, rules. This, of course, 18
the appropriate role for our genetic algo-
rithm. In the learning classifier systen
application, we mist be less cavalier about
replacing entire string populations each
generation, and we should pay more attention
to the replacement of low perforners by new
strings; however, the genetic algorithn
adopted’ in the LES is very sinilar to the
simple tripartite algorithn described earli-

 

Taken together, the learning classifier
systen with a computationally complete and
convenient rule and message system, an
apportiontent of credit systen modeled after
 

@ competitive service economy, and the
fnnovative search of a genetic algorithm,
provides a unified framework for investiga”
ting the learning control of dynamic systems.
In the next section we examine the applica-
tion of an LCS to natural gas pipeline
operation and leak detection.

‘A LEARNING CLASSIFIER SYSTEM CONTROLS A
PIPELINE,

A pipeline model, load schedule, and
upset conditions are programmed and inter-
faced to the LCS. We briefly discuss this
environnental model and present results of
normal operations and upset tests.

 

A model of a pipeline has been developed
which accounts for Iinepack accumulation and
frictional resistance. User demand varies on
a daily basis and depends upon the weather.
Different patterns may be used for winter and
sumer operation. In addition to normal
sumer and winter conditions, the pipeline
nay be subjected to a leak upset. During any
given time step, a leak may occur with a
Specified leak probability. If a leak
occurs, the leak flow, @ specified value, is
extracted from the upstream junction ’and
persists for a specified number of time
steps.

The LCS receives a message about the
pipeline condition every time step. A
template for that message is show in Figure
4. “The system has conplete, albeit imperfect
and discrete, knowledge of ‘its state includ-
ing inflow, outflow, tnlet pressure, outlet
pressure, pressure rate change, season, tine
of day, ime of year, and current temperature
reading.

 

COO

 
    
  
  
  

Deseriptlon

 

 

  

rx | inter peesnece aol 2
© w]

 

Figure 4. Pipeline LCS Environmental
Message Template

13

Jn the pipeline task, the LCS has a
nunber of alternatives for actions it may
take. It may send out a flow rate chosen
fron’ one of four values, and it may send a
Bessage indicating Whether a leak {8 suspect=
ed or not.

 

 

‘The LCS receives reward fron its trainer
depending upon the quality of its action in
relation to the current state of the pipe-
line. To make the trainer evervigilant, a
computer subroutine has been written which
administers the reward consistently. ‘This is
not a necessary step, and revard can cone
from a hunan trainer.

Under normal operating conditions we
examine the performance of the learning
classifier systen with and without the
genetic algorithm enabled. Without the
genetic algorithn, the systen 1s forced to
rake do with its Original set of rules. The
results of a normal operating test are pre~
sented in Figure 5. Both runs with
the LCS outperform a randoa walk (through the
operating alternatives). Furthermore, the
run with genetic algoritha enabled {s superi~
or to the run vithout GA. In this figure, we
show time-averaged total evaluation versus
tine of simulation (naxinun reward per
timestep = 6).

 

 

 

ihe onrst

Figure 5. Time-averaged TOTALEVAL
vs. Time. Normal
Operations. Runs
POLCS.1 & POLCS.2

More dramatic performance differences
are noted when we have the possibility of
leaks on the system. Figure 6 shows the
time-averaged total evaluation versus time
for several runs with leak upsets. Once
agein the LCS {s initialized with’ random
rules and permitted to learn fron external
reward. Both LCS runs outperform the random
walk and the run with GA clearly beats the
run with no new rule learning. To understand
this, we take a look at sone auxilfary
performance measures. In Figure 7 we see

 

 

 
bl

the percentage of leaks alarmed correctly
versus time. Strangely, the run without GA
alarms a higher percentage of leaks than the
run with GA. This nay seem counterintuitive
until we examine the false alara statistics
in’ Figure 8. The run without GA is only
able to alarm a high percentage of leaks
correctly because it has so many false
alarms. “The run with GA decreases its false
alarm percentage, while increasing its leaks
correct percentage.

 

 

CET)
Tine (OAYS)

Figure 6, Time-averaged TOTALEVAL
vs, Time = Leak Runs -
POLCS.5 & POLCS.6

    
  
  
 

i: ee

ae

Ye

 

ET
Tine (oAys)

Figure 7. Percentage of Leaks
Correct vs. Time Runs
POLCS.5 & POLCS.6

CONCLUSTONS

In this paper, we exanined the perfor-
mance of a genetic algorithn in two appli-
cations. In the first, a tripartite genetic
algoritha consisting of reproduction, cross~
over, and mutation was applied ‘to the

 

4

 

  
  
 

 

F FRLSE pues

to we wae eae

Ha
Tine toAys)

Figure 8, Percentage of False
Alarns vs. Tire Runs
POLCS.5 & POLCS.6

optimization of a natural gas pipeline’
operation. The control space vas coded as 40
bit binary strings. Three initial popu~
lations of 50 strings were chosen at random,
The genetic algorithm was started and in ali
three cases, very near-optimal performance
was obtained after only 20 generations (1050
funetion evaluations).

In the second application, a genetic
algorithm was the primary discovery mechanism
in a larger rule-learning syste called a
learning classifier system, The LCS, con-

ng of @ syntactically simply rule and
ye system, an apportionment of credit
mechanism based on a competitive service
econony, and a genetic algorithm, vas taught
to operate a gas pipeline under winter and
sumer conditions. It also was trained to
alarm correctly for leaks while mininizing
the number of false alarms,

   

 

REFERENCES

1, Goldberg, D. E., "Computer-Aided Pipe-
Line Operation using Genetic Algorithas
and Rule Learning," Ph.D. Dissertation,
University of Michigan, Ann Arbor, 1983,

2, Hadamard, J., The Psychology of Inven=
thon inthe’ Mathematical Field Prince>
‘ton University Press, Princeton

ton University Press, Princeton, 1945,

3, Wolland, J. H., Adaptation in Natural
and “Artificial” sjotems, Untverstey oF
Htenigan Press, Amr Arbor, 1975.
4, Wong, P. J. and R, E, Larson, "Optimi-
zation of Watural Gas Pipeline Systems
via Dynamic Programming," IEEE Trans.

‘Auto. Control, vol. AC-13, no. 5, pp.
475-481, October, 1968.

 
 

 

 

Holland, J. H. and J. S. Reitman,
"Cognitive ‘Systens Based on Adaptive
Algorizhas,"" in Pattern-Directed Infer~
ence Systens, ‘Waterman, D. ke and Fe
Hayer-torh (ets), pp. 513-529, Acadentc
Press, New York, 1978.

 

 

Snith, S. F., "A Learning System Based
fon Genetic Adaptive Algorithns," Ph.D.
dissertation, University of Pittsburgh,
Pietsburgh, 1980,

Booker, L. B., “Intelligent Behavior as
an Adaptation to the Task Environnent,"
Ph.D, dissertation, University of
Michigan, Ann Arbor, ‘1982,

Wilson, S., "Adaptive ‘Cortical’ Pattern
Recognition," unpublished anuscript,
Rowland Institute cf Sctence, Canbridge,
MA, 1983.

15

 
 

KNOWLEDGE GROWTH IN AN ARTIFICIAL ANIMAL

by

Stewart W. Wilson

Rowland Institute for Science, Cambridge MA 02142

ABSTRACT

Results are presented of experiments with
ple artificial anima! mode! acting in a simulated en:
vironment containing food and other objects Proce-
dures within the model that lead to improved perfor:
mance and perceptual generalization are discussed.
‘The model is designed in the light of an explicit
definition of intelligence which appears to apply to
all animal Ife. Iv is suggested that study of aruf
cial animal models of increasing complexity would
contribute to understanding of natural and artificial
intelligence.

 

 

    

 

INTRODUCTION

The science of understanding and realizing in-
telligence in artificial systems needs a definition of
intelligence. Every science needs good definitions
of the problems it addresses. But in the artificial
intelligence field there has been a hesitancy about
defining intelligence. For example, on the first page
of a recent, widely used Al textbook we find: “A
definition in the usual sense seems impossible be-
cause intelligence appears to be an amalgam of so
many information-representation and information-
processing talents.”[1} For many Al goals, this oinis-
sion 3s nol important But the lack of a good work-
ing definition can lead to uncertainty in evaluating
progress toward understanding intelligence per 4,
even though results are in other respects substan-
tial.

 

This paper reports work using an artificial, be-
having, animal model to study intelhgence at a
primitive level An explicit definition of intelligence
is adopted, and guides construction of the model.
‘The definition has mtuitive appeal and apparent ap-
phicability to the range of hfe from human beings to
very primitive animals. Because of this range, some
results with the primitive animal model should pro-
vide insight into intelligence sn general

  

A DEFINITION OF INTELLIGENCE

A good definition should be relatively simple and
yet cover most of the things we regard as belonging
to the concept and few we regard as not belong-

 

 

16

ing. The psychological hterature offers a number of
useful similar efforts but the best definition of in-
telligence we have found is the following, from the
physicist van Heerden:

 

Intelligent behavior is to be repeatedly successful
in satisfying one’s psychological needs in diverse,
observably different, situations on the basis of
past experience.|2]

  

 

 

“This definition (vH) is suitable for the computer
study of intelhgence beeause i 1s comprehensive and
terms are not difficult to dehine coi
for experimental purposes. A high rate of recespt
of certain reward quantities can correspond to “re-
peatedly successful in satisfying one’s psychological
needs” {on the simplest level. somatic needs). To
“diverse, observably different, situations” can corre-
spond sets of distinct sensory input “vectors” with
each set having a particular implication for optim:
action, To “past experience” can correspond a suit
able internal record of earlier interactions with the
environment, and their results

   

 

   

 

THE ANIMAT MODEL

Computer modehng of human levels of intelli-
gence 1s complex VH's apparent applicabshty to
both simple animals and human beings (assuming
appropriate translations of its terms) suggests the
usefulness of the easier course of considering basic
problems that simple animals must solve, and con-
structing behaving models aimed at solving them.
Observation of the models should aid understand-
ing of all intelligence, and the construction of more
complex models

 

To define our model, we abstract four basic chai
acteristics of simple animals

 

1) The animal exists in a sea of sensory signals. At
any moment only some signals are significant;
the rest are irrelevant.

2) The animal is capable of actions (e.g. movement)
which tend to change these signals

 

3) Certain signals (e.g. those attendant on con-
sumption of food), oF certain signals’ absence
(e.g. absence of pain) have special status for him.
 

 

4) He acts, both externally and through internal
operations, so as approximately to optimize the
rate of occurrence of the special signals.

 

 

An aminal’s sensory-motor situation is described
in very general terms by (1) and (2). Characteristics
(3) and (4) are assumptions which provide a way
of making definite the notion of ‘needs” and their
satisfaction. Together, the four characteristics form
the basis of our artificial animal model. For brevity,
we call such a model an “ansmat”

 

  

 

 

We take as the animat’s basic problem the gen-
eration of rules which associate sensory signals with
appropriate actions so as to achieve the optimiza-
tion of (4), above For this, the major questions are
adaptive, namely:

1) How to discover and emphasize rules that work

 

 

 

2) Get rid of those that don't (since memory space
ts hmited and noise is undesirable), and

38) Optimally generalize the rules that are kept (since
space is limited).

 

‘There is some previous work along these hnes
Notable were Grey Walter's machina speculatriz,
which was a sort of sub-animat which chose actions
based on needs and the sensory situation, but did
not adapt its rules; and m_ docalis, which ‘could be
taught a conditioned response|$] More recently,
Holland and Reitman|4} exhibited successful perfor-
mance by a rule-adaptive animat-like system which
optimized its rate of satisfaction of two distinct
needs. Booker|5] experimented with an animat-like
“hypothetical organism” which adepted its rules in
a simple environment that contained both attr
tive and aversive stimuli; he also provides a review
of earlier systems. The present investigation 1s in-
debted to the last two works

 

  

 

 

 

IMPLEMENTATION

Within the above framework we make the model
definite by defining the animat’s environment, sen-
sory channels, repertoire of actions, its association,
rules, and then its performance and adaptation al-
gorithms.

Environmen

A rectangle on the computer terminal screen 18
rows by 58 columns and continued toroidally at its
edges defines the environmental space. Alphanu-
meric characters at various positions represent ob-
Jects; the animat itself is denoted by * Some, pos-
sibly many, positions are just blank

nels.

__In studies so far, * has been given the ability to
Pick up seusory signals from objects which happen
to be one step (row and/or column) away, in any of

 

nso}

 

 

the eight (including diagonal) directions, nothing ts
detected from more distant objects. Thus the “sense
vector” has eight positions With * located, for
ample, as shown below left, the sense vector would
be as shown at the right.

 

  

‘FP TTEbbbbb,

where b stands for blank To form the sense vec-
tor, the circle of positions surrounding * is mapped
clockwise starting at 12 o'elock, into a left-to-right
string

 

But this vector is not the final sensory input We
imagine that an object is ultimately sensed as the
outcome of measurements upon it by one or more
feature or attribute detectors Without loss of gen-
crality we assume each detector produces either a 0
or 1 output. If there are d detector types, an ob-
ject translates into a binary string d bits in length.
‘The sense vector as a whole thus translates into @
“detector vector” of 8d bits. Detector translations
or encodings of objects are fixed in 's “low-level”
sensory hardware ‘They are assigned at the begin-
ning of an experiment For example. in experiments
discussed here, “F” (food) 1s encoded as “117; “T"
(tree or obstacle) as “01”, and “b” (open space) as
00". {The first bit might. be thought of as the out-
put of a “food smell?" detector, the second, of an
“opacity” detector | Thus the above sense vector
translates into the detector vector.

01 01 11 00 00 00 00 00

 

 

The associative apperatus takes the detector vector
as input

 

“’s actions are restricted to single-step moves
1m each of the eight directions. The directions are
numbered 0-7 starting at 12 o'clock and proceeding
clockwise; for example, a move in direction $ would
be south-ensterly

The animat may move, or attempt to move, to 2
position occupied by an object. The environment’s
response for each kind of object is predefined. In
present experiments, if the move is into position
whose encoding is 00 (the blank object), there 1s
no response (though the new sense vector will in
general be different). If * steps into a space occupied
by an object whose encoding has the first bit equal,
to 1, * is regarded as having eaten the object and
receives a reward signal. If * tries to step toward an
adjacent object whose encoding is O1, the step is not
permitted to occur (2 collision-like banging may be
displayed)

 

 

The foregoing establish a semi-realistic situation
in which sensory signals carry partial, but uncertain
information about the location of food, and avail

 

 
En:
vironmental predictability can be varied through the
choice and arrangement of the objects. The number
of object types which may be experimented with 1s
Inmited only by the number of bits in the detector
encoding scheme.

Ass

 

 

ation Rules.

For its association rules, the amimat uses a rudi-
mentary form of Holland’s|6| “classifier” rule The
amimat’s rules each consist of a “taxon” and an “ac-
tion”. The taxon is a sort of template capable of
matching a certain set of detector vectors. The ac-
tion is some one of the available actions. The ani
mat’s classifier says, in effect, “if my taxon matches
the current detector vector, then consider taking
this action” It 1s a kind of hypothesis about what
to do given a certain sensory situation (class of de-
tector vectors). An example of a classifier would be

O¥ 01 14 0% 00 00 O# O# 2

The matching rule requires that for any taxon
position having a 0 or 2, the same value must occur
in the detector vector; taxon positions with # (don’t
care) match unconditionally. Because of the #’5,
which confer a kind of generality on the classifier,
the above taxon, for example, will match 32 possible
detector vectors, including the one discussed earher

 

 

 

 

 

  

It is worth making a few further observations
about this classifier. First, it is a pretty good one
because if food 1s present in direction 2 and the clas-
sifier matches the detector vector, the action rec-
‘ommended is to move in direction 2 and not some
other direction! Second, in directions 0, 3, 6, and 7,
the taxon only requires that the object be, in effect,
non-food, it being irrelevant whether these direc-
tions have obstacles or are blank. Directions 1, 4,
and 5 have not been so generalized Broadly speak-
ing, & classifier is more useful to the animat to the
extent itis general (matches many detector vectors)
without being so general that it makes too many
errors (re, that in certain matching situations its
recommended action is inappropriate)

Besides taxon and action, each classifier pos-
sesses a “strength”, a quantity serving as the prin
cipal measure of a classifier’s value to the animat.
There may be other associated quantities, as well.

‘The amimat keeps a classifier population |[P] of
fixed size Usually, |P 1s mtiahzed by filling all
the taxa with 0, I, and # according to some ran-
dom rule, actions are similarly filled in, As the an-
imat’s CRT “life” evolves, the classifier population
changes, as will be described.

 

 

PERFORMANCE ALGORITHM

"'s basic cycle is one “step”, within which events
having purely to do with immediate behavior are

18

 

very simple Virst, the current detector vector is cal-
culated Second, |P} 1s searched for classifiers which
match it, these form the “match set” [M| Third, a
classifier is selected from |M) using a probability dis
tribution over the strengths of [M,’s elassifier’s, that
is, the probability of selection of a particular clas-
sifier is equal to its strength divided by the sum of
strengths of classifiers im {M| Fourth,
cording to the action of the selected classifier, or
ines to. ‘The environment’s response to the move
will be as described earher

  

 

 

 

 

It can be seen that *’s move choice tends to be
the one having the greatest total strength among the
‘M/ classifiers advocating it ‘Thus, overall, * first
asks which classifiers of |P) “recognize” the current
sensory situation, then from these tends to pick the
move with the greatest associated strength ‘The
subset of M) consisting of classifiers whose action 1s
the same as the chosen action 1s called the “action
ser A

ADAPTATION ALGORITHM

‘The adaptation algorithm has three distinct ase
pects 1) reinforcement of classifier strengths; 2)
“genetic™ operations on classifiers yielding new clas-
sifiers, and 3) direct creation of classifiers.

 

Reinforcement.

As discussed in the last section, a classifier's
strength is a major determinant of its ability to influ-
ence *’s action and therefore performance. We con-
sequently want strength to reflect the performance
which tends to result when this classifier 1s in [A].
That would be straightforward if every step were
rewarded we could, for example, adjust the clas-
sifier’s strength by an amount proportional to the
reward Classifiers which got bigger rewards would
be stronger, thus more likely to be an (A), etc

 

Realistically, however, it 1s usually the case that
only some of an organism's actions receive a def-
inite reward from the environment. Actions lead~
ang up to, or setting the stage for, a rewarded ac-
tion are themselves not directly rewarded, but they
must somehow be encouraged or the final payoff
will not occur. Holland|7} addressed this problem
in proposing a “bucket-brigade” algorithm in which,
very briefly, 1) classifiers make payments out of their
strengths (o classifiers which were active on the pre-
ceding cycle, and 2) the same classifiers later corre-
spondingly receive payments from the strengths of
the next set of active classifiers External reward
goes to the final active set in the chain. In effect,
given amount of external reward will eventually’ flow
all the way back through a rehable chain, reinforcing
every precursor classifier

  

Our basic implementation of this idea is as fol
lows On each step:
 

 

1) all classifiers in AJ have a fraction ¢ of their
strengths removed;

2) the total strength thus removed from |] is dis-
tributed to the strengths of any classifiers in |A-
1), defined as the action set in the previous step,

3) * then moves and if external reward 1s received
iis distributed to the strengths of ‘Al; if exter-
nal reward 15 not received. the classifiers of |A\
replace those of |A-1

‘Thus every |A. participates in general in two trans-
actions, one paying out, the other receiving We can
write

 

$4 =Sa-eSa+p

where Sq 1s [Al's total strength on one step, Si its
total on the next, and p is the total payoff received
{either external reward or from the next (Al) If p is
the same over time, Sa approaches a constant value
given by p/e, so that under reasonably steady pay-
off conditions, Sq 1s an estimator of typical payoff
Similarly, the strength of any individual classifier is
fan estimator of its t pical payoff

The total payoffs to [A and A-I] are sn the sim-
plest case shared equally by the reciptent classifiers.

This has the consequence that the more classifiers
are im, say, /A/, the less payoff each gets

Genetic Operations:

Consider two classifiers which match sunilar si
vations:

 

 

   

 

Of 01 14 OF 00 00 OF O# 2
and

O# Of 11.01 00 0% O44 Og 2
Each is good. but each still lacks something in gener-
ality since, for example. the matching requirements
for 01 in bits 2-3 and 6-7 respectively, of each are

 

perhaps unnecessarily restrictive Suppose we make
& new classifier by combining bits 5-9 of the first
with bits O-4 and 10-15 of the second. The result
would be the slightly more general classifier.

O# OF 19 0H OOOH OOH / 2

‘The above operation on two classifiers resembles
kind of crossing-over or recombination of chromo-
some parts in genetics. It is an operation in which
two *parent” classifiers produce an offspring that is
possibly an improvement over both of them An
other “genetic” operation, this time using just one
Parent, would first clone the parent then mutate one
‘oF more of the clone’s taxon positions Other types
of operations on classifier structure can be imagined
{one will be discussed later). In each case the at
tempt is to use existing classifiers as the starting
Points for improved classifiers

  

But the crossover points above were chosen quite
carefully; otherwise the offspring might have been no

19

Improvement, or even a retrogression (Lo a classifier
nore specific than either parent). We do not expect
the animat to know where best to cut and mutate.
How can we expect genetic operations to be of any

Holland 8 presents a mathematical theory show-
tng that a population of individual symbol strings
im which cach string can be assigned a numerical
worth, will progressively increase in average worth
as its members undergo reproduction, genetic oper-
ations on or among the offspring, and deletion of in-
dividuals to maintain constant population size The
key requirement is that an individual's probability
of reproduction be proportional to its worth. Hol-
land extended the theory to include classifier s
tems. In employing genetic operations, our ani
constitutes an exploration and test of the theory

   

 

 

 

 

‘The specific algorithm employed is as follows

1) A first classifier ¢1 of P! is selected with proba-
bility proportional to is strength:

2) If et as merely to be reproduced, a copy of it
as made and added to 'P_ To make room some
classifier 1s deleted;

3) If el 1s to be crossed with another classifier. a
second, ¢2, is selected, also with probability pro-
portional to strength, but from the subset of P
of classifiers having the same action as cl. Two
cut points are chosen as above, but at random,
and an offspring ¢3 constructed out of the parts.
¢3 is added to P’ and some classifier is deleted.

Note that the parents are kept (unless one happens
to suffer the deletion, but this 1s unlikely) The
offspring, in effect, go into competition for payoff
with the parents. Better (higher strength) offspring
should proliferate more rapidly than their parents,
driving them out, for worse offspring, the reverse
should be the case

2Cr

 

 

 

’ Operations

 

Occasionally, as * executes the performance algo-
rigthm, a detector vector may occur that no cla
fier of [P_ matches, i.e , the situation is unrecognized
‘The animat’s response is to create a new. match-
tng, classifier A taxon is made by adding some 4’
at random to the detector vector, an action 1s cho-
sen randomly. The created classifier 1s added to P
and one is deleted The new classifier immediately
matches the previously unrecognized situation and
action occurs by the normal mechanism

 

 

EXPERIMENTAL PROCEDURE

‘The animat model was designed with the vIl-
intelligence definition as a guide In experiments
with the model we are interested in finding pro-
cedures and parameter values that seem to give *

 
 

 

greater rather than less vH-intelligence For this
two measures have been adopted. One is a perfor-
mance measure. given an environment, how many
steps does * take, on average, to find food objects?
The other 1s a generality measure docs * evolve
classifiers each tending to be useful in a number of
distinct situations? Generality 1s important because
it suggests that a high level of performance devel-
oped in one environment will carry over to a some-
what different environment

 

‘The experimental procedure is to fix *"s methods
and parameters, then have him do a large number
of “problems” in a particular environment E. The
measures of performance and generality are tracked
A “problem” always consists of starting * at a ran-
domly selected blank position in E, then * moves
until he eats some food, at which point the problem
ends The number of steps between start and food is
recorded a moving average of this quantity over the
previous 60 problems 1s the performance measure,
STPSAV

To track generality, we calculate a histogram
over the “periods” of all classifiers in P. The pe-
riod of a classifier 15 a moving average of the number
of steps by * between occurrences in A. of this clas-
sifier Thus a frequently used classifier will have a
low period [P| will then be general to the extent the
histogram of periods is largest at low period. As |P]
evolves we expect the histogram peak to move to-
ward lower period, if [P]'s generality is increasing.

  

 

 

 

 

 

 

Figure 1 The Environment “WOODS7*.

Aw environment used for many of the exper-

  

iments is “WOODS7", shown in Fig 1. Although
WOODS? may look ¢ contains a tor
tal of 92 distinct sense 's need to dis-

 

cover and generalize 1s substantial. To obtain per-
formance baselines, we can start * randomly, then
let him also move completely randomly until food
(F) 1s bumped into. For WOODS7, the long-term
average of the number of steps this takes is about 41

20

steps. We may also ask (9: what is the best possible
performance (if, say, the animat had human capa-
bilities)? For every starting position, the number of
steps to the nearest F can be found and averaged
over all starting positions. The result for WOODS7
is 2.2 steps.

RESULTS AND DISCUSSION

Fig 2 showsa performance curve for a combi-
nation of procedures and parameter settings that 1s
‘among the best so far found. There is an initial rapid
improvement within the first 1000 problems (un-
typically good during the first 100 problems, where
STPSAV usually stays above 15), followed by very
gradual improvement thereafter. The performance
at 8000 problems, between 4 and 5 steps, is quite
respectable compared with “perfect” (2 2 steps), es-
pecially since * has no information whatsoever until
he 1s next to a nonblank object.

 

 

  

“4

12

1

116

6

Average steps to food

4

      

Musber of problens x 1028

Figure 2. STPSAV (ragged line) and Period Av-
erage (broken line) for * to 8000 prob-
lems. Period values as marked.

  

For the same animat, Fig. 3 shows the histogram
of periods of [P] at 8000 problems. There is a defi-
nite bulge for low periods; the average period is 116.
For comparison, the broken line in Fig. 2 shows the
trend of the period averages at earlier epochs, indi-
cating gradual generalization in the sense we have
defined

 

Qualitatively, a * such as this one gives the im-
pression of “knowing” the Woods quite well. When
next to F, * nearly always takes it directly; occa-
sionally he will move one step sideways and take it
from that direction. When next to one or more T's,

 
of
d

 

116

 

but with no F immediately in sight, * quite reliably
steps around the obstacle(s) and finds the F_ When
* 1s “out im the open”, i.e, the sense vector consists
of blanks, he has no information about the best way
to go, as in a thick fog. One might expect *'s be-
havior to resemble a random walk but this is not the
case. Instead, the movements look more like a ge
eral “drift” 1n some direction, with some supert
posed randomness After several problems the drift
may shift to another direction

 

 

 

72

60

40

Number of Classifiers
30

 

‘amt

 

 

 

 

 

 

 

 

© Se 108 159 200 258 380 350 400 «so
Period

Figure 3. Histogram of classifier periods for the
* of Figure 2 at 8000 problems

Parameter Values

Parameter values for the animat of Fig 2 were
arrived at by experiment. Three basic parame-
ters are discussed in this section, with observations
about setting them reasonably

For Fig 2, [P| contained 400 classifiers. A suit-
‘able value for this number appears related to the
number of distinct sense vectors or “scenes” (here,
92) in the environment Too small a ratio of clas-
sifiers to scenes results in “forgetful” behavior in
which * keeps losing good moves that appeared well
learned. A small ratio means that for some scenes
deletion has a high probability of eliminating all
matching classifiers. For ratios above about four,
the forgetting is much less noticeable ‘To the extent

jeneralizes, more and more classifiers match each
sense vector, further reducing the problem

 

 

 

‘The “estimator fraction”, ¢, was set at 0.2, 1.€,
‘a classifier lost 20 percent of its strength each time
it entered [A]. In general, smaller values of ¢ mean
that a classifier’s strength reflects 2 weighted av-

erage of payoffs that reaches farther into the past.
Conversely, a larger value makes the strength more
sensitive to recent payoffs. It was found that e = 04
produced a noticeably more erratic STPSAV curve,
whereas changing from ¢ = 02 to 0.1 did not affect
the curve significantly Strength should accurately
estimate a classifier’s typical payoff. In this problem,
payoff fluctuations are apparently large enough so
that ¢ = 0-4 results in too short an averaging interval
for good estimation Ife 1s too small, though, newly
formed classifiers may get evaluated too slowly, we
therefore kept ¢ at 0.2

 

 

The rate at which genetic operations occurred
was set proportional to the problen rate  Specif-
ically, at the end of each problem, a single genetic
event (as described earlier) took place with probabil-
tty RGPROB Given the event, crossover occurred
with probability XPROB Settings were typically
0.25 and 0.50, respectively. These seemed to ensure
that, on average, classifiers would be fully evaluated
by the reinforcement process by the time they were
selected for a genetic operation (or deleted) Typ-
ically, a problem took five steps in which each set
AJ had about 10 members, giving about 50 evalu-
ations. The above value for RGPROB then implies
200 evaluations per genetic event. This seems ex-
cessive except that some classifiers are much more
frequently used than others and we wanted to allow
for the well-rewarded but infrequently called-upon
classifier. It is possible our results would have been
speeded up, without adverse side effects. by a lngher
genetic rate.

 

 

Distance Estimation.

 

Performance in the earliest animat experiments
was far below the level of Fig. 2. One defect was a
kind of “dithering” in which while * would tend to-
-d F's, the path would have unnecessary sidesteps
and wanderings. It was then realtzed that the ba-
sic reinforcement algorithm does not care whether =
path from point A to food is long or short, there 1s
nothing which preferentially reinforces the most ex-
peditious classifiers Any path, even a looping one,
will come to equilibriuin at a high strength level in
ts constituent classifiers

    

 

The solution had to be more subtle than simply
penalizing long paths. What 1s required 1s a tech-
mmigue that, at every position, tends to prefer the
most direct of several possible moves but does not
prevent the setting up of a long path if that is ac-
tually the shortest path available. Our solution was
twofold First cach classifier was niade to keep an
estimate of its distance (in steps) to food. This did
not require elaborate look-ahead Instead each clas-
sifier in A-1] adjusted its distance estimate accord-
ing to an average of the distance estimates of |Aj,
when reward was received, the members of |A’ were
similarly adjusted using the quantity 1. This tech-

 

 
 

nique, with each estimate an average over the last
fow updates, 1s quite satisfactory

 

The distances are employed as follows In the
performance cycle, selection from (M] is based on
probability proportional to strength, distance mstead
of just strength Consequently, a move tends to be
selected that 1s not only strong, but also ‘short™

Now comes the second part of the solution. At the
same ume as |) 1s formed, the set NOT/A] of the re-
maining classifiers in |M) 1s taxed Ly a small amount
(typically five percent): the “longer” classifiers thus
tend to incur a loss by not being selected This
“lateral inhibition™ induces a sort of catastrophe in
which the shorter classifiers become even more likely
to be picked and the longer become ever weaker, and
can disappear entirely. Note that the competition is
purely local and does not work against the setting
up of minimal long paths

 

 

This technique is very effective against “dither-
ing’; the progressive takeover of a match set by
a discovered shorter move has been repeatedly ob-
served Our solution is not perfect, however, be-
cause to suppress the special case of occasional loop-
ing situations we had to impose a small tax (five
percent) on [A]. Since [A] 1s the set which receives
payoff, the tax has hittle effect except if a loop is
taking place, and then the tax 1s soun very effective,
Still, 1m principal, even a small tax on [A] reduces
the strength flow in very long chains, putung them
at a reproductive disadvantage. This residual prob-
Jem may be an indication that as paths grow, they
should be “condensed” into umts of behavior longer
than one step.

 

Extension

 

A second area of changes which improved perfor-
mance had to do with the “Create” operations. As
discussed, Create at first only occured when {M! was
empty. It was found that * sometimes also got stuck
looping among situations with nonempty [M)’s. The
tax on [A] enabled recognition of these loops because
the total strengths in each (A) would tend to zero
We put in a threshold that triggered Create if the
strength of any [M] got too low. This suppressed
looping dramatically and improved performance

It was also found important to trigger Create
randomly, at a very low rate (typically, with prob-
ability 0.02 per step). * is engaged in path con-
struction, using the best available current evidence
This can lead to good but nevertheless suboptimal
paths which might be improved if * would only try
something different. Random Creates are one wa:
to introduce a new move direction Usually the new
classifier is no improvement. But when it is, and it
gets tried (gets in {A]), it will be (often heavily) rein-
forced and therefore given a good chance at eventual
reproductive success

 

22

A different type of Create was also found useful
Instead of randomly picking the action in a Created
classifier * may make an educated guess, as fol-
lows From its current position, * steps tentauvely:
into a randomly selected adjacent position. ‘There,
|M| 1s determined and the strength-weighted aver-
age of the distances of its classifiers, MNDIST|M],
is formed. ‘The same is done for several adjacent
positions ‘These values are then compared with
MNDIST M) for the starting position. Several de-
cision schemes are possible, with the general idea
of picking an action direction corresponding to the
shortest apparent path. If, however, none of the ad-
Jacent MNDIST|M|'s is better by more than 1 than
the current position’s value, it 1s preferable not to
create a new classifier. This technique 1s important
early in *’s existence, when very little is yet know:

but, interestingly, it appears that " should not rely
entirely upon it, Some suboptimal paths get set up
which tend not to be improved The problem goes
away if random Creates are also available

Effect of Genetic Operations

Finally, we shall discuss what. the experiments
suggest about the role of the genetic operations To
begin. at 1s helpful to define a “concept” as a set of
classifiers from |P| having exactly the same (axon
and action, and for which there 15 no other class
1m [P| with that taxon and action. The baste effect
of *'s genetic operations then appears to be to ex-
ert a pressure tending to increase the generahity of
{P\’s concepts. That 1s, with time, the periods of
the concepts in [P| tend to decrease The pressure
as restrained by the requirement that the concepts
be more or less correct (* must get the food expe-
ditiously). The precise point of balance appears to
depend on the parameter regime

An mmportant experiment is to evolve an antmat
th reinforcement and Create going as usual, but
th genetic operations turned off. The result is a
performance almost as good as Fig. 2. But signifi
cant generahzation does not occur, the curve of his-
togram averages remains essentially flat ata value
of about 270. There thus appears to be a division
of effort Create introduces the raw material, the
specific examples to be evaluated; and the genetic
operations produce more general concepts from the
examples

 

 

 

 

  

 

 

 

 

Its clear that crossover 1s capable of making a
more general classifier out of two less general par-
ents, this was illustrated earher We are not sure,
however, just why for * the more general concept
has a selective advantage. Somehow, greater gener-
ahty must lead to greater concept strength; there 1s
no other way to win out. Yet being active more fre-
quently does not mm itself result in greater strength
strength 1s an estimator typicel payoff, not payoff
rate.

 
 

Our tentative hypothesis stems from noting that
a more specific concept will always have to share
payoff with any more general offspring that comes
into existence This imbally weakens the specific
concept so that the number of classifiers making 1t
up tends to fall (at equilibrium, numbers are propor-
tional to total strength) Consequently, the specific
gets even less of the payoff. since payoff 1s shared
The result 1s a cascading situation in which the more
general concept wins out. The odds favor the gen-
eral because it has more than this one source of p:
off.

While general classifiers appear to have a selec-
tuve advantage, this is of no use unless such classi-
fiers can be formed and mtroduced in the first place
Crossover 1s adequate for some types of generaliza-
ton. But a natural operation for the purpose 1s
obviously intersection We have tmplemented this
operation as follows ‘Two parents are chosen and
anew taxon is formed by intersecting copies of the
parents’ taxa over a randomly selected interval In
that interval, if the parents differ at a position, the
new taxon gets a #, if not, the new taxon gets the
common value Outside the interval, the new taxon
1s filled in from parent 1

 

 

 

 

Intersection 1s a “hot” operation which should
be used cautiously because it can introduce # s
ft a high rate. Nevertheless, our results show in-
creased generalization with litle performance loss
when crossover and intersection are both available
to*.

 

 
 

 

Space remains only discuss the deletion tech-
nique. The simplest method, conceptually, is to
delete at random Then, to a first approxunation,
the equilibrium number of classifiers in a concept

or in any subsct of {P) whatsoever —is proportional
toiits total strength A drawback of random deletion
18 that a valuable concept that happens to consist
of one classifier 1s at considerable risk untal 1 re-
produces. This is not a problem on average if P
is large enough Yet one wonders whether “deleting
the weak” might not be better

 

Several methods have been tried, all but one
clearly worse than random deletion. ‘The possibly
better method 1s to delete with probability propor-
tional to the reciprocal of strength This has the
obvious effect of tending to protect the precious clas-
sifier just mentioned. It can also be shown that the
Probability that a concept {C_ will lose a member
under this type of deletion is proportional 10 the
square of its number, which places a strong restraint
on over-expansion

 

The * of Fig. 2 employed both intersection
(along with crossover) and inverse-strength deletion

CONCLUSION

In its simple way, * meets the definition of antel-
ligence stated at the beginning. * becomes good
at satisfying its need for food in a Woods of di
verse object configurations on the basis of expei
ence. Though not yet vested, *'s rule generaliza-
on over Lime suggests that performance would be
maintained in a somewhat different Woods. or if the
Woods slowly changed

 

While the present animat has numerous limi
tations (sensory, motor, memory ete) there docs
not seem to be any essential barrier to removal of
the limitations and to carryover of the present algo-
rithms to a more sophisticated model in more come
phicated environments,

 

 

ACKNOWLEI

 

EMENT

‘The author wishes to acknowledge valuable co
versations with CG Shaefer of the Rowland Insti
tute

 

 

REFERENCES

1, Winston, PH Artificial Intelligence, 2nd
Reading, Massachusetts. Addison-Wesley, 1984

2) van Heerden, PJ The Foundation of Empirical
Knowledge. Wassenaar, The Netherlands Wis-
uk, 1968

3) Walter, W.G. The Living Brain New York:
Norton. 1953.

i4, Holland, JH., & Reitman, JS Cognitive sys-
tems based on adaptive algorithms In Pattern-
Directed Inference Systems, Waterman, DA, &
Hayes-Roth, F., (eds). New York: Academic
Press, 1978

 

5 Booker, L Intelligent Behavior ws an Adapta-
tron to the Task Environment. Ph D. Dissertation
(Computer and Communication Sciences), The
University of Michigan, 1982

6 Holland, JH Adaptation In Progress tn Theo:
retical Biology, 4, Rosen, R, & Snell, F.M , (eds.)
New York Plenum 1976

— Genetic algorithms and adaptation
In Adaptive Control of Ill-Defined Systems, Self-
ridge, O.G Russland, EL, & Arbib, MA (eds)
New Vork Plenum 1984.

8 Adaptation in Natural and Artsfictal
Systems Ann Arbor University of Michigan
Press. 1975

9 Martha Gordon, personal con

 

munication.

 
 

IMPLEMENTING SEMANTIC NETWORK STRUCTURES
USING
THE CLASSIFIER SYSTEM
Stephame Forrest

‘The University of Michigan
‘Ann Arbor, Michigan

Introduction

One common cnticism of Classifier Systems is the low-level nature of their
representations In Classifier Systems information is stored as rules (classifiers) that have a
very constrained format (binary bit strings). Low-level binary bit string representations
support adaptive learning algorithms well (Holland, 75)(Holland, 80). However, it is difficult
to interpret the behavior of these systems without a high-level interpreter that can code and
de-code the ones and zeroes into more meaningful terms. In particular, although gross
behaviors can be measured at various intervals using some fitness function it is difficult to
chart how learning takes place or to determine what role is played by each component of the
system. This feature of low-level representauions makes 1 difficult to establish direct
connections between the behavior of Classifier Systems and more common high-level
symbolic representations used in artificial intelligence programs

The research described in this paper addresses this criticism by demonstrating that
Classifier Systems are capable of representing sophisticated high-level structures. This has
been accomplished by selecting one class of knowledge representation paradigms (semantic
networks) and showing how they can be implemented as a collection of Classifier System
rules. The described system takes high-level semantic network descriptions as input and
automatically translates them into a Classifier System representation. It also provides a
“query processor” that takes high-level queries about the semantic network, translates them

into a sequence of Classifier System operations, and translates the results of the queries back

24

 
into higher-level answers

In large scale parallel systems such as the Classifier System, the issue of control is

ct

 

tral Control issues arise in two ways for the Classifier System in deciding which
external classifiers are to be generated, and in deciding which external messages are to be
placed on the message list and when As the number of rules in the system increases, it
quickly becomes impossible to do control the system manually, There are at least two
possible ways to automate the process’ “learning” and “compiling.” Compilation can be
viewed as mapping high-level structures onto lower-level operations (“top down”). Likewise,
some kinds of learning (for example, genetic algorithms) can be viewed as the gradual

emergence of higher-level structures from a random assortment of low-level processes:

 

systems using these kinds of learning organize themselves from the “bottom up.” The
bottom-up approach is the one that has been studied previously for Classifier Systems
(Holland, 80) (Booker. 82) (Goldberg. 83) The top-down approach 1s being explored in this
paper.

The implementation takes the form of a compiler mapping “high-level” semantic
network definitions onto the Classifier System. In this context, the Classifier System is
properly viewed either as a lower-level target language or as a specification for an abstract
parallel machine One particular semantic network formalism was selected for this research
KL-ONE (Brachman. 78) (Schmolze and Brachman. 82) (Brachman and Schmolze. 85)
The KL-ONE family of languages 1s widely used. st contains most of the common semantic
network constructs (the most notable exception being cancel links). has been precisely
described, and includes sophisticated accessing functions as part of the design of the
language These characteristics make KL-ONE an excellent exemplar of the semantic
network representation paradigm

The remainder of this paper is divided into five sections (1) brief description of my
version of the Classifier System, (2) short introduction to KL-ONE (3) description of the

Classifier System implementation of KL~

 

INE. (4) discussion. and (5) conclusions

 
 

The Classifier System

Since there are several variants of Classifier Systems, I will describe below the one
used for this project. This particular system does not include those features that are specific
to the use of adaptive algorithms, such as bidding, support, etc. This 1s because 1 am
interested in showing what sorts of representations are possible, not how they can evolve.
The following view of the Classifier System emphasizes how it can be used to represent
higher-level structures and does not rely on any parucular hardware implementation. Thus,
it is appropriate to describe the language of possible programs for the Classifier System as a
formal grammar. The input to a Classifier program is the set of external messages (often
called detector messages) that are added to the message list during the program's execution
‘The output is the set of messages (called effector messages) read from the message list by an

external agent. Just as many traditional programs can be run interactively. a classifier

 

program can be thought of as receiving intermittent input from the external environment

and occasionally emitting output messages. The syntax for the Classifier System is as

follows:
<classifier system> ::= <classifier>

<classifier> <condition>~ => <action>

 

<condition>

 

<alphabet>" ~<alphabet>”

<action> = <alphabet>"

<alphabet> = 1 0 £

 

Each classifier, or production rule, consists of a condition part and an action part
The action part specifies exactly one action. while the condition part may contain many
conditions (pre-conditions of activation) Rules with more than one condition are referred to
as *multuple-condition classifiers.” A muluple-condition classifier must have each of 1ts pre-
conditions fulfilled in a single time step for it to be activated ‘The conditions and actions
are fixed length strings over the alphabet (1.0.4) where = denotes “don't care” and 1 and 0

are literals. The determination of whether or not a specific message matches a condition 1s a

26
— ii

logical bit comparison on the defined (1 or 0) bits. If a “not” condition is used, the
condition is fulfilled just in the case that no message on the message hist matches it. The
#'s in the condition part designate “don’t care” positions in the sense that they match
either 1 or 0. The action part of the classifier determines the message to be posted. All
defined bits appear directly in the output message. Any + symbols in the action part
indicate that the corresponding bit value in the activating message should be substituted for
the # symbol in the output message.! Actual messages are always completely defined in
that they do not contain “don’t care” symbols Separate conditions are placed on separate
lines, and the first condition (the distinguished condition) of a classifier is used to pass
through messages to the action part.

‘As a simple example, consider the following four bit (n = 4) classifier system

#00% => 1101

 

~Wi=> un
‘This classifier system has three classifiers. The second classifier illustrates muluple-
conditions, and the third contains a negative condition. If an initial message, “0000” is
placed on the message list at time TO, the pattern of activity shown below will be observed

on the message list

Time Step Message List Activating Classifier
To: 0000 external
TL 1101 first
ill third
T2 mu second
73
T4 lll third,

 

‘For multiple-condition classifiers this operation 1s ambiguous since it 1s not clear
what it means to simultaneously perform “pass through” on more than one condition. The
ambiguity is resolved by selecting one condition to be used for pass through. By convention,
this will always be the first condition. Another ambiguity arises sf more than one message
Matches the distinguished condition in one ume step. Again by convention, in my system I
Process all the messages that match this condition. The example illustrates this procedure.

 
 

The final two message lists (null and “1111") would continue alternating until the system
was turned off In TI. one message (1101) matches the first (distinguished) condition and ”
both messages match the second condition. Pass through is performed on the first
condition, producing one output message for time T2. If the conditions had been reversed
(##9#1 distinguished), the message list at ume T2 would have contained two identical

messages (1111)

KL-ONE

KL-ONE organizes descnptive terms into a multrlevel structure that allows
properties of a general concept, such as “mammal.” to be inherited by more specific
concepts, such as “zebra.” This allows the system to store properties that pertain to all
mammals (such as “warm-blooded”) in one place but to have the capability of associating
those properties with all concepts that are more specific than mammal (such as zebra) A
multilevel structure such as KL-ONE 1s easily represented as a graph where the nodes of
the graph correspond to concepts and the links correspond to relations between concepts
Such graphs, with or without property inheritance, are often referred to as semantic
networks.

KL-ONE resembles NETL [Fahiman. 79] and other systems with default hierarchies
1m its exploitation of the idea of structured inheritance of properties. It differs by taking the
definitional component of the network much more seriously than these other systems. In
KL-ONE, properties associated with a concept in the network are what constitute its
definition. This 1s a stronger notion than the one that views properties as predicates of a
“typical” element, any one of which may be cancelled for an “atypical” case. KL-ONE does
not allow cancellation of properties Rather. the space of definitions is seen as an infinite
lattice of all possible definitions: there are concepts to cover each “atypical” case All
concepts in a KL-ONE network are parually ordered by the “SUBSUMES" relation. This
relation, often referred to as “IS-A” in other systems. defines how properties are inherited

through the network. ‘That 1s, if a concept A is subsumed by another concept B, A inherits
all of B's properties. Included in the lattice of all possible definitions are contradictory
concepts that can never have an extension (instance) in any useful domain, such as ~a
person with two legs and four legs.” Out of this potentially infinite lattice, any parucular
KL-ONE network will choose to name a finite number of points (because they are of interest
in that application), always including the top element, often referred to as “THING ”

KL-ONE also provides a mechanism for using concepts whose definitions either
cannot be completely articulated or for which it is inconvenient to elaborate a complete
definition — the PRIMITIVE construct. For example, if one were representing abstract
data types and the operations that can be performed on them, it might be necessary to
mention the concept of “Addition.” However, it would be extremely tedious and not very
helpful in this context to be required to give the complete set-theoretic definition of
addition. In a case such as this, it would be useful to define addition as a primitive concept.
‘The PRIMITIVE construct allows a concept to be defined as having something special
about it beyond its explicit properties Concepts defined using the PRIMITIVE construct
are often indicated with “*" when a KL-ONE network is represented as a graph

While NETL stores assertional information (e.g., “Clyde is a particular elephant”) in
the same knowledge structure as that containing definitional information (for example
“typical elephant”), KL-ONE separates these two kinds of knowledge. A sharp distinction 1s
drawn between the definitional component, where terms are represented. and the assertional
component, where extensions (instances) described by these terms are represented. It 1s
possible to make more than one assertion about the same object in any world For example
it may be possible to assert that a certain obyect is both a “Building” and a “Fire Hazard”
In KL-ONE, the definitional component (and its attendant reasoning processes) of the
system is called the “terminological” space, and a collection of instances (and the reasoning
Processes that operate on it) is referred to as the “assertional” space. The feavures of KL-
ONE that are discussed here (structured inheritance. no cancellation of properties primutive
concepts, etc.) reside in the terminological component, while statements in the assertional

Component are represented as sentences in some defined logic. Reasoning in the assertional

art of the system is generally viewed as theorem proving

 
 

At the heart of knowledge acquisition and retrieval is the problem of classification

Given a new piece of information. classification 1s the process of deciding where to locate “
that information m an existing network and knowing how to retrieve it once it has been
entered. This information may be a single node (concept) or, more likely, it may be a
complex description built out of other concepts. Because KL-ONE maintains a strict notion
of definition, it is possible to formulate precise rules about where any new descripuon
(terminological) should be located in an existing knowledge base.

As an example of this classification process in KL-ONE, if one wants to elaborate a
new concept XXX that has the following characteristics,

XXXX takes place in Africa. and
XXXX involves hunting zebras,

1) XXXX 1s a kind of vacation,
i
there exists a precise way to determine which point in the lattice of possible definitions
should be elaborated as XXXX Finding the proper location for XXXX would involve
finding all subsumption relationships between XXXX and terms that share characteristics
with it

If the terminological space 1s implemented as a multi-level network, this process can
be described as that of finding those nodes that should be immediately above and
immediately below XXXX in the network The notions of ‘above” and “below” are
expressed more precisely by the relation “SUBSUMES* Deciding whether one concept
SUBSUMES another 1s the central issue of classification in KL-ONE, The subsumption
rules for a particular language are a property of the language definition
(Schmolze and Israel, 83)

In summary. there are two aspects to the KL-ONE system (1) data structures that
store information and (2) a sophisticated set of operations that control interactions with
those data structures. In the following sections. the first of these aspects is emphasized A

more detailed treatment of KL-ONE operations 1s contained in (Lipkis. 81)

 

More precisely, XXXX has a location role which 1s value restricted to the concept
Afmca, an activity role which 1s value restricted to concept HuntngZebras, and a SUPERC
link connecting it to the concept Vacation.

30
Classifier System Implementation of KL-ONE

In this section, a small subset of the KL-ONE language 1s introduced and the
corresponding representation in classifiers 1s presented. Then it 1s shown how simple queries
can be made to the Classifier System representation to retrieve information about the
semantic network representation. The simple queries that are discussed can be combined to
form more complex interactions with the network structure (Forrest, 83)

A KL-ONE semantic network can be viewed as a directed graph that contains a finite
number of link and node types Under this view, a Classifier System representation of the
graph can be built up using one classifier to represent every directed link in the graph The
condition part of the classifier contains the encoded name of the node that the link comes
from and the action part contains the encoded name of the node that the link goes to.
Tagging controls which type of link is traversed. In the following, two node types (concepts
and roles) and six link types (SUPERC, ROLE. VR, DIFF, MAX. and MIN) are discussed
‘These node and link types comprise the central core of most KL-ONE systems and are
sufficiently rich for the purposes of this paper

For the purposes of encoding, the individual bits of the classifiers have been
conceptually grouped into fields. The complete description of these fields appears below
The description of the encoding of KL-ONE is then presented in terms of fields and field
values, rather than ‘using bit values It should be remembered that each field value has a
corresponding bit pattern and that ultimately each condition and action 1s represented as a
string of length thirty-two over the alphabet {1.0,%} The word nil denotes “don’t care” for
fan entire field. There are several distinct ways in which the classifiers’ bits have been
interpreted. The use of tagging ensures that there is no ambiguity in the interpretations

used. The type definition facilities of Pascal-like languages provide a natural way to express

the the conceptual interpretations I have used. as shown below

 
 

wpe
tag = (NET.NUM.PRE)
link = (SUPERC,ROLE,DIFF,VRLIN!
direction (UP,DOWN);
compare = (AFIELD,BFIELD,CFIELD);
name = string:
message = string;
numeric = 0 .. 63;

   
 
 
 
  

.MAX.MIN,):

 

classifier pattern = record
case tag : tagfield

NET /* Structural Variant *
(vagfield name lnk direction);

NUM :/* Numeric Variant *
(tagfield name nil direction compare numeric);

PRE /* PreDefined Variant *
(tagfield message);

end;

This definition defines three patterns for constructing classifiers: structural, numeric,
and predefined. The structural pattern is by far the most important. It is used to represent
concepts and roles. The numeric pattern 1s used for processing number restrictions. ‘The
predefined pattern is used for control purposes; it has no don’t cares in it, providing reserved
words, or constants, to the system.

The structural pattern has been broken into four fields: tag, name, link, and direction.
‘The tag field is set to NET. the name field contams the coded name of a concept or role, the
link field specifies which link type is being traversed (SUPERC, DIFF etc.), and the
direction determines whether the traversal is up (specific to general) or down (general to
specific)

‘The Numeric pattern has six fields: tag, name, link, direction, compare, and number
In most cases the name, link. and direction fields are not relevant to the numeric processing
and are filled with don't cares. The tag field is always set to NUM, and the compare field is *
one of AFIELD. BFIELD. or CFIELD. The compare field is used to distinguish operands in
arithmenc operations. The number field contains the binary representation of the number
being processed.

The Predefined pattern has the value PRE in the tag field. The rest of the pattern 1s

assigned to one field These bits are always completely defined (even in conditions and

32

SSS
actions) as they refer to umique constant messages These messages provide internal control

information and they are they are used to initiate queries from the command processor

Concept Specialization

All concepts in KL-ONE are partially ordered by the “SUBSUMES” relation. One
concept, for example Surfing, is said to specialize another concept, say WaterSport, if
Surfing is SUBSUMEG by WaterSport. This means that Surfing inherits all of WaterSport’s
properties. The “SUBSUMES” relation can be inferred by inspecting the respective
properties of the two concepts, or Surfing can be explicitly defined as a specialization of
WaterSport Graphically the specialzation is represented by a double arrow (called a
SUPERC link) from the subsumed concept to the subsuming concept (see Figure 1). KL-
ONE’s SUPERC link is often called an ISA link in other semantic network formalisms.
Since the SUBSUMES relation is transitive, SUPERC hinks could be drawn to all of
WaterSport’s subsumers as well Traditionally, only the local links are represented

explicitly.

WaterSport

 

 

 

 

Surfing

Figure 1
Concept Specialization

‘Two classifiers are are needed to represent every explicit specialization in the network
This allows traversals through the network in either the UP (specific to general) or DOWN

(general to specific) direction. The classifiers form the link between the concept that is

being specialized and the specializing concept. The following two classifiers represent the

 
 

network shown in Figure 1

NORM-WaterSport-SUPERC-DOWN => NORM-Surfing-SUPERC-DOWN
NORM-Surfing-“SUPERC-UP => NORM-WaterSport-SUPERC-UP.

A role defines an ordered relation between two concepts. Roles in KL-ONE are
similar to slots in frame-based representations. The domain of a role is analogous to the
frame that contains the slot; the range of a role 1s analogous to the class of allowable slot-
fillers. In KL-ONE, the domain and range of a role are always concepts. Just as there is a
partial ordering of concepts in KL-ONE, so 1s there a partial ordering of roles. The relation
that determines this ordering is “differentiation.” Pictorially, the DIFFERENTIATES
relation between two roles is drawn as a single arrow (called a DIFF link). Roles are
indicated by a circle surrounding a square (see Figure 2). This allows roles to be defined in
terms of other roles similarly to the way that concepts are defined from other concepts. The
domain of a role is taken to be the most general concept at which it 1s defined, and, likewise,
the range is taken to be the most general concept to which the role 1s restricted (called a
value restriction), If there 1s no explicit value restriction in the network for some role, its
range is assumed to be the top element. THING

Roles are associated with a concept. and one classifier is needed to represent each
association (link) between a concept and its role, For example, the role Arm might be
associated with the concept Person (see Figure 2) and the following classifier would be
generated.

nil-Person-nil-ml-nil
PRE-RoleMessage => nil~Arm-DIFF-nil-nil.

Roles can be defined in terms of other roles using DIFF links. For example. the role
Sibling can be defined as a differenniater of “Relatives” (see Figure 3). Building on this
defimtion. the conjunction WealthySibling is defined by constructing DIFF links from

WealthySibling both to Sibling and to Wealthy as shown in Figure 3.

34
xii ccaicaaidemenemmtbitent

 

Figure 2
Concept and Role

Figure 3 shows how these would be drawn

Wealthy Sibling

WealthySibling
Figure 3
Role Differentiation

There are two links specified by this definition. Two classifiers are needed to
tepresent each link so that queries can be supported in both directions (UP or DOWN).
They are shown below:

NORM-Wealthy-DIFF-DOWN => NORM-WealthySibling-DIFF-DOWN

NORM-WealthySibling~DIFF-UP => NORM-Wealthy-DIFF-UP

NORM-Sibling-DIFF-DOWN => NORM-WealthySibling-DIFF-DOWN

NORM-WealthySibling-DIFF-UP => NOR!

 

{-Sibling-DIFF-UP
‘These classifiers control propagations along DIFF links. They could be used to query the

system about relations between roles.

Value Restrictions

Value restrictions limit the range of a role in the context of a particular concept In

 
 

frame/slot notation. this would correspond to constraining the class of allowable slot fillers

for a particular slot. To return to the sibling example. we might wish to define the concept
of a person all of whose siblings are sisters (Person WithOnlySisters). In this case the role,
Sibling, 1s a defining property of PersonWithOnlySisters The association between a concept.
and a role 1s indicated in the graph by a line segment connecting the concept with the role.
Value restrictions are indicated with a single arrow from the role to the value restriction (a

concept). Figure 4 illustrates these conventions.

Person

 

 
 
    

  

Person With
OnlySisters

   

Sibling

Figure 4
Value Restrictions

One classifier 1s needed for each explicitly mentioned value restriction This classifier
associates the local concept and the relevant role with their value restriction. The control
message. VR, ensures that the classifier 1s only activated when the system is looking for
value restrictions The following classifier is produced for the value restriction:

nil-Person WithOnlySisters-nil-nil-nil

|-Sibling-nil-nil-nil
PRE-VRMessage => nil-Ferale-SUPERC-nil-nil

 

It should be noted that the above definiuon does not require a PersonWithOnlySisters
to actually have any siblings It just savs that if there are any. they must be female. The

definition can be completed to require this person to have at least one sister by placing a

number restriction on the role

36
 

simatic

 

Pictorially, number restrictions are indicated at the role with (x,y), where x 1s the

4
i

lower bound and y is the upper bound Not surprisingly, these constructs place limitations
on the minimum and maximum number of role fillers that an instance of the defined concept
can have. In KL-ONE, number restrictions are limited to the natural numbers The default
MIN restriction for a concept is zero. and the default MAX restriction is infinity. Thus, in
the above example, the concept Person WithOnlySisters has no upper bound on the number

of siblings.

Child

 

 

 

 

 
    

(0.0)
OnlyChild
Sibling

Figure 5
Number Restrictions
Consider the defimtion of an only child shown in Figure 5. This expresses the
definition of OnlyChild as any child with no siblings The following two classifiers would be
generated for the number restriction:

nil-Sibling-nil-nil-ni
nil-OnlyChild-nil-ni

 

 

NUM-nil-MAX-nil-nil-0

  

- PRE-MaxMessagt

é nil-Sibling-nil-nil-nil
nil-OnlyChild-nil-mil-ml

fs PRE-MinMessage => NUM-nil-MI

 

Querving The System

Four important KL-ONE constructs and their corresponding representations in

 
 

classifiers have been described These are: concept specialization, role attachment and

differentiation, value restriction, and number restriction Once a Classifier System
representation for such a system has been proposed, it is necessary to show how such a
representation could perform useful computations. In particular, it will be shown how the
collection of classifiers that represent some network (as described above) can be queried to
retneve information about the network. An example of such a retrieval would be
discovering all the inherited roles for some concept.

In the context of the Classifier System, the only IO capability is through the global
message list. The form of a query will therefore be message(s) added to the message list
from some external source (a query processor) and the reply will likewise be some collection
of messages that can be read from the message list after the Classifier System has iterated
for some number of time steps.

‘As an example, consider the network shown in Figure 6 and suppose that one wanted
to find all the inherited roles for the concept HighRiskDriver, First, one new classifier must
be added to the rule set

NET-ail
~ PRE-ClearMessage => NET-nil.

This classifier allows network messages to stay on the message list until it is explicitly de-
activated by @ ClearMessage appearing on the message list.

The query would be performed in two stages. First, a message would be added to the
message list that would find all the concepts that HighRiskDriver specializes (to locate all

the concepts from which High

 

‘Driver can inherit roles) This query takes two tume steps
After the second time step (when the three concepts that HighRiskDriver specializes are on
the message list). the second stage 1s initiated by adding the “Role” message to the message
list. It is necessary at this point to ensure that the three current messages will not be
rewritten at the next time step so that the role messages will not be confused with the
concept messages. This is accomplished by adding the ClearMessage. which “turns off” the

one overhead classifier. Both stages of the query are shown below-*

 

*The -> symbol indicates messages that are written to the message hist from an

38
 

|
|

Limb}

Legs

 

 

 

 

 

 

Thing
Person Gender
Sex
Female Male”
Sex
Woman
Sex
Man
Sex to
HighRisk
Driver
Younghtan
Figure 6

Example KL-ONE Network,

 

external source

39

 
 

Time Step Message List
To ->  NET-HighRiskDnver-SUPERC-UP

Tk NET-HighRiskDriver-SUPERC-UP
NET-Person-SUPERC-UP

T2 NET-HighRiskDriver-SUPERC-UP
NET-Person-SUPERC-UP
NET-Thing-SUPERC-UP

-> _ PRE-RoleMessage
=> PRE-ClearMessage

73 NET-Sex-DIFF-UP
NET-Age-DIFF-UP
NET-Sex-DIFF-UP
NET-Limb-DIFF-UP

74 NET-Sex-DIFF-UP
NET-Age-DIFF-UP
NET-Limb-DIFF-UP.

 

The query could be continued by adding more messages after ume T4. For example. the
VRMessage could be added (with the ClearMessage) to generate the value restrictions for all
the roles on the list.

This style of parallel graph search is one example of the kinds of retrievals that can be
performed on a set of classifiers that represent a an inheritance network Other parallel
operations include: boolean combinations of simple queries. limited numerical processing,
and synchronization. An example of a query using boolean combinations would be to
discover all the roles that two concepts have in common. This 1s accomplished by
determining the inherited roles for each of the two concepts and then taking their
intersection Queries about number restrictions involve some numerical processing. Finally,
it is also possible to synchronize the progression of independent queries For these three

types of queries. additional overhead classifiers are required

Discussion

The techniques discussed in the previous section have been implemented and fully

described (Forrest, 85) These techniques are presented in the context of more complex KL-

40
 

ally

KL-

ONE operations such as classification and determination of subsumption.

‘The implemented system (excluding the Classifier System simulation) 1s divided into
four major parts: parser, classifier generator, symbol table manager. and external command
processor. The parser takes KL-ONE definitions as input, checks their syntax, and enters all
new terms (concepts or roles) into a symbol table. The classifier generator takes
syntactically correct KL-ONE definitions as mput and (using the symbol table) constructs
the corresponding classifier representation of the KL-ONE expression The parser and
classifier generator together may be thought of as a two pass compiler that takes as input
KL-ONE network definitions and produces “code” (a set of classifiers) for the Classifier
System. Additional classifiers that are mdependent of any given KL-ONE network (for
example, the overhead classifier described in the previous section) are loaded into the list of
network classifiers automatically. ‘These include classifiers to perform boolean set operations.
sorting, arithmetic operations, ete The symbol table contains the specific bit patterns used
to represent each term in a KL-ONE definition. One symbol table 1s needed for each KL-
ONE network. Thus. if new concepts are to be added to a network without recompilation,
the symbol table must be preserved after “compilation.” The external command processor
runs the Classifier System, providing mput (and reading output) from the “classifier
program.”

Several techniques for controlling the behavior of @ Classifier System have been
incorporated into the implementation Tagging, in which one field of the classifier 1s used as
a selector, is used to maintain groups of messages on the message list that are in distinct
states. This allows the use of specific operators that are defined for particular states This
specificity also allows additional layers of parallelism to be added by processing more than
‘one operation simultaneously In these situations the messages for each operation are kept
istinct on the global message list by the unique values of their cags.

Negative conditions activate and deactivate various subsystems of the Classifier
System. Negative conditions are used to terminate computations and to explicitly change
the state of a group of messages when a “trigger” message is added to the list. ‘The trigger

condition violates the negative condition and that classifier is effectively turned off.

 
 

Computations that proceed one bit at a time illustrate wwo techniques (1) using
control messages to sequence the processing of a computation. and (2) how to collect and

combine information from independent messages into one message. Sequencing will alw:

 

be useful when a computation 1s spread out over multiple ume steps instead of being
performed in one step. Collection is important because in the Classifier System it 1s easy to
“parallelize” information from one message into many messages that can be operated on
independently. This is most easily accomplished by having many classifiers that match the
same message and operate on various fields within the message The division of one message
into 1ts components takes one time step. However. the recombination of the new
components back into one message (for example. an answer) is more difficult. The collection
process must either be conducted in a pairwise fashion or a huge number of classifiers must
be employed. The computational tradeoff for n bits is 2" classifiers (one for each
combination of possible messages) in one time step versus n classifiers (one for each bit) that
are sequenced for n time steps. Intermediate solutions are also possible

‘Synchronization techniques allow one operation to be delayed until another operation
has reached some specific stage. Then both operations can proceed independently: until the
next synchronization point. Synchronization can be achieved by combining tagging with

negative conditions.

Conclusions

Classifier Systems are capable of representing complex high-level knowledge
structures. This has been shown by choosing one example of a common knowledge
representation paradigm (KL-ONE) and showing how it can be translated into @ Classifier
System rule set In the translation process the Classifier System 1s viewed as a low-level
target language into which KL-ONE constructs are mapped. The translation 1s described as
compilation from high-level KL~ONE constructs into low-level classifiers

Since this study has not mncorporated the bucket brigade learning algorithm, one

obvious direction for future study 1s exploration of how many of the structures described

42
here are learnable by the bucket brigade. This would test the efficacy of the learning
algorithm and 1 would allow an investigation of whether the translauions that I have
developed are good ones or whether there are more natural ways wo represent similar
structures. While the particular algorithms that I have developed might not emerge with
learning, the general techniques could be expected to manifest themselves It 1s possible
that some of these structures are not required to build real world models, but this seems
unlikely based on the evidence of KL-ONE and some miual investigations with the bucket
brigade These structures are for computations that are useful in many domains and could
be expected to play a role in most sophisticated models that are as powerful as KL-ONE.
Since they are useful in KL-ONE, this suggests that they mught be useful in other real world
models

A start has already been made in this direction. Goldberg (Goldberg, 83] and Holland
[Holland, 85] have shown that the bucket brigade is capable of building up default
hnerarchies, using tags using negative conditions as triggers, and limited sequencing
(chauning). In addition. 1 would look for synchronization. more sophisticated uses of tags
more extensive sequencing, and in the context of knowledge representation. the formation of
roles Roles are more complex than “properties” for two reasons First, they are two place
relations rather than one place predicates. and second. relations between roles (DIFF links)
are well defined. Of the other structures it 1s possible that some are so central to every
Tepresentation system that they should be “bootstrapped’ into a learning system ‘That 1s

they should be provided from the beginning as a “macro” package and not required to be

learned from the beginning every time

 
 

References

Booker, Laiton (1982) “Intellgent Behavior as an Adaptation to the Task Environment”.
Ph D. Dissertation (Computer and Communication Sciences) The University of
Michigan, Ann Arbor, Michigan.

Brachman, Ronald J (1978) “A Structural Paradigm for Representing Knowledge,”
Technical Report No. 3605, Bolt Beranek and Newman Inc., Cambridge, Ma.

Brachman, Ronald J and Schmolze, James G (1985), “An Overview of the KL-ONE
Knowledge Representation System,” Vol. 9, No. 2

Fahlman. Scott E. (1979), NETL: A System for Representing and Using Real-World
Knowledge. The MIT Press, Cambridge, Ma.

Forrest, Stephanie (1985), “A Study of Parallelism in The Classifier System and Its
Application to Classification m KL-ONE Semantic Networks”, Ph. D. Dissertation
(Computer And Communication Sciences) The University of Michigan, Ann
Arbor, Mi.

Goldberg, David (1983), Ph. D. Dissertation. The University of Michigan, Ann Arbor, Mi.

Holland, John H (1975) Adaptation m Nawural and Aruficial Systems. The University of
Michigan Press, Ant Arbon Mie

Holland, John H (1980). “Adaptive Algorithms for Discovering and Using General Patterns
in Growing Knowledge Bases”, International Journal of Policy Analysis and
Information Systems, Vol.4 No 3

Holland, John H. (1985), Personal Communication.

Lipkis, Thomas (1981), “A KL-ONE Classifier”, Consul Note #5. USC/Information Sciences
Institute, Marina del Rey, Ca.

Schmolze, James G. and Brachman, Ronald J. (1982) (editors) “Proceedings of the 1981 KL-
ONE Workshop.” Technical Report No. 4842, Bolt Beranek and Newman Inc..
Cambridge. Ma

Schmolze, James G. and Israel, David (1983). “KL-ONE: Semantics and Classification.” in
Sidner, C., et al., (editors) Techmical Report No. 5421, Bolt Beranek and Newman
Inc . Cambridge. Ma... pp 27-39.

44
The Bucket Brigade is not Genetic

T. H. WESTERDALE

Abstract -- Unlike genetic reward schemes, bucket brigade
schemes are subgoal reward schemes. Genetic schemes operat~
ing in parallel are here compared with a sequentially
operating bucket brigade scheme. Sequential genetic schemes
and parallel bucket brigade schemes are also examined in
order to highlight the non-genetic nature of the bucket bri-
gade.

1, INTRODUCTION

The Bucket Brigade can be viewed as a class of appor-

tionment of credit schemes for production systems. There is
an essentially different class of schemes which we call
genetic. Bucket Brigade schemes are subgoal reward schemes.

Genetic schemes are not.

For concreteness, let us suppose the environment of
each production system is a finite automaton, whose outputs
are non-negative real numbers called payoffs. (To simplify
our discussion, we are excluding negative payoff, but most
of our conclusions will hold for negative payoff as well.)
Each production's left hand side 1s a subset of the environ-
ment state set and each production's right hand sade 1s a
member of the environment's input alphabet. Associated with
each production is a positive real number called that
production's availability.

Probabilistic sequential selection systems are systems
in which the following four steps take place each time unit

(1) The state of the environment is examined and those pro-
ductions whose left hand sides contain this state form the
eligibility set. (2) A member of the eligibility set is
selected, probabilistically, each production in the set
being selected with probability proportional to its availa-
bility. (3) This production then fires, which means merely
that its right hand side is input into the environment caus~
ing an environment state transition and an output of payoff.
(6) A reward scheme (or apportionment of credit scheme

examines the payoff and on its basis adjusts the avaizlabili-
ties of the various productions. Thus the availabilities
are real numbers which are being continually changed by the
weward scheme. Probabilistic sequential selection systems
Giffer from one another in their differing reward schemes.

  

states eaisume that for any ordered pair of environment
@tes there is a sequence of productions which will take us

 
 

from the first state to the second.

The average payoff per unit time is a reasonable meas-
ure of how well the system is doing. If the availabalities
are held fixed, the system-environment complex becomes a
finite state Markov Chain, and the average payoff per unit
time (at equilibrium) is formally defined in the obvious
way. As the availabilities change, the average payoff per
unit time changes. Thus the average payoff per unit time
can be thought of as a function of the availabilities. The
object of the reward scheme is to change the availabilities
so as to increase the average payoff per unit time

The systems above have been simplified so as to more
easily illustrate the points we wish to make. In any usefu.
system the environment would output other symbols in addi-
tion to payoff, symbols whach we could call ordinary output
symbols. The left hand sides of the productions would then
be sets of ordinary output symbols. A useful system would
also contain some working memory (a “blackboard” or “message
list") which could be examined and altered by the produc-
tions. In the above systems the working memory is regarded
as part of the environment and instead of sets of output
symbols we have sets of (Moore type) automaton states which
produce those symbols. for illustrative purposes we have
simplified the system by removing various parts and leaving
only those parts on which the reward scheme operates.

 

In our systems the set of productions is fixed. We
want to study the reward scheme, and allowing generation of
new productions from old ones (E.9. [4]) will merely dis-
tract us.

II. GENETIC SYSTEMS WITH COMPLETE RECOMBINATION

At any given time, the production system can be thought
of as a population of productions, the availability of a
Production giving the number of copies of that production in
the population, or some fixed multiple of the number of
copies. Thus the process of probabilistic selection of the
production to fire can be thought of as randomly drawing
productions from the population, until one 1s drawn that i
in the eligibility set. In some systems the population is
held explicitly and the availabilities are implicit whereas
in others the availabilities are held explicitly and the
population is implicit.

1f the system is to be viewed as a population of pro-
ductions, then of course after each production is success-
fully selected from the population it as tested on the
environment with the environment in the state in which the
previously selected production left it.

46
It is easier to analyse systems in which the result of
a test of a production is independent of which productions
were tested previously. Such systems are usually unrealis
tic, but if the system is viewed as a population of produ
tion strings, rather than of individual productions, then it
is often realistic to view the test of a string as being
independent of whach strings were tested previously. Let us
look at a population system of this kind. The system wall
consist of a population of production strings. The popula
tion will change over time. Tame is viewed as divided into
large units called generations. During a generation, every
string in the population is tested against the environment
and, as a result of the tests, the reward scheme determines
the composition of the population in the next generation. A
system of this kind we call a string population system
Let's examine such a system and give its reward scheme in
detail. We shall call the system, System A. System A is a
genetic system with complete recombination.

 

Begin wath a set of productions, each with an availa-

 

bility. Let nm and N be large integers wath n much larger
than N. The set of availabilities define a population of
length n strings of productions (possibly with repeats) as
follows. Let v be the sum of all the availabilities. For

any length n string, the number of copies of that string in
the population is nv" times the product of the availabil
ities of its constituent productions. In each generation
the number of progeny of each string 1s given by testing the
string and summing the payoff obtained during the test. To
test a string one selects the first production in the strang
that 1s in the eligibility set, fires it, then moves on down
the string untal one finds the next production in the strang
that 1s now in the eligibility set, fires it etc. etc. until
N productions have fired. We will not worry here about the
few cases where one gets to the end of the string before N
productions have fired.

We are assuming that during a generation, every string
is tested against the environment. We are also assuming
that there is an “initia] state" of the enviranment and that
when each string 1s tested the test always begins with the
environment in the initial state, so that the results of a

string test are independent of which strings were previously
tested.

____The formation of progeny 1s followed by complete recom-
bination. In other words, each production's availability i

incremented by the number of times that production occurs in
the new progeny, and the next generation's population i

formed from the availabilities just as the previous
Seneration’s population was. (In effect, the strings are
brokén into individual productions and these productions

then re-combine at random to form a new population of
denath-n strings.)

 
 

We could have demanded that each string test begin with
the environment in the state in which the last string left
at, but if N and n are large then this demand will make
hardly any difference to the test results. This 1s because
the environment “forgets” what state it started in during a
long test. For example, suppose there is one production
whose left hand side is the set of all environment states
and whose right hand side 1s a symbol which resets the
environment to one particular state. Let's call this pro-
duction the resetting production. Then during any string
test, once the resetting production 1s encountered, the pay-
off for the rest of the test and the successive availability
sets are independent of the state the environment was in
when the test started. Thus each string has a value
independent of which strings were tested previously, except
for a usually small amount of payoff at the start of the
test before the first occurrence of the resetting produc-
tion. One can generalize these comments usefully to the
case where there is no resetting production [6], but we wall
mot do so formally here. The important thing to note 1s
that except for a usually small initial segment, the
sequence of successive eligibility sets would be independent
of which strings were tested previously (provided n and N
are large enough). Thus we do not lose anything important
af we assume that each test begins with the environment in
some initial state. So we can think of the tests in a gen-
eration as taking place sequentially or in parallel, it
makes no difference.

Let the value of a string be the sum of the payoffs

when the string is tested wath the environment begun in the
initial state. If there are x copies of a string in the
Population, and if the value of 7 15 y, then the number of
progeny of will be xy. If r 18a production which occurs
z times inthe string » then zxy will be the contribution of
the progeny of 7 to the increase 1n the availabality of r
This is obvious, and we have only re-stated matters in this
way to make it clear that we need not insist that x, y, and
the activations are integers. The formalism makes perfect
sense provided they are non negatave real numbers. If the
value of is 0.038 then every copy of 7 will have 0.038
Progeny. (But remember, we insist that ‘activations, and
hence x, are actually positive.

Note that the behavior of System A can be thought of as
a sequence of availability tuples. In any given generation
the population composition 1s given by the availabilities.
Just as in the probabilistic sequential selection systems,
the availabilities determine the average payoff per unit

time (averaged over the tests of all the strings in the gen-
eration).

System A is deterministic. Given a tuple of avaalabal-
ities it is completely determined what the next tuple of

48

i

 

 

cal.
duct
any
tie

wil.
cum:
mal.
equ

reg
To
per
set
sch
eve:
tem
two
tha
sys
tor
sys
alw
Sys
by
“th
sam.

wil
con
sel
the
tho
bil
cha
an
the
whe
po.
ize
jec
dir
the
tio
imp

sys
gen
all
availabilities (in the next generation) will be. We wall
call two string population systems equivalent if they pro-
duce the same change in the availabalities, that is if given
any tuple of availabilities, the next tuple of availabili-
ties will be the same whichever system we are examining.

Actually we need a weaker notion of equivalence. We
will also call two systems equivalent in several other car-
cumstances. We will describe these circumstances infor-
mally, but wall not give here a rigorous definition of

equivalence.

Let the set of all possible tuples of availabilities be
regarded as a subset of Euclidean space in the usual way.
To each point in the subset corresponds an average payoff
per unit time. System A defines for each point in the sub
set a vector giving the change an availabilities which its

scheme would produce. Two systems are equivalent if at
every point the change vector is the same for the two sys-
tems and the average payoff is also the same. We also call

two systems equivalent if there is a positive scalar k such
that at each point (1) the average payoff for the second
system is k times that of the first, and (2) the change vec-
tor of the two systems aims in the same direction. So a
system which was like System A but whose reward scheme
always gave just half as many progeny would be equivalent to
System A. If we define normalizing a vector as dividing it
by the sum of its components, then condition (2) becomes

“the normalized change vector of the two systems 1s the
same.”

 

For completeness I must mention a complication which
will not be important in our discussion. We need to loosen
condition (2) by normalizing the points in the space them-
selves. Normalizing a point in the space projects it onto
the normalized hyperplane. (Its components can then be
thought of as probabilities and it 1s of course these proba-
bilities that we are really interested in.) If we take a
change vector at a point, and think of the change vector as
an arrow with its tail at that point, then we can normalize
the point where its tail is and also normalize the point
where its head is. The arrow between the two normalized
Points 1s a projection of the change vector onto the normal-
ized hyperplane. We want condition (2) to say “the pro
jected change vector of the two systems aims in the same
direction", or “the normalized projected change vector of
the two systems 1s the same". Sorry about this complica-

fon. It does make sense, but the details will not be
important in our discussion.

 

Of course many schemes are probabilistic. Consider a
System (System 8) just like System A except that in each
Generation instead of its reward scheme giving progeny to
all strings in the population, the reward scheme randomly

 
 

selects just one string and gives only that string progeny
(the same number of progeny System A would give it). Now
the change in availabilities 1s probabilistic. At each
point there are many possible change vectors, depending on
which string 1s selected. When a system produces many pos
sible change vectors at a point, we simply average them,
weighting each possible change vector with the probability
that it would represent the change. It is the average
change vector that we then use in deciding system
equivalence (or rather, the normalized projected average
change vector). We call a scheme noisier the more the pos~
sible change vectors at a point differ from each other. So
System 8 1s equivalent to System A, though System B 1s much
noisier.

Fisher's fundamental theorem of natural selection [1)
(2] applies to Systems A and 8 so we know that for these
systems the expectation of the change in the average payoff
Per unit tame 1s non-negative. We call a system with this
Property safe. A safe system, then, 28 one in which at
every point, the average change vector aims in a direction
of non-decreasing average payoff. Clearly then, a system
that is equivalent to a safe system is also safe

Consader a system like System A except that the anitial
state (the state in which all string tests begin) 1s dif
ferent from the initial state in System A. Technically
this new system would not be equivalent to System A, but if
mn and N are large enough it is nearly equivalent. In decid-
ang system equivalence we will assume n and N are large
enough. More precisely, we note that as n and N increase, a
system's normalized projected average change vectors gradu-
ally change. At any point, the normalized projected average
change vector approaches a limit vector as n and N approach
infinity. It 28 this limit vector that we use as our nor-
malized projected average change vector in deciding system
equivalence. Thus the change in initial state produces a
new system that is equivalent to System A. In fact, a sys
tem like A or B which begins each string test with the
environment in the state the last string test left it is a
system equivalent to A and B.

In all the systems discussed in this paper, a tuple of
availabilities defines an average payoff per unit time, and
the reward scheme defines, for each such tuple, an average
change vector. This is true also an the probabilistic
sequential selection systems. Thus we can compare any two
of our systems and ask whether they are equavalent.

We ask if there is a reward scheme for a probabilistic
sequential selection system that makes the system equivalent
to Systems A and 8. The natural candidate is System C
defined by the following reward scheme: Reward every N pro-
ductions which fire by incrementing the availabilities of

50
these N productions by the sum of the payoffs over these N

firings. But System C is not equivalent to System A. In
the System A string tests, productions are skipped when they
are not in the eligibility set. System A rewards these

(ancrements their availabilities) whereas System C does not.
To make C equivalent to A we must do something about reward-
ang the productions that are not in the eligibality set.

Equivalently we can instead penalize the various pro-
ductions that are in the elagibility set. (See [5] for the
formal details of the argument in the remainder of this sec-
tion, including the effect of ancreasing string length.) The
idea 1s that whenever production r 1s rewarded (has its
availability incremented), the eligibility set R at the tame
x fired is penalized as follows. Let S$ be the sum of all
availabilities and R the sum of the availabilities of pro-
ductions an R. The absolute probability of r 1s the availa-
bility of r divided by 5. The problbility of r relative to
Ris the availability of r divided by R. If the reward 1s
x, the availability of ras first increased by x. Then the
availabilities of all members of R are adjusted to bring R
back down to what at was before the reward. The adjustment
is done proportionally: i.e. the adjustments do not change
the probabilities, relative to R, of the members of R. We
call these adjustments penalties since they penalize a pro

duction for being eligible.

 

 

Let System C' be System C with this penalty scheme
added. Then-System C’ 1s equivalent to Systems A and B

In fact we can easily make this penalty scheme more
sensible if we reward every time unit. The payoff in a time
unit becomes the reward of the last N productions that fired
(wath corresponding penalties for the eligibility sets)
This gives an equivalent, but more sensible scheme.

 

More sensibly, we can use an exponential weighting
function, so that the reward of the production that fired z
time units ago 1s c* times the payoff. (c is a constant and
O<c<1 . Instead of assuming N large, we assume c 1s
very close to 1.) This gives an equivalent scheme which 18
easy to implement because one only needs to keep for each
Production a count

 

(2).
Where X(z) 1s 1 af the production fired z time units ago and
O otherwise A second count, called the production's eligi-
bility count is also kept. This has the formula

#2)
Fz
Set
where F(z) is the sum of the availabilities of the

 

 

 

 
 

productions that were an the elagibility set z time units
ago and f(z) 1s the availability of the given production if
at was in the availability set z time units ago, and 0 oth-
erwise. The count and the eligibility count are particu-
larly easy to update. Each time unit, all productions are
rewarded by the product of the current payoff times the
difference between the count and the eligibility count.
Call the probabilistic sequential selection system using
this scheme System D.

System 0 15 equivalent to System A. (Provided, as we
said, we let nm and N approach anfinity and © approach 1.
Thus System 0 is safe (assymptotically safe as c->1) since
System A is. Unfortunately a system using a bucket brigade
scheme will not in general be safe, and it wall not be
equivalent to System A.

 

Since System P 15 equivalent to the genetic Systems A
and 8, we n call 0 also a genetic system. (Fisher's
theorem says that a genetic system must be safe.) We can
call the reward scheme of System 0 a genetic scheme for a
probabilistic sequential selection system.

 

IIL, THE BUCKET BRIGADE

Genetic schemes like the scheme of System 0 form one
ass of reward schemes for probabilistic sequential seli
tion systems. Another class is the class of bucket brigade
schemes.

    

We shall examine the following bucket brigade scheme.
Let Cand kK be constants, O<K<1, 0<C <1. For each pro-
duction the system holds two quantities, the availability
and the cash balance. Productions are chosen from the eli-
gibility set probabilistically on the basis of the availa-
bilities. Each time unit, the production that fires pays
Proportion C of its cash balance to the production that
fired in the previous time unit. The production that fired
then has the current payoff added to its cash balance, and
then its availability 21s increased by K times its cash bal-
ance. The members of the eligibility set are then penalized
as ain System C’ I know that Holland [4] employs a bidding
system with the bucket brigade, but that system 1s much more
difficult to analyze, so I shall use the probabilistic
sequential selection system described above.

 

Let System £ be a probabilistic sequential selection
system using the above bucket brigade scheme. System E
looks rather like System D. The most fundamental difference

 

however is the following. In the bucket brigade, a produc
tion r as rewarded if it 1s followed by a production whach
1s usually successful. In the genetic schemes of the last

52
section, r 1s rewarded only if it is followed by a produc-
tion which is successful this very time.

For this reason, a bucket brigade like the one in Sys-
tem E 1s not safe. Suppose the environment is such that the
productions must come in triples, each triple being either
hbc, dbe, or dfg. Suppose payoff is 10 for e, and zero for
the other productions. Suppose hbc and dbe are equiprob
able. The bucket brigade will pass b a cash payment of 5 on
average, whereas f will get 0, so dfg will become less prob
able vis a vis dbe. Nevertheless, h gets passed 5 on aver-
age, whereas d gets passed less than 5, since it is same
times followed by f, and g 1s broke.

The bucket brigade gives reward for achieving a
subgoal. In the production system context, the subgoal is
to put the environment in a state which will make eligible
some production t. t says (via its left hand sade) under
what conditions it thinks it can convert the situation into
one which will yield payoff. The subgoal is to provide
those conditions. If r achieves that subgoal then (provided
t as the production actually selected) r is rewarded, The
amount of reward quite properly depends on how good t 1s, on
how useful achieving that subgoal has been found in the
past. This 2s the essence of a subgoal reward scheme.

Now of course it may happen that although t can usually

achieve ultimate payoff, the particular state in which r
happens to place the environment is one which (though
included on the left hand side of t) t actually can never
convert into ultimate payoff. That as, there is something

slightly wrong with the left hand sade of t. The subgoal
(the left hand side of t) 1s badly formulated. In this case
r will still be rewarded handsomely for achieving the

 

Subgoal since t usually does well t of course will be
mildly penalized for its indescriminateness since it will be
in the eligibility set when it shouldn't be. Thus the

effect of poor subgoal formulation can be to penalize the
calling subroutine (t) for its poor formulation of the
Subgoal whale rewarding the subroutine called (r) for the
fact that it did what the calling routine asked. (1 believe
this 1s the correct analogy: the preceding productaon is the

called subroutine and the following one is the calling sub
routine.)

Now thas is not what happens in the genetic schemes.
In those schemes r would not get rewarded when followed by
£2 The called subroutine would only be rewarded for achiev-

#Mg a subgoal in the case where the ultimate result achieved
Payoff. Thus genetic schemes are not subgoal reward
"schemes.

At first sight it looks as if some reinterpretation of
System could reconcile the two approaches. Perhaps the

 

 
 

bucket bragade w21l look genetic 1f an allele 1s something
other than a sangle production. I can't prove that such a
reconciling reinterpretation 15 impossible, but I'm rather
convinced that the dichotomy between subgoal reward schemes
and non subgoal reward schemes is too fundamental to permit
of such a reinterpretation. What I shall do now is to try
to highlight that dichotomy via one of the obvious ways of
trying to make the bucket brigade look genetic. [t will
fail, and the way in which it fails will be illustrative.

Systems D and & are both probabilistic sequential
selection systems. System D 1s equivalent to System A, a
string population system We now construct a string popula-
tion system (System F) that 1s equivalent to System &. The
ways in which systems F and A differ will be informative.

In System A, the composition of the population in a
particular generation uld be determined by examining a
tuple of production availabilities. This will also be true
in System F, though the way in which the availabilities
determine the population composition will be different.

 

The trick in constructing System F 1s to find a way of
explicitly stringing together those productions to which a
bit of cash would be successively passed by the bucket bri
gade

 

We form a population of length M strings of productions
as follows by induction on M. If we have a population of
length M strings, then the population of length M+1 strings
is formed as follows. Take e length M string in turn
For each of these strings, note the eligibility set R at
the end of a test of 1t. (Each string test begins with the
environment in the initial state.) For each r in R make
10000y copies of the string pr , where y is the probability
of r relative to R. Put these copies in the population of
length M+1 strings.

 

 

We are interested in the case where M=N. These strings
can be tested with no skipping.

In such a population of length N strings, any given
production will occur many times, and may of course occur
many times in the same string. For concreteness, let us
call each production occurrence an allele, and let us say
two alleles are of the same type if they are occurrences of
the same production.

We are interested in two kinds of recombination. One
is ordinary complete recombination, in which the strangs are
broken into individual alleles and these then recombine at
random to form a new population of length N strings. The
other is what we will call crazy recombination. In this the
strings are broken into their individual alleles, but each

54
zoeron

   
  
   
  
 
     
  
  
   
  
  
  
  
  
   
   
   
   
  
   
    
   
  
  
   
 
 
  
  
     
  
  
  
  
   
   

allele remembers the type of the allele that preceded it in
ats string. Then these alleles recombine at random, but an
allele, in combining with others, must follow the same type
of allele that it followed in its original string. The
result of crazy recombination is a population of strings
which represent the paths that cash may take in being passed
by the bucket brigade.

System F works as follows. As described above, we use
the availabilities to make a population of length N strangs
which can be tested with no skipping. The payoff for each

of these strings is then determined, but it 1s not summed.
Instead the system tests each string, noting for each time
unit the eligibility set, the allele that fired, and the
payoff received. The system then physically attaches the
eligibility sets and payoffs to the alleles in the string,
like clothes attached to a clothesline. To each allele in
the string 1s attached the eligibility set from which that
allele was selected and the payoff that arrived when that

allele fired. Now we have a population of strings of
alleles with each allele in each string having a number and
a set attached to it. We now do crazy recombination, but

the alleles carry their numbers and sets with them so that
after recombination the strings still have numbers and sets
hung along them. Now and only now are the payoffs for each
string summed. Each production that occurs in the string is
rewarded with this cum. For each allele an the string, the
Production of which that allele 1s an rrence has the sum
added to ats availability; then the elagibility set attached
to the allele is penalized in the usual way. The next gen-
eration population is formed from the new avazlabalities.

 

It as the crazy recombination that implements the
bucket brigade notion that a production's reward depends on
the production which follows and on how much reward that
Production achieved during some entirely different test
Though System F is equivalent to System £, it 1s a bit

_ foisier because it is like a bucket brigade in which the

Cash is passed forward as well as backwards. Cash passed

forwards, however, doesn't affect the biases and so doesn’

affect equivalence in the sense in which we mean at

System F doesn't look much like System A, but we can
change System F slightly to improve matters. Carrying
around the eligibility sets looks rather un-genetic
Instead of carrying around these sets and penalizing them we
“gould carry around their complements and reward them
Equivalently (though with an increase in noise) we could
garry around strings of productions selected from the com

ements reward the productions in the strings much as

original genetic scheme.

 

The idea here is to build a population of Length n
Just as in System A The system then marks on each
 

string the first N alleles that will fire and attaches the
appropriate payoffs to these alleles. Crazy recombination
then proceeds as follows. (We can call this version insane
recombination.) Each string 18 broken into segments, the
breaks occurring at the N alleles. Each segment remembers
the type of the allele that preceded it. The segments then
recombine to form strings composed of N segments (the long
tails which contain none of the N alleles are thrown away --
or, equivalently, they are attached to the ends of the new

strings). In recombining, each segment must follow an
allele of the same type as the one it followed before recom-
bination. The new strings wall now not all be of the same

length. Each segment carries with it the payoff that was
attached to its terminal production before the insane recom-
bination. In each of the new strings the payoffs are summed
and the sum gives the number of progeny of the new strings.
Then the strings, including the new progeny are all broken
apart anto aindavidual alleles and the total number of
occurrences of a production (alleles) 1s 1ts new availabil-
aty. The population of the next generation 1s formed using
the new availabilities. Thus this final breaking apart and
formation of the next generation can be viewed as ordinary
complete recombination.

 

So, beginning with the population thus formed, a gen-
eration consists of the following steps: (1) Mark on ea

string the N alleles whach fare and attach to them the pay-
offs. (2) Do insane recombination, breaking at the N

alleles, and having each segment carry its payoff with it
(3) Sum the payoffs on each of the new strings and produce
the number of progeny given by the sum. (4) Do ordinary
complete recombination

Call the system using this scheme System G. System 6G
1s equivalent to System &.

The scheme of System G looks a bit like alternation of
generations, but it has a dissatisfying artificiality. It
18 possible to remove the almost Lamarkian oddity of carry-
ing payoffs attached to segments, but not in any particu
larly convincing fashion. Note that in step (3) we need not
sum all the payoffs on a string. Equivalently we could just
reward using one of those payoffs, or just a few. The pay-
offs we use need not be carried from before the insane
recombination. They could be re-calculated afterwards. We
need only take a re-combined string and run at, and then
note one of the payoffs during the run. Actually it's not
that travial, The payoff we use must be one that arrives as
the result of the firing of one of the N productions, not
the other intervening productions.

Still, the scheme contains insane recombination and it

as this that makes it inherently different from the genetic
schemes.

56

 
  
   
 
  
   
  
  
   
   
   
  
  
   
   
    
   
  
 
   
 
 
  
  
 
   
  
   
  
 
  
   
     
  
    
    

IV, FURTHER QUESTIONS

This raises several further questions which I find it
difficult to answer.

(1) The bucket bragade now begins to look rather silly. I
this just because of the bucket brigade version used here?
Holland's bidding system bucket brigade is rather different

Perhaps it is the penalty scheme that is at fault. In
genetic systems the penalty scheme preserves safety, but
systems employing a bucket brigade aren't safe anyway. If
the penalty scheme were removed we could go back to crazy
recombination, but without the attached eligibility sets

This as less insane looking than insane recombination, but
still not convincing. Nevertheless, we feel intuitively
that subgoal reward is good. It certainly is a sensible way
of combating scheme noise. $0 15 it the genetic schemes
that are the silly schemes? Or is there yet some way of
viewing the bucket brigade so that it looks genetic?

 

(2) If the genetic schemes and the bucket brigade are for-
mally different, 1s there anyway a biological system analo-
gue of the bucket brigade? One can imagine a crossing over

 

rule that implements insane recombination. But there are
two problems with this. One is that we have not merely
insane recombination, but insane recombination alternating
wath ordinary recombination Unfortunately the ordanary

recombination is required in order to retain equivalence
It may be possible to remove the ordinary recombination and
replace it with a phase which takes the strings and forgets
whach the special N productions are and then marks N new
productions by determining which of the productions would be
the first N to fare if that particular string were run. I
don't think, though, that this change would retain
equivalence.

(3) In the more general context, do we sve biological sys
tems with subgoal reward? If so, then perhaps the bucket
brigade has a sound biological basis, but it 1s merely that
Our usual population genetics formalism fails to capture
that basis For example, one might claim that a gene that
does its own job (achieves a subgoal) more efficiently
incurs lower cost, even if that job is useless for the
Current organism. The trouble is that if the job is con
verting metabolite A into 8 then the efficiency would prob
ably give lower cost only if accumulation of ® (or an
@quivalent) reduced gene expression. But then if B is not
at will accumulate and reduce the cost of even an
gene. Postulating leakage of B and other fid
doesn’t seem to help. The gene really only ends up
Tewarded when 8 ic useful, and fiddling with cost only
Adjusts how much it is rewarded, but doesn’t change the

 

         

     

   
 

basic fact that reward is much more when 8 is useful than
When it isn’t. So that doesn’t seem to work.

A possibility not discussed in this paper as that in
the bucket brigade the string of productions to which the
cash is passed is the analog of a metabolic pathway in whach
each metabolite inhibits the expression of the gene respon-
sible for the reaction that produces that metabolite (3).
Then each gene raises the expression rate (passes cash to)
the gene preceding 1t in the pathway. In this view, the
version of the bucket brigade described in this paper 1s
wrong. In the correct version the cash is passed by a
bucket brigade, the availabilities are adjusted by a genetic
scheme that pays no attention to cash balances (unless you
believe in Lamark), and the probabilities of the various
productions firing are proportional to the products of the
corresponding cash balances and availabilities. The biolog~
ical analogue of these probabilities is then the prevalence
of the various enzyme molecules rather than the prevalence
of the various alleles.

Of course one can look for ecological models which use
subgoal reward. This leads us into the quagmire of
altruism, where current formalisms seem to me unhelpful

 

It may be that examining some parasitic systems or sym-
biotic systems might be helpful. Parasites and endosym
bionts must regulate their reproduction rate $0 as not to

 

destroy a host In effect they are passing cash to the
host. But this 1s a situation where admittedly group selec-
tion is operating. Is it possible to regard two genes on

the same chromosome as symbionts, regulating their reproduc-
tion rate to help each other? Perhaps we should explicitly
implement group selection in our production systems (I don't
believe this would be difficult) and let productions with
various cash-passing schemes compete under such a group
selection scheme.

 

Vv. CONCLUSIONS

It looks as if genetic systems are inherently different
from systems employing the bucket brigade We usually view
genetic systems as population systems operating in parallel,
whereas the bucket brigade operates essentially sequen

tially. This is a superficial difference. The essential
difference appears to be that bucket brigade schemes are
subgoal reward schemes, whereas genetic schemes are not.

 

58

t

u
1
C2)
33

C43

5]

6]

REFERENCES

J. F. Crow and M. Kimura, An Introduction to Population
Genetics Theory. New York: Harper and Row, 1970.

R. A, Fasher, The Genetical Theory of Natural Selec-
tion. New York: Dover, 1958.

P. W. Hochachka and G. N. Somero, Biochemical Adapita
tion Pranceton: Pranceton University Press, 1984.

J. H. Holland, “Escaping brittleness: the possibili-
ties of general purpose learning algorithms applied to
parallel rule based systems,” to appear in Machine
Learning II (R. S. Michalski, J. G. Carbonell, and T.
M. Mitchell, eds.). Palo Alto: Tioga, 1984.

T. H. Westerdale, "A Reward Scheme for Production Sys-
tems with Overlapping Conflict Sets,” to appear in IEEE
Trans. Syst., Man, Cybern.

T. H. Westerdale, “An Automaton Decomposition for
Learning System Environments,” submitted.

 

 

 
     
   
 
    
   
 
  
    
 
    
  
  

ABSTRACT

This paper describes new conceptual and experi
mental results using the probabilistic Iearning
system PLS? PLS2 is designed for any task m
which overall performance can be measured, and
in which choice of task objects or operators
influences performance, ‘The system can manage
incremental learning and noisy domains

PLS? learns in two ways. Its lower “percep=
tual” layer clusters data into economical cells or
regions in augmented feature space. The upper
“genetic” level of PLS2 selects successful regions
(compressed genes) from multiple, parallel cases.
Intermediate between performance data and task
control structures, regions promote efficient and
effective learning.

Novel aspects of PLS2 include compressed
genotypes, credit localization and “population
performance”. Incipient principles of efficiency
and effectiveness are suggested. Analysis of the
system is confirmed by experiments demonstrat-
ing stabihty, efficiency, and effectiveness.

 

Figure 1 Layered learning system PLS? The pereep-
tual learning system PLS! serves as the performance
clement (PE) of the genetic system PLS2. The PB of
PLSI is some task. PLS2 activates PLSI with different
knowledge structures (“cumulative region scts") which
PLS? continually improves. The basis for improve-
‘ment is competition and credit localization.

 

60

GENETIC PLANS AND THE PROBABILISTIC LEARNING SYSTEM:
SYNTHESIS AND RESULTS
Larry Rendell
Departinent of Computer Science,
University of Hlinois at Urbana-Champaign,
1304 West Springfield Avenue, Urbana, Illinois 61801

1, INTRODUCTION

‘The author's probabilistic learning system PLS ws)
capable of efficient and effective generalization
learning in many domains [Re83a, Re834,
Re85a]_ Unlike other systems [La83, Mit83,
Mie 83a], PLS can manage noise, and learn incre:
mentally. While it can be used for “single con.
cept” learning, like the systems described im}
2 been developed and tested in the
jcult. domain of heuristic search, which |
Fequires not only noise management and incre-
ental learning, but also removal of bias
acquired during task performance [Re83a]. The
stem can discover optimal evaluation functions
(see Fig.2). PLS has introduced some novel
approaches, such as new kinds of clustering."

   

 

 

 

 

 

 

 

Figure 2, One use of PLS, In heuristic. search,
object is a state, and its vulity mght be the prob:
ity of contributing to success (appearing on a solution
path) Ex, for fy, this probability is 1/3 Here the |
pair (tg, Pg) is one of three regione which may be used |)
to create’ a heuristic evaluation function Region |
characteristics are determined by clustering. /

 

  

 

 

1, See [Re 83a} for det
for discussion of PLS's “conceptual clustering”
[Mic 83b] which began in [Re76, Re77]. PLS ‘utihty”
‘of domain objects provides “eategory cohesiveness”
[Me85]. [Re 85c} introduces *higher dimensional” clus-
tering which permits creation of structure. Appendix
A summarizes some of these terms, which will be ex-
panded in later sections of this paper.

ls and [Re85a, Re85b]

 
 
  
  
  
    
   
   
  
    
    
     
   
    
   
  
   
     
   
      
   
    
  
 
 
 
 
  
 
 
 
 
  
 
  
   

 

rovel

Another successful approach to adaptation
is genetic algorithms (GA's). Aside from their
ability to discover global optima, GA’s have
several other important characteristics, including
stability, efficiency, Mexibility, and extensibi
[H075, Ho81]. White the full behavior of genetic
algorithms is not yet known in detail, certain
characteristics have been established, and this
approach compares very favorably with other
methods of optimization [Be80, Br81, De80]
Because of their performance and potential, GA's
have been applied to various Al learning tasks
[Re83e, Sin 80, Sm 83}

In [Re83e] a combination of the above two
approaches was described: the doubly layered
learning system PLSe (sce Fig.t)* PLSI, the
lower level of PLS2, could be considered “percep-
ual”; it compresses goal-oriented information
(task “utility”) into a generalized, economical
and useful form (“regions” —-see Figs.2,4). The
upper layer is genetic, a competition of parallel
knowledge structures. In [Re83c}, each of these
components was argued to improve ellicacy and

efficiency

‘This paper extends and substantiates these
claims, conceptually and empirically. The next
section gives an example of a genetic algorithm
which is oriented toward the current context.
Section 3 describes the knowledge structure
(regions) from two points of view PLSI and PLS2
Section 4 examines the synthesis of these two
systems and considers some reasons for their
efficiency. Sections 5 and G present and analyze
the experimental results, which show the system
to be stable, accurate, and efficient. ‘The paper
closes with a brief summary and a glossary of
terms used in macl ng and genctic sys-
tems.

   

 

 

  

 

 

 

  

2. For the reader unfamiliar with learning sys-
tem and other terminology, Appendix B provides brief
explanations.

3. PLst is applicable to any domain for which
features and “usefulness” or utility of objects can be
fined [Re83d]. An object can represent a physical
“eatity or an operator over the set of entities. Domains
Ean be simple (e.g. “single concept” learning), of com=
(es. expert systems). State-space problems and
Bames have been tested in [Re 83a, Re 83d) The PLS
roach is uniform and can be deterministic or pro-

ba The only real difficulty with a new domain
Ja constructing features which bear a smooth rela-

ship to the utility (the system can evaluate and
iereen features presented to it)

2, GENETIC SYSTEMS: AN EXAMPLE,

This section describes a simple GA, to introduce
termmology and concepts, and to provide a basis
for comparison with the more complex PLS2.
‘The reader already familiar with GA's may wish
to omit all but the last part of this section.

 

   

2.

 

. Optimization

Many problems can be regarded as function
optimization In an Al application, this may
mean discovery of a good control structure for
executing some task. The function to be optim-
ized is then some measure of task success which
we may call the performance In the terminol-
ogy of optimization, wis the objective function.
In the context of genetic systems, p is the

fitness, payoff, or merit *

 

The merit p depends on some control
structure, the simplest example of which is a vee~
tor of weights b= (by, be, ., by). Frequently
the analytic form of 4(b) is not known, so exact
methods cannot be used to optimize it (this is the
case with most Al problems). But what often is
available (at some cost) is the value of w for a
given control structure. In our example, let us
suppose that 1 can be obtained for any desired
lue of b, by testing system performance. If p
5 a well behaved, smooth function of b, and if
there is just one peak in the pL surface, then this
local optimum is also a global optimum, which
can be efficiently discovered using hill climbing
techniques. However, the behavior of y. is often
unknown, and 2 may have numerous optima, in
these cases a genetic adaptive algorithm 1s
appropriate.

  

 

 

 

2.2, Genetic Algorithms

In a GA, a structure of interest, such as a
weight vector b, 15 called a phenotype. Fig 3
shows a simple example with just two weights, by
and by. The phenotype 1s normally coded as a
string of digits (usually bits) called the genotype
B. A single digit 1s a gene; gene values are
alleles, The position of a gene within the geno-
type is given by an index called the locus
Depending on the resolution desired, we might
choose a greater or lesser number of sequential
genes to code each bj. If we consider § bits to be

 

 

4p might also be called the “utility”, but we
reserve this term for another kind of quality measure
used by PLS1
  

sufficient, the length of the genotype B will be L
= 5n bits (see Fig 3)

Instead of searching weight space directly
for an optimal vector b, a GA searches gene
space, which has dimensionality L (gene space is
Hamming space if alleles are binary). A GA con-
ducts this search in parallel, using a set of indivi-
dual genotypes called a population or gene pool
By comparing the relative merits jt of individuals
in a population, and by mating only the better
individuals, a GA performs an informed search of
gene space This search is conducted iteratively,
over repeated generations In each new genera-
tion, there are three basic operations performed:
(1) selection of parents, (2) generation of
offspring, and (3) replacement of individuals. (1)
and (2) have been given more attention Parent
selection 1s usually stochastic, weighted in favor
of individuals having higher p values. Ofepring
generation relies on genetic operators which
modify parent genotypes Two natural examples
are mutation (which alters one allele), and cross-
over (which slices two genotypes at a common

 

 

 

 

 

locus and exchanges seginents— see Fig. 3)
POPULATION
Senne B Pheseyse Ment
ceonniii0 02) a
2 °3
cononion es) 17
to
cone 00 en) uM
OFFsPninc
Pareou Chiles B Caldera
19 s81n10 cxoio1ri00 é
amr tires” —eovontsi0 2

Figure 3. Simple genetic system. The upper part of
Unis diagram shows 3 small population of just seven
individuals. Here the set of characteristics (the pheno-
type) is a simple two element vector b. This is coded
by the genotype B. Each individual is associated with
its measured merit 2. On the basis of their w values,
pairs of individuals are stochastically chosen as
Parents for genetic recombination. Their genotypes
are modified by crossover to produce two new
offspring

 

   

 

Because the more successful parents are
selected for mating, and because limited opera-

62

  

tions are performed on them to produce
offspring, the effect is a combination of
knowledge retention and controlled search. Hols
land proved that, using binary alleles, the cross.
‘over operator, and parent selection proportional
toy, a GA is K® times more efficient thag
exhaustive search of gene space, where K is the
population size [Ho7S, Ho8t]. Several empirical
studies have verified the computational efficiency
of GA's compared with alterative procedures for
global optimization, and have covered,
interesting properties of GA's, such as effects of
varying K. For example, populations smaller
than 50 can cause problems [Br81, De 80).

 
 

2.3. Application in Heuristic Search

One AI use is search for solutions to prob-
lems, or for wins in games [Ni80].° Here we wish
to learn an evaluation function H as a combina-
tion of variables x4, Xp, ...,X_ called altributes or |
features (features are often used to describe |
states in search). In the simplest case, H is
expressed as the linear combination b,x, + beX,.|
+. + bx, = b x, where the b, are weights to |
be learned. We want to optimize the weight vec- q
tor b according to some measure of the perfor:

mance when I1 is used to control search.

    
    
     
      
  
     

A rational way to define 4 (which we shall
use throughout this paper) 1s related to the aver- 4
age number D of states or nodes developed in
solving a set of problems. Suppose D is observed
for a population of K heuristic functions If,
defined by weight vectors b,. Since the perfor-
mance improves with lower values of D, a good
definition of the merit of H; (te of b,) 1s the rela-

tive performance measure p, = D / D,, where D
ts the average over the population, ic. D =
D,/K_ This expression of merit could be used to
assess genotypes B representing weight vectors
b,, as depicted in Fig.3.

 

Instead of this simple genetic approach,
however, PLS2 employs unusual genotypes and
operators, some of which relate to PLSI. In the
remaining sections of this paper, we shall exam-
me the advantages of the GA resulting from the
combination of PLSI with PLS?

 

 

 

5, Notice that search takes place both at the
level of the task domain (for good problem solutions),
and at the level of the learning element (for 2 good
control structure Il).

 

      
   
  
   
  

 

 
  
   
  
   
  
 
 
   
  
    
  
   
   
   
  
   
   
   
  
   
  
   
 
   
        
 
 
 
  
 
  
   
 
 
 
 
 
 
   

 

ce 3. PLS INFORMATION STRUCTURING:
of DUAL VIEWPOINT

a The connection between PLSI perceptual learning
ms and PLS2 genetic adaptation is subtle and
al indirect, Basically PLS! deals with objects x
op (which can be just about anything), and their
tte relationships to task performance. Let us call
col the usefulness of an object x in some task
te domain its utiltty u(x).

ral Since the number of objects is typically
of ymense, even vast observation is incomplete,
er and generalization is required for prediction of u,

given a previously unencountered x A
significant step in generalization is usually the
expression of x asa vector of high-level, abstract
features X;, Xp) «-y Xp, $0 that x really represents
not just one object, but rather a large number of
similar objects (e.g. in a board game, x might be
a vector of features such as picce advantage,
center control, etc). A further step in generali-
zation is to classify or categorize x's which are
similar for current purposes.® Since the purpose
ts to succeed well in a task, PLSI classifies x's
having similar utihties u.

   
   

ina
aor
ibe

Class formation can be accomplished in
several ways, depending on the model assumed.
If the task domain and features permit, objects
having similar utilities may be clustered in
feature space, as illustrated in Figs.2&4, giving
‘a “region set” R.’ Another model is the linear
combination I! = bf of §2

It is at this point that a GA like PLS? can
aid the learning process. Well performing b's or
R's may be selected according to their merit

“Note that merit 2 is an overall measure of the
‘task performance, while utility u is a quality
measure localized to individual objects

 

 

   
 

   

The question now is what information
structures to choose for representing knowledge
about task utility, For many reasons, PLS incor-
| Porates the “region set” (Fig 4), which represents
"domain knowledge by associating an object with

.. Here to classify means to form classes,
| eategories, or concepts. This is difficult to automate

7. PLst initiated what has become known as

Eonceptual clustering — where not just feature values

Considered, but alo predetermined forms of clases

65. Fectangles), and the whole data environment (e.g.

tility). ‘See [Re70, Re 77, Re 83a, Re85a, Re85b}, and
alo Appendix A. een

points of view: as a PLS1 knowledge structure,
and as a PLS? genetic structure.

 

 

 

 

 

 

 

 

 

 

Tah 001
° x
oa 8 mw x

Figure 4. Dual interpretation of a region set R.A
region sct is a partition of feature space (here there
are 6 regions). Points are clustered into regions
according to their utility u in some task domain (e.g. u
‘= probability of contributing to task success— see
Fig.2) Here the u values are shown inside the rectan-
les. A region R is the triple (r,u,e), where ¢ 15 the
error in v. The tegion set R= {R} serves both as the
PLSI knowledge structure and as the PLS? genotype.
In PLS1, R is a discrete (step) function expressing vari
ation of utility u with features xj In PLS:
compressed version of the detailed genotype dlustrated
in Fig. 5

     

‘The Region as PLS1 Knowledge Struc-

 

In a feature space representation, an object
isa vector x = (xy, Xp) ++) Xp)* Ina problem
or game, the basic object is the state, frequently
expressed as a vector of features such as piece
advantage, center control, mobility, ete.” Obser-
ations made during the course of even many
problems or games normally cover just a fraction
of feature space, and generalization 1s required
for prediction

  

 

  

In generalization learning, objects are
abstracted to form classes, categories, or con-
cepts. This may take the form of a partition of
feature space, ie. a set of mutually exhaustive
local neighborhoods called clusters or celle
{An73, D182] Since the goal of clustering in PLS
is to aid task performance, the basis for general
n is some measure of the worth, quality, or

 

 

 

 

 

8 Feature spaces are sometimes avoided because
they cannot easily express structure. However, alter-
native representations, as normally used, are also
deficient for realistic generalization learning A new
scheme mechanizes of a very difficult inductive prob-
lem: feature formation [Re 83d, Re 85).

9. The object or event could ust as well be an
operator to be applied to a state, or 8 state-operator
pair See [Re 83d}.
  

utility of a state or cell, relative to the task. One
measure of utility is the probability of contribut-
ing to a solution or win, In Figs.2,4, probability
classes are rectangular cells (for economy) The
leftmost rectangle r has probability u = 0.2.1°
The rectangle r is a category generalizing the
conditions under which the utility u applies.

 

In PLS, a rectangle is associated not just
with its atility u, but also with the utility error
¢. This expression ¢ of uncertainty in u allows
quantification of the effect of noise and provides
an informed and concise means for weighting
various contributions to the value of u during
learning The triple R = (r,u,e), called a
region, is the main knowledge structure for PLSI.
A set R = {R} of regions defines a partition in
augmented feature space.

 

 

 

R may be used directly as a (discrete)
evaluation or heuristic function H = u(r) to
assess state x€r in search. For example, in
Fig4, there are six regions, which differentiate
states into sty utility classes. Instead of forming
a discrete heuristic, R may be used indirectly, as
data for determining the weight vector b in a
smooth evaluation function H = b x (employing
curve fitting techniques). We shall return to
these algorithmic aspects of PLS 1n fi.

   
   

 

 

Figure § Definition of maximally detailed genotype U
I the number of points in feature space is finite and a
value of the utility is associated with each point, com-
plete information can be captured in a detailed’ geno
type U of concatenated utilities u Us .. uy. Coordi
nates could be linearly ordered as shown here for the
two dimensional case. U is an fully expanded genotype
corresponding to the compresced version of Fig. 4

10, This could be expressed in other ways. The
production rule form is r=u, Using logic, + is
represented: (05 x, 54) 9 (05x; 5 3}

   

 

 

 

 

 

64

   

     
    
   
 

3.2, The Region Set as Compressed and
Unrestricted PLS2 Genotype

Now let us examine these information
Structures from the genetic viewpoint. The
weight vector b of evaluation function H could
be considered a GA phenotype. What might the
genotype be? One choice, a simple one, was
described in §2 and illustrated in Fig.3: here the
genotype B 1s just a binary coding of b. A
different possibility is one that captures exhaus-
tive information about the relationship between
utility wand feature vector x (see Fig.5). In th
case, the gene would be (x,u). If the number of
genes finite, they can be indexed and con-
catenated, to give a very detailed genotype U, |
which becomes a string of vahtes ujttg ... Uy, cod- |
ing the entire utility surface in augmented
feature space

  
   
 
      
   

‘This genotype U is unusual in some impor-
tant ways. Let us compare it with the earlier
example B of §2 (Fig.3). B is simply a binary |
form of weight vector b. One obvious difference |
between B and U ts that U is more verbose than
B.. This redundancy aspect will be considered
shortly, The other important difference between
B and U is that alleles within B may well interact
(to express feature nonlinearity), but alleles u,

in U cannot interact (since the u, express an
absolute property of feature vector x, ic. its util
ity for some task) As explamed in the next sec-
n, this freedom from gene interdependence
permits localization of eredit."!

 

  

 

 

The detailed genotype U codes the uti
surface, which may be very irregular at worst, or
very smooth at best, This surface may be locally
well behaved (it may vary slowly in some
volumes of feature space) In cases of local regu-
larity, portions of U are redundant. As shown mm
Fig.5, PLS2 compresses the genotype U, into the
region set R (examined in §31 from the PLSI
viewpoint) In PLS2, a single region R = (r,u,e)
1s a sel of genes, the whole having just one allele
u (we disregard the genetic coding of e) Unlike.
standard genotypes, which have a stationary
locus for each gene and a fixed number of genes,
the region set has no explicit loci, but rather a

 

 

 

 

11, While one of the strengths of a GA is its abil
ity to manage interaction of variables (by *co-
adapting” alleles), PLS? achieves efficient and concise
knowledge representation and acquisition by exible
sgene compression, and by certain other methods exam-
ined later in this paper

 
  

variable number of elements (regions), each
representing a variable number of genes A
Fegion compresses gene sets having similar utility
according to current knowledge.

4. KNOWLEDGE ACQUISITION:
SYNERGIC LEARNING ALGORITHMS

In this section we examine how R is used to pro-
ide barely adequate information about the util-
ity surface. This compact representation results
in economy of both space and time, and in
effective learning. Some reasons for this power
are considered.

 

The ultimate purpose of PLS is to discover
utility classes im the form of a region set R. This
knowledge structure controls the primary task
for example, in heuristic search, R = (R} =
{(r,u,e)} defines a discrete evaluation function
I(r) =u.

 

The ideal R would be perfectly accurate
and maximally compressed. Accuracy of utility
u determines the quality of task performance
Appropriate compression of R characterizes the
task domain concisely but adequately (sce
Figs 3,4), saving storage and ume, both during
ask performance and during learning.

 

These goals of accuracy and economy are
approached by the doubly layered learning sys-
tem PLS? (Fig.1). PLS! and PLS2 combine to
become effective rules for generalization (induc-
ion), specialization (differentiation), and reorgan-
ization. The two layers support each other in
various ways for example PLS2 stabilizes the per-
ceptual system PLSI, and PLS! maintains geno-
type diversity of the genetic system PLS2_ In the
following we consider details, first from the
standpoint of PLS1, then from the perspective of
PLS2,

   

 

 

 

 

 

 

4.1, PLS1 Revision and Differentiation

Even without a genetic component, PLSI is
flexible learning system which can be employed
in noisy domains requiring incremental learning
It can be used for simple concept learning lke
the systems in [Di82}, but most experiments have
involved state space problem solving and game

playing.!? Here we examine PLS in the context of

   

 

  

 

12, These experiments have led to unique results
such as discovery of locally optimal evaluation func-

tuons (see [Re 83a, Re 834).

65

these difficult tasks

As described in §31, the main PLSt
Knowledge structure is the region set R =
{(r,u,e)}. Intermediate between basic data
obtained during search, and a general heuristic
used to control search, R defines a feature space
augmented and partitioned by wand e Because
R is improved incrementally, it is called the
cumulative region sel. PLSI repeatedly performs
two basic operations on R. One operation 1s
not revision (of utility u and error e),
and the other is specialization, differentiation, or
refinement (of feature space cells r). These
operators are detailed in [Re83a, Re83d}; here
we simply outline their effects and note their him-
itations

 

    

 

Revision of u and ¢ For an established
region R = (r,u,e) € R, PLS! is able to modify wu
and to decrease by using new data This is
accomplished in a rough fashion, by comparing
established values within all rectangles r with
fresh values within the same t It is difficult or
impossible to learn the “true” values of u, since
data are acquired during performance of hard
tasks, and these data are biased in unknown
ways because of nontrivial search

 

  

Refinement of R. Alternately performing
then learning, PLS! acquires more and more
detail about the nature of variation of utility u
with features, ‘This information accumulates in
the region set R = {R} = {(r,u,e)}, where the
primary effect of clustering u is increasing resolu-
tion of R The number, sizes, and shapes of ree-
tangles in R reflect current knowledge resolution
AS this differentiation continues in successive
iterations of PLSI, attention focuses on more use-
ful parts of feature space, and heuristic power
improves

 

Unfortunately, so does the likehhood of
error. Further, errors are difficult to quantify
and hard to localize to individual regions.

In brief, while the incremental learning of
PLS1 is powerful enough to learn locally optima
heuristics under certain conditions, and while
PLSI feedback is good enough to control and
correct mild errors, the feedback can become
unstable in unfavorable situations: instead of
being corrected, errors can become more prow
nounced. Moreover, PLSI is sensitive to parame-
ter settings (see Appendix B). The system needs
support

 

 

  

 
  

4.2. PLS2 Genetic Operators

Qualities missing in PLSt can be provided
by PLS2. As §4.1 concluded, PLSI, with tts single
region set, cannot discover accurate values of
utilities u. PLS2, however, maintains an entire
population of region sets, which means that
several regions in all cover any given feature
space volume. The availabilty of comparable
regions ultimately permits greater accuracy in u,
and brings other benefits.

 

 

As §3.2 explained, a PLS2 genotype 1s the
region set R = {R}, and each region R= (r,u,¢)
€ Ris a compressed gene whose allele 1s the util
ity u. Details of an early version of PLS? are
given in [Re83c]. Those algorithms have been
improved; the time complexity of the operators
in recent program implementations 1s linear with
population size K. ‘The following discussion out-
lines the overall effects and properties of the
various genetic operators (compare to the more
usual GA of §2)

K-sexual mating is the operator analo-
gous to crossover. Consiler a population {IR} of
K different region sets R. Each set is composed
of a number of regions R which together cover
feature space A new region set R! 1s formed by
selecting individual regions (one at a time) from
parents R, with probability proportional to merit
b (menit is the performance of R defined at the
end of §2). Selection of regions from the whole
Population of region sets continues until the
feature space cover is approximately the average
cover of the parents. This ereates the offspring
region set Rt! which 1 generally not a partition.

     

  
      

   

 

Gene reorganization. For cconomy of
storage and time, the offspring region set R! 1s
repartitioned so that regions do not overlap in
feature space

Controlled mutation. Standard muta-
tion operators alter an allele randomly. In con-
(rast, the PLS2 operator analogous to mutation
changes an allele according to evidence arising in
the task domain The controlled mutation
operator for a region set R = {(r,u,e)} is the
utility revision operator of PLS! As described in
§4.1, PLS! modifies the utility u for each feature
space cell r.

 

Genotype expansion. ‘This operator is
also provided by PLSI. Recall the discussion of
§3.2 about the economy resulting from compress
ing genes (utility-feature vectors) into a region

66

    

R, and is carried

seri

igenotype diversity. Thus
premature convergence,

81, Mad].

rithms just ow
ing. principles
are briely di
Re85b, Re85e}).

Credit localization. The selection of
regions for K-sexual mating may use a single
merit value w for each region R within a given
set R. However, the value of, ys can just as well
he localized to single regions within R, by com-
paring R with similar regions in other sets. Since
regions estimate an absolute quantity (task-
related utility) in their own volume of feature
space, they are independent of each other. Thus
credit and blame may be assigned to feature
space cells (i.e. to gene sequences).

Assignment of credit to individual regions
within a cumulative set Ris straightforward, but
it would be difficult to do directly in the final
evaluation function H, since the components of
H, while appropriate for performance, omit infor-
mation relevant to learning (compare
Figs.2,4)"

Knowledge mediation. Successful sys-
tems tend to employ information structures
which medtate data objects and the ultimate
Knowledge form. These mediating structures
include means (o record growing assurance of
tentative hypotheses.

When used in heuristic search, the PLS
region set mediates large aumbers of states and a

 

 

 

 

13. There are various possibilities for the evalua-
tion function H, but all contain less useful information
than their determinant, the region set.R. The sim.
plect heuristic used in [Re 83a, Re 83d) is H = b,
where b is a vector of weights for the feature vector f’
(This linear combination is used exclusively in experi
ments to be described in §5.) The value of b is found
Using regions az data in linear regression [Re 83a,
Re 838],

 

 

 
very concise evaluation function 11 Retention
and continual improvement of this mediating
structure relieves the credit assignment problem
This view is unlike that of {Di8I,p.14, Di82]
learning systems often attempt to improve the
control structure itself, whereas PLS acquires
knowledge efficiently 1n an appropriate structure,
and utilizes this knowledge by compressing it
only temporarily for performance. In other
words, PLS does not directly search rule space for
a good H, but rather searches for good cumula-
tive regions from which H is constructed.

   

 

 

 

Full but controlled use of every datum
Samuel's checker player permitted each state
encountered to influence the heuristic H, and at
the same time no one datum could overwhelm
the system. The learning was stochastic: both
conservative and economic. In this respect PLS2
is similar (although more automated).

Schemata In learning systems and
genetic algorithms. A related efficienc:
both Samuel's systems and PLS is like the sche-
mata concept in a GA. Ina GA, a single indivi-
dual, coded as a genotype (a string of digits),
supports not only itself, but also all ats sub-
strings, Similarly, a single state arising in heuris-
tic search contains information about every
feature used to describe it. Thus each state can
be used to appraise and weight each feature.
(The effect is more pronounced when a state is
described in more elementary terms, and combi-
nations of primitive descriptors are assessed —sce
{Re8sc}).

 

 

  

5, EXPERIMENT AND ANALYSIS

PLS2 ts designed to work in a changing environ-
iment of increasingly difficult problems. This scc-
tion describes experimental evidence of effective
and efficient learning.

 

5.1. Experimental Conditions

The features used for
these experiments were the four of [Re&3a}_ The
relationship between utility and these features is
fairly smooth, so the full capability of a GA is not
tested, although the environment was dynamic.

Definition of merit #. As §4 described,
PLS2 choses regions from successful cumulative
sets and recombines them into unproved sets
For the experiments reported here, the selection
criterion was the global merit pi. the perfor-

Tame features.

  

 

 

 

mance of a whole region gel, without localization
of credit to individual regions. This measure »
was the average number of nodes developed D in
a training sample of 8 fifteen puzzles, divided
into the mean of all such averages in a popula-
tion of K sets, ic. » = D/D, where D is the

average over the population (D = © D,/ K)

  

 

 

Changing environment. For these exper-
iments, successive rounds of training were
repeated in incremental learning over several
iterations or generations. The environment was
altered in successive generations; it was specified
as problem difficulty or depth d (defined as the
number of moves from the goal in sample prob-
lems). As a sequence of specifications of problem
difficulty, this becomes a training difficulty vector
d= (d,,ds,. dy)

Here d was static, one known to be a good
progression, based on previous experience with
user training [Co84].!* In these experiments, d
was always (8,14,22,50,#,#,.) An integer
means random production of training problems
subject to this difficulty constraint, while “#"
demands production of fully random training
instances.

 

5.2. Discussion

Before we examine the experiments them-
selves let us consider potential differences
between PLSI and PLS? im terms of their
effectiveness and efficiency. We also need a cri-
terion for assessing differences between the two
systems.

Vulnerability of PLS1 With population
size K = 1, PLS? degenerates to the simpler PLS!
In this case, static training can result in utter
failure, since the process is stochastic and various
things can go wrong (sce Appendix B) The
worst is failure (0 solve any problems in some
generation, and consequent absence of any new
information If the control structure II is this
poor, it will not improve unless the fact 1s

 

 

   

 

11, PLS and similar systems for problems and
games are sometimes neither fully supervised nor fully
Unsupervised | The original PLSI was intermediate in
this respect Training problems were selected by a hu-
man, but from each training instance, a multitude of
individual nodes for learning are generated by the sys-
tem Each node can be considered a separate example
for concept learning [Re 83d]. {Co8i] describes experi
ments with an automated trainer.

 
detected and problem difficulty is reduced (i.e.
dynamic training is needed),

Even without this catastrophe, PLS! per-
forms with varying degrees of success depending
‘on the sophistication of its training and other
factors (explained in Appendix B). With miaimal
human guidance, PLSI always achieves a good
evaluation function H, although not always an
optimal one. With static training, PLSI succeeds
reasonably about half the time,

Stability of PLS2. In contrast, one would
expect PLS2 to have a much better success rate
Since PLS1 is here being run in parallel (Fig. 1),

*¢ PLS2 should reject hopeless cases (the
are small), a complete catastrophe (all H's
failing) should occur with probability p< qX,
where q is the probability of PLSI failure and K
is population size. If q is even as large as one
half, but K is 7 or more, the probability p of
catastrophe is less than 0.01,

 

   

 

 

Cost versus benefit: a measure, Failure
plays a part in costs, so PLS2 may have an
advantage, The ultimate criterion for system
quality is cost effectiveness: is PLS2 worth its
extra complexity? Since the main cost is in task
performance (here solving), the number of nodes
developed D to attain some performance is a
good measure of the expense.

 

It training results in catastrophic failure,
however, all effort is wasted, so a better measure
1s the expected cost D/p, where p 1s the probal
ity of success. For example, if D = 500 for
Viable control structures, but the probability of
finding solutions is only %. then the average cost
of useful mformation is 500 /% = 1000.

To extend this argument, probability p
depends on what is considered a success. Is suc-
cess the discovery of a perfect evaluation func-
tion H, or 1s performance satisfactory if D
departs from optimal by no more than 25%?

 

 

 

 

Results

Table I shows performances and costs with
various values of K. Here p is estimated using
roughly 36 trials of PLst in a PLS2:context (if K
=1, 36 distinct runs; if K =2, 18 runs; etc.).
Since variances in D are high, performance tests
were made over a random sample of 60 puzzles.
This typically gives 95% confidence of D * 40.

Accuracy of learning. Let us first com-
Pare results of PLSI versus PLS2 for four different
Success criteria, We consider the learning to be
successful if the resulting heuristic H approaches
optimal quality within a given margin (of 100%,
50%, 25%, and 10%).

Columns two to five in the table (the
second group) show the proportion of H's giving
performance D within a specified percentage of
the best known D (the best D is around 350
nodes for the four features used). For example,
the last row of the table shows that, of the 36
individual control structures H tested in (two

ferent) populations of size 19, all 36 were
within 100% of optimal D (column two). This
means that all developed no more than 700 nodes
before a solution was found. Similarly, column
five in the last row shows that 0.21 of the 36 H's,
or 8 of them, were within 10%, ie. required no
more than 385 nodes developed.

Cost of accuracy. Columns ten and
leven (the two rightmost columns of the fourth
group) show the estimated costs of achieving per-
formance within 100% and within 10% of
‘optimum, respectively. ‘The values are based on
the expected total number of nodes required (i.e.
D/p), with adjustments in favor of PLS1 for extra
PLS? overhead. (The unit is one thousand nodes
developed.) As K increases, the cost of a given
accuracy first increases. Nevertheless, with just
moderate K values, the genetic system becomes
cheaper, particularly for an accuracy of 10%,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
‘The expected cost benefit 1s not the only
advantage of PLS2,

Average performance, best perfor-
mance, and population performance. Con:
sider now the third group of columns in Table I
‘The sixth column gives the average D for all H's
in the sample (of 36) The seventh column gives
Dy for the beet Hh, in the sample. These two
measures, average and best performance, are
often used in assessing genetic systems [Br81].
The eighth column, however, is unuswal; it indi-
cates the population performance D, resulting
when all regions from every set in the population
are used together in a regression to determine
H,, This 1s sensible because regions are indepen
dent and estimate the utility, an absolute quan-
tity ([Re83e, ef [Br81, Ho75})

Several trends are apparent in Table I
First, whether the criterion is 100%, 50%, 25%,
or 10% of optimum (columns 2-5), the proportion
of good fI’s increases steadily as population size
K rises. Similarly, average, best, and population
performance measures D, Dy and Dy (columns 6-
8) also improve with K. Perhaps most important
is that the population performance D, is so relt
ably close to best, even with these low K values
‘This means that the whole population of regions
can be used (for 15,) without independent
verification of performance. In contrast, indivi-
dual H's would require additional testing to dis-
cover the best (column 7), and the other alterna-
tive, any II, is likely not as good as 11, (columns
6 and 8). Furthermore, the entire population of
regions can become an accurate source of mas-
sive data for determining an evaluation function
capturing feature interaction [Re83b].

   

 

 

This accuracy advantage of PLS2 is illus
trated in the final column of the table, where, for
a constant cost, rough estimates are given, of the
expected error in population performance Dp
relative to the optimal value.

tis interesting that such small populations
improve performance markedly; usually popula-
tuon sizes are 50 or more.

 

6. EFFICIENCY AND CAPABILITY

Based on these empirical observations for PLS2,
on other-comparisons for PLS1, and on various
conceptual differences, general properties of three
competing methods can be compared: PLSI,

 

 

PLS2, and traditional optimization. In [Re81],
PLS1 was found considerably more efficient than
standard optimization, and the suggestion was
made that PLS! made better use of available
information By studying such behaviors and
underlying reasons, we should eventually identify
principles of efficient learning. Some aspects are
considered below.

‘Traditional optimization versus PLS1.
First, let us consider efficiency of search for an
optimal weight vector b in the evaluation func-
tion I] = b.f, One good optimization method is
response surface filting (RSF). It can discover a
local optimum in weight space by measuring and
regressing the response (here number of nodes
developed D) for various values of b RSP util-
izes just a single quantity (i.c. D) for every prob-
lem solved. This seems hike a small amount of
information to extract from an entire search,
since a typical one may develop hundreds or
thousands of nodes, each possibly containing
relevant information. In contrast to this tradi-
tional statistical approach, PLSt, like [Sa63,
SaG7], uncovers knowledge about every feature
from every node (sce §43) PLSI, then, might be
expected to be more efficient than RSF. Expert
ments verify this [Re81}

‘Traditional optimization versus PLS2
As shown in §5, PLS2 1s more efficient still, We
can compare it, too, with RSF. The accuracy of
RSF is known to improve with VN, where N is
the number of data (here the number of of prob-
lems solved). As a first approximation, a parallel
method like PLS2 should also cause accuracy to
increase with the square root of the number of
data, although the data are now regions instead
of D values. If roughly the same number of
regions 15 present in each individual set R of a
population of size K, accuracy must therefore
improve as Vik. Since cach of these K structures
requires N problems in training, the accuracy of

PLS2 should increase as VN, like RSF

Obviously, though, PLS? involves much
more than blind parallelism: a genetic algorithm
extracts accurate knowledge and dismisses
incorrect (unfit) information, While it is impossi-
ble for PLS1 alone, PLS2 can refine merit by
localizing credit to individual regions [Re83c}
Planned experiments with this should show
further increases in efficiency since the additional
cost is small. Another inexpensive improvement

 

 
will attempt to reward good regions by decreas

ing their estimated errors. Even without these
refinements, PLS2 retains meritorious regions
(G4), and should exhibit accuracy improvement
better than VN Table I suggests this

PLS2 versus PLS2. As discussed in §41
and Appendix B, PLSI is limited, necessitating

uman tuning for optimum performance. In con-
trast, the second layer learning system PLS?
requires little human intervention. ‘The main
reason is that PLS2 stabilizes knowledge
automatically, by comparing region sets and
dismissing aberrant ones, Accurate cumulative
sets have a longer lifetime.

This ability to diserimmate merit and
retain successful data will likely be accentuated
ith the localization of credit to individual
regions (sce §1.2) Another improvement is to
aller dynamically the error of a region (estimated
PLS1) as a function of its merit (found by
PLS2). This will have the effect of protect
good region from imperfect PLSt utility revision;
‘once some parallel PLS! has succeeded in discov-
ering an accurate value, it will be more immune
to damage. A fil region will have a very long
lifespan.

 

 

 

 

 

     

Inherent differences in capability. RSF,
PLS1, and PLS2 can be characterized differently
From the standpoint of time costs: given a chal-
lenging requirement such as the location of a
local optimum within 10%, the ordering of these
methods in terms of efficiency 1s RSF S PLS =
PLS2. In terms of capabiltty, the same relation-
ship holds. RSF cannot handle feature interac-
tions without a more complex model (which
would increase its cost drastically) PLS, on the
other hand, ean provide some performance
improvement using piecewise linearity, with little
additional cost [Re83b} PLS? is more robust
than PLS1. While the original system 15 some-
what sensitive to training and parameters, PLS2
provides stability using competition to overcome
deficiencies, obviate tuning, and increase accu-
racy, all al once. PLS? buffers inadequacies
inherent in PLS1. Moreover, PLS2, being geneti-
cally based, may be able to handle highly
interacting features, and discover global optima
[Re83e]. This 1s very costly with RSF and seems
infeasible with PLS1 alone.

 

 

 

 

70

 

7. SUMMARY AND CONCLUSIONS

PLS2 is a gencral learning system [Re83a,
ReB8Qd]. Given a set of user-defined us and
Some measure of the utility (e.g. probability of
success im task performance), PLS? forms and
refines an appropriate knowledge structure, the
cumulative region get relating utility to
feature values, and permitting noise manage-
ment. This economical and flexible structure

mediates data objects and abstract heuristic
knowledge

 

Since individual regions of the cumulative
set Rare dependent of one another, both credit
localization and feature interaction are possible
simultaneously. Separating the task control
structure H from the main store of knowledge R
allows straightforward credit assignment to this
determinant R. of 1, while H. itself may incor-
porate feature nonlinearities without being
responsible for them,

A concise and adequate embodiment of
current heuristic knowledge, the cumulative
region set R was originally used in the learning
system PLSI [Re8%a}. PLS1 is the only system
shown to discover locally optimal evaluation
functions in an AI context. Clearly superior to
PLSI, its genetic successor PLS2 has been shown
to be more stable, more accurate, more efficient,
and more convenient. PLS? employs an unusual
genetic algorithm having the cumulative set Ras
acompressed genotype. PLS? extends PLS1's hii
ited operations of revision (controlled mutation)
and differentiation (genotype expansion), to
include generalization and other rules (I-sexual
mating and genotype reorganization) Credit
may be localized to individual gene sequences,

‘These improvements may be viewed as
effecting greater efficiency or as allowing greater
capability. Compared
of optimization, PLS! is more efficient [Re 85a},
but PLS2 does even better. Given a required
accuracy, PLS2 locates an optimum with lower
expected cost. In terms of capability, PLS? insu-
lates the system from herent inadequacies and
sensitivities of PLS! PLS? is much more stable
and can use the whole population of regions reli-
ably to create a highly informed heuristic (this
population performance is not meaningful in
standard genetic systems). This availability of
massive data has important implications for
feature interaction [Re83b}.

 

 

 

   

  

 

 
Additional refinements of PLS? may further
inerease efficiency and power These include
rewarding meritorious regions so they become
immune to damage Future experiments will
investigate nonlinear capability, ability to dis-
cover global optima, and efficiency and
effectiveness of localized credit assignment.

‘This paper has quantitatively affirmed some
principles believed to improve efficieney and
effectiveness of learning (eg credit localization).
‘The paper has also considered some simple but
Inttle explored ideas for realizing these capabili-
tues (e.g. full but controlled use of each datum).

REFERENCES

{An 73] Anderberg, MR., Cluster Analyers for Applica-
tone, Academie Press, 1973,

[Be80] Bethke, A.D., Genetic rithms ae function
optimizers, Ph.D Thesis, Univer of Michigan, 1980,

[Br81] Brindle, A., Genetic algorithms for function
optimization, C.S, Department Report TR8I-2 (PhD
Dissertation), University of Alberta, 1981.

[Bu78] Buchanan, B.G., Johnson, C.R., Mitchel,
T.M, and Smith, R.G , Models of learning systems, in
Belzer, J. (Fd }, Encyclopedia of Computer Science and
Technology 11 (1978), 24-51.

[Co8t] Coles, D. and Rendell, L.A., Some issues in
traning learning systems and an autonomous design,
Proc. Fifth Bienniel Conference of the Cenadion
Society for Computational Studice of Intelligence, 1984.

[De 80] DeJong, Adaptive system design: A
genetic approach, [EEE Tranractione on Systeme,
‘Man, and Cybernetice SMC: 10, (1980), 566-574.

[Di8!] Dieterich, T.G and Buchanan, BG , The role
of the critic in learning systems, Stanford University
Report STAN-CS-81-891, 1981

[Di82] Dietterich, TG, London, B., Clarkson, K , and
Dromey, G , Learning and inductive inference, STAN-
€S-82-913, Stanford University, also Chapter XIV of
The Handbook of Artificial Intelligence, Cohen, P.R.,
and Feigenbaum, B.A, (Ed.}, Kaufmann, 1982,

{I1075] Holland, J.H., Adaptation in Netural ond
Artyfieval Syetems, University of Michigan Press, 1975.

 

 

 

  

 

   

[Ho80} Holand, J.t1, Adaptive algorithms for discov-
ening and using general patterns in growing knowledge
bases, Intl. Journal on Poltey Anclyeis and Information
Syrteme 4, 2 (1980), 217-240

[Ho8t] Holland, J.1L, Genetic algorithms and adapta-
tion, Proc NATO Ade. Ree. Inst, Adaptive Control of

 

n

  

Medefined Systeme, 1981

{i083} Holland,
Second Inte
1983, 02-05.

{L383] Langley, P., Bradshaw, GL., and Simon, H.A.,
Rediscovering chemistry with the Bacon system,
Michalski, R.S., Carbonell, J.G, and Mitchell, TM.
(Ed), Machine Learning: An Artificial Intelligence
Approach, Tioga, 1983, 307-329.

[Ma8i] Mauldin, ML, Maintaining diversity in
ie search, Proc. Fourth Nattonal Conference on
Intelligence, 1984, 247-250,

[Me85] Medin, D.L. Wattenmaker, W1
Category cohesiveness, theories, and cognitive archeol-
ogy (a8 yet unpublished manuscript), Dept of Psychol
ogy, University of Illinois at Urbana Champaign, 1985

[Mic 83a] Michalski, R.S., A theory and methodology
of inductive learning, Artificiel Intelligence 20, 2
(10983), 111-161, reprinted in Michalski, R.S. et al (Ed },
Machine Learning: An Artificial Intelligence Approach,
Tioga, 1983, 83-134,

[Mic83b] Michalski, RS and Stepp, R.E., Learning
from observation: Conceptual clustering, in Michalski,
RSS ot al (Ed), Machine Learmng. An Artificial Intel-
ligence Approach, Tioga, 1983, 331-363

[Mit83] Muchell, TM, Learning and problem solving,
Proc. Eighth Internationel Joint Conference
Artificial Intelligence, 1983, 1130-1151.

[Ni80] Nilsson, NJ., Principles of Artificial Inteli-
gence, Tioga, 1980,

 

Proc.
hop,

J, Escaping brittleness,
ionel Machine Learning Wor

   

 

 

 

 

and

  

  

 

  

 

 

 

 

 

[Re76] Rendell, L.A., A method for automatic genera-
tion of heuristics for state-space problems, Dept of
Computer Science CS-76-10, University of Waterloo,
1976.

[Re77] Rendell, L.A., A locally optimal solution of the
fifteen puzzle produced by an automatic evaluation
function generator, Dept of Computer Science CS-77-
36, University of Waterloo, 197.

[Re81] Rendell, L.A., An adaptive plan for state-space
problems, Dept of Computer Scionce CS-81-13, (PhD
thesis), University of Waterloo, 1981.

{Re 83a] Rendell, L.A., A new basis for state-space
learning systems and a successful implementation,
Antificial Intelligence 20 (1983), 4, 369-302

[Re83b] Rendell, L.A. A learning system which
accommodates feature interactions, Proe. Eighth Inter-
national Joint Conference on Artifical Intelligence,
1983, 469-472,

  

 

 

  

  

 

 
 

[Re83c] Rendell, L.A A doubly layered, genetic
penetrance learning system, Proc Third National
Conference on Artificial Intelligence, 1983, 343-347.

[Re88d] Rendell, L.A., Conceptual knowledge acquis
tion in search, University of Guelph Report CIS-83-15,
Dept. of Computing and Information Science, Guelph,
Canada, 1983 (to appear in Bole, L. (ed), Knowledge
Bared Learning Systeme, Springer-Verlag).

[Re85a] Rendell, L.A., Utility patterns as criteria for
efficient generalization learning, Proe 1985 Conference
on Intelligent Syrteme and Machiner, (to appear), 1985

{Re85b] Rendell, L.A., A scientific approach to
applied induction, Proc. 1985 International Machine
Learning Workshop, Rutgers University (to appear),
1985,

[Re85e] Rendell,
tion using layered information compression: Tractable
feature formation in search, Proc. Ninth International
Joint Conference on Artificial Intelligence, (to appear),
1985.

{S263} Samuel, A.L,

ng using the game of checkers, m Feigenbaum, E.A.
and Feldman, 5. (Ed.), Computers and Thought,
MeGravellll, 1963, 71-105.

Substantial constructive indue-

 

  
   

» Some studies in machine learn-

 

{Sa67] Samuel, A.L., Some studies in machine learn-
ing using the game of checkers Il—recent progress,
IDM J. Ree, and Develop. 11 (1967) 601-617.

[Sm80] Smith, SF, A learning system based on
genetic adaptive algorithms, PhD Dissertation, Univer-
‘sity of Pittsburgh, 1980.

 

 
  
 

 

h, S.F., Flexible learning of problem solv-
ics through adaptive search, Proc. Eighth
jal Joint Conference on Artificral Intell-
gence, 1983, 422-425

APPENDIX A. GLOSSARY OF TERMS.

Clustering. Cluster analysis has long been
used as a tool for induction in statistics and pat-
tern recognition [An73}. (See “induction” )
Improvements to basic clustering techniques gen-
erally use more than just the features of a datum
(|An73,p 194] suggests “external criteria”)
External criteria in [Mi83, Re76, Re83a, Re85b]
involve prior specification of the forms clusters
may take (this has been called “conceptual clus-
tering” [Mi83}). Criteria in [Re76, Re83a,
Re85b] are based on the data environment (see

72

‘util

 

'y") below) 'S This paper uses clustering to
create economical, compressed genetic structures
(genotypes).

Feature. A feature is an attribute or pro-
Perty of an object. Features are usually quite
abstract (e.g. “center control” or “mobility”) in 2
board game. The utility (see below) varies
smoothly with a feature,

Genetic algorithm. In a GA, a the ch:
acter of an individual of a population is called a
phenotype. The phenotype is coded as a string of
digits called the genotype. A single digit is a
gene Instead of searching rule space directly
(compare “learning system"), a GA searches gene
a GA searches for good genes in the
n of genotypes). This search uses the
merit 2 of individual genotypes, selecting the
more successful individuals to undergo genetic
operations for the production of offspring. Sce
§2 and Fig.3.

Induction. Induction or generalization
learning is an important means for knowledge
acquisition. Information is actually created, as
data are compressed into classes or categories in
order to predict future events efficiently and
effectiely. Induction may create feature space
neighborhoods or clusters, See “clustering” and

a4

Learning System. Buchanan et al.
present a general model which distinguishes com-
ponents of a learning system [Bu78]. The perfor-
mance element PE is guided by a control struc-
ture IJ. Based on observation of the PE, the cri
tic assesses H, possibly localizing credit to parts
of H [Bu78, Di81}. The learning element LE uses
this information to improve H, for the next
round of task performance. Layered systems
have multiple PE’s, critics, and LE’s (e.g. PLS?
uses PLS! as its PE —see Fig.1). Just as a PE
searches for its goal in problem space, the LE
searches in rule space [Di82] for an optimal H to
control the PE

To facilitate this higher goal, PLS2 uses an
intermediate knowledge structure which divides
feature space into regions relating feature values
to object uftlty [Re 83d] and discovering a useful
subset of features (ef [$63]). In this paper, the
control structure H is a linear evaluation func-

 

  

 

 

 

 

 

 

 

 

 

 

15. A new learning system [Re 85c] introduces
higher-dimensional clustering for creation of structure.
tion [Ni80], and the “ules” are feature weights
for H. Search for accurate regions replaces direct
search of rule space; i.
and H. As explained in §3, sets of regions
become compressed GA “genotypes”. See also
“genetic algorithms”, “PLS”, “region”, and Fig.1.

Merit pt Also called payoff or fitness, this
4s the measure used by a genetic algorithm to
select parent genotypes for preferential reproduc-
tion of successful individuals. Compare “utility”,
also sce “genetic algorithms”.

Object. Objects are any data to be gen-
eralized into categories. Relationships usually
depend on task domain. See “utility”.

PLS. The probabilietic learning system
ean learn what are sometimes called “single con-
cepts” [D182], but PLS is capable of much more
difficult tasks, involving noise management,
incremental learning, and normahzation of biased
data, PLS1 uniquely discovered locally optimal
heuristics in search [Re83a], and PLS? is the
effective and efficient extension examined in tI
paper. PLS manipulates “regions” (sce below),
using various inductive operations described in

§4.

 

 

  

 

Region or Cell. Depending on one’s
siewpoint, the region ts PLS's basic structure for
clustering or for the genetic algorithm. The
region is a compressed representation of a utility
surface m ougmented feature space; it 1s also a
compressed genotype representing a utility func-
tion to be optimized. As explained in [e834],
the region representation 1s fully expressive, pro-
viding the features are Sce §3 and Figs. 3&4.

Utility uw This is any measure of the use-
fulness of an object in the performance of some
task. The uiility provides a link between the
task domain and PLS generalization algorithms.
Uulity can be a probability, as in Fig 2. Com-
pare merit, See §1,3.

 

 

 

 

 

APPENDIX B. PLS1 LIMITATIONS

PLSI alone is inherently limited. ‘The problems
relate to modification of the main knowledge
structure, the cumulative region set Ro =
{(r,u,e)}. As mentioned in §41, R undergoes
two basic alterations. PLSI gradually changes
the meaning of an established feature space rec-
tangle r by updating its associated utihty u
(along with u's error e). PLS1 also incrementally

73

refines the feature space, as rectangles r ate con-
tinually split.

Both of these modifications (utility revision
and region refinement) are largely directed by
search data, but the degree to which newer infor-
mation affects R depends on various choices of
system parameters (Re83a]. System parameters
influence estimates of the error e, and determine
the degree of region refinement. ‘These, in turn,
affect the relative importance of new versus esta-
blished knowledge.

Consequently, values of these parameters
influence task performance For example, there
1s a tradeoff between utility revision and region
refinement. If regions are refined too quickly,
accuracy suffers (this is theoretically predictable).
If, instead, utility revision predominates, regions
become inert (their estimated errors decline), but
sometimes incorrectly.

 

 

 

 

‘There are several other problems, including
ifficulties in training, limitation in the utility
revision algorithm, and inaccurate estimation of
various errors. As a result, utility estimations
are imperfect, and biased in unknown ways.

 

Together, the above uncertainties and sen-
sitivi explain the failure of PLS! always to
focate an optimum with static training (Table 1).
The net effect is that PLSI works fairly well with
no parameter tuning and unsophisticated train-
ing, and close to optimal with mild tuning and
informed training [Co8t}, as long as the features
are well behaved.

 

By nature, however, PLSI requires features
exhibiting no worse than mild interactions, This
15 a serious restriction, since feature nonlinearity
1s prevalent. On sts own, then, PLSI is inherently
Iumted. There is simply no wey to learn utility
accurately unless the effects of differing heuristic
functions §1 are compared, as in PLS2.

 

ACKNOWLEDGEMENTS,

I would hke to thank Dave Coles for his lasting
enthusiasm during the development, implementa-
tion and (esting of this system I appreciate the
helpful suggestions from Chris Matheus, Mike
Mauldin, and the Conference Reviewers.

 

 

16. Although system parameters are given by
domain-independent statistical analysis, tuning these
parameters nevertheless improves performance in some
cases, (This is not required in PLS?)

 

 
 

Learning Multiclass Pattern Discrimination

J. David Schaffer
Department of Electrical Engineering
Vanderbilt University.
Nashville, TN 37235

ABSTRACT

Genetic algorithas (GA's) are poverful,
general purpose adaptive search techniques which
have been use successfully in a variety of learning
systens. Previous implementations have tended to
use scalar feedback concerning the performance of
alternate knowledge structures on the task to be
learned. This approach vas found to be inadequate
vhen the task vas multiclass pattern
discrimination. By providing the GA with
multidimensional feedback, a problem of this type
vas successfully learned. In addition, a careful
balance of revard and punishnent vas found to be
necessary in order to guide the opportunistic GA to
@ correct solution of the problen.

 

 

Introduction

‘This paper presents cone results of
experinents vith a software system designed to
learn rules for multiclass pattern discrimination
fron examples of correctly classified patterns.

The original motivation for this research
arose fron attenpts to develop computer progracs
capable of intelligent signal analysis. One such
application donain is computer analysis of
bivelectric signals such as EMG's and EEG's.
Previous attempts to model the actions of an
electroencephalographer using variations of
traditional electrical engineering approaches had
met with some success, but complete agreanent vith
the human expert eluded us [2,4 ]. Attempts to
elicit the knovledge from the expert for use by an
expert system had similarly met with limited
success (5]. Nevertheless, it vas clear that the
expert vas able to reliably preform this complex
pattern discrimination, even if he vas unable to
completely articulate how he did it. Therefore, an
algorithn capable of inferring rules for
discrimination from examples of correctly
classified patterns, seened to hold promise.

A search of ‘the literature for nethods by
which machines. could learn rules fron example:
revealed a small nusber of currently active
approaches [11,12]. Of these, the methods based on
Holland's Genetic Adaptive Plans [9], of Genetic
Algorithms (GA's) seemed to hold the most promise
for the folloviag reasons. They have been shown
both theoretically 9] and empirically [6,7,8] to
take near optimal advantage of the information
gained during attempts to solve a problen. In
addition, the preferred coding for the example

  

 

 

 

 

 

 

 

74

 

patterns is as bit strings. This offer:
possibility that one may avoid the usual f
extraction processes, vhich, although capable of
considerable data reduction, also carry with then
the risk that the reduced feature set aay no longer
contain the information contained in the original
Signal which made the discrimination possible. A GA
might be developed that operates on the raw
digitized signals.

For the ‘remainder of this paper, an
understanding of the basics of GS's has been
assuned. They are vell described elsevhere
(3,6,7,8,9,10,131,

     

 

 

 

2. Background

‘There are tvo learning systeas based on
GA's in the literature uhich might be considered
imediately ancestral to the research described
herein. The Cognitive System One (CS-1) of Holland

Rietman [10] vas the first published account of
@ system vhich combined the computational pover of
a production systen (P5)with a GA-based learning
component. This system exhibited an ability to
learn a dual-revard linear maze. Tho Learning
Systes One (LS-1) of Smith [13] took this concept
further and demonstrated learning behavior in two
different problex domains, the maze problea and
draw poker playing.

LS-1 appeared to have an dnportant.
advantage over CS-1, CS-1 maintained a population
of knoviedge structures which vere individual PS
rules, which gave rise to the credit assignoent
problem. A heuristic nethod had to be devised to
Aistribute credit for revarded behavior anong the
rules which cooperated in producing that behavior.
LS-1 avoided this problea by using complete rule
sets, or PS programs, as the individuals in its
populations. A difficulty with this approach
involved the use of scalar evaluations for the
individuals, When the task to be learned is
sultidimensional, then scalarization of the
feedback to tha GA creates difficulties vhich will
be described below.

 

 
 

   

 

 

3. A Vector Extension of LS-1 (18-2)

Several attenpte to learn a multiclass
pattern “diserinination, problen with [S-1 vee
Shourcesstal Enaniation “of Om population "ot
ropcase in’ both the early and Late stages cf che
Searches revealed a coamon pattern. Knovledge ‘of
hov te recognise particular class vas frequent{y
Absent from the populations in the latter stages of

 
the search even vhen such knowledge was prasant in
jarlier populations. The hypothesis for this vas
that the scalar feedback vas forcing competition
between prograns whose knovledge vas conplenentary.

Consider this sizple exaaple. Suppose the
feedback to GA consists essentially of the
nunber of training cases correctly classified.
Suppose progran Pl contains rules which correctly
classify classes A and B while program P2
classifies only instances in class C. If all
classes are equally represented in the training
set, then Pl vill appear to be twice as "fit"
P2."In a survival-of-the-fittest selectio
specialized knoviedge such as that posse:
may die out.

The solution to this problem vas to sodify
the critic so that a vector of performances could
be reported back to the GA, one for each facet of
the probles (class of pattern to be learned, in
this case). The selection process of the GA
also wodified in such a vay that an independent
survival-of-the-fittest selection is perforned on
each dizension of the perforsance vector, each tine
selecting only the appropriate fraction of the
population. This selection is performed vith

eplacenent so that individuals with better that
average perforcance on more than one dimension have
the appropriate probability of sultiple selection
vhile simultaneously protecting individuals with
specialized knovledge from unfatr competition.
This represents a sicple generalization of the
traditional selection process vhich reverts to the
traditional process when the nunber of dimensions
of performance {s one.

 

   

    

 

 

 

   
 

   

 

   

 

 

4, The LS-2 Production Systes

4.1 Knowledge Representation

The binary coding scheme for the IF-THEN
rules was devised by Smith and called Knovle
Structure One (KS-1). On the IF side of each rule
nuaber of clauses vhich are sensitive to extern:
Agnals are alloved as well asa nunber of clauses
jensitive to internal signals i.e. signals froa
other rules. These clauses are sinply strings on
the alphabet (0,1,!) vhere 0 and 1 require an exact
match and # will catch either 0 or 1. For exanpl
the clause 10M would patch any 3-bit signal vhose
first bit is a1. On the THEN side, each rule has a
signal which {t deposits in short term memory if
the rule fires. From there, it becomes accessible
to other rules. Also on the THEN side 4s an action
which is perforeed if the rule fires and also
survives the conflict resolution at the end of the
production systes cycle. This general schene as
Well as an example of the binary coding of these
rules is shown in figure 1.

‘An individual for the purposes of the GA is
a set of rules in KS-1 format concatenated
together. For pattern discrizination problens, each
rule would have one detector nal signal)
clause and one message (internal signal) clausi
The length of the detector clause is sot to the
length of the patterns to be learned and the length
of the message clause would be set to the length of
the messages. The nessage length need only be a
function of the maxisun number of rules an
individual may contain, the only requirement being
that there should be enough bits go that every rule

 

 

 

 

 

 

 

 

   

 

 

   

 

 

 

 

Roe 1 (ROE fla, atcha (0

 

 

SEeSpeleeactor 2 (amore fing, sot fies, atelog (

 

Sees (ses wan
‘eeiee? (enone

  

 

4 atactar of tenet 2, (nat o2t)
LSet

     

mt ori 7 er

Pees

Figure 1. KS-1 Knowledge Representation Schene

 

ight have a unique signal. The actions vould
include one for each class of pattern to be
discrininated vith one for no-operation (nop). The
oop action ts a sinple device which allovs for the
evolution of sets of rules vhich perfor a chain
calculation leading to a final classification
decksion.

Tt say be noticed that che binary coding
scheno illustrated in figure 1 adaits a certain
redundancy. By having two binary codings for the
don't care symbol (W= 01 or 10), there are many
binary codings possible for any given rule, vith
the problen groving vorse as the maber of #'s
ince ‘This observation rise to the
conjecture that the knoviedge structures might be
coded directly in their natural ternary (0,1,H)
alphabet. To counter this conjecture, ts the known
auperiority of binary coding for the gathering of
information fron the hyperplanes of the search
space (see Holland (10,p71] or Saith (13,p561).
However, it vas unclear how these tvo argunents,
ight be coppared and the better coding selected.
‘Therefore, several tests ware conducted vherein the
saze probleas vere solved by L5-2 using first
binary coding and then ternary codiny

 

  

 

   

 

 

 

 

 

4.2 Recognize-act Cycle
© production system contained the usual
Fecognize-act cycle vith parallel firing of rules
loved in the folloving fashion. On each cycle,
every
clauses. Every rule vhich does catch,
the sense that its sessage is posted in short tere
senory, but its action 1s erely tallied in an
array ‘called the tedaction array. Only at
the end of the cycle is an action selected and
actually performed. If the number of "real" (non
oop) actions suggested is zero, then ancther cycle
is initiated. If one action is suggested, then it
is performed. If nore than one action is suggt
then a stochastic conflict resolution sch
invoked which randooly selects one of the suggested
actions vith the probability of selection being
proportional to the number of rules suggesting it.

 

 

 

    

 

  

 

 

  

 

 

 
 

4 3 The Halting Problen

The question of halting sucha
computational schene is quite a real one. Since
this systen will be executing prograss produced by
genetic search, one aust’ worry about the
Possibility of infinite loops. On the other hand,
an arbitrary stopping threshold in terns of the
nunber of cycles to allow mist be carefully chosen
$0 as not to render a problem unsolvable if it

 

requires nore lengthy couputation that the
threshold permits. The stopping _ procedure
implenented for 15-2, vhich differs from LS-1,

 

exanines four criteria in the folloving order: (1)
Stop if no rules fired this cycle. (2) Stop if the
number of consecutive noop cycles equals the nuaber
of rules. This allows for a vorst case chain
calculation which utilizes every rule in the
prograz. (3) Stop if the total nuaber of cycles is
N tines the nusber of rules. The setting of N is
the arbitrary threshold just centioned and {s
currently set to 3, but at least the actual
stopping threshold {s also a function of the nuaber
of rules. (4) Stop if the task has been learned.

 

 

   

5S. The Cette.

 

Ades vector performance and the stopping
criteria, the other major area of differen:
between LS-1 and 18-2 was that of the critic.
Smith anticipated a problen vith a critic vhich
only revarded task successes. In the early stages
of a genetic search, especially one started with a
random population, successes are likely to be rai
and thus such a critic vould be unable to give tl
GA any infornation about vhich population menbers
were more promising than others. In the absence of
Such information the GA vould revert to near randon
search. In addition, in the later stages of the
search when successes were plentiful, sone way of
identifying the better (e.g. more parsinonious)
individuals vould lead ore efficiently to good
solutions. Smith sought measures to provide this
kind of | information vhich vere also task
Andependent. He devised tuo classes, static
weasures vhich could be computed just by examining
the rule set, and dynanic seasures vhich could be
computed only by sonitoring the action of the rule
set on the task. The static seasures included such
itess as the asount of interrule communication
neasured by hov any rule sessages would match the
message clause of other rules, the ality of
the clauses measured by the nusber of  syabols and
0 on. The dynamic measures included the anount of
random behavior seasured by the activity of the
conflict resolution procedure and the percentage of
inactive rules. While these seasures have an
intuitive appeal, it is not clear chat they are

   

 

 

   

 

 

   

 

 

 

 

 

always associated vith superior performance
regardless of the task. In addition, some
prelininary experiments ‘with LS-2 using these

measures as “an independent dimension of the
performance vector revealed a poor correlation with
task success. These measures, then, were dropped
from LS-2 in favor of the critic described below.
This critic did incorporate the anount of random,
activity which vas called guessing behavior.

The properties sought for a critic for
pattern discrimination learning were similar to
those identified by Saith. In the early stages of

 

 

 

76

te act ts rfght

rovers

Af act Us rong

 

7

°

guessing factor

Figure 2. The Revard Function for the First Critic

the search, guessing should be encouraged so as to
locate as quickly as possible, sone knovledge which
seens to work. Hovever, the save critic must
discourage guessing in the later si

reliable classification becones the
critic revard function designed vith
properties is illustrated in figure 2. The guessing
factor is a measure of the uncertainty with which
the production system reaches a conclusion as
measured by the nunber of rules sugi
other conclusion. This ravard function vould
some credit for a vrong conclusion if the right
action vere at least suggested and would also give
less than naximua credit for the right action if it
vere only guessed. While this critic seened
sonable, it had a. subtle weakness vhich was,
led by oxperimentation. Better critics were
later designed, but they will be discussed vith the
results vhich lead to thes.

   

 

 

       

    

 

 

    
 

6. The Test Problens
6.1 The Parity Problen

fava disect teat that che GA {a capable of
sroducing” sore poverful prograns than perceptrons,
the parity probles vas” included. This probles
involves. the’ discrinination of tvo classes vhich
ave inexteteably aixed in the feature space as
Shown in figure 3. Although a discrisinator
Fequicing Linear separability in the feature space
Would. be unable. £0 solve this probles, a systen
ble co compute the parity (a derived feature) of
the given features’ vould be able to solve this
problen easily.

 

 

 

    

   

 

6.2 A Hulticlass Pattern Probl

‘As atest of aGAon a multiclass signal
discrimination learning task) &\saail problem vas
Selected fron the Literature vhich ‘used EMG
signals. Bekey at al [1] measured the neural firing
Patterns enervating six suscles in the lover leg
and then” reduced the signal from each muscle to a
two-bit string indicating that the suscle was
either on (1) or off (0) during tyo phases of the
it eyele. Thus, each subject tested yielded a

 

 

 

 

 
xl

 

x0
Figure 3. The Parity Probles

12-bit string which was classified by the clinician
as belonging to one of five classes of gait as
shown in figure 4.

The Il patterns shown in figure 4 vere
considered by Bekey and his colle:
typical of their classes and were
training set for a statistical di
experiment. This training
to recommend it as a test bed for a GA learning
experiment. The strings vere short and binary
coded. Also the training set is small implying a
saall computational burden. Although the problea
is underconstrained (there are sany possible
solutions), it 1s rich enough to frustrate a simple
Linear discriminator like @ perceptron.

 

   

 

  

 

2. Results
7.1 The Parity Problea
After trying 5040 different PS prograns
(evaluations )on the parity problem, LS-2 found the
solution shovn in figure 5. The training set for
this problem consisted of all 16 patterns shown in
figure 3 with x0 and x1 coded at tuo bits each. Thi
“ignore” in figure 5 indicates that the ness:
clauses in al} of the rules vere turned off. Thus,
the solution did not require any interrule
communication. Instead tvo rules vere evolved for
each combination of lov order bits in the feature
tings. The 2-bit strings to the right of th
bol in figure 5 are the signals placed in short
term wezory which are ignored by the other rules.
The rightaost nusber in each rule is the action
(classification asserted for the pattern if the
rule fires).

 

 

    

    

7.2 The ENG Probles

The first trials with the EMG probles
clearly indicated the superiority of the KS-1
binary coding over the ternary coding. Comparison
runs to discrisinate different pairs of classes and
one run to discriminate a 3-class subset of the
Bekey training set all learned the discrimination
faster vith binary coding than vith ternary coding.
Besides this series, all other experinents reported
in this paper utilized binary coding.

A solution to a 2-class subproblea of the
Bekey set vas located in 2520 evaluations. Hovever,
a J-class Subproblea required almost 30000
evaluations and the full 5-class problen remained
unsolved after 100000 evaluations This rapid

 

    

 

   

 

7

  

Ancrease in computational
attention to the critic as a possible cause. By
exanining sone of the individuals in different
populations, the survival strategy that LS-2 vas
iscovered. It vas observed that very

even in the initial

‘appeared who received
aximun credit for one class of patterns and zero
for all other classes. These vere nazed specialists
since they seened to have knowledge of how to
identify one pattern class without guessing, but no

effort soo

  
 

populations,

 

 

 

  
  

   

others. However, close examination revealed their
a single
over-general ruie which fired for every pattern,

 

calling class x. If the pattern was class x, the
critic assigned maximum credit, if not then’ the
critic assigned zero. Hovever, under the vector
survival-of-the-fittest, eaximun fitness on one
dimension was sufficient to assure survival. The:

ecialists also tended to persist since their
genes combined poorly, not producing offspring able
to perfora vell on more than one dimension. This
strategy of wild guessing vas encouraged by the
eritic; sone form of punishment for wrong actions

clearly needed.

‘The original critic vas then replaced by
fone vhich simply assigned tvo points for each
correct suggestion and minus one for each vrong
suggestion. More credit for right than vrong
appropriate in light of the need to pi
individuals with some good knovledge, even if
tuperfect. This critic also embodied a shift of
focus from the overt actions of the individual to
the suggested actions, thus rewarding the
“chinking” sore directly.

Under the nev critic, 1S-2 solved the
2-class problen in only 1440 evaluations, a 43%
reduction in search effort. However, this advanta;
did not hold for higher order problens. Again, an
examination of the individuals persisting in ‘the
population revealed the new survival strategy. With

 

 

 

  

 

 

 

 

 

 

 

 

    

4 2-class task, the expectation for wild guessing
under this critic, 4s one half the maxiun credit
40 long as each class in equally represented in the

   

training set. Hovever, as the nuaber of classes
+ this expectation decreases and even
becones Aogative for fairly modest tasks. When

 

 

class 1 (normal)
(0,2,2/2,1/1,1,0,0,0/0/0)
clas 2 (equinus)

 

(9,2, 2/242,1,910,9,0,0,0)
2/2/0,1,1,1,0,0,0,0,0)
(o11/1/3/0,2/2,9:070,0,0)
1)1)1,0,2,0,0,0/1,0,0)
3 (rlat footea}
(0,0,0,0,2,2,2,0,0,0,0,0)
(0,1,0,0,0,2,2,1,0,0,0,2)
class 4 (vai
(2,0,2,2,0,0,2,0,0,2,0,0)
1,1;1,0/1,2,0,0,0,0,0,0)
5 (valgus)
(0,0,2,0,1,2,2,0,0,0,1,2)
(0,2,2,0,0,1,2,0,0,0,0,)

   

 

el

 

   

el

 

Figure 4, The Bekey Training Cases

 
 

IF ~> THEN

    

#212 dgnore -> 02 2
fot ignore -> 01 2

ignore => 10 1.

rule ffo ignore => 10 2

Figure 5. The Solution to the Parity Probles

this occurs, LS-2 quickly begins evolving
individuals with no rules that fire at all. Doing
nothing at least scores zero vhich is better than
being punished. A balance of revard and punisheent
which will be ‘maintained as tasks increase in
complexity is needed so as to avoid the GA's
ability to quickly exploit this weakness in the
eritte function.

The next critic employed a computational
schene based on that used on the Scholastic
Aptitude Test and so vas called SAT scoring. The
main idea in the scoring of aultiple choice tests
is that indiscrisinant guessing should have an
expectation of zero, but that if a student can
eliminate sone of the choices ona question, then
he should be encouraged to guess by having the
expected score increase as the range of guessing
jecreases. For the SAT, this is achieved by
subtracting from the nuaber of correct ansvers, the
number of wrong anavers veighted by the inverse of
the number of choices minus one. This gives an
expectation vhich varies fron zero for vild
guessing to the eaximin score for no guessing. For

Lightly different expectation vas thought
je. Wild guessing was deened better than
doing nothing because this at least vould give the
GA sone active rules to deal with. So the designed
expectation vas that wild guessing (e.g. calling
every case the sane class) should score haif of the
naxizus.

‘At this point in the experimentation, an
effort vas also initiated to learn about’ the
sensitivity of LS-2 to changes in four of its main
parameters, population size, crossover, mutation
ind inversion rates. All exparinents reported so
far used a population size of 30 per dinension of
the performance vector, a crossover rate of .95, @
sutation rate of .01 and an inversion rate of .25.
These first three values vere suggested by
Greffenstette (8) and the inversion rate by Saith
{13}. Lizited resources prevented the best approach
which vould have been the seta-GA approach of
Greffenstetto, so different settings vere produced
by increasing the population size in steps of 10
per divension and simultaneously reducing the rates
Bore or less in unison. This process was continued
until the mean evaluations-to-solution stopped
improving. Means vere computed for three runs at
each setting with different random seed:

‘The SAT critic has the sane expectation
the previous critic for 2-class problens vith
balanced training so this task was not
repeated. A 3-c solved {n 6921
evaluations, a 77% 4 the original
critic. A ‘G-class subproblen vas solved in 26591
evaluations. Both of these results represented a

 

 

 

     

 

 

   

  

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 
      

 

78

 

best parameter setting of 40, .90, .005 and .20 for
Population size per dimension, crossover, sutation
and inversion rates respectively.

One final improvenant vas made in LS-2,
this tine to the conflict resolution. In LS-1 Smith
had not persitted conflict resolution to consider
the nop action so long as a "real” action were
suggested. In all the 15-2 experisents so far, noop
competed equally vith tha "real" actions. The
argument for this was that for some task
environnents, doing nothing, or continuing te think
(cycle) was a decision and that {f the environnent
vere dynamic, then this sight vell affect
Performance. Hovever, sone counter argusents can
also be sade. The’ pattern discrimination tasks
considered so far are “not dymanie; the patterns
don't change while LS-2 1s trying to decide. Also,
this strategy allovs for sone stochastic effect to
remain in the eritic-reported values. By deciding
to cycle again vhen a "real" action had been
suggested, LS-2 postponed the coputation of the
credit in'a non-detersinistic vay. The critic was
only permitted to evaluate the suggested action
artay on the final cycle. I vould now argue that if
the task environsent 13 dynanic, and a do nothing
faction should be considered, then it
explicitly included as one of the "r
Noop should not be considered a do nothing action.

With this final isprovenent, LS-2 solved
the 3-, 4- and full S-class problems in 5647, 15938
and 44509 evaluations respectively. The ‘taning
effect of these  improvesents in 15-2 are
illustrated in figure 6.

 

 

 

 

 

 

  
 

 

    
  

 

“10°

BSN) “ESo0

mip inean-evel-Lo-solut ion}

 

‘Soo epee S00) wae
" winder of et

 

Figure 6. Inprovenents in LS-2 with Changing Critic
 

 

uss
‘The major finding of this research vas that
vector feedback {s essential to multicl:

    

discriminant learning. Vector selection provides
the necessary protection against unfair competition
while sioultaneously providing the proper pressure
for the evolution of the utopian individual capable
of high perforcance on all facets of the task.

Secondary to this major finding are a
number of observations vhich may contribute to
better understanding of GA's and how to effectively
utilize thes.

The solution of the parity problea clearly
demonstrates LS-2's ability to learn non-linear
discrimination.

Ternary coding of KS-1 was inferior to
binary coding, even vith the redundancy inherent in
the binary coding schese. A search for coding
Schenes which are binary and yet avoid this
redundancy might pay handsone dividends.

Grofenstette's finding (8] that genetic
rch may be very efficient with snaller
populations and higher aixing rates than previous
Wisdom suggested, seeas generally to have been
confirmed. Populations of 40 per dimension of
performance vith crossover rates of .7 to .9,
utation rates of .001 to .01 and inversion rai
of .1 to .2 provided the best perfornance on the
problems studied here. It should be noted, hovever,
that the search vas limited and began with
Grefenstette’s solution.

‘As Saith observed, the critic ts critical.
The GA is cay if exploiting the properties of
its critte and so good performance vas only
achieved when reward and punishoent vere carefully
balanced. The application of punishzent to a
performance vector has raised a question which did
ot oceur vith scalar performance systems. There
are tvo places vhere this punishment may be
applied. Suppose that a PS progran incorrectly
classifies a class 1 case as class 2. By
the punishsent to the class 1 slot
performance vector one ts punishing the failure to
do the right thing. By applying it to the class 2
slot, one is punishing the program for doing the
wrong thing. It is unknown which strategy, or both
leads to faster learning. The experiments reported
here applied the punishzent to the slot
corresponding to the case to be classified, thus
alvays punishing the failure to do the right thing.
Other approaches might be profitably studied,

‘The task independent measures proposed by
Saith did not seem to be sufficiently closely
associated with good performance to varrent their
use. Hovever, his strategy of disallowing noop
actions to Compete in conflict resolution was
superior to allowing it

‘A final observation is in order on the
original question of using a GA for intelligent
signal classification. The strategy used in LS-2
seems to be promising, but requires that a prior
decision be nade on the length and sampling rate
for the signal, The patterns must be "frozen" so
that the system can examine then. This feature
sens to impose undesirable lisitations. A gore
dynamic method of exanining the signal, bit by bit,
and only reporting a decision ‘when enoug!
information has been acquired to do so vith

 

  

 

 

 

 

 

 

     

    

 

 

 

 

    

 

 

 

 

 

confidence, seens to offer a sore robust approach.
REFERENCES

1. A.B. Bekey, C. Chang, J. Perry, and H.M.

Hoffer, "Pattern recognition of cultiple EG

ignals applied to the description of hunan

it," Proceedings of the IEEE,Vol. 65 No. 5,
May 1977.

 

 

2. J.R. Bourne, V. Jagannathan, B. Hanel, B.H.
Jansen, J.W. Ward, J.R. Hughes and C.W. Ervin
“Evaluation of a syntactic pattern recognition

proach to quantitative  electroencephalo-
graphic analysis,” Electroencephapography &

Clinical Neurophysiology, 52:57-64, 1981.

   

   

3. A. Brindle, Genetic Algorithas for function
optimization, Ph.D. Dissertation, University

 

   

 

 

 

 

of Alberta, Edmonton, Alberta, Canada, 1975,

4. (B.A. Giese, J.R. Bourne and J, Ward,
"syntactic analysis of the
electroencephalograa," IEEE Trans. Systens,
Man and Cybernetics, Vol. SHC-9 No. 8, Aug
1979.

5. V, Jagannathan, An artificial intelligence

 

approach to computerized electroencephalogras
analysis, Ph.D. Dissertation, Vanderbilt.
University, Nashville, Tennessee 1981.

 

 

6. Kenneth DeJong, Analysis of the behavior of a
class of genetic adaptive systens, Ph.D.
Dissertation, University of Michigen, Ana
Arbor, 1975.

 

 

7. Kenneth DeJong, “Adaptive systen design: a
genetic approach," IEEE Trans. Systeas, Man
and Cybernetics, Vol. SHC-10 No. 9 Sept 1980.

 

8. John J. Greffenstette, "Genetic algorithas for
ultilevel adaptive "systens," IEEZ Trans. on
Systess, Man and Cybernetics, in press.

9. John H. Holland, Adaptation in natural and
artifictal systeas, University of Michigan
Press, Ann Arbor, Michigan 1975.

10, J-H. Holland and J.S. Reitman, "Cognitive
systems based on adaptive algoriths,” in
Pattern-Directed Inference Systess, Waterman
and Hayes-Roth (Eds.), Acadesic Press, 1978.

  

Ll. (RS. Michalski, J.G. Carbonell and 1.
Mitchell, (Eds. ), Machine Learning, Tioga
Publishing Co.,Palo Alto, California 1983

 

12. RS. Michalski, J.G. Carbonell and T.M.
Mitchell, (Eds.), Proceedings of the
International Machine Learning Workshop,
University of Illinois, Urbana-Champaign,
Tilinots, 1983.

13. S.F. Saith, A learning system based on genetic
adaptive ‘algorithas, Ph.D. Dissertation,
University of Pittsburg, 1980.

 
 

IMPROVING THE PERFORMANCE OF GENETIC ALGORITHMS
IN CLASSIFIER SYSTEMS

Lashon B. Booker

Navy Center for Applied Research in AI
Naval Research Laboratory, Code 7510
Washington, D.C. 20375

ABSTRACT

Classifier systems must continuously infer useful categories and other generalizations — in the form
of classifier taxa — from the steady stream of messages received and transmitted This paper describes

ways to use the genetic algorithm more effectively in discov

 

1 such patterns. Two issues are

addressed. First, a flexible criterion is advocated for deciding when a message matches a classifier taxon.
This is shown to improve performance over a wide range of categorization problems. Second, a restricted
mating policy and crowding algorithm are introduced. These modifications lead to the growth and
dynamie management of subpopulations correlated with the various pattern categories in the environ-

ment.

INTRODUCTION

A classifier aystem is a special kind of
production system designed to permit non-
trivial modifications and reorganizations of
its rules as it performs a task (Holland,1976).
Classifier systems process binary messages.
Each rule or classifier is a fixed length string
whose activating condition, called a tazon, is
a string in the alphabet {0,1,#}. The
differences between classifier systems and
more conventional production systems are
discussed by Booker [1982] and Holland
{1983}.

One of the most important qualities of
classifiers systems as a computational para-
digm is their flexibility under changing
environmental conditions _[Holland,1983).
This is the major reason why these systems
are being applied to dynamic, real-world
problems like the control of combat systems
|Kuchinski,1985] and gas pipelines [Gold-
berg,1983]. Conventional rule-based systems
are brittle in the sense that they function

 

 

80

poorly, if at all, when the domain or under-
lying model changes slightly. Several factors
work together to enable classifier systems to
avoid this kind of brittleness: parallelism,
categorization, active competition of alterna-
tive hypotheses, system elements con-
structed from “building blocks” , ete

 

 

Perhaps the most important factor is
the direct and computationally efficient
implementation of categorization. Holland
[1983, p.92] points out that

Categorization is the system’s sine qua

non for combating the environment's

perceptual novelty.

 

Classifier systems must continuously infer
useful categories and other generalizations
— in the form of taxa — from the steady
stream of messages received and transmit
ted. This approach to pattern-directed
inference poses several difficulties. For
example, the number of categories needed to
function in a task environment is usually not
known in advance. The system must there-
fore dynamically manage its limited classifier
memory so that, as a whole, it accounts for
all the important pattern classes. Moreover,
sinee the categories created depend on
which messages are compared, the system
must also determine which messages should
be clustered into a category.

The fundamental inference procedure
for addressing these issues is the genetic
algorithm [Holland,1975]. While genetic
algorithms have been analyzed and empiri-
cally tested for years
[DeJong,1975;Bethke,1981], most of the
knowledge about how to implement them
.has come from applications in function
optimization There has been little work
done to determine the best implementation
for the problems faced by a classifier system.
This paper begins to formulate such an
understanding with respect to categoriza-
tion. In particular, two questions related to
genetic algorithins and classifiers systems are
examined:

(1) What kinds of performance measures
provide the most informative ranking
of classifier taxa, allowing the genetic

“algorithm to efficiently discover useful
patterns?

How can a population of classifier taxa
be dynamically partitioned into distin-
guishable, specialized subpopulations
correlated with the set of categories in
the message environment?

Finding answers to these and related ques-
tions is an important step toward improving
the categorization abilities of classifier sys-
nd, expanding the repertoire of prob-
lems these systems can be used to solve

   

THE CATEGORIZATION PROBLEM

In order to formulate these issues more
precisely, we begin by specifying a class of
categorization problems. Subsequently, 2
criterion is given for evaluating various solu-
tions to one of these problems.

81

Defining Message Categories

Hayes-Roth 1973] defines. a
“schematic” approach to characterizing pat-
tern categories that has proven useful in
building test-bed environments for classifier
systems [Booker,1982|. This approach
assumes, in the simplest case, that each pat-
tern category can be defined by a single
structural prototype or characteristic Bach
such characteristic is a schema designating a
set of features values required for category
membership. Unspecified values are
assumed to be irrelevant for determining
membership.

 

‘The obvious generalization of using just
one characteristic to define a category is to
permit several characteristics to define a
category disjunctively. Pattern generators
based on the schematic approach generate
exemplars by assigning the mandatory coin-
binations given by one or more of the pat-
tern characteristies and producing irrelevant
feature values probabilistically In this way,
each exemplar of a category manifests at
least one of the defining characteristics
The categorization problem can be very
difficult under the schematic approach since
any given item can instantiate the charac-
teristics of several alternative categories

Classifiers receive, process, and
transmit binary message strings We define
a category of binary strings by specifying a
set of pattern characteristics Each charac-
teristic is a string in the alphabet {1,0,*}
where the * is a place holder for irrelevant,
features. A characteristic is a template for
generating binary strings in the sense that
the 1 and O indicate mandatory values and
the * indicates values to be generated at
random. ‘Thus the characteristic 1*0* gen-
erates the four strings 1000, 1001, 1100, and
1101. When more than one characteristic is
associated with a category, one is selected ab
random to generate an exemplar ‘The
correspondence between the syntax of a
taxon and the designation of pattern charac-
teristics is obvious. The class of pattern

 
categories defined in this manner therefore
spans the full range of categorization prob-
lems solvable with a set of taxa.

An Evaluation Criterion

‘A set of taxa is a solution to a categori-
zation problem if it corresponds directly
with the set of characteristics defining the
category. In this sense, the set of taxa
models the structure of the category. One
way to evaluate how closely a set of taxa
models a set of characteristics is to define
what an “ideal” model would look like, then,
measure the discrepancy between the model
given by the set of taxa and that ideal.

 

More specifically, the structure of a
pattern category is given by its set of
characteristics. We first consider the case
involving only one characteristic. As the
genetic algorithm searches the space of taxa,
the collection of alleles and schemata in the
population become increasingly less diverse.
Eventually, the best schema and its associ-
ated alleles will dominate the population in
the sense that alternatives will be present
only in proportions roughly determined by
the mutation rate. A population with this
property will be called a perfect model of
the category. The taxon which corresponds
exactly with the characteristic will be called
the perfect taxon.

One way to describe the perfect model
quantitatively is in terms of the probability
of occurrence for the perfect taxon. An
exact value for this probability is difficult to
compute, but for our purposes it can be
approximated by the “steady state” proba-

bility! P(é)= [TP(é,) , where P(é,) is the

3
proportion of the allele occurring at the jth
position of the perfect taxon €. In the ideal
case, if u is the mutation rate, what we want
is P(€,) = 1-x for the alleles of € In order
to measure the discrepancy between an

} ‘The probability of occurrence
over wit fandom pairing,
operators

 

F repeated cross
the absence of other

  

82

 

arbitrary population and the perfect model,

we can use the following metric:

   

G = P(g) wo (ee + (1-P(€)) ul

a

probability of

Pre) P\

where P(é) 1s the ideal

occurrence for € and P(g) is é’s probability
of occurrence in the current population.
This information-theoretic measure is called

the directed divergence between the two
probability distributions [Kullback,1959].

is a non-negative quantity that approaches
zero as the “resemblance” between P and P!
increases. The G metric has proven useful
in evaluating other systems that generate
stochastic models of their environment (eg
Hinton et al. {1984]).

When a pattern category is defined by
more than one characteristic, we can use the
G metric to evaluate the population's model
of each characteristic separately This
involves identifying the subset of the popula-
tion involved in modeling each characteris-
tic; and, treating each subset as a separate
entity for the purpose of making measure-
ments. A method for identifying these sub-
sets will be discussed shortly.

 

 

MEASURES FOR RANKING TAXA

 

Given a class of categorization prob-
lems to be solved, and a criterion for
evaluating solutions, we are now ready to
examine the performance of the genetic
algorithm. ‘The starting point will be the
measures used to rank taxa. Only if the
taxa are usefully ranked can the genetic
algorithm, or any learning heuristic, have
hope of inferring the best taxon. In this sec-
tion we first pot out some deficiencies in
the most often used measure; then, alterna
tive measures are considered and shown to
provide significantly better performance.

 

Brittleness and Match Scores

‘The first step im the execution cycle of
every classifier system is a determination of
which classifiers are relevant to the current
set. of messages Most implementations
make this determination using the straight-
forward matching criterion first proposed by
Holland and Reitman [1978]. More

 

specifically, if M-=myms ~~~ m,, m, € {0,1his

a message and Cmeyeg* ++ ey, 6 € fossts

a classifier taxon, then the message Mf
satisfies or matches C if and only if m=,
wherever ¢, is O or 1. When c,=#, the value
of m, does not matter. Every classifier
matched by a message is deemed relevant.
Relevant classifiers are ranked according to
the specificity of their taxa, where specificity
is proportional to the number of non-#’s in
the taxon Holland and Reitman used a sim-
ple match score to measure relevance The
score is zero if the message does not match
the taxon; otherwise it is equal to the
number of non-# positions in the taxon

This simple match score — hereafter
called Mi — effectively guides the genetic
algorithm in its search of relevant taxa.
Because all non-relevant taxa are assigned a
score of zero, however, M1 is the source of a
subtle kind of brittleness Whenever a mes-
sage matches no taxon in the population, the
choice of which taxa are relevant must be
made at random ‘This can clearly have
undesirable consequences for the perfor-
mance of the classifier system; and, also for
the prospects of quickly categorizing that
message using the genetic algorithm.

In order to circumvent this difficulty,
Holland and Reitman use an initial popula-
tion of classifiers having a 90% proportion of
#'s at each taxon position. This makes it
very likely that relevant taxa will be avail-
able for the genetic algorithm to work with.
Unless the pattern categories in the environ-
ment are very broad, though, the brittleness
of this approach is still a concern. Suppose,
for example, a classifier system must

the pattern characteristic
A fairly well-adapted population of

 

 

classifiers will contain taxa such as
LIOLOFH#, 1HO1O¥H, 11741041, 11411070,
ete, As the categorization process under the
genetic algorithm continues, the variability
in the population decreases. It therefore
becomes unlikely that the population will
contain many taxa having four or more #’s.
Such taxa would have a mateh score too low
to compete over the long run and survive
Now suppose the environment changes
slightly so that the characteristic is **010
that is, the category has been expanded to
allow either a 0 or 1 in the first two posi-
tions In order to consistently match the
exemplars of the new category, the popula-
tion needs a taxon with four #'s at exactly
the right loci. There is no reason to expect
such good fortune since the combinations of
attribute values are no longer random. The
population will most likely have no taxon to
match new exemplars, and the genetic algo-
rithm will blindly search for a solution.

 

Another proposed resolution of this
dilemma is to simply insert the troublesome
message into the population as a taxon {Hol-
fand,1976], perhaps with a few #'s added to
it. The problem with this is that the rest of
the classifier must be chosen more or less at
random. By abandoning the “building
block” approach to generating classifiers,
this method introduces the brittleness
inherent in ad hoc constructions that cannot
make use of previous experience. What is
needed is a way of determining partial
relevance, so the genetic algorithm can dis-
cover useful building blocks even in taxa
that are not matched. In the example cited
above, such a capability would allow the
genetic algorithm to recogmize #101044
and 1/#010## as “near miss” categoriza-
tions and work from there rapidly toward
the solution ##010##.

 

Alternatives to M1

The brittleness associated with the
match score MI has a noticeable impact on
categorization in classifier systems. To
demonstrate this effect, a basic genetic
 

algorithm [Booker,1982] was implemented to
manipulate populations of classifier taxa
Taxa in this system are 16 positions long.
The effectiveness of a match score in identi-
fying useful building blocks is tested by
presenting the genetic algorithm with a
categorization problem. Each generation, a
binary string belonging to the category is
constructed and mateh scores are computed
for every taxon. The genetic algorithm then
generates a new population, using the match
score to rate individual taxa.

To test MI, three pattern categories
were selected:

Cl = MNNNnLNntLAa
C2 = LLL LS 484888

CB a [tttteeeeeeenene

 

‘These characteristics are representative of
the kinds of structural properties that are
used to define categories, from the very
specific to the very broad. Three sets of
tests were run, each set starting with an i
tial population containing a different propor-
tion of #’s. Each test involved a population
of size 50 observed for 120 generations, giv-
ing a total of 6000 match score computa-
tions? At the end of each run, a G value
was computed for the final population to
evaluate how well the characteristic had
been modeled. The results of these experi-
ments — averaged over 15 runs — are given
in Table 1. For each pattern category, there
are statistically significant? decreases in per-
formance as the proportion of #’s is changed
from 80% to 33% (Recall that the best G
value is zero) Given this quantitative evi-
dence of M1’s brittleness, it is reasonable to
ask if there are better performing alterna~
tives.

 

 

The primary criterion for an alterna-
tive to MI 1s that it identify useful building

 

2 6000 function evaluations 1 the observation
75| that bas become a standard
ing genetic m3

3 For all results preseated io this paper, this me
test was performed comparing the means of Uo gr

terval

   

 

 

ps. The

84

 

Table 1
Final Average G Value Using MI

LH

 

 

 

 

 

 

Category [uitial Percentage of #'s

80% 50% 33%
cl 7.83 10.28 12.25
c2 4.95 16.72 25.13
c3 598 13.67 36.57

 

 

 

blocks in non-matching taxa; and, that it
retain the strong selective pressure induced
by MI among matching taxa, One way to
achieve this 1s to design a score that 1s equal
to MI for matching taxa, but assigns no
matching taxa values between 0 and 1, The
question is, how should the non-matching
taxa be ranked?

 

If we are concerned with directly iden-
tifying useful alleles, the following simple
point system will suffice: award 1 point for
each matched 0 or 1, % point for each #,
and nothing for each position not matched.
The value for # is chosen to make sure it is
more valuable for matching a random bit in
a message than a 0 or 1, whose expected
value in that case would be % To convert
this point total into a value between 0 and
1, we divide by the square of the taxon
length. This insures that there is an order
of magnitude difference between the lowest
score for a matching taxon and all scores for
non-matebing taxa. More formally, if t is
the length of a taxon, n, is the number of
exactly matched 0's and I's and ng is the
number of #’s, we define a new match score «

 

MI if the mecoage matches the tozon
ny + %ng otherwiee

ra

MQ—

Another way to rank non-matching

taxa is by counting the number of

 

alpha level for each test was 0S
mismatched 0's and 1's. This approach
measures the Hamming distance between a
message and a taxon for the non-# positions.
A simple match score M3 can be defined to
plement this idea. If n is the number of
mismatched 0's and 1’s, then

 

M1 if the mesoage matches the tazon
Ln otherwise

ra

M3=

Now it must be determined if M2 and
MG usefully rank non-matching taxa; and, if
so, whether that gives them an advantage
over Mi. Accordingly, M2 and M3 were
tested on the same three patterns and types
of populations described above for Ml.
‘These experiments are summarized in
Tables 2 and 3. As before, all values are
averages from 15 runs. First consider the
final G values shown in Table 2 When the
population is initialized to 80% y's there 1s
little difference among the three match
scores. The only statistically significant
differences are with pattern C3, where both

 

M2 and M3 do better than MI. This is
interesting because C3 is a category that has
no generalizations other than the set of all
messages. MI operates by seizing upon
matching taxa quickly, then refining them to
fit the situation. This strategy is frustrated
when general taxa that consistently match
are hard to find. Since M2 and M3 can both
take advantage of other information, they do
not have this problem with C3. When the
population is initialized to 33% #'s the lia-
bilities of M1 become very obvious For
each pattern category, the performance of
M2 and M3 are both statistically significant
improvements over MI.

In order to further understand the
behavior of the match scores, we also com-
pare them using DeJong’s [1975] on-line per-
formance criterion. On-line performance
takes into account every new structure gen-
erated by the genetic algorithm, emphasi2-
ing steady and consistent. progress toward
the optimum value, The structures of
interest here are populations as models of
the pattern characteristic. The appropriate

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 2 Table 3
Comparison of Final G Values Comparison of On-line Performance
c 80% #f's 80% #'s -

Match Score
category a ta Category [hae | MS
cl 7.83 | 10.30 | 7.76 ca 25.75 22.93
ce 4.95 | 2.25 | 4.32 o2 14.06 13.45
C3. sos |_142 | 097 3 175 282

50% #'s 50% #'s
cl 1028 8.17 696 cl 34.41 26.3 2198
cz 1672} 7.03 | 439 oz 2709 | 2022 | 17.81
cs 13.67 |_ 8.67 |_ 913 2126 | 14.78 | 1354
33% #'s 33% #'s
cr 12.25 | 8.05 | 519 co 26.35 | 21.46
C2 25.13 | 13.99 | 10.37 C2 35.3 26.75
C3 36.57 | 11.41 | 7.28 G3 40.16 19.34

 

 

 

 

 

 

 

 

 

 

 

 

 

 
on-line measure is therefore

1) .
MT) =p) 2a(7), where T is the number

given by

of generations observed and G(t) is the @
value for the (th generation. The on-line
performance of the match scores is given in
Table 3 When there are 80% #'s, the only
statistically significant difference is the one
between M3 and Ml on category C3. In the
case of 50% #s, the statistically significant
differences occur on Cl, where both M2 and
M8 outperform M1; and, on C2, where only
M3 does better than Ml. Finally, in the
difficult case of 33% #’s, the differences
between M3 and MI are all statistically
significant. M2 is significantly better than
M1 only on category C3.

Taken together, these results suggest
that M3 is the best of the three match
scores. It consistently gives the best perfor-
mance over a broad range of circumstances.
Figure 1 shows that, even in the case of 38%
#'s, MB reliably leads the genetic algorithm
to the perfect model for all three categories.
Using M3 should therefore enhance the abil-
ity of classifier systems to categorize mes-
sages

 

  

How should a classifier system use M3
to identify relevant classifiers? The criterion
for relevance using a score ike M3 is cen-
tered around the idea of a variable thres-
hold. The threshold is simply the nurnber of
mismatched taxon positions to be tolerated.
Initially the threshold is set to zero and
relevance is determined as with M1. If there
are no matching classifiers, or not enough to
Bill the system’s channel capacity, the thres-
hold can be slowly relaxed until enough
classifiers have been found. Note that this
procedure is like the conventional one in
that it clearly partitions the classifiers
according to whether or not they are
relevant to a message. This means that
negated conditions in classifiers can be
treated as usual; namely, a negated condi-
tion is satisfied only when it is not relevant
to any message.

 

86

 

DISCOVERING MULTIPLE CATEGORIES °

In developing the match score M3, we
have enhanced the ability of the genctic
algorithm to discover the defining charac-
teristic for a given pattern category. What
Mf there is more than one category to learn,
or a single category with more than one
defining characteristic? In this section we
show how to modify the genetic algorithm to
handle this more general case First, two
modifications are proposed for the way indi-
viduals are selected to reproduce and to be
deleted. Then, the modified algorithm is
shown to perform as desired

An Ecological Analogy

 

The basic genetic algorithm is a reli-
able way to discover the defining charac-
teristic of a category When there is more
than one characteristic in the environment,
however, straightforward optimization of
match scores will not lead to the best set of
taxa, Suppose, for example, there are two
categories given by the characteristics
11°*...** and 00% The ideal popula-
tion for distinguishing these categories would
contain the classifier taxa HH... and
OO##...##; that is, two specialized sub-
populations , one for each category. The
genetic algorithm as described so far will
treat the two patterns as one category and
produce a population of taxa having good
performance in that larger category. In this
case, that means the taxon ##H##...### will
be selected as the best way to categorize the
messages. The problem is obvious. Requir-
ing each taxon to match each message
results in an averaging of performance that ”
is not always desirable.

 

Various strategies have been proposed
for avoiding this problem. When the
number of categories is known in advance,
the classifier system can be designed to have
several populations of classifiers (Holland
and Reitman,1978); or, a single population
with pre-determined partitions and operator
Froure o
MS CONVERGES TO THE PERFECT woDEL

 

 

 

 

so
40 4
c
e
a0 4
v °
a
t cs
v
zo 4
5
| ro 4
4
T T T T
° 100 z00 300 400

GENERATIONS

87

 
 

restrictions [Goldberg,1983]. Both of these
approaches involve building domain depen-
dencies into the system that lead to brittle-
ness. If the category structure of the
domain changes in any way, the system
must be re-designed.

It is preferable to have a non-brittle
method that automatically manages several
characteristics in one population. What is
needed is a simple analog of the speciation
and niche competition found in biological
populations. The genetic algorithm should
be implemented so that, for each charac-
teristic or “niche” , a “species” of taxa is
generated that has high performance in that
niche. Moreover, the spread of each species
should be limited to a proportion determined
by the “carrying capacity” of its niche.
What follows is a description of technical
modifications to the genetic algorithm that
implement this idea.

 

A Restricted Mating Strategy

If the genetic algorithm is to be used to
generate a population containing many spe-
cialized sub-populations, it is no longer rea-
sonable for the entire population to be
modified at the same time. Only those indi-
viduals directly relevant to the current
category need to be involved in the repro-
ductive process. Given that the overall
population size is fixed and the various sub-
populations are not physically separated,
two questions immediately are raised: Does
modifying only a fraction of the population
at a time make a difference in overall perfor-
mance? How is a sub-population identified?

DeJong [1975] experimented with
genetic algorithms in which only a fraction
of the population is replaced by new indivi-
duals each generation. His results indicate
that such a change has adverse effects on
overall plan performance. The problem is
that the algorithm generates fewer samples
of the search space at a time. This causes
the sampling error due to finite stochastic
effects to become more severe. An increase

 

88

 

in cumulative sampling error, in turn, makes
it more likely that the algorithm will con-
verge on some sub-optimal solution.

The strategy adopted here to reduce
the sampling error is to make sure that the
“productive” regions of the search space
consistently get most of the samples. In the
standard implementations of the genetic
algorithm, the search trajectory is uncon-
strained in the sense that any two individu-
als have some non-zero probability of mating
and generating new offspring (sample points)
via crossover. This means, in particular,
that taxa representing distinct characteris-
ties can be mated to produce taxa not hkely
to be useful for categorization. As a simple
example, consider the two categories given
by 1111**** and 0000****, Combining taxa
specific to each of these classes under ¢ross-
over will lead to taxa like 1100**** which
categorize none of the messages in either
category. ‘There is no reason why such func-
tional constraints should not be used to help
improve the allocation of samples It there-
fore seems reasonable to restrict the ability
of functionally distinct individuals to become
parents and mate with each other. This will
force the genetic algorithm to progressively
cluster new sample points in the more pro-
ductive regions of the search space. The
clusters that emerge will be the desired spe-
cialized subpopulations.

As for identifying these functionally
distinct individuals, any restrictive designa-
tion of parent taxa must obviously be based
on match scores. This ts because taxa
relevant to the same message have a similar
categorization function. Taken together,
these considerations provide the basis for a
restricted mating policy. Only those taxa
that are relevant to the same message will
be allowed to mate with each other. This
restriction is enforced by using the set of
relevant classifiers as the parents for each
invocation of the genetic algorithm.
Crowding

Under the restricted mating policy,
each set of relevant taxa designates 3
species. Each category characteristic desig.
nates a niche. Following this analogy, indi-
viduals that perform well in a given niche
will proliferate while those that do not do
well in any niche will become extinct. This
ecological perspective leads to an obvious
mechanism for automatically controlling the
size of each sub-population. Briefly, and
very simply, any ecological niche has limited
resources to support the individuals of a
species. The number of individuals that can
be supported in a niche is called the carry-
ing capacity of the niche. If there are too
many individuals there will not be enough
resources to go around. The niche becomes
“crowded,” there is an overall decrease in
fitness, and individuals die at a higher rate
until the balance between niche resources
and the demands on those resources is
restored Similarly, if there are too few indi-
viduals the excess of resources results in a
proliferation of individuals to fill the niche
to capacity.

The idea of introducing a crowding
mechanism into the genetic algorithm is not
new. DeJong [1975] experimented with such
a mechanism in bis function optimization
studies. Instead of deleting individuals at
random to make room for new samples, he
advocates selecting a small subset of the
population at random. ‘The individual in
that subset most similar to the new one is
the one that gets replaced. Clearly, the
more individuals there are of a given type,
the more likely it is that one of them will
turn up in the randomly chosen subset.
After a certain point, new individuals begin
to replace their own kind and the prolifera-
tion of a species is inhibited.

A. similar algorithm can be imple-
mented much more naturally here. Because
a message selects via mateh scores those
taxa that are similar, there is no need to
choose a random subset. Crowding pressure

can be exerted directly on the set of
relevant taxa. This can be done using the
atrength paraineter normally associated with
every classifier [Holland,1983]. The strength
of a classifier summarizes its value to the
system in generating behavior. Strength i
continuously adjusted using the bucket bri-
gade algorithm [Holland,1983] that treats the
system like a complex economy. Each
classifier’s strength reflects its ability to turn
a “profit” from its interactions with other
classifiers and the environment. One factor
bearing on profitability 1s the prevailing “tax
rate”. Taxation is the easiest way to intro-
duce crowding pressure Assume that a
classifier is taxed some fraction of its
strength whenever it is deemed to be
relevant to a message. Assume, further,
that all relevant classifiers share in a fixed
sized tax rebate. The size of the tax rebate
represents the limited resource available to
support a species in a niche. When there
are too many classifiers in a niche their
average strength decreases in a tax transac-
tion because they lose more strength than
they gain. Conversely, when there are too
few classifiers in a niche their average
strength will increase. The crowding pres-
sure is exerted by deleting classifiers in
inverse proportion to their strength ‘The
more individuals there are in a niche, the
Jess their average strength Merbers of this
species are therefore more likely to be
deleted. In a species with fewer members,
on the other hand, the average strength will
be relatively higher which means members
are more likely to survive and reproduce. In
this way, the total available space in the
population is automatically and dynamically
managed for every species. The number of
individuals in a niche increases or decreases
in relative proportion to the average
strength in alternative niches.

 

Testing the New Algorithm

Having described the restricted mating
policy and crowding algorithm, we now
examine how well they perform in an actual

 
implementation. The genetic algorithm used
in previous experiments was modified as
indicated above. The number of taxa in the
population was increased to 200, and each
taxon was given an initial strength of 320. A
taxation rate of 0.1 was arbitrarily selected,
and the tax rebate was fixed at 50°32; In
other words, whenever there are 50 relevant
taxa, the net tax transaction based on initial
strengths is zero. Each generation the tax
transaction is repeated 10 times to help
make sure the strengths used for crowding
are near their equilibrium values.

Four categorization tasks involving
multiple characteristics were chosen to test
the performance of the algorithm:

1) Lan tss494*

2) LILLTLL At 8eee8%
sees U0

3) LLL ELLLateeee*
sees LL

 

4) aan
00000000****9+"
eeeeeeen TLL

The frst task involves two categories that
are defined on the same feature dimensions
The second task contains categories defined
on different dimensions. In the third task
the categories share some relevant features

 

 

 

 

 

 

‘Table 4
Performance With Multiple Catego:
Task On-line . Avg. G value
: or all categories

1 1212 8.3

2 1091 841

3 12.77 789

4 15.75 11.64

 

 

 

 

 

in common. Finally, the fourth task involves
three categories to be discriminated.

Experiments were performed on each of
these tasks, running the genetic algorithm
enough generations to produce 6000 new
individuals per characteristic. Each genera-
tion, one of the characteristics was selected
and a message belonging to that category
was used to compute match scores. In the
first three tasks, at least 50 relevant taxa
were chosen per generation. Only 30 were
chosen on task 4 to avoid exceeding the lim-
ited capacity of the population. All popula-
tions were initialized with 80% #’s. The
results are summarized in Table 4 and show
that the algorithm behaves as expected.
The performance values are comparable to
those obtained with Ml working on a
simpler problem with a dedicated popula-
tion. More importantly, an inspection of the
populations revealed that they were parti-
tioned into speciazed sub-populations as
desired.

CONCLUSIONS

This research has shown how to
improve the performance of genetic algo-
rithms in classifier systems. A new match
score was devised that makes use of all of
the information available in a population of
taxa, This improves the ability of the
genetic algorithm to discover pattern
characteristics under changing conditions in

 
the environment Modifications to the algo-
rithm have been presented that transform it
from a function optimizer into a sophisti-
cated heuristic for categorization. The first
modification, a restricted mating policy,
results in the isolation and development of
clusters of taxa, or sub-populations, corre-
lated with the inferred structural charac-
teristics of the pattern environment. The
second modification, a crowding algorithm, is
responsible for the dynamic and automatic
allocation of space in the population among
the various clusters. Together, these
modifications produce a learning algorithm
powerful enough for challenging applica-
tions. As evidence of this claim, a full-scale
classifier system has been built along these
lines that solves difficult cognitive tasks
[Booker,1982].

Acknowledgements

+t The ideas in this paper were derived
from work done on the author's Ph.D.
dissertation That work was supported by
the Ford Foundation, the IBM Corporation,
the Rackham School of Graduate Studies,
and National Science Foundation Grant
MCS78.26016.

REFERENCES

Bethke, A.D. (1981), “Genetic Algorithms as
Function Optimizers”, Ph.D.
dissertation, University of
Michigan.

Booker, L.B. (1982), “Intelligent Behavior as
an Adaptation to the Task
Environment”, Ph.D. disser-
tation, University of Michi-
gan.

 

DeJong, K.A (1975), “Anslysis of the
Behavior of a Class of
Genetic Adaptive Systems”
Ph.D dissertation, University

  

of Michigan.

Goldberg, D.E. (1983), “Computer-Aided
Gas Pipeline Operation Using
Genetic Algorithms and Rule
Learning”, Ph D dissertation,
University of Michigan.

Hayes-Roth, F. (1973), “A Structural
Approach to Pattern Learn-
ing and the Acquisition of
Classificatory Power",
Proceedings of the First
International Joint. Confer-
ence on Pattern Recognition,
p. 343-355,

Hinton, G., Sejnowski, T., and Ackley, D.
(1984), “Boltzmann Machines:
Constraint Satisfaction Net-
works that Learn”, Technical
Report, CMU-CS-84-119,
Carnegie-Mellon University,.

 

Holland, J.H. (1975), Adaptation in Natural
and Artificial Systems,
University of Michigan Press,
Ann Arbor.

Holland, J.H. (1976), “Adaptation” , In Pro-
gress in Theoretical Biology 4
(Rosen, R. and Snell, Feds).
‘Academic Press, New York

Holland, J.H. (1983), “Escaping Brittleness”,
Proceedings of the Intern:
tional Machine — Learnit
Workshop, June 1983, Monti-
cello, Illinois, pp 92-95.

  

Holland, JH. and Reitman, JS. (1978),
“Cognitive Systems Based on
Adaptive Algorithms”, In
Pattern-Directed Inference
Systems, (Waterman, D. and
Hayes-Roth, F. eds), pp. 313-
329 Academic Press, New
York.

 
Kuchinski, M J. (1985), “Battle Management
Systems Control Rule Optim-
vation Using Artificial Intelli-
gence”, Technical Note,
Naval Surface Weapons
Center, Dablgren, VA.

Kullback, S. (1959), Information Theory and
Statistics, John Wiley and
Sons, New York.

 

92

 
Hultiple Objective optinization with Vector Evaluated Genetic Algorithas

J. David Schaffer
Departaent of Electrical Engineering
Vanderbilt University
Nashville, TH 37235

ABSTRACT

Genetic algorithas (GA's) have been shown
to be capable of searching for optise in function
spaces vhich cause difficulties for gradient
techniques. This paper presents a method by which
the power of GA's can be applied to the
optinization of multiobjective functions.

 

Js Introduction

‘There 43 currently considerable interest in
optisization techniques capable of handling
multiple non-cosensurable objectives. Hany.

| _probless are of this type where, fo
such factors as cost, safety | and
Performance must be taken into account.

‘A class of adaptive search procedures known
as genetic algorithas (GA's) have already
‘shown to possess desireable properties [3,10]
to out perfore gradient techniques ‘on sone
problens, particularly those of high order, with

     

 

 

Bultiple’ peaks or with noise disturbance [4,5,6).
This paper describes an extension of the
traditional GA which allows the searching of

parameter spaces where sultiple objectives are to
be optinized, The software systen implenenting
this procedure was called VEGA for Vector Evaluated
Genetic Algoritha,

The next’ section of this paper will
jescribe the basic GA and the vector extension.
Then some properties are described vaich sight
logically be expected of this method. Sos
preliminary experiments on soue simple problens are
then presented to illuminate these properties and
finally, VEGA {8 compared to an established
eultiobjective search technique on a set of sore
formidable probleas,

 

 

      

28

 

stor Genetic Algoritha

Unlike many other search techniques which
waintain @ single "current best" solution and try
to improve it, a GA maintains a set of possible
solutions called a population, This population 1s
improved by a cyclic two=step process consisting of
@ selection step (survival of the fittest) and
recoabination step (mating). Each cycle is usually
called a generation. Hore detailed descriptions of
‘these operations say be found in the literature
13,4,5,6,10).

 

   

   

 

  

‘The question addressed here 1s, how can
this process be applied to problens where fitness
is a vector and not @ scalar? How eight survival of
the fittest be implesented when there is gore than
fone way to be fit? We exclude scalerization
processes such a3 weighted sums or root eean square
by the assumption that the different dimensions of
the vector are non-coanensurabl

When comparing vector quantities, the usual
concepts eaploye¢ are those proposed’ by Pareto
[11,13]. For two vectors of the sane size, the

 

 

 

 

 

equality, less-than and greater-than relations
require ‘that these relations hold element by
elenent. Another relation, partially-less-than, 13

   

defined as follows: vector 'X = (x1, x2, 4. 4 x0)
4s said to be partially-less-than vector ¥ = (yl,
2s vee y yn) Aff x4 <= yt for all 4 ond for at
least one value of 4, xi < yi. Assuming that siniaa
fare sought, if X 4s partially-less-than Y, then Y
is said to be inferior to or doatnated by X. The
objective of a search for ainina in a vector-valued
space is, then, a search for the set of non-
Anferior menbers, or the mecbers not dominated by
any others, At least one mesber of this Pareto
iniual set will doninate each vector outside the
set, Dut among thesselves, none is doninated.
in mind, a simple
vector of the fittest process was
Smplenented. The selection step in each generation
decane @ loop, each tine through the loop the
‘appropriate fraction of the next generation was
selected on the basis of another elesent of the
fitness vector, This process, illustrated in figure
1, protects the survival of the best individuals on
each dimension of performance and, simultaneously,
provides the appropriate probabilities for aultiple
ction of individuals who are better than
average on more than one diaension..

 

 

 

     

   

 

 

  

2+ Some Anticipated Properties of VEGA

3.1 Multiple Solutions
One potential advantage of VEGA over other
optieization searches should now be clear. Since
the object of the search is @ set of solutions,
GA has built-in advantage by working with o
population of test solutions. By comparing each
Andividval in a population to every other, those
Who are dominated by any other/s can be flagged as
Anferior. The set of non-inferior individu:
ich generation is the current best guess at the

   

   

 

 
 

performance

 

parents

cee,

Generatton(te1)

 

 

 

 

. 2

 

1

 

.

 

select 0 7
Sebgrauns
‘sing each

Sleenston of
berforsance:
itor

 

 

 

 

 

 

popstze

souffle

oly
Geral
operators

 

 

 

 

 

popsize

 

 

Figure 1, Schenatic of VEGA Selection

Pareto-optinal (PO) set. By presenting a nuaber of
non-inferior solutions, VEGA provides the user vith
an idea of the tradeoffs required by his problea if
4 single solution must be selected. It should be
noted that VEGA's view of non-inferiority {s
strictly local; it is limited to the current
population. While a locally dominated individual is
also globally dominated, the converse is not
necessarily true. An individual who is  non-
dominated in one generation may becone dominated by
an individual who energes in a later generation.

 

 

3.2 Possible Speciation

There ts a potential probles vith this
vector selection process. Survival pressure is
applied favoring extrene perforsance on at least
fone dimension of perforsance. If a utopian
individual (i.e, one vho excels on all dimensions
of performance) exists, then he may be found by
genetic combinations of extrene parents, but for
many probleas this utopian solution does not exist.
For these probleas, the location of the Pareto-
optimal set or front is sought. This front will.
contain some menbers vith extrene performance on
each dimension and sone vith "aiddling” performance

 

 

 

 

 

 

 

on all dimensions. Frequently, these cospromise
solutions ara of most interest, but there may be
danger of their not surviving VEGA's selection

‘This sight give rise to the evolution of
"within the population which excel on
different aspects of performance. This danger is
expected to be nore severe for problens with a
concave PO front than for those with a convex one.
See figure 2.

Two methods for conbating this potential
property of VEGA vere conceived. One trick would
to provide a heuristic selection preference for
non-dosinated individuals in each generation. This

 

   

 

 

 

would provide extra protection for the "middling"
individuals.
‘Another, not necessarily — exclusivi

   

approach would be to try to encourage crossbreeding
among the "species" by adding sone sate selection

 

94

heuristics. In a traditional GA, mates are selected

 

at random. On the assuaption that utopian
individuals are more Likely to result from
crossbreeding than inbreeding, such heuristics

 

might speed the search.

4. Preliminary Experinents

4.1 The Test Functions

In order to test the properties of VEGA

‘a set of three simple functions (f1, £2 &

selected.

Fl was a single-valued quadratic function
variables. (i.e. f1(x1,x2,x3) = x12 +

x22 + x3#82). This function vas run to test

Vhether VEGA reverts to a traditional GA uhen the

perfornance vector has only one dizension.

F2 vas a tyo-valuod function of one
variable (i.e. £21(x) = x##2; £22(x) = (x-2)##2).
‘The initial random population for the search on
this function is illustrated in figure 3. In
addition to the locations of x, £21 and £22, this
figure also shows the dominated flag for each x (1
Af doninated, 0 if not). The PO region {s Ocex<e2.

P3 vas another tvo-valued function of one
but with tuo disjoint PO regions Ocaxc=2

 

    

 

  

  

4.2 Heuristics
In order to mitigate the anticipated loss
of “siddling" individuals a heuristic vas tested
which gave an extra selection preference to locally
non-dopinated individuals. This preference took the
form of nuseric adjustents to the performance
sagures which vere required by the selection
algoriths to sum to zero across the population.
Therefore, a snall penalty vas deducted fron each
infertor individual and the sum of these penalties
vas divided acong the non-inferior individuals.
Experizents vere also conducted to see if

 

 

 

    

the search for the PO front could be improved bj
pate-selection heuristics which encouraged
crossbreeding. Inbreeding, in this context, means a

 
convex

Performance 2

pac

e close |__t ©

|

 

 

 

 

 

Aas Performance 1
are close
i CONCAVE.
Bac |
are not i
close i
7
bop
Aue Performance 1

are not close

Figure 2. A Concave and Convex Pareto-Optinal Front.

wating between two individuals whose high
performance is on the sane divension. Two such
heuristics vere tested, both attempting to improve
upon the performance’ of VEGA vith randos mating.
Random mating vas implenented by shuffling the
population and rating pairs fron the top, shown
as step three in figure 1. Each heuristic proceeded
by selecting a individual at random and then
electing a nate whose distance in perforaance
space vas maxizus, Two distance measures vei
tested, Euclidian distance and "inprovenent”
distance vhich was computed ignoring those
divensions on which the proposed mate perforzed
worse.

 

 

 

 

 

 

4.3 Results

All of these experinents were conducted
with populations of 30 individuals per dinension of
the perforsance vactor, and crossover and sutation
rates of .95 and .0] respectively. This represents
asoaller population size and higher rates of

    

 

Figure 3. F2 Generation Zero

application of the
traditional (3,5).
sug

 

netic operators than has been
These setting vere, hovever,
sted by the ork of Grefenstette [8].

On £1, VEGA replicated a search previously
conducted on’this function by a traditional GA (7]
when started with the sane random seeds. Thus, VEGA
doce appear to be a vector generalization of a
scalar GA.

On £2, VEGA evolved the
illustrated in figure 4 in just three gene!
While not all the individuals are in the PO
(Ocexe=2), those which are outs!
doainated. This result, coabined vith similar
performance by VEGA on £3 yielded sone confidence
in the soundness of the VEGA approach.

However, during these experizents, a
dangerous property of the heuristic selection
Preference for non-dominated individuals va
discovered. It had a tendancy to produce sudden
Premature convergence of the population to a
Suboptimal solution. This occurred when, in at

generation, only one or tvo individuals
anaged to be non-doainated. Then, the sux of the
dominated penalties vas large and, vhen divided
anong very fev, gave then an overvhelning selection
advantage. This lead to subsequent generations
consisting only of offspring of a few parents with
too little genetic diversity, After this
observation, this heuristic was renoved. VEGA has,

far, not exhibited the anticipated loss of the
ddling” individuals fron the PO set. Perhaps
concave PO fronts are not a characteristic of many
practical probli

 

  

    

  

 

  

 

  

 

 

  

The nate-selection heuristics faired no
better. Randon sating proved superior to both of
then. This vas an encouraging finding for two-

 
 

 

 

 

   

Figure 4, F2 Generation Three

valued
inbreeding vith randoa mating decr

problems, since the probability of

ses as the
nunber of disensions of performance increases. All
subsequent experinents utilized neither of these

heuristics.

 

 

5. Comparison of VEGA with ARSO
‘Once some conf

    
  

acquired that VEGA
vas able to conduct a gent rch in spaces with
sultiple objectives, {t vas desired to compare the
performance of VEGA with that of an established
Technique for aultiple objective search.

 

 

S.A The ARSO Technique

For cosparisen purpo
Randos Search Optiaization  (ARSO) procedure,
pioneered by Beale (1,2], vas selected. — ARSO
Fequests a starting point in the paraneter space to
be searched and proceeds to try to inprove upon it
by randooly perturbing the paraneters. Statistics
(wean & variance) are aintained for all,
perturbations vhich produce inprovenents (defined
as a nev solution vhich dosinates the old one), and
these statistics are used to. guide the future
perturbations. Random perturbation techniques have
been shown to solve a large class of optisization
probleas faster than gradient techniques hen the
hnunber of paraseters exceeds four, and furthersore,
the convergence tine seens to incr only
linearly vith this number. ARSO had already
exhibited high perforsance in problens of the sort
tested here

 

, the Adaptive

 

 

 

 

 

96

5.2 Some Methodological Probleas

‘The comparison of tvo search procedure:
presents sone wethodological probless vhich are
complicated when there are qultiple objectives.
One approach is to run each procedure until the”
solution is within some tolerance of a  knovn
solution and then compare the computational effort.
‘This approach vas rejected since the true solution
vas not known for the test problems. It vas desired
to compare the methods on probless vhose solutions
ware not knovn so as to include in the comparison,
the stopping criterion of each method. ARSO has a
threshold on the nusber of perturbations tried
without finding an tmprovenent which forces a halt
to the search. VEGA has no such preset stopping
criterion and is stopped by the user vhen no
further daprovesent 1s evident.

Another approach 4s to run both procedures
for the same apount of computational effort and
then compare the quality of the solutions.
Comparing vector solutions is probably best done by
checking if any are doninated by those provided by
the other procedure. If not, then a tie eust be
declared. This approach say be ‘unfair since ARSO
reports only a single solution vhile VEGA nay
report several.

The approach adopted was to run each
Procedure to its natural stopping criterion. All
Proposed non-doninated solutions were then compared
and, if any were found to be inferior, they vere
rejected. (Included in this set vere solutions
provided by Hartley (9] who used a variant of ARSO,
but solved the scalar problen of the equally
weighted sum of the errors on all dimensions.)
Then, the number of “ultimately” non-dominat
solutions found by each procedure vas plotted
against computational effort (number of function
evaluations).

 

 

 

 

 

 

 

   

 

 

5.3 The Test Functions
A set of three problens, dravn from the
of control engineering vas contributed by a
Hartley [9]. All involved the
sioulation of a systen with a different integration
operator for each of the systes state variables.
The systens vere of orders 2, 3 and 7 respectively.
The object of the search vas an optisal set of
integrators, each characterized by three
parazeters, aking the dimensions of the paraneter
search spaces 6, 9 and 21, respectively. The
performance measures vera the rms error of the
simulated solution from a known solution, one for
ich of the state variables saking the disensions
of the performance spaces 2, 3 and 7.
‘ALL searches vere conducted using the sane
GA parapeters as were used for the preliminary
problens. The integrator parameter sets vere gray

  

 

   

 

 

coded (see Schaffer(12] or Bethke[3}) to 12 bit
Precision, making the binary search spaces
248(1246)" = 4.7 * 108021, 2#8(1289) = 3.5 108832

    

and 248(128821) = 7.2 #
aystens, respectively.

10875 for the three

5.4 Results

While the true system behavior vas assuned
known, the object of the search vas for optinal
integrators for the simulations, and these vere not
known, Thus the probles of when to stop searching
had to be faced. To illustrate, a scatter plot of
perforsance of the initial random generation for

 

 
the second order systen is shown in figure 5.
Figure 6 shows that considerable inprovesent had
been achieved in three generations. Figure 7 shovs
the leading edge of the population after 49
generations. Note that the axes have been expanded
three orders of magnitude. After running VEGA to
generation 110 no. substantial increase in
perforsance vas evident. See figure 8. There are
however, several nore points on what appears to be
the PO’ front. Thus, a decision to stop such a
search must be a judgement call based on a belief
that the PO front has beon located and that further
search effort vould be wasted,

‘The experiences were sinilar for the third
and seventh order systess, but scatter plots for
these high order systeas could not be dravn.

Before proceding to the comparison of VEGA
with ARSO, it say be instructive to illustrate one
of the ARSO searches in the second order system
probles. Figure 9 traces the improvesents in the
Solution found by ARSO and is presented on the sane
axes scales used for figures 5 to 8. ARSO found a
solution vhich vas judged “ultigately" —non-
dominated in 607 evaluations. ARSO's stopping
criterion halts if no improvenent 1s located after
1000 consecutive evaluations and so this run
continued until 1607 evaluations and halted.

‘A second run of ARSO vas initiated with one
of the tvo non-dominated individuals from the

VEGA. This run
bout 1300 evaluations, but its
solution vas inferior. VEGA, on the other hand,
did not locate its first "ultimately" non-doainated
solution until 2621 evaluations and by 6000 tt had
found eight. These results are shown in figure 10.

 

 

 

 

 

 

 
  

 

 

   

 

eae FO” Shee wees

eb eo

 

Figure 5. Second Order Systes — Generation Zero

 

 

Figure 6. Second Order Systea — Generation Three

 

Figure 7. Second or

 

 

Syst

 

 

Vag om a

== Generation 49
   
   
   

 

 

   

ary m
error’) 1

 

Second Order Systea — Generation 110

 

 

‘The tentative conclusion from these runs is that
ARSO is fast, but may get trapped on local extrena.
but ore robust.

   

two with VEGA and tyo vith ARSO. VEGA wi
initiated with a random population for one run and
given a whole population of clones of Hartley
Solution for the other. ARSO vas started with a
reasonable starting point analogous to the starting
point that lead to success on the 2nd order
Problem, and also the Hartley solution. The results
of these tests are presented in figure 11. ARSO
again found a good solution in under 2000
evaluations on its first run. When given a PO
solution, tt could only try for 1000 evaluations to
VEGA found a non-
lution quite early in its search (415
evaluations), but because it vas not sufficiently
extrene on any one dimension of performance, it did
not survive into future generations. Hore good
solutions energed later vith VEGA having five after
10000 evaluations. VEGA, unlike ARSO, vhen given a
PO solution, quickly located sany variants of it.
‘The same four searches vere run on the 7th
order systen. This tine neither VEGA nor ARSO
located any solutions which vere not dominated by
the Hartley solution. Hartley had used his
knovledge of the problen to start his search at a
close-to-PO point, but both VEGA and ARSO vere
started without this prior knowledge. The stopping
criteria for ARSO had been relaxed to lengthen the
search and a variance paraneter had also been
relaxed so as to broaden the search, but after
almost 12000 evaluations no non-doninated solutions
had been found. Its best solution at that tine vas

 

 

 

 

 

 

 

 

Figure 9. Trace of ARSO Search, Second Order systes

98
also known to be unstable. VEGA searched for alnost
36000 evaluations without locating any solutions
not doainated by Hartley's, howaver many of then
were stable. Again, when told where to look, VEGA
fed many nore PO solution:

The tentative conclusion, then, seems to
have been supported by the higher order searches.

$. Discussion

The major finding of this research vas that
ctorization of feedback and the
jection process of a GA can be successfully done.
This opens the domain of multiobjective
optisization probless to the already established
power of genetic search.

Heuristic modifications of the traditional
wethod to give selection preference to non-
dominated menbers of a population and to try to
Approve on random mating proved to be inferior to
the traditional method. The possibility that VEGA
nay have a veakness in the central region of a
concave PO front cannot be eliminated, but
sapirical evidence to date suggests that it aay not
rhous.

‘The comparisons of VEGA with ARSO contain
no anal amount of “apples versus oranges." The
methods differ in the nuaber of solutions presented
and in the way their searches are normally halted.
However, both contain stochastic elements, both
conduct multidisensional search and both are halted
when no further improvement is apparent. oth may
be started with random information, or say take
advantage of prior knovledg
‘about his s

 

 

   

 

   

 

 

 

 

 

  

 

 

 

 

 

 

 

   

eC

Figure 11, ARSO vs VEGA on Third Order Systes

 

 

 

 

Sah ake he ye tee
Cea ne oY eee

Figure 10, ARSO vs VEGA on Second or

 

5
ary

   

0 aE

Figure 12, ARSO v3 VEGA on Seventh Or

 

fF Systen

 
vas given several tires sore computational effort.
than vas ARSO, due largely to differences in the
ethods for stopping each search.

The general conclusion of the comparison
was that ARSO {s capable of very quickly locating
solutions to conplex sultidizensional problens, but
its preforsance may be less robust than VEGA's.
VEGA, on the other hand, takes longer to locate the
good regions of complex search spaces, but seens to
be able to do so nore reliably. This conclusion is
not dissinilar to previous results from comparison
of scalar genetic search vith gradient techniques

(5,6).

Finally, a simple eethod has been conceived
which say isprove both VEGA. and” ARSO.
saintaining a data structure "off to the aid
containing all non-dosinated solutions encountered
in the search, VECA vould be protected against. the
Joss of good but not extrene individuals, as
eccurred in the search on the 3rd order problen.
Sintlarly, ARSO would then have the pover to report
2 nusber’ of solutions instead of only one.
Furthermore, by monitoring the adding and
subtracting of menbers to this set, both techniques
Bight be given a nore rational stopping criterion.
Work on this addition to both methods will comence
in the near future.

 

 

  

 

 

 

   

 

REFERENCES
Guy 0, Beale Optinal Atrcrafe
Development by Adaptive

Qptinization, Ph.D. Dissertation,

of Virginia, Charlottesville, Virginia, May
1977.

2. Guy 0, Beale and Gerald Cook, “Optimal digital
Simulation of aircraft via random search
techniques," J. Guidance and Control, Vol. 1,
no. 4, July-Aug. 1971

 

 

 

 

‘Albert Donally Bethke Genetic Algorithns
Function Optimizers, Ph.D. Dissertation,
University of Michigan, Ann Arbor, Michigan,
Jan 1981.

   

Anne Brindle Genetic Algorithas for Function
Optimization, Ph.D. Dissertation, University
of Alberta, Ednonton, Alberta, Canada, Jan
1981.

 

 

Kenneth DeJong, Analvsis of the behavior of @

s of genetic adaptive svstens, Ph.
Dissertation, University of Michigan,

Arbor, Mich., 1975.

 

   

‘Ann

 

6. Kenneth DeJong, "Adaptive System Design: A
Genetic Approach," IEEE Trans. Systens, Man
and Cybernetics, Vol. SHC-10 No. 9, Sept 1980.

 

John Grefenstette, “A user's guide to

   

   

GENESIS," Tech. “Report CS~83-11, Computer
Science’ Dept., Vanderbilt University,
Nashville, Tenn., Aug. 1983.

John Grefenstette, “Genetic algorithas for
multilevel adaptive systeas,"" IEEE Trans. on

Systems, Han and Cybernetics, in press.

100

10.

a.

12.

3.

 

 

 

‘Thomas Hartley, Parellel Hethods for the Real,
Lise ‘SEAT Nonlinear Systens,
Ph.D. “Dissertation, Vanderbilt University,
Nashville, Tenn., Aug. 1984.

John H. Hollan tation in

Artificial Systeas, University
Press, Ann Arbor, Michigan 1975.

   

Natural
of Michigan

    

V. Pareto, Cour
Lausanne, Swit

’ Econgate Folitique,
nd, 1898.

 

J. David Schaffer Sone Experinents in

Machine
Eva ie

Vanderbilt
Dec 1984.

 

 

Algorithas, Ph.D. Dis
University, Nashville,

 

Thonss L. Vincent and Walter J. Granthan,
ooeieality Je Systess, John Wiley
SE Sona hee vor ToHT
Adaptive Selection Methods for Genetic Algorithms

James Edward Baker

‘Computer Science Department
Vanderbilt University

Abstract,

Premature convergence 1s a common problem in
Genetic Algorithms. This paper deals with
inhibiting premature convergence by the use of
adaptive selection methods Two new measures
for the prediction of convergence are presented
and their accuracy tested. Various selection
methods are described, experimentally tested and
compared

1. Introduction

In Genetic Algorithms, it 1 obviously desirable
to achieve an optimal solution for the particular
function being evaluated However, it 13 not
necessary or desirable for the entire population to
converge to a single genotype Rather the
population needs to be diverse, so that a
continuation of the search 1s possible ‘The loss of
an allele indicates a restriction on the explorable
search space Since the full nature of the
function being evaluated 1 not known, such a
restriction may prevent the optimal solution from
ever being discovered If convergence occurs too
rapidly, then valuable information developed in
Part of the population is often lost This paper
deals with the control of rapid convergence

  

Three measures are typically used to compare
genetic algorithms ‘They are the Onkne
Performance, the average of all individuals that
have been generated, the Offline Performance,
the average of the Best Individuals from each
generation, and the Best Indiuidual, the best
individual that has been generated We attempt
to optimize functions and therefore use the Best
Individual measure for comparison In order tu
improve this measure, we promote diversity
within the population and control rapid
convergence Increased diversity detrimentally
affects the Onhne Performance measure and
inhibited convergence detrimentally affects the
Offline Performance measure Improving these
two performance measures is not in the scope of
this paper

 

Methods for the prediction of a rapid
convergence are the topics of section 2. Section 3
will describe various algorithms with which to
slow down convergence, and section 4 will present
their results A conclusion section will follow the
results

2. Prediction of Rapid
Convergence
There are two different aspects to the control of
rapid convergence First, how can one tell that tt
has occurred and second, how can one predict
when it will occur

Recognizing rapid convergence after it has
occurred 1 rather straightforward By its very
meaning, a rapid convergence will result in a
dramatic rise in the number of lost and converged
alleles A lost allele occurs whenever the entire
population has the same value for a particular
gene Thus subsequent search with that gene 1s
impossible A converged allele, as defined by
DeJong [1], 15 2 gene for which at least 95% of
the population has the same value However, the
effects of rapid convergence are not limited to
‘only those alleles which are indicated by these
measures A rapid take over of the population
wall cause all genes tu suddenly lose much of thew
variance We define bias as the average percent
convergence of each gene Thus for binary genes,
this value will range between 50, for a completely
uniform distribution (in which for each gene there
are as many individuals with a one as a zero) and
100, for a totally converged population (in which
each gene has converged to a one or a zero). The
‘bias measure provides an indication of the entire
population's development without the
disadvantage of a threshold, such as the one
suggested by DeJong to indicale a converged
allele A threshold does not indicate the amount
by which individuals exceed it or the number of
individuals which fall just short We can

 

therefore monitor the sudden jumps in the lost,
converged or bias values to determine when a
sapid convergence has occurred
‘The prediction of rapid convergence 1s necessary
for selection algorithms to be able to adapt
accordingly Lost, converged or bias values
cannot be used for this purpose since their

measurement occurs after potentially _vital
information has been discarded. Two different
prediction methods will be described

 

A common cause of rapid convergence is the
existence of super indivrduals, that 1s individuals
which will be rewarded with a large number of
offspring in the next generation Since the
population sie is typically kept constant, the
number of offspring allocated to a super
individual will prevent some other individuals
from contributing any offspring to the next
generation. In one or two generations, a super
individual and its descendants may eliminate
many desirable alleles which were developed in
other individuals The first method addresses the
problem of super individuals by setting a
threshold on an individual's expected number of
offspring. If, after all individuals in a generation
have been evaluated, an individual has an
expected value above this threshold, then a rapid
convergence is deemed imminent.

 

 

 

A closer analysis of rapid convergence leads to
the second measure Rapid convergence is not
‘caused solely by an individual receiving too many
offspring, but also by the often related situation
of many individuals being denied any offspring
‘Thus rapid convergence may also be predicted
not by monitoring how many offspring the most
fortunate individual receives, but rather by
monitoring how many individuals receive no
offspring. We define percent involvement as the
percentage of the current population which
contributes offepring for the next generation
‘The percent involvement measure has the
advantage that it can predict a rapid convergence
caused by several individuals even when none of
them would exceed the threshold of the first
method. If that threshold were lowered to catch
this case, then it might predict a rapid
convergence when it was not occurring

3. Modifications to Selection

This section presents various methods designed
to prevent or control rapid convergence. There
are basically two choices: either develop a fixed

  

102

 

selection algorithm which avoids rapid
convergence, or develop a hybrid system, which
adapts ats selection algorithm to handle rapid
convergence when it occurs.

3.1. Standard Selection

The expected value model presented by DeJong
4s taken as the standard for comparison, since a
number of properties about its behavior have
been proven [1,3] This model evaluates each
individual and normalizes the value with respect
to the population’s average. The result for each
individual is called his expected value and
determines the number of offspring that he will
ree In our implementation [2], the actual
number of offspring will be either the floor or the
ceiling of the expected value. Thus, the number
of offspring attributed to a particular individual
is approximately directly proportional to that
individual’s performance ‘This direct
proportionality is necessary for Holland's
theorems to hold, however it is also the core of
this method's susceptibility to rapid convergence,
Since there is no constraint on an individual's
expected value, an individual can have as many
offspring as the population size will allow.
‘Therefore, the expected value model can exhibit
rapid convergence leading to sharp increases in
lost, converged and bias values as well as non-
optimal final results

 

  

 

Because of the theoretical advantages associated
with the expected value model, all of the hybrid
systems listed below use this selection algorithm
when rapid convergence 1s not indicated

3.2. Ranking

‘One way to control rapid convergence is to
control the range of trials allocated to any single
individual, so that no individual receives: many
offspring. The ranking system 1s one such
alternative selection algorithm In this algorithm,
each individual receives an expected number of
offspring which 1s based on the rank of his
performance and not on the magnitude, There
are many ways to assign a number of offspring

based on ranking, subject to the following two
constraints

1 the allocation of trials should be
monotonically inereasing with respect
to increasing performance values, to
provide for desirable rewarding,

2 the total of the individual allocation
of trials should be such that the
desired number of individuals will be
in the next generation

Determining the values for our ranking
experiments was done by taking a user defined
value, MAX, as the upper bound for the expected
values A linear curve through MAX was taken
such that the area under the curve equaled the
population size For this construction, several
values are easily derivable

 

1. lower bound,

MIN = 2.0- MAX ;

2. difference between "adjacent"
individuals,

INC = 20 * (MAX - 10) /

Population Size

3 lowest individual's expected value,
LOW = INC /20.

Hence for a population size of 50 and a MAX of
2.0 . MIN = 0.0, INC = 0.04, LOW = 0.02

However, ranking with MAX == 2.0 causes the
population to be driven to converge during every
generation, including very stable searching
periods, ie, all individuals being within 10% of
the mean. Ranking forces a particular percent
involvement rather than preventing low percent
involvement values from occurring Our
experiments show the desirable range for the
percent involvement value is between 94% and
100% The above settings force a percent
involvement value of approximately 75%, and
hence cause undesirable convergence To prevent
this, one must choose a MAX value of
approximately 1.1, which force the percentage
involvement value into the desirable range

 

3.3. Hybrid Systems
The following two systems use the expected
value model as the default. When rapid

convergence 1s predicted, the system will
temporarily switch to a different selection
method, designed to better handle the situation

3.3.1. Hybrid with Ranking

We have investigated two systems in which
Ranking 1s used as the alternative selection
method. The systems differ only in the way in
which a rapid convergence is predicted The first
fone uses a threshold on the maximum allowable
expected value, the second system uses a
threshold on the minimum allowable percent
involvement. Ranking was chosen since, as
described previously, it should work better during
periods of rapid convergence and the expected
value model should work better during the other
periods Thus, these two system's strengths and
weaknesses complement each other and should
create a good hybrid.

 

 

3.3.2. Dynamic Population

Recall that one cause of rapid convergence is
that super individuals prevent other individuals
from having offspring. This is due to the
enforcement of a constant population size, and
clearly results in a drop in the percent
involvement. If the population site were allowed
to grow, then a super individual would not force
the elimination of many other individuals

 

‘The dynamic population size method is
implemented by enforcing a lower bound on the
percent involvement. This is done by adding
individuals to the population until both the
orginal population size and the acceptable value
for the percentage involvement are reached. Due
to the requirements of crossover, additions to the
population are made in increments of two
Additional individuals are added to the system on
the same basis as before, that 1s by their expected
value

During periods of slow convergence, the size of
the population will be constrained toward the
original population site, since the lower bound on
the percent involvement will be satisfied before
the entire current population is chosen Although
the population size may grow as large as deemed
necessary (within physical memory limitations), it
will be guided back to the original population size
during periods of slow convergence, as long as the
lower bound value is set below 100%
Furthermore, on our system, the floor of the
individual's expected value was taken as a
minimum for his number of offspring This
Periodically Jed to percent involvement values

 
which were higher than the required lower bound

even during a time of high population size This
is a characteristic of our implementation and has
not yet been investigated for its desirability.

This method has good intuitive appeal and has
the advantage of using the expected value model
throughout, The advantage of reacting
differently to differing magnitudes of potential
rapid convergence is also present.

 

 

A possible disadvantage of this system is that a
super individual can obtain a large
percentage of the population very quickly; while
other individuals are not completely lost, their
effect on the population is tremendously
undermined

  

4, Results

All experiments were performed using the
Genesis System [2] at Vanderbilt University. The
initial population size was set to 50, the crossover
rate to 0.6, and the mutation rate to 0.001 Bach
curve in figures 1 through 10 were taken from
single, representative executions of | the
appropriate functions Each curve in figures 11
and 12 represent the average of five executions

 

4.1. Detection

In order to confirm the predicting capabihty of
the percent involvement and greatest expected
value, a function was designed on which a
standard GA would experience a rapid
convergence This function had a gentle slope
over more than 99.5% of the search space. A
steep, highly valued spike existed in the
remaining one half of one percent. To achi
the optimal result, the system needed to find the
spike and then to optimize withm it. The
outlying, gentle slope discouraged those alleles
necessary for the optimal result Thus when a
super individual occurred, that 1s one within the
spike, vital information was likely to be lost

 

 

‘This function was used with the expected value
model of selection For each generation, the
values of the percent volvement and greatest
expected value were output. For these values to
be useful as predictors, they must noticeably
change prior to a rapid increase in the lost,

  

104

 

converged and bias values Graphs comparing
these predictors with the lost, converged and bias
values can be seen in figures 1 - 6

Figure 1 shows that the percent involvement
value drops sharply, prior to the dramatic rise in
the lost value. A similar relationship exists for
the converged value in Figure 2. Figure 3 shows
the bias value chmbing before the percent
involvement value has reached its minimum
‘This occurs since the percent involvement values
are already below their normal range prior to
reaching its minimum The normal range for the
percent involvement has been found to be
between about 94% and 100% . The drop in the
Percent involvement which occurs around the
23rd generation causes no appreciable effect, since
(as the bias value indicates) the population is
already over 90% converged. Note the first
indication that a rapid convergence has occurred
is given by the bias value, and the last indication
by the lost value. This 1s seen by the primary
increase occurring in the fourth generation for the
bias value, in the fifth generation for the
converged value and in the sixth generation for
the lost value

  

 

The maximum expected value also experienced
a sharp change prior to the rapid convergence.
Figures 4, 5 and 6 show this clearly Note that
the spike in the maximum expected value
‘occurred one generation before the minimum
percent involvement value This shows clearly
the more global nature of the percent
involvement discussed eather. ‘That is for
generation number three, there was a single super
individual, evident from the maximum expected
value and percent involvement. However, the
largest loss of other members of the population
occurred during the following generation, when
this super individual's offspring were reproducing
This is seen in the fourth generation's percent
involvement value

Figures 1 - 6 show that both the percent
involvement and the maximum expected value
provide a good prediction of the occurrence of
rapid convergence in this example However, the
percent involvement appears to be superior in
general, since it can detect some rapid
convergence not caused by a super individual
4.2. Comparison of Methods

Recall that a superior method potentially
produces better Best Individuals by retaining
diversity in the population and controlling rapid
convergence Thus the lost, converged and bias
values should remain low with increases occurring,
only gradually

The various selection methods discussed in
section 3 were tested with a variety of functions
In all cases, Ranking uses 2 MAX of 1.1, as
discussed in section 3.2. Figure 7 compares their
lost alleles values for the same function used for
Figures 1-6. These results were chosen because
they are fairly representative of the various
functions tested The standard selection method
consistently experienced rapid convergence sooner
and more dramatically This can also be seen in
Figure 8, a comparison of the bias values for the
same function Just as consistently, the ranking
system did not rapidly converge

 

Figures 9 and 10 show the loss of alleles and
bias for Shekel’s “foxhole problem" studied by
DeJong [1,4]. ‘These figures also show the same
two characteristics: 1) standard selection
performing worst; and 2) ranking performing
best However, note the vast superiority of the
hybrid system which was based on the percent
involvement over the other hybrid system This
is probably the result of the percent
involvement’s supenor ability to predict rapid
convergence.

 

‘The population variance method performed no
worse than the standard system, but for this
function, at did not perform significantly better
The hybrid methods did experience convergence
However, it is delayed, it is extended over more
generations, and its magnitude is lessened
‘Therefore one should expect these systems to be
able to produce better final solution than the
standard system, although it may take longer

Figures 11 and 12 show the average Best
Individual versus trial number for the various
methods. These values represent the average for
five executions Figure 11 1s for the “foxhole
problem”. Note that all of the methods
performed at least as well as standard selection,
given a sufficient number of tals. Furthermore,

ndard selection has lost nearly half of its

 

105

alleles by the 200th trial © Ranking has
‘outperformed all the others and after 2000 trials
has not lost any alleles. Thus ranking has the
best final result and the best potential for
improvement, Note that ranking 1s the slowest of
the methods, not producing a "competitive
result until about the 900th trial This causes the
Offline Performance to be very bad, and the high
diversity causes the Online Performance to suffer

 

Figure 12 18 from a function which has a sharp
‘optimal region which the system must find. The
function also has various local optima which may
cause convergence before the discovery of this
region Figure 12 indicates the relative ability of
the systems to find the optimal region. Each
curve plotted represents the average over five
executions Ranking was able to find the region
four out of five times, but again was the slowest
an starting

5. Conclusions

To varying degrees, all of the methods discussed
in this paper were able to control rapid
convergence. The ranking method shows the
greatest promise It results in better solutions for
many functions experiencing rapid convergence
and it maintains virtually all of its alleles This
gives it the potential for continued search and
even further improvement on its solutions The
primary drawback to this system is that it
requires a larger number of trials to obtain these
results, especially for functions not exhibiting
rapid convergence. We have observed that in the
expected value model all of the individuals are
typically within 20% of the mean during non-
rapidly converging periods. Hence, the ranking
system, with MAX = 11 , should be roughly
equivalent to the expected value system during
these periods, yet has the advantage of being able
to control rapid convergence Of course, it also
has the disadvantage of avoiding rapid
convergence, even when the convergence 1s
desirable. Hence, ranking warrants further study
both for its robustness and its particular handling
of rapid convergence.

 

At present there 1s insufficient justification to
rank the other methods’ performance. However,
they all represent improvements over the
standard selection algorithm For some functions

 

 
they were able to significantly slow down the
rapid convergence and retain more diversity
within their genes. ‘This typically led to better
final results than the standard selection algorithm

would produce Standard selection did not
‘outperform any of the other methods on any of
the functions tested, given a sufficient number of
tals

 

Both the percent involvement and the expected
value provide a prediction of rapid convergence
and can be used to help control it However, the
prediction based on the expected value applies
only for rapid convergence caused by super
individuals Therefore, the percent involvement
value should be used as a more general predictor

Many other methods to control rapid
convergence should be studied and compared
Among them are

1 providing a simple upper bound on
the number of offspring allowable to
an individual,

2 limit super individuals to a very small
number of offspring (1 or 2) in
combination with the elitist strategy,

3a ranking system based on non-lnear
curves or adaptively changing curves,

4 providing multiple thresholds for the
varying population size method, to
‘use separate thresholds for growing
and for shrinking the population, ot
separate values for the prediction and
the processing phases

‘The methods presented in this paper should be
tested further on a larger number of functions
before defimtive conclusions can be made

References

LK A Delong, Analysts of the
behavior of a class of genetic
adaptive systems, PhD Thesis, Dept
Computer and Communication
Sciences, Univ. of Michigan (1975).

23 J Grefenstette, "A user's guide to
GENESIS," Tech Report CS-83-11,

106

 

Computer Science Dept, Vanderbilt
Univ (August 1983)

3.J H Holland, Adaptation in Natural
and Arttfictal Systems, Univ
Michigan Press, Ann Arbor (1975)

4.3. Shekel, "Test functions for
multimodal search techniques," Fifth
Ann Princeton Conf Inform Sct
Syst (1971)
oq Figure 2. 200
fF 90
L 80
207
L 70
An-tost,
L 60
‘B--percent
Anvolvenent
20-4 L 50

fF 40

 

204

A--converged

 

 

1004

 

An-Bias

4

 

 

 

Generation

Figure 2. y 200
[ 90
IK r 80
fF 70
| 60
Be-percent
sgvolvenent
+ 50
F 40
a ia
Figure too
1. 90
L eo
L 70
L 60
Be-percent
| gqvoivenent
f 40
a.

L 30

 

 

 
30., Figure,

 

ras

t-10

Be-maximum
expected
value

 

 

A--converged

 

T
Generation

 

1-10

Be-maximum
expected
value

 

 

 

 

Generation

 

10

Besmaxinum

expected
value

 

T
Generation

108

 
aoe Figure 7. a

 

 

 

 

 

204 PANINI ry
Lost
Alleles
207) a B c D 20
E
T T T T T t
10 20 30
Generation
1007 Figure 8. 100
90
80
Bias
70
60
30) T =a T T T T 50
10 20 30
Aq-Standara
Becliybrid/nax. exp. val. Generation

C--Hybrid/perct. invol.
D=-Pop. variance
E--Ranking

109

 

 
 

305

207]

Lost,

 

 

 

nd
>
2
oe £

t rT T 1 T T
a0 20 By

Generation

Soo Figure 10.

904

 

 

 

30

20

100

20

60

 

 

20
Ac-Standard
Be-Hybrid/sax. exp. val.
C--Hybrid/perct. invol.
D--Pop. variance
Ranking

 

110

50
Average
Best

Average

 

4

y
E
Ir

 

 

 

 

 

 

 

 

E
5
x
Le
34 Ls
24
2: i TT TT .
500 1000 1500 2000
‘Trial
405 Figure 12. 40
204 t- 30
204 c L20
>
10 Fe
E
© i a T e
sbo 11000 1sb0 TT 2000
As-Standara eiat

BesHybrid/max. exp. val.
G--Hybrid/perct. invol.

Pop. variance
E+-Ranking

  

m1

 
Genetic Search with Appro:

 

 

ate Function Evaluations

John J Grefenstette!
J Michael Fitzpatrick

Computer Science Department
‘Vanderbilt. University

Abstract

Genetic search requires the evaluation of many
candidate solutions to the given problem. The
evaluation of candidate solutions to complex
problems often depends on statistical sampling
techniques This work explores the relationship
between the amount of effort spent on individual
evaluations and the number of evaluations
performed by genetic algorithms It 1 shown
that in some cases more efficient search results
from less accurate individual evaluations

1. Introduction

Genetic algorithms (GA's) are direct search
algorithms which require the evaluation of many
points in the search space In some cases the
computational effort required for each evaluation
is large In a subset of these cases 1t 1s possible to
make an approximate evaluation quickly. In this
Paper we investigate how well GA’s perform with
approximate evaluations. This topic 1s motivated
1m part by the work of De Jong [5], who included
@ noisy function as part of his test environment
for GA's, but did not specifically study the
imphcations for using approximate evaluations on
the efficiency of GA's Our main question 1s
Given a fixed amount of computation time, 18 it
better to devote substantial effort to getting
Inghly accurate evaluations or to obtain quick,
rough evaluations and run the GA for many more
Generations? We assume that the evaluation of
each structure by the GA involves a Monte Carlo
sampling, and the effort required for each
evaluation 1s equal to the number of samples
performed,

 

  

 

Since the GA's we consider do not obtain
accurate evaluations during the search, the

 

traditional metrics, online performance and
offline performance, are not appropriate (or at
least not easily obtained). Instead, we assume
that the GA runs for a fixed amount of time,
after which it yields a single answer. The
performance measurement we use 1s the absolute
performance, that 1s, the exact evaluation of the
suggested answer after a fixed amount of time.

In section 2 we describe the statistical
evaluation technique In section 3 we describe
the result of testing on a simple example
evaluation function. In section 4 we describe the
result of testing on image comparison functions
In section 5 we present future directions of
research on approximate evaluations

2 The Statistical
Technique

Evaluation

In this work we investigate the optimization of
a function f(z) whose value can be estimated by
sampling The variable z ranges over the space
of structures representable to the GA We are
interested in functions for which an exact
evaluation requires a large investment in time but
for which an approximate evaluation can be
carried out quickly. Examples of such functions
appear in the evaluation of integrals of
complicated integrands over large spaces Such
integrals appear in many applications of physics
and engineering and are commonly evaluated by
Monte Carlo techniques [13,15]. An example
from the field of image processing, examined in
detail below, is the comparison of two digital
images Here the integrand is the absolute
difference between image intensities in two
images at a given point in the image and the
‘space is the area of the image.

 

‘Research supported in part by the National Science Foundation under Grant MCS-8305603,

12
‘Throughout our discussions it 18 convenient to
treat the function, f(z), to be optimized as the
mean of some random variable r(z), In terms of
the evaluation of an integral by the Monte Carlo
technique, f(z) would be the mean of of the
integrand’s value over the space and r(z) is
simply the set of values of the integrand over the
space The approximation of f(z) by the Monte
Carlo technique proceeds by selecting n random
‘sample from r(z) ‘The mean of the sample serves
as the approximation and to the extent that the
samples are random, the sample mean is
guaranteed by the law of large numbers to
converge to f(z) with increasing n Once f(z) is
approximated, the desired value of the integral
can be approximated by multiplymg the
approximation of f(z) by the volume of the space
There are many approaches to improving the
convergence of the sample mean and. the
confidence in the means for a fixed n [15] We
will not investigate these approaches Here we
will be concerned only with the sample mean’ and
an estimate of our confidence in that mean

 

 

The idea which we are exploring is to use as an
evaluation function in the GA optimization of
A(z), not f(z) itself, but an estimate, e(z), of (2)
obtained by taking n randomly chosen samples
from r(z) It 1s intuitive that e(z) approaches
Jz) for large n, From statistical sampling theory
itis known that if r(2) has standard deviation
(2) then the standard deviation of the sample
mean, 02), 18 given by

(1) of2) = o(2/Vn

 

 

In general o(z) will be unknown It 1s simple,

however, to estimate o(z) from the samples using
the unbiased estimate,

(2) off2) = (6 -el2))'/(n—1)
=
It is clear from equation (1) that reducing the
size of ¢,(z) can be expensive Reducing o fz) by
@ factor of two, for example, requires four umes
as many samples It 1s intuitive that the GA will
require more evaluations to reach a fixed level of
optimization for f(z) when oz) is larger
Concomitantly, it is intuitive that the GA will

113

achieve a less satisfactory level of optimization
for f(z) for a fixed number of evaluations when
¢,(2) 18 larger What is 1s not obvious is which
effect 18 more important here, the increase in the
number of evaluations required or the increase in
the time required per evaluation The following

experiments explore the relative importance of
these two effects

3. A Simple Experiment

As a simple example function we have chosen to
minimize

Kowa) = Zaye?

We imagine that f(z,y,z) 1s the mean of some
distribution which 1s parameterized by z, y and 2,

but instead of actually sampling such a function
to achteve the estimate e(z,y,z), we use

e(z.yz) = fle,

 

where norse

represents a pseudo-random
function chosen to be normally distributed and to
have ero mean The standard deviation,
¢,{2,4,2), of e(z,y,2) 15 in Uhis case equal to that of
the noise function and it is chosen artificially No
actual sampling is done The advantage of this
‘experimental scheme is that we can investigate
the effects of many different distributions and
sample sizes for each o,(z,y,2) we choose without
performing all the experiments

In order to get some idea of the effect of the
dependence of o,(x,y,2) on z, y and z, we perform
two different sets of experiments on f(z,y,2) (a)
o,%y.2) independent of z, y and s, and (b)
ozuz)=d,f(zy.2) The search space is
limited to z, y and z between -512 and +512
digitized to increments of 0.01 The GA
Parameters are the standard ones suggested by
De Jong {5] population size 50, crossover rate
06, mutation 0.001 For the experiments of type
(a) we determine for several values of o, the
number of evaluations necessary to find z, y and
2 such that f(zy,2) falls below a threshold of
005 For the experiments of type (b) we
determine for several values of d, the number of

 
evaluations
threshold

Necessary to achieve the 0.05

The results of 50 runs at each setting are shown
in Figure 2. It as immediately obvious that these
Graphs are approximately linear In fact linear
Tegression analysis produces a correlation
coefficient of 099 in each case The linearity of
these graphs simphifies their analysis considerably.
To see the relative importance of number of
evaluations versus time per evaluation we can
start with the equation for the straight. lines

(3) £, = 1244422680,
(4) £, = 18,020+9285.,

where £, and E, are the number of evaluations
required for case a and b, respectively We
Imagine that the evaluations were obtained by
sampling from a normal distribution whose
standard deviation is @ in case (a) and d/(z,y,2)
in case (b) In that case we can use Equation (1)
for o,(z,y,2) in both Equations (3) and (4) to get

(8) B, = 12444-22680/Vn
(8) , = 18,020+9285./Vn

 

 

These equations give the number of evaluations
Fequired to achieve the threshold as a function of
the number of samples taken per evaluation, but
they do not indicate the total effort required to
achieve the threshold The total time required
for the optimization procedure includes: the time
for the n samples taken at each evaluation and
the overhead incurred by the GA for each
evaluation. Taking these factors into
consideration we arrive at two equations for the
time necessary to achieve the threshold,

(7) t=(a +8 pp)(1244422680/Vn)
_ (8) 4=(a;+8,p)(18,020+9285./Vn)

where @ is the GA overhead per evaluation and
B18 the time per sample These equations allow
us to determine the optimal value for n, i.e, the
value which will minimize the time necessary to
reach the desired threshold in this sample

14

problem It can be seen that for large n each
expression for the time increases linearly with n
Thus, regardless of the relative size of the
‘overhead, the optimal value of n 3s, not
surprisingly, finite As n approaches zero each
expression approaches infinity, but the smallest
Possible value for n 1s one The optimal value of
n for either case can be found by finding the
‘minimum of the appropriate expression subject to
the restriction that m be an integer greater than
zero Further analysis requires some idea of the
size of a/8 Since the results apply only to the
Particular example evaluation function f(z,y,2) a
detailed analysis 1s not worthwhile We simply
note that in the case in which a is negligible, the
optimal value of n is 1, and as a increases the
optimal value will increase ‘Thus, at least for
small overhead the answer to the question:
concerning the relative importance of the number
of evaluations versus the time required for a
given evaluation is clear. The ume required for a
given evaluation 1s more important The
accuracy of the evaluation should be sacrificed in
order to obtain more evaluations Optimization
Proceeds more quickly with many rough
evaluations than with few precise evaluations

4. An Experiment on Image
Registration

‘The preceding simple example has the following
special characteristics (1) the function to be
optimized 1s simple, (2) r(z) has a normal
dietmbution, (3) the standard deviation of r(z) 18
@ known function, These characteristics make it
Possible to do simple experiments which are easy
to analyze In more general problems these
characteristics are nol guaranteed, but they are
not necessary to insure the efficacy of the
statistical approach To demonstrate the method
for practical problems, we describe here out
approach to a problem which has none of these
characteristics. ‘The problem is found in the
registration of digital images The functions
which are optimized in image registration are
measures of the difference between two images of
@ scene, in our case X-ray images of an area of a
human neck, which have been acquired at
different umes The images differ because of

 

 
motion which has taken place between the two
acquisition times, because of the injection of dye
into the arteries, and because of noise in the
Image acquisition process The registration of
such images 15 necessary for the success of the
process known as digital subtraction angiography
in which an image of the interior of an artery 1s
produced by subtracting a pre-injection image
from a post-injection image The details of the
process and the registration technique can be
found in [7] By performing a geometrical
transformation which warps one image relative to
the other it 18 possible to improve the registration
of the images so that the difference which 1s due
to motion 1s reduced. The function parameters
specify the transformation, and it 1s the goal of
the genetic algorithms to find the parameter
values which minimize the image difference.

The general problem of image re
important in such diverse fields as aerial
photography [8,16,17] and medical imaging
[1,7,12,14,18] General introductions to the field
of image registration and extensive bibliographies
may be found in {8,9,11] An image comparison
technique based on random sampling, different
from the method used here, is described in [2]
‘The class of transformations which we consider
includes elastic motion as well as rotation and
translation.

ration 1s

  

‘The transformations which are employed here
are illustrated in Figure 1 Two images are
selected and a square subimage, the region of
interest, 1s specified as image one -- imi A.
geometrically transformed version of that image
18 to be compared to a second image -- im2 The
transformation is specified by means of four
vectors ~ dl, d2, d3, and d4 -- which specify the
motion of the four corners of iml The
transformed image 1s called m3. The motion of
intermediate points 1s determined by means of
bilnear interpolation from the corner points
‘The magnitudes of the horizontal and vertical
components of the d vectors are limited to be less
than one-fourth of the width of the subimage to
avoid the” possibility of folding [8] (More
complicated warpings will require additional

115

 

vectors )

‘The images are represented digitally as square
arrays of numbers representing an approximate
map of image intensity Each such intensity 1s
called a przel ‘The image difference is defined to
be the mean absolute difference between the
pixels at corresponding positions in im? and 1m3
‘The exact mean can be determined by measuring
the absolute difference at each pixel position; an
estimate of the mean may be obtained by
sampling randomly from the population of
absolute pixel differences The effort required to
estimate the mean ts approximately proportional
to the number of samples taken; so, once again,
the question arises as to the relative importance
of number of evaluations used in the GA versus
the time required per evaluation

 

In general, the distribution of pixel differences
for a given image transformation 1s not normal
Its shape will, in fact, depend in an unknown way
fon the geometrical transformation parameters,
and consequently the standard deviation will
change in an unknown way. Thus, while the
experiments on f(2,y,2) suggest that better results
will be realized if less exact evaluations are made
iis nol clear how the level of accuracy should be
set We note that in the analysis of the
experiments on {(z,y,2) fixing the number of
samples, n, has the effect of fixing, either ¢, or

= 2/flzyz), given the assumed forms of
In the image registration case and in the general
case, however, fixing n fixes neither of these
quantities, since the o's behavior cannot in
general be expected to be so simple. We could,
however, fix either of these quantities
approximately by estimating @ using Equation (2)
as samples are taken during an evaluation and
continuing the sampling until n is large enough
such that the estimate of ©, obtamed from
Equation (1) 1s reduced to the desired value
Thus, the results from the previous experiments
suggest three experiments on image registration —
(1) try to determine an optimal fixed n, (2) try to
determine an optimal fixed o, (3) try to
determine an optimal fixed d,. We have
implemented the first idea and a variation of the

 
third idea The variation 1s motivated by noting
from statistical sampling theory that by fixing ),
we are equivalently fixing our confidence in the
accuracy of the sample mean as representative
the actual mean. If, for example, we require that
the sample mean be within (100p)% of the actual
mean with 95% confidence, we should sample
until we determine that X, 1s less than or equal to
P/1 96 [19] If we can fix only an estimate of d,,
as in the general case, then the (100p)% accuracy
at 95% confidence level requires that the estimate
of A, be less than or equal to p/t ,{n) Here t {n)
ts student's t at a confidence level of 100(1-9)%
and a sample size of n[4] ‘This t-test is exact
only if the distribution of the sample mean is
normal In order to assure that the sainple mean
is approximately normal the sample size, n,
should be at least 10 [4] Our variation on fixing
A, 15 to pick a confidence level of 95% (an
arbitrary choice) and then fix p, subject to
n> 10 to determine an optimal p

The experiments to determine an optimal value
of n and p for image registration and im the
general case differ from those described for
F(z,y,2) above in two ways First, because so
Iittle 1s known about the distributions in the
general case, actual sampling 1s necessary
Second, because so litle is known about the mean
which 1s to be optimized (minimized) it 1s difficult
to determine in the general case whether a
threshold has been reached, and therefore the
criterion for halting must be different. We have
considered two alternative halting critena (1)
determing an exact mean, or a highly accurate
estimate of the mean, of the structure whose
estimate is the best at each generation, halting
when that value reaches a threshold, and using as
@ measure of performance the total number of
samples taken, (2) halting after a fixed number of
samples have been taken and using as the
measure of performance the exact evaluation of
the structure whose estimate is the best at the
last generation The first allernative suffers from
the disadvantage that the additional evaluation
at each generation 1s expensive and tends to
offset the savings gained through approximate
evaluation The severity of the disadvantage 1,

116

 

fon the other hand, diminished as the size of the
generation is increased Therefore this method
suggests a new consideration in setting the
number of structures per generation We choose
in this work to avoid the question of the optimal

number of structures by choosing the simpler
alternative, (2)

The results of our experiments on image
registration are shown in Figure 3. The Figure
shows data resulling from 10 runs at each setting,
‘The subimage m1 1s 100 by 100 pixels, giving a
sample space of size 10,000 The motion of the
corners 1s limited to 8 pixels in the x and y
directions In each case the GA 1s halted after
the generation during which the total number of
samples taken exceed 200,000 The parameters
for the transformation comprise the x and y
components of the four d vectors The range for
each of these eight components is [-80, +80]
digitized to eight bit accuracy The GA
Parameters are set to optimize offline
performance, as suggested by [10] population size
80, crossover rate 0 45, mutation rate 0.10

In Figure 3a each GA takes a fixed number of
samples per evaluation It can be seen from the
Figure that the optimal sample size 1s
approximately 10 samples per evaluation
Apparently, taking one sample per evaluation
does not give the GA sufficient information to
carry out an efficient search The fact that
performance deteriorates when we take fewer
than 10 samples may indicate that Uhe underlying
distribution of pixel difference is not in general
normal, and so this appheation does not

correspond to the ideal experiments described in
section 3

In Figure 3b the estimated accuracy interval,
based on the t-test, 1s fixed subject to the
restriction that the sample size be at least 10
(Note that in Figure 3b, a 10% accuracy interval
means thal we are 95% confident that the sample
mean is within 10% of the true mean) These
experiments suggest that the optimal accuracy
interval at 95% confidence is nearly 100%, which
corresponds to taking on the average 10 samples
per evaluation Given that the performance level
is nearly identical whether we take exactly 10
samples per evaluation or we take on the average
10 samples, the first approach 1 preferable, since
it does not require the calculation of the t-test for
each sample

It should be pointed out that, as in the
experiment on f(z,y,2), the GA overhead is
ignored here If the overhead were included, the
‘optimal sample size would be somewhat larger
In any case, it 1s clear that a substantial
advantage is obtained in statistical evaluation by
reducing sampling sizes and accuracies, at least
for this case of image registration

5. Conclusions

GA's search by allocating trials to hyperplanes
based on an estimate of the relative performance
of the hyperplanes One result of this approach is
that the individual structures representing the
hyperplanes need not be evaluated exactly This
‘observation makes GA’s applicable to problems in
which evaluation of candidate solutions can only
be performed through Monte Carlo techniques
‘The present work suggests that in some cases the
overall efficiency of GA’s may be improved by
reducing the time spent on individual evaluations
and anereasing the number of generations
performed

‘This works suggests some topics which deserve
deeper study First, the GA incurs some
overhead in performing operations such as
selection, crossover, and mutation If the GA
runs for many more generations as a result of
performing quicker evaluations, this overhead
may offset the time savings Future studies
should account for this overhead in identifying
the optimal time to be spent on each evaluation
Second, 1t would be interesting to see how using
approximate evaluations effects the usual kinds of
performance metrics, such as online and offline
performance Finally, additional theoretical work
an this area work be helpful, since experimental
results concerning, say, the optimal sample size
can be expected to be highly application
dependent

17

References

1D. G. Barber, "Automatic Alignment of
Radionuclide Images," Phys. Med Brol
Vol. 27(3), pp 387-96 (1982)

2 Daniel I. Barnea and Harvey F. Silverman,
"A Class of Algorithms for Fast Digital
Image Registration," [EEE Trans Comp
Vol 21(2), pp 179-86 (Feb 1972)

3 Chaim Broit, Optrmal Registration of
Deformed Images, Ph D thesis, Computer
and Info. Sei, Univ of Pennsylvania (1981)

4 Chapman and Schaufele, Elementary
Probability Models and _Statistecal
Inference, Xerox College Publ. Co,

Waltham, MA (1970).

5 K A DeJong, Analysts of the behavior of
a class of genetic adaptive systems, Ph

 

D Thesis, Dept Computer and
Communication Sciences, Univ of Michigan
(1975)

6 J Michael Fitzpatrick and Michael
R Leuze, "A class of injective two
dimensional transformations," tobe
published

7 J M Fitspatrick, J J. Grefenstette, and
D Van Gucht, “image registration by
genetic search," Proceedings of IEEE
Southeastcon '84, pp 460-464 (April 1984)

8 Werner Frei, T Shibata, and © © Chen,
"Past Matching of Non-stationary Images
with False Fix Protection," Proc 5th Intl.
Conf Patt Recog Vol 1, pp.208-12, IBEE
Computer Society (Dee 1-4, 1980)

 

9 Ardesir Goshtasby, A Symbolically-
assisted Approach to Digital Image
Registration with Application in Computer
Vision, Ph D. thesis, Computer Science,
Michigan State Univ (1983)

10 J J. Grefenstette, "Optimization of control
parameters for genetic algorithms", to
appear in JEEE Trans Systems, Man,
and Cybernetics (1985)

 
AL

12.

13.

14,

15.

16

17.

18

19

. Ernest

L. Hall, Computer Image
Processing and Recognition Academte
Press, Inc., New York (1979).

KH Hobne and M. Bohm, *The
Processing and Analysis of Radiographic
Image Sequences," Froc 6th Intnl. Conf.
Patt. Recog Vol. 2, pp.884-897,
Computer Society Press (Oct 19-22, 1982).

FL Jai
practic
(1980)

“Monte Carlo theory and
‘Rep. Frog. Phys Vol 48, p73

  

J. H. Kinsey and B.D. Vannelhi, “Applic
of Digit. Image Change Detection to Diagn
and Follow-up of Cancer Involving the
Lungs," Proc Soc. Photo-optical Instrum.
Eng Vol. 70, pp.99-112, Society of Photo-
optical Instr. Eng. (1975)

B Lautrup, "Monte Carlo methods in
theoretical high-energy physics," Comm
ACM Vol 28, p 358 (April 1985)

James J Little, “Automatic Registration of
Landsat MSS Images to Digital Elevation
Models," Proc. Workshop Computer Viston
Representation and Control, pp 178-84
IEEE Computer Science Press (Aug 23-25,
1982)

 

Gerard G Medioni, *Matching Regions in
Aerial Images," Proc. Comp Viston and
Patt Recog., pp.364-65, IEEE Computer
Society Press (June 19-23, 1983)

Michael J Potel and David E Gustafson,
“Motion Correction for Digital Subtraction
Angiography," IEEE Proc Sth An. Conf
Eng in Med Biol. Soc, pp 166-9 (Sept
1983)

Murray R Spiegel, Theory and Problems

of Probability and Statistics, MeGraw-Hill,
New York (1975)

118

 

 

Figure 1a.
Subimage im1 is represented by the smaller

inner square. The arrows represent the four d-
vectors.

 

Figure 1b.

im? is the larger image. im3. is the inner image

formed by transforming iml according to the d-
vectors shown in Fig 1a
 

  

 

4500}
on
2
2
= “
S sooo} a
2 eats
ia -
a
aso0b
a 08 L2 Le zo
SIGHAs
Figure 2a.

EVALUATIONS

i

3

 

Evaluations Until Threshold vs. Absolute Error.

 
 
   

 

o-
”
ae
’
O.4 0.8 L2 2.0
LAMBDAS '
Figure 2b.

Evaluations Until Threshold vs. Relative Esror.

19

 
Nv

t

a
a
i,
ir
a
o
at
a
3
a

.
=
=
z
z
=
ing

FFERENCE
=

t

nv

FINAL AVE, PIXEL D,

 

Be

 

SO 100 150
SAMPLES PER EVALUATION

Figure 3a.
Performance vs. Fixed Sample Size.

25 50 75 100
ACCURACY INTERVAL

Figure 3b.
Performance vs. Accuracy Interval.

120
A connectionist algorithm for genetic search!

David I. Ackley
Department of Computer Science
Carnegie-Mellon University
Pittsburgh, PA 15213

Abstract

An architecture for function maximization is proposed The design is motivated
by genetic principles, but connectionist considerations dominate the implementa-
tion ‘The standard genetic operators do not appear explicitly in the model, and
the description of the model in genetic terms is somewhat intricate, but the imple-
mentation in a connectionist framework is quite compact ‘The learning algorithm
manipulates the gene pool via a symmetric converge /diverge reinforcement opera-
tor Preliminary simulation studies on illustrative functions suggest the model is
at least comparable in performance to a conventional genctic algorithm.

 

1 Overview

A new implementation of a genetic algorithm is presented. The possibility for it was noted
during work on learning evaluation functions for simple games (1) using a variation on a
recently developed connectionist architecture called a Boltzmann Machine [2] ‘The present
work abstracts away from game-playing and focuses on relationships between genetic al-
gorithms and massively parallel, neuron-like architectures.

‘This work takes function maximization as the task. The system obtains information by
supplying inputs to the function and receiving corresponding function values. By-assump-
tion, no additional information about the function is available. Finding the maximum of
‘a complex function possessing an exponential number of possible mputs is a formidable
problem under these conditions. No strategy short of enumerating all possible inputs can
always find the maximum value. Any unchecked point might be higher than those al-
ready examined. Any practical algorithm can only make plausible guesses, based on small
samples of the parameter space and assumptions about how to extrapolate them.

However, the function maximization problem avoids two further complexities faced
by more general formulations First, performing “associative learning” or “categorization”
can be viewed as finding maxima in specified subspaces of the possible input space. Second,
in the most general case, the function may change over tme, spontaneously or in response
to the system’s behavior. There the entire history of the search may affect the current
location of the maximum value.

Section 2 presents the model. For those familiar with genetic algorithms, highlights of
Section 2 are

« Real-valued vectors are used as genotypes instead of bit vectors. Reproduction and
crossover are continuous arithmetic processes, rather than discrete boolean processes.

‘This research 1s supported by the System Development Foundation

121

 
 

  

© The entire population is potentially
is not limited to contiguous port

invalved in each crossover oper:
of genes.

 

Lion, and crossover

 

   

 

# The reproductive potential of genotypes is not determined by comparison to the average
fituess of the population, but by comparison to a threshold. Adjusting the threshold
can induce rapid convergence or diverge an already converged population.

   

 

Section 3 describes simulation studies that have been performed. ‘The model is tested
on functions that are constructed to explore its behavior when faced with various hazards
First a simple convex function space is considered, then larger spaces with local maxima
are tried.

   

 

Section 4 discusses the model with respect to the framework of reproductive plans and
genetic operators developed in {10}. Possible implications for connectionist research are
not extensively developed in this paper.

 

Section 5 concludes the paper.

2 Development

The goal of this research was to satisfy both genctic and conncctionist constraints as
harmoniously as possible. As it turned out, the standard genetic operators appear only
implicitly, as parts of a good description of how the model behaves. On the other hand,
the implementation of the model in connectionist terms is not particularly intuitive. After
sketching a genctic algorithm, this section presents the model via a loose analogy to the
political process of a democratic society. The section concludes by detailing the implemen-
tation of this “election” model and drawing links between the genetic, the political, and
the connectionist descriptions.

2.1 Genetic algorithms. Genetic evolution as a computational technique was proposed
and analyzed by Holland [10). It has been elaborated and refined by a number of re-
searchers, e.g. [3, 4] and applied in various domains, e.g. (13, 6]. In its broadest formula-
tions it is a very general theory; the following description in terms of function maximization
is only one of many possible incarnations.

 

Genetic search can be used to optimize a function over a discrete parameter space,
typically the corners of an n dimensional hypercube, so that any point in the parameter
space can be represented as an n bit vector The technique manipulates a set of such
vectors to record information gained about the function. The pool of bit vectors is called
the population, an individual bit vector in the population is called a genotype, and the bit
values at each position of a genotype are called alleles. The function value of a genotype
is called the genotype’s fitness or figure of merit.

There are two primary operations applied to the population by a genetic algorithm.
Reproduction changes the contents of the population by adding copies of genotypes with
above-average figures of merit. The population is held at a fixed size, so below-average
genotypes are displaced in the process. No new genotypes are introduced, but changing
the distribution this way causes the average fitness of the population to rise toward that
of the most-fit existing genotype.

In addition to this “reproduction according to fitness,” it is necessary to generate
new, untested genotypes and add them to the population, else the population will simply

122
 

converge on the best one it started with. Crossover is the primary means of generat
plausible new genotypes for addition to the population. In a simple implementation of
crossover, two genotypes are selected at random from the population. Since the population
is weighted towards higher-valued genotypes, a random selection will be biased in the same
way The crossover operator takes some of the alleles from one of the “parents” and some
from the other, and combines them to produce a complete genotype. This “ollspring”
is added to the population, displacing some other genotype according to various criteria,
where it has the opportunity to flourish or perish depending on its fitness.

To perform a search for the maximum of a given function, the population is first ini-
tialized to random genotypes, then reproduction and crossover operations are iterated
Eventually some (hopefully maximal valued) genotype will spread throughout the popula-
tion, and the population is said to have “converged.” Once the population has converged to
a single genotype, the reproduction and crossover operators no longer change the makeup
of the population.

One technical issue is central to the development of the proposed model In addition to
reproduction and the crossover operator, most genetic algorithms include a “background”
mutation operator as well. In a typical implementation, the mutation operator provides a
chance for any allele to be changed to another randomly chosen value. Since reproduction
and crossover only redistribute existing alleles, the mutation operator guarantees that every
value in every position of a genotype always has a chance of occuring. If the mutation
rate is too low, possibly critical alleles missing from the initial random distribution (or lost
through displacement) will have only a small chance of getting even one copy (back) into
the population. However, if the probability of a mutation is not low enough, information
that the population has stored about the parameter space will be steadily lost to random
noise. In either of these situations, the performance of the algorithm will suffer.

    

 

 

 

  

2.2 A democratic society metaphor. Envision the democratic political process as a
gargantuan function maximization engine. The political leanings of the voting popula-
tion constitute the system’s store of information about maximizing the nebulous function
of “good government.” An election summarizes the contents of the store by computing
simple sums across the entire population and using the totals to fill each position in the
government. When the winners are known, voters informally express opinions about how
well they think the elected government will fare. The bulk of the time between elections
is spent estimating how well the government actually performs. By the next election, this
evaluation process has altered the contents of the store: better times favor incumbents;
worse times, challengers.

In society, the function being optimized is neither well-defined nor arbitrary, and the
final evaluation of a government must be left to history, but in the abstract realm of
function maximization the true value of a point supplied to any function can be determined
in a single operation. The immediacy and accuracy of this feedback creates an opportunity
for an explicit learning algorithm that would be difficult to formalize in a real democracy.
Credit and blame can be assigned to the voters based on how well their opinions about
the successive governments predict the results produced by the objective function. Voters
that approved of a high-scoring government can be rewarded by giving them more votes,
so their preferences become a bit more influential in the subsequent election. Voters in
such circumstances tend to favor the status quo. Voters whose preferences cause them to

 
 

  

approve of a low-scoring government lose voting power,
take a chance on something new ‘The proposed model
to learning.

become a bit more willing to
built around such an approach

An iteration of the algorithm consists of three phases which will be called “election,”
“reaction,” and “outcome.” The function maximization society is run by an n member
“government” corresponding to the n dimensions of the function being maximized. In
cach election all n “government positions” are contested. There are wo political parties,
“Plus” and “Minus.” A genotype represents a voter’s cturrent party preferences, recording
a signed, real-valued number of votes for each of the positions. Which party wins a position
depends on the net vote total for that position. A government represents a point in the
parameter space, with Plus signifying a 1 and Minus signifying a 0.

 

 

 

After an election is concluded, each voter chooses a reaction to the new government:
“satisfied,” “dissatisfied,” or “apathetic.” The complete state of a voter includes the
weights of its genotype plus its reaction In general, voters whose genotypes match well
with the government—i.e., most (or the most strongly weighted) of the positions have the
same signs as the genotype weights—will be satisfied and therefore share in the credit or
blame for the government's performance. Voters that got about half of their choices are
likely to be apathetic, and therefore are unaffected by any consequent reward or punish-
ment. Voters that got few of their choices are likely to be dissatisfied with the election
results. Dissatisfied voters share in the fate of the government, but with credit and blame
reversed in a particular way discussed below. Satisfied and dissatisfied voters are also
referred to as active, and apathetic voters are also referred to as inactive.

 

In the outcome phase, the performance of the government is tested by supplying the
corresponding point to the objective function and obtaining a function value. This value is
compared to the recent Instory of function values produced by previously elected govern-
ments to obtain a reinforcement signal. A positive result indicates a point scoring better
than usual and vice-versa. The reinforcement signal is used to adjust the preferenées of the
active voters. Positive reinforcement makes the reactions of the population more stable,
and negative reinforcement makes them more likely to change. Finally, the newly ob-
tained function value is incorporated into the history of function values, and an iteration
is complete.

Two points are worth making before considering the actual implementation. The first
point is that there is noise incorporated into both the election and the reaction processes.
If the sum of the vote for a given position is a landslide, the result will essentially always be
as expected, but as the vote total gets closer to zero the probability rises that the winner
of the position will not actually be the party that got the most votes. There are no ties
or runoff elections; if the sum of the vote for a position totals to exactly zero the winner
is chosen completely at random. Voter reactions are also stochastic, based on the net
degree of match over mismatch between each genotype and the elected point. Although
real election systems try to ensure that the winner got the most votes, in the proposed
model this nondeterminisin serves the crucial function of introducing mutation. Moreover,
unlike the constant-probability mutation operator mentioned in the previous section, it is
data dependent. Mutation is very likely in those positions where no consensus arises from
the population, but it will almost never upset a clear favorite.

The second point is that only the currently active voters participate in the election.

124
gore rument

positions

  
 
 
 
  
  
      

ested 1 OL
1 vale v = 877
“Ite 3 95.6 mw tevel 0 = 113

  

voters

 
   

as been passed to tH
1 so the r
fe whether the weights

 sigual ss apphed Ne

 

uch returned 1
{signal 1s negative The
ase (f)- decrease (|). oF re

    
 
       

Satisfied voters vote in the manner described above. Dissatisfied voters vote in a sign-
reversed manner: positive weights vote for Minus and negative weights vote for Plus.
Apathetic voters do not vote at all, but they react to each election and may become
active. Section 4 discusses a genetic interpretation of this strategy.

 

2.3 A connectionist implementation. The ever-increasing demand for computational
power and the continuing desire to understand the human brain has encouraged research
into massively parallel computational architectures that resemble the physiological picture
of the brain more closely than does the standard Von Neumann model. The basic as-
sumption of the connectionist approach (see, e.g., {5] or [7]), is that computation can be
accomplished collectively by large numbers of very siniple processing units that contain
very little storage. The bulk of the memory of the system is located in communication links
between the units, usually in the form of one or a few scalar values per link that control
the link’s properties. In terms of individual units and links, the Perceptron (12] typifies
the kinds of hardware considered: a unit is simple linear threshold device, adopting one of
two numeric output states based on a comparison between the sum of its input links and
its threshold; a link connects two units and contains a scalar variable that 1s multiplied by
the link input to produce the link output.

 

In terms of problem formulations, network organizations, and learning algorithms,
connectionist research has moved in many directions from the Perceptron; the proposed
model uses assumptions most closely related to those employed in {1, 2, 9, 11]. There is not
space to explicitly motivate all of the decision designs of the implementation, but analogies
to the political and genetic descriptions are discussed as they arise. Figure 1 sketches an
instance of the model and defines terminology.

The basic processing element of the model is called a unit. Each unit i has a ternary

125

 
 

-1000-800— 600-400-2000 20040000800 FoM0
AL;

 

 
 
  

Figure 2. A phase
ified by Eq. (2), pl

generate

erate 4

 

Feet ane epee
Tine
and points hetween the lines ge

     
  

100. (AE,

ow the dotted line generate

 
   

 

state variable s; € {+-1,0,—1}. Units communicate their current states to other units via
Jinks. A link between two units i and j has a real-valued weight w,,. All links between
units are bidirectional and have the same weight in both directions, ie. w,j = wy.

In the political analogy, groups of units represent both the government positions and
the voters. In the former case, s; represents the winner of position i, with s; = 1 —» Plus
and ; = ~1—» Minus, Parameters are set so that s, = 0 cannot occur for the position
units. In the latter case, s, represents the reaction of voter 2, with s; = 1— “satisfied,”
5, =0— “apathetic,” and s, = ~1 — “dissatisfied.”

A unit simply retains its current state until it is probed, at which time it checks
the states of the units it is connected to and the weights on those links and applies a
probabilistic decision rule to select a state. The quantity that sums up the current context
of a unit 7 is called “AZ,” and is defined as

 

 

Q)

 

where ranges over all the units in the network and w,, = 0 if units i and j are not
connected. Given AE; and a uniform random variable 0 < ¢ < 1, the decision rule is

1
1+ e-(AE,+a)/T

i. ()
ie < aa
0 otherwise.

+1 if €>

   

The boundaries between the unit states are plotted in Figure 2. The size of the model
parameter T > 0—the “temperature”—determines how sharply the boundaries slope as
AE; moves away from zero; it controls how “noisy” the system is. The model parameter
a > O controls the width of the “apathy window” when the voter units are probed.

In the political analogy, the election and reaction processes are both implemented by
the probe operation. An election is performed by probing each of the position units once.
Since position units connect only to voter units the ordering of the probes is irrelevant,
and the contests for each position can happen in parallel. When applied to a position

 

 

126
unit 1, the summation in Hq (1) totes up the effective vote count for the position. If a
voter unit j is apathetic, then s, - 0 and w,, does not affect the total for the position,
otherwise either wj; or —w,; is included in the total depending on whether the voter is
satisfied or dissatisfied. ‘The wminer of the position is then determined by Eq. (2), applied
with a = 0. As AF, becomes more positive, the likelihood of Plus winning Ue contest
increases, and vice-versa. If one takes the limit as T —> 0, Eq. (2) approaches a step
function corresponding to a deterministic clection based only on the sign of AE).

 

     

 

 

‘The voter reaction is assessed symmetrically, by probing each of the voter units once.
When applied to a voter unit i, the sunuation in Bq. (1) produces a net match score
between an elected government and the voter's preferences. The match score for the voter
increases when the state of position 7 has the same sign as w,, and decreases when the signs
differ. The voter’s reaction is then determined by Eq. (2), with a set as a model parameter.
A large positive AE, indicates a particularly good match between a government and a
voter, and generates a high probability that the voter will be satisfied and adopt s, = 1; a
large negative value indicates a particularly bad natch and strongly suggests s, = —1; and
a near-zero value indicates an ambiguous situation and generates the largest probability of
adopting s, = 0. The assumption of bidirectional links with symmetric weights guarantees
that a voter's behavior during elections and reactions will be consistent. If all of a voter's
preferred candidates are elected, for example, then in the zero temperature limit the voter
cannot be dissatisfied with the government.

 

 

 

In genetic terms, an election can be viewed as part of a generalized crossover operation.
If we imagine one satisfied voter in an otherwise apathetic population, the outcome of a
(sufficiently low temperature) election will be a direct expression of that voter’s genotype:
wherever the weight from the voter to a position is positive Plus will win and vice-versa. If
two voters are satisfied, some mixture of their genotypes will be expressed by the position
units, depending on the relative magnitudes of the weights to the positions where the voters
disagree. This situation bears a close resemblance to the standard crossover operator.
The difference is that standard crossover determines the winners of disputed positions by
a random choice of crossover point, whereas the proposed model exploits accumulated
performance data to bias each decision.” In the general case the crossover operation is
hard to see explicitly, considering the effects of many satisfied voters, the dissatisfied vote,
temperature, and the fact that the crossed-over genotype is not guaranteed admission to
the population.

The next steps in the algorithm are straightforward. The states of the position units
are translated into a binary vector J; the vector is passed to the objective function; a scalar
value v is returned. The function value has no meaning in itself since the possible range
of function values is unknown. A judgment must be made whether the value is “good” or
“bad,” assuming that whatever is deemed good will be made more probable in the future.
The expectation level 9 is used to produce the reinforcement signal

—sta is
Tye0-T

 

(3)

 

 

This statement is too strong if the model using standard crossover also uses inversion, since in
that case the grouping induced by the crossover point docs depend on the past performance of the model,
as recorded by the inversion operator. Section 4 discusses inversion and crossover further.

 
 

 

0. Initialization. Given unknown function {v = f(I)|Ic 2",v © R}. Select
model parameters Create 1 position units and 7m voter units. Link each
position unit to each voter unit. Set all nr link weights w,, = 0. Set all
nm unit states s, = 0. Set 0

1. Blection: Probe each position unit (Eqs. | and 2).

2. Reactios ‘robe each voter unit

3. Outcome:

3.1, Mitness test: Compute v = f(I).
3.2. Discount expectations: Compute r (Fa. 3)
3.3. Apportion credit: Update w,, (Iq. 4
3.4. Adjust expectations: Update 0 (Eq. 5).
4. Iterate: Go to step 1.

    

 

 

 

   

  

 

0 Size of population; number of voters.
‘Temperature of unit decisions.

Apathy window for voter reactions.
Payoff rate.

0 “Temperature” of reinforcement scaling.
1 Time constant for function averaging.

0 Excess expectation.

eoce

 

Figure 3. Algorithin summary and list of model parameters.

 

‘This employs the same basic sigmoid function used in the unit decision rule, but r is
bounded by :t1 and is used as an analog value rather than a probability The’ model
parameter T; scales the sensitivity around 0 = v.5 r is used to update the weights

 

 

Wij ter = Wye + krsys, (4)

where k > Ois the payoff rate. The change to each link weight depends on the product s;8,.
If the voter unit is apathetic the weight does not change, otherwise either kr or —kr 1s
added to the weight, depending if the voter and position units are in the same or different
states.

Ifr is positive, the net effect of this 1s that the AE of satisfied units becomes more pos-
itive and the AE of dissatisfied units becomes more negative, i.e., each active unit becomes
somewhat less likely to change state when probed. Consistency is encouraged; the incum-
bents are more likely to be reelected, the voters are less likely to change their reactions.
When r is negative the reverse happens Inconsistency is encouraged; victory margins
erode, voter reaction becomes more capricious. An updating of weights with positive r is
called “converging on a genotype,” with negative r, “diverging from a genotype.”

In genetic terms, the weight modification procedure both implements reproduction and
completes the implementation of the crossover operator. Only the crossed-over genotype
as expressed in the position units is eligible for reproduction, and then only if r > 0.
Otherwise the network diverges, and that genotype decreases its “degree of existence”
in the population. It 1s displaced, by some amount, but it is not replaced with other

3 ‘The precise form of Eq. (3) docs not appear essential to the model Several variations all searched
effectively, though they displayed different detailed behaviors.

128
 

 

members of the population—the total “voting power” of the population declines a bit
instead. Intuitively speaking, the space vacated by a diverged genotype is filled with noise.

‘The final implementation issue is the computation of the expectation level. A number
of workable ways to manipulate @ have been tried, but the simulations in the next section
all use a simple backward-averaging procedure

D1 = PO + (1~ p)(v +4) (5)

where 0 < p< 1is the “retention rate” governing how quickly 0 responds to changes in v.
Just allowing @ to track v is inadequate, however, for if the network completely converged
there would be no pressure to continue searching for a better value. A positive value for
the model parameter 5 avoids this complacency and ensures that a converged network will
receive more divergence than convergence, and eventually destabilize.

Figure 3 summarizes the algorithm and lists the seven model parameters.

 

3 Behavior
This section describes preliminary simulations of the election model. Most of the objective
functions considered here were explored during the design of the model, rather than being
chosen as independent tests after the design stabilized. The functions were created to
embody interesting characteristics of search spaces in general.

‘All of the simulations described in this paper use the following settings for the model

parameters
m=50 T=10n a=sn k=20

T,=10 p=0.75 6=40

Note that the temperature and the apathy are proportional to the dimensionality of the
given parameter space. For convenience, these are called the “standard” settings, but
significantly faster searching on a function of interest can be produced by fine-tuning the
parameters. The standard settings were chosen because they produce moderately fast
performances across the four selected functions, each tested at four dimensionalities.

‘The simulations count the average number of function evaluations before the model

evaluates the global maximum. Two other algorithms were implemented for comparison.
The first was the following hillclimbing algorithm

   

1, Select a point at random and evaluate it.

2. Evaluate all adjacent points. If no points are higher than the selected point,
go to step 1. Otherwise select the highest adjacent point, and repeat this step.

Iterated hillclimbing is a simple-minded algorithm that requires very little memory.
Its performance provides only a weak bound on the complexity of a parameter space. The
second algorithm was a basic version of Holland’s Ri reproductive plan {10}, using only
simple crossover and mutation. Considering the lack of sophisticated operators in the
implementation, and the author’s inexperience at tuning its parameters, the performance
of the Ri implementation should be taken only as an upper bound on the achievable
performance of a simple genetic algorithm.*

4

‘The Ri model parameter values were selected after a short period of trial and error on the test

129

 

rT
 

3.1 A convex space. Consider the following trivial function: Score 10 points for each
L bit, Return the sum. The global maximum equals 10n and occurs when all bits are
turned on. This “one max” function was tested because it can be searched optimally by _
hillclimbing, and the generality of a genetic search is unnecessary. Figure 4 tabulates the
simulation results for n = 8,12, 16,20. As expected, the hillelimbing algorithm found the
maximum more quickly than did the model, but it is encouraging that on all but the
smallest case the election model comes within a factor of two of hillclimbing’s efficiency on
this convex space. Observations made during the simulations suggest that the relatively
poorer performance of Ri arose primarily from the occasional loss of one or more critical
alleles, producing the occasional very long run. Although increasing the mutation rate
reduced the probability of such anomalies, it produced a costly rise in the length of typical
runs.

 

 

  

 

 

      

 

 

 

 

 

 

 

 

 

‘One max
n 8 | 12 16 20
‘Method Evaluations performed*
Hillclimb 3i_| 82 | 128 | 198
Election 73 (117 | 187 | 302
Holland Ri_| 195 | 674 | 1807 | 4161

 

 

 

 

* Rounded averages over 25 runs.

Figure 4. Comparative simulation results on the “one max” function, In all simulations, the
performance measure ix the number of objective function evalnations perforined before the global
maximum is evaluated.

3.2 A local maximum. Convex function spaces are very easy to search, but spaces of
interest most often have local maxima, or “false peaks” Consider this “two max” function:
Score 10 points for each 1 bit, score ~8 points for each 0 bit, and return the absolute value
of the sum This function has the global maximum when the input is all 1’s, but it also has
a local maximum when the input is all 0’s. Figure 5 summarizes the simulation results.
With this function, a simple hillelimber may get stuck on the local maximum, so multiple
starting points may be required.

 

 

 

 

 

 

 

 

 

 

Two max
n 8 12 | 16 20
‘Method Bualuations performed™
Hillclimb 37_[_97 | 186 | 230
Election 83 | 152 | 194 | 269
Holland R1_| 113 | 340 | 794 [ 1622

 

 

 

* Rounded averages over 25 runs.
Figure 5. Comparative simulation results on the “two max” function.
functions. Using the notation defined in {10}, the values were M = 50, Po = 1, Py = 0, 1Pyy = 0.5, and

¢ = (1/1)925*2/, where n is the dimensionality of the objective function. Constant offsets were added to
the functions where necessary to ensure non-negative function values.

130
 

 

 

Nonetheless, on this function also the hillclimber outperforms the model, although only
by a narrow margin on the larger cases. The sere existence of a local maximum does not
imply that a space will be hard to search by iterated hillclimbing. The regions surrounding
the two maxima of the function have a constant slope of 18 points per step toward the
nearer maximum. The slopes have the same magnitude, so the higher peak must be wider
at its base. With every random starting point, the hillchmber 1s odds on to start in the
“collecting area” of the higher peak, so it continues to perform well.

 

3.3 Fine-grained local maxima. Consider the following “porcupine” function: Score
10 points for cach 1 bit and compute the total. If the number of 1 bits is odd, subtract
15 points from the total. Return the total. Every point that has an even number of 1 bits
is a porcupine “quill,” surrounded on all sides by the porcupine’s “back”—lower valued
points with odd numbers of 1 bits. As the total number of | bits grows, the back slopes
upward; the task is to single out the quill extending above the highest point on the back.

 

 

 

 

 

 

 

Porcupine
n Ls] 2 16 | 20
Method . Evaluations performed*
Hillclimb [145 | 2474 | 41973 =|
Election | 160 | 211 241 | 495
Holland Ri | 163 | 739 | 1296 | 3771 |

 

 

 

 

 

 

* Rounded averages over 25 runs.

Figure 6. Comparative sunulation results on the *poreupme” function,

Unlike the first two functions, the porcupine function presents a tremendously rugged
landscape when one is forced to navigate it by changing one bit at time. Not surprisingly,
hillclimbing fails spectacularly here. Figure 6 displays the results. The landscape acts like
Aypaper, trapping the hillclimber after at most one move, and the resulting long simulation
times reflect the exponential time needed to randomly guess a starting point within a bit
of the global maximum. (The hillclimber was not run with n = 20 for that reason.) On
the other hand, the election model gains less than a factor of two over its performance on
the one max function. The strong global property of the space—the more 1’s the better,
other things being equal—is detected and exploited by both genetic algorithms.5

Although the porcupine function reduced hillclimbing to random combinatoric search,
in a sense it cheated to do so, by exploiting the hillclimber’s extremely myopic view of
possible places to move. A hillclimber that considered changing two bits at a time could
proceed directly to the highest quill. But increasing the working range of a hillclimber
exacts its price in added function evaluations per move, and can be foiled anyway by using
fewer, wider quills (e.g., subtract 25 points unless the number of ones is a multiple of

5 The concept of parity, which determines whether one lands on quill or back, is not detected or
exploited. All three algoritlins continue to try many odd parity points during the search. The general notion
of parity, independent of any particular pattern of bits, cannot be represented m such simple models; the
amport of this demonstration is that the genetie models can make good progress even when there are aspects
of the objective function that, from their point of view, are fundamentally unaccountable.

    

 

131

 
 

three.) Higher peaks may always be just “over the horizon” of an algorithm that searches
fixed distances outward from a single point.

3.4 Broad plateaus. The porcupine function was full of local maxima, but they were
all very small and narrow. A rather different sort of problem occurs when there are large
regions of the space in which all points have the same value, offering no uphill direction.
Consider the following “plateaus” function: Divide the bits into four equal-sized groups.
For each group, if all the bits are 1 score 50 points, if all the bits are 0 score —50 points,
and otherwise score 0 points. Return the sum of the scores for the four groups. In a group,
any pattern that includes both zeros and ones is on a plateau. Between the groups the
bits are completely independent of each other; within a group only the combined states of
all the units has any predictive power. When n = 8 there are only two bits in a group and
the function space is convex, because the sequence 00 ~+ {01,10} — 11 is strictly uphill.
However, since each group grows larger as n increases, this function rapidly becomes very
non-linear and difficult to maximize.

 

 

 

 

 

 

 

 

 

 

 

Plateaus
n 8 | 2 16 | 20
‘Method = Bvaluations performed”
Hillclimb 34 414 [| 2224 [13404
Election | 146 | 392 | 758 | 2304
Holland R1_[{ 228 | 607 | 2223 | 8197

 

 

 

* Rounded averages over 25 runs.

Figure 7. Comparative simulation results on the “plateaus” function.

4 Discussion

‘The proposed model was developed only recently, and has it has not yet been analyzed
or tested extensively. Although it would be premature to interpret the model and simula-
tions in a very broad scope, a few interesting consequences have been uncovered already.
This section touches on a number of relationships between the election model and the
analytic structure of schemata and generalized genetic operators developed by Holland in
Adaptation in Natural and Artificial Systems (ANAS) [10].

Given a population, computational effects related to simple crossover can be achieved
in many ways. For example, disputed positions could be resolved by random choices
between the parents, or by appealing to a third genotype as a tie-breaker. Like simple
crossover, both of these implementations perform the basic task of generating new points
that instantiate many of the same schemata as the parents. An appropriate crossover
mechanism interacts well with the other constraints of the model and the task domain.
For example, the information represented by a DNA molecule is expressed linearly, so the
sequential ordering of the alleles is critical. In these circumstances, the simple cut-and-
swap crossover mechanism is an elegant solution, since it is cheap to implement and it
preferentially promotes contiguous groups of co-adapted alleles.

In an unconstrained function optimization task, as little as possible should be presumed
a priori about how the external function will interpret the alleles. In these circumstances,

132
the sequential bias of the standard crossover mechanism is unwarranted. ANAS proposes
an inversion operator to compensate for it. The inversion operator tags each allele with
its position number in terms of the external function, so the ordering of the genotype can
be permuted to bring co-adapted alleles closer together and therefore shelter them from
simple crossover. However, if two chosen parents do not have their genotypes permuted in
the same way, a simple crossover between them may not produce a complete set of alleles.
ANAS offers two suggestions. If inversion is a rare event, sub-populations with matching
permutations can develop, and crossover can be applied only within such groups. But then
information about the linkages between alleles accumulates only slowly. Alternatively, one
of the parents can be temporarily permuted to match the other parent in order to allow
simple crossover to work, but then half of the accumulated linkage information is ignored
at each crossover.

The proposed model does not use the ordering of the alleles to carry information.
Linkage information is carried in the magnitudes of the genotype weights, in non-obvious
ways involving all three phases and the assumption of symmetric weights. For example,
the defining loci of a discovered critical schema are likely to be represented by relatively
large weights on a genotype, since those weights will receive systematically higher net
reinforcement than the non-critical links. Conversely, relatively large weights to a few
positions cause the designated alleles to behave in a relatively tightly coupled fashion.
In the election phase, large weights increase the chance that the alleles will be expressed
simultaneously and receive reproduction opportunities. In the reaction phase, the same
large weights increase the chance that the voter will be apathetic when the implied schema
is not expressed, since the genotype’s large weights will tend to cancel. Strongly coupled
alleles will be disrupted more slowly over successive outcome phases.

Although it is not discussed in ANAS, subsequent research found it useful to include
a “crowding factor” that affects how genotypes get selected for deletion to make room for
a new offspring [4]. The idea is to prefer displacing genotypes that are similar to the new
one, thus minimizing the loss of schemata. In the proposed model, note the interaction
between the reaction phase and the outcome phase. Only active voters are affected by
weight modification. Since voters tend to be satisfied or dissatisfied when they strongly
match or mismatch the government, and dissatisfied voters invert the sign of the weight
modifications, converging on a genotype preferentially displaces similar existing genotypes.

The representation of genotypes by real-valued vectors instead of bit vectors has
widespread consequences. One major difference concerns the displacement of genotypes as
a result of reproduction or crossover. When a bit vector is displaced from a conventional
population, the information it contained is permanently lost. In contrast, the proposed
reinforcement operator is an invertible function. Between a constant government and a
voter, any sequence of positive and negative reinforcements has the same effect as their
sum. Observations revealed that the election model exploits this property in an unan-
ticipated and useful way. The happenstance election of a surprisingly good government
often leads to a run of reelections and positive reinforcements, occasionally freezing the
network solid for a few iterations, until the expectation level catches up. If one examines
the signs of the genotype weights at such a point and interprets them as boolean vari-
ables, the population often looks nearly converged. But the expectation level soon exceeds
any fixed value, and weaker negative reinforcements begin to cancel out the changes and

 
 

to regenerate the pre-convergent diversity During such times, the government positions
with the smallest victory margins are the first to begin changing, which causes a period
of stochastic local search in an expanding neighborhood around the convergence point. If
farther improvement is discovered, the network will frequently converge on it, but often
the destabilization spreads until the government collapses entirely and a period of wide-
ranging global search ensues. It may be that much of the election model’s edge over the
Rt algorithm on the strict maximization-time performance metric used in this paper arises
from this tendency to hillelimb for a while in promising regions of the parameter space,
without irrevocably converging the population.

5 Conclusion

‘The architectural assumptions of the model—-the unit and link definitions, the decision
rule, and the weight update rule—were first explored for reasons unrelated to genetic algo-
rithms. The assumption of symmetric links between binary (:£1) threshold units was made
by Hopfield [11] because he could prove such networks would spontancously minimize a
particular “energy” function that was easily modifiable by changing link weights. Hopfield
used the modifiable “energy landscape” to implement an associative memory.

Hopfield’s deterministic decision rule was recast into a stochastic form by Hinton &
Sejnowski [8] because they could then employ mathematics from statistical mechanics
to prove such a system would satisfy an asymptotic log-linear relationship between the
probability of a state and the energy of the state. 0/1 binary units were used. They found
a distributed learning algorithm that would provably hillelimb in a global statistical error
measure. They used the system to learn probability distributions.

The weight update rule was investigated by the author because it provided a sim-
ple method of adjusting energies of states based on a reinforcement signal for a back-
propagation credit assignment algorithm [1]. +1 binary units were used. The connectionist
network was used as a modifiable evaluation function for a game-playing program. The
system learned to beat simple but non-trivial opponents at tic-tac-toe. Observations made
during simulations raised the possibility that genetic learning was occurring as the system
evolved. In that work, the government corresponds to the game board, and a voter, in
effect, specifies a sequence of moves and countermoves for an entire game. The model fre-
quently played out variations that looked like crossed-over “hybrid strategies.” The rapid
spread through the units of a discovered winning strategy was suggestive of a reproduction
process.

The research reported here focused on that possibility. The task was simplified to avoid
problems caused by legal move constraints, opposing play, and delayed reinforcement.
Given an appropriate problem statement, the basic election/reaction scheme seemed to
be the simplest approach. Extending the unit state and decision rule to three values
occurred to the author while developing the political analogy. In theory, apathy could
be eliminated, because a unit with a near-zero AE would pick +1 or —1 randomly, so
rewards and punishments irrelevant to that unit’s genotype would cancel out in the long
run. In practice, explicitly representing apathy improves the signal-to-noise ratio of the
reinforcement signal with respect to the genotype. The unit is not forced to take a position
and suffer the consequences when it looks like a “judgment call.” The performance of the
algorithm is generally faster and more consistent, but a percentage of the population is

 

134
 

 

ignored at each election. For the large populations implied by massively parallel models,
it appears to be an attractive space/time trade-off.

 

The connectionist model presented here has a much more sophisticated genetic de-
scription than was anticipated at the outset. Only reproduction, crossover and mutation
were intentionally “designed into” the model. It was a surprise to discover that the model
performed functions reminiscent of other genetic operators such as inversion and crowding
factors. As an emergent property, the model displays both local hillelimbing and global
genetic search, shifting between strategies at sensible times. More experience with the pro-
posed model is needed, but a crossing-over of genetic and connectionist concepts appears
to have produced a viable offspring.

References

[1] Ackley, D.H. Learning evaluation functions in stochastic parallel networks. Carnegie-
Mellon University Department of Computer Science thesis proposal. Pittsburgh, PA:
December 4, 1984.

[2] Ackley, D.H., Hinton, G.E., & Sejnowski, T.J. A learning algorithm for Boltzmann
Machines. Cognitive Science, 1985, 9(i), 147-169.

3] Bethke, A.D. Genetic algorithms as function optimizers. University of Michigan Ph.D.
‘Thesis, Ann Arbor, MI: 1981. .

[4] DeJong, K.A. Analysis of the behavior of a class of genetic algorithms. University of
Michigan Ph.D. Thesis, Ann Arbor, MI: 1975.

{5] Feldman, J., (Ed.) Special issue: Connectionist models and their applications. Cogni-
tive Science, 1985, 9(1).

6] Goldberg, D. Computer aided gas pipeline operation using genetic algorithms and rule
learning. University of Michigan Ph.D. Thesis (Civil engineering), Ann Arbor, MI:
1983.

[7] Hinton, G.E. & Anderson, J.A. Parallel Models of Associative Memory. Hillsdale, NJ:
Erlbaum, 1981.

[8] Hinton, G.E., & Sejnowski, T.J. Optimal perceptual inference. Proceedings of the
JEEE Computer Society Conference of Computer Vision and Pattern Recognition.
June 1983, Washington, DC, 448-453.

(9] Hinton, G.E., Sejnowski, T.J., & Ackley, D.H. Boltzmann Machines: Constraint satis
faction networks that learn. Technical report CMU-CS-84-119, Carnegie-Mellon Uni-
versity, May 1984,

{10] Holland, J.H. Adaptation in Natural and Artificial Systems. University of Michigan
Press, 1975.

(11) Hopfield, J.J. Neural networks and physical systems with emergent collective compu-
tational abilities. Proceedings of the National Academy of Sciences USA, 1982, 79,
2554-2558.

[12] Rosenblatt, F. Principles of neurodynamics: Perceptrons and the theory of brain mech-
anisms. Washington, DC: Spartan, 1961.

[13] Smith, S. A learning system based on genetic algorithms. University of Pittsburgh
Ph.D. Thesis (Computer science). Pittsburgh, PA: 1980.

  

135

 
 

Job Shop Scheduling with Genetic Algorithms

Dr. Lawrence Davis
Bolt Beranek and Newman Inc.

1. INTRODUCTION

The job shop scheduling problem is hard to solve well, for reasons outlined by Mark
Fox et al’, Their chief point is that realistic examples involve constraints that cannot be
represented in a mathematical theory like linear programming. In ISIS, the system that
Fox et al have built, the problem is attacked with the use of multiple levels of abstraction
and progressive constraint relaxation within a frame-based representation system. ISIS is a
deterministic program. however, and faced with a single scheduling problem it will produce
asingle result. Given the vast search space where such unruly problems reside, the chances
of being trapped on an inferior local minimum are good for a deterministic program. In
this paper, techniques are proposed for treating the problem non-deterministically, with
genetic algorithms.

2. JOB SHOP SCHEDULING: THE PROBLEM

A job shop is an organization composed of a number of work stations capable of
performing operations on objects. Job shops accept contracts to produce objects by putting
them through series of operations. for a fee. They prosper when the sequence of operations
required to fill their contracts can be performed at their work centers for less cost than
the contracted amount, and they languish when this is not done. Scheduling the day-to-
day workings of a job shop (specifying which work station is to perform which operations
on which objects from which contracts) is critical in order to maximize profit, for poor
scheduling may cause such problems as work stations standing idle, contract due dates not
being met, or work of unacceptable quality being produced.

The scheduling problem is made more difficult by that fact that factors taken into
account in one’s schedule may change: machines break down, the work force may be
unexpectedly diminished, supplies may be delayed, and so on. A job shop scheduling
system must be able to generate schedules that fill the job shop’s contracts, while keeping
profit levels as high as practicable. The scheduler must also be able to react quickly to
changes in the assumptions its schedules are based on.

In what follows, we shall consider a simple job shop scheduling problem, intended to
be instructive rather than realistic, and show how genetic algorithms can be used to solve
it.

3,_SJS-A SIMPLIFIED JOB SHOP
SJS Enterprises makes widgets and blodgets by contract. There are six work stations

in SJS. Centers 1 and 2 perform the grilling operation on the raw materials that are
delivered to the shop. Centers 3 and 4 perform the filling operation on grilled objects,

136
and centers 5 and 6 perform the final milling operation on filled objects. Widgets and
blodgets go through these three stages when they are manufactured. Thus, the sequence
of processes to turn raw materials into finished objects is this:

RAW MATERIALS - GRILLING - FILLING - MILLING - CUSTOMER.
SJS has collected a number of statistics about its operations. Due to differences in its

machinery and personnel, the expected time for a work station to complete its operation
on an object is as follows, in minutes:

WORK STATION WIDGETS BLODGETS
1 5 15
2 8 20
3 10 12
4 8 15
5 3 6
6 4 8

The cost of running each of the work stations at SJS is as follows, per hour:

WORK STATION IDLE ACTIVE
z 10 70
2 20 60
3 10 70
4 10 70
5 20 80
6 20 100

In addition, SJS has overhead costs of 100 units per hour.
Finally, it requires some time for a work station to change from making widgets to
making blodgets (or vice versa). The change time for each station is:

WORK STATION CHANGE TIME
30
10
20
20
9
18

oonene

4. A SCHEDULING PROBLEM .

Suppose SJS is beginning production with two orders, one for 10 widgets and one
for 10 blodgets How should it be scheduled so as to maximize profits from the point at

 
 

which operations begin, to the point at which both orders are filled? Let us consider three
schedules that address this problem.

In schedule 1, individual work stations are assigned their own contracts. We notice
that the production of blodgets takes longer than the production of widgets, and so we
make widgets with centers 2, 4, and 6, and make blodgets with centers 1, 3, and 5. If the
shop follows this schedule, the various work stations are occupied as follows:

STATION CONTRACT WORKING WAITING HRS-WORKED COST

1 dlodgets 0-150 30 3 210
2 widgets 0-80 40 2 140
3 blodgets 15-162 60 3 210
4 widgets 8-88 40 2 150
5 blodgets 27-168 120 3 240
6 widgets 16-92 80 2 220

In simulating the operation of the job shop under this plan. we note that some work
stations spend a good deal of time waiting for objects to work on. Work stations 5 and 6,
for example. spend from one to two hours waiting because they are faster than the centers
that feed objects to them. It is possible to let them stand idle for the first hour of the day
without delaying the filling of the orders, yielding a second schedule with cost 970, a 17
per cent reduction over the first schedule, achieved by giving these work stations an initial
idle hour. A different way to cut down on the waiting time would be to leave work station
6 idle throughout the day, performing all operations with work station 5 during the second
and third hours of the day. Work station 5 must start work on blodgets when it begins,
switch to widgets later on and finish them, then switch back to making blodgets at the
end. The cost of this schedule is 950, an 18.8 per cent reduction over the direct cost of the
first schedule.

It is interesting to note that a deterministic system would be likely to try one or
the other of the two optimizations on the first schedule, but not both. Each of these
optimizations brings the situation to a local minimum in cost, and advance predictions of
which such optimization will be best appear difficult to make.

 

  

5. AN AMENABLE REPRESENTATION OF THE PROBLEM

If we consider a schedule to be a literal specification of the activity of each work sta-
tion, perhaps of the form “Work station w performs operation o on object x from time
11 to time t2.” then one will be caught in a dilemma if one applies genetic techniques to
this problem. Either one will attempt to use CROSSOVER operations or not. If so, their
use will frequently change a legal schedule into an illegal one, since exchanging such state-
ments between chromosomes will cause operations to be ordered for which the necessary
previous operations have not been performed. As a result, one would acquire the benefits
of CROSSOVER operations at the cost of spending @ good deal of one’s time in a space
of illegal solutions to the problem. If one foregocs CROSSOVER operations, however, one

 

138
 

loses the ability to accelerate the search process, the very feature of the genetic method
that gives it its great power.

There is a solution to this dilemma’. It is to use an intermediary, encoded representa-
tion of schedules that is amenable to crossover operations, while employing a decoder that
always yields legal solutions to the problem. Let us consider the scheme of representations
and decoders that generated the second and third schedules above.

A complete schedule for the job shop was derived from a list of preferences for each
work station, linked to times. A preference list had an initial member~-a time at which
the list went into effect. The rest of the list was made up of some permutation of the
contracts available, plus the elements “wait” and “idle”. The decoding routine for these
representations was a simulation of the job shop’s operations, assuming that at any choice
point in the simulation, a work station would perform the first allowable operation from
its preference list. Thus, if work station 5 had a preference list of the form (60 contract]
contract2 wait idle), and it was minute 60 in the simulation, the simulator looked to see
whether there was an object from contract 1 for the work station to work on. If so, that
was the task the work station was given to perform. If not, the simulator looked to see
whether there was an object from contract 2 to work on. If so, it set the work station to
change status to work on contract 2. noting the elapsed time if contract 1 had been worked
on last, and then set it to work on the new object. If not. the station waited until an object
became available. By moving the “wait” element before contract2, one could cause the
work station to process objects from contract 1 only, never changing over to contract 2.

Representing the problem in this way guarantees that legal schedules will be produced,
for at cach decision point the simulator performs the first legal action contained on a work
station’s list of all available actions. The decoding routine is a projected simulation, and
the evaluation of a schedule is the cost of the work stations. performing the tasks derived
in the simulation. As we shall see, the simulation decoder also provides some information
that will guide operations to perform useful alterations of a schedule.

6. DET/

 

S$ OF OPERATION

The program used a population sized 30, and ran for 20 generations. The problem
was tried 20 times. It converged on variations of Schedule 2 14 times and a variation of
Schedule 3 6 times®. The operations used were derived from those optimizations made by
us as we tried to solve the problem deterministically:

RUN-IDLE: If a work station has been waiting for more than an hour insert a preference
list with IDLE as the second member at the beginning of the day, and move the previous
initial list to time 60. The probability of applying this operation was the percentage of
tine the work station spent waiting, divided by the total time of the simulation.

SCRAMBLE: Scramble the members of a preference list. Probability was 5 per cent
for each list at the beginning of the run, tapered to 1 per cent at the last generation.

CROSSOVER: Exchange preference lists for selected work stations. Probability was 40

 

per cent at the beginning of the run, tapered to 5 per cent at the last generation.

Each member of the initial population associated a list of five preference lists with
each work station. The preference lists were spaced at one-hour intervals, and each was a
random permutation of the legal actions.

 

 
 

The evaluation function summed the costs of simulating the run of the system for five
hours with the schedule encoded by an individual. (Although SJS overhead costs are not
included in the discussion of the three schedules earlier, the evaluation function included.
them.) If, at the end of five hours, the contracts were not filled, 1000 was added to the
Tun costs.

7. CONCLUDING OBSERVATIONS

The example discussed above is much simpler than those one would encounter in
real life, and the range of operations employed here would have to be widely expanded
if a realistic example were approached. In addition, the system here would have to be
extended to handle the sorts of phenomena that the ISIS team has handled: establishing
connections between levels of abstraction, finding useful operations, and building special
constraints into the system, for example.

My belief is that these things could be done if they are successfully done by a deter-
ministic program, for it has been our experience that a quick. powerful way to produce an
genetic system for a large search problem is to examine the workings of a good deterministic
program in that domain. Wherever the deterministic program produces an optimization
of its solution, we include a corresponding operation. Wherever it makes a choice based on
some measurement, we make a random choice, using each option’s measurement to weight
its chances of being selected. The result is a process of mimicry that, if adroitly carried
out, produces a system that will out-perform the deterministic predecessor in the same
environmental niche.

In the case of the schedules produced above, the genetic operators were just those
optimizations of schedules that seemed most beneficial when we attempted to produce
good schedules by hand. The crudeness of the approach stems from our lack of any
fully specified deterministic solution to more realistic scheduling problems. When fuller
descriptions of knowledge-based scheduling routines are available, it will be interesting to
investigate their potential for conversion into genetic scheduling systems.

FOOTNOTES

1 “ISIS: A Constraint-Directed Reasoning Approach to Job Shop Scheduling,” Mark
S. Fon. Bradley P. Allen, Stephen F. Smith, Gary A. Strohm. Carnegie-Mellon University:
Research Report. 1983.

2 The strategy of encoding solutions in an epistatic domain for operation purposes,
while decoding them for evaluation purposes, was worked out and applied to a number of
test cases by a group of researchers at Texas Instruments, Inc. The group included me,
Nichael Cramer, Garr Lystad, Derek Smith, and Vibhu Kalyan.

3. A number of variations in the scheduling that made no difference in the final
evaluation have been omitted in this summary.

140
Compaction of Symbolic Lagout
using Genetic Algorithms

Michael P. Fourman
Dept of Electrical and Electronic Engineering
Brunel University, Uxbridge Middx., UK.
michael Rbruser @ucl-cs.AC.UK

Introduction.

Design may be viewed abstractly as a problem of optimisation in the presence of
constraints. Such problems become interesting once the space of putative
solutions is too large to permit exhaustive search for an optimum, and the payoff
function too complex to permit algorithmic solutions. Evolutionary algorithms
[Holland 1975) provide a means of guiding the search for good solutions. These

algorithms may be viewed as embodying an informal heuristic for problem solving
along the lines of

“To find a better strategy try
variations on what has worked
well in the past.”

Here, a “strategy” is an attempt at a solution. A strategy will generally not
address all the constraints imposed by the problem. The algorithms we are
considering guide the search by comparing strategies. We represent this
comparison by the relation

abeats b

(which will usually be be a partial order, but need not be total). We call strategies
which satisfy all the constraints of the problem “solutions”. In general, solutions
should beat other strategies and, of course, some solutions will beat others.
Abstractly, the algorithms merely search for strategies which are well-placed in
this ordering.

Many problems in silicon design involve intractable optimisation problems, for
example, partitioning, placement, PLA folding and layout compaction. We say a

 
 

Problem is intractable when the combinatorial complexity of the solution space .
for the problem makes exhaustive search impossible, and the varied nature of the
constraints which must be satisfied makes {t unlikely that there is a constructive
algorithmic solution to the problem Automatic solution of such problems requires
efficient search of the solution space. Simulated annealing has been applied to the
first three problems [Kirkpatrick e¢ 2/ 1983], branch and bound techniques have
been applied to layout compaction [Schlag e¢ a/. 1983], In this paper we report on
the application of a genetic algorithm to layout compaction.

The first prototype solved a highly simplified version of the problem. it produced
jayouts of a given family of rectangles under the constraint that no two shall
overlap, with cost given by the area of a bounding box. A more realistic prototype
deals with the layout of a family of rectangular modules with a single level of
interconnect. These prototypes allow the designer to add his ideas to the evolving
population of layouts and thus supplement rather than replace his expertise.

Symbolic Layout.

Acircuit diagram conveys connectivity information:

1 Go
4
gL od
1 9

To manufacture the circuit this must be tranformed to a representation in terms
of layout elements, each layout element must be assigned an absolute mask |
Position. A layout diagram conveys this mask-making information. The passage
from a circuit diagram to a layout may be divided into three stages: firstly the
topology (relative positioning of layout elements) of the layout is designed and
represented by a symbolic layout, then a mask level is assigned to each wire in
the circuit - the design is now represented by a stick diagram, finally the mask
geometry (absolute size and positions) is created, Engineers commonly use these

142
 

 

intermediate notations to represent the intermediate stages in the design process.
Here is a mask layout for our circuit:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

LL

 

 

 

 

 

Here is a symbolic version of this layout:

Fo
at
ql TT
4
1

The corresponding stick diagram is.

 

OMA Ltt

en Ma EAE

 

143

 
 

A symbolic layout is a representation of a circuit design which includes some
Yayout information. The symbolic layout represents a number of design decisions
on the relative placement of circuit elements. A stick diagram may be regarded as
a symbolic layout with a greater variety of symbols.

The procedure leading from a symbolic layout to a mask layout is a form of
compaction. In general, there are many realisations of a given symbolic layout.
The aim of compaction is to produce a layout respecting the constraints implicit
in the symbolic layout while optimising performance and yield. Current compaction
algorithms require the designer to provide a layout as input. Compaction usually
consists of the modification of this layout by sliding elements closer together
while retaining the topology. Clearly, the order in which elements are moved
affects the result. Most algorithms simply compact in each coordinate direction in
turn.

Modern designs are modularised hierarchically The process of symbolic layout and
compaction may occur at any level of this hierarchy. The example we have used for
iNustration above is a leaf cell (a dynamic NMOS shift register cell) from the
bottom level of the hierarchy. Leaf cell layout provides great opportunities for
area reduction and yield enhancement, as these cells are replicated many times
and any small improvements at this level have a magnified effect on the chip.
Optimising leaf celi layout requires awareness of many interacting constraints
and complex cost functions (for example connectivity constraints given by the
circuit design, geometric constraints given by the process design rules, and the
cost functions arising from performance requirements and knowledge of yield
hazards). Because of this, constructive algorithmic solutions to this problem have
not proved efficient. Traditionally, this area of design has been left to human
experts.

We hope to apply genetic algorithms to leaf-cell compaction, and have
implemented two prototypes to explore the applicability of these methods in this
domain.

144,
 

penetic Aloott

Genetic algorithms are applicable to problems whose solution may be arrived at by
a process of successive approximations. This means that we need to be able to
modify strategies in such a way that modifications of good strategies are likely to
be better than randomly chosen strategies. A simple heuristic in this setting
would be to take a strategy, a, and randomly generate a modification , M(a), of it
which may, or may not, be accepted on a probabilistic basis. An algorithm
embodying this idea is simulated annealing [Kirkpatrick e¢ a/ 1983]. The
algorithm procedes by starting with a strategy and repeatedly modifying it in this
way, varying the acceptance procedure according to the value of a variable called
temperature . If M(a) beats a, the modification is accepted. if a beats M(a), the
modification may be accepted (the probability of this increases with temperature
and decreases !f M(a) !s badly beaten). The algorithm is run, starting at a high
temperature which Is gently lowered. This simulates the mechanism whereby a
physical system, gently cooled, tends to reach a low-energy equlibrium position.
Genetic algorithms apply where the strategies have more structure. (In fact, in
Most applications of simulated annealing, this extra structure {s available.)
Strategies are represented as conjunctions of elementary restrictions on the
search space, or decisions . The evolutionary algorithm produces a population of
strategies, rather than a single strategy. The Idea is that by combining some parts
of one good strategy with some parts of another, we are likely to produce a good
strategy. Thus in generating the progeny of a population, we allow not only
Modifications or mutation , but also reproduction which combines part of one
strategy with part of another. The basic step is to take a population and produce a
number of progeny using a combination of mutation and reproduction. The progeny
compete with the older generation, and each other, for the right to reproduce.

If reproduction is to maintain good performance, we need to be able to divide
strategies in such a way that decisions which cooperate are likely to stay
together. This Is accomplished in an indirect and subtle manner. Strategies are
represented as strings of decisions. The child, R(a,b), of a and b is generated by
randomly splitting a and b and joining part of one to part of the other. Thus,
decisions which are close together in the string are likely to stay together. To
allow cooperating decisions to become close together, we include inversions
(which merely choose some substring and reverse it) among the possible

Mutations. These act together with reproduction and selection, to move decisions
which cooperate closer to each other.

145
 

Nothing analogous to the temperature used in simulated annealing appears *
explicitly in the genetic algorithm. The likelihood that a nascent individual will
‘survive to reproduce depends on the degree of competition it experiences from the
rest of the population. As the population adapts, the competition hots up -which
has the same effect as the cooling in the simulation of annealing.

Although genetic algorithms may be seen as a generalisation of simulated
annealing, mutation plays a subsiduary réle to reproduction. The population at any
generation should be viewed as a repository of information summarizing the
results of previous evaluations. Individuals which perform well survive to
reproduce, Reproduction acts to propagate combinations of decisions occuring in
these individuals. The better an individual performs, the longer it will survive and
the more chances it has to reproduce. The relative frequencies with which various
groups of decisions occur in the population record the degree to which they have
been found to work well together. Holland has shown that (under appropriate
statistical assumptions) the effect of the genetic algorithm is to use this
information to effect an optimal allocation of trials to the various combinations
of genes.

yi ic algori ;

The genetic algorithm evolves populations of individuals. In our implementation,
each Individual 1s characterised by a chromosome which Is a string of genes. The
length of chromosomes Is not fixed. New Individuals are produced by a stochastic
mix of the classic genetic operators: crossover, mutation and inversion. Crossover
Picks two Individuals at random from the population, randomly cuts their
chromosomes and splices part of one with part of the other to form a new
chromosome. Mutation picks an individual from the population and, at a randomly
chosen number of points in its chromosome, may delete, create or replace a gene.
Inversion reverses some substring of a randomly selected chromosome.

ASimple Layout Problem.

The layout problem addressed by our first prototype may be thought of as a form of
2-dimenstonal binpacking: A collection of rectangles ts to be placed in the plane
to satisfy certain design rules and minimise some cost function.

146
 

The simplest version of this problem (the one we address) has rectangles of fixed
sizes, the design rule that distinct rectangles should not overlap, and cost given
by area of a bounding box. This version of the problem is already intractable:
Suppose we satisfy the constraint that the distinct rectangles, p,q, should not
overlap, by stipulating that one of the four elementary constraints

p above q

Pp below q

P left_of q
Pright_of q

is satisfied. Then, for a problem with n rectangles, we have N = n? - n pairs
and, a priori , 4" elements in our search space. In fact, this estimate of the size
of the problem is unreasonably large, there are ways of reducing the search space
significantly; for example, "branch and bound” procedures have been used [Schlag
et a/. 1983).

Layout Strategies.

We consider layout strategies which consist of consistent lists of elementary
constraints (as above). Given such a list, the rectangles are placed in the first
quadrant of the plane as close to the origin as is consistent with the list of
elementary constraints. (The procedure which interprets the constraints is very
unintelligent. For example, it interprets ‘p above q' by ensuring that the
y-coordinate of the bottom of p is greater than that of the top of q, even if p is
actually placed far to the right of q (because of other constraints). Any
inconsistent lists of constraints produced by the genetic operators are discarded.

5 , a

Populations of consistent lists of constraints are evolved using various orderings
for selection. When defining a selection criterion, various conflicting factors must
be addressed. For example, our simplest criterion attempts firstly to remove
design-rule violations and then to reduce the area of the layout. Strategies with
fewer violations beat those with more and, for those with the same numberof
violations, strategies with smaller bounding boxes win. This simple prioritising of
concerns led to the generation of some rather unpromising strategies; while the
selection criterion was busy removing design rule violations, for example, any

 
 

strategy with few such violations (compared to the current population norm) was
accepted. Typically, these would have large areas and redundant constraints. The
algorithm would later have to spend time refining these crude attempts. In an
attempt to mitigate this effect, we added a further selection, favouring shorter
chromosomes all other things being equal. Smith has pointed out that
implementations of the genetic algorithm allowing variable length chromosomes
tend to produce ever longer chromosomes (as chromosomes below a certain length
are selected against). We did not find this an overwhelming problem as longer
chromosomes were more likely to be rejected as inconsistent by the evaluation
function. Nevertheless, we did find that the performance of the algorithm was
improved by introducing a selection favouring shorter chromosomes.

‘We also experimented with trade-offs between the various criteria, established by
computing a composite score for each strategy and letting the strategy with the
better score win. We found that the genetic algorithm was remarkably robust in
optimising the various scoring functions we tried. However, the results were often
unexpected; the algorithm would find ways of exploiting the trade-offs provided in
unanticipated ways. We have not yet found a selection criterion of this type which
works uniformly well, over a range of examples. However, by tuning the selection
criterion to the example, good solutions have been obtained.

A better way of combining our various concerns was found. Rather than address the
concerns serially, or try to address all the concerns at once, we select a concern
randomly each time we have a selection to make. A number of predicates for
comparing two individuals were programmed. (For example, comparing areas of
bounding boxes, comparing areas of design rule violations, comparing the areas of
rectangles placed.) Each time we are asked to compare two individuals, we
non-deterministically choose one of these criteria and apply it, ignoring the
others. This works surprisingly well. It is easy to code in new criteria and to
adjust the algorithm by changing the relative frequencies with which the criteria
are chosen. The resulting populations show a greater variability than with
deterministic selection, and alleles which perform well in some respects, but _
would have been selected out with our earlier deterministic approach, are
retained.

Results,

Most of our experiments with this prototype have been based on problems with a
large amount of symmetry, for which it is easy (for us) to enumerate the optimal
solutions. If we actually wanted to solve these problems, other approaches

148
 

exploiting the symmetries available would certainly be more efficient. However,
for the purpose of evaluating the performance of the genetic algorithm, we claim
these examples are not too misleading. The algorithm is not provided with any
knowledge of the symmetries of the problem nor of the arithmetical relationships
between the sizes of the rectangles. For the purposes of evaluating the
applicability of the genetic algorithm to layout compaction, the prototype is
probably pessimistic. Real layout problems are far more constrained (by, for
example, connectivity constraints). This not only reduces the size of the search
space per se, but also appears to localise the interdependence of various genes
making the problem more suitable for the genetic algorithm.

Analve analysis of avery simple example is instructive. The example consists of
six rectangles, three 3 x 1 (horizontal) and three 1 x 3 (vertical). A minimal
solution of this problem was found (consistently) in under 50 generations with 20
progeny per generation (1000 points of the search space evaluated).

A solution to this problem must say how each of these rectangles is constrained,

both horizontally and vertically. Thus the search space has 6!? (about 2 x 10°)
points. The problem has 8 basic solutions and a symmetry group of order 36. There
are about 7.5 x 10° points/solution. Of these, we only examine some 10°.

Representing Layout.

Our first prototype deals with a problem which has little direct practical
significance for VLS! layout. (However, Rob Holte has pointed out that scheduling
Problems from operations research might be represented by minor variations on
our prototype problem.) As a next step towards a practical layout tool, we have
implemented a system which compacts a simple form of symbolic layout. The
Problem is to formalise the constraints implicit in the symbolic layout, and to
find a representation, suitable for the genetic algorithm, for layout strategies

We consider @ symbolic layout of blocks connected by wires. The rectangles
(blocks) are of fixed size and may be translated but not rotated. The
interconnecting lines (wires) are of fixed width but variable length. The
interconnections shown must be maintained, and no others are allowed. In addition,
there are design rules which prohibit unconnected pairs of tiles (wires or blocks)
from being placed too close together.

149

 
 

This form of the symbolic layout problem was introduced by [Schlag e¢ a/. 1983).

Here is their example of a simple symbolic layout:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

We represent the problem at two levels:

A surface level deals with tiles of three kinds - blocks, horizontal
wires and vertical wires. In addition to evolving layout constraints
dealing with the relative positions of tiles (above, right-of etc. as
before), we use a fixed list of structural constraints, to represent
the information in the symbolic layout, and fundamental
constraints which represent the size limitations on tiles.
Structural constraints have the following forms
vcrosses h,Nbv, Sbv,Ebh,Wbh

where v, h are vertical and horizontal wires and b is a block. These
constraints allow us to stipulate which wires cross (and hence are
connected) and which wires connect to which edges (North, South,
East or West) of which blocks.

At a deeper level, unseen by the user, the problem is represented in
terms of the primitive layout elements , north b, south b, east b,
west b, left h, right h, y_posn h, top v, btm v, x_posn v, whose names.
are self-explanatory. For each tile, we generate a list of fundamental

150
constraints expressing the relationship between the primitive layout
elements arising from it. This representation allows both blocks and
wires to stretch.

The example above is represented by declaring the widths of the wires and sizes
of the blocks and then specifying the following list of constraints. (We use a LISP

list syntax as it is more widely familiar, actually, our implementation is written
in ML.):

((E BI H2)
(crosses V3 H2)
(crosses V3 H3)
(crosses V4 H3)
(NB4 V4)
(WBS H3)

(SBI VI)
(crosses V1 H1)
(crosses V2 HI)
(N B2 V2)

(E B2 HS)
(WB3 HS)

(S B4 V6)
(crosses V6 H4)
(crosses VS H4)
(N B3 VS)

(S BS V7)
(N B6 V7))

Again, we evolve lists of layout constraints. These are compiled, together with the
fixed structural and fundamental constraints representing the symbolic layout to
give graphs of constraints on the primitive layout elements, whose positions are
thus determined. The number of design-rule violations and the area of the
| resulting layout are again used to select between rival strategies. Solutions to
this problem were found in around 200 generations of 20 progeny, and this was
reduced to around 150 generations when the algorithm was given a few “hints” in

151

 
 

the form of extra constraints. Watching the evolving populations showed that -
progress was rapid for around SO generations. Thereafter, the algorithm appeared
to get stuck for long periods on local minima (in the sense that one configuration
would dominate the population). This lack of variation in the population reduced
the usefulness of crossover. When mutation led to a promising new configuration,
there would be a period of experimentation leading rapidly to a new local
minimum. This might suggest that either the population size (100) or the
probability of mutation being used as an operator (0.1) is too small. We have not
yet experimented with variations on these parameters. We think that better
solutions would be either to introduce a further element of competition into the
genetic algorithm by penalising configurations which become too numerous
(implementing this is problemaical), or to evolve a number of populations allowing
a limited degree of “intermarriage” (We are currently implementing the latter
approach. If it is successful it will be a good candidate for parallel
implementation.)

Conclusions,

The genetic algorithm may be viewed as a (non-deterministic) machine which is
programmed by supplying it with a selection criterion - an algorithm for
comparing two lists of constraints. We have experimented with various selection
criteria based on combinations of the total intersection area, 1, of overlap
involved in design-rule violations, and the area, A, of a bounding rectangle.
Experiments were made to compare various performance criteria based on
combinations of the number of design-rule violations, and the area of a bounding
rectangle. From our experience with the prototype, It appears that the choice of a
selection criterion is an essential difficulty in applying the genetic algorithm to
layout. The problem is that we must evolve populations of partial solutions
(strategies), while the optimisation task is defined in terms of a cost function
defined on layouts (solutions). To extend a (technology imposed) cost-function, ©
defined on solutions, to the space of strategies, In such a way that the genetic
algorithm will produce a solution (rather than just a high-scoring strategy), Is a
non-trivial task.

We intend to experiment with our second prototype In various ways before going on
to tmplement a “real” system dealing with design-rules for a practical multi-layer
technology. We will continue to experiment with selection criteria and we are

152
implementing the idea of having several weakly interacting populations running in
parallel, described above. We also intend to integrate other, rule-based, methods
with the genetic algorithm, automating the provision of “hints”. Thus, a number of
suggestions for strategies would be generated and passed to the genetic algorithm
which would then explore combinations and variations of these.

Acknowledgements,

! would like to thank Steve Smith for introducing me to Genetic Algorithms, and
Robert Holte for many stimulating discussions, his criticism and encouragement
have been invaluable.

References,

Holland, John H. 1975. Adaptation in natural and artificial systems. Ann Arbor,
University of Michigan Press.

Kirkpatrick, S. C.D. Gelatt, and MP. Vecchi 1983. Optimisation by simulated
annealing. Sc/ence, 1983, 220, 671-680.

Schlag, M., Y.-Z. Liao, and CK. Wong 1983. An algorithm for optimal
two-dimensional compaction of VLSI layouts. /W7EGRATION, the VLSI journal
1 (1983) 179-209.

‘Smith, S.F. 1982. Implementing an adaptive learning system using a genetic
algorithm. Ph.D. thesis. U. of Pittsburgh 1982.

153

 
 

ALLELES, LOCT, AND THE TRAVELING SALESMAN PROBLEM

by

David E. Goldberg

and

Robert Lingle, Jr.

Department of Engineering Mechanics
‘The University of Alabama, University, AL 35486

INTRODUCTION

We start this paper by making several
seemingly not-too-related observations:

1) Simple genetic algorithns work well in

  

problens which can be coded so the
Underlying building blocks (highly fit,
short defining length schemata) lead to
improved performance.

2) There are problens (nore properly,
codings for problens) that are GA-Hard
s-difficult for the normal  reproduc~

tion+crossover+mutation processes of the
simple genetic algorithm,

3) Inversion {s the conventional answer
when genetic algorithnists are asked how
they intend to find good string order-
ings, but {nversion has never done much
in-empirical studies to date.

4) Despite nunerous rumored attempts, che
traveling salesman problem has not
succumbed to genetic algorithn-like
solution.

Our goal in this paper is to show that,
in fact, these observations are closely
related.’ Specifically, we show how our
atti to solve the traveling

   

 

Problen (TSP) with genetic elgorithes have
ied to a new type of crossover operator,
partially-mapped crossover (PMX), which
permits genetic algorithas to search for
better string orderings while still searching
for better allele combinations. The
partially-mapped crossover operator conbines
a mapping operation usually associated with
inversion and subsequent crossover between
non-homologous strings with a swapping
operation that preserves a full gene comple-
nent. The resultant is an operator which
enables both allele and ordering combinations
to be searched with the implicit parallelism
usually reserved for allele combinations in
ore conventional genetic algorithns.

 

 

 

In the remainder, we first examine and
question the conventional notions of gene and
locus. This leads us to consider the rechan-
ics of the partially-mapped crossover opera’
tor (PUX), This discussion is augrented by
the presentation of a sample implenentation
(for ordering-only problems) in Pascal.
Next, we consider the effect of PHX by
extending the normal notion of a schena by

 

 

 

154

introducing the __o-schenata_—_ (ordering
schemata) or locus templates, This leads to
simple counting arguments’ and survival

robabi lity calculations for o-schenata under

MK. These results show that with high
probability, low order o-schenata survive PHK
thus giving us a desirable result: an
operator which searches anong both orderings
and allele combinations that lead to good
fitness. Finally, we demonstrate the effec~
tiveness of this extended genetic algoritha
consisting of reproduction+P™X, by applying
tt to an ordering-only problen, the traveling
salesman problen (TSP). Coding the problen
as an n=permutation with no allele values, we
obtain optinal or very near-optimal results
in a well-known 10 elty problem. Our dis-
cussion concludes by discussing extensions in
problens with both ordering and value consi
dered.

  

   

 

 

‘THE CONVENTIONAL VIEW OF POSITION AND VALUE

In genetic algorithm work ve usually
take a decidely Nendelian view of our arti-
ficial chromosones and consider genes which

xy take on different values (alleles) and
positions (loci). Normally we assune that
alleles decode to our problen parameter set
(phenotype) in @ manner independent of locus:
Furthermore, we assure that our paraneter set
nay then be evaluated by a fitness function
(a non-negative objective function to be
maxinized). Symbolically, the fitness £
depends upon the parameter set x which in
turn depends upon the allele values v or nore
compactly f = £(x(v)). While this is cer
tainly conventional, we need to ask whether
this 1s the most’ general (or even nost
biological) vay to consider this mapping.
More to the point, shouldn't we also consider
the possible effect of a string's ordering o
‘on phenotype outcome and fitness. Mathenati~
cally there seems to be no good reason to
exclude this possibility which we may write

fe (x(0,v)).

   

 

     

 

 

While this generalization of our coding
techniques is attractive because it would
permit us to code ordering problens more
naturally, we must make sure we naintain che
implicit 'parallelisn of the reproductive
plans and genetic operators we apply to the
generalized structures, Furthermore, because
GA's are drawn from biological example we
should be careful to seek natural precedent
before committing ourselves to this

 
extension, To find biological precedent for
the importance of ordering as well as value
we need only consider the sublayer of struc-
ture beneath the chromosome and consider the
amino acid sequences that lead to particular
proteins. At this level, the values (anino
acids) are in no way tagged with neaning.
There are only anino acids and they cust
appear in just the right order to obtain «
useful outcone (a particular protein). Thus,
there is biological example of outcoses that
depend upon both ordering ané value, and we
do not risk the loss of the right flavor by
considering them both.

   

 

Then, wherein Iies our problen? If it
is ok to admit both ordering and value
{information into our fitness evaluation, what
is missing in our current thinking ‘about
genetic algorithms which prevents us from
exploiting both ordering and value inforna~
tion concurrently? In previous work where
ordering was considered at ll (primarily for
its effect on the creation of good, tightly
Linked, building blocks), the only’ ordering
operator considered was inversion, @ unary
operator which picks two points’ along a
single string at random ond inverts the
included substring (1). Subsequent crossover
between non-honologous (differently ordered)
strings occurred by mapping one string's
order to the other, crossing via simple
crossover, and unnapping the offspring. This
procedure’ is well and good for searching
among different allele combinations, but 1
does little to search for better orderings.
Clearly the only operator effecting string
order here is inversion, but the beauty of
genetic algorithms is’ contained in the
Structured, yet randomized information
exchange Of crossover--the combination of
highly fit notions from different strings.
With only a unary operator to search for
better string orderings, we have little hope
of finding the best ordering, or even very
good orderings, in strings of ony substantial
Tength. Just as cutation cannot be expected
to find very good allele schenata in reason-
able tine, inversion cannot be expected to
find good orderings in substantial problens.
What is needed is a binary, crossover-like
operator which exchanges both ordering and
value information anong different strings
In the next section, we present a new opera’
tor which does precisely this. Specifically,
we outline an operator we call partially~
mapped crossover (PMX) that exploits impor-
tant similarities in value and ordering
simultaneously when used with an appropriate
reproductive plan,

 

 

 

 

 

 

PARTIALLY-MAPPED CROSSOVER (PMX) =
MECHANICS.

To exchange ordering and value infor-
mation anong different strings we present a
new genetic operator with the proper flavor.
We ‘call - this operator partially-napped
crossover because a portion of one string
ordering {s mapped to a portion of another
and the renaining information is exchanged

155

after appropriate swapping operations. To
tie down these ideas ve also present a piece
of code used in the computational experiments
to be presented later.

To motivate the partially-napped cro:
over operator (PYX) we will consider differ-
ent orderings only and neglect any value
information carried with the ordering (this
is not a limitation of the method because
allele information can easily be tacked on to
city nane infornation). For example, consider
two permutations of 10 objects:

 

A= 98456713 210
Be@71230 9546

 

PHX proceeds as follows. First, two posi-
tions are chosen along the string uniformly
at random, The substrings defined from che
first number chosen to the second number
chosen are called the MAPPING SECTIONS.
Next, we consider each napping section
separately by mapping the other string to the
napping section through a sequence of swap-
ping operations. For example, if we pick two
random nurbers say 4 and 6, this defines the
two mapping sections, 5-6-7 in string A, and
2-3-10 in string B.' The mapping operation,
say B to A, is performed by swapping first
the 5 and the 2, the 6 and the 3, and the 7
and the 10, resulting in a well defined
offepring. | Simtlarty the napping, and svap-
ping operation of A to B results in the sw
Of the 2 and the 5, the 3 and the 6, ond the
10 and the 7. The resulting two new strings
are as follows:

    

     

Abe 9 84 23201
B= 8101567 9

 

The mechanics of PHX is a bit more complex
than simple crossover so to tie dom the
ideas completely we present a code excerpt
which implements the operator for order-
ing-only structures in Figure 1. In this
code, the string {s treated as a ring and
attention 1s paid to the order of selection
of the two mapping section endpoints.

 

The pover of effect of this operator, as
with sisple crossover, {s much more subtle
than {s suggested by ‘the simplicity of the
string aatching and \ swapping. Clearly,
however, portions of the string ordering ate
being propagated untouched as. we. should
expect." In the next section, we identify the
type of information being exchanged by
introducing the o-schenata (ordering. schena~
ta). We also consider the probability of
survival of particular ovschenata under PMX.

 

PARTIALLY-NAPPED CROSSOVER - POWER OF
EFFECT

In the analysis of a simple genetic
algorithn with  reproductionscrossover+auta’
tion, we consider allele schenata as the
underlying building blocks of future solu
tions. We also consider the effect of the
genetic operators on the survivability of

 

 
Dara Types and Constants

const mixackty = 1001

re

 

city
Nourseray =

Tencetons and Procedure:

function findaci tvici tyawne- mack twt ch tnt

varut
‘Stito:
Ht

 

 

 

 

until tone
finda cs ia )17
and?

procesura ruse.
One tanetct uy

 

(Etnd_etey, oom

 

Procedure crors tours

 

    
    

Sie tour:
var JteMtesttiotenert

besin

i

tou iourtseias
Agurtinws te tourgoldt

 

Flovcrors

Figure 1. Ps

Sheree” and

Jazt tas loncred!

 

soscroze

 

  

tty, cron

   

ae tours touraresy enter

 

hiccrossscttyt

 

214s tourlnatt tour? anaes tourareay ot

S hiceroze + it Af CMterbonectty) then nictesteeit

> maetarts than beste

+ Rach ty tour.

‘a1 Implementation of PMX ~ Partially Mapped

Crossover - procedure cross_tour.

important schemata. In a similar way, in our
current work we consider the o-schenata or
ordering schenata, and calculate the survival
probabilities of important o-schenata under
the PHX operator just discussed. As in the
previous section we will neglect any allele
information which may be carried along to
focus solely on the ordering infornation;
however, we recognize that we can always tack
on the allele information for problens where
it {s needed in the coding.

 

 

To motivate an o-schema consider tuo of
the 10-pernutations:

8 910

ce123%
5 9 810

567
Del235467
As with allele schemata (a-schenata) where we
appended a * (a meta-don't-care symbol) to
our k-nary alphabet to motivate a notation
for the schemata or similarity templates, so

 

156

do we here append a don't care sysbol (the !)
to mean that any of the renaining permuta-
tions will do in the don't care slots. Thus
in our example we have, anong others,
following o-schemata common anong structur
C and D:

 

  

 

 

Seer)

it
7!
at

To consider the nunder of
then with. no positions fixed, 2 ‘position
Fixed, 2 positions fixed, ete., and recognize
that the number of o-schemata with exactly j
positions fixed {s simply the product of the
hhunber of combinations of groups of J azong 2

objects, (4), tines the nuaber of permuta-

tions of groups of J among & objects.
Summing from 0'to & (the ‘string length) we

ovschemata, we take
 

obtain the number of o-schenata:

nw ai ay ay
os ~ LUE THT

While this expression has not been reduced to
closed form, it may be shown for large £ that
the nunber of ovschenata is certainly greater
than (21), Furthermore, it {8 eas{ly shown
that each particular string (permutation) is

a representative of 2" o-schemata and that 2
Population contains at most net o-schemata,

 

 

   

Next we consider the survival probabili-
ty of a particular o-schena under the
partially-mapped crossover operator. The
easiest vay to calculate this is to use
conditional probabilities over three mutually
exclusive events: the o-schena is entirely
contained within the match section (Event
Wewithin), the schena ts entirely outside the
match section (Event O-outside), or the
schena is cut by a cross point (Event C-cut).
Thus, the probabtlity of survival (Event
S-survival) may be given:

 

P(S) = P(S|w)P(H) + P(S|0)P(O) + P(s|c)P(C)

Since the probability of surviving a cut is
very low (P(S|C)20) we ignore this pi
sibility and focus on the other tvo event:
Assuming a cut length k, a defining length of
the schena 6(s), and an o-schena of order
(number of fixed positions) o(s), che overall
probability of survival (for iarge string
length 2) may be estimated:

  

   

 

is) = K$e2 , fake Ba gy Hay

Closer examination of this equation reveals
two nodes of survival. When the cut length {s
large with respect to the defining length,
relatively short defining length. schenat
survive with high probability. The second and
more subtle mode of survival occurs when
short, low order scherata survive, because a
snall’cut length dictates a snall probability
of interruption due to swapping. Together
the two modes combine to pass through short,
low order o-schenata 0 normal reproductive

these building blocks at
optimal rates. Hence, PHX permits the
Sane type of implicit paralielism to occur in
both orderings and alleles as we have already
witnessed using simple crossover on allele
information alone.

 

 

 

     

 

‘A PURE ORDERING PROBLEM - THE TRAVELING
SALESMAN PROBLEM (TSP)

In some sense we've presented this paper
in the reverse order of discovery. ke did
not 1) adnit ordering information, 2) dis-
cover PYX end o-schemata, and '3) apply
reproduction+PX to the traveling salesnan
problen. In fact, by trying to solve the TSP
with genetic algorithns, we were led to
PMX-Like operators, then o-schemata, and

 

157

 

finally POX, The traveling salesnan problen
is a pure ordering problem (2,3,4) where one
attempts to find the optinal tour (ninimun
cost path vhich visits each of n cities
exactly once). The TSP is possibly the most
celebrated coabinatorial optimization problen
of the past three decades, and despite
numerous exact (impractical) "and heuristic
(inexact) pethods already discovered, the
pethod remains an active research area in its
own right, partially because the problem is
part of a'class of problens considered to be
NP-complete for which no polynontal tine
solution {s believed to exist. Our interest
in the TSP sprung mainly fron a concern over
claims of genetic algorithm robustness. If
GA's are ‘robust, why have the rumored
attempts at "solving" the TSP with CA's
failed. This concern led us to consider many
schenes for coding the ordering inforaation,
with strange codes, penalty functions, and
the Like, but none of these had the appropri-
ate flavor--the bullding blocks didn't seen
right. This led us to consider the current
scheme, which does have appropriate butlding
Docks, and as we shall soon see, does (in
fone problem) lead to optinal or near-optinal
results.

 

 

 

 

 

The specific problen ve consider is Karg
‘and Thompson's well-studied 10 city problen
(4). While a 10 city problen {s no final
touchstone of success, it does contain 9
alternatives (the GA ‘knows nothing of the
problen's symmetry which reduces this number
to (91)/2), We code the problem as @ normal~
ized (city'l in the first position) 10-permu~
tation and apply reproduction and PAX to
successive populations. We use roulette
wheel reproduction with’ selection probabili-
tles set in the normal way, and fitnesses are
created from costs and scaled by subtracting
string cost fron population maximum cost,
£1" Cua ~ y+ We choose initial popula
popsize=200, at random, This nuaber
lected to obtain a rich spread of order
2 o-schenata in the population. This re-
quires a population size proportional to
n(n-1) or roughly n®, Tt might be useful to
have order 3 schenata as well, but this may
require larger populations than we are used
to working with.

 

 

    

We present the results of two runs on
the 10 ity problen in Figures 2 and 3.
Figure 2 shows the population average cost
with each successive generation. The cross~
over probability was set at 0.6 60 each
generation represents roughly 120 new func~
Elon evaluations (0.64200). Figure 3 shows
the population best results with successive

erations. As ve can see, run 1 reaches
the optimal (1!) result rather quickly, while
run 2 converges on a very near-optinal tour
(we only ran twenty generations--there was
still enough diversity left so inproverent
was possible in run 2). The best of run 1
was indeed the Karg and Thompson optinun,
tour 1-2-3-4-5-10-9-6-6-7 with cost=378, The
best of run 2 was a near-optimun, the tour
1-2-3-10-9-5-H-6-8-7 with coste381. We are

 

 

 

 

 

   
600.0

 

 

 

 

 

 

&
8 550.0,
iS ° °RUN 1
g
= 500.0 —— pun 2
=
3 450.0
S
.
Q
=
2 400.0
G
8

350.0

0 5 10 15 20
GENERATION
Figure 2, Generation Average Cost vs. Generation for 10 City 15?
currently working on a 20 city problem and a The new operator is tested in an

33 etty problem, although we need to do sone
Teprogranming to fit the large population
sizes into our TBM PC's, We also have built
in an inversion operator, but have not had a
chance to test its effect on average and best
results.

CONCLUSIONS

In this paper ve have examined a nev
type of crossover operator, partially-napped
crossover (PMX), for the exploration of
codings where ordering and allele information
nay directly or indirectly effect fitness
values. The rechanics of the operator have
en described, and an orderingronly {nple-
mentation has been presented in Pascal. The
power of effect of the new operator has been
analyzed using an extension to the concept of
called the o-schenata (ordering
Simple counting argusents have
been put forvard which show che vast anount
of information contained in the o-schenata,
and survival probabilities have been estina~
ted for o-schenata under the PHX operator.
The result is an operation which preserves
ordering butlding blocks (and allele building
blocks if they are attached) so orderings and
allele combinations aay be explored” with
implicit parallelism,

 

    

  

 

  

 

 

158

ordering-only problem, the traveling salesman
problen, Using reproduction+PHX in two runs,
optinal’ or very near optinal results are
found in a well-known 10 city problen after
exploring a small portion of the tour search
space. We are continuing our work by testing
the method in larger problens, but we are
encouraged with the GA-like’ performance
obtained on our first test.

This work has important inplteations for
inproving pore general GA-search in problens
where both allele combinations and ordering
information are. inportant. The binary
operation of PHX does permit’ the randonized,
yet structured, informacion exchange anong
doth “alleles an¢ ordering building blocks
which sinple crossover pronotes anong allele
schenata alone. This should assist us in our
efforts to successfully apply genetic algori-
thns to ever nore complex problens.

 

  

 

REFERENCES

  

1. Holland, J. Hy

and_Artifictal Syste
chigan . 5
450.0

 

 

 

 

=
&
8
2 o——* RUN 4
z
o
5 *—* BUN 2
=
5 400.0
a
8
7
w
S
ne
&
a
a
350 0
0 5 10 15 20
GENERATION

Figure 3. Best-of-Generation Cost for 10 City TSP

2 and G. L, Neuhauser, "Th
in Problem: A Survey,’
eration Research, vol. 16, May-June

1968, pp. 538-558,

3. Parker, R. G. and R. L. Rardin, "The
Traveling ‘Salesnan Probien: An Update
of Research," Naval Research Logistics
Quarterly, vol. 30, 1983, pp. 69-96.

    

 

 

4. Kare) R. L, and G. 1. ‘Thompson, "A
} Heuristic Approach to Soiving Travelling
| Salesman Problens,”" Managenent Science,
j woke 20) = 2) January” 2964, pp:
i meee
|

| 159

 
 

Genetic Algorithms for
the Traveling Salesman Problem

John Grefenstette!, Rayeev Gopal,
Brian Rosmaita, Dirk Van Gucht

Computer Science Department
Vanderbilt University

Abstract

This paper presents some approaches to the
application of Genetic Algonthms to the
Traveling Salesman Problem. A number of
representation issues are discussed along with
several recombination operators Some
prelimmary analysis of the Adjacency List
representation is presented, as well as some
promising experimental results.

  

1. Introduction

Genetic Algorithms (GA's) have been applied to
a variety of function optimization problems, and
have been shown to be highly effective in
searching large, complex response surfaces even in
the presence of difficulties such as high-
dimensionality, multimodality, discontinuity and
noise [4]. However, GA’s have not been applied
extensively to combinatorial problems. The
major obstacle is in finding an appropriate
representation. This paper presents some
approaches to the design of GA's for a well
known combinatorial optimization problem -- the
‘Traveling Salesman Problem (TSP) The TSP 1s
easily stated: Given a complete graph with N
nodes, find the shortest Hamiltonian path
through the graph. (In this paper, we will assume
Euclidean distances between nodes) The TSP 1s
NP-Hard, which probably means that any
algorithm which computes an exact solution of
the TSP requires an amount of computation time
which 1 exponential in N, the size of the problem
[5] In addition to many important
applications, the TSP 1s often used to illustrate
heunistic search methods {2,7,8], so it is natural to
investigate the use of GA's for this problem.

 

 

Choosing an appropriate representation 1s the
first step in applying GA’s to any optimization
problem If the problem mvolves searching an N-

 

dimensional space, the representation problem is
often solved by allocating a sufficient number of
bits to each dimension to achieve the desired
accuracy For the TSP, the search space is a
space of permutations and the representation
problem is more complex Consider a path
representation in which a tour 1s represented by a
Iist of cities (abcde f) The first problem is
that the representation is not unique each tour
has N representations. This can be solved by
fixing the initial city. Another problem is that
the crossover operator does not generally yield
offspring which are legal tours. For example,
suppose we cross tours (ab c de) and (a de cb)
between the third and fourth cities We get as
offspring (a b cc b) and (ade de), neither of
which are legal tours Finally, there 1s a problem
im applying the hyperplane analysis of GA’s to
this representation. The definition of a
hyperplane 18 unclear in this representation. For
example, (a # # # #) appears to be a first order
hyperplane, but st contains the entire space. The
problem 1s that in this representation, the
semantics of an allele in a given position depends
on the surrounding alleles. Intuitively, we hope
that GA’s will tend to construct good solutions
by identifying good building blocks and
eventually combining these to get larger building
blocks For the TSP, the basic building blocks
are edges, Larger building blocks correspond to
larger subtours ‘The path representation does
not lend itself to the description of edges and

longer subtours in ways which are useful to the
GA.

  

In section 2, we present two representations
which offer some improvements over the path
representation. Section 3 discusses the design of
a heuristic recombination operator for what we
consider to be the most promising representation
In section 4, some preliminary experimental

Research supported in part by the National Science Foundation under Grant MCS-8305603

160
 

results are described for the TSP.
discusses some future directions

Section 5

2. Representations for TSP

 

2.1, Ordinal Representation

In the ordinal representation, a tour is described
by a list of N integers in which the ith element
can range from 1 to (N-it1) Given a path
representation of a tour, we can construct the
ordinal representation TourList as follows’ Let
FreeList be an ordered list of the cities. For each
city in the tour, append the position of that city
in the FreeList to the TourList and delete that,
city from the FreeList. For example, the path
tour (a c e db) corresponds to an ordinal tour
(12.321) as shown:

  

   

   

Tourl,
0 (abede)
(1) (bede)
(12) (ode)
(123) (b d)
(12.32) (b)
(12321) 0

Note that it 1s necessary to fix the starting city
to avoid multiple representation of tours.

A sumilar procedure provides a mapping from
the ordinal representation back to the path
representation. In fact, the mapping between the
two representations is one-to-one.

The primary advantage of the ordinal
representation is that the classical crossover
operator may be freely applied to the ordinal
representation and will always produce the
ordinal representation of a legal tour However,
the results of crossover may not bear much
relation to the parents when translated to the
path representation For example, consider the
following two tours:

 

ordi fours path tours
(12321) faced)

161

(24111) (beacd)

Suppose that we cross the ordinal tours between

the second and third positions. We get the
following tours as offspring:
ordinal tours path tours
(12111) facbde)
(24321) (bedca)

The subtours corresponding to the genes in the
ordinal tours to the left of the crossover point do
not change. However, the subtours corresponding,
to genes to the right of the crossover points are
disrupted in a fairly random way Furthermore,
the closer the crossover point is to the front of
the tour, the greater the disruption of subtours in
the offspring

As predicted by the above consideration of
subtour disruptions, experimental results using
the ordinal representation have been generally
poor. In most cases, 2 GA using the ordinal
representation does no better than random search
on the TSP.

2.2, Adjacency Representation

In the adjacency representation, a tour is
described by a list of cities. There is an edge in
the tour from city i to city j iff the allele in
position 1 is j For example, the path tour
(1 3 5 4 2) corresponds to the adjacency tour

  

(3.1524) Note that any tour has exacily one

adjacency list representation

2.2.1. Crossover Operators

Unlike the ordinal representation, the adjacency
representation does not allow the classical
crossover operator. Several modified crossover
operators can be defined

Alternating Edges

Using the alternating edges operator, an
offspring 1s constructed from two parent tours as
follows. First choose an edge at random from one
parent Then extend the partial tour by choosing
the appropriate edge from the other parent

 

 
Continue extending the tour by choosing edges
from alternating parents If the parent's edge
would introduce a cycle into a partial tour, then
extend the partial tour by a random edge which
does not introduce a cycle. Continue until a
complete tour is constructed

For example, suppose we have

mom
dad =

234561)
251643)

 

‘Then we might get the following offspring.
kid =(254163)

where the only random edge introduced into the
offspring is the edge (4 1) All other edges were
inherited by alternately choosing edges from
parents, starting with the edge (1 2) from mom.

Experimental results with the alternating edges
operator have been uniformly discouraging. The
obvious explanation seems to be that good
subtours are often disrupted by the crossover
‘operator Ideally, an operator ought to promote
the development of coadapted alleles, or in the
TSP, longer and longer high performance
subtours. The next operator was motivated by
the desire to preserve longer parental subtours

Subtour Chunks

Using the subtour chunking operator, an
offspring 1s constructed from two parent tours as
follows First choose a subtour of random length
from one parent. ‘Then extend the partial tour
by choosing a subtour of random length from the
other parent. Continue extending the tour by
choosing subtours from alternating parents.
Dunng the selection of a subtour from a parent,
if the parent’s edge would introduce a cycle into
partial tour, then extend the partial tour by a
random edge which does not introduce a cycle.
Continue until a complete tour 1s constructed

 

Subtour chunking performed better than
alternating edges, as expected, but the absolute
performance was still unimpressive. An analysis
of the allocation of trials to hyperplanes provide a
partial explanation for the poor performance of

162

 

this operator

2.2.2, Hyperplane Analysis .
The primary advantage of the adjacency
representation is that it permits the kind of
hyperplane analysis which has been applied to the
Nedimensional function optimieation GA.
paradigm (1,3,6] Hyperplanes defined in terms of
a single defining position correspond to the
natural building blocks, ie, edges, for the TSP
problem. For example, the hyperplane
(# # # 2 #) is the set of all permutations in
which the edge (4 2) occurs. We briefly
summarize the main points of the classical
hyperplane analysis of GA's. In the absence of
recombination operators, selection of structures
for reproduction in proportion to the structure's
observed relative performance allocates trials to
all represented hyperplanes in the population
(roughly) according to the following formul

 

 

 

M(H,t+1) = M(H)" uf) / u(P,.))

where

M(H,t) = # of representatives of H at time t
u(H,t) = observed performance of H at time t

u(P,t) = mean performance of population at
time t.

‘The elements of any hyperplane partition
compete against the other elements of that
partition, with the better performing elements
eventually propagating through the population.
This in turn leads to a reduction in the
dimensionality of the search space, and the

construction of larger high performance building
blocks.

In the adjacency representation, a first order
hyperplane partition consists of all of the
hyperplanes which are defined on the same
position For example

(HALA), (HHH #) (HH HH),
(## #5 #))

is a first order hyperplane partition. Each
element of the partition contains an equal
number of tours. Selection is supposed to
distinguish among the elements of this partition
and to favor the high performance hyperplanes.
However, the following theorem shows that
selection has very little information on which to
allocate trials to competing first order
hyperplanes

Theorem 1 Suppose that H,, and H,. are
two first-order hyperplanes defined by the edg
(a b) and (a c), respectively, in a Euclidean TSP.
‘Then | u(H,,)- u(H,,) | < 4(ab + ac) where ab
and ac represent the lengths of the edges (a b)
and (ac), respectively.

 

 

Proof. We show that there is a one-to-one
mapping f between the tours in H,, and the tours

H,, such that if x is a tour in Hy, and y = f(x) 9
the corresponding tour in H,_, then

| Length(y) - Length(x) | < 4{ab+ac).
‘The theorem follows directly.

‘The followin

 

justrates the mapping f-

 

‘That 1s, y is obtained by exchanging the nodes
b and c in the tour x Using the triangle
inequality, it 1s easy to show that.

-(4ab + 2ac) < Distance(y) - Distance(x)
S (dac + 2ab)

163

So

| Distance(y) - Distance(x) | <  d{ab+ac)
QED.

In practice, the observed difference between
competing first order hyperplanes is usually an
order of magnitude less than the bounds in the
theorem. And since the overall tour length is
generally very large compared to the bound in
the theorem, there 1s generally no significant
difference between the mean relative performance
‘of any two competing first order hyperplanes
Our experimental studies have shown that the
difference in the observed performance of
competing first order hyperplanes in a TSP of
size 20 is generally less than 5% of the mean
population tour length In larger problems, this
difference can be expected to rapidly approach
ero

One might suspect that the TSP is not a
suitable problem for GA's, that the TSP
some sense GA-Hard Bethke(1) characterizes
some problems for which GA’s are unsuitable
Informally, Bethke shows that there are functions
and representations for which the low order
hyperplanes can mislead the GA into allocating
trials to suboptimal areas of the search space
However, Bethke’s techniques, which involve the
Walsh transform of the objective function, apply
to one-dimensional functions of a real variable
‘using a fixed-point representation A similar set
of results may be derivable for combinatorial
problems using the adjacency representation But
‘Theorem 1 does not indicate that the information
an the first order hyperplanes of the adjacency
representation 18 misleading, just that 1t is buried
In other words, measuring the fitness of a tour by
the tour length may be too crude a measure for
apportioning credit. We now describe a crossover
operator which performs a secondary
apportionment of credit at the level of individual
alleles

 

 

3. Heuristic Crossover

Theorem 1 shows that selection alone may not
be able to properly allocate trials to first order
hyperplanes, given our adjacency representation
for the TSP The heuristic crossover operator
attempts to perform a secondary apportionment

 
of credit at the allele level. This operator
constructs an offspring from two parent tours as
follows: Pick a random city as the starting point
for the child’s tour. Compare the two edges
leaving the starting city in the parents and
choose the shorter edge. Continue to extend the
partial tour by choosing the shorter of the two
edges in the parents which extend the tour If
the shorter parental edge would introduce a cycle
into the partial tour, then extend the tour by a
random edge. Continue until a complete tour is
generated.

 

In order to compare this operator with the
Previous two recombination operators, 1000
random pairs of parents were chosen for a TSP of
site 20. For each pair of parents, an offspring
was constructed according to each of the
crossover operators. For all three operators, the
offspring generally inherited about 30% of the
edges from each parent. The remaining 40%
were random edges introduced by the
recombination operator to create a legal tour
For the first two operators, the offspring
generally show no improvement in overall tour
length when compared to the better parent Not
surprisingly, the heuristic crossover produces
offspring which are, on average, about 10%
better than the better parent. It seems
reasonable that such an improvement should give
selection a way to promote the propagation of
good edges through the population. The next
section shows some experimental results which
confirm this expectation.

 

It is important to note that, with the proper
choice of data structures, the heuristic crossover
operator can be implemented to run as a linear
function of the length of the structures {9] ‘This
imphes that, if E is the number of trials and N is
the number of cities, our GA’s for the TSP run
with asymptotic complexity O(EN), the same as
pure random search

 

4. Experimental Results

This section describes some experiments with
the adjacency representation and the heuristic
crossover operator For each experiment, N cities
were randomly placed in a square Euclidean
space. The initial population consisted of
randomly generated tours The selection method

164

 

was based on the expected value model The
crossover rate was set at 50%, and there was no
explicit mutation operator. .

Figure 1 shows the results of a 50 city problem,
Figure 2 shows a 100 city problem and Figure 3
shows a 200 city problem. Each Figure shows a
representative tour from the initial population,
the best tour obtained part way through the
search, and the best tour obtained after the entire
search, along with a randomly selected tour in the
final population. It can be seen, especially in
Figues 2 and 3, that good subtours tend to
survive and to propagate. The figures also show
that there is still a good deal of diversity in the
final population

Statistical techniques [2] allow us to estimate
that the expected length of an optimal tour for
experiment 1 is approximately 37 45 The
optimal tour obtained by the GA differs from this
expected optimum by about 25% After an equal
number of trials, random search produces a best
tour of length 148.6, nearly 300% longer than the
optimal tour. The optimal tour obtained in
experiment 2 differs from the expected optimum
by 16% The optimal tour obtained in
experiment 3 differs from the expected optimum
by about 27%, ‘These results are encouraging and
suggest that further investigation of this
approach is warranted

 

Experiments show that GA’s which use heuristic
crossover but not selection perform better than
random search but significantly worse than GA’s
which use both selection and heuristic crossover
That 1s, there appears to be a symbiotic
relationship between the two level of credit
assignment performed by selection and heuristic
crossover We are currently working on
clarifying the relationship between selection and
the heuristic crossover operator

 

 

5. Future Directions

This papers presents some _ preliminary
observations and experiments Many more
questions about the TSP need to be investigated
Some interesting future projects include

Combining GA’s with other heuristics In
may be useful to heuristically choose the initial

 
population of tours. For example, the nearest
neighbor algorithm can generate a set of
relatively good tours when started from various
iitial cities For very large problems, nearest
neighbor can be approximated by choosing a
random set of cities and taking the one closest to
the current city. Heuristics could also be invoked
at the end of the GA to do some local
modifications to the tours in the final population
For example, the Figures shows many
opportunities for improving the final tour by
some local edge reversals.

Comparison with simulated annealing.
Simulated annealing is another randomized
heuristic algorithm which has applied to
very large (N > 1000) TSP’s. From the
published Iterature on simulated annealing (2,7,
it appears that our resulls are at least
competitive. A careful comparison of these two
techniques would be very interesting.

  

 

Effects of GA parameters. There are several
control parameters involved in any GA
implementation, such as population site,
crossover rate, etc. which may have an effect on
the performance of the system The proposed
GA's are sufficiently different from previous GA's
that it might be useful to investigate the effects
of these parameters for the TSP.

  

Other combinatorial appltcations. How do the
ideas developed thus far apply to combinatorial
problems other than the TSP?

References

1 A. D. Bethke, Genetic algortthms as
function optimizers, Ph D Thesis,
Dept. Computer and Communication
Sciences, Univ. of Michigan (1981).

2. E, Bonomi and J-L Luton, "The N-
city traveling salesman problem
statistical mechanics and the
Metropolis Algorithm," SIAM Review
Vol 26(4), pp. 551-569 (Oct 1984)

 

 

3K A. Dejong, Analysis of the
behavior of a class of genetic
adaptive systems, Ph D_ Thesis,
Dept Computer and Communication

Sciences, Univ. of Michigan (1975)

K A. Dejong, “Adaptive system
design’ a genetic approach, IEEE
Trans. Syst, Man, and Cyber. Vol.
SMC-10(9), pp. 556-574 (Sept 1980)

M. R. Garey and D S$ Johnson,
Computers and Intractability,
WH. Freeman Co, San Fransisco
(1979)

. J. H. Holland, Adaptation in Natural

and Artifictal Systems, Uni. of
Michigan Press, Ann Arbor (1975).

8. Kirkpatrick, C. D, Gelatt, and

M. P. Vecchi, "Optimization by
simulated annealing," Science Vol.
‘220(4598), pp. 671-680 (May 1983).

 

. J. Pearl, Heuristics, Addison-Wesley,

Menlo Park (1984)

B, J. Rosmaita, Exodus. An extension
of the the genetic algorithm to
problems dealing with permutations,
MS Thesis, Computer Science
Department, Vanderbilt University
(Aug 1985)

 
 

 

 

FIGRE Te FIGRE Ib

 

 

 

 

 

 

58 CITIES 58 CITIES
DISTANCE = 197.82 DISTANCE = 64.76
INITIAL POPULATION . GENERATION 38 1969 TRIALS
FIGE Te FIGRE 1é
58 CITIES 58 CITIES
DISTANCE = 68.32 DISTANCE = 46.84
FINAL POPULATION . GENERATION 234 14686 TRIALS

Figure 1.

166
 

 

 

FIGRE 2a
188 CITIES
DISTANCE = 547.12
INITIAL POPULATION

FIGRE 26
108 CITIES
DISTANCE = 118.47
GENERATION 125 6296 TRIALS

 

 

 

 

 

FIGHRE 2e
108 CITIES
DISTANCE = 99.84
FINAL POPULATION

Figure

167

 

FIGRE 24
108 CITIES
DISTANCE = 87.21
GENERATION 487 28338 TRIALS

 
 

 

 

 

 

 

 

 

TIGRE 3a

208 CITIES
DISTANCE = 1475.68
INITIAL POPULATION

FIGRE 36
208 CITIES
DISTANCE = 223.81

 

 

 

 

 

 

 

GENERATION 227 11373 TRIALS
FIGRE ae FIGRE 34
288 CITIES 208 CITIES
DISTANCE = 351.22 DISTANCE = 203.46
FINAL POPULATION GENERATION 483 24596 TRIALS

Figure 3,

168

 
Genetic Algorithms: A 10 Year Perspective

Kenneth De Jong
George Mason University
Fairfax, VA 22030

1, Introduction

In 1975 Holland's book, Adaptation in
Natural and Artificial Systems, was pub-
lished and provided a summary of the work
which Holland and his students had been
pursuing for some time. An important
theme in this wide ranging study of the pro-
perties of adaptive systems was that adapta-
tion can be usefully modeled as a form of
search through a space of structural changes
which one might make to a complex system
in an attempt to “improve"” its behavioral
characteristics. ‘This gave rise to a metho-
dology for studying existing (natural) adap-
tive systems and designing (artificial) adap-
tive systems which focused on answering key
questions such as: What are the legal struc-
tural changes one is allowed to make? How
is that space searched in an attempt to iden-
tity structural changes which improve
behavior? How does one ascertain that
resulting behavioral changes are, in fact, an
improvement?

‘As an example of the merit of this
approach, Holland specified the architecture
for and provided a theoretical analysis of a
class of adaptive systems in which the struc-
tural modification space is represented by
strings of symbols chosen from some alpha-
bet and the searching of this representation
space is accomplished by an unusual pro-
cedure called a genetic algorithm. I think it
is fair to say, at this point in time, that the
careful definition and theoretic analysis of
these genetic algorithms (GAs) was and con-
tinues to be one of the major contributions
of this effort. In the intervening ten years, a
good deal of interest and activity has
resulted in important new
and their potential applications, culminating
in this conference.

 

 

 

Unfortunately, as is the case in many
novel areas of research,
to find a forum

 

 

169

journal/conference structure for reporting
the wide ranging activities which have
resulted from Holland's provocative ideas
With only a few exceptions, much of this
work has been disseminated via unpublished
Master's and Ph.D. theses, personal com-
munications, and presentations at a series of
informal suinmer workshops.

1 am pleased to report that this sit
tion 1s changing for the better. In addition
to growing institutional support for research
in this area, the renewed interest in machine
learning in the AI community as well as the
continued interest in robust, flexible problem
solving strategies in many different contexts
has led to a dramatic increase in interest in
GAs during the last few years. There
remains, however, a fairly serious gap in the
coverage of GA research activities since
1975. Those who are new to the area find it
difficult to ascertain who has been doing
what and frequently get involved unneces-
sarily in rediscovering various aspects of
undocumented “wisdom” regarding the
implementation and application of GAs
This conference in general and this paper in
particular represent attempts to remedy such
perceived gaps, to suggest open research
issues, and to identify potential application
areas. The following sections summarize my
‘own personal perspective on the current
state of the art in this field.

 

 

 

 

 

 

Conceptual and Perceptual Issues

Most algorithms are developed with a
purpose in mind such as sorting, memory
management, tree traversal, ete. Genetic
algorithms, “however, represent a highly
idealized model of a natural process and as
such can be legitimately viewed as a simula-
tion at a very high level of abstraction. This
tends to raise some conceptual end percep-
tual difficulties when trying to understand
exactly what GAs do and how they might be

 

 
used.

Much of the early GA research, in an
attempt to simplify an already complicated
situation, focused on understanding how
Gas behaved when the structure space to be
searched was an N-dimensional space of
numerical parameters (corresponding to
independently settable dials on a control
panel) and the behavior of the system under
the new control settings (the fitness measure)
was ascertained by simply computing 2
memoryless function whose arguments were
the new control settings. By carefully choos-
ing functions which presented a variety of
well understood payoff surfaces, a great deal
of insight was obtained regarding how GAs
distribute trials in such spaces in response to
the feedback obtained from earlier trials.
This gave rise to a very natural question: Do
Gas provide @ new and important technique
for solving global function optimization
problems? A good deal of research
[DeJong75, Brindle80, Bethke8I] has and
continues to be done in this area
impressive results.

However, because of this historical
focus and emphasis on function optimization
applications, it is easy to fall into the trap of
perceiving GAs themselvee as optimization
algorithms and then being surprised and/or
disappointed when they fail to find an “obvi-
fous” optimum in a particular search space.
My suggestion for avoiding this perceptual
trap is to think of GAs as a (highly ideal-
ized) simulation of a natural process and as
such they embody the goals and purpose (if
any) of that natural process. I'm not sure if
anyone is up to the task of defining the goals,
and purpose of evolutionary systems; how-
ever, I think it’s fair to say that such sys-
tems are not generally perceived 2s function
optimizers.

The question that remains, then, is
how can one characterize what GAs do in a
way which is useful for understanding how
they might be best applied to difficult areas
such as global function optimization,
machine learning, NP-hard problems,
machine vision, etc I believe we still have a
long way to go in this area. I have
attempted to summarize recent advances as
well as identify some open issues in the next
section. To my mind the best perspective
currently available as to what GAs do is

 

 

   

 

 

 

 

 

Holland's characterization of them as simul-
taneously solving a large number of K-armed
bandit problems. (If you haven't read it or
didn't understand it, you should make an
effort to do so.) Although this characteriza-
tion leaves many unanswered questions,
armed with this viewpoint, one shouldn't be
surprised Chat 1) the best individual encoun-
tered so far may not even survive into the
next generation, 2) that the population itself
seldom converges to a global (or even local)
optima, or 3) that the ability of GAs to pro-
duce a steady stream of offspring that are
better than any seen eo far can vary from
quite impressive to dismal.

At the risk of summarizing the obvi-
ous, it is important to realize that GAs have
properties of their own independent of the
application area, and the key to a successful
application {including global function optim
ization) is to understand and exploit these
properties

 

  

3. Representation Issues

‘The strongest hyperplane analysis.
results assume that GAs use a very specific
form of selection, crossover, and mutation to
search a space of fixed length binary strings.
In order to take advantage of the power of
GAs as analyzed, the space to be searched in
& particular application must be mapped
onto a representation space of this form
Depending on the application, selecting an
appropriate mapping can range from a
trivial activity to a highly creative one.
There is now sufficient expenence to begin
to characterize search spaces with respect to
choosing a representation mapping. The fol-
lowing is an attempt to do so.

 

Searching Parameter Spaces

Typically, the simplest way to make a
complex process more flexible (adaptive) 1s
to identify a fixed set of parameters which
can be altered to improve behavior The
obvious mapping 1s to think of each of the N
parameters as a genes and assign each a gene
(string) position. If we then choose for each
parameter a set of unique symbols
representing the legal values of that parame-
ter, we have a very intuitive internal
representation as strings of length N. Cross-
over occurs between symbol boundaries and
produces “legal” offspring, and mutation

170

 
when applied to position i selects a new sym-
bol from the legal symbol set for that posi
tion. There is both theoretical and experi:
mental evidence to suggest that such direct
intuitive mappings are appropriate when the
number of legal values a parameter may
take on is quite small (ideally, 2) and inap-
propriate when they deviate much from the
ideal [Holland75}.

Although there are many interesting
problems which permit such direct mappings
(eg., feature spaces, certain NP-hard prob-

 

 

lems), most parameter modification problems

do not. An obvious solution is to map each
of the N symbol sets onto a set of fixed-
length binary strings, concatenate the
results, and apply GAs to this representation
space. While it is easy to demonstrate a
dramatic improvement in the behavior of
GAs in switching from a short length, high
cardinality representation of a problem to a
longer, but lower cardinality representation,
there are a several issues which arise for
which we do not have good answers. Fre-
quently the cardinality of a symbol set is not
a power of 2, requiring rounding up to the
next power of 2 and implying the symbol
map is into but not onto the set of binary
strings. In so doing, the size of the represen-
tation space ean be increased (in the
case) by a factor of 2N over the ori
search space. Since crossover and mutation
invariably produce some of these unas-
signed strings, there are any number of ways
to handle including discarding such
strings as illegal, assigning such strings low
payoff, or mapping such strings redundantly
into the symbol set. Each of these
approaches has been tried at various times
with no clear indication (either experimen-
tally or theoretically) of the overhead
incurred by such rounding or whether one
approach is consistently better than another.
Frequently the application permits
enough flexibility in defining the original
search space so that the set of legal values
each parameter can take on can easily and
naturally be powers of 2 (e g, most function
optimization problems) so that rounding up
issues are perceived as critical. There
remains, however, the problem of selecting
which of the M! ways M objects can be
mapped onto another set of M objects in
order to generate binary representations.
This issue came up early in the function

 

 

 

   

 

  

 

 

studies in that when presented
with certain relatively simple continuous
surfaces GAs appeared to “lack the killer
instinct” in the sense that they would
quickly find near-optimal points, but fail to
press on to better points near by. Further
analysis indicated that such behavior was
generally caused by artificial “representation
boundaries” introduced by mapping the ori-
ginal space onto a binary representation
space in such a way that ‘‘near-by-ness” had
not been preserved. Hence, at a representa-
tion boundary, a small change in the value
of a parameter is achieved only by a radical
change in the binary representation of that
parameter value. Since crossover and muta-
tion are operating at the bit level, only very
low probability sequences of events could
“bump” the search over such boundaries.
Experiments with alternative encodings such
‘as gray codes yielded clearly identifiable
improvements in cases where representation
boundaries appeared to be a problem, but
gave mixed results in others [Brindle80,
Bethke8l, | Another suggestion for which
there are no definite results is to redefine
mutation so that it works at the parameter
level, guaranteeing that at any point in time
each parameter value is equally likely to be
generated. The argument against such an
approach is the disruptive effect such an
operator would have on the proper allocation
of trials to hyperplanes at the bit level.

As a consequence, an important open
question is a better understanding of exactly
what has to be preserved when choosing a
mapping and how to find mappings with the
desired properties. The only hints and
‘suggestions along these lines that I am aware
of are Bethke’s use of Walsh transforms to
characterize when representation spaces are
“GA hard" [Bethke8]] Any new results in
this area would greatly improve our under-
standing and use of GAs.

 

 

 

3.2. Adaptive Representations

Since there may not be sufficient «
priori insight to select an appropriate
representation, an alternative approach
which has been discussed but for which there
i little theoretical or experimental insight
to allow GAs themselves to select the map-
ping as part of the adaptive process. One
strategy involves including extra "tag bits”
with each individual which identifies the

  

 

awn

 
mapping to be used. An interesting issue
here is whether GAs should be modified to
be aware of such tags bits (for example, by
‘only applying crossover to parents with
identical mappings) or whether GAs should
manipulate the tags bits in the usual way as
undistinguished members of a longer binary
string. In the former case, this introduces
the idea of subpopulations (species) for
which there is considerable support in
natural systems but for which there are no
analytic results. In the latter case, the
presumed usefulness of binary strings inher-
ited from one (and possibly both) parents
can be lost because they are interpreted in
totally different. way in an offspring unless
the parents had identical tag bits and muta-
tion left them unchanged.

Holland raised similar issues while
analyzing the disruptive eflects of crossover
‘on co-adapted sets of alleles which, because
of the particular representation chosen, hap-
pened to be far apart [Holland75]. His
suggestion was to introduce the inversion
operator as a mechanism for changing the
physical location of genes without changing.
their functional interpretation. As above,
left unresolved were issues such as whether
there should only be a few inversion patterns
(species) present in a population with mating
(crossover) occurring only within species or
whether crossover should be modified to
allow offspring to inherit an inversion pat-
tern from one parent but gene values from
both, Early experimental work {Franz72,
DeJong75] generated little evidence of any
significant improvement due to introducing
inversion in a function optimization context;
however, inversion proved to be effective in
later work using GAs to search spaces of
production system programs [Smnith80).

 

 

 

3.3. Context Sensitive Values

A related but more fundamental prob-
em arises when the application area has the
property that the legal values for one param-
eter are contezt sensitive in that they
depend on which values have been chosen at
other positions. While it is frequently con-
venient and natural to view such problems
as defining parameter spaces to be searched,
violating the assumption that values can be
selected independently can have dramatic
effects on the performance of GAs. A simple
example of this occurs if we try to represent

 

 

172

the unit circle with Cartesian coordinates
mapped onto fixed-length strings. GAs, by
independently choosing symbols at each
position, will distribute trials over the unit,
square. The usual “fix is to define the
payoff outside the unit circle to be excep
tionally low (a penalty function) and let the
Gas “learn” to keep new trials inside the
desired region. Suppose, however, we gen-
eralize the problem to that of representing
an Nedimensional hypersphere using Carte-
sian coordinates. If GAs distribute their tri-
als over the enclosing hypercube, as N gets
large, the volume of the hypersphere
becomes vanishingly small relative to the
hypercube and the search process becomes
hopelessly bogged down on a surface which
appears to be uniformly bad almost every-
where. In this ease, of course, it doesn't
take much insight to suggest a switch to
polar coordinates. However, there are other
cases in which alternate representations are
not so easy to find.

My favorite example of this is the
Traveling Salesman Problem (TSP), and 1
am delighted to see that it is well
represented at this conference. I continue to
believe that it captures in a simple, elegant
way many of the open GA issues. A good
deal of thought and discussion has gone into
the problem of representing TSPs in a form
amenable to GAs with very little success to
this point. Since the problem involves vi
ing each of N cities exactly once while
minimizing the total distance of a tour, the
most natural way to represent candidate
solutions is to list in order the cities visited.
Obviously, even though this representation
can be viewed as N parameters specifying
the Ith city to be visited, it is strongly con-
text sensitive in that once a city symbol is
used, it eannot be re-used in another post-
tion’ OF course, one can always permit the
GAS to construct illegal tours via crossover
and mutation and assign them a very low
payoff. Unfortunately, just as with hyper-
spheres, the space of interest here (the set of
all permutations of N symbols) becomes a
vanishingly small fraction of the the set of
all combinations as N increases. There have
been many alternative representations
invented and explored, but to my knowledge
none represent the set of permutations in an
efficient, context free way.

 

    

 

 
The alternative to finding a representa
tion which fits with the standard versions of
crossover and mutation is to change the
definition of crossover and mutation to fit
the representation. Inventing new mutation
operators is not too difficult in this case, the
most natural being low order permutation
operators. Crossover requires a bit more
creativity and usually involves taking a p:
tial tours from one parent and splicing in
whatever is legally possible from the second
parent. The results to date from this
approach have not been any more encourag-
ing than the previous ones using the stan-
dard versions of crossover and mutation on
inadequate representations. The problem in
this case is that, by altering the genetic
operators, we have altered the way in which
GAs distribute trials and the fundamental
theorems regarding efficient parallel search
need to be re-proved.

So we find ourselves “caught between
a rock and a hard place” with few places to
turn, I don’t claim to have the answer
either, but there are several observations
which would seem to provide some hints.
TSP problems fall into an equivalence class
of problems called NP-complete because
there are no known polynomial-time solu-
tions for any member of the class and if one
were found there are polynomial-time
transformations permitting all _ other
members to be solved in polynomial time.
The Boolean Satisfiability Problem (BSP) is
‘a member of this class and involves finding
truth value assignments to N boolean vari-
ables in such a way as to make an art
given boolean expression of these N variables
true. The most natural representation for
BSPs is precisely what is needed for use
GAs, namely a binary string of length N.
Crossover and mutation work precisely as
intended and problems of surprising size can
be solved. (Unfortunately, there isn't much
interest here in nearly correct assignments!)
What we have then are two problems which
are known to be equivalent in the NP-hard
sense, but are quite different in a GA-hard
sense.

The difference seems to hinge on a sort
of duality relationship between the two
problems. Fitness for BSPs is defined purely
in terms of the values of the symbols and
not their relative positions in the string.
This maps well onto our notion of

 

 

 

 

 

 

   

 

 

 

hyperplane and in these situations crossover
and mutation are effective mechanisms for
homing in on good value combinations. On
the other hond, TSP fitness is defined purely
in terms of the order of valueless genes
which represents being in city n Here inver-
sion seems most natural with crossover and
mutation inappropriate in their usual form
‘What seems to be needed is a definition of a
hyperplane in this dual space, Unfor-
tunately, our notions of hyperplanes are so
tightly bound to spaces represented by a
fixed number of independent axes that it's
hard to conceive of alternate definitions.
With an appropriate definition Uhere would
be a much clearer view of the duals to cross-
‘over and mutation, and hopefully 2 dual set
of analytic results,

3.4. Context Sensitive Interpretations

Another form of context sensitivity can
arise and cause difficulty when the same
value of a particular parameter has different
interpretations depending on the values of
other parameters. We have already seen
how this can occur when attempting to
select representations adaptively. Another
nice example arises in attempting to escape
from the context sensitive value representa-
tions of TSPs. One could imagine an N
parameter representation in which the first
parameter specified which of the N cities
should be visited first. Having deleted that
city from our list, the second parameter
always takes on a value in the range 1...N-1,
specifying Ly position on our list which of
the remaining cities is to be visited second,
and so on. Values for each of the parame-
ters can now be independently selected and
crossover and mutation always produce legal
tours. However, the performance of GAs on
this representation is not significantly better
than the previous ones. The difficulty
appears to Le that gene values to the right
of a crossover point or a mutation are inter-
preted quite differently (ie., specify totally
different su! tours) in an offspring than in the
parent, ing the concept of minimal
disruption of “building block” formation
What seems to be needed is a representation
which allows good subtours (co-adapted sets)
to form and be passed on in combination
other subtours, forming better tours,
and so on. With the traditional definition of
a hyperplane, this seems to rule out context

 

  

 

 

 
sensitive interpretations as bad representa-
tions. 1 am unaware of any alternatives
other than the hope that perhaps a more
general perspective on hyperplanes will clar-
ify these issues.

3.8, Varying Length Representations
So far we have been discussing issues
which appear in the context of searching
parameter spaces. There are, of course,
many other (generally more complex) kinds
of spaces which represent the set of permissi-
ble structural changes to an adaptive pro-
cess. In some cases strings are still a natural
representation, but there may be no notion
of a fixed length. A good example are
strings which specify structural changes via
“genes” which represent actions to be Laken.
One string may consist of only a few actions
while others require many. If we wish to use
standard GAs, the simplest (but somewhat
inefficient) approach is to assume some rea-
sonable upper bound on the length, throw in
a “'no-op" action, and require all strings to
be maximum length. Alternatively, eross-
over can be easily generalized to produce
offspring whose length is different (in gen-
eral) from either parent by choosing
independent crossover points in each parent

However, it is important to note that
neither approach is sufficient to guarantee
good GA performance on varying string
length spaces. To understand why requires
asking what the hyperplanes are in this con-
text. Both Holland [Holland75| and Smith
[Smith80] discuss the issues. I will not
repeat the discussions here, but just note
that there is considerable evidence that a
sufficient condition for good GA performance
is that the genes express their actions in a
position independent way.

  

 

 

  

3.6, Non-String Representations

What should one do when elements in
the space to be searched are most naturally
represented by more complex data structures
such as arrays, trees, digraphs, ete. Should
fone attempt to “linearize” them into a
string representation or are there ways to
creatively redefine crossover and mutation to
work directly on such structures. Iam
unaware of any progress in this area. How-
ever, the issues appear to be reasonably
clear. Any linear representations will have

 

 

174

to satisfy (he properties discussed in the
preceding sections in order to achieve
efficient GA search. Similarly, any attempts
to modify crossover and mutation will
require analogous hyperplane analysis results
to guarantee reasonable performance

3.7. Production System Spaces

One of the most intellectually pleasing
ways to effect changes in the behavior of a
complex process is to modify its knowledge
base. There has been a good deal of
research within the AT community regarding
appropriate ways to represent knowledge.
Production rules are frequently chosen when
learning is involved |Waterinan70, Newell77,
Buchanan78]. The GA community has also
maintained 2 long standing interest in pro-
duction system architectures because of their
amenability for use with GAs [Holland75,
Holland78, Smith80, Booker82|. From my
Perspective there are currently two main
approaches to searching production system
rule spaces with GAs.

The first is typified by the classifier
systems developed initially by Holland [Hol-
land78) and Booker [Booker82}. Here indivi-
duals in the population represent single pro-
duction rules (typically fixed length) and the
current population represents the entire set
of rules governing the behavior of the adap-
tive process. GAs play a subservient role
within a larger cognitive model and are
invoked intermittently to produce new rules
which replace existing rules in the popula-
tion,

The alternate approach is represented
by the LS-1 system developed by Smith
{Smith8o}. Individuals represent entire rule
sets to be plugged into the knowledge base
and evaluated. ‘The next generation of rule
sets is produced in the usual way by apply-
ing genetic operators to existing rule sets.

Both approaches have _ produced
encouraging results in quite different con-
texts. There is not enough experience, how-
ever, to understand precisely the strengths,
weaknesses, and tradeofis involved in either
of the approaches. My guess is that the
classifier approach will prove to be most use-
ful in an on-line, real-time environment in
which radical changes in behavior cannot. be
tolerated whereas the LS-1 approach will be
best suited for off-line environments in which

 
more leisurely exploration and more radical
behavioral changes are acceptable.

 

Fitness Functions

In addition to choosing an appropriate
representation on which to apply GAs, care-
ful thought must be given to the characteris-
ties of the payoff function used to provide
feedback regarding an individual's fitness to
produce offspring. The wealth of data from
GA function optimization studies simultane-
‘ously show a general robustness in perfor-
mance over widely varying classes of fune-
tions and intermittent dismal results. This
has lead to several informal characterizations
of the kinds of surfaces which are GA-hard.
Surfaces which are flat almost everywhere
except for an occasional spike present
difficult. search problems for any approach
including GAs. The intuitive explanation is
that, since there is (essentially) no
differential payoff among the competing
hyperplanes, such peaks will be found only
by chance. Unfortunately, it is not all that
difficult to inadvertently construct one in
applications like the hypersphere and BSP
examples discussed earlier.

This immediately suggests another way
to fool GAs: put misleading information in
the hyperplanes. Fortunately, this is much
more difficult to do because of the simul-
taneous sampling of many different hyper-
plane partition elements. Bethke [Bethke8!]
has a nice discussion of this using Walsh
transforms to characterize GA-hard func-
tions. However, much more work needs to
be done in this area,

It should be also noted that it is quite
easy to incorrectly blame GAs for poor per-
formance when the fault in fact lies else-
where. One classic case of this arises when
using GAs to improve the performance of a
complex process for which no payoff function
is given. Since one has to be constructed,
care must be taken to verify that high payo
values as seen by GAs corresponds to good
behavior as observed by watching the com-
plex process itself. Another case arises when
numeric parameter spaces are being
searched. Since there is typically some free-
dom in how finely to discretize a parameter
range, choosing too coarse a discretization
factor may inadvertently leave out optimal
points in the representation space being

 

 
 

 

 

 

 

searched by GAs and then blame the GAs
for not finding them!

Until recently, most GA research and
applications involved payoff functions which
return a single (scalar) payoff value. There
are situations in which it is more natural to
have the payoff function return a vector of
values reprisenting, for example, scores on
non-commensurate aspects of performance.
Rather than insisting that an artificial fune-
tion be created which combines such scores
into a single payoff value, it would be prefer-
able to have GAs work directly with multi-
valued payoffs. Schaffer (Schaffer85] has
explored this possibility recently and has
obtained promising results.

 

 

5. Genetic Operators

‘There certainly is nothing sacred about
the traditional operators defined and
analyzed by Holland. What is important is
that we have criteria from Holland's hyper-
plane analy is which operators should meet.
If changes are made to existing operators or
new ones are introduced, it is important to
verify that they aren't overly disruptive of
the process of distribution of trials according
to payoff and that they encourage the forma-
tion of building blocks. There are still some
interesting open questions along these lines
with respect to rather modest variations of
the standard operators.

It is pretty much standard procedure
now to view crossover as applying to circular
strings and selecting two crossover points,
‘the beginning and the end of the segment

the second parent. This
modification is well supported both theoreti-
cally and experimentally. What happens if
we continue along this vein and select two
segments from the second parent (via four
crossover points)? Is this helpful or too dis-
tuptive? ‘The answers are pretty clearly
negative by the time we have increased the
number of crossover points to the extent
‘that an offspring’s gene values are randomly
selected from its parents values. Perhaps
the number of crossover points should be a
function of the length of the strings
involved. Applying the traditional crossover
to strings with thousands of genes (which is
currently being done) seems to be intuitively
more disruptive than one with four or six
crossover points. If so, where does the law

 

 
of diminishing returns set int

The role of mutation as a background
operator which introduces new allele values
is fairly well understood and accepted in the
abstract. As discussed earlier, problems can
arise from our choice of representation in
which mutation (and crossover) are operat-
ing at the bit level, but our interpretation of
the search space is at a higher level. This
can lead to a frequently tried but rarely sue
cessful strategy of increasing the mutation
rate to improve GA performance. A better
approach in such situations is to think in
terms of both higher and lower level versions
of the genetic operators. Both Holland |Hol-
Jand75] and Smith [Smith80} discuss this,
but much more work needs to be done.

6. Selection

The technique of selecting parents for
reproduction with a frequency proportional
to observed fitness has strong theoretical
justification and considerable empirical sup-
port. However, there are occasions when
this process seems to break down when
implementing GAs with finite populations.
This has come to be known as “the scaling
problem" and can occur in a number of
ways. If a highly fit individual is encoun-
tered early in the search process among.
mediocre peers, selection will give it such
strong preference that it can dominate the
population in a few generations and cause
premature convergence. Similarly, late in
the search process the population can be leg-
iumately dominated by members with very
high payofls which differ on an absolute
scale, but when normalized to produce
expected number of offspring are equivalent,
out to the third or fourth decimal place.
The effect is that essentially every parent
contributes equally to subsequent popula-
tions in spite of fitness differences.

There have been a number of proposed
solutions including the introduction of scal-
ing factors and crowding factors (DeJong75|
and selection by rank [Wetzel83, Shaffer85|
However, I think it is fair to say that a gen-
eral solution still eludes us.

 

  

 

7. GA Parameters

One of the observations people are
quick to make 1s that GAs are themselves
complex processes which appear to have a

set of paraineters (crossover rate, mutation
rate, populstion size, ete.) which could be
tuned to iinprove performance. There is
considerable empirical support for the state-
ment that within reasonable ranges the
values of such parameters are not all that
critical [DeJong75, Grefenstette85]. As a
consequence most GA applications work
with fixed “accepted” parameter values.
However, there is also evidence to suggest
that additional performance improvements
could be obtained if such parameter values
could be dynamically modified. The
difficulty is in deciding when and how to
effect such changes. Should we have a two-
level GA complex with the top level GA
actively searching the parameter space of
the lower level GA and trying out new
parameter combinations? Are there simpler
signals such as allele loss which should
trigger parameter changes? Unfortunately,
the existing theory gives little guidance here.

 

  

 

  

8. Conclusion

In rervading the previous sections, I
became a little concerned that the reader
might infer a strong negative tone from this
long list of problems and open issues in GA
research. Nothing could be further from my
intent. [am enthusiastic about the pote
which GAs hold an ely involved in

    
 

 

GA research and applications It is that
enthusiasm which generated this paper and
this conference. I hope the result is that the

next time we get together my list will be
considerably shorter (or at least different)!

References

[Bethke80| Bethke, A., "Genetic Algorithms
as Function Optimizers", Doctoral Thesis,
CCS Department, University of Michigan,
1981.

[Booker82] Booker, L. B, "Intelligent
Behavior as an Adaptation to the Task
Environment”, Doctoral Thesis, CCS
Department, University of Michigan, 1982.

 

[Brindle80] Brindle, A., "Genetic Algorithms
for Function Optimization”, Doctoral Thesis,
Department of Computing Science, Univer-
sity of Alberta, 1980,

176

 
 

{Buchanan78] Buchanan, B., Mitchell, T.M.,
"Model-Directed Learning of Production
Rules", in Pattern-Directed Inference Sye-
tems, ‘eds. Waterman and Hayes-Roth,
‘Academie Press, 1978.

[DeJong75] De Jong, K., "The Analysis of
the Behavior of a Class of Genetic Adaptive
Systems”, Doctoral Thesis, CCS Depart
ment, University of Michigan, 1975.

 

[DeJong80a] De Jong, K., "A Genetic-based
Global Function Optimization Technique”,
TR 80-2, Department of Computer Science,
University of Pittsburgh, 1980.

[DeJong80b] DeJong, K., "Adaptive System
Design: A Genetic Approach”, IEEE Trans.
on Systems, Man and Cybernetics, 10,9,
Sept. 1980.

[DeJong81] De Jong, K. and Smith, T,
*Genetic Algorithms Applied to Information
Driven Models of US Migration Patterns”,
12th Annual Pittsburgh Conf. on Modelling.
and Simulation, April 1981.

 

[Frante72| Frantz, D. R., "Non-linearities in
Genetic Search", Doctoral The ccs
Department, University of Michi

   

[Grefenstette85] Grefenstette, J., "Genetic
Algorithms for Multilevel Adaptive Sys-
tems”, to appear in IBEE Trans. on Sys-
tems, Man and Cybernetics.

|Hedrick76] Hedrick, C.L., "Learning Pro-
duction Systems from Examples”, Artificial
Intelligence, Vol. 7, 1976.

|Holland75] J. H, Holland, Adoptotion in
Natural and Artificial Systems, University
of Michigan Press, 1975.

[Holland78] Holland, J.H., Reitman, J.,
"Cognitive Systems Based on Adaptive Algo-
rithms", in Pattern-Directed Inference Sye-
tems, eds. Waterman and Hayes-Roth,
‘Academic Press, 1978.

[Newell77] Newell, A. "Knowledge

Representation Aspects of Production Sys-
tems", Proc. 5th ICAI, 1977.

|Schaffer85] Schaffer, J. D., "Multiple Objec-
tive Optimization with Vector Evaluated
Genetie Algorithms”, to appear in Proc. Int'l
Conf. on Genetic ' Algorithms and their
Applications, July 1985,

[Smith80] Smith, S. F., "A Learning System
Based on Genetic Adaptive Algorithms”,
Doctoral Thesis, Department of Computer
Science, University of Pittsburgh, 1980.

 

[Smith83] Smith, S. F., "Flexible Learning of
Problem Solving Heuristics Through Adap-
tive Search”, Proc. 8th IJCAI, August 1983.

[Wetzel83] Wetzel, A., "Evaluation of the
Effectiveness of Genetie Algorithms to Com-
binatorial Optimi jon”, Doctoral Thesi
Department of Library and Information Sei-
ence, University of Pittsburgh, 1983.

   

 
 

Classifier System with Long-term Memory in Machine Learning

Hayong Zhou
Vanderbilt University

 

ABSTRACT
This paper dig ee the
advantages of classifier systema
with long-term memory and

includes a description of the basic
structure of auch a aystem. The
learning atrategy used here ia
twofold one. Firat, an analogical
learning strategy is employed to
inject the appropriate knowledge
into the population. Second, o
production system with a GA-based
learning component woked to
perform subsequent learning. The
proposed system has one overall
objective: It aeeka to increase the
efficiency and power of the
learning system over a long period
of time of use.

   

   

 

 

1. Introduction

A genetic algorithm (GA) is a problem-
solving and non-deterministic search algorithm
first introduced by Holland in 1975(3). It has been
shown, theoretically and empirically, that GAs
are robust and effective in various task domains,
even in the presence of difficulties such as noise,
high-dimensionality, multimodality
discontinuity(7].

The outgrowth of the continuing research
in this area evolved into a message-passing, rule-
based production system called classifier
tem{4]. A classifier system is a learning system
in which many
simultaneously. A classifier is a pattern sensitive
with condition/action Each
condition specifies the set of messages satisfying

 

and

  

classifiers are active

element form

178

it, and each action specifies the message to be
sent when its condition part is satisfied. In short,
& classifier system manipulates knowledge
structures (KSs) in response to performance via a
genetic algorithm. It provides a framework for
cognitive simulation 2]

Several published classifier systems which
incorporate transfer of learning knowledge from
fone task to another have been developed. In
1978, Holland and Reitman designed the first
classifier system called CS-1 tested on mate
problems An experiment was conducted to
demonstrate transfer of learning from a small
mase problem to a large but similar one(4]. The
‘experimental result showed that CS-1 was able to
solve the large maze problem much faster when
initially supphed with some learned knowledge
In 1982, Booker did in-depth
simulation study of classifier systems as cognitive
models[2]. He performed
demonstr:

 

structures

rveral experiments to
e the effects of prior knowledge
structures on learning in new situation. For
“positive transfer*(transfer of knowledge for

 

 

 

   

solving similar tasks), his results were very
encouraging
Before proceeding any further, the

“reversal learning task" needs to be descr
Schrier{6] trained a monkey on a reversal learning
task. Reward and punishment were reversed
repeatedly while keeping the input information to
the monkey unchanged. Performance of this
monkey was inefficient at the outset, but,
eventually, each new reversal could be learned

 

with a single trial.
In order to test the learning ability of
classifier systems, Booker ran his system on the
reversal learning task. Surprisingly, the resulting
performance was inconclusive. The reasons,
according to Booker, are that “the emphasis on
recency and short-term memory in the system is
too great" because “by the time the organism
had reached criterion on a given reversal, the
classifiers learned during the previous reversal
were likely to have been deleted - that is, become
“extinct* due to the drastic change in the
environment"(2}. In 1984, Schaffer completed the
LS-2 designed for the pattern discrimination task
domain|6}. He also gave the reversal learning task
to his system. The results obtained so far are not
encouraging either(private communication),

In sum, efforts to build powerful classifier
systems have met with impressive success over
the past The attempts to transfer learned
knowledge for solving similar tasks, though
manaully, have been shown to be useful and
effective. However, the failures in solving the
reversal learning task pose a question Is there
any way that classifier systems can keep
knowledge which is useful but irrelevant to the
current situation intact in order to increase the
efficiency and power of their learning ability? To
answer this question, this paper proceeds from a
general need for having a long-term memory to a
proposed prototype in the following sections.

2. Motivation for the design of classifier
system with long-term memory(CSLM)

We begin this section with several
assumptions which have been associated with
traditional classifier systems

‘© The domain of learning is concerned with a
single task.

©The changes in environments are slight,
smooth and gradual.

* The efficiency for solving similar tasks in a
long run is not important

If task domain satisfies these assumptions, it
would be unnecessary to augment a classifier
system with long-term memory. However, an
ideal learning system should be able to switch its
attention as needed while still preserving the
most useful knowledge gained in the past no
matter how its environment has been changed
By doing 10, the system would increase its
efficiency and power over time and improve its
learning ability os the number of learned tasks
grows

In short, the main concern of this paper is
to investigate how to accumulate and preserve
knowledge not only within a task, but also among
tasks It has been shown empirically that the sire
of a population should be chosen around
S0{aumber of knowledge structures) in order to
maximize computational efficiency(8). In practice,
tion
larger than 200. For auch small knowledge pools,
it is hard to imagine that a set of generalized
knowledge structures could be constructed, for
example, suitable for many pattern discrimination
tasks. A short-term memory, i.e the population
in a classifier system, can not be expected to meet
the challenges imposed by drastic environmental
changes Each knowledge structure in a
population is evaluated by the Critic designed for
the current task. It is very difficult, if not
impossible, to preserve those knowledge
structures which were perfect for some previous
tasks but not suitable for the current situation
We see this as a serious weakness of the current
model and as the major motivation for the design

of a classifier system with long-term memory
(CSLM).

most of classifier systems never use a pop

 

3. Overall description of CSLM.

In this section, an overall organization of
CSLM 1s outlined ‘The description is based on
the following diagram and 1s intended to be

 
 

instructive rather than specific In figure 1, an have one of the following three outcomes.

understanding of the basics of classifier systems eee

hhas been assumed. It is well described in (24,5) The wat cee step is to bring the learned -
Knowledge structures into the

Detectors population Heuristic initialitation of

Yorn or Receive the population is done.
decriptors (Di)

 

 

 

   
   

 

 

 

2. Partial matching
‘One similar task can be found. The

similarity between the incoming task
and the stored ones indicates that
there might exist some useful building
lene posarieeetiee) blocks in the stored knowledge
structures which, hopefully, can
ipepaet isso provide a promising direction to start,
I with, Thus the search space would be
pruned and the computational effort

{thr Learning might be reduced.

 

 

 

 

 

 

 

 

 
 

 

 

 

 

store the winners
(nav KSa)

3. No matching.
ai It tells us that no previous experience
regarding the incoming task is known,
or possibly has been forgotten. In this
* Deseriptora: Descriptors serve as indices case the CSLM has to start from
scratch, no worse than current

 

 

 

 

In simplest terms, we can visualise the main
components of CSLM as follows:

to learned knowledge structures. The “
classifier systems.
descriptors for various tasks could be very

general In fact, complete and precise Tepaterm Umemery:, The long-term
‘memory consists of two separated memories

called Episodic © Memory(EM) and

Knowledge Base(KB) respectively.

The EM stores all descriptors for previous

tasks. Each descriptor has one pointer

descriptor for a task is neither necessary
nor realistic. In practice, the descriptors
might use a low level language(a string of
bits) or a high level language(alphabet) to

express main characteristics of taske They
Pointing to its corresponding KSs in the

KB The content of the EM may be

considered as the indices for accumulated

may be produced automatically from
incoming tasks, or supplied by users,

* Matcher: The Matcher(a procedure)

performs two functions. ~—_ matching

knowledge structures
‘The KB preserves learned KSs

descriptors and initiating a population. We Whenever a task has been solved, the set of

discuss them together here Matching the soluions are stored ‘ia the tong-tarsi

descriptor of a incoming task with that of memory along with the associated pointer

ks i 7 a
solved tasks in a long-term memory might One of the basic learning strategies

180
employed in CSLM is “learning by analogy”
which appears to be a centr
human cognition and promises to be a powerful
mechanism in machine learning . Learning by
analogy consists of two phases. The first phase
called the "reminding phase" which identifies the
similarity between an incoming task and the
problems observed or solved before The second
phase involves the transfer of appropriate
knowledge obtained in the past into the new
situation Carbonell pointed out the importance
of learning by analogy:* In general, transfer of
experience among related problems appears to be
theoretically significant phenomenon as well as a
practical necessity in acquiring the task -
dependent expertise necessary to solve more
complex real world problems*{1).

The approach used in CSLM 1s to form
descriptors derived from the detector array to
categorize tasks. In the reminding process,
similarity could be determined by matching these
stored descriptors in a long-term memory with
the descriptor derived from an incoming task In
the next phase of analogical problem solving, the
related knowledge structures, if any, would be
brought into the population Notice that to
inject these learned KSs into a population 18 not
the end of our story. Instead, it should be viewed
as providing strong guidance for future search
‘The genetic algorithm will manipulate these
‘useful building blocks and transform them into a
form that would be appropriate for the current
task. In the next phase, the classifier system is
invoked to perform the subsequent learning which
will not been detailed here

inference method in

 

 

 

4. Solving the reversal learning task in
CSLM

First of all, we need to emphasize that the
interestingness of the reversal learning task is not
only because it represents a new class of learning

 

tasks, but also, more importantly, it tests the
learning ability of a system on how well it can
preserve useful knowledge from radical changes in
environments.

Let us see what will happen if a reversal
learning task is given to CSLM.

Suppose that a CSLM has created a set of
KSs for a given task and stored it along with its
associated descriptor in a long-term memory, as
shown in figure 2.2, When the second task with
the same appearance: but opposite
meaning(reversal task) is given, the CSLM, as

 

expected, is in the worst possible position to learn
the new task since the Matcher procedure would
have brought the learned KSs into the
population. In this case, the learned KS would
receive a low score and the classifier system
would have to develop a new KS for the reversal
task. However, after the CSLM has created two
sets of KSs for each reversal, it can solve
subsequent reversal learning tasks with a single
trial, As noted earher, the generality of »
descriptor for a task would guarantee the CSLM
to recognize the tasks with the same or similar
characteristics. ‘Thus the Matcher would be able
to pull two sets of KSs out of the long-term
memory based on the similarity measurement and
myect them into the population The initialized
population 1s shown in figure 2.b. Therefore, the
Critic would be able to choose the appropriate
KS for each reversal

 

 

 

 

 

 

vigure 2

Another significance of this demonstration

 
is to show what happens af a set of bad
knowledge structures has been used to initialize
the population. The full power of genetic
algorithms comes from the parallel nature of the
search and the immunity to false peaks,
Therefore, these injected KSs are only tentative,
and as such are subject to testing If some of
them prove useless or misleading, they will die
out in subsequent generations

There 1s a further point worth noting the
portion of a population to be heuristically
initiated should be judiciously decided 20 that
the premature convergence could be avoided

 

while still giving an opportunity for guiding

future search,

5. Summary and Future Research

‘This paper has discussed the advantages of
augmenting classifier systems with long-term
memory and described a prototype of CSLM
conceptually The process of solving the reversal
learning task was demonstrated as well The
driving force behind this paper 1s to extend the
current model in order to deal with more complex
task and make
environments have been drastically changed
Several difficulties which can be anticipated in
the design of CSLM are mentioned here:

consistent progress even if

‘* How to extract descriptors from tasks
reasonable accuracy and effort while
maintaining the delicate balance between
generality and specificity?

 

* How to update the content of a long-term
memory dynamically?

© How to best initialize a population?

In seeking the answer to these questions and to
test the feasibility of the proposed ideas, a
specific CSLM designed in the pattern
discrimination domain is to be implemented It 1s
hoped that the experimental results will be
available soon as an evidence of the improved
learning ability of the proposed system

182

 

Acknowledgements

‘The author would like to thank hs advisor,
Dr John Grefenstette for his guidance, and Dr.
David Schaffer for his encouragement during the
development of this paper

References

1. Carbonell, J.G. Learning By Analogy
Formulating And Generalizing Plans From
Past Experience. Machine learning,
137-159, Tioga publishing Co

2 Booker, L. Intelligent Behavior As An
Adaptation To The Task Environment
PhD dusertation, The university of
Michigan, 1982

3. Holland, J.H. Adaptation in Natural and
Artificial System The university of
Michigan press, 1975

4.Holland, J.H. and Reitman, JS.
"Cognitive Systems Based On Adaptive
Algorithms, ittern-directed inference
system, 313-329, 1978

5. Schaffer, J.D. Some Experiments in
Machine Learning Using Vector Evaluated
Genetic Algorithm PhD dissertation,
Vanderbilt University 1984

6 Schrier, A.M. Transfer By Macaque
Monkey Between Learning-set. and
Reperted-reversal Tasks Percept_ Mot
Skills, 23, 787-792

7, Dejong, K.A. Analysis of the Behavior of
a Class of Genetic Adaptive Systems,
PhD dissertationUmw of Michigan, 1975

8 Grefenstette, J.J. Optimization of
Control Parameters for Genetic Algorithm,

To _appear_in IBEE Trans Sys Man,
Cybn,1985

 

 

 
A Representation for the Adaptive Generation of Simple Sequential Programs

Nichael Lynn Cramer

Texas Instruments Inc.
PO Box 226015,MS 238
Dallas, TX 75266

ABSTRACT

‘An adaptive system for generating short sequential computer functions is described.
‘The created functions are written in the simple “number-string” language JB, and in TB, a
modified version of JB with a tree-like structure. These languages have the feature that they
can be used to represent well-formed, useful computer programs while still being amenable
to suitably defined genetic operators. The system is used to produce two-input, single-
output multiplication functions that are concise and well-defined. Future work, dealing
with extensions to more complicated functions and generalizations of the techniques, is
also discussed.

INTRODUCTION

The techniques of adaptive Genetic Algorithms 'GAs]! have been shown to be useful
in many areas. Initially, these systems involved the adjusting of a fixed set of parame-
ters in order to optimize the performance of a given algorithm?. Much work has been
done toward the goal of evolving the algorithms themselves, particularly in Production
System-like domains!(eh2P8).34, ‘This paper discusses work towards developing a sequen-
tial programming language that is suitable for manipulation by GAs so as to permit the
adaptive generation of simple computer functions from low-level computational primitives.

 
  

FUNCTIONAL REPRESENTATION

‘The scheme that we will follow is first to find a suitably powerful programming lan-
guage, and then encode the programs in this language in such a way as to make them
amenable to the standard Genetic Operators [GOs].

The basic language to be used is a variation of the algorithmic language PL having
the following operators:

(INC VAR) ;:add 1 to the variable VAR

(:ZERO VAR) ;;set the variable VAR to 0

(LOOP VAR STAT) :;perform the statement STAT VAR times

(:GOTO LAB) ::jump to the statement with label LAB

Programs in PL consist of an arbitrary number of globally-scoped (positive) integer
variables and statements containing operators of the above forms. Two simple example PL
Programs are:
et variable VO to have the value of V1
(:ZERO VO)
OOP V1 (:INC VO))
Multiply V3 to V4 and store the result in V5

  
 
  

(:ZERO V5)

 
(:LOOP V3 {:LOOP V4 (:INC V5)})

While PL can be shown to be Turing Equivalent 5, we will be interested in the language
subset PL-{: GOTO}. This language subset has two useful properties: first, while it is not
fully Turing Equivalent, it still comprises a powerful set of functions (specifically, the
set of primitive recursive functions)® and second, programs written in PL-{. GOTO} are
guaranteed to halt. Finally, we make two small extensions to the language. First, a :SET
operator, which accepts two variables and sets the value of the first variable equal to that of
the second (As can be seen in in the examples above, this operation is trivially definable
in PL-{: GOTO}; if so desired, it can be considered a macro or subroutine operator.)
Secondly, we define a :BLOCK operator that accepts two statements as arguments and
evaluates the two statements sequentially. (This is essentially just a grouping operation
that has no effect on the overall structure of the language.)

Now, the encoded representation for our programs should have two characteristic

(Goal 1) It should be amenable to the standard GOs.

(Goal 2) The representation should produce only well-formed programs, even when
subjected to the GOs While some representations, e.g. character-strings, might be well
suited for the mechanisms of GOs, the random generation and/or altering of characters
is not likely to produce, say, a useful FORTRAN program. Consequently, it is strongly
desirable that the chosen representation be such that all such generated programs stay in
the space of syntactically correct programs. Not all such generated programs would be
useful (adapation would be expected to correct that); it is only important at this point
that such programs be well formed.

This paper will consider lists of integers as a representation for these programs where
the object the integer represents (variable. operator, etc,) is determined by the integer’s
position in the list. Clearly such a representation satisfies Goal 1 above, the standard GOs
(Crossover. Mutation, Inversion) would be well defined on such a list. To satisfy Goal 2,
we need to define a decoding of an arbitrary hist into a well-formed program.

  

 

THE JB LANGUAGE

‘A first attempt at such a decoding is the language JB. The list of integers is first divided
into statements of some length large enough for the longest statement size, (three in the
present case). Any integers left over at the end of this list are ignored The first of these
statements is defined to be the Main Statement |MS and the remaining Nq, statements are
the Auxiliary Statements |AS,. Syntactically, these statements are interpreted as follows:

{0 42) -> (:BLOCK 4S, 452)

{16 0) -> (LOOP V% ASo)

(219) -> (‘SET V; Yo)

(3.17 8) -> (ZERO Vj;) ..the 8 1s ignored

(4.05) ~> (:INC Vo) :.the 5 is ignored

Here the symbols of the forms V7, and AS, represent. respectively. example Variables
and Auxiliary Statements.

This body of statements is embedded in an environment containing Ny, body-variables
(initialized to 0) and N,, input-variables. At the end of the execution of the program, any
of the Nutot = (Niv + Nov) available variables can be returned as ouput.

184

 
The function is entered by executing the MS, which, typically, will call on one or more
of the AS’s. An example JB program would be:

(00135813214345992)

‘This would be grouped into the following Statements:

(00 1) ;;main statement -> (:BLOCK ASo AS;)

(3.5 8) ;;auxiliary statement 0 -> (:ZERO Vs)

(13 2) ;;auxiliary statement 1 -> (:LOOP V3 AS2)

(1 4 3) ;;auxiliary statement 2 -> (:LOOP V4 AS3)

(459) ;jauxiliary statement 3 -> (:INC Vs)

This is the same as the PL multiplication program above.

‘As can be seen, virtually (see below) any list (of sufficient length) of integers chosen
from the range [0,Nand-1] can be used to generate a well-formed JB program. Where Nrand
= Nutot*Nas*Nop (Nop is the total number of operator types). A particular language object
(variable, AS, operator-type) needed for the program can then be extracted from a given
integer in the list by taking the modulus of that integer with respect to the respective
number above. This ensures random selection over all syntactic types. Two problems
arise from this straight forward use of the JB language. The first, a minor problem, is
that a JB integer-list will not define a correct program when a loop is created among
the Auxiliary Statements. In practice, with a moderate number of AS’s this is a rare
occurence. Moreover, it is easy to rernove such programs during the expansion of the body
of the program. (In any case, this problem will be removed in the TB language below.)

A second. more serious problem is that while the mechanisms of the applications of
the GOs are very simple in the JB language, the semantic implications of their use are
quite complicated. Because of the structure of JB, semantic positioning of a integer-list
clement is extremely sensitive to change. As a specific example, consider a large compli-
cated prograin beginning with a :BLOCK statement in the top-level Main Statement. A
single, unfortunate, mutation converting this operator to a :SET would destroy any useful
features of the program. Secondly, this strongly epistatic nature of JB seems incompatible
with Crossover, given Crossover’s useful-feature-passing nature. A useful JB substructure
shifted one integer to the right will almost certainly contain none of its previously useful
properties.

 

THE TB LANGUAGE

In an effort to alleviate these problems, we consider a modified version of JB. This
language, called TB, takes advantage of the implicit tree-like nature of JB programs.

TB is fundamentally the same as JB except that the Auxiliary Statements are no
longer used. Instead, when a TB statement is generated, cither at its initial creation or
as a result of the application of a GO (defined below), any subsidiary statements that the
generated statement contains are recursively expanded at that time. The TB programs
no longer have the simple list structure of JB, but instead are tree-like. Because we are
simply recursively expanding te internal statements without altering the actual structure
of the resulting program. the TB programs still satisfy Goal 2. Indeed, it can be seen that,
because of its tree-like structure, TB does not suffer from the problem of internal loops
described above. Thus. all possible program trees do indeed describe syntactically correct
programs.

  

 
 

 

An example of a TB program is:

{0 (3.5) (1.3 (1.4 (45))))

This expands to the same PL and JB multiplication programs given above.

‘The standard GOs are defined in the following way:

Random Mutation could be defined to be the random altering of integers in the pro-
gram tree. This would be valid but would encounter the same “catastrophic minor change”
problems as did JB. Instead, Random Mutation is restricted to the statements near the
fringe of the program tree. Specifically: 1) to leaf statements, i.e., those that contain
operators that do not themselves require statements as arguments (:INC, :SET, :ZERO).
And 2) to non-leaf statements (with operators :BLOCK, :LOOP) whose sub-statement ar-
guments are themselves leaf operators. Inside a statement, mutation of a variable simply
means randomly changing the integer representing that variable. Mutating an operator
involves randomly changing the integer representing the operator and making any nec-
essary changes to its arguments, keeping any of the integers as arguments that are still
appropriate, and recursively expanding the subsidiary statements as necessary.

Similarly, following Smith®, we restrict the points at which Crossover can occur.
Specifically, Crossover on TB is defined to be the exchange of subtrees between two parent
programs; this is well-defined and clearly embodies the intuitive notion of Crossover as the
exchange of (possibly useful) substructures. This method is also without the problems that
Crossover entails in JB. In a similar manner, we could define Inversion to be the exchange
of one or more subtrees within a given program.

 

EXAMPLE

As a concrete example, an attempt was made to “evolve” concise, two-input, one-
output multiplication functions from a population of randomly generated functions. As
discussed by Smith®(""P5) a major problem here is one of “hand-crafting” the evaluation
function to give partial credit to functions that, in some sense, exhibit multiplication-like
behavior, without actually doing multiplication.

After much experimentation, the following scheme for giving an evaluation score was
used. For a given program body to be scored, several instantiations of the function were
made, each having a different pair of input variables {IVs]. Each of these test functions
was given a number of pairs of input values and the values of all of the function’s variables
were collected as output variables |OVs). The resulting output values were examined and
compared against the various combinations of input values and IVs. The following types of
behavior were noted and each successive type given more credit: 1] OVs that had changed
from their initial values. (Is there any activity in the function?) 2] Simple Functional
dependence of an OV on an IV. (Is the function noticing the input?) 3] The value of |
an IV is a factor of the value of an OV. (Are useful loop-like structures developing?) 4]
Multiplication. (Is an OV exactly the product of two IVs.)

Furthermore, rather than accept input and/or output in arbitrary variables. scores
were given an extra weight if the input and/or output occurred in the specific target
variables. To ensure that the fenctions remain reasonably short, functions beyond a certain
length are penalized harshly. Finally, a limit is placed on the length of time a function is
permitted to run; any functios that has not halted within in this time is aborted.

 

 

 

 

 

  

186
 

‘A number of test runs were made for the system with a population size of fifty. These
‘were compared against a set of control runs. The control runs were the same as the regular
runs except that there was no partial credit given; all members of the population were given
a low, nominal score until they actually started multiplying correctly. All runs were halted
at the thirtieth generation. The system produced the desired multiplcation functions 72%
more often than the control sample.

FUTURE WORK

Finally, a number of questions remain concerning the present system and its various
extensions:

Extensions of the Present System: Generation of other types of simple arithmetic
operations seem to be the next step in this direction. Given the looping nature of the
underlying PL language it seems obvious that the system should be well suited for also
generating addition functions. However, it is less clear that it would do equally well
attempting to generate, e.g., subtraction or division functions, to say nothing of more
complicated mathematical functions. Indeed, the results of the control case above show
that it is difficult not to produce multiplication in this language; generation of other
types of functions would prove an interesting result. On the other hand, are there other,
comparably simple, languages that are better suited to other types of functions?

Concerning Extensions of the Language: A useful feature of the original JB language
is its suitability for the mechanisms of the GOs. Can some further modification be made
to the current TB language to bring it back into line with a more traditional bit-string
representation? Are these modifications, in fact, really desirable? Alternatively, would it
be useful to modify the languages to make GOs less standard? For example, would it be
productive to formalize the subroutine swapping nature of the present method of Crossover
and define a program as a structure comprising a number of subroutines, where the appli-
cation Crossover and Inversion was restricted to the swapping of entire subroutines, and
Random Mutation restricted to occurring inside the body of a subroutine?

ACKNOWLEDGEMENTS
I would like to thank Dr. Dave Davis for innumerable valuable discussions and Dr.
Bruce Anderson for preserving the environment that made this work possible.

REFERENCES

1. Holland, John H., Adaptation in Natural and Artificial Systems, Univerity of Michigan Press,
1975.

2. Bethke, A., Genetse Algorithms as Function Optimizers, Ph.D Thesis, University of
Michigan, 1980.

3. Smith, S.F., A Learning System Based on Genetic Adaptive Algorithms, Ph.D. Thesis,
Univ. of Pittsburghm, December, 1980.

4. Holland, J.H. and J. Reitznan, Cognitive Systems Based on Adaptive Algorithms, in
Pattern Directed Inference Systems. Waterman and Hayes-Roth, Ed. Academic Press, 1978.
5, Brainerd, W.S. and Landweber L.H., Theory of Computation, Wiley-Interscience, 1974
6. Smith, S.F., Flezible Learning of Problem Solving Hueristics through Adaptive Search,
Proc. IICAI-83, 1983.

 

 

 
 

ADAPTIVE *CORTICAL™ PATTERN RECOGNITION

by

Stewart W. Wilson

Rowland Institute for Science, Cambridge MA 02442

ABSTRACT

It ts shown that a certain model of the primate
retmno-cortical mapping “sees” all centered objects
with the same “obyect-resolution” , or number of dis-
tinct signals. independent of apparent size. In an
arpficial system, this property would permit recog-
mition of patterns using templates in a cortex-like
space It is suggested that with an adaptive produc-
tion system such as Holland's classifier system, the
recognition process could be made self-organizing.

  

 

INTRODUCTION

Templates are generally felt to have Innited use-
fulness for visual pattern recognition. Though they
provide a simple and compact description of shape,
templates cannot directly deal with objects that, as
1s common, vary in real or apparent (ie., imaged)
size However, the human visual system, in the step
from retina to cortex, appears to perform an auto-
matic srze-normalizing transformation of the retinal

 

image This suggests that pattern recognition using
templates may occur in the cortex, and that artsfi-
cial systems having a similar transformation should
be investigated. Properties of the retino-cortical
mapping which are relevant (o pattern recognition
are discussed in the first half of this paper. In the
second half, we outline how an adaptive production
system having template-hke conditions might recog-
nize patterns that had been transformed to a “cor-
tucal” space.

 

THE RETINO-CORTICAL MAPPING

Recent papers in image processing and display,
and in theoretical neurophysiology, have drawn at-
tention to a nonlinear visual field representation
which resembles the primate retno-cortical system.
Weiman and Chaikin [1) propose a computer archi-
tecture for picture processing based on the complex
logarithmic mapping, the formal properties of which
they analyze extensively. They and also Schwartz 2]

 

 

 

 

Figure 1. “Retina” consisting of
each connected to an “MSU” in the “cortex” of
Fig 2

 

188

Figure 2_ Each MSU receives signals from a data
field in Fig. 1 Letters indicate connection pat
tern,
 

present. physiological and perceptual evidence that
the mapping from retina to (striate) cortex embod-
ies the same function Wilson (3) discusses the map-
ping in the light of additional evidence and exam-
ines its potential for pattern recogmition Early re-
lated ideas in the pattern recognition literature can
be found in Harmon’s [4] recognizer and in certain
patents [5].

 

A hypothetical structure (adapted from |3)) sche-
matizing important aspects of the retino-cortical
(R-C) mapping 1s shown in Figures 1 and 2. The
“retina” of Figure 1 consists of “data fields" whose
size and spacing increase linearly with distance from
the center of vision The “cortex” of Figure 21s a
matrix of identical “message-sending units” (MSUs)
each of which receives signals from its own retinal
data field, processes the signals, and generates a rel-
atively simple output message that summarizes the
overall pattern of light stimulus falling on the data
field. The MSU's output message is drawn from
‘a small vocabulary, ie, the MSU's input-output
transform is highly information-reducing and prob-
ably spatially nonlinear

 

Further all MSUs are regarded as computing the
same transform, except for seale. That is, if two
data fields differ in stze by a factor of d, and their
luminance inputs have the same spatial pattern ex-
cept for a scale factor of d, then the output messages
from the associated MSUs will be identical. (Physi-
logically, the cortical hypercolumns (6] are hypoth-
esized in [3] to have the above MSU properties )

‘The pattern of connections from retina to cortex
is as suggested by the letters in Figures 1 and 2.
Data fields along a ray from center to periphery map
into a row of MSUs, and simultaneously, each ring
of data fields maps into a column of MSUs The
lefernost column corresponds to the innermost ring,
the 12 o'clock ray maps into the top row, and so
forth.

It is convenrent to describe position in retinal
space by the complex number z = re'®, where r and
¢ are polar coordinates We can denote cortical po-
ition by w = u + iv, where u is the column index
increasing from left to right and v is the row in-
dex increasing downwards For the mapping to have
complex logarithmic form, it must be true that the
position w of the MSU whose data field is at 2 satis-
fies w = log z or, equivalently, u = logr and v=¢

‘That the equations do hold can be seen from Fig-
ure 1 The distance Ar from one data field center
to the next is proportional to r itself, which implies
that u is logarithmic in r Similarly, the fact that
all rings have equal numbers of data fields directly
implies that v is linear in polar angle Thus (with
appropriate units) we have w = logz. (The sin-
gularity at z = 0 can be handled by changing the

 

 

 

 

function within some small radius of the origin. For
present purposes we are interested in the mapping’s
logarithmic property and will ignore this necessary
“te)

 

Figures 8-5 (at end of article) review three salient
properties of the R-C mapping that have been noted
by previous authors The photos on the left in each
figure are “retinal” (TV camera) images On the
right are crude “cortical” images obtained by the
expedient of sampling the retinal data field centers,
‘The mapping used has 64 MSUs per ring and per
ray

 

 

Figure 3 shows a clown seen at two distances
differing by a factor of three The cortical um-
ages, though “distorted”, are of constant size and
shape. Also shown is the result of rotating the clown
through 45 degrees again, cortical size and shape re-
main the same. The pictures show how retinal scale
change and rotation only alter the position of the
cortical image Figure 4 illustrates these effects for
a texture The cortical images are again the same
except for ashift The mapping thus brings about a
kind of size and rotation invariance which one would
expect to be useful for pattern recognition

 

 

Figure 5, m contrast, shows that the mapping
Incks translation invariance ‘The same clown is seen
at a constant distance but in three different posi-
tions with respect to the center of vision. Transla-
tion non-mvariance would appear to be a distinct
disadvantage for pattern recognition.

As the clown recedes from the center in Figure
5, its cortical image gets smaller and less defined.
‘The effect illustrates how in a sense the mapping
optimizes processing resources through a resolving
power which is highest at the center and decreases
toward the periphery. This variation is sometimes

sds

 

cited as a useful property of the eye, and
cussed in connection with an artificial retin:
structure by Sandint and Tagliasco (7).

 

 

OBJECT-RESOLUTION

‘The pattern recognition potential of the map-
ping’s size-normalizing property 1s best seen by defin-
ing a somewhat unusual notion of resolution. Recall
first that the resolving power p of a sensor is the
number of distinct signals per unit visual angle, in
the case of a linear sensor (such as a TV camera), p
15 a constant. Suppose we ask of a system: when its
sensor images a centered object of half-angle A, how

any distinct signals, corresponding to the object,
will the sensor produce? Let us name this quan-
tity the system's object-resolution, Rp. Then, in the
case of a linear system, it is clear that Re will be
proportional to p*A. ‘That is, Rp will depend on
the distance or “apparent size” of the object, or on
the relationship between perceiver and object.

 

 

 

 
The resulting amount of information may be in-
sufficient for recognition, it may be just mght, or
it may overload and therefore confuse the recogn
tion process. This uncertainty leads to the scale or
“grain” problem noted by Marr |8) and others and
to Marr and Hildreth’s |9| proposed solution of com-
jons at several resolutions which are Inter to be
combined. The grain problem 1s also a mousation
for the application of relaxation techniques {10} in
pattern recognition

  

 

Let us now ask what is the object-resolution of an
R-C system For such a system the resolving power
is p = ¢/r, with r the distance from the center of vi-
sion The constant ¢ can be defined as the number
of MSU outputs per unit visual angle at an eccen-
tricity of r = 1. Object-resolution Ro can be found
by taking a centered object of half-angle A and in-
egrating over the object from a small inner radius
cA (c< 1) out to A We have

 

e

[geet

independent of A.

Ro

   

nln A= 240%tn

Thus the mapping’s object-resolution or spatial
quantization of the seen object is independent of the
object's apparent size or distance, and independent
of its actual size as well. It depends only on ¢ (and

Given a fixed value of ¢, the system may be
to see every centered object, regardless of size,
equally well, independent of the perceiver-object re-
lationship. (Strictly speaking, the above integral in-
cludes only a fraction 1—¢? of the object, the “outer”
fraction. But if ¢ 1s very small the omitted fraction
€ wall contain an insignificant portion of the object's
pattern )

 

 

The object-resolution of the R-C mapping can
be thought of an terms of the number of data fields
per retinal ring. By mentally superimposing and
then expanding and contracting a centered object
on Figure 1, one can see that it 1s examined in an
equivalent way at any scale. In fact, it 1s convenient
to use the number of fields per ring as a measure of
Re.

‘The R-C mapping’s constant object-resolution is
the significant difference between it and a linear sys-
tem In the remainder of the paper we will develop
implications of this difference First, why in an im-
portant sense the “grain” problem disappears. Sec-
fond, why Gestalt-like templates are, cortically, suit-
able for pattern recognition. Third, in outline, how
the cortical approach with templates allows a sep
rate adaptive theory due to Holland (11) to be ap-
plied to pattern recognition—and in the process may
solve the mapping’s apparent problem of translation
non-invariance

 

 

 

THE “GRAIN" PROBLEM

Basically, a “grain” problem e
1 priori way to tell whether the size of the elements
with which the perceiver 1s looking is the same as
that of the optimally informative element of the ob-
Ject or scene. In the linear case, we found that the
information about an object may be insufficient, just
right, or overloading depending on (1) the perceiver-
‘object relationship and of course on (2) the amount
of detail m the obyect itself.

   

In the R-C mapping case, the information is
constant, dependent only on the perceiver. Thus
(1) above—uncertainty due to the perceiver-object,
relationship—disappears But the information may
still, st scems, be insufficient, just right, or overload-
ing—depending on object detail

 

 

We can develop a criterion for the latter as fol-
lows. Let an object's “object frequency spectrum”

  

be the two-dimensional Fourier spectrum of geo
metrically similar object of unit size, and let fo be
the highest significant (for diserimination) frequency.
in such a spectrum. Then, roughly, we may say that
a mapping with resolution Rg (in units of fields per

ring) provides sufficient information about an object
>

if Roz fo.

But this bound 1s not ultimately limiting It only
says whether information from one fization 1s suffi
cient for recognition. Peculiarly, by the mapping’s
constancy of information, any fixated local part of
an object is seen in as much detail as is the whole ob-
ject. Thus, if Ro < fo, the system can always gather
‘enough information by scanning, 1.e., by moving the
center of fixation to any part not seen clearly. Re is
therefore always sufficient, though several fixations
may be required

 

 

 

 

Can there be too much resolution? Only if ob-
jects turn out to be simpler than expected. But
often this can be known in advance. In contrast,
in the linear ease, superfluous resolution will always
occur whenever object images become large.

 

TEMPLATES

In any digital computer implementation, a tem-
plate for pattern matching consists of a finite (usu-_
ally rectangular) array of cells in each of which the
relative brightness (o be matched is specified. The
array has a fixed resolution since the number of cells,
is fixed.

One major traditional problem with templates is
a variation of the “grain” problem: Unless the tem-
plate’s resolution 1s the same as the system’s object
resolution, there is virtually no chance of getting a
correct match. The R-C mapping offers a solution
since the system's object-resolution 1s fixed, and the

190
 

resolution of all stored templates can be made ex-
actly commensurate. For instance, the system can
acquire its templates by copying its own cortical
MSU output images of identified objects. The same
objects when later presented in other sizes will be
“seen” in the same way.

 

‘Templates have other problems, e.g., orientation
and brightness variations may lead to mismatch.
‘These will be taken up later. Our analysis suggests,
however, that templates may yet have an important
role to play in general pattern recognition, provided
the matching occurs in a cortex-like space.

 

 

 

OUTLINE OF AN ADAPTIVE CORTICAL
PATTERN RECOGNITION SYSTEM

This section will outline a system concept com-
bining the R-C mapping, a production system based
‘on cortical templates, and the theory of adaptation
due to Holland,

A visual world mapped as in Figures 1 and 2
suggests a natural polarity between center and pe-
riphery The same centered object, as it grows big-
ger, expands toward the periphery, and its cortical
image, as noted, shifts as a unit from the left side
of the “cortex” toward the right side. The impli:
tion is strong that processing, in the cortex, should

 

 

right. The pattern of an object, whatever its degree
of shift from the left, will be encountered “sooner
or later” and thus be available for matching against
templates

Further reflection suggests that rather than work-
ing with two-dimensional templates, it might be
simpler to use one-dimensional column templates—
the identification of a pattern consisting of succes-
sive matching of the appropriate column templates.
Storage would be saved because a given column tem-
plate would often be a contributor in more than one
two-dimensional match.

  

An appropriate structure for performing the cor-
ion of successively matching column templates
form of production system in which (1) the con-
dition of each production includes a column tem-
plate pattern and one or more internal message pat-
terns, and (2) the action is an internal message to
be placed on the common message list. (These in-
ternal messages are distinct from the MSU output
messages To avoid confusion, the internal messages
will be called i-messages.)

 
  

 

In addition, a separate set of “effector” produc-
tons, whose conditions consisted only of i-message
patterns, would monitor the i-message list. When
an appropriate i-message appeared on the hist, the
effector would fire Its “action” would be (1) an ex-
ternal action such as moving the center of vision, or
(2) an “internal” action also modifying the system’s

 

 

frame of reference but. not directly observable from
the outside (more on this later), or (3) a signal to
the outside world denoting a pattern name.

Many details need to be filled in to make this an
‘operating system However, enough has been given
Co suggest a process in which starting at the left end
of the cortex, columns would be scanned and pro-
ductions would fire in dependent sequence (the de-
pendency based on i-messages as well as the column
information being matched), resulting ultimately: in
an effector firing whose signal named the object in
view.

 

 

Production systems have not usually been con-
sidered in connection with pattern recognition be-
‘cause production conditions typically deal with *nor-
malized” or logical variables and, given the gra
problem, patterns in linear vision are anything but
normalized. In cortical space, however, patterns are
normalized so that there the power of productioi
can potentially be exploited

But we can go farther One part of the adaptive
theory due to Holland is concerned with “cognitive
systems” based on sets of productions called “clas-
sifiers” The form of a classifier 1s, most generally, a
string whose condition part consists of a fixed length
“environmental detector pattern” together with one
for more i-message patterns, and whose action part
is an output i-message or effector action. The m-
portant point for us is that the “environmental de-
tector pattern” has exactly the form of the column
templates we have been considering, so that clas-
sifier systems and the adaptive theory may be di-
rectly applicable to “cortical” pattern recognition
It has been demonstrated [13-16] that given an ap-
propriate external reward regime a classifier system
can evolve a set of classifiers that is adapted to, or
“fit”, in its environment. This means in particular
that the conditions of the classifiers recognize what
matters, and the -messages and actions are appro-
priate Much further research must be done, but by
combining classifiers with R-C vision, a new path
would appear to be open to the objective of a self-
organizing visual pattern recognition system.

If the adaptive properties of the Holland sys-
tem be assumed, we can suggest how the produc-
tion structure given earlier might deal with non-
centered objects. They look different from their cen
tered forms" this 1s the mapping’s translation non-
invariance The problem would be solved if classi-
fiers existed which would react to the off-center form
and lead to an effector which would move the center
of vision so as to center the object (at which point
“standard” classifiers could recognize it).

 

 

 

 

 

 

At first sight, the evolution of this kind of se-
quence seems implausible: you would need classifiers
for every object in every peripheral position How-

 
ever, the mapping helps by reducing the detail seen
an an object. as it recedes toward the periphery; 1n
the linnit, every object becomes ust a “blob”. This
suggests that only a relatively small number of dis-
tinct classifiers would be needed to “acquire” any
object for standard (centered) inspection

 

 

‘There remains the problem, not of the isolated
object, but of the more-or-less centered one—such as
1 face—which is still not centered quite weil enough
to fire its standard classifiers How can an appro-
priate centering movement come about? For this
question, and related ones, we need to consider the
“internal effectors” mentioned earlicr

 

Three are important in the present discussion
Object-Resolution (OBRES), Azimuth (AZIM), and
Brightness Gain (BG AIN). OBRES 1s an effector (or
set of them} which, given appropriate i-messages,
Il alter the system’s obyect-resolution (in effect
changing the number of data fields per ring in Fig-
ure 1) This permits seeing an object (regardless,
of course. of its apparent size) in detail, or more
coarsely, depending on the iemessage list circum-
stances The evolution of OBRES effectors. ap:
propriate to different circumstances would occur
through the adaptive mechanisms

 

   

 

   

If we now recall the problem of the slightly off
conter face, it seems plausible that, given some re-
duced level of object-resolution, most different faces
with that degree of decentering could be matched
by a relatively small (and thus practical) set of clas-
sifiers These would lead to e movement command
bringing the face to the center, where it would be
recognized in detail (after, perhaps, restoration by
OBRES of a higher Re).

‘The AZIM internal effectors set the direction
the system regards as “up” In cortical space, this
amounts to shifting the input column vector along
its length by a definite amount before matching clas-
sifier template patterns against it The purpose of
AZIM 1s, of course, to allow a given set of classi-
fiers to be effective for recognition even if the object
is not in standard orientation. But how will the
right azimuth be set in such a case? We again have
recourse to the evolution of relatively coarse classi-
fiers which, given reduced object-resolution through
OBRES, will recognize the presence of a nonspe-
cific (“obtong”, say) object at a certain orientation.
‘These would lead to the right AZIM acting, and spe-
cific recognition could then occur

  

 

Finally, BGAIN is a set of internal effectors to
deal with the persistent problem of setting the right
brightness level for template matching The intent
is that the appropriate gain will be determined (via
the i-message list) by what is seen, and that the
evolution of an appropriate set of BGAIN effectors
will again be under adaptive control in the Holland

 

 

sense

‘The various internal effectors, and the external
one resulting 1n snovement, are concerned with the
system's “point of view” on its visual input, that
is, with systematic transformations which will allow
the system’s form detector set—the classifiers—to
function efficiently.

 

SUMMARY

We began this paper with the retino-cortical
mapping and showed how it “saw” centered objects
with a resolution independent of the object’s size
Constant object-resolution led to a renewed prospect
for template matching in general pattern recogm-
tion. Fixed size templates permitted the power of
production systems to be brought to bear. Finally,
the applicability of Holland’s adaptive theory to pro-
duction systems allowed us to suggest that a recog-
nition system based on the mapping might be made
self-organizing, in the process overcoming the map-
ping'’s “problem” of translation non-invariance.

  

REFERENCES

[1] Weiman, C.F.R. & Chaikin, G. Logarithmic
spiral grids for image processing and displ
Computer Graphics and Image Processing, 1A
197-226. 1979.

Schwartz, EL. Spatial mapping in the primate
sensory projection. Biological Cybernetice, 25,
181-194, 1977,

Wilson, S.W. On the retino-cortical mapping
Int J. Man-Machine Studies, 18, 361-389,
1983.

 

{2}

 

{3}

‘4] Harmon, LD. Line-drawing pattern recognizer

Electrontes, 39-48, Sept 2, 1960.
[5] Singer, J R Electronic recognition.

US. 3,255,437, Jan. 7, 1986.

Burckhardt, CB., et al Pattern recognition
apparatus utilizing complex spatial filterin
USS. 3,435,244, March 25, 1969.

McLaughlin, J-A , et al. Pattern recognition ap-
paratus and methods invariant to translation,
scale change, and rotation.

US. 3,614,736, October 19, 1971

Hubel D.H. & Wiesel, TN. Uniformity of mon-
key striate cortex: a parallel relationship be-
tween field size, scatter, and magnification f
tor J. Comp. Neurology, 188(3), 295-805.
1974

 

  

 

{6}

 

7] Sandini, G. & Taghasco, V_ An anthropomor-
phic retina-like structure for scene analysis. Com-
puter Graphics and Image Processing, 14, 365-

1980.

 

 

|8] Marr, D. Early processing of visual information

192

 
Philosophical Transactions of the Royal Society
of London B, 275, 483-524, 1976.

|9] Marr, D., & Hildreth, E. Theory of edge detec
tion. Proc. Royal Society of London B, 207,
187-219, 1980.

10] Davis, LS & Rosenfeld, A. Cooperating pro-
cesses for low-level vision a survey. Artificial
Intelligence, 17, 245-263, 1981.

11) Holland, J.H. Adaptation in Natural and Ar-
tificial Systems Ann Arbor U of Michigan
Press, 1975.

[12] Evidence and a model for scanning in humans
is presented in Wilson, SW , Strobe imagery:

scanning model. Submitted for publication.

 

 

 

EE

13) Holland, J.1L, & Reitman, J.S. Cognitive sys-
tem based on adaptive algorithms In Pattern-
Directed Inference Systems, Waterman, D.A &
Hayes-Roth, F. (eds.). New York: Academic
Press, 1978

[14] Booker, L. Intelligent behavior as an adapta-
tron to the task environment. Ph D. Disserta-
ton Sciences).

 

tion (Computer and Communic:
‘The University of Michigan, 1982

[15] Goldberg, DE Computer-arded gas pipeline
‘operation using genetic algorithms and rule learn-
ing Ph.D. Dissertation (Civil Engineering),
‘The University of Michigan, 1983.

[16° Wilson, S.WV. Knowledge growth in an artificial
animal. "These Proceedings

 

 

 
 

  

 

194
 

 

 

 

 
 

  

Fig 5
 

MACHINE LEARNING OF VISUAL RECOGNITION USING GENETIC ALGORITHMS

Arnold C. Englander
Itran Corporation, Manchester, N.H.

ABSTRACT

This paper briefly describes
preliminary work with an
application of genetic algo-
rithms. Genetic algorithms
are used as the mechanism by
which a vision recognition
system learns to classify dis-
torted examples of different
but similar classes of image
patterns. The system develops
increasingly effective collec-
tions of class specific
feature detectors producing
increasingly unambiguous,
hence reliable, recognition
performance. Algorithms and
early simulation results are
described.

Genetic algorithms are applied
to a special case of a diffi-
cult optimization problem
which is emerging in several
forms in computational vision
research. The general optimi-
zation problem has a
performance measure that is
easily formulated as an algo-
rithm involving the composi-
tion of both functionals and
logical operations. However,
the performance measure is not
itself a smooth, much less
convex, functional. This pre-
cludes the application of most
conventional optimization
techniques.

I. INTRODUCTION
A variety of techniques for

the machine recognition of
objects in images exist in the

literature and in demonstrated
machine vision technology
(1,2,3]. There is an image
recognition problem which is
difficult for all of these
techniques but which arises in
practical applications. The
problem combines two
troublesome characteristics.
First, pattern classes have
prototypes which correlate
highly with the prototypes of
different pattern classes.
Second, the pattern examples
(to be classified) are randomly
distorted and occluded.
Practical cases of this problem
arise in reading characters
stamped in certain industrial
materials such as rubber and
cast metal. Other examples are
found in robot vision "bin-
picking" applications involving
certain assortments of parts.
This paper describes the use of
genetic algorithms as the basis
of a machine vision system
which improves its own
performance with such
recognition problems by
learning from labeled examples.?

II. THE OPTIMIZATION PROBLEM

Experience in applying
conventional recognition
techniques to difficult

industrial vision problems has
led to this view: Robust
recognition performance relies
on the identification and use

1 For a general and thorough introduction to genetic

algorithms, including general analytical results, see the
pioneering book by Holland [4].

 
of a large set of local image
features having two
properties. First, important
local features are those
which, either alone or in
small groups, disambiguate the
recognition process by being
necessary and/or sufficient
("essential") evidence for

classification. Second, such
features and groups of
features must be likely

survivors of the distortion
and occlusion operations under
which image pattern examples
are generated from class
prototypes.

Obviously essential features
are application dependent.
They depend on the class
Prototypes and on the
distorting and occluding
Processes. The problem's
strong dependence on
application particulars leads
to the requirement that the
recognition system improve its
own performance by associative
learning from labeled
examples.

It is desirable to identify
many small features which are
essential when detected alone
or in a variety of groupings.
This way the features which
contribute to the recognition
process are likely to survive
the random distortions and
occlusions. The detections of
essential features should be
not only graded and combined
in weighted sums but combined
in ways which allow pieces of
evidence to "veto" the
significance of other pieces
of evidence. Intuitively, the
behavior of algorithms based
on such ideas will be
complicated by implicit non-

198

linear, "competitive" and
"cooperative" interactions
between the evidence derived
from the detections of
essential features.

II. USE OF GENETIC ALGORITHMS

Applying these views to machine
learning of visual recognition
leads to an optimization
problem over a space of
populations of 2-D detector
arrays where each array is a
composite of templates for the
detection of essential image
features, The overall
population of detector arrays
is divided into class specific
sub-populations each of which
is optimized to respond
maximally to examples of a
particular image pattern class.
The recognition algorithm
classifies unidentified images
by assigning them to the
detector array sub-population
producing the highest sum of
individual recognition
responses, The recognition
response of an individual
detector is the product of a
match between the detector and
the input image, and a term
called "strength". The
strength of a detector array is
indicative of the detector
array's past performance in
disambiguating recognition
decisions.

Optimization of a sub-
population of class specific
detector arrays means finding
detectors which strongly match
input image examples of the
specified class, but which only
weakly match input image
examples of other classes. This

 
 

is difficult because the
different image pattern
classes have prototypes which
are alike in the sense of
being highly cross-correlated.
This optimization problem
reflects the desired strategy
and intuitively seems simple.
However, it is not easy to
solve. The problem's per-
formance measure on individual
detector arryas is composed of
functionals and logical
operations, It is not itself
a smooth, much less convex,
functional. Such optimization
problems are unsolvable by
most conventional methods.
Because genetic algorithms
impose unusually few con-
straints on the formulation of
optimization problems they are

applicable to this problem.*

The match between detectors
and input images involves a
"matchscore" which is common
to most genetic algorithms.
The strength of detectors
develops iteratively. During
the associative learning phase
of the system, the strength of
each detector is increased
each time the detector's
response is above the average
response of all detectors and
the class origin of the input
image and the class
specificity assignment of the
detector are the same. The
strength of a detector is
decreased each time it
produces an above average
response to an input image

originating from a class other
than the class to which the
detector's sub-population is
being optimized to recognize.

Here, an image pattern is a 2-D
array of binary valued picture
elements, or "pixels". (This
corresponds to a 2-D map of the
zero crossings in a digital
image processed by convolution
with a difference of gaussians
(DOG) operator for the
detection of edges. The
resulting zero crossings are
useful in portraying the
boundaries of objects in the
scene.) The image patterns are
randomly distorted and occluded
examples of prototypes from one
of several distinct, but
similar, image pattern classes.

A detector array is a 2-D array
of pixels of the same size as
the image patterns. Here each
pixel takes one of three
symbols, {0,1,#} where {0,1}
indicate values taken by pixels
in image patterns and #
indicates the "don't care"
condition in the usual genetic
algorithm matchscore. A
standard matchscore is used in
mating image patterns to
detectors arrays by simply "un-
winding" the image patterns and
detector as taxa type character
strings (over {0,1} for image
patterns and over {0,1,#} for
detectors).

Genetic algorithms optimize the
class specific sub-populations
of detector arrays, indirectly,

2 Other cases of such optimation problems are emerging in

computational vision research [5].

One case involves the

goal of combining the information of various visual

processes (stereopsis, motion,

and "shape from-shading" for

example) into a single interpretation (of 3-D or "2-1/2-D" for
example), which is optimal under a performance measure which
combines functionals and logic.

applicable to such problems as well.

Genetic algorithms may be

 
by operating on the individual
detector arrays in each
separate, class-specific sub-
population. Restricting
"mating" and "replacement"
operations to taxa within the
same sub-population, two
"parents" are selected (in
each sub-population, at the
completion of each recognition
trial involving labeled
examples, hence changes in
strengths). The "parent" taxa
are selected according to the
detectors returning the two
highest recognition responses
(the product of the match with

the current input image
example and the detector
strength) or with

probabilities proportional to
the recognition responses.
The two "parents" generate two
"offspring" under genetic
operators and the "offspring"
each replace an "individual"
judged to be "weak" for having
one of the two lowest
strengths of the taxa in the
sub-population. The
"offspring" enter the sub-
population with strengths
which are a fraction of the
average strength of the two
"parents" and the strengths of
the "parents" are reduced to
match that of their
"offspring".

These selection rules reflect
heuristic arguments and
experimentation. "Parents"
are selected as to recognition
responses to ensure that they
are "strong" for having con-
tributed to disambiguation in
the past, and that they are
well matched to the current
input example. "Weak" indivi-
duals are "un-selected" by low
"strength" alone, rather than

200

 

by the current match-"strength"
product, to avoid losing
detector arrays which tend to
be useful but match poorly with
the current input example
(which is randomly distorted
and occluded).

Early simulations involved
standard operators of genetic
algorithms: "cloning", “cross—
over", "inversion", and "muta-
tion", chosen according to pro-
babilities which are fixed for

 

 

each experiment. As is
commonly believed, it is most
useful to assign "crossover"

the highest usage probability.
Experiments were also performed
using Wilson's "imprinting" and
"ternary intersection" opera-
tors, with low usage probabili-
ties. Wilson's operators seem
relevant and useful to this
problem [6].

III. EARLY SIMULATION RESULTS

Early simulation results are
promising in that self-optimi-
zation by genetic algorithms is
obvious. The recognition
system, operating in training
mode, clearly improves its
cumulative average of correct
recognitions from very low
initial percentages to
moderately high percentages
over a few hundred trials. In
simulations involving 4 pattern
classes of 2 prototypes each, 4
sub-populations of detector
arrays having 32 detector .
arrays each, and image and
detector arrays of 32 by 32
pixels, the system averaged
correct recognitions 25% of the
time for the first 100 or so
trials, rising exponentially to
78% correct recognitions after
 

1000 trials. In such simula-
tions the detectors were
initialized with pixels con-
taining 0,1,#, with equal pro-
bability and Wilson's genetic
operators were used randomly
with small probabilities. In
some simulations the system
improved its recognition per-
formance over correlation
based pattern recognition
techniques in a few thousand
training iterations.

As expected, over time, the
system evolves strong detector
arrays which partly resemble
the prototypes of the pattern
classes to which the detectors
are assigned. But the
resemblance is never complete
because detectors must match
features present in examples
of their assigned pattern
class but ignore features
which are also characteristic

of other classes. The
evolution of such detectors is
apparent in the simulations.

IV. CONCLUSION

Preliminary work with an
application of genetic
algorithms has been described.
Genetic algorithms are the
mechanism by which a vision
recognition system learns to
Classify distorted examples of
different but similar classes
of image patterns. This work
addresses an unconventional
optimization problem which
arises naturally from an
intuitive model of visual
learning. Early simulation
results indicate that the
proposed model can lead to the
design of an effective machine
vision system.

REFERENCES

1. R, Duda, P, Hart:
Wiley, New York, 1973.

2. &E. Hall:
Academic, New York, 1979.

3. J. Tou and R. Gonzalez:

p Classificati a

Addison-Wesley, Reading, MA, 1974.

4. Jd. Holland:

University of Michigan, Ann Arbor, 1975.

5. D. Terzopoulos:

"Multilevel Reconstruction of Visual

Surfaces: Variational Principles and Finite-Element

Representations", in

Multiresolution Image Processing
and Analysis, ed. A. Rosenfeld, Springer, New York, 1984

(see page 283).

6. S. Wilson:

"Knowledge Growth in an Artifical Animal",

in Proc. Fourth Yale Workshop on Applications of Adaptive
Systems Theory, New Haven, Conn., 1985.

 
Bin Packing Hith Adaptive Search
Derek Smith
Texas Instruments

1.0 INTRODUCTION

He have looked at the problem of bin packing arbitrarily dimensioned
rectangular boxes into a single orthogonal bin. Figure 1 shows a good bin
packing, the sort we are aiming for. Figure 2 shows a poor bin packing.

The problem is NP-hard in the strong sense, so there is little hope
of finding a polynonial time optimisation algorithm for it (1).
Reasonable approximation algorithms exist which can be guaranteed to be
within 22% of optimal (1).

   

Gur approach has been to use a wrinkle on genetic algorithms (3),
developed in the Texas Instruments Conputer Science Laboratory (2).

2.0 ADAPTIVE SEARCH

The epistatic domain of bin packing has traditionally not been
amenable to adaptive search techniques. This is because it is difficult
to represent a bin packing on which we can do crossover and mutation and
retain either a reasonable packing or a legal packing.

Consider a flip mutation (rotate through 90 degrees) of box 18 in
figure 1. The flip will either cause a illegal bin packing due to boxes
overlapping each other, or if we fracture the packing by moving the
neighbouring boxes away to make the flip legal, will produce a poor bin
packing.

 

Dur solution is to represent the bin packing as a list of the boxes
plus an algorithm for decoding the list into a bin packing. The list is
readily mutatable (flipping boxes), and is amenable to a modified form of
crossover. The decoding algorithm takes any list of boxes and forns a
legal packing. Hence we attempt to produce good bin packings using
Genetic Algorithns.

 

2.1 The Representation

As explained above our representation is a list with an associated
algorithm to apply to the list to produce a bin packing. For effective
search the algorithm must produce legal packings from any operation on the
list. Here we describe two such decoding algorithms.

The first algorithm we call SLIDE PACK. He take each box, in order,

from the list, place it in one corner of the bin, and let it fall to the
farthest corner away, as if under a gravity that only allowed it to nove

202

 
 

orthogonally. The effect is that a box will zigzag into a stable position
in the opposite corner from which it Kas placed. Box 2 in figure 3 shows
the SLIDE PACK algorithm.

SLIDE PACK is fast as there is no backtracking, and is simple to
conpute. Its time complexity is O(nkX2), where n is the number of boxes.
There are nl possible orderings of our list of n boxes. If we associate
a flipped state with each box, this gives us nl2Mn menbers in the set of
all encoded representations. Although we can contrive packings that SLIDE
PACK can never do, we believe that in general we can reach all of the
search space by operating on the list of boxes.

The second algorithm we call SKYLINE PACK. For each box in the list,
in order, we try the box in all stable positions, and in all its
orientations on the partially packed bin. A stable position is where the
box is tucked into a corner, or cave formed by other previously packed
boxes. The algorithm takes its name from the fact that it tours the
skyline formed by the previously packed boxes to find the position it fits
best. Figure 4 shous sone of the places that box 2 is being considered
for by the SKYLINE PACKer.

   

 

Again we have ni possible orderings of the list. However each tine
a we pack a box we try that box in many positions - we are covering nore
of the search space than in the SLIDE PACKing of a box. It is clear that
We can no longer generate all possible bin packings, as a poor placenent
of a box wil! be ignored in favour of a better placement somewhere else on
the skyline. A nore practical question is whether we can represent all
good bin packings. He believe so (again informally) but with less
conviction than with the SLIDE PACK. SKYLINE PACK has time complexity
O(nKKs) .

 

    

With @ randomly generated list SKYLINE PACK will tend to generate a
significantly denser packing than SLIDE PACK, however, it takes longer to
run. Figure 2 is a typical SLIDE PACKing of a randomly generated list,
whilst figure 5 is a typical SKYLINE PACKing. SLIDE PACK can produce good
packings as shown in figure 1 when we apply the adaptive search
techniques. The trade off is whether to run the adaptive search with
larger populations and for more generations using SLIDE PACK, or in the
sane amount of time use SKYLINE PACK for fewer generations. Our
experiments have shown that SKYLINE PACK is more favorable, however with a
better tuning of the adaptive search SLIDE PACK nay produce better
results.

2.2 The Genetic Operators

Our representation of a packing, as described, is the order of the
boxes presented to the packing algorithm. Traditional crossover cannot
operate on such a list. Consider a crossover of list (12345) with (S
4 3241) the crossover point being after the second elenent to produce (1
2321). The list now has boxes 1 and 2 duplicated and boxes 4 and §
missing.

 

 
Hence we use a MODIFIED CROSSOVER which takes the order of the boxes
before the splice from the first list, and the order of the boxes which
remain to be packed fron the second list after the splice point. In the
above example we would generate the list (125 4 3).

Hollands theorens (3) regarding the effectivness of crossover no
longer hold. He have not yet investigated the theoretical aspect of the
modified crossover. However we have experimented with its use; we have
run random search versus our genetic operators, and have found the genetic
operators to produce consistently better results.

 

One of the mutations we have experimented with is SCRAMBLE, that is
randoaly reordering sone portion of the list. At the beginning of the
adaptive search process we can concentrate on SCRANBLing the beginning
portions of the list to evolve a good basis for the packing. As the
evolution proceeds we can move our area of interest father up the list.

 

A FLIP nutation to try different orientations of the boxes is
necessary if the decoding algorithm does not try the box it is packing in
all its orientations. FLIP is applied discretely to boxes in the list.

2.3 The Evaluation

Because we require our evaluation procedure to score dense packings
highly, a straightforward evaluation criteria is the ratio of the area of
the boxes packed to the area of the bin. This works well as an evaluation
of a packing.

 

It is less clear how to evaluate partial packings which are required
in such decoding algorithms as the SKYLINE PACKer where we need an
evaluation of the packing for each position of the box along the skyline,
to choose where to settle it. He have tried numerous ways to measure
partial bin packings. One of the most intriguing is to take the inverse
square of the separation of the box being packed to all the other boxes.
This favors boxes filling in caves, especially if they fit snugly into the
cave. There is sone analogy here to gravitational effects, and indeed
such an evaluation allows us to pack space (as opposed to in a
containing bin) as the boxes are attracted to each other.

 

 

 

 

Graph 1 shows how the density of a partial bin packing falls as the
number of boxes packed increases. This is due to the forming of more and
larger caves by the later boxes. As the evolution continues we form less
caves, and we can see from the graph that by generation 20 we have kept to
about 85% density.

3.0 RESULTS
He have benchmarked our results against a recentiy developed
deterministic bin packing program within our group. This program uses

some heuristics and dynamic programming techniques. Qur program can

204

 
 

produce the same packing density 300 tines faster. Also if a greater
density is required then we can simply allow our program to run for
longer, or run it again. Similarly if a less dense packing is required we
run for only a short time. Graph 2 shous how the density increases as the
evolution proceeds. This is a tremendous practical advantage of this
approach. A practical disadvantage is that each time He run the process
we Will end up with a different packing.

 

 

4.0 FUTURE RESEARCH

 

There is work to be done in the mating of the decoding algorithm and
the genetic operators. In particular, finding ways to operate a portion
of a bin packing without having repercussions on the whole packing.

   

Work is also in progress in making the genetic operators robust to
quantity of data, variation in dimensions of boxes, and variations in the
aspect ratio of the bin.

He are also considering a process which monitors the adaptive search
Whilst it runs. Such a process could vary the importance of the nutations
as the search proceeds. It could bring in mutations to produce diversity
of the search if it were trapped at a local naxina. It could also alter
the size of the population at various stages in the evolution. Currently
such variations are set up at the start of arun, it would be nore
effective to have the process continually monitoring and adapting itself.

 

In order to learn how to implement the monitor process we need to
study how the search space is being explored. Seeing our bin packing
algorithms run by the use of graphics has been very useful in this work to
date. Graph 3 shows the sort of display which He would like in order to
watch the evolution, learn about the process, and write the if
monitoring system we have mentioned. Numbers 1 through 4 are four of the
menbers of the initial population. The trees sprouting from then
represent the performance of their offspring. 1 was a poor initial
packing and soon died away. 4 was a good packing and we can see it
spanned many children in exploring its portion of the search space. Note
also that 2 and 3 are allowed to evolve to maintain diversity in the
search.

 

 

 
 

 

Graph 4 is the sane concept as graph 3 in a search space that we have
completley mapped out and in which we can draw the local naxina,
represented by Is in the graph. We could then test new levels of

  

operators, and different population sizes in a controlled and visible
search space. Graph four shows only two dimensions of such a space, which
for n boxes is n-dimensional.

 
5.0 ACKNOHLEDGEMENTS

This

 

work is only possible because of the enthusiasm, research work,

and utilities for adaptive search al! provided by co-worker Lawrence

Davis.

He thank the referees for their valuable comments.

6.0 REFERENCES

Garey and Johnson, Computers ond Intractability, 1973, H. H.
Freeman.

Lawrence Davis, Applying Adaptive Algorithas to Epistatic
Domains, To appear proc. IJCAI-8S.

John H. Holland, Adaptation in Natural and Artificial Systens,
University of Hichegan Press, 1975.

206

 
 

Graph 1 Density a bin packing proceeds

 

 

 

 

 

cerceament,

Graph 2 Density as search proceeds

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

WP
co

°

Graph 8 Tracing the evolution ~

 
 

Directed Trees Method for Fitting a Potential Function

Craig G Shaefer

Rowland Institute for Science, Cambridge MA 02142

Abstract

The Directed Trees Method 1s employed to find interpolating functions for potential energy sur-
faces The matheinatical algorithm underlying this fitting procedure 1s described along with example
calculations performed using a genetic adaptive algorithm for fitting the Ax unfolding families to 1-
and 2-dimensional surfaces ‘The properties and advantages of the usage of genetic adaptive algorithms
in conjunction with the Directed Trees method are illustrated in these examples.

Seetion: 1. Introduction

How does one choose a mathematical model to describe a particular physical phenomenon?

‘To help im answering this question, we have developed a method called the Directed Trees (DT) method?
for describing Uhe possible structures available to a particular special type of model. the gradient dynamical
systems The gradient dynamical systems are, however, quite general and flexible and hold a ubiquitous
presence in the physical sciences. In the next section we illustrate where this special type of model ‘fits’ into a
very broad class of mathematical models. The DT method employs a relatively young branch of mathematics
called differential topology: “topological” in order to form categories of solutions for gradient dynamical
systems to reduce the problem to the study of a finite number of different categories, and “differenti
1m order to allow for quantitative calculations within these models. For the purposes of this paper, it is
sufficient to say that m the numerical applications of the Directed Trees method, systems of nonlinear
‘equations arise for which we require solutions Although classical numerical methods could be employed for
the solution of these nonlinear systems, we find that genetic adaptive algorithms (GAs) are especially suited
for this purpose and have certain advantages In order to introduce our application of GAs to the solution of
nonlinear systems of equations and be able to discuss the advantages which GAs offer over the more classical
numerical methods, the third section of this paper provides a brief exposition on the topological concepts
inherent to the Directed Trees method and describes the equations that arise in its quantitative applications.
Section 4 contains examples of the usage of genetic adaptive algorithms for solution of these systems.

 

   

  

 

   

 

 

Section: 2. General Mathematical Models

In this paper, we are secking not so much # procedure for calculating the specific solution to the
mathematical model of a physical system, but rather the dev lopment of a model for which we may classify its
solutions into behavioral categories so that one particular solution from each category serves as & paradigm
for all solutions belonging to its category. Obviously. this will greatly simplify the study of the general
solution of a model_In order to do this, however, we first need to restrict the type of mathematical model
to which our classification scheme is applicable To understand where our restricted class of models fits
into the general class of mathematical models, below we will describe the simplifications inherent to our
restricted class. The following table contains a hst of possible variables whose interrelationships we seek
‘These variables include items such as the spatial and time coordinates, and parameters such as the masses
of particles, the refractive indices of mediums, densities, temperatures, etc In addition, our model might

 

 

  

    

   

 

207
gl:2

 

General Mathematical Models

 

 

 

Variable General Term ‘Comments
Zen” i (tion) 1m spatial coordinates
TER t time coordinate
gem pe (uaa) # parameters (mass, refractive index,...)
6 CUBA O10) SOLUTIONS (trajectories... }
DIS G8 tay 1c time derivatives
Dio aaa Sagm ert om 0s spatial derivatives
{ T(6)dt ST(O., dt (7 functionals time integrals of functionals
JS K(6)dz PRO: Ydz, (x funetionalyy spatial integrals of functionals
f LAE 66,F -)P (ern) mtegrodifferential functionals

 

 

 

t= functionals may depend on any of the variables located above them in this table

 

 

 

Table 1 Table containing possible variables, parameters, and functional dependencies for a general mathe-

 

also depend on the derivatives with respect to the time and spatial coordinates as well as integrals whose
integrands are funictions of the other variables or solutions

Suppose we have a physical system for which we have a set of m arbitrary rules that specify the in-
teractions of the variables from Table 1 This leads to the following system of m equations, called an
Integrodifferential System, whose solutions describe the behaviors of the physical system

 

 

a)

Since we have n equations, let us suppose that there are n solutions, thus we take © = (@;,...,Q,) in
what follows. Let us remark that this system forms a very general and flexible mathematical model for
studying physical phenomena. It encompasses almost all mathematical models that are currently employed
in the sciences This system of integrodifferential equations ts however, much too difficult to solve in all
of its generahty — only in very specific cases are solutions even known And virtually nothing 1s known
about how these solutions vary as the parameters are changed. We must make a few simplifications in these
equations before anything can be said about their general solutions These simphfications are very typical
though, for many models in the “hard” sciences have as their fundamental premises the assumptions that
we describe below.

 

 

 

 

To begin, we assume that f does not explicitly depend upon Z, D/6 for j > 1, D§6, fT(S)at,
nor { K(6)dz. Then the system has the form: f = /(6,/,t;D,6) = 6, for which more can be said
concerning its solutions. Instead of studying this system, though, we continue with a further simplification
concerning the dependency of f on the time derivatives, and, in particular, we consider those f of the form,

 

   
Durected Trees Method Jor Fitting a Potential Function — §1°3

 

F = D6 - F'(6; At) = G. Note that the function j’ appears to be similar to a force vector. In effect,
the above system of equations describes the situation in which the rates of change of the solutions are
proportional to a vector that depend upon the solutions themselves This type of system arises in classical
mechanics and is usually called a Dynamical System. If the further restriction that the forces do not explicitly
depend on the time, then we have the following system of equations, which form an Autonomous Dynamical
Systere f = D,O ~ f'(6; 7) = 0. A few useful statements can be made about the solutions of this type of
system of equations and their behaviors as the parameters Fare varied We, however, will agam continue to
make one further simphifying assumption on the form of the f*. We noted above that the vector function,
Fis of a form similar to the forces in kinematics and electrodynamics _If, in fact, f' is a true force, then it
can be taken to be the negative gradient of some scalar potential 6: f' = ~Dg¢(6;A). Thon we have the
system

 

   
    

   

 

F= D6 + Deol

 

iA) , (2)
which is termed a Gradient Dynamical System. Many very powerful statements can be made about the ©
and their behaviors as functions of f for this system Oftentimes, we are concerned with the *stationary”
solutions of (2), 1¢., solutions which are time-independent These stationary solutions require the forces to
we require Dg¢(6; A) = 0. This equation determines what are called the equilibria,

of the grachent system ‘The most powerful and general statements can be made about equilibria and
how they depend upon their parameters.

   

vanish, in other words

 

‘The solutions, 6, of the above systems are merely generalized coordinates for the physical systems, and
thus, following the standard nomenclature, we replace © by ¥. For example, these solutions, z, might be
the positions of equilibria as functions of time, the Fourier coefficients of a time series, or even laboratory
measurements.

  

We have thus shaved the general mathematical model, (1), of a physical process to the specific case of
examining the behaviors of scalar potential functions, 4(Z, A) It 1s for these special cases that differential
topology yields the most useful results.

In the next section we examine the primary results of singularity theory which allows any arbitrary
potential to be classified into a finite number of different category types. It 1s this class greatly
simplifies the study of gradient dynamical systems? Since we are interested im the particular potential
functions stemming from the solution of the Schroedinger equation, under the Born-Oppenheimer approxi-
mation for a chemical reaction, we apply the classification scheme specifically to potential energy surfaces
(PESs) Keep in mind that the same classifications and calculations are applicable, however, to any gradient
dynamical system. The classification scheme that we have developed for PESs, as we have mentioned, is
called the Directed Trees method and contains both a qualitative diagramatic procedure for implementing
the classification as well as a quantitative computational procedure for calculation of specific behaviors and
characteristics of the model.

 

 

 

 

Section: 3. The Topology of Potentials

Why should we concern ourselves with an alternate classification scheme based upon differential topology
for potential energy surfaces? The reason for domg so is that this new Directed Trees classification has two
special properties: structural stability and genericity.? The concept of structural stability plays an important
role in the mathematical theory of singularities. There are sev eral reasons for this importance. First of all,
usually the problem of classifying objects 1s extremely difficult, 1t becomes much simpler if the objects one
is classifying are stable Secondly. in many cases, the class of all stable objects forms what is loosely called
fa generic set. This means that the set of all stable objects is both open and dense, in the mathematical
sense, in the set of all objects In other words, almost every object 1s & stable object and every object
is “near” a stable object. Thus every object can be represented arbitrarily closely by a combination of
stable objects For instance, the Implicit, Function Theorem of calculus and Sard’s Theorem of differential

 

209

 
 

 

§13

topology imply that almost all points are regular points (points whose gradients are nonzero) for stable
functions and thus are not critical points Stated differently, regular points are generic, ic, they form an
open and dense subset of the set of all points for stable functions. (Stable functions are functions which
can be perturbed and still maintain their same topological properties.) Even though almost all points of
a function are regular points, nondegenerate isolated critical points do occur and have a generic property:
they are not removed by perturbations The importance of nondegenerate critical points extends beyond
their mere existence, for they “organize” the overall shape of the function. This can be seen in the following
one-dimensional example Consider a smooth function of a single dimension, f(z), which has three critical
points between z= O and x= 1 If the curvature at the eritieal point with the smallest x coordinate in
this interval is negative, then the curvatures for the middle and highest eritical points must be positive and
negative, respectively, for no other combination can lead to a smooth function connecting these three critical
points. In addition, the functional values at the smallest critical point and the largest critical point must
bbe greater than the value of the function at the middle critical point. A stmple graph of f satisfying the
above conditions will show that if these statements were not in fact true, then additional critical points
would be required between these three eritical points dust _as nondegenerate critical points orgamze the
shape of a one-dimensional function degenerate eritw al joints “organize” families of functions having specific
arrangements of nondegenerate critical points These deve: al points are nongeneric in the sense
that small perturbations either split the degenerate critical points into nondegenerate points or annihilate
the degenerate point completely leaving behind only regular points, It aught therefore seem that we should
not concern ourselves with degencrate critical points since they are mathematically “rare” occurrences on
a surface and can be removed Ly small perturbations ‘The manner m whieh degenerate points Sorganize”
functions into classes, however, leads to a generic classification of famihes of functions that is stable 10
perturbations and hence will be very useful m our study of PESs A third reason for the importance of
stability stems from the applications of singularity theory to the experimental sciences It 1s customary to
msist on the repeatability of experiments Thus similar results are expected under similar conditions, but
since the conditions under which an experiment takes place can never be reproduced exactly, the results
must be invariant under small perturbations and hence must be stable to those perturbations. Thus we see
iL is reasonable to require that the mathematical model of a physical process have the property of structural
stability In order to define this concept of stability, we first need a notion of the equivalence between objects.
This is usually given by defining two objects to be equivalent if one can be transformed into the other by
diffeomorphisms of the underlying space in which the objects are defined. For the specific case when the
objects are PESs, these diffeomorphisms are coordinate transformations and will be required to be smooth,
that is, differentiable to all orders, and invertible. This invertability 1s a requireinent of the Directed Trees
method and forms an important reason for employing GAs in the numerical applications of the DT method

   

    

 

 

 

   

 

     

 

 

 

 

 

 

‘The mathematical branch of differential topology called catastrophe theory forms the foundation for
the DT method. In its usual form, catastrophe theory is merely a classification of degenerate singularities
of mappings, the techniques of which use singularity theory and unfolding theory extensively along with a
very umportant simphfying observation made by Thom which has come to be called the Splitting Theorem
In this paper, we wish only to emphasize the fundamental concepts behind the Classification Theorem thus
providing & heuristic justification for its use in the study of PESs. In the process we describe the functional
relationships between the PES and its canonical form, which we call the Directed Trees Surface (DTS). We
do not provide rigorous statements nor proofs of any of the theorems of differential topology, but, more
importantly, we hope to provide an intuitive description of the fundamental concepts behind these theorems
In order to describe these results. we employ the terminology of differential (opology, and thus below we
provide the basic definitions necessary for comprehension and discussion of the DT method A glossary of
topological terms and notation used. sometimes without comment, in this paper is provided in the Appendix.
Since our main interest in this paper 1s the local properties of potential energy functions, we begin by recalling
some preliminary definitions of local properties If two functions agree on some neighborhood of a point,
then all of their derivatives at that point are the same. Thus, if we are interested in trying to deduce the
local behavior of a function from information about its derivatives at a point, we do not need to be concerned
with the nature of the function away from that point but may only be concerned with the function on some
neighborhood at this point. This leads to the concept of a germ of a function. Let L be the set of all

 

 

 

 

 
 

Durected Trees Method for Fitting a Potential Function §1:3

continuous functions from the Euchdean space * (o 8 defined in a neighborhood of the origin. We say
that two such functions, f,9 € L determine the same germ if they agree in some neighborhood of the origin,
so that a germ of a function 1s an equivalence class of functions Since this Uheory 1s entirely local, we may
speak of the values of a germ j and we write {(z) for z € R", although t would be more correct to choose
a representative f from the equivalence class f. A germ f at x 1s smooth if 1 has a representative which
13 sinooth in the neighborhood of z Because germs and functions behave similarly, we often use f and
J interchangeably to represent a germ Only where confusion may result wll we distinguish a germ from
one of its representatives We may also talk of germs at points of 8" different from the origin. A germ is
thus defined by @ local mapping from some point of origin If two smooth functions have the same germ
a a point, then their Taylor expansions at that point are identical We may, without loss of generality,
take the origin of a germ to be the origin of 8". The set of all germs from R” to R forms an algebra #
‘This convenient fact allows us to study the germs of maps with powerful algebraic techniques that ultimately
lead to algebraic algorithms for the topological study of arbitrary PESs §

 

 

 

 

Fundamental to many apphentions of applied mathen
a finite number of terms nits Taylor expansion Tor 4)
estimate for the size of the remainder term after tra
so much an the stze of the re
remainder term can be removed completely In this ease
My truncated Taylor series 1m the new evord: » of transforming away the higher-order terms
of a Taylor series expansion is formalized a the notic jerminacy Before defining determinacy, we
first. introduce some additional nomenclature The ‘Taylor series of f at x which 1s truncated after Lerms
of degree p is referred to as the p-jet of f at z, denoted by j*f(z) We now define what we mean by
the local equivalence of germs Two germs, f,g with f(z) = gz). are equivalent if there exists local C™-
diffeomorphisms y R" — R* and g.R — RK such that g = o(f(y(z}}). Thus, by suitable C® changes of
local coordinates, the germ J can be transformed into the germ g. We now note why the coordinate changes
must be invertible Neglecting a constant, the two functions are equal on some neighborhood of a point,
and we have expressed fas a function of z, that 1s, f(¥(z)) In addition, we would hike to be able to
express g as a function of the coordinates for f, that is, 9(7(¥)) ‘This requires us to invert the y coordinate
transformation z= z(y) As we stated earlier, this invertibihty criterion becomes an important reason for
choosing GAs to solve the systems of nonhear equations that arise from the DT method, With this, we
may now formulate the definition of determinacy. The p-jet ¢ at x 1s p-determined if any two germs at 2
having ¢ as their p-jet are equivalent

Lice 18 the technique of representing a function by

   

itative calculations, 1 1s necessary to make some
ation of the series Sometumes we are not interested
+ by a suitable change m coordmates near z, the
fe function 1s. in a very precise sense, equal to

    
   

amder term as in whet

 

      
   

    

 

 

If we are studying a C*-function f, we may understand its local behavior by expanding J in a truncated
Taylor series, ignoring all of the higher-order terms of degree greater than p. We can be sure that nothing
essential has been thrown away ifwe know that f 1s pedetermined Stated more precisely, we may study the
topological behavior of a p-determined germ f by studying its p-jet j°f One might think at first that no
germs are p-determined for finite p As an example of this, consider the germ of f at the origin of ? given
by f(z,y) = 2? This is not p-determined for any p, since the following function, which has the same p-jet as
J, 9(2,y) = 2? + y*, is 0 at the origin and positive elsewhere, whereas f 1s also 0 along the y-axis. However,
Mf f were a function of z alone, f(z) = 2?, then f would be 2-determined. We thus see that the determinacy
of f depends not only on its form but also on the domain over which it acts Since we have noted that if
1 function is p-determined, its topological behavior may be understood by studying its p-jet, then we may
now ask the following question Are there methods for deciding whether or not a given p-yet 1s determined?
We answér this question in the affirmative, and in a later paper we describe an algorithm based on work by
Mather for calculating the determinacy of p-jets In Section 4 of this paper, which describes the fitting of
DTSs to PESs, we provide examples for (i) the DTS behavior for cases in which the proper p-jet 1s chosen
for f, (it) the behavior for cases in which the chosen p-jet has p less than the deterininacy of f, and (iit) the
behavior of yf in which p is greater than the determinacy of J.

 

 

 

 

 

Below we summarize the four basic and interrelated concepts of singularity theory (i) stability, (ii)
genericity, (:ii) reduction, and (1v) unfolding of singularities To describe what is meant by stability, consider
the map J. —+ R given by f(z) = 2? This map is stable, since we may perturb the graph of this map

 

21
Le

 

Differential Topology of the Directed Tree Method

shghtly and the topological picture of its graph remains the same. That is, consider the perturbed map
g RR, olz) = x? + ex with c #0. This perturbed function, g, still has a single critical point just as f
docs, and can be shown to be just a reparametrization of f. Thus we hope to characterize and classify stable
maps since if we perturb these, we can still predict their topological behavior.

 

Since our goal is to provide a mathematical model for classifying and calculating PESs, one might ask
whether there are enough stable maps to be worthwhile in this endeavor That 1s, can any arbitrary PES
be approximated by a stable map? This is the question of the genericity of stable maps, i.e., whether the
set of all stable maps 1s open and dense in the set of all maps. If it is, then any map 1s arbitrarily “close”
toa stable map and may be represented by combinations of stable maps. It thus makes sense to study the
properties of stable maps since these properties will then be pertinent to any arbitrary PES.

 

Reduction refers to the often employed technique of splitting a problem into two components: one
component whose behavior is simple and known, and a second component whose behavior is unknown
and hence more interesting and whose behavior we would like to study. This is typical in most physical
models in which there are many variables whose functional behavior 1s assumed to be simple, for example,
harmome These variables are usually “factored out” of the overall model for the physical phenomenon since
the behavior of the system over these variables is known The Splitting Theorem provides a justification for
this reduetiomsm.

 

 

René Thom introduced the baste notion of the unfolding of an unstable map m order to provide stability
for a family of maps To see what this means, let us consider the following example to which we will often
return for illustrating new topological concepts. Let f:® —+ R be given by f(z) = 2°. This map is unstable
at zero. since if we perturb f by ¢z, where ¢ is small, the perturbed map g(z) = z° + ez assumes different
critical behaviors for « <0 and «> 0 There are two critical points, a minimum and a maximum, in a small
neighborhood of zero when ¢ < 0, but for ¢ > O there are no critical points. The family of maps F(z,¢) = 9(z)
is, however, stable. Thus F includes not only f, but also all possible ways of perturbing f. The map F
is said to be an universal unfolding of f. It is very important that the unfolding F includes all possible
ways of perturbing f. To be more specific, consider perturbing f by the term 62°, where 6 is arbitrarily
small but not zero The map A(z) = 2° + 62? assumes the same critical behavior for all 6 # 0, that is, h(z)
has one maximum and one minimum, Thus for ¢ <0, g(z) has the same critical behavior as h(z), and st
can be shown that g and h are “equivalent” for ¢ <0 and 6 #0. (The precise meaning of “equivalent” is
described in the Glossary.) On the other hand, there ts no 6 for which f(z) lacks critical points, thus A(z)
1s not equivalent to g(z) when ¢ > 0. Therefore h 1s not capable of describing all possible perturbations of
J, since it 1s unable to provide g with ¢ > 0, The unfolding g is, however, capable of describing all possible
perturbations of J. Our discussion so far does not indicate how we know this fact, it is a rather deep result
of singularity theory stemming from results based on the early insights of Thom The crux of singularity
theory is how to unfold the “interesting” component of a given model into a stable mapping with the least
number of parameters, such as the ¢ from above.

 

 

 

 

 

3.1, Theorems from Topology

Several principal theorems of differential topology concern the effects that critical points have on the
geometrical shape of manifolds. Since each has been carefully proven and thoroughly investigated in the
Merature, we only include here an informal statement of these theorems and a few of the results derivable
from them. We emphasize that these theorems are closely related to each other: their differences entail the
stepwise removal of some of the assumptions upon which the first theorcin 1s based

     

The first of these theorems 1s borrowed from elementary calculus: the Implicit Function Theorem ©
This theorem controls the behavior of a surface at regular points, that 1s, at points which are not critical
points. Excluding the overall translational and rotational coordinates of a molecule, the critical points of
potential energy surfaces are isolated.” Thus almost all points of a PES are regular points and hence the
implicit function theorem describes the local behavior of almost all of a PES Qualitatively speaking, the
Implicit Function Theorem states that at a noncritical point of a potential function, the coordinate axes may

 

22

 
Directed Trees Method for Fitting a Potential Function §1:3.2

be rotated so that one of the axes aligns with the gradient of the potential at that point. Then the function
is represented as f(z’) = z} where 2 are the new coordinates. This is intuitively obvious by considering the
gradient to be “force vector”, then the coordinate axes may be rotated so that one axis 1s colinear with
the force, which may then be described as a linear function of this one coordinate In analogy to our one-
dimensional example of the control which critical points have on the possible shape of a function, we find that
the overall shape of a PES depends upon the positioning and ty pe of its critical points. The Morse Theorem,
which 1s sometimes called the Morse Lemma in the literature,® and its corollaries describe how nondegenerate
critical points both control the shape of a surface and determine the relationship between an approximately
measured function and the stable mathematical model which 1s used to descibe that physical process In
particular, through the elimination of the assumption that the gradient 1s nonzero at a point, we find around
nondegenerate critical points, a new coordinate system so that a potential may be represented as the sum of
squared terms of the coordinates with no higher-order terms no linear terms, and no quadratic cross terms.
Thus the function has the form f = Y7 z'? and 1s termed a Morse function Corollaries of the Morse Theorem
say that Morse functions are stable and this stability is a generic property Lastly, we discuss degenerate
critical points and their influence on the possible configurations of nondegenerate points. By eliminating the
assumption of the nonsingular Hessian matrix at a critical pomt of the surface, the GromelleMeyer Sphtting
‘Theorem says that the function may be split into two components one is a Morse function, Fay, and the
other 15 non-Morse function. Fp yy. The non-Morse component cannot be represented as quadratic terms and
does not involve any of the coordinates of the Morse component! The Arnol’d-Thom Classification Theorem?
categorizes all of these non-Morse functions into families provides canonteal forms for thein and describes
the interrelations among the various families

 

 

   

 

 

The ramifications of the Arnol’d-Thom theorem cannot be overestimated. If function, F (¥,f), havinga
non-Morse critical point at (,; J.) 1s perturbed. The perturbed function, F(z, ), through diffeomorphisins
¥ and f, is obtained from F by perturbing the Morse part and the non-Morse part separately. Perturbation
of the former does not change its qualitative eritical behavior, while perturbation of the later does. Thus
one can “forget” about the coordinates involved in the Morse function, while concentrating on the subspace
spanned by the variables of Fya¢. The theorem classifies all possible types of perturbed functions in this
subspace. Corollaries also establish the stability and genericity of the universal unfoldings of the Classification
‘Theorem

 

    

3.2. Potential Functions and their Canonteal Forms

In this section we want not only to diseuss the connection between arbitrary potential functions and their
canonical forms provided in a separate paper,! but also to demonstrate the quantitative relationships that
exist between the critical points, gradients, and curvatures of the potential function with the corresponding
expressions that exist for the canonical forms. In order to define the extent of the applications of these
canomical forms we begin with a brief exposition of Thom’s method! for modeling a physical system.

 

 

   

     

First, suppose the physical system we wish to model has n distinet properties to which n definite real
values may be assigned We define an n-dimensional Euchdean space, &", which parametrizes these various
physical variables, Each point in " represents a particular state for the physical system. If Z, #€ 8", is such
‘8 point, then the coordinates of Z (z1,...,zn), are called the state variables Let X C&R" be the set of all
possible states of the physical system The particular state, # € X, which describes the system, 1s determined
by a rule which usually depends on a multidimensional parameter represented by #, 7 = (01, . px) € R*
For most physical systems this rule 1s often specified as a flow associated with a smooth vector field
This flow, or trajectory. on Y usually determines the attractor set of Y. Sometimes the rule is specified so
the flow “chooses® a particular attractor on Y with the “largest” basin At other times the rule may only
specify that the attractor be a stable one. Since very little 1s known mathematically about the attractors of
arbitrary vector fields, catastrophe theory has little to say about this general model. If, however, the vector
field is further restricted to be one generated by the gradient of a given smooth function, say V, then Thom’s
theory becomes very useful in the study of the physical model In other words, if Y = -DV(Z, A) where V
3s considered a family of potential functions on R" @ W*, the attractors of Y are just the local minima of

 

 

 

 

 

213

 
 

§1°3.2 Correspondence between Potential Functions and their Unfoldings

V(Z,A). In terms of a potential function, the rule $ again may have several forms. For instence, $ may
choose the global minimum of V, or it may require only that the state of the system corresond to one of
the local minima of V. The specific details of the method which $ uses to move Z to the attractors of Y
determines the dynamics of the trajectory of Zin I. Various choices for $ may correspond to tunneling
through barriers on V, to steepest descent paths on V, or to “bouncing” over small barriers by means of
thermodynamic fluctuations.

 

3.3. Relationships between Potential Functions and thesr Unfoldings

In order to examine a specific example, let us suppose that / is a gradient vector field: Y = —DV(Z, A),
where V(Z, 7): 8" @ R* — R is a smooth potential function of the state variables, Z, and depends upon
& parameter f. The attractor set of Y is then specified as a set of stable minima of V. The cnitical
points of V, defined by DV = 6, form a manifold, Xv, where Xv ¢ H"+*, which includes the stable
minima. Choosing @ point, (0: Fo) « 5"**, of Xv, Thom’s classification theorem tells us that in some
neighborhood of (0; 0), V 18 equal to the sum of a universal unfolding, U;, of one of the germ functions,
Gy, and a quadratic form Q, Q@ = Ov. ,4,7% for k < 6 and j = 1 or 2.9 More formally, if Me CR”
is a neighborhood of Z and Np © %* is a neighborhood of fo, then V: Nz © Np —+ Rts equivalent to
PEA) = Gilza,y) + PE: A) + Qlzs41.n) = Usl E55) + Qzie1.n) for some finite ¢ with Z,5 denoting
the first j coordinates of F while Z,~1, denotes the last n — j coordinates. This means that there exist
diffeomorphisms ¥* Mz @ Ny — Nz, and a: Np — ® such that, for any (2) € Nz © Np, we have

 

 

VEA= AKA AM) +l) - (3)

This equation allows us to quantitatively relate the critical points, gradients, and curvatures of V and Fi.
Application of the chain rule for derivatives of vector fields to equation (8) provides an expression for the
gradient of V:

DV (A) = D(X; ADT FA) (4)

where D denotes the partial derivative operator with respect to the coordinates of the function or operator
which follows it In order to determine the Hessian of V, HV, we carefully reapply the chain rule to (4) to
yield.

HV (2) = D'X(z) ¢ HA,(X) « Dx(z) + Ss Di Fi(X)HXx(Z) (5)

where D* is the transpose of D. We now have expressions equating not only V and Fi, (3), but also their
gradients, (4), and Hessians, (5). Through these systems of nonlinear equations the unfolding parameters
and diffeomorphisms may be calculated.

As Connor!! has pointed out in a different context, the diffeomorphism and parameters of an unfolding
may be calculated via the solution of the nonlinear system of equations which arises from the correspondence
between the critical points of the unfolding and those of the experimental function. For PESs, however, the
critical points are usually not known a priori, and thus this 1s not a viable procedure. Extensions of this
method are reasonable though. For instance, the DTS and PES must correspond within a neighborhood of
any point. Thus, # similar system of nonlinear equations may be derived, for points within some neighborhood
of a particular point, whose solution yields the parameters and diffeomorphism. Alternatively, at a single
point the function and all of its derivatives must comcide with those of the DTS. Therefore, since ab ini
quantum calculations now provide analytic first and second derivatives, it 1s reasonable to employ this
information to help calculate the DTS parameters and diffeomorphism. Thus, the calculation of a single
point on the PES with its first and second derivatives may be employed to determine a first approximation to

 

 

   

 

 
  

 
Directed Trees Method for Fitting a Potenteal Function §1:4

the parameters and diffeomorphism Thus, from a single point, we may be able to specify to which unfolding
within a given family the particular PES belongs. Since there are canonical forms for the DTS, we also have
canonical forms for its critical points, in particular, its saddle points !? Therefore, one might next move
over to the DTS saddle point and perform another quantum mechanical calculation there Of course, this
point will not correspond to the PES saddle point, but since locally the diffeomorphism is approximately
the identity function. it will be close to the PES saddle point ‘The additional information obtained at this
new point may then be used to calculate a second approximation for the parameters and diffeomorphism
‘Thus, with each new point, better parameters are calculated so that the DTS better fits the PES In the
next section, we perform sainple calculations on the one-dimensional unfolding families, the A, families

Section: 4. DT Method for fitting a PES via the Genetic Algorithm

As we discussed in the last section, the problem of fitting a DTS to a PES is one of finding a solution to
a nonlinear system of equations The DT method allows for a flexsble choice for the form of the optimization

 

 

function We have considered both weighed least squares as well as absolute value evaluation functions. In
particular, 1n the follwoing examples we have employed the experimental and evaluation functions provided
below
Expernnental function’ f(z) = 02° + 2° — 322, @ = 005
Az Unfolding: F(X) = 29+ 1X + po
Diffeomorphism: —_X(z) = eo +e12 + ez?
Evaluation functions. R= Sy wil F(X) ~ Hz)

 

 

(6)

Ry = Sw OED ~ 24a

FX) _ 2 fty)
oe

333

 

 

R=, wy

Ry = rR ry Ry + ro

 

 

where {r,ra,r2,w;) are weighting factors,

‘The standard numerical methods for solving nonlinear s) stems often involve algorithms of the Newton-
Raphson type '3 As we mentioned earher, the coordinate transformation must be a diffeomorphism, and
hence invertible Empirically, we found that when employ ing a Newton-Raphson algorithm for solving these
nonlinear systems, the calculated coordinate transformations often did not satisfy the inverubihty criterion
‘Therefore we resorted to constrained optimization techmiques Several methods, including the Box complex
algorithm, and standard least squares procedures,!® have been successfully used to solve these nonlinear
equations Typically, the constained methods were very slow to converge to a minimum and thus required a
significant increase in computational time, Since the evaluation functions involved the differences between
values for the experimental PES and its DTS they were froth with shallow local minima. Thus, for some
problems, these methods did not converge to the global minimum of the evaluation functions. In addition,
the constrained optimizations often tended to remain close to their constraint boundaries, resulting in the
optimizations becoming stuck in local minima. These considerations led us to try other function optimizers
Besides these classical techniques, genetic adaptive algorithms (GAs) also may be employed to solve these
systems, GAs are based on an observation originally made by Holland!® that living organisms are very
efficient at adapting to their environs. Implicit in a genetic adaptive search is an immense amount of parallel
calculation, and empirical studies indicate that GAs will often outperform the usual numerical techniques.!7
We do not discuss the working of GAs here, but rather refer the reader to literature references.!8

 

 

 

 

Several features illustrated in the following fitting examples are of importance and we mention them
here, (i) We show that the coordinate transformation employed by the DT Method is required to be a

215

 
gla Correspondence between Potential Functions and their Unfoldings

diffeomorphism. If the coordinate transformation calculated via the DT method is not a diffeomorphism,
then the chosen determinacy of the PES 1s too low and a higher-order unfolding family 1s needed in order to
accurately fit the PES (11) Also illustrated 1s the fact that the diffeomorphism may include terms which have
asymptotic behavior, for example, exponential terms. In this case, the asymptotic behavior of the surface
may be reproduced by including comparable behavior in the diffeomorphism. (11) “Bumps” or “shoulders”
on surfaces that do not form critical points still reflect the fact that they stem from the annihilation of critical
points of a germ function. Thus any bump or shoulder on a surface means that a higher order unfolding
family will be required in order to accurately reproduce them (1) Also depicted in these examples 1s the DT
Method for fitting a 2-dimensional potential energy surface Our example 2-D surfaces have one “interesting”
coordinate, that 15, one coor: is not harmonic, and one coordinate which is harmonic,

 

 

 

Ag DTS Fit for R and Rg

Experiment
——R eR
* Data points

Experiment
——R eR
Date points

 

¢ 1 Az DS fits employing R and Rs to an As experimental function at 1, 2, and 3 data points.

 

In Figure 1, we illustrate the Directed Trees fitting procedure by employing the genetic algorithm for
fitting the’ Az unfolding family to an experimental function belonging to the As family. We choose this
experimental function to exemplify several features of the DT method In particular, the value of the
coefficient of the 2° term was chosen in order to generate a third eritical point on the experimental surface
within the coordinate interval 3 < x < 3. We choose this interval so that the local nature of the fitting
procedure for the Az unfolding may be demonstrated In conyunction with this local aspect of the Az DTS

216

 

 
Directed Trees Method for Fitting a Potential Function §1:4

 

on the |~3,3| interval, however, we would like to point out that all three critical points, and hence the
experimental function itself, may be accurately represented with the 4 unfolding family Even though the
highest-order term of the As germ function 1s fourth-order, its unfoldings may have three critical points
and thus the three critical points of this As PES may be accurately reproduced on the interval {3,3} We
have successfully fit an Ag DTS (o all three singularities of this Az PES (The Ag unfolding family does
not have the proper local topology and consequently it cannot accurately reproduce this PES When an AZ
DTS fit 1s attempted, either the fitting 1s very poor or the calculated coordinate transformation is not a
diffeomorphism ) ‘This example also demonstrates the usage of the DTS to help choose new positions for
further calculations and the employment of the first and second derivatives in addtion to the functional
values at the data points

   

Ag DTS Fit to Noisy Data

   

      
    
   

 

    

  

20.
10.
ics
&
So 0.
re
++ + “Experiment” + + "Experiment
—— 10.

© Noisy Data

   

Figure 2 Az DTS fit to noisy date points

In this figure, the experimental function 1s drawn as narrow solid lines. For clarity, the data points,
which are represented as “sold” squares, are drawn at a constant ‘y’ coordinate and not at their proper
functional values Their proper functional values are located on the narrow solid curve The dotted lines are
the Az DTS fits employing R as the fitting criterion Thus, these R curves attempt only to fit the functional
value of the experimental function at each of the data points The thick dashed lines are the Az DTS fits
employing the Rs evaluation fiunction, thus these dashed curves fit not only the functional value but also the
values of the first and second derivatives at cach point In Part A of Figure 2 we have attempted the DTS
fit employing only a single experimental point Note that in this case, the R fit does not have the proper
local topology There is not enough information available to determine the local shape of the experimental
function and at 1s only fortuitious that the R unfolding has about the same value of its first derivative as
the experimental function. On the other hand, the Rs DTS fit does have the proper local topology but its
ertuical pos

 

 

are far removed from the corresponding experimental minimum and maximum. In Parts B
and C of this figure we employ two experimental data pomnts for fitting the DTSs In Part B_the chosen
date points include the single point from Part A plus an additional point at the minimum of the DT'S surface
calculated in Part A. We thus have used the approximate DTS surface of Part A to choose where the next
calculation should be performed. The new information from the second datum point 1s then used to refine
the DTS. In Part C, we use the same datum point as in Part A as well as the maximum point of the DTS in
A. These refined DTS curves in Parts B and C now provide a more accurate estimates of the minimuin and
maximum of the experimental function. We use three data points for Part D, the original point from A as

 

 

217

 
 

   

gl:d Correspondence between Potential Functions and their Unfoldings

well as Part A’s DTS's minimum and maximum points. Note that che DTS fit to the three points does not
Nave uhe proper topology of the experimental fonction. The Rs DTS, however is a very accurate fit within
the neighborhoods surrounding the maximun: and minimum of the experimental function, Note, however,
Boe eee DDTS vs unable to ft the second, rightmost, maxunum of the experimental funetion. This is
ane ine ns thurd eritical point generated by the sixth-order cerm in the experimental Function cannot be
represented scithin the Az unfolding family, which has, at most, two critical points ‘A higher-order family
reece to be chosen in order Co Gt this maximum value In particular, the As family ‘would be capable
sPhtting both of the maxima and the minimum on this experimental function. One does not have to use an
“Fe unfolding for thes experimental function even though it contains 1 sixth-order term

 

Ag DTS Fits to an Ag function

    

  
   

40.
~ 30. ix.
XS 20. , we
10. a
° 4
cece +Bxperiment
— — RDS Ss 7
= Data pointes 1.48
4
ue)

-6. -4. -2. 0. 2 -6,.-4.-2. 0. 2
x ‘x,

Figure # Az DTS fit to the minimum and shoulder of an As experimental function

the experimental function was assumed to be known exactly. ‘This 1s usually

In the previous figure,
energy at each datum point on & PES.

not the case, ‘Typically, there are random errors in the pote
‘These random fluctuations stem from round-off errors 1n calculations,
integration inaccuracies, or experimental random fluctuations ‘Also, as previously noted, the evaluation
fonctions have many local minima which often appear to be similar to random fluctuations To-show that the

data, we return to the experimental

figure and add rather severe random fluctuations to the functional of values

approximate wave functions, numerical

  

DT method in conjunction with the GA optimizer does not require ex
function of Figure 2 in the next

28
Directed Trees Method for Fitting a Potential Function §1:4

‘as well as its first and second derivatives The GA optimizer 1s very efficient at avoiding local minima and
consequently works well for noisy PESs. Part A of this figure has six data pomts to which noise has been
added (The “open” squares representing the data points in this figure now reside at their proper functional
values ) Note that the best Az DTS fit employing Rs to these data docs not accurately repeat the “exact”
experimental function. that 1s, the function without the random fluctuations which is drawn as a dashed
Ine. In fact, it might appear as if the DTS does not even accurately fit the two data points at z = ~1
It must be realized that the functional value of the data points is all that is being plotted in this diagram.
The Re evaluation function however, includes the first and second derivatives as criteria for a fit. Thus for
1 small number of data points, the random fluctuations in the first and second derivatives need not cancel
and thus the DTS need not accurately fit the two functional values at z= —1. In Part B, we have added
additional data points. Here, the DTS fairly accurately fits the “exact” experimental function. This figure
illustrates that the Directed Trees method coupled with the genetic algorithm are easily applied to fitting
DTSs to noisy PESs

 

Ag DTS Model Fit

—-— A, DTS
30. “R

3
40. —— Experiment 2
1
By * Data points, 0.
vo

=

2

 

 

-4.-2. 0. 2.

x

   

Figure 4 As DTS fit to the As experimental function

 

One particular advantage of employing the genetic algorithm for fitting DTSs to PESs is that it is
easy to require that the calculated coordinate change remain a diffeomorphism. In the next figure, we see
not only a new experimental function as well as its DTS fits, but an addition, plots of the corresponding
diffeomorphisms, X(z), for the DTSs Note that in Part A, we have chosen data points surrounding the
minimum of the experimental function at x = 0. This experimental function has only a single minimum, but
it does have a shoulder at around z= -% Even though this shoulder is not a new critical point, it stems
from the annihilation of a saddle and a minimum of the Az family of functions. Hence our Ay DTS cannot
fit this experimental function exactly It is capable of fitting either the minimum as illustrated in Part A
or the shoulder as illustrated in Part B. In addition Part A also illustrates the possibility of asymptotic
behavior being included in the diffeomorphism, then the DTS 1s capable of fitting the asymptotic behavior
on a PES. In fact, instead of expanding the diffeomorphism as a Taylor series, as we have done here, it
could easily be expanded as a suin of exponential terms whose asymptotic behaviors are then imparted to
the DTS. Note that, as the diffeomorphism levels off for z < —5, the DTS also becomes asymptotically
level Part B of this diagram contains a warning, however The function ¥(z) is not a diffeomorphism over
the entire interval, -7 < z < 3, and hence, the assumptions necessary for application of the Arnol’d-Thom

 

    

   

219

 
 

gd Correspondence between Potential Functions and their Unfoldings

Classification Theorem are not satified over this mterval. In fact, the critical point of X(z) leads to an
additional critical of the DTS at about 0. This critical point of X(z) was induced by attempting to fit the
Ap unfolding family to “three” critical points the one actual minimum of the surface and the annihilated
saddle and minimum which generates the shoulder region. If the datum point at z = } is removed, then
X(z) remains a diffeomorphism and the DTS accurately fits the shoulder of the experimental function ‘This
example reveals an advantage of the genetic algorithm over many of the nonlinear Newton-lke optimization
schemes. Unlike the Newton methods which require an initial guess and can become “stuck” in local minima,
the genetic algorithm only requires starting intervals for its parameter values. This, by the way, allows one
to assure that the coordinate transformation X remains a diffeomorphisin by ineans of controlling the ranges
over which the parameter values may vary. In addition to the fact that parameter intervals are a much less
restrictive mitial condition than having to guess a starting parameter solution, one may also easily specify the
resolution at which each individual parameter is caleulated. Thus individual parameters may all be optimized
at differing resolutions If X 1s not a diffeomorphism after fitting a DTS to a PES, then this 1s a tipoff that,
the chosen fitting family 1s too small and does not contain enough critical points necessary for fitting the

 

 

 

     

 

surface. Thus one should choose a higher-order family fur fitting shis surface. fn partiewlar, the next figure,
Figure 4 illustrates that if we choose the 43 unfolding fanuhy to fit thie expernmental function, then both
the shoulder and the minimum may be accurately ht Siace this expernental function is 3-determined and
we are employing the dy unfolding family the diffeomorphism 1 a linvar funcuon with no erstical points.

 

 
   

We next consider the Directed Trees met smtal function which has more than
onedimension We choose an experimental func! monic coordinate and one harmonic
coordinate. This PES 1s representative of isomerization reactions It 1s an unportant trial case because
of the recent interest in quast-periodie versus chaotic trajectories on simular two-dimensional surfaces 19
Also a similar surface was also chosen by Fukui? to illustrate the intrinsic reaction coordinate method.
Contour levels of this function are drawn in Parts A and C of the following figure. There are several things
‘to note about the experimental function First of all, there are two minima and one saddle point. Neither
of the minima are located at special points, such as the origin. Also, a line drawn between the (wo minima
is not parallel to either of the coordinate axes The DT method, though, is capable of “rotating” the DTS
coordinate axes so that it can accurately represent the experimental surface In Part B of tus figure, we
have chosen che As family for fitting this function. Note that the corresponding contour levels in all Parts
of this diagram are drawn employing the same ty pe of line, whether that be solid, dashed dash-dotted, or
dotted The “stars” (*} in Parts A and C locate the data points used in the ealeulations for Parts B and D.
respectively. The As DTS of Part B very accurately fits the experimental

 

 

 

 

 

You might ask what would happen if one were to choose a family which can display more critical points
than the experimental function contains. This 1s allustrated in Part D of this Figure In this case, the
Ag unfolding family was chosen to fit the same experimental function as provided in Part A Note than in
Part D, the DTS accurately fit both minima and the saddle point of the experimental function. In addition,
however, there 1s a new saddle point appearing around the point (2.1,0.2). This new saddle point stems from
the fact that the Ay family can display four critical points, It is worth noting, however, that in the region
surrounding the data points, the Ay DTS accurately fits the experimental function The new, extraneous,
saddle point of the DTS lies outside the local neighborhood of the data points employed to fit this PES. This
example of employing the Ay unfolding mightlead one to consider always employing a high-order unfolding
family to fit all PESs. One finds, however, from the practical viewpoint of calculating the fitting parameters,
that a properly chosen unfolding family (one whose determinacy and local topology is the same as the
experimental PES) will greatly reduce the amount of calculation and hence provide an casily calculated fit
to the PES This 1» because the DTS has the proper number of critical points to reproduce the topology
of the surface data 1s not required to suppress extraneous critical points of the unfolding. Thus there is
an optimum unfolding family, from a calculation standpoint, for each PES It is true that the higher-order
family, assuming it contains the lower-order family as a subfamily, will provide an unfolding which repeats
the topology of its lower-order subfamily. It is this subfamily, however, that should be chosen as the unfolding
family for the original fitting procedure.

 

 

   

As our last example of a 2-dimensional fitting to a 2-dimensional PES, we choose the same “exact”

 

 

 
Directed Trees Method for Fitting a Potential Function §1:4

Experimental PES’ Fitted DTS

 

Figure 5 As and Ay 2-dimensional DTSs fit to an experimental function Contour lines are drawn at energies
of 15, 10, 8, 7, 4, and 1 mn all parts of this figure

 

experimental function, but add in random fluctuations to the experimental values and its first and second
derivatives. For this example, we also employ the Rs evaluation function in determining the unfolding
and diffeomorphism parameters Note that in Figure 6 the A: DTS has the same critical behavior as the
exerimental PES, however, it 1s not as accurate of a fit as that shown in Figure 5. This is because the noise
included in our functional values is rather extensive Since it is not possible to see these random fluctuations
on a contour plot of the PES, we have drawn a 3-D stereo projection of the experimental PES along with
the noisy data points chosen In this view, the bold circles are the experimental points chosen on the surface
while the light crosses are the “exact” experimental values corresponding to the noisy data points.

  

221

 
ged Correspondence between Potential Functions and their Unfoldings

Noisy PES

 

 

 

 

 

 

 

 

 

 

 

Figure 7 Stereo view of the noisy data points of the Az experimental 2-D function The bold circles are the
noisy data points while the thin ‘-+" signs are the corresponding “exact” values

Section: 5. References

[1] See the associated paper “The Directed Trees Method | Classification of Potential Energy Surfaces”
(submitted for publication)

 

 
 

le

Is

{18a}
18b]

{18¢)
[18d]

(18e|

|19a)
119b,
[20a
[20b)

 

Directed Trees Method for Fulting a Potentsal Function — §1°5

For typical examples of the simphifications that may arise solely from a classification scheme, see “The
Differential Topology of the Directed Trees Method V. Symmetry Invariant Potential Energy Surfaces”

‘These concepts have already played important parts in the story of classical mechanics and dynamics,
for example, see Arnol'd, V 1 “Mathematical Methods of Classical Mechanics”, Springer-Verlag: New
York, 1978, and Arnol'd, V 1, Avez, A. “Ergotic Problems of Classical Mechanics”; Benjamin: New
York. 1968; Ch t, 3-4

See the Glossary for the definition of an algebra,

  

 

 

See the accompanying paper “The Differential Topology of the Directed Trees Method III: Determinacy
and Unfolding Algorithms.”

Rudin, W_ “Principles of Mathematical Analysis; 3rd Ed McGraw-Hill: New York, 1976; p 224.
Mezey, P..G. Theoret. Chim. Acta (Berl) 1981, 58, 309.
Morse M. Trans Amer. Math Soc 1931 33 72

 

 

 

Milnor. J) “Morse ‘Theory” Princeton Unis Press. Prin
Amol'd, VoL Russian Math Surveys 1974 2. 10.
Thom RB. “Structural Stability and Morphogenesis” Benjamin: Reading. MA, 1975
Connor. JN 1 Mol Phys 1976 31 3%

 

New Jersey, 1963 No 1

 

See the accompanying paper “I'he Dalferential Topology of the Directed Trees Method II: Potential
Energy Surfaces and Canonical Forms”

Fletcher, R. “Practical Methods of Optimization, Vol 1 Unconstained Optimization”, Wiley New
York, 1980; Ch 6

Richardson, J A. Commun, ACM 1973, 16, 487,

International Mathematical & Statistical Libraries, Inc. Subroutines ZXSSQ, ZSPOW, and ZXCNT
from “IMSL Library of Fortran Subroutines”; 9th ed., IMSL, Inc: Houston, TX, 1981

Holland, JH. “Adaptation in Natural and Artificial Systems”. Univ, of Michigan Press. Ann Arbor,
1975

  

De Jong, K.A. “Analysis of the Behavior of a Class of Genetic Adaptive Systems”. PhD dissertation,
Univ of Michigan, August, 1975

Bethke, A.D. “Genetic Algorithms as Function Optimizers”; PhD dissertation, Un. of Michigan,
January, 1981.

Brindle, A_ “Genetic Algorithms for Function Optimization”, C.S. Department Report TR81-2 (PhD
dissertation), Univ. of Alberta, 1981

De Jong, K A. IEEE Trans Systems, Man, and Cybernetics 1980, 10, 9.

Holland, J.H. In Prog. in Theor. Biol., 1976, 4, 263. Rosen, R., Snell, F M., eds ; Academic Press: New
York, 1976.

Holland, J.H “Adaptation in Natural and Artificial Systems”; Univ. of Michigan Press Ann Arbor,
1975.

DeLeon, N: Berne, Bi. J. Chem Phys 1981, 75. 2495.
Kariotis, Ri; Suhl, Haj Bckmann, J-P Phys Rev Lett 1985,
‘Tachibana, Aj Fuku, K Theor Chin Acta 1978, 49, 321
‘Tachibana, A., Fukui, K Theor. Chim Acta 1979, 51, 189.

 

223
(t)

(2)
(3)

i}

(3)

(6)
(7)
(8)
(9)
(10)

(uy

(2

Section: 6. Glossary
‘The following furnishes brief definitions of a few of the terms from differential topology that we employ
in the text of this paper
C™-Diffeomorphism: If y is a C™-diffeomorphismn, then it satisfies the following three criteria’
(i) yas m times differentiable,
(1) y has an inverse, yo"! R" — R", such that yoy"? = toy =I, and
(iti) y-? is m tames differentiable,
where m is either finite, 00, of »
Equivalence class: If A is a set and if ~ is an equivalence relation on A, then the equivalence class of
@€ A isthe set {r€ Ala ~ x}
Equivalent: Two functions, {-H* — Rand g K" —- WR, are equivalent at 0 if there exists a diffeomorphism
XR" — R" and a constant such that

 

A(X) 0 (7)

 

of

na neighborhood of 0 Equivalence of two functions imphes that they have the same geometric “shape”

eritical behavior ‘They have corresponding critical points which are of the same type
Genericity: A generte property 1s a property possessed by an open dense subset of the system. This
means that a generic property 1s “typical” for the system, and a complementary subset for which the
property does not hold has measure 0 ‘Thus it 1s “mathematically rare” for a generic property not to
hold. Since a generic property holds on a dense subset of the system, then any member of the system,
including those not having the generic property, may be approximated arbitrarily closely by elements
having the generic property. An example of this 1s that a function having a degenerate critical point
may be approximated by functions having only Morse critical points
Germ, Germ-equivalent: Let Tbe a topological space and $ be any set. Let f:U + $ and g:V — S
be maps with domains U, V open sets in T, and suppose = hes in UNV. Then f and g are said to be
germ-equivalent at x if there exists some open neighborhood IW’ of z lying inside UV such that f = 9
‘on IY This 1s an equivalence relation on the set of all maps defined on neighborhoods of z in T and
with values in Sand the equivalence classes are called germs of maps at z If S is a topological space
also, then we can consider germs of continuous maps If § and 7 are normed linear spaces, we can
consider germs at r of C" maps. Iftwo C® maps are germ-cquivalent at z, then all their derivatives at
Z are the same.
K-determined: Let {= and let k be a non- negative integer. Then f is right-determined (right-left
determined) if, for every g € R" such that j*(f) = j*(9), then f~e 9 (f~n1 9)
Jet, kejet: The k-jet of a function f denoted by 7*(f), 18 the Taylor series expansion of f at z and
truncated after the order k terms
Neighborhood N: Given a topological space, (7,1), a subset NC Tas an
is there is a member $ of r with te Sc
Regular point: A point, z, is a regular point if z is in the domain of a function, f R” —- ®, and the
gradient of the function at z is not zero.

) Smooth or C%: A function f, sR"
and are continuous at z

 

   

   

 

 

 

ighborhood of a point t € T

 

 

   

 

R™, 1s called smooth at a point, z, if all of its derivitives exist

) Stability: Properties of a mapping which are invariant to perturbations of the map are called stable
properties, and the collection of maps which possess a particular stable property may he referred to as
' atable class of maps. In particular. a property 1s stable provided that whenever fo: X — Y possesses
a property and f X — ¥ 1s a homotopy of fo, then, for some <> 0, each f; with £ < ¢ also possesses
the property

) Structural Stability: For the single function case, let fH" --+ R be a function and P.R" @ RE — ®
be an arbitrary small perturbation Then f is stable at a point in Zo if there exists & diffeomorphism
¥ = X(Z) such that the perturbed function, 9 = f +p, in the new coordinate system is equivalent to the
unperturbed function, (2) = 9(X) +0

    

 

 

224

 
(13)

(4)

Directed Trees Method for Fulting a Potential Function §1:6

Topology, Topological Space, Open Sets: Let Tbe a set; a topology + on T is a collection + of subsets
on T which satisfy the following criteria: a family r of subsets on T 1s a topology on T if

(i) fT cr, then UT Er,

(ii) if T C+ and T is finite, then OT € 2,
(m) Oe rand Te

then (T,=) 18 a topological space, T 1s its underlying set, and the members + are called the open or r-open
sets of (7,1) of T

Unfolding, Versal and Universal: An unfolding of a function, f(z), 1s a parametrized smooth family
of functions, F(z, f), where 7 = (py. ..,pj), whose members are possible perturbations of f(z) The
dimension of j, j, 18 called the codimension of the unfolding. Usually unfolding also refers to a particular
member of the family, F(Z,). An unfolding, G. 1s  vereal unfolding if any other unfolding of { may
be obtained from G via a diffeomorphism An unfolding, H is a univereal unfolding if it
and is of minimum codimension

   

 

is both versal

 

225