GENETIC ALGORITHMS
AND THEIR APPLICATIONS:
Proceedings of the
Second International Conference on
Genetic Algorithms

July 28-31, 1987
at the
Massachusetts Institute of Technology
Cambridge, MA
Sponsored By
American Association for Artificial Intelligence

Naval Research Laboratory

Bolt Beranek and Newman, Inc.

John J. Grefenstette
Naval Research Laboratory
Editor

EA LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS
1987 Hillsdale, New Jersey Hove and London
dale
jerk weghene
“i an LC wil (

 

rhe cop} keep

GENETIC ALGORITHMS
AND THEIR APPLICATIONS:
Proceedings of the
Second International Conference on
Genetic Algorithms

July 28-31, 1987
at the
Massachusetts Institute of Technology
Cambridge, MA
Sponsored By
American Association for Artificial Intelligence

Naval Research Laboratory

Bolt Beranek and Newman, Inc.

John J. Grefenstette
Naval Research Laboratory
Editor

IEA LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS
1987 Hillsdale, New Jersey Hove and London

 
Copynght © 1987 by Lawrence Erlbaum Associates, Inc.

All rights reserved, No part of thus book may be reproduced in
any form, by photostat, microform, retrieval sysiem, or any other
means, without the prior whiten permission of the publisher.

Lawrence Erlbaum Associates, Inc., Publishers
365 Broadway
Hillsdale, New Jerscy 07642

ISBN 0-8058-0158-8 cloth ediuon
ISBN 0-8058-0159-6 paperback edition

Printed in the United States of Amenca

O98 76543221

 

 

 

 

 
ACKNOWLEDGEMENTS

On behalf of the Conference Commitce, it is my pleasure to acknowledge the support of our
sponsors: the American Association for Artifical Intelligence, the Navy Center for Applied Research
in Artificial Intelligence at the Naval Research Laboratory, and Bolt Beranek and Newman, Inc. The
Committce also appreciates the cooperation of Dr. Edwin H. Land. I would personally like to thank
the other members of the Conference Commitice for their conscientious efforts as referees. Stewart
Wilson deserves special thanks for handling the local arrangements.

John J. Grefenstette
Program Chair

Conference Committee

John H. Holland University of Michigan (Conference Chair)

Lashon B. Booker Navy Center for Applied Research in AI

Dave Davis Bolt Beranck and Newman

Kenneth A. De Jong George Mason University

David E. Goldberg University of Alabama

John J. Grefenstette Navy Center for Applicd Research in AI (Program Chair)
Stephen F. Smith Carnegie-Mellon Robotics Institute

Stewart W, Wilson Rowland Institute for Science (Local Arrangements)

  
CONFERENCE PROGRAM

TUESDAY, JULY 28, 1987
5:00 - 9:00 REGISTRATION: Lobby, McCormick Hall

7:00 - 9:00 WELCOMING RECEPTION: Courtyard, McCormick Hall

WEDNESDAY, JULY 29, 1987

8:00 REGISTRATION: Roam 10-280

9:00 OPENING REMARKS: Room 10-250
9:20 - 10:40 GENETIC SEARCH THEORY

Finite Markov chain analysis of genetic algorithms

 

   

 

 

David E. Goldberg and Philip Segrest l
An analysis of reproduction and crossover in a binary-coded genetic algorithm

Clayton L. Bridges and David E. Goldberg .... 9
Reducing bias and inefficiency in the selection algorithm

James E. Baker .... 14
Altruism in the bucket brigade

Thomas H. Westerdale 22
10:40 - 11:00 COFFEE BREAK

11:00 - 12:00 ADAPTIVE SEARCH OPERATORS I

Schema recombination in pattern recognition problems ry, ; A ff

Trene Stadnyk ccc ceescssessseeecseeseseeeseeesecatscescscaeseeessenusengearsanesiveneavens Leoaeesecseeesnecesnereeseestseaeaeets 27
An adaptive crossover distribution mechanism for genetic algorithms

J. David Schaffer and Amy Morishima 0.0.0... ccc csessscececsessecesesceseessscsessssetanesesssensanetsseessesseee 36
Genetic algorithms with sharing for multimodal function optimization

David E. Goldberg and Jon Richardson ...0.....cccscsccecssescscesssstesteescssesesssassestesssesersssescsusaeasetsseescas 4)

12:00 - 2:00 LUNCH

 

 
 

2:00 - 3:20 REPRESENTATION ISSUES

The ARGOT strategy: adaptive represeniation genetic optimizer technique
Craig G. Shaeker ccc ce a ee ceccscsa pene reneseeeaeessesesacecsesessesneceteseessnerseesssseseeaieneetecanectecsieacseseeseees 50

Nonstationary function optimization using genetic algorithms with dominance and diploidy
David E. Goldberg and Robert BE. Smith 2.0. ccececssesccsesses cece cen ssesseneessescsvenseeseeeseecavetesnesstensese 59

Genetic operators for high-level knowledge representations
H. J. Amtomisse and K. 5. Keller oo. ccccccceececesesssesssseesecesnsseseeceuescsecesesscscautsensecsaseeseaesesisaeesansecs 69

Tree structured rules in genetic algorithms
Arthur §, Bickel and Riva Wenig Bicke] oo iccccecccseseccesssscsesesescsessiessussacesvsceseceriestnenseanes 77

3:20 - 3:40 COFFEE BREAK
3:40 - 5:00 KEYNOTE ADDRESS

Genetic algorithms and classifier systems. foundations and future directions
John H. Holland ...

 

7:00 - 9:00 BUSINESS MEETING

THURSDAY, JULY 30, 1987
9:00 - 10:20 ADAPTIVE SEARCH OPERATORS Hi

Greedy genetics
G. E. Liepins, M. R. Hilliard, Mark Palmer and Michael Morrow oo. .cccccccsssecccsesecetesensssstretacee 50

Incorporating heuristic information into genetic search
Jung Y. Suh and Dirk Van Guecht occ cc ccecssueecsneseeesececenesseesereareacageeseesnneensaestaseasieesees 100

Using reproductive evaluation to improve genetic search and heuristic discovery
Darrell Whitley occ ec ceecseecseesess cscs eeevecesseceeesanensestcasseseaseeasaeseneasnea ties sseasssneeetscanetseesetacsentees 108

Toward a unified thermodynamic genetic operator
David J. Sirag and Paul T. Weisser occ cecseecseecssscssscevessuescsecareseuascescouensegeesesnsrsnseseatsctentees 116

10:20 - 10:40 COFFEE BREAK

 
10:40 - 12:00 CONNECTIONISM AND PARALLELISM I

Toward the evolution of symbols
Charles P. Dolan and Michael G. Dyer .....

 

SUPERGRAN: a connectionist approach to learning, integrating
genetic algorithms and graph induction
Deon G. Oosthuizen oo... cece snes csceceesesctcneeeneesesecanecensceseseecseuenseevsneseassesesssensreteeresageaneneaesenes 132

Parallel implementation of genetic algorithms in a classifier system
George G, Robertson oo s ccs ceccsesscscescetsesetscasseecsecseeesenaenvssseceeeseesaceesensesesssqeasseaseceessasseensesseesaes 140

Punctuated equilibria: a parallel genetic algorithm
J. P. Cohoon, S. U. Hegde, W. N. Martin and D. Richards 0... ecessesesseccteeeeeseescnsearesesenenes 148

12:00 - 2:00 LUNCH
2:00 - 3:20 PARALLELISM If

A parallel genetic algorithm
Chrisila B. Pettey, Michael R. Leuze and John J. Grefemstette oo... ee cececeeeccessenseeeecntenseeeaes 155

Genetic learning procedures in distributed environments
Adrian V. Sannier II and Erik D. Goodman oi eesscsnecseneneetsetetesnsssnerseassstsetsssetessersers 162

Parallelisation of probabilistic sequential search algorithms
Prasanna Jog and Dirk Varn Guecht oo... cccccsecssenseseesesetseecsessceseeceenecterseessnsecersceeeecssenseenseseavaes 170

Parallel genetic algorithms for a hypercube

Reiko Tanese ...... 177

 

3:20 - 3:40 COFFEE BREAK
3:40 - 5:00 CREDIT ASSIGNMENT AND LEARNING

Bucket brigade performance: I. Long sequences of classifiers
RICK L. RIGO wee cece tsesteecesessceseassneeesescespesecauessesescansurssssecesesusecueecsns egassesisneeceesenesaeengasecaanaes 184

Bucket brigade performance: H. Default hierarchies
Rick L. Riolo .....

 

196

Multilevel credit assignment in a genetic learning system
John J. Greferstette oe csesseeeeseeseessseeseescsessneecsnecnsseeassvescansaeesesseaeeessasessecasnaeesieaseeceneeneseeenee 202

On using genetic algorithms to search program spaces
Kenneth A. De Jong wi cccccceeessecscsesssecasenssenscessnseenseceseestesuesucesseaenussnetacesscansuaseneesestcssnesesseseseses 210

 

 
 

 

 

THURSDAY, JULY 30, 1987

6:30 - 10:00 CONFERENCE BANQUET: New England Clambake

FRIDAY, JULY 31, 1987
9:00 - 16:20 APPLICATIONS J

A genetic system for learning models of consumer choice
David Perry Greene and Stephen F. Smith o.cc cc ceeesccscsssscesesessseseeessseecssescessesssestearseseatesssessseanse 217

A stidy of permutation crossover operators on the traveling salesman problem

1M, Oliver, D. J. Smith and J. R. C. Holland oe ccccesccsscseesesseseesseseacsussuesenssenciesscansucerseueases 224
A classifier based system for discovering scheduling heuristics

M. R. Hilliard, G. E. Liepins, Mark Palmer, Michael Morrow and Jon Richardson .........0.00.. 231
Using the genetic algorithm to generate LISP source code to solve the prisoner's dilemma

Cory Fujiko and John Dickinson oo... ccc cesesesnesessssssscecesssnescssscsesvescansessestestssavacsessevsusevseesresveaee 236

10:20 - 10:40 COFFEE BREAK
10:40 - 12:00 APPLICATIONS If

Optimal determination of user-oriented clusters: an application for the reproductive plan
Vijay V. Raghavan and Brijesh Agarwal

 

The genetic algorithm and biological development
Stewart W, Wilson .....0......

 
 

 

Susan Coombs and Lawrence Davis...

12:06 - 2:00 LUNCH

2:00 - 3:20. PANEL DISCUSSION: GA’s and AI
3:20 - 3:40 COFFEE BREAK

3:40 - 5:00 INFORMAL DISCUSSION AND FAREWELL
FINITE MARKOV CHAIN ANALYSIS OF GENETIC ALGORITHMS

David E. Goldberg and Philip Segrest

The University of Alabama
Tuscaloosa, AL 35487

ABSTRACT

A finite Markov chain analysis of a single-
locus, binary allele, finite population genetic
algorithm (GA) is presented in this paper. The
Markov analysis is briefly derived and computa-
tions are presented for two kinds of problems:
genetic drift (no preference for either allele)
and preferential selection (ene allele is selected
over the other). Approximate analyses are present-
ed to explain the detailed Markov analysis. These
computations are useful in choosing parameters for
artificial genetic search,

INTRODUCTION

Genetic algorithms (GAs) are receiving
increasing application in a variety of search and
machine learning problems. These efforts have
been greatly aided by the existence of theory that
explains what GAs are processing and how they are
processing it. The theory largely rests on
Holland’s exposition of schemata (1968, 1975), his
bridge to the two-armed bandit problem (1973
1975), his fundamental theorem of genetic algo-
rithms (1973, 1975), and later works by several of
his students (Martin, 1973; De Jong, 1975; Bethke,
1981), All of these works have necessarily made
bounding assumptions: pepulation sizes have been
assumed to be infinitely large, probability limits
have been estimated by relatively crude limits,
and in some cases genetic operators have even been
modified to facilitate the analysis. As a result,
there is still a need for mare exact analysis of
genetic algorithm behavior using finite popula-
tions and realistic analytical models of genetic
operators.

In this paper, we perform a Markov chain
analysis of a one-locus, two-operator (reproduc-
tion and mutation) genetic algorithm acting upon a
finite population of haploid binary structures.
Specifically, we calculate the expected time of
first passage to various levels of convergence

under different selection ratios and mutation
rates,

- To understand the Markov chain analysis and
its application, we first develop an analysis of
Benetic drift: convergence in finite populations

under no selective pressure. Understanding this
phenomenon is useful in explaining why finite CAs
make convergence errors at relatively unimportant
bit positions. We then examine expected GA per-
formance at different levels of selective pressure
assuming a deterministic fitness function. This
anslysis permits calculation of the selective
pressure required to reduce the probability of
selecting the wrong bit to some known level. We
also discuss how these results and their extension
may be used for designing and sizing finite GAs.

STOCHASTIC ERRORS IN FINITE GENETIC ALGORITHMS

As simple three-operator genetic algorithms
have been used in a wider array of search and
machine learning applications (Goldberg & Thomas
1986), a number of objections have been voiced
concerning their performance. Chief among these
objections is the occurrence of premature conver-
gence (De Jong, 1975). Premature convergence is
that event where a population of structures
attains a high level of uniformity at all loci
without containing sufficiently near-optimal
structures. Two reasons have been given for this
undesirable form of convergence: a problem (with
its chosen coding) can itself be GA-hard, or the
finite GA can suffer from stochastic errors.

GAs may diverge because a problem is inher-
ently difficult for a three-operator GA, Such
problems have been called GA-hard (Goldberg,
1983), and Bethke (1981) has proved that GaA-hard
problems exist using Walsh function computation of
schema averages. Reordering operators based on
natural precedent have been suggested (Holland,
1975; Goldberg & Lingle, 1985) as one remedy for
such problems, especially those where building
blocks are not Linked tightly enough to permit the
three-operator GA to find near-optimal points.
Despite the possibility of GA-hardness, Bethke’s
study and an extended schema analysis of the
minimal deceptive problem (Goldberg, 1986) suggest
that it is more difficult to construct
intentionally misleading (GA-hard) problems than
was previously thought. Furthermore this notion
--that GA-hard problems are relatively hard to
construct--is empirically supported by the
widespread success of simple GAs across a spectrum
of problems. Nonetheless, the study of GA-hard
problems and the design of operators to circumvent

 
 

the difficulty remain imporcant open areas of
research activity.

The other major source of convergence diffi-
eulty in genetic algorithms results from the

stochastic errors of small populations. These
errors may themselves be divided into two types
errors in sampling and errors in selection. A

pollster makes a sampling error when he selects a
sample size which is too small to achieve the
accuracy he desires. He also commits a sampling
error when the sample he selects is not represen-
tative of the population as a whole. Genetic
algorithms may make similar sampling errors when
elther the strings representing important schemata
are not present in sufficient numbers or the
individuals present are not representative of the
whole similarity subset. Sampling errors in small
populations in this way prevent the proper propa-
gation of the correct (above average) schemata,
thereby circumventing the expected and desired
action predicted by the schema theorem (Holland,
1975).

Errors of selection are harder to understand
because their disruptive effects are counterintul-
tive. A simple example will halp clarify the
problem. Suppose we have a population of 50
single-locus structures containing 25 ones and 25
zeros. Suppose further that we select a new
population from the old population by choosing 50
new members one at a time using selection with
replacement (where once picked, an individual is
placed back in the sampled population). Since we
are picking each new member with probability 1/50,
and since there are currently 25 ones and 25
zeros, the expected number of ones and zeros is
still 25 and 25 respectively, however, because the
selection process is fairly noisy, we shouldn’t
expect to retain exactly the same 25/25 split. In
fact, the probability of exactly that division is

only P(25/25} = B2)¢0.5)° ~ 0.1123, Therefore

it is reasonably likely the popularion will fall
away From this initial starting point. In the
next generation, the new population becomes the
new initial condition with the expected number of
ones and zeros determined by the new state. This
process continues, and in finite time for a finite
population, the population converges to all ones
ox all zeros, Once converged there is no way to
get back any of the missing material (unless we
permit some mutation, a possibility we will
cougider Later).

Geneticists have long recognized that finite
populations do converge in this manner even when
there is no selective advantage for one allele
over another. This error of selection is s0
important it has been given a special name,
genetic drift. Selection errors can accumulate,
causing a drift to one allele or the other. By
itself, genetic drift is bad enough, After all,
if the environment has no preference for one
allele or another, we might like a GA to preserve
both of them (perhaps we would even like the GA to
preserve them in relatively equal numbers).
Genetic drift insures that this won't happen
unless we take special action to encourage this

 

behavior (see Goldberg & Richardson, this volume)
To make matters worse, these errors of selection
can cause a genetic algorithm to converge to the
wrong allele when the environment dees prefer one
allele over the other, especially when that
selective advantage is relatively smail The
analysis of these difficulties is our main concern
in the remainder of this paper. We consider first
the case of no selection pressure--genetic drift
--followed by analysis of cases with varying
degrees of selective pressure.

 

MARKOV CHAIN ANALYSIS OF GENETIC DRIFT

Discrete and continuous modeis of genetic
drift have received attention in the literature of
mathematical biology (Crow & Kimura, 1970);
however, many of these analyses contain additional
details of biological reality that are not always
of interest to genetic algorithmists De Jong,
(1975) presents computer simulations of genetic
drift in the context of simple GAs, showing graphs
of the relationship between expected first passage
time to varying convergence levels as a function
of population size and mutation rate, In this
section, we calculate these quantities more
exactly using the mathematics of Markov chains
(Remeny & Snell, 1960).

Markov Chains

Suppose we have a sequence of random vari-
ables 5,, 5,,..., and suppose the possible values
for these random variables are drawn fram the set
(0, 1,..., N). We think of the random variables
5 as the state of some system at time t: more
precisely, the system is in state i at time t if
so i If at each time t there is a fixed
probability p,, that the system will be in state j
at time t+] whdn the system was in state i at time
t, we say the sequence of random variables forms a
Markov chain. The fixed quantities Py, are said
to be transition probabilities. 4

P{ s=L

Pay 7 Seay | syet }

Markov chains may be classified by the types
of states they contain. States that may not be
reached as a process goes to infinity are said to
be transient states. Those that may be reached as
time marches on are said to be erpodic states
(non-transient). In this paper, we will only be
interested in those Markov chains that contain a
particular type of ergodic state: an absorbing

State. A state is said to be absorbing if once
entered it may never be left, For any state i,

this is true if and only if the following condi-
tion holds:

Puy wt

We have already seen examples of absorbing states
in our earlier description of genetic drift: we
Yecognize the all-zeros and all-ones stares as
absorbing states, because once we get to either
one of them we can never leave. A chain with at
least one absorbing state and no ergodic states
other than absorbing states is said to be an
absorbing Markov chain.

The states of an absorbing Markov chain may
be renumbered to obtain a transition probability
matrix P in the following canonical form (Kemeny &
Snell, 1960):

where I is an identity matrix and 0 is a matrix of
all zeros. The submatrices Q and R are used to
calculate important properties of an absorbing
chain.

Genetic Drift without Mutation

To illustrate the construction of a
transition probability matrix, we turn to the
genetic drift problem at hand. Suppose we have a
population of N ones and zeros. We let state 0 be
that situation where we have all zeros and we let
state N be that case where we have all ones (no
zeros), and in general we let state i be that
state with exactly i ones. Thus we have a total
of N+l states, i- 0, 1,,.., N, that together
represent all possible conditions within the
population. Then if we assume random selection of
exactly N new population members with replacement,
we may calculate the elements of our drift transi-

tion probability matrix Py as follows:

Reg

Pa (Psy) in (METDAI

de

These calculations are provably correct, because
at each state i, we have a probability of selec-
ting @ one Pone 7 i/N; further, the probability

of getting to state j (j ones) is binomially
distributed with probability Pone and size N. In

this matrix, the states 0 and N are both absorbing
states because Poo 7 Pun ~ 1. Thus, it is a

simple matter to obtain the canonical form of the
matrix shown above. In particular, we are inter-
ested in the Q matrix which may be obtained simply
by stripping off the first and last rows and
colums of the original Py matrix.

The Q matrix may then be used to calculate
the expected number of visits n,, to any transient

state j starting in the transi@dt state i (the N
matrix):

-1
Ne (I - Q)
We won’t belabor the details leading to this

salculation here; however, the desired N matrix
may first be written as an infinite matrix geomet-

 

vic series in Q. Thereafter, the closed form
calculation may be obtained in ao manner similar to
that used in summing an infinite scalar geometric
series.

The total expected time in transit to any one
of the absorbing states starting in state i--we
eall this the transit time 4 ,--may easily be
calculated as a row sum over the ith row of the N
matrix:

;
ro n
oe oy t3

The t, values may then be used to calculate the
expected time to absorption given any initial
starting conditions. If we assume an initial
population chosen uniformly at random, the praba-
bility of being ina state i initially--we call
this m,--is binonially distributed as follows

mo Coy

The expected time to an absorbing state, the

quantiry Copsorbed? MY then be calculated as the

dot product ef the initial state probability

vector ms and the transit time vector te

&
Capsorbnd ™ ZL Tym

These calculations suggest a straightforward
procedure for calculating the expected time until
either the all-ones state (state N) or all-zeros
state (state 0) is reached, but what can we do if
we are interested in calculating the time of first

assage to some level of convergence other than
100 percent? We may calculate such quantities
quite simply by replacing all rows in the Py
matrix within a desired percenrage of convergence
by identity rows: Pay 7 1 andp,, = 0 for ij.
This bit of chicanéry makes chad once-transient
states absorbing states, and the procedure
described above may be used on a now-reduced Q
matrix to determine expected first passage times
to the particular level of convergence.

These computations are carried aut for
convergence levels of 60%, 70%, 80%, 90%, and
100%. The expected time of first passage is
calculated for populations ranging from Ne10 to
N-100. Computations have been performed in AFL to
exploit that language’s facility with vector and
matrix operations. Figure 1 displays the expected
first passage time versus population size at
different convergence levels. The relationship
between passage time and population size at a
given level of convergence is strikingly linear.
Although we may expect the passage time to prow
with increased population size, it is not at all
obvious why that growth is linear.

 
|
|
f
|
|
[
|

 

Figure 1. Genetic drift Markey chain computation
of expected first passage time versus population
size at different convergence levels with no
mutation,

To understand this better, we appeal to a
simpler but related stochastic model: the gam-
bler’s ruin problem. In the gambler's ruin
problem, a gambler with a fixed stake D places
successive one dollar wagers on the outcome of a
coin toss. In our version of the game we assume
that the coin is fair, Putnning ao Plosing ~ 0.5.

The game ends when the gambler either loses all
his money or attains a goal value value A, ALL
this is interesting, but how is it at all related
to the genetic drift problem? We may view the
genetic drift problem as a gambler‘s ruin problem
where we drift toward either ruin (all zeros) or
reward (all ones). In the real genetic drift
problem, our range of outcomes on a given "gamble"
is not binary; however, by using an average step
size per generation we may calculate an
approximate form for the genetic drift solution
using the simpler gambler‘’s ruin model.

Fortunately, the expected duration of the
gambler’s ruin problem is a well-known computation
(Feller, 1968). Under the assumptions above, the
expected number of coin tosses to ruin or reward
may be calculated as E(number of coin tosses} =
B(A-D). In our problem, the stake amount is given
by NP yon 85) (the distance to ruin or reward at

# particular convergence level, Poon?! the desired

accumulation to quit (assuming equal distances) is
twice the stake amount. To place these quantities
in terms of numbers of generations, we divide both
terms (D and A-D) by the average number of steps
taken per generation. This quantity is simply the
standard deviation of the expected number of ones
(ox zeros) per generation:

 

 

This operation yields the following approximate
expression for the number of penerations to a
particular level of convergence:

 

(2p_-1)?
t es son N
convergence tp py

n

Thus we have reasoned that the expected transit
time increases linearly with population size. As
with many approximate models, the derived propor-
tionality constant is mainly useful an indicator
of order of magnitude. The simple model does help
explain why the slopes of lines of successively
higher convergence proportion increase faster than
the proportion itself.

Genetic Drift with Mutation

To combat the undesired convergence of gene-
tic drift in simple GAs, mutation rate increases
are often suggested. To investigate the effect of
mutation on expected times of convergence we
continue our Markov chain analysis by including
mutation in our overall probability transition
matrix.

With mutation included, we view the overall
probability transition matrix P_ as the product of
the drift’ matrix P developed earlier and a
mutation probability transition matrix Pp

(Py) = (Pg) (By)

To develop the mutation transition probability
matrix, we need to select from a number of possi-
ble mutation models. We may perform a_ single
mutation with probability p_, replace the indivi-
dual in the original population, and then perform
a sequence of N such operations (we call this
mutation with sequential replacement). We may
also mutate (again with probability p_) population
member by population member (mutation without
replacement). We develop the transition probabil-
ity matrices for both types of mutation.

The transition probability matrix for muta-
tion with sequential replacement may be developed
quite simply. A single mutation permits a shift
in state by one step (either one more one or one
less one). Thus, the single mutation transition
probability matrix Pos is tridiagonal, and its

elements may be specified as follows:

Precty - (3 ae.

dd “ ip,
otherwise

The overall mutation transition probability matrix
under sequential replacement may then be taken as
the product of N single mutation matrices;
N
(PO (Pi?

m’ sequential replacement ms

The transition probability matrix for muta-
tion without replacement may also be calculated.
Although the computation is more cumbersome, the
extra effort is worthwhile as mutation without
replacement is a more faithful model of mutation
as it is implemented in most GAs. To calculate
the transition probability matrix for mutation
without replacement, we recognize a simple fact.
To go up in state value by, for example, two ones,
we must have exactly two more zeros change to ones
than we have ones change to zeros. If we sum aver
all the possible occurrences where this is true we
obtain the elements of the transition probability
matrix. For j>i (where we shift to a state with
more ones) we obtain the elements of the transi-
tion probability matrix as follows:

Pho (Py

mutation no replacement
d 2 - oe De
— ECDC tae
where ¢ = min(i, N-j)

A similar expression may be derived for the case
where j<i (a transition occurs to a state with
fewer ones). Because mutation without selection
is a better model of mutation as performed in real
GAs, we adopt it in all subsequent computations.

We perform similar calculations as in the
case without mutation, Expected passage times are
plotted versus population size (log-log scale) at
different mutation rates in Figures 2 and 3,
These graphs are presented for convergence levels
of 100% and 80% respectively. Increasing mutation
probabilities increase the expected time of first
Passage to a particular level of convergence as we
expect.

CONVERGENCE UNDER PREFERENTIAL SELECTION

The genetic drift cases demonstrate the
problem of undesirable convergence when the
environment has no preference for one allele over
another. In this section, we examine the rate of
convergence when there is a preference for a
particular allele. We also calculate the
Probability that the allele we converge to is the
sorrect allele,

We assume that a one allele receives an
environmental revard of gy and a zero allele

 

 

Figure 2, Genetic drift Markov chain computation
of expected first passage time versus population
size at 100 percent convergence level with
different mutation probabilities.

10

ETH MammenD aaans

 

 

Figure 3. Genetic drift Markov chain computation
of expected first passage time versus population
size at 80 percent convergence level with dif-
ferent mutation probabilities.

receives an environmental reward of f,. These
rewards are assumed to be stationary and
deterministic, and the rewards are given in the
ratio r= f£,/f,. In the genetic drift cases of
the last “ section the overall transition
probability matrix was the product of the drift
matrix and the mutation matrix. Under
preferential selection, we must replace the drift
 

matrix by a reproduction matrix that accounts for
the differential pressure toward ane allele or
another.

We may account for the added pressure toward
a particular allele by recognizing that the
probability of selecting a particular one is
ty/ Jf, and therefore the probability of selecting
a-one from the population is the number of ones
times the probability of selecting a particular
one:

£.

-— tn
TE + WIE,

wth

rei+(N-i)

where the sum in the denominator is take over all
structures s. With this selection probability,
the transition probability matrix for reproduction
(PL) may be calculated:

Pe fPaa]

reproduction

Neg

- ledtien) (ottien)

The probability of going from a state with i ones
to a state with j ones is binomially distributed
with the selection probability a function of the
state i and the fitness ratio r. Notice that the
reproduction matrix reduces -as it must~--to the
drift matrix as r goes to one,

Using the same procedure as before, we may
calculate the expected time to some level of
convergence; however, in the case where the
environment is not indifferent to the two alleles,
we are not simply interested in knowing when we
expect to converge. We are also interested in
knowing the probability that we converge to the
allele with higher fitness. Fortunately we may
obtain this probabiliry in a straightforward
manner using our fundamental N matrix and the R
matrix.

The probability of starting ina transient
state 1 and going to absorbing state {--we call
this the B matrix, (by }--may be obtained as the
product of the fundamédtal matrix Nand the R

matrix:

Bos NR

That this is so may be reasoned directly (Kemeny &
Snell, 1960). Any time we are in transient state
k, we have probability r,, of going to absorbing
state J; however we will kL

in state k assuming we

 

have started in state 1 a total of ny times. To
finish the computation, we simply redfize that we
must sum over all the transient states k leading
to state j. This is prectsely the computation
Tepresented by the matrix product above.

With the probability b,, of going from a
starting transient state i to a final absorbing
state j thus calculated, it is a simple matter to
obtain the probability of being in a final state
j--the probability a,--as time goes on assuming
initial state probabidities a,. We simply add the
initial probability of being in state j to the sum
of the probabilities of getting from all transient
states i--call this the set T-+-to the particular
absorbing, state j:

a,~n, + F mb.
4 4 eT 4

Having calculated the probability of ending up in
a particular absorbing state, we now may find the
probability of convergence to the correct allele,
Pp , by summing the a values over the

correct

desired absorbing states--call this set A‘;

PB ~ Da

correct, 4ear

We calculate the expected time to 100 percent
convergence and the probability that the conver-
gence is correct for varying r values at different
population sizes. These results are plotted in
Figures 4 and § respectively. To put these
results in perspective, we consider the expected
performance in a large population and compare
these results co the exact study of finite popula-
tions.

The schema theorem (Holland, 1975) tells us
what to expect in a large population. Under
Yeproduction anly, the proportion of ones PS Brows
according to the following equation

Pp ttl ~ r Pp t
> (r-Dp, 4 |?

where the superscript t is a generation index.
For r values near 1 this equation reduces to a
geometric progression in xr:

 

 

t t °
Py =r Py

Taking the natural logarithm of both sides we
obtain the following relationship for the time to
a given proportion of ones;

1nf P,*/P,"]
ce nr

 
 

Figure 4. Markov chain computation of expected
first passage time versus fitness ratio r_ for
reproduction cases to 100 percent convergence
level with different population sizes.

 

Elgure 5. Markov chain computation of probability
o£ correct convergence versus fitness ratio r for
reproduction cases to 100 percent convergence
level with different population sizes.

On a plot of log(t) versus log(log r) this curve
plots as a straight line with negative slope.
Referring to Figure 4, at high enough r values we
hetice this straight line behavior. At small r
values the curve levels out and approaches the
constant value predicted by the genetic drift
computations. We may derive an approximate r
value where this divergence should occur with a
little physical reasoning. During a given genera-

 

tion we expect an, increase in ones approximately
equal to N(r-1)P,". On average, the increase or
decrease of ones due to selection noise is simply
the standard deviation a. When the noise is of
the same order of magninude or greater than the
expected increase, we should expect divergence
between simplified model and the finite model.
This divergence should occur at values of r
predicted by the following relationship:

 

When the population is half ones and half zeros,
this equation says that the large population
result becomes suspect when the excess fitness
(r-1) is less than the inverse of the square root
of the population size, This relationship ex-
plains the decreasing break point value (between
drift-like and convergent behavior) with imcreas-
ing population size as can be seen in both Figures
4 and 5, At values of r beneath this critical
value, the expected time results flatten, and the
probability of converging to the correct allele
starts to fall dramatically.

Similar calculations may be performed for
reproduction with mutation. As in the genetic
drift case, we take the overall transition proba-
bility matrix as the product of two matrices;
here, we multiply the reproduction transition
probability matrix by a mutation matrix (without
replacement) as follows:

@) - @)@)

We perform expected first passage time calcula-
tions as before. in Figure 6, we graph first
passage time versus r for different mutation
probabilities p_ at a population size Ns50, Using
our previous physical analysis, we expect the
finite analysis to approach the bounding analysis
for r-l values greater than sqrt(1/50) = 0.1414.
This holds true for small p_ values; however, as
the mutation probability grows, the expected rime
grows beyond that predicted by the bounding
analysis, That this should eccur may be reasoned
intuitively. If the expected net loss of good
alleles due to mutation starts to exceed the
expected increase in good alleles from reproduc-
tion, the process loses its ability to converge as
quickly. This point occurs when the following
condition holds true:

When the excess fitness (r-1) is of the same order
or less than the mutation, the GA will take a
longer time to converge, because mutation is
adding errors faster than the reproduction can
erase them. This shows up in Figure 6 as suctes-

 
 

  

Figure 6. Markov chain computation of expected
first passage time versus fitnesa ratio r for
reproduction and mutation cases to 90 percent
convergence with N=50 and different mutation
levels.

sively higher mutation rates cause the expected
time curves to break away from the bounding
analysis results more quickly.

CONCLUSIONS

In this paper, we have analyzed the perfor-
mance of one-locus, two-operator (reproduction and
mutation) genetic algorithms with finite popula-
tions of haploid structures using finite Markov
chains.

The results in particular and the Markov
chain analysis in general are useful in under-
standing the performance of finite GAs commonly in
use. These results and their extension should be
useful in sizing populations appropriately,
selecting proper mutation rates, and choosing
rates of selection in scaling procedures.

ACKNOWLEDGEMENTS

This material is based upon work supported by
the National Science Foundation under Grant MSM-
8451610.

REFERENCES

Bethke, A. D. (1981). Genetic algorithms as
function optimizers. (Doctoral dissertation,
University of Michigan). Dissertation

Abstracts International, 41(9), 3503B.
(University Microfilms No. 8106101)

Crow, J, F., & Kimura, M. (1970) An
introduction to population genetics theory.
New York: Harper and Row.

De Jong, K. A. (1975). An analysis of the behavior
of a class of genetic adaptive systems.
(Doctoral dissertation, University of
Michigan). Dissertation Abstracts
International, 36(10), 5140B. (University
Microfilms No. 76-9381)

Feller, W. (1968). An introduction to prohability
theory and its applications (Vol. 1, 3rd

ed.). New York: Wiley.

Goldberg, D. E. (1983), Computer-aided gas
pipeline operation using genetic algorithms
and rule learning (Doctoral dissertation,
University of Michigan). Dissertation

Abstracts International, 44(10), 31748.
(University Microfilms No 8402282)

Goldberg, D. E, (1986). Simple penetic alporithms

and the minimal deceptive problem (TCGA
Report No. 86003). Tuscaloosa: University of

Alabama, The Clearinghouse for Genetic
Algorithms.

Goldberg, D. £., & Lingle, R, (1985). Alleles,
loci, and the traveling salesman problem. In
J. J. Grefenstette (Ed.), Proceedings of an
International Conference on Genetic

Algorithms and Their Applicationg (pp. 154+
159). Pittsburgh: Carnegie-Mellon
University.

Goldberg, D. E., & Thomas, A. L, (1986). Genetic
algorithms: bibliography 1962-1986 (TCGA
Report No. 86001). Tuscaloosa: University of
Alabama, The Clearinghouse for Genetic
Algorithms.

Holland, J. H. (1968). Hierarchical descriptions

of universal spaces and adaptive systems
(Technical Report ORA Projects 01252 and
08226). Ann Arbor: University of Michigan,
Department of Computer and Communication
Scflences.

Holland, J. RH. (1973). Genetic algorithms and the
optimal allocations of trials. SIAM Journal
of Computing, 2(2), 88-105.

Holland, J. H. (1975). Adaptation in natural and
artificial systems. Ann Arbor: The University
of Michigan Press.

Kemeny, J. G., & Snell, J. L. (1966). Finite
Markov Chains, Princeton: Van Nostrand.

Martin, N.(1973). Convergence properties of a
class of probabilistic adaptive schemes
called sequential reproductive plans.
(Doctoral dissertation, University of
Michigan), Dissertation Abstracts Interna-
tienal, B-3747B. (University Microfilms No.
74-3685)
AN ANALYSIS OF REPRODUCTION AND CROSSOVER IN A BINARY-CODED GENETIC ALGORITHM

Clayton L. Bridges and David E, Goldberg

The University of Alabama
Tuscaloosa, AL 35487

ABSTRACT

The foundation of genetic algorithm (GA)
theory--the so-called schema theorem or fundamental
theorem of genetic algorithms--provides a lower
bound on the expected number of representatives of
a particular schema (similarity subset) in the next
generation under various genetic operators. In this
paper, assuming a large population of binary,
haploid structures of known distribution, processed
by fitness proportionate reproduction, random
mating, and random, single-point crossover, an
exact expression for the expected proportion of a
particular string (or ‘representatives of a
particular schema) in the next generation is
calculated. This derivation is useful in analyzing
the expected performance of simple GAs.

INTRODUCTION

Over the past two decades, the application of
genetic algorithms (GAs) to search and machine
learning problems in setence, commerce, and
engineering has been been made possible by a number
of theoretical developments (Holland, 1673, 1975
De Jong, 1975; Bethke, 1981). Without these
theories, it is doubtful that many of us would have
made much sense of our computer simulations or
experiments; this speculation is supported by the
experience of genetic algorithm prehistory. We
need only recall some of the evolutionary schemes
that resorted to mutation-plus-save~-the-best
strategies (Box, 1957; Bledsoe, 1959; Friedman,
1959; Fogel, Owens, & Walsh, 1966) to remember that
shots in a darkness without schemata or the
fundamental theorem (Holland, 1975) can be
frustrating experiences indeed. Because of the
usefulness of theory to progress in genetic
algorithm research, there is still a pressing need
to improve our understanding of the foundations--
the theoretical underpinnings--of genetic
algorithms and their derivatives.

In this paper, we extend the fundamental
theorem of genetic algorithms to exactitude,
Specifically, we derive a complete set of equations
describing the combined effects of reproduction and
crassover on a large population of binary, haploid
Seructures. These equations may be used to

determine the correct expected performance of a

 

genetic algorithm on a given problem with specified
coding; they may be used for calculating the
correct expected propagation of a set of competing
schemata; they may also be used for estimating
disruption or source probabilities for particular
strings or particular schemata in a specified
population of structures.

In the remainder of the paper, we develop our
extended analysis in three steps. We reexamine the
fundamental theorem of GAs (the schema theorem),
calculate an exact expression for the probability
of disruption due to crossover, and calculate the
expected gain of individuals due to mating and
crossover by others.

THE FUNDAMENTAL THEOREM OF GENETIC ALGORITHMS

The fundamental theorem of genetic algorithms
(Holland, 1975) calculates a bound on the expected
number, m, of schemata (similarity templates), H,
in successive generations, t, under the action of
reproduction and crossover (other operators are
often included in the calculation, we choose to
focus on reproduction and crossover alone):

£(H 5(H
m(H, tL) = (a, ERD - 0 A |

In this equation, 2 is the string length, p, is the
probability of crossover, and § is the schema
defining length (the distance between its outermost
defining positions). In words, the schema theorem
tells us that a particular schema H receives trials
according to the ratio of schema fitness to
population average fitness as long as the schema is
not unduly disrupted by crossover, The average
fitness of a schema £(H) may be calculated by
summing over all strings s, representatives of H at
time t:

£

4
(ifs, chy ef
FD = Raney

We note that thea theorem is an inequality--a lower
bound. Our main goal in this paper is to transform
the bound to an equality. To do this, it is helpful
to leok at a full schema conservation equation in
broad outline:

~ EOD 4.
m(H, t+1) m(H,t) : { 1 P.Py ]
+ [gains from crossing]

The increase due to reproduction is multiplied by
the probability of the schema surviving a cross
(one minus the product of the crossover probability
BP, and the probability of disruption py). For
completeness we must also include possible gains
from mating and crossover of other strings.

Hidden within this broad outline are the two
items that prevent the schema theorem from
providing an exact analysis. First, we usually
calculate a crude upper bound on the probability of
disruption pq due to crossover, the usual term
6(H)/(4-1) assumes that the schema is destroyed
every time a cross falls within its defining
length. This is a conservative assumption because
it ignores the possibility of getting back the same
material from a particular mate, Second, the
schema theorem ignores all sources of schemata from
erosses of strings containing different competing
schemata. In crossover, one schema’s loss is
another's gain. Although these terms may seem small
and somewhat beside the point, a recent study
(Goldberg, 1986) has shown that inclusion of these
terms in an extended schema analysis permits the
prediction of convergence of a simple genetic
algorithm on a problem specifically designed to
cause the CA to diverge (the minimal, deceptive
problem). As such, these terms deserve more of our
attention We first look at the probability of
disruption due to crossover.

DISRUPTION DUE TO CROSSOVER

In this section we calculate the probability
of disruption of «a string due to the set of all
possible crosses, Ta do this, we define some
useful symbols and terms.

We take S as the set of all possible binary
strings of length £; there are 2° such strings.
Furthermore, we index the set § 50 84 ip the the
jth member of S, where j goes from 0 td 2"-1,. As a
convenience, we choose our indexing scheme so the
decoded value of our string is equal to the index
value j. We note that we are working in terms of
strings, and the schema theorem {s written in terms
of schemata, We continue here to work with strings
as the equations are easier to derive and
understand; however, keep in mind that we
ultimately want to come back to a particular set of
competing schemata. We do this in a later section,

We define arbitrary string variables B, Q,
and R; these may take on any value in S. We also
permit ourselves to refer to their individual
positions by subseripted lower case letters:

B= bob. -by,

40

 

Thus b; is the boolean variable representing the
jth position of a string of length 2.

Before we proceed with the derivation of the
probability of disruption, we convert the schema
theorem to proportion form. Suppose reproduction
is operating by itself. In this case, we only have
the first part of the schema theorem:

m(B,t+1) = m(B,c)£(B)/E.

If we divide both sides of the equation by the
population size N, and if we define Ppt to be the
proportion of the string B in the population at
time t, we obtain the following equation fer the
expected proportion in the next generation (or the
mating pool) under the action of reproduction
alone;

£,

pitt - pt. 8

B BE

where f, is the fitness of string B and f is the
average fitness of the population as given by the
following expression:

— ky "
be) Prey
i-0

Of course, we are currently interested in
calculating the crossover disruption term that
reduces the above expression by a factor 1+PoPg
where p, is the probability of crossover and pg is
the probability of disruption; however, we envision
the GA acting in two phases: a selection phase
followed by a mating and crossover phase.
Therefore it is useful to define the expected
proportion of string B under reproduction alone.
We shall use this quantity in our calculation of
the probability of disruption pq. We call this
important intermediate quantity Rae the
reproductive proportion:

f,
t yo B
Be Fa

To calculate the probability of disruption to
a given string B under crossover, we need to know
which strings will disrupt B and which won't.
Clearly a string that is different from another
string by a single bit cannot fail to return a copy
of the original string among the two strings
produced by the cross For example, suppose we
have two strings Bb and B’ which differ at position
3 (the bar is used to {ndicate a complementary bit
position):

B = b,b, bbb,
BY = byb,b,6,b,

Every possible cross produces a copy of B.

On the *
ether hand,- strings with two or more hits of
difference must (for at least one cross site)
disrupt one another. Consider the two strings with
three different bits:

B= bgbyb;bb,
BY = bgb, bbb,

In this case, disruption occurs if a crass falls
between the two outermost bits of difference
(between positions 1 and 3). These two outermost
bits--we call them sentry bits--are important in
the analysis of disruption, because we need only

consider their location when deciding how two
strings will disrupt one another. We can use this
information to create a general scheme for
analyzing mutual disruption:
Region begin middle end
Length x GL t-8-x-1
Characteristics {| = | 6*...*b fo = |

Here we have divided the string into three regions:
the beginning, the middle, and the end. In the
beginning and ending regions, the strings under
consideration must have the same bit values as
string B. _ The middle region is bounded by sentry
bits (the b's) where the string B is different from
its prospective mate. Bit positions other than the
sentry positions are marked with *'s, the usual
don't care or wild card symbol used in schema
analysis; we may properly use the don't care symbol
here, because it really does not matter where we
cross between sentry bits. Every such crass
disruptive. To quantify the disruption we define a
number of useful variables. We take 6 (Se[1,2-1]}
as the defining length of the middle region; this
is also the number of possible cross sites.
Separately we recognize that the length of the
middle section (including both sentries) is 6+1.
We define «x (xe[0,2-6-1]) as the length of the
beginning region; it also is the position of the
first sentry bit. Finally, we recognize that the
length of the final section is the remaining
portion that causes the total to sum to £: 2-8-x-1.

is

To facilicate our disruption computation, we
define a middle function M = M[B,6,x]. The middle
function is a subset generator, generating a
similarity subset according to the following
schema:

MIB,6,%] = Bye. Deg Dyke MB eayPccssery Peggy

We may now use the middle function M to calculate
the probability of disruption of a given string B.
This probability is the sum over all strings of the
product of proportions and defining lengths:

(3) GF 5 a y ‘
i - yr: R
a ne ae (318, et0B,6 21) 4

 

"

The use of the middle function M insures that we
count each string only once. This may be confirmed
by removing the § and R factors and simply using
the summation to count the number of such crosses.
After clearing the expression, we find that there
are 2*-2-1 crosses that match the M function, This
quantity may be reasoned independently as the total
number of strings less the number of one-bit
mismatches (these don't disrupt) and the string B
itself (a zero-bit mismatch).

STRING GAINS FROM CROSSOVER

Just as we are interested in detailing the
from crossover, so are we interested in
knowing the potential gain from other string
crosses. We may construct a template to precisely
identify when this occurs Specifically, we
consider the construction of a string B from two
strings Q and R as follows:

losses

Region begin |middle| end
Length a @ w

Q Characteristics, sf = °

R Characteristics! — - (lb, #

We now demand that the Q string be identical
to B in the middle and end regions, and we require
the R string to be identical to B in the beginning
and middle sections, This time around we post
sentries outside the middle region, and when we
cross between the sentries we obtain a copy of the
desired string B as one of the products of the
cross, To make our summation a bit easier we
define several important quantities.

The quantity a (a «f[1,2-1]) is the length of
the beginning region in string Q. The quantity
w (w e[1,£-a]) is the length of the ending region
in R, and o (o = £-a-w) is the length of the middle
region,

We define formal string functions to specify
the strings they may cross to yield B. We call the
two functions the beginning function A[B,a] and the
ending function Q[B,w]. The beginning function is
a subset generator as follows.

PAD gegybye sb

A[Ba] = #8 by by yy

The ending function generates subsets as follows:

Sk ate
ew buy ue

Q[B,w] = by. ..b

The probability that these two strings will
cross to give B is dependent on the value of a, the
length of the region where they are the same. If
o™~ 0, then there is only one site where the
strings can cross to yield B. In general there are
a total of o+l cross sites; therefore, the
probability that a cross will give B is given by
the expression’
Lo

oth

|

os
eI

We may now calculate the expected proportion
gain P_ of strings B from crosses by all other
strings: as follows:

We multiply by 2 because there are two ways to pick
Q and R. We divide by two because only half the
products of the cross are B strings.

THE COMPLETE EQUATION

With both disruptions and gains now
calculated, it is a straightforward matter to write
the expected proportion of strings B in generation
ttl under both reproduction and crossover:

frat

fol
pitt pth a. P y é at
5 a ° on et Z, tstayeitb 01 3
telt-a
oth t t
Pp R
s [ Lae isteydaa : [ wigeiner | | |

EXTENSION TO SCHEMATA

So far we have confined our analysis to
individual strings. We often want to examine the
expected propagation of a set of competing schemata
over some specified bit positions. To do this in
our current scheme requires only minor
modificacions to our equations.

To interpret the extended equations in
sachemata, we first index the sums over the o fixed

pesitions (where o Is the schema order or the
number of fixed positions) instead of the £
positions in the string (we do use £ in the
denominator of the disruption probability,

however). Next, we introduce a function A(x,é) to
account for the possibly unequal spacing of the
fixed positions. For example, if we are interested
in the order 3 competing schemata defined by the
template f4**f*f (where the £ indicates a fixed
position, a 0 or a1) at xs0 and 61, the defining
length function may be calculared as Aw4-0—84. Since
the x values only run through the consecutive fixed
positions, at x=l (the middle fixed position) we
find A(1,1)=2. With such a function defined, we may
rewrite our disruption probability for a schema H
as follows:

orl os-1
Az, 6) cs
pd ~ p 2 2 RR FR
but {4]Syetilt s,21)

x0

 

Here we interpret S,; 4s a member of the set of
schemata sharing the same o fixed positions as H,
Also, RB is interpreted as the reproductive
proportion of the schema Sj.

Thus we are able to extend the computation of
the probability of disruption to schemata using the
same middle function M through careful indexing and
the introduction of a function that tracks unequal
defining bit spacing. Similar modification may be
introduced in the expected gain proportion
computation, Poe

CONCLUSIONS

In this paper, we have derived an equation
for the expected propagation of strings in a
genetic algorithm under reproduction and crossover,
and we have shown how such a derivation can be

extended to schemata. The computation is exact
within the assumptions of the analysis (large
populations, uniformly random mating, and randomly

chosen ¢ross
the analysis

sites), The equations may be used for
of specifle problems and codings or it
may be used to quantify disruption in populations
of strings or schemata of known distribution, The
equations should further our understanding of the
detailed operation of genetic algorithms.

ACKNOWLEDGEMENTS

This material is based upon work supported by
the National Science Foundation under Grant MSM-
8451610.

REFERENCES

Bethke, A. D, (1981).
function optimizers.

Genetic algorithms as
(Doctoral dissertation,

University of Michigan). Dissertation
Abstracrs International, 41(9), 35038,

(University Microfilms No. 8106101)

Bledsoe, W. W. (1961, November). The uge of
biological concepts in the analytical study
of systems. Paper presented at the ORSA-TIMS
National Meeting, San Francisco, GA.

Box, G. E. P. (1957). Evolutionary operation. A
method for increasing industrial
productivity. Applied Statistics, 6(2), 8l-
LOL.

Be Jong, K. A. (1975). An analysis of the behavior

 

of a class of genetic adaptive systems.
(Doctoral dissertation, University of
Michigan). Dissertation Abstracts
International, 36(10), S140B. (University
Microfiims No. 76-9381)

Fogel, L. J., Owens, A. J., & Walsh, M. J, (1966),
Artificial intellipence through simulated
evolucton New York: John Wiley.

Friedman, G, J. (1959). Digital simulation of an
evolutionary process, General Systems

Yearbook, 4, 171-184.
Goldberg, D. E. (1986). Simple genetic alsorithms
and the minimal deceptive problem (TCGA
Report No. 86003). Tuscaloosa: University of
Alabama, The Clearinghouse for Genetic
Algorithms.

Holland, J. H. (1973). Genetic algorithms and the
optimal allocation of trials. SIAM Journgl of
Computing, 2(2), 88-105.

Holland, J. H. (1975). Adaptation in natural and
artificial systems. Ann Arbor: The University
of Michigan Press.

 

   
 

REDUCING BIAS AND INEFFICIENCY IN THE SELECTION ALGORITHM
James Edward Baker
Vanderbilt University

Abstract

Most implementations of Genetic Algorithms
experience sampling bias and are unnecessarily
inefficient. This paper reviews various sam-
phng algorithms proposed in the hterature and
offers two new algorithms of reduced bias and
increased efficiency, An empirical analysis of
bias is then presented.

L. Introduction

Genetic Algorithms (GAs) cycle through four
phases|4j: evaluation, selection, recombination
and mutation. The selection phase determines
the actual number of offspring each individual
will receive based on its relative performance.
The selection phase is composed of two parts:
1) determination of the individuals’ expected
values; and 2) conversion of the expected
values to discrete numbers of offspring. An
individual’s expected value is a real number
indicating the average number of offspring that
individual should receive. Hence, an individual
with an expected value of 1.5 should average, 1
1/2 offspring. The algorithm used to convert
the real expected values to integer numbers of
offspring is called the sampling algorithm. The
sampling algorithm must maintain a constant
population size while attempting to provide
accurate, consistent and efficient sampling.
These three goals lead to the bias, spread and
efficiency measures described below:

1} bias — Bias is defined to be the absolute
difference between an individual’s actual sam-
pling probability and his expected valne[3]. To
ensure proper credit assignment to the
represented hyperplanes, the expected values
should be sampled as accurately as possible.
The optimal, zero bias is achieved whenever
each individual’s sampling probability equals
his expected value.

2) spread - Let f(i) be the actual number of
offspring individual i receives im a given genera-
tion. We define the “spread” as the range of
possible values for f(i). Furthermore, we define
the “Minimum Spread” as the smallest possi-
ble spread which theoretically permits zero
bias. Hence, the Minimum Spread is one in
which

198 (x)J, 1 cota]

where eu(2) = expected value of individual 1.

14

Whereas the bias indicates accuracy, the
spread indicates precision. Hence the spread
reveals the sampling algorithim’s consistency.

3) efficiency -- Jt is desirable for the sampling
algorithm not to increase the GAs’ overall time
complexity. GAs’ other phases are O(LN) or
better, where L = length of an individual and
N= population size.

All currently available sampling algorithms fail
to provide both zero bias and Minimum
Spread/3]. Yet an accurate sampling algorithm
is crucial, All of the GAs’ theoretical support
presupposes the ability to implement the
intended expected values. How much the
existing inaccuracies have affected the GAs’
performance is unknown. However, inaccura-
cies of sufficient magnitude to alter perfor-
mance should be either understood or elim-
inated. This paper offers an empirical analysis
of one of the most common sampling algo-
rithms[5] -- Remainder Stochastic Sampling
without Replacement--and introduces two new
sampling algorithms designed to reduce or
eliminate these inaccuracies.

Section 2 reviews and compares previously pro-
posed sampling algorithms. Section 3 intro-
duces an improved sampling algorithm which
can be partially executed in parallel, Section 4
introduces a sampling algorithm of zero bias,
Minimum Spread and optimal O(N) time com-
plexity. Section 5 presents an empirical
analysis of these algorithms and section 6 pro-
vides a summary and conclusions.

2. Previous Work

The basic characteristics of various sampling
algorithms/3] are presented in Table 1. The
first four algorithms are stochastic and involve
basically the same technique:

-—~ Determine R, the sum of the competing
expected values,

~— 1-1 map the individuals to contiguous seg-
ments of the real number line, [0..R), such that
each individual’s segment is equal in size to its
competing expected value.

~~ Generate a random number within (0.3).
~~ Select the individual whose segment spans
the random number.

~~ Repeat the process until the desired number
of samples is obtained.
 

i

This technique is commonly called the “spin-
ning wheel” method. It is analogous to a
gambler’s spinning wheel with each wheel slice
proportional in size to some individual’s
expected value. This technique is typically
implemented in O(N?) time, but can be imple-
mented in O(NlogN) by using a B-tree.

In “Stochastic Sampling with Replacement”,
the “spinning wheel” is composed of the origi-
nal expected values and remains unchanged
between “spins”. This provides zero bias yet
virtually unlimited spread — any individual
with expected value > 0 could be chosen to fill
the entire next population. “Stochastic Sam-
pling with partial Replacement” prevents this.
In this algorithm, an individual’s expected
value is decreased by 1.0 each time he is
selected. (If the expected value becomes nega-
tive, it is set to 0.) This modification provides
an upper bound of [expected value] on the
spread. However, this upper bound is achieved
at the expense of bias. Furthermore, a reason-
able lower bound is not provided.

Remainder sampling methods reduce the
spread further. A remainder sampling method
involves two distinct phases. In the integral
phase, samples are awarded deterministically
based on the integer portions of the expected
values. This phase guarantees a lower bound
of {expected value} on the spread. The frac-
tional phase then samples according to the
expected values’ fractional portions.

In “Remainder Stochastic Sampling with
Replacement”, the fractional expected values
are sampled by the “spinning wheel” method.
The individual’s fractions remain unaltered
between “spins”, and hence continue to com-
pete for selection. This sampling algorithm
provides zero bias and the greatest lower
bound on the spread, yet it provides virtually
no upper bound. Any individual with an
expected value fraction > 0 can obtain all sam-
ples selected during the fractional phase.
“Remainder Stochastic Sampling without
Replacement”, (RSSwoR), provides Minimum
Spread. This remainder algorithm also uses
the “spinning wheel” for the fractional phase.
However, after each “spin”, the selected
individual’s expected value is set to zero,
Hence, individuals are prevented from having
multiple selections during the fractional phase.
Unfortunately, although this is a commonly
used sampling algorithm, it is biased by favor-
ing smaller fractions(2].

A “Deterministic Sampling” algorithm is sug-
gested and used by Brindle{3]. In this
remainder algorithm’s fractional phase, the

15

individuals with the largest fractions are
selected. The result is a minimum sampling
error for each generation, yet a high overall
bias. Since overall accuracy in approximating
the expected values is crucial to the applicabil-
ity of the GAs’ current theory, high bias is con-
sidered unacceptable, Hence, “Deterministic
Sampling” is not widely used.

3. Remainder Stochastic Independent Sampling

In a spinning wheel algorithm, selection is
based on the relative expected values of those
competing. This relativity is unnecessary since
the expected values themselves are population
normalized. Furthermore, this relativity is
causing the bias-spread tradeoff and the
O{NlogN) time complexity. (IF the selected
individuals are not inhibited in future competi-
tions, then there is little spread limitation.
However, if they are inhibited, subsequent
selection is biased.)

In “Remainder Stochastic Independent Sam-
pling” (RSIS), the fractional phase is per-
formed without use of the error-prone “spin-
ning wheel”. RSIS independently uses each
fractional expected value as a probability of
selection. This is accomplished by traversing
the population and stochastically determining
whether each individual should be selected:

“C” Code Fragment for RSIS

/* Integral Phase */
NumSelected = 0;
for (i=0; i<N; i++)
for (; ExpValli] >= 1.0; ExpValliJ--)
{  SelectInd{i);
NumSelected++;

/* Fractional Phase */
for (i=0; NumSelected<N; )
{ if (ExpValfi] > Rand())
{ Select Ind (i);
ExpValli] = 0;
NumSelected++;

if (4-41 == N) ie20;
}

Where Rand() returns a random real number
uniformly distributed within the range [0..1).

The code for RSIS is noticeably less complex
than typical implementations of sampling and
much less than O(NlogN) algorithms.
Although the fractional phase is potentially an
infinite loop, probabilistically it completes after
one traversal of the population and is only
O(N). Empirical studies indicate that it rarely
proceeds beyond the second traversal --

 
averaging only 0.017% over a variety of
expected value distributions. Hence a modified
algorithm which randomly selected the remain-
ing sample(s) after two traversals would be
heuristically appropriate.

An individual can not be selected more than
once during the fractional phase, since the
selected individual’s expected value is set to
zero. Hence this remainder sampling algorithm
has Minimum Spread. However, some bias is
present in RSIS. This bias occurs if a second
population traversal is necessary and is similar
to that found in latter stages of a spinning
wheel algorithm without replacement. How-
ever, the overall bias of RSIS will be much less
since, typically, most of the samples are
obtained in the zero biased, first traversal.
Note that “positional bias’ may oceur if the
individuals are ordered. However, in a
remainder sampling algorithm, the population
is shuffled to prevent excessive cloning. Hence,
this potential bias does not exist.

The selection phase of GAs requires some
sequential processing, both in obtaining and
sampling the expected values. However, the
GAs’ evaluation, recombination and mutation
phases can be executed in full parallel. Hence,
increases in the parallel nature of the selection
phase may eventually prove useful. RSIS is
the only acceptable sampling algorithm which
can be partially executed in parallel. The frac-
tional phase can be performed in full parallel
but must precede the O(N) sequential, integral
phase.

“C" Code Fragment for Partially Parallel RSIS

/* sample fractions in full parallel */
/* copy population and mark selected ind.s */
/* code of jth processor */
parbegin
NextPop.Ind|j] = CurrPop.Ind|j];
NextPop.flag = (ExpValFract|j] > Rand(}};
parend;

/* sample integers sequentially */
/* overwrite unmarked individuals first */

 

 

k = 0; /* pointer to NextPop */
; /* availability flag */
for (i==0; i<N; i++)

for (; ExpVal]i] >= 1.0; ExpValfil—)
{ while ( NextPop.flaglk] == set }
if (++k == N) k = set = 0;
NextPop.Ind|k-+-+| = CurrPop.Ind{i};

In the parallel fractional phase, the current
population is copied into the next population
and the selected individuals are marked with a

 

1. This mark indicates the positions’ unavaila-
bility. During the integral phase, the
unmarked positions will be filled first. Two
potential problem conditions exist: the frac-
tional phase may select too few or too many
individuals. If too few are selected, then some
unmarked individuals will not be overwritten
during the integral phase and hence will be
considered selected. This is equivalent to a
standard RSIS implementation employing ran-
dom selection on the second traversal. (If this
proves undesirable, a sequential, stochastic,
second traversal could easily be performed.) If
fractional selection choses too many individu-
als, then integral selection will require addi-
tional positions. However, {expected value]
must be guaranteed to maintain Minimum
Spread. Thus, some individuals selected dur-
ing the fractional phase must be replaced.
This replacement does not alter the perfor-
mance characteristics of RSIS, since sequential
RSIS would not have chosen these extra indivi-
duals in the first place. In either case,
Minimum Spread is maintained and the low
bias of RSIS remains uneffected.

4, Stochastic Universal Sampling

“Stochastic Universal Sampling” (SUS) is a
simple, single phase, O(N) sampling algorithm.
It is zero biased, has Minimum Spread and will
achieve all N samples in a single traversal.
However, the algorithm is strictly sequential.

“C* Code Fragment for SUS

ptr = Rand();
for (sum=i==0; i<N; i++)
for (sum += ExpValli]; sum > ptr; ptr++)
SelectInd(i};

On a standard spinning wheel, there is a single
pointer which indicates the “winner”. SUS is
analogous to a spinning wheel with N equally
spaced pointers. Hence, a single spin results in
N “winners”. Since the sum of the
population’s expected values = N, the pointers
are exactly 1.0 apart. Thus an individual is
guaranteed |expected value] in samples and
no more than [expected value]. Hence, SUS
has Minimum Spread. Furthermore, in a ran-
domly ordered population, an individual’s
selection probability is based solely on the ini-
tial spin and the magnitude of his expected
value. Hence, SUS has zero bias. Like
remainder sampling algorithms, the population
must be shuffled before crossover. Therefore,
positional biag can not exist in SUS, either.
(Note that, in general, any number of samples,
n, can be obtained by letting ptr range within
(0..N/n) and incrementing it by N/n after
 

 

 

each selection.) In a sequential environment,
SUS seems to be the optimal sampling algo-
rithm. In the next section, an empirical
analysis of RSIS and SUS is presented.

5. Empirical Analysis

This analysis investigates the severity, direc-
tion and progression of bias in RSSwoR,
sequential RSIS and SUS. The severity is
important from an implementation standpoint.
Although bias can be proven theoretically[2}, it
is of less interest if it does not alter perfor-
mance. The direction of the bias indicates
which individuals are favored and the progres-
sion indicates when the bias occurs. Since we
are concerned only with the sampling algo-
rithm, a full execution of GAs is not necessary.
Rather, we simply maintain a population of
expected values. This allows a single expected
value distribution to be sampled, repeatedly.
We use linear distributions specified by the
parameter MAX. Given 1.0<MAX<2.0, the
expected values’ range is (2~-MAX..MAX), see
Figure 1.

Bias Severity

The severity of the bias is given by its average
actual affect. on selection, ie. by the number of
individuals not selected that should be, or visa
versa. The “Fertility Factor”, (FF), is defined
to be the percentage of the population selected
to reproduce[i]. The FF indicates the number
of distinct individuals that are selected. There-
fore, the difference between the experimentally
observed FF and the theoretically expected
FF(2] indicates the severity of the bias. In
Experiment 1, the sampling algorithms are exe-
cuted 1,000,000 times for each of ten MAX
values between 1.1 and 2.0. The average FFs
of each sampling algorithm are compared in
Figure 2. Figure 3 presents the bias severity,
the difference between the experimental and
theoretical FFs.

Clearly RSSwoR is severely biased. It has
been empirically shown that in GAs, expected
values seldom exceed 1.2(1|, Yet for such a
case (MAX = 1.1) approximately 3% of the
total population is erroneously not selected.
Thus, for N=100, an average of 3 individuals
are lost due to bias during each generation.
The bias in RSIS is much less severe. For
N=100, RSIS averages losing only 1 individual
due to bias in every 3 generations. This is an
order of magnitude better than RSSwoR. The
zero bias expected for SUS is empirically sup-
ported by this experiment. Note that, Experi-
ment 1 simply indicates favoritism between
two groups: those selected in the integral
Phase versus those not selected in the integral

phase. Hence, some decline in the bias severity
as MAX— 2.0 is due to the increasing similarity
between these group’s fractions, see Figure 3.

Bias Direction

We define an individual’s “Bias Factor” as the
ratio of his actual sampling probability to his
expected value. By comparing Bias Factors,
we can characterize the direction of the bias, ie.
what type of individuals are favored? Since
bias only occurs in the fractional phase and the
integral phase is deterministic, the Bias Fac-
tors will be based solely on the fractional
expected values. In Experiment 2, each sam-
pling algorithm is executed 10,000,000 times
with MAX = 11. This MAX value is chosen
to closely resemble a typical expected value dis-
tribution, The actual sampling frequencies of
each fractional expected value are obtained and
used to calculate the Bias Factors plotted in
Figure 4, Note that for MAX = 1.1 the frac-
tional expected values form a split range (0 ..
0.1) and (0.9 .. 1.0).

RSSwoR and RSIS favor smaller fractions.
Furthermore, the Bias Factor monotonically
decreases with increasing fractional magnitude.
Both of these results can be proven theoreti-
cally(2]. Experiment Vs indication of the sam-
pling algorithms’ relative bias is also confirmed,
as is its affect on the FF. (In this example,
individuals with small fractions have expected
values in the range (1.0 ., 1.1). Hence, they are
selected during the integral phase. A favorable
bias toward these individuals will decrease the
FF, see Figure 1. Furthermore, as MAX
increases, these fractions increase, thereby
decreasing the bias severity, see Figure 3.)
Experiment 2 indicates that SUS does not
favor any individuals and hence, is empirically
zero biased.

Bias Progression

Js the bias constant throughout the fractional
selection phase or does it vary? This is impor-
tant both in understanding the bias and for
systems where the generation gap, (fraction of
the population replaced during each genera-
tion), is less than 1.0, To determine the pro-
gression, we must know the sampling probabil-
ity each individual experiences whenever each
sample is taken. This is obtained by recording
when each selected individual is sampled. Long
term averages reveal Bias Factors for each
selection cycle, ie. the Bias Factor each indivi-
dual experiences whenever the 1" selection is
made, whenever the 2™ selection is made, etc.
In Experiment 3, the sampling algorithms are
executed 10,000,000 times with MAX = 1.1
and the Bias Factors are determined. Due to
 

the unwieldy amount of data, only two arbi-
trarily chosen, representative individuals are
compared: the fifth smallest fraction (0,009}
and the fifth largest fraction (0.991). Figure 5

presents these individuals’ Bias Factors,
though some stochastic fluctuation is still evi-
dent.

Clearly, the bias in RSSwoR gets progressively
worse. This is intuitive — as more fractions are
removed from the “spinning wheel”, the bias
increases. For RSIS the zero bias of its first
population traversal is evident in Figure 5.
The second population traversal has constant
bias, however it depends upon the number and
nature of the first traversal’s samples. Hence
the apparent steady, late increase in RSIS bias
is misleading. It is a direct result of the vari-
able length of the zero biased, first traversal of
the population. As a consequence of the
increase during latter selections for both
RSSwokt and RSIS, the amount of overall bias
is strongly dependent upon the generation gap.
This dependence upon the generation gap may
in fact have affected previous determinations of
optimal generation gap settings[4,6]. Due to
the simultaneity of SUS -- all N samples are
simultaneously determined by the single ran-
dom number generated -- the concept of a sam-
ple cycle is inapplicable.

8. Conclusions

The commonly used sampling algorithm
RSSwok. is severely biased and is O(NlogN).
Whereas efficiency is always desirable, accurate
performance (zero bias) is crucial. Hence, two
new sampling algorithms are presented. The
first algorithm, RSS, has Minimum Spread,
optimal O(N), and an order of magnitude less
bias than RSSwoR. Furthermore, it can be
partially implemented in parallel. The second
algorithm, SUS, has Minimum Spread, optimal
O(N}, and zero bias. Yet it can only be imple-
mented sequentially.

Empirical evidence is presented that reveals
the relative bias of these three sampling algo-
rithms. The Bias Factors are found to
decrease monotonically with increasing
expected values. The Bias Factors in RSSwoR
are also found to grow steadily worse as an
increasing number of selections are made, For
RSIS, zero bias is maintained throughout the
algorithm’s first traversal, yet increases sharply
for later traversals. This results in bias occur-
ring only for the last few selections. SUS is
shown to be zero biased for all individuals.

The SUS algorithm is an optimal sequential
sampling algorithm. Its use enables GAs, for

the first time, to assign offspring according to
the theoretical specifications. How much the
bias has affected the GAs’ ability to perform, is
unknown. Since the bias reduces the actual
FF, its removal will impede convergence.
Thus, the bias should improve performance on
simple, unimodal functions and worsen perfor-
mance on others. This has not been empiri-
cally verified, but will be thoroughly analyzed
in {2}. Since simple functions are not the
domain for which GAs are most useful, a
degradation in their performance is not of great
concern. Hence, in sequential environments,
GAs should employ the SUS algorithm. It will
sample performance on more difficult functions.
If a parallel environment is available, RSIS
may prove valuable, especially if a generation
gap < 1.0 is desired.

Acknowledgements

T would like to thank Dr. John J. Grefenstette
for directing me toward this research area and
for numerous discussions of GAs in general.
Furthermore, I would like to thank Dr. J.
Michael Fitzpatrick for his substantial assis-

tance on the theoretical proofs of bias in
RSSwoR.

References

1. J. E. Baker, Adaptive selection methods
for genetic algorithms, Proc. Interna-
tional Conf. on Genetic Algorithms, Ed.
J. J. Grefenstette, Carnegie-Mellon
University, Pittsburgh, PA, pp.104-111
(July 1985).

2. J. E. Baker, Balancing diversity and con-
vergence tn genetic search, Ph.D. Thesis,
Computer Science Dept., Vanderbilt
University, Nashville, TN, (September
1987), (expected).

3. A. Brindle, Genetic algorithms for fune-
tion optimization, Ph.D. Thesis, Univ. of
Alberta, Alberta (1981).

4. K. A. DeJong, Analysis of the Behavior
of a class of genetic adaptive systems,
Ph.D. Thesis, Dept. Computer and Com-
munication Sciences, Univ. of Michigan
(1975).

5. J. J. Grefenstette, A user’s guide to
GENESIS, Tech. Report CS-84-11, Com-
puter Science Dept., Vanderbilt Univ.,
Nashville, TN, (August 1984).

6. J. J. Grefenstette, Optimization of con-
trol parameters for genetic algorithms,
IEEE Transactions of Systems, Man, and
Cybernetics, SMC-16(1), pp.122-128
(1985).
EXPECTED
; VALUE

2.0-MAX 4

 

 

 

 

 

 

 

 

 

 

Sampling Method Bias Spread Cost Parallelability

Stochastic

with zero unlimited O(NlogN) None

Replacement ON

Stochastic

with medium | upper bounded | O(NlogN) None

Partial 0 fev]

Replacement

Remainder

Stochastic zero lower bounded | O(NlogN) None

with lev[—lev}+R

Replacement

Remainder

Stochastic medium minimum O(NlogN) None

without lev}, fev]

Replacement

Deterministic high minimum O(NlogN} None
lev] fev]

Remainder

Stochastic low minimum O(N) Fractional

Independant lev] ,fev] Phase

Sampling

Stochastic

Universal zero minimum O(N) None

Sampling lev], fev}

 

 

 

 

 

 

Table 1 - Comparison of Sampling Methods

A= fluchoud expected valued
B= integral eqected values

 

 

 

7
N/2
INDIVIDUAL

Figure 1 -- Expected Value Distribution

19

 

 
Sotleoretical optimum

 

ag,

fay
+
4
te, tay,
on '
"
tay
‘
ne

 

 

 

 

 

&0 +
% ! T 1 “I
| 1.2 L4 1.6 1.8 2
_ | MAX Value
I
' Figure 2 - Observed FFs
4 a ;
RSSwolt wee
ERROR 2
IN a4
FFs
2 |
1-
1 RSIS.
Ferree
sys :
8 T T T I T T 1
1.2 1.4 1.6 1.8 2

MAX Value

Pigure 3 -- Bias Severity

20
 

 

 

 

 

  

 

 

 

 

 

ce, ' ‘
RESwl, i :
teeees t t
| Tete 1
2.0 : ' r?
t '
' \
1
' 1
‘
Bias ! inapplicable | Bias
Factor 1.5— 1 region ' 1.5 Factor
t '
1 1
1 1
1 ‘
' 1
RsIs ' ‘
eyeteee Once ane tate teen ee thet tas eeneetan t '
t I
optimum L fevteereee LUNES t Neer e eae nate emanate hace anes ' tae
SUS '
t
a nd
0.02 0.06 0.1 0.9 0.94 0.98
Expected Value Expected Value
Figure 4 — Bias Direction
(Optimal Bias Factor = 1.0)
A = ev O00, RSW.
TD sev 0001, RESwoR 6
6~ vA Sev 0.00, RSIS Af [-
w Bw ev: 0.901, RSIS
a t~ 4
BIAS
FACTOR
2-— L- 2
8 9
50 60 70 80 90 100
SELECTION
Figure 5 -- Bias Progression
(Optimal Bais Factor = 1.0}
24

 

 
 

ALTRUISH IN THE BUCKET BRIGADE

Thomas H. Westerdale

Birkbeck College,

Abstract --We examine the bucket brigade in a 51m~
ple allustrative production system. We employ a
penalty scheme to produce an interpretable bucket
brigade reward scheme. The result is a reward
scheme that looks Like a genetic scheme wath
altruism added,

1. INTRODUCTION:
RAILWAY

A PRODUCTION SYSTEM VIEWED AS A

We shall investigate the action of certain
reward schemes 1n a simple illustrative praduction
system and environment. The system consists of
seven productions. h, b, c, d, e, f, and g. f
call the set of productions that can fire in the
current time unit the eligibility set. Ie 1s of
course a function of environment state. Gur eligi-
bility situation can be stmmarized with a simple
diagram -~ se@ Fig. tT.

@-.. , __&)
yp Aer oe)
@)) -- AE)

System and Enviranment, With Station d Split
Fig. 4 Fag. 2

 

After each production fires, the eligibility set 1s
the set af those productions 1f 28 connected to by
arrows. The possible firing sequences are the Left
ta right paths an the figure

The figure Looks like a rallway map, where the
productions are atatilons. Trains on this ranrlway
always move from west to east. The stations can be
seen to form three groups: the western, central
and eastern group. The rarlway layout 1s on a
cylindrical planet and so trains Leaving eastern
stations proceeding e@astwaTds, pass around the far

side of the planet and appear on the western edge
of the map. Therr next stop will be one of the
western statrons, h or d. In other words, emanat-

ang from each eastern station there are two arrows
{not shown) potnting to h and d. Station b has two
platforms. Trains arriving from h arrive at the
northern platform and they always depart for c
Jrains arriving from d arrive at the southern plat-
form and they always depart for e.

On a trip around the planet, a train takes one
of three routes: the northern, hbc; the middle
dbe; or the southern, dfg. A sequence of produc-
tions 1s a permissible firing sequence yust 2f that
sequence 1s a sequence of successive stations on a
possible rallway Journey

With every train that arrives at
the trallway company

station c¢,
1s paid payoff plc), whereas

University of London

with every train that arrives at station e the com~
pany 1s pald ufe}, and wath every train that
arrives at station g the company as paid ylg).
These three fares are fixed and unchanging.
Arrivals at western and central stations yield no
such income, no payoff.

Attached to each station is a
number called the station's avarlability. When a
train has a decision ta make as toe which track to
take {as for example when it leaves station d)} then
it makeg a prababilistic chalice from among the sta-
tions aun the eligibility set, chosing with proba~
bilities piaportianal to the stations’ avaxrbabili-
tiles. (Systems in whreh the driver's charce 1s
deterministic (1) are more common in practice, but
harder to analyze. We shall not discuss them, but
see section ? for a comment on their rationale and
analysis.} It will be the job of the reward scheme

pasitive real

to adjust the availabilities in an attempt to max-
imise company income per unit time. For example
af pig} 1s very high then the reward scheme will
want to have a high availability of f.

When the reward scheme increases a station's

avarlability we say the station has been rewarded.
The amount of the increase 18 the station's reward.

2. TWO RAILWAY REWARD SCHEMES

Our simple system and environment has a
periodicity af 3. Reward schemes discussed in this
paper will be specifically designed to handle sys-~-
tems and environments which have a periodicity af
3. We have discussed elsewhere [2,3] the handlang
of more general systems and environments.

Scheme A: Each train keeps a record of the
Stations ait passes through while crossing the
map from west to east. When the train arrives
at an eastern station, the payoff received 15
added to the availabilities of the last three
stations the train has passed through.

Scheme A 1s not a subgoal reward scheme, so
its analysis 19 comparatively easy. {I said *com-
paratively”. See the analysis an [2].) Bucket bri-
gade schemes, on the other hand, are subgoal reward
schemes, Let us move directiy to a bucket brigade
scheme. In this scheme each station holds a cer-~
tain amount of cash, called its cash balance.
Every time a train arrives at a station, that sta-~
tron sends a hundredth of its cash to the station
from which the train arrived, Thus cash moves east
to west around the planet, carraed by the “bucket
brigade”.

fut the cash management at eastern statians is
rather different. astern stations throw away the
cash they receive from the bucket brigade.
Janet ctl

 

 

 

Instead, when a train arrives at an eastern station
x, the cash balance af ¥ 18 Lacremanted by the pay-
off plx).

Scheme B samply sets the availability af each
station equal to its cash balance.

In Scheme A, trains coming from h end up at c,
and their payoff, wic) , rewards station h, but
does not reward station d. In Scheme 6 that payoff
rewards d as well. In other words, cash traveling
through station b can change platforms, whereas
traing cannot,

Under the bucket brigade, the cash valances of

the stations will fluctuate around an “equil2zbrium
value”, at which the expectation of cash inflow per
train arrival 25 equal ta the expectation of the
cash outflow. For any station x, we write % to
mean the equilibrium value of the cash balance of
x. Thus we have € = 100 * ulc} , &@ = 100 * wlel ,
g = 100 * wlg) , and Fz BR.
For analysis purposes, 1t 18 sometimes useful
to modify the bucket brigade as follows: Ali pay-~
offs are multiplied by 100, and when a train
arrives at a station, that station sends its whole
cash balance (rather than only a hundredth) to the
station from which the train arrived, This modifi~
cation produces what we will call the modified
bucket brigade. The equilabrium cash balances are
unchanged by the modification.

Scheme & looks interesting, but it 1s never
used and 2t 14s clearly wrong. In fact 1t 15
clearly wrong in even simpler systems and = enviaron-

ments. Suppose for example that there 1s no north-
ern route on the map, that there are no stations h
and c and that station b has aniy one platform,
Then we have both # = § and 6 = @ Suppose that
lg) = 2 * ule), sa that # = 26 . Then the trains
leaving d take the route to f about twice as often
as the route to bh, As they do this, more and more
evidence piles up that the f route 2s the better
route, but instead of sending more and more trains
to f, the system keeps sending about a third of its
trains to b.

3. A HGRE BIOLOGICAL BUCKET BRIGADE

In biologically motivated systems one gen-
erally sets the reward Proportronal to payoff.
That is, the change in avallability 13 proportional

to payoff, This 15 the case in Scheme A, But in
Scheme &, the cash balances are proportional to
payoff, so the cash balances ought not to be the

availabilzties, but rather ought to be
in the availabilities. Sa we have

the change

Scheme C: fach station Keeps two numbers, its
cash balance and rts avazlability. The cash
balances are maintained exactly as in Scheme
8. The availabilities are updated as follows:

Every tame a train arrives at a station, that
statian's aVallab2zlity has added to it the
Product of K and the station's cash balance.

23

{K 2s a small positive constant.)

It 15 Scheme C, nat Scheme B, which we should
have been comparing to Scheme A. If we use the
modified bucket brigade, then Schemes C and A are
almost identical. tf we look at the east to west
path followed by any penny in the modified bucket
brigade, we can see that if the peany could not
change platforms at b then Scheme C would be yust
like Scheme A.

Let us look more carefully at what
Scheme €, Note that fh = 6 whatever the
ties. The value of 6 depends on the availabrli-
ties, since these determine the ratio between the
number of trains passing through the northern plat-
farm of 6 and the number of trains passing through
the southern platform of b.

happens in
availabili-~

Let us let the station name without the tilde
stand for the availability of that station. It
should be clear from context whether at ig the sta-
tion or its avallabilzty that 13 maant,

To obtain a formula for 6 an terms of availa-

bilities, we need to know the proportion of trains
that go through h, and the proportion that ga
through d. We define q = h + d, the total availa-

bility of the western stations. We define H = h/q
and 0 = d/q. The capital letters can be thought of
as normalized avallabilities, or under certain cir-
cumstances, as probabilities. Similarly for the

central stations we let r=b+ f, B= bB/r, and
Fos fir. There are three routes for trains going
from west to east, and their probabilities are H
for route hbc, 08 for route dbe, and OF for route
dfg. The first two routes go through b and so ait
us easy to see that 6& = (Hé + OBe}/{H + DB) .
Similarly, d = 66 + FF . The 6 15 the equilibrium
value of the cash balance of b far the given
aVallabilities, The cash balance wail fluctuate

around this value & provided the availabilities are
held constant.

They will of course be changing, but we assume
slowly. Thus we can approximate the behavior of
the system by the usual continuous deterministic
model in which we think of the tratns as fluads and
the tracks a pipes, the volume of fluid flowing
through a pipe reflecting the average number of
trains passing along that track, Under Scheme C¢
the availability of, for example, f increases by KF
{on average) every time a train enters f, Thus
f' , the rate of change of the availability of f,
as K# times the rate at which trains enter f. If on
trains leave the western stations per hour on aver-
age, then there are on average nDF of these passing
through f and so £f' = K#nDF , in our continuous
model, #y the same reasoning, there are nH + nDdB
trains passing through b on average, sO
b’ = K6n(H + DB) . We can similarly calculate h’
and a’. Then, defining 4 = HA + dd and
$= BB + FF , and using ordinary differential cal-
culus, we can obtain the other rates in Table 1.

Sa what does Table 1
Not a Lot, [I’m afraid.

tell us about Scheme C€?
Perhaps the three routes,
kniif Hoos (Ui/qykandth - 4)

Knod D’ - ftfqyknibta | A)

Kail + DOIG |B’ = (t/riknF((H + DBIB - vet)
Kant Fis (1/r)KnP (DBF - (H © Da}B)
Knq

Kng

 

Table 1. Rates for Scheme C

hbc, dbe, and dfg can be thought of as three geno-
types with respective fitnesses ¢, &, and g. There
would need to be some fancy recombination rules.

Let's look at the three routes under Scheme Cc.
The proportion of trains on the three routes is H,
DB, and DF, respectively. The Changes in these
three are easily calculated. WW’ we have already
(OF) = (t/q) knDF(HT-HB)s (1/21 Ko QF L087-HG-086); and
(DB) ‘= (1/q)KnbO(HF-HB}+ (1 /rdKnOF(HB+OB6-DBF) .
This dosn't look very helpful.

4. AN ATTEMPT AT STATION SPLITTING

Let's try Lo go part way toward separating our
system into the three routes, part way toward
thanking of at am a genetic manner. It Looks as if
a fork like that at @ can be treated as LF Lt were
two stations d, and d,, with trains frum d going
only to b and trains from 35 gaing only to f. See
Figure 2. A train leaving an eastern station goes
to h, d., or ¢, according to the 3 availabilities
af those stations.

The hope 1s that Fig. 2 will behave just as
Fig. 1% did if d, ~ dB and d, = dF. Unfortunately
this doesn't work. 0) = (1/dIknOFDB(b - #) , af we

represent at an Fig. 1 quantities, and
DL (t/q)Konr(H + OBd¢F 6) . If our splitting
of Station d really results in an equivalent sys-
tem, then Di) should equal ovr Fig. 1 (D8)' , and DS
should equal our Fig. t+ (DF)‘. Cven when r= gq
these quantities are not equal. When r = g we have
(DF)’ = {1/qhKnOr((K + DUI (F-Bt - HE ,

(08)' = (t/q)knoFOB(G-T) + (t/qiknDH(at + (F-B)B),
and DS ~ (OP}' = €d/qyknDrHh -

5S. THE PLNALTY SCHEME

Looking at G4 - (DFJ , we see that the Fig. 1
dfg route seems to be getting less reward than it
should. {t Looks as if cash from & has something
to do with the discrepancy. If we look at b’ and
f’ in Table 1, we can see what might be causing the
problem. bh‘ 1s too big. The availabilities b and
f are used by trains leaving d. Under Scheme C,
all those arrelevant trains leaving bh proceed on to
b and boast its avaikability. Clearly the 4 1n the
formula for b' does not belong.

This 23 part of a more general problem A
sensible reward scheme penalizes a production when-
ever it 4s a member of the eligibility set. Other~
wise, there is na incentive for a production to
take itself out of the eligibility set when there

 

1s little likelihood of 1ts doing good. J have pro-
posed, in a non bucket-brigade context, that when a
production 1s rewarded, its reward 1s then sub-
tracted from all the members of the eligibility set
in proportion to their availabilities (2). If we
do this we obtain the rates given in Table 2.

hoo Knttoth - &) Rie (1/q)Knno(h - a)
d’ = XnHo(d - A} D' = {t/q)KnHD(a - fi}
b' = KaDBF(B - 7) B's (t/rlknapecG - Fy
f' = KnOaF(f - 6) Flos (t/r)Knper(? - 6}
go =o
ri oz QD

Table 2: Rates for Scheme C with Penalties

We can now go ahead and derive as before

(DB)’ = KnDF((1/q)HB(F ~ 6) + (2/rjoB(B - 7)) , and
(OV}' = KANDEL TAQGSHEEF - O) + C1/rpDB{F - Bd)
Splitting station Gd and using the penalty scheme

 

gives
oI {1/g)KaDeDFIBb ~ Fy and
g 2 Udabknor(r - OFF ~ 6)
It 1s now easy to show that to have (DB)' = 0!
and (DF)’ = Of iat 18 necessary and sufficrent thal

ros d. The penalty scheme was developed for a
non-subgaal reward scheme like Scheme A, but 1s
applied here to a bucket brigade. And 1ts applica-
tion suddenly makes sense of the equations: The
availabilities 6 and f are for the use of trains
leaving d. QGne will be able to split d just if the
avallabllities of b and fF add up to the availabil-
aty of da. 1 questioned [4] whether the penalty
stheme made sense in the burket brigade context
Evidently tt does.

On the other hand, the penalty scheme doesn't
really allow station splitting. Wath the penalty
stheme, ris constant, so keeping d= r means d
can't be rewarded. Since gq 18 also constant, that
means nh can’t be rewarded erther. That's no good.

&. RESTORING THC PENALTY

The penalty scheme was developed [2] as a
practical equivalent toa a biologically motivated
scheme which instead of penalizing productions
inside the eligibility set, rewards productions
outside it. The brologically motivated scheme can
be thought of as identical to the penaity scheme
except that after penalizing, the penalty is
restored as follows. All the stations of the
current locus (the geographical region at which the
train 18 arriving: west, central, or east} are
rewarded by the total penalty .in proportion to
thez1 avarlability. The penalized eligibility set
1s a subset of the locus, so for the locus as a
whole, the penalty 1s restored.

There are two obvious ways of implementing
this scheme. One 1s by making the reward, then
penalizing across the eligibility set, and then
distributing the resturation across the whole
lecus, Another more suggestive, but much noxrsier
implementation ts as follows. When a train arrives
at a station, calculate the reward {K times the

       

 

 

station's cash balance), but don’t gave at to that geno-,Pr payoff)routes of Pr rewards | Pr
deserving station necessarily. Choose an arbitrary type | payoff

station from the same locus {choose wath probabila- cen ia hh

ties proportional to avaLlabilities), lf the nb c

chosen station ts outside the eligibility set, give

the reward ta the chosen station. If the chosen

station 1S ainside the e@lagibility set, give the
reward instead to the deserving station.

 

Table 3 gives the rates for Scheme C with
restored penalties,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

     

b's KnH Hoo. (1/qiKanp(h - db =
d‘ = Knod D' = {1/qiknuo{a - fi)
b' knob B’ = (t/r)Knpr(b - Ff) Table 4: Payaff Routes and Rewards
f'o= KnF{Hb + oF) Fos (t/rj)Kape(t - 6)
q' Kaq rewarding the central station, b will be the
ros Knog deserving station, whereas the chosen station will
be b with probability 6, and f with probability fF.
Jable 2: Rates for Scheme C with Restored Penalties The eligibility set 1s {b}, so af f 15 chosen the
reward goes to the chosen staLion f, whereas if b
If we use Table 3 to calcukate {DB}‘ and 1s chosen the reward goes to the deservang station
(DF}' . and af we ask what 1s necessary for these b. In rewarding the western station, h will be the
to equal D’ and 0! respectively after station deserving station, whereas the chosen station wiil
splitting. we come up with the ridiculous require~ be bh with probability H and d with probability 0D.
ment that 0 = t . Why should this be so? Can we The eligibility set 1s {hd} , 30 whether ho d
make any sense of Table 3? are chosen, the reward will ain any case go tu the
deserving station h, Thus 2f the payoff follows
the raute via b and h, the reward will go to ho and
y. ALTRUISM bo with probability 8, and to h and f with probabll-
aty F. This 18 summarized in the farst two entries
If we set r= q , then Table 3 makes sense, in each of the Last two columns af the table.
We can think of a population of size q * q , con-
sisting of individuals of the four genotypes (see Table 4 can be used to caleulate the rates.
Table 4). The number of individuals in the popula- In each time unit, m trains run from west to east
tion of genotype bb 1s simply the product hb, and sa for example to get h’ we need only run down the
the same for the other genotypes. So the pupula- second ta the last calumn, chogsing all entries
tran 1s in Hardy-Weinberg equilibrium, containing an h, Far each of these entries, we
multiply together K, n, the reward 2n column 3, and
We can view the operation af the scheme as the three probability columns, We then sum the
follows. Each time unit, nm genotypes are chosen at results. The rates we so obtain are those given in
random from the populatron. Each of these geno~ Table 2
types 1s then run as follows. It 18 given to a
train driver who runs his train west to east, using There ls 4a certain strange unifarmity in
the genotype as a prescription to tell him which Table 4. The last four columns have identical
track to take when the tracks fork, When the train entries for the genotypes hb, hf, and db. In othea
arrives at the eastern station, the payoff (ple), words, these three genotypes distribute their
H(e@), or plg)}) as passed to the modified bucket reward adentically. $0 we can samplify Table 4 toa
brigade, which runs it back east ta west. Table 4 Table 5.
shows the routes back wast that the payoff can
take, and for each route, Table 4 shows which sta- genotype! payott rewards to probability
tions can be rewarded, nb é ; ho ~~) tho) /(hb + hf + db)
he c | I nf (nf)/(hb + hf + db}
The top line of the table says that when hb is db a | dab . (db) /thb + hf + db)
the genotype, the payoff will be £ (since the train af _ 3 | t 1 4
will go by the northern route). The payoff has two ~~ ~ ~
possible routes back west, via b and h or via b and Table 5: Reward Distribution
d. (Remember, money can change platforms at b.)
Whether at goes fram b to h or from b ta d depends The last column of Tahle 5 is the product of
of where the next train entering b comes from. The columns five and seven of Table 4. Table S$ teils
probability that 1t comes from h as clearly us that when hb is the genotype, the payoff 15 &
HA(H + D8) . We can see that 1f at. follows the and this reward is given to one of the genotypes
Foute via b and h, it has a chance to make one hb, hf, or db, according to the probabilities ain
reward to a central station (when 1t 19 at b and a the Last column.
train arrives at ») and one ta a western station
(when it is at h and a train arrives at h) Now in So we can think of the four genotypes in ou
“the pay off

 

 

25

 
papulation as beang grouped ante two mations. One
mation consists of all those individuals whose
genotypes are hb, hf, ar db. The other nation con-
sists of all those whose genotypes are df. Each
genotype has a value {Table 5 column 2). If the
scheme were a straightforward genetic scheme, the
bumber of children per time unit of each andividual
would be K times that andividual’s value, and thea
complete recombination would take place, In this
scheme, however, each individual altrurstically
hands his payoff to an arbitrary individual which
he selects at random from his own mation. Thus the
payoff determines the number of children of the
arbitrary recipient rather than af the donor. OF
Course the payoff does at least remain within the
same Nation. So our system can indeed be divided
into two parts, but evidently not in the way we
intended.

The system seams ta exhibit genuine altruxsm
and not simply kin selection [6]. Each genotype
would doe better for its genes af at didn’t give
away its payoff. Group selection 1s taking place,
the natians being the groups.

Suppose we remove from the scheme the altruis
tic handing over of payoff, giving us a straight~-
forward genetic scheme. if we had begun not with
Scheme C, but with Scheme A, and 1f we had added
the penalties and then the restorations ta Scheme
A, then the scheme we would have ended up with
would have been that straightforward genetic
scheme. Thus 1£ 15 the altruism that distinguishes
Scheme © from Scheme A. 2t 2s altruism when the
pennies change platforms at station bd,

To what extent can the previous analys2s be
performed tn a more complex system and envirenment?
The short answer is that I don’t know. Elimination
of the perrodicity does not seem to me to present
the more serious problem Tt 1g the introduction
of fancier railway branching that lavks the more
worrying. Suppose for example that there were an
addz:tionai route hjk in the extreme arctic. Then
bow could alleles specify charces buth for drivers
leaving h and for those leaving 4d? 1 suggested in
{2} that to handle more complicated branchings, the
alleles should be orderings of the stations at a
given locus. That track might work.

In many published systems [1] the train driver
makes a deterministac rather than a probabilistic
choice when the track forks. The rationale is that
$ince a higher cash balance will ultimately lead to
a very much higher avallability, we can forget
about availabilities and ask the driver to always
head for the station with the higher cash balance.
Such a system relies on notse to shake the system
out of non-maximal equiltbria. Analysis 1s diffi-
cult because results depend on assumptions regard-
Lng norse. Such an amalysis 1s nat attempted here

8. CONCLUSIONS

By a continual process of modifying our bucket
brigade scheme and trying to make sense of it, we

 

have arrived at a scheme which has a simple
description in one simple system and eavironment.
This scheme employs a system of penalties. It
Looks like a genetic scheme with altruism added.

The sequence of analyses and changes reported
here are genuinely an the order 1 followed, but
with the false trails removed. I really did try
the Scheme C first with no penalty, and was puzzled
by the messy equations. 2 had previously used the
penalty scheme zn a non subgoal reward context [2],
but 1 genuinely forgot about it and so was led
astray. The nature of my difficulties reminded me
of at, and it seemed to work. 1 had previously 147
wondered whether the penalty scheme made sense in a
subgoah reward context. I'm convinced aow that it
does. If [ had used 1t from the first, 1 wouldn’t
be so sure.

Tf we had trred different modifications we
would have ended up with a different scheme, and
perhaps 1f we had been cleverer we could have made
gense of some of the schemes we rejected as nonsen-
sical. $t221, what we have shown as that we can
ask our qUestLons rigorously ensugh that the answer
can land frequently does} come gut “no”. That we
can be forced by the facts inte an explanation dif-
ferent from the one we intended is an indication
that we are dolng more than merely making qualita-
tive generalizations. But perhaps not very much
more.

References

1. J. H. Holland, “Escaping Brittleness: the
Possubslitres of General Purpose Learning
Algorithms Applied to Parallel Rule Based Sys~
tems”, pp. 593-623 1n Machine Learning II, ed.
tT. M Mitchell, Morgan Kaufmann, Los Altos,
Calif. (1986).

2. T H. Westerdale, “A Reward Scheme for Produc~
tion Systems with Overlapaing Conflict Sets”,
IEEE Trans, Syst., Man, Cybern. Yol. SMC-
16(3), pp.369-383 (May/June 1986}.

a. Tt. H. Westerdale, An Autamatan Pecomposition
far Learning System Environments, Submitted.

A. T. H. Westerdale, “The Bucket Brigade as not
Genetic”, in Proceedings of an International
Conference an Genetic Algorithms and their
Applicatiens, ed, John 3. GTefenstette,
Carnegié-Melon University, Piltsburgh (1985).

5. John Maynard Smith, The Theory of Evalutian,
Penguin, Harmondsworth, Middlesex, England
(1966).
or Sr 0 BE In ans cUr HO OL nut

 

 

 

 

SCHEMA RECOMBINATION IN A
PATTERN RECOGNITION PROBLEM

Trene Stadnyk *

Theoretical Division

Los Alamos National Laboratory
Los Alamos, NM 87545. '

Abstract

A simple pattern recognition problem is pre-
sented, where a population of binary strings evolves
under the genetic algorithm as it tries to match
and recognize another population of binary strings.
Sampling is used in the genetic algorithm to do
matching, which determines fitness, and crowd-
ing, which allowa subpopulations to form that
recognize different pattern strings. To under-
stand what subpatterns the recognizer strings
chose as the regularities in the patterns and how
these are organized as recognition improves, the
theory of schemata is applied to the resulting
recognizer string population. The achemata that
are formed by recombining other schemata and
mutating them with a low level of noise emerge
in a default hierarchy. This hierarchy has gen-
eral properties that apply to any knowledge rep-
resentation used in such a pattern recognition
system.

1 Introduction

Adaptive systems require a dynamic memory in which
to store information about the environment. The mem-
ory must be organized in such a way so that relevant
information can be retreived quickly and can be used
to predict states of the environment based on the cur-
rent states, Feedback from the environment should be
used to have the system learn: the memory should be
updated using the feedback so that it can make more
accurate predictions in the future, even with partial or
noisy information about the environment. The mem-
oty’s predictions are also used to determine the perfor-
mance or behavior of the system.

The adaptive system builds its memory by trying to
find regularities in the environment that occur in more
than one environmental state or occur frequently in the
environment. These regularities based on similarity and
Tepeateability are then used to form equivalence classes
of environmental states that are then used to organize
eer

“Formanent address: EECS Department, The University of

Michigan, Ann Arbor, MI 48109.
Arxpanat address; ims@lanl.gov,

the memory. These form the basis of the learned model
of the environment stored on memory.

Ina population of adaptive systems, another require-
ment can be placed on the memory. The memory must
be capable of being transmitted from one system to an-
other, That is, a code representing instructions on how
to build a functioning memory must be transmitted to
the another system. The code consists of discrete units,
which are transmitted in some partial sequential order.
Once the code is formed, it does not change but rather
is subjected to pressures of selection. It is transmitted
throughout the population based on its fitness. It is not
clear how the transmitted code is decoded and how it
affects the structure and information represented in the
receiving memory.

Adaptive system memories have been described in
the paradigm of adaptive networks. Adaptive networks
include autocatalytic networks of chemical reactions [3],
the idiotypic network in the immune system [10], classi-
fiers under the Bucket Brigade Algorithm [9], and nets
abstracted from these systems [4]. Transmittable adap-
tive memories include self-reproducing "organsims” in
cellular automata (12] and genetic algorithms [8].

This paper presents a model of an adaptive system
based on genetics, which has a memory consisting of
binary bit strings. The transmittability of the mem-
ory is not discusses since presumably the binary strings
can be transmitted in a sequence of discrete bits. How-
ever, the organization of a memory consisting of binary
strings is studied as the binary strings evolve under a
genetic algorithm. The system with this memory is a
simple pattern recognizer, based on pattern recognition
in the immune system. The system’s memory is a popu-
lation of recognizer strings which try to recognize many
pattern strings in parallel by picking out regularities in
the pattern strings. The genetic algorithm generates
new recognizer strings according to fitness. The fitness
of each recognizer string is determined by how many
bits it matches of how many pattern strings. Recog-
nizer strings with above average fitness reproduce, and
of those, the ones with higher fitness produce more off-
spring. Thus, recognizers that contain “subpatterns” or
substrings of one or more pattern strings will prolifer-

 
ate through the population. These subpatterns are the
schemata that the recognizer population contains and
that the genetic algorithm implicitly manipulates. Ini-
tial runs of this model show that these schernata evolve
into a default hierarchy.

2 The Problem

This pattern recognition problem has been motivated
by the pattern recognition that occurs in the immune
system. In the immune system|5], one population of
antibodies (recognizers) tries to recognize any antigens
(patterns) that might invade an organism. By recogniz-
ing the antigens, the antibodies tag the antigens, which
will then be killed and removed by other cells. The
antibodies are formed from gene libraries. From experi-
mental data, we see that these initial antbodies undergo
a high rate of mutation once an anitgen is found in the
organism. This implies that the initial antibodies must
contain some initial structure that can simply be mu-
tated to quickly produce an antibody that recognizes
the new antigen. Also, this structure must have been
transmitted through the gene libraries. What structure
might exist in the gene libraries that have evolved to
produce antibodies that are good at recognizing anti-
gens?

In order to study what structure might be found in
the gene libraries, a model has been built and simu-
lated which represents the antibodies (recognizers) and
antigens (patterns) as binary strings of a fixed length.
Several constraints have been put on this model which
will be used to study the structures that might occur
in the gene libraries. The gene libraries are implicitly
represented in this model by the antibodies for which
they code. The model constraints are evolutionary con-
straints, pattern recognition constraints, and covering
constraints, The evolutionary constraints have been
implemented by the genetic algorithm described in the
next section.

The pattern recognition problem is defined as fol-
lows:

Population A, of binary strings of length !

recognizes a binary string z ¢ population A,

of binary strings of length J,

where [|A,| > |Ap| if and only if

danon-empty S C A, such that

Vales

tnt

D((r(eT 2!) > acd) 27,

ret
where f is the XOR operator, > is the shift-right op-
erator, & is the bitwise AND operator, and T is an
integer threshold. Thus, a population A, of recognizer
strings recognizes a pattern string when there is at least

28

 

one recognizer string that has the same bit values in at
least T bit positions.

T in the immune system has been experimentally
found to be between six and eight amino acids, but
in this conceptual model T has been set to five bits,
though this can easily be changed. Also, in the real
immune system the antibodies form three dimensional
configurations so that any T’ or more positions can try
to match the antigen. In the current model, this trans-
lates to letting any bit position value of the recognizer
string match any other bit position value of the pat-
tern string within the limits of the configuration possi-
bilities. Since the configurations of real antibodies are
not known yet, this idea has been abstracted to simply
allowing any T bit position values to match with the
values in the same bit positions in the pattern string.
Also, two bit values match if they are not only in the
same position in the binary string, but also if they are
the same values. In the real immune system, matches
occur between complementary values rather than iden-
tical values, In this simple model, this does not make
a difference. However, if the model were to incorporate
the fact that antibodies can match other antibodies as
well (where self-recognition and loop creation are im-
portant), then complementary matching would have to
be the rule. However, in the current model, antibodies
as recognizers evolve as they try to match the antigens
or patterns, regardless of what other recognizers the rec-
ognizer string may match.

Another way of looking at the pattern recognition
problem is in terms of schemata. Schemata represent
subsets of binary strings by picking out certain bit po-
sitions that must have a certain bit value, where the
rest of the bit positions can have either value(8]. For
instance, the string 011 is contained in schemata 0##,
#1#, ##1, O14, OF 1, #11, O11, and HHH, where the
#-sign indicates that for that position the schema will
accept any value. In this model, the defining bits of
a schema would represent those bit values that match
between a recognizer string and a pattern string.

The theory of how schemata are manipulated under
the genetic algorithm can be used to understand how
the recognizer strings are evolving. Since each binary
string is a member of 2' schemata, between 2! and 3!
schemata are implicitly tested and recombined by the
genetic algorithm each generation or time step as the
strings get tested and recombined with respect to their
fitnesses. The strings reproduce in proportion to their
fitness, Schemata can also be thought of as having fit-
nesses based on the average of the fitnesses of the strings
that are members of that schema, Then as Holland has
shown|8], the schemata also reproduce in the population
in proportion to their fitnesses. Each schema is implic-
itly manipulated in the strings that are members of that
schema, where the strings are explicitly manipulated by
 

the genetic operators. What this means for the cur-
rent pattern recognition problem is that if the genetic
algorithm finds one schema in the recognizer popula-
tion with at least T defining bits that match with a
pattern, then the recognizer population recognizes that
pattern string. Also, as the genetic algorithm manipu-
lates schemata at an exponential rate to find useful ones
that match patterns in the pattern population, it also
finds recognizer strings at an exponential rate to match
and recognize the patterns.

The covering problem can simply be stated as fol-
lows:

Population A, covers population A,

if and only if

VreA,3 a non-empty 5 ¢ Agr

such that

V 2'eS, x! recognizes z.
That is, every pattern string in the pattern population
should be recognized by at least one recognizer string.

The runs of this model should at least be able to
show that the recognizer population can evolve under
the genetic algorithm to recognize a pattern string as
well as cover the pattern population. Once parameter
settings and reproduction strategies have been found
that give this result, the schemata in the evolving rec-
ognizer population can be studied to determine how the
recognizer population as a whole organizes information
about the patterns as it discovers recognizers to recog-
nize the patterns.

3 The Model

The parameter values for the genetic algorithm are set
as follows.
The length of the strings is
Us 36

bits,

To determine the size of the recognizer population,
the length of the schemata one expects to find must be
taken into account. Based on recognition in the immune
system, the model should be able to let schemata with
five or six defining bits develop in the recognizer pop-
ulation, To see what schemata we can expect to find
in the recognizer population, we can use a corollary of
John Holland's Theorem 6.2.3 [8] that

Qn

EST

where ¢ is the error due to crossover (which may break
an instance of a schema), n is the the number of defin-
ing bits in the schemata one is interested in, and / is
J the string length. Since / = 36, the error for a schema
: of length 1 is < 0,056, 2 < 0.12, 3 < 0.17, 4 < 0.23,
$< 0.28,6 <0. 34, and so on. In order for the recog-
nizer population to be able to search for schemata with

 

= 6 defining bits, of which there are 2" = 2° = 64,
the recognizer population size should be about 64 as an
upper bound. Schemata with more defining bits might
still be found. The recognizer population size has been
set to

recognizer populationstze = 50

which is still much greater than 2° = 32 and close to
28 = 64, where the error rate is about 0.3, about as high
as the genetic algorithm can tolerate. The initial strings
in the recognizer population are created at random.
The pattern population size has been chosen to be

patternpopulationsize = 1

for one set of runs, and
patternpopulationsize = 10

for another set. of runs to see if the recognizer population
can recognize just one string as well as cover several pat-
tern strings. The population is initialized with random
strings that do not change over time.

The point mutation rate has been set as low back-
ground noise,

pointmutationrate = 2

two point mutations per time step, where the mutated
bits are chosen at random.
The crossover rate has been set to

1 . . .
crossoverrate = 3° recognizer populationsize

so that in each generation ; of the recognizer strings
are chosen as parents, thus producing that many new
strings, and replacing that many old strings. The par-
ents are chosen probabilistically with respect to fitness
so that the algorithm results in reproduction with em-
phasis (on fitness)[8]. Schemata with above average fit-
ness have new copies or samples of them generated in
the recognizer population in accordance with the theo-
rem that guarentees minimal performance loss.

Several fitness functions have been considered. With
fitness equal to the average number of matching bits
of the recognizer string to every pattern string, fitness
generally does not increase monotonically nor exponen-
tially as claimed should happen with crossover[8]. To
encourage the recognizer strings to evolve with as many
consecutive bits matching a pattern string as possible
so that it is less likely that the schema of matching bits
will be broken by crossover, fitness has been chosen to
be

soruomale sopamatehes — 0
TPO Gown, (+1)

msample

f(z) =
 

where z is a recognizer string, msample is matchsample,
m,(t) is the number of consecutive bit matches between
the ith set of consecutive bit matches in string x and the
jth pattern string in sample s,,, #jmatches is the num-
ber of consecutive bit match sequences between string
x and the jth pattern string, and ! is the length of the
strings. For example, if z =01101 and s, = {11100,
00101}, then my(1) = 3, mo(1) = 1, m2{2) = 3, and

 

o ay h
i 3 1 3 V8 ~ t
Ne) = Garg + garg t gavg het 2 5 TP
. 4

The matchsample is the size of the sample 5, of
strings chosen from the pattern population. The sam-
ple 8, ,rather than all of the strings in the pattern pop-
ulation, is used to compute fitness. Using this sample
helps subpopulations emerge in the recognizer popula-
tion, as in Booker’s simulation [1]. Subpopulations are
necessary if all of the patterns are to be recognized by
the recognizer population.

The combinatorial bias of the probability of finding
m matching bits between two strings, 3 —m+),
has been removed by dividing the number of matches
by the bias to get the unbiased fitness that would make
all lengths of matches equally likely to occur when based
on fitness. The probability of finding a match between
two bits is $. For m bits, it is 1". Since these bits are
consecutive, there are (!~ m-+-1) ways that they can be
chosen in a string of length |. Therefore, a sequence of
m consecutive matching bits occurs with a probability
of 4"(i ~m +1) in a string of length 1. Dividing by
this probability is equivalent to putting more emphasis
on longer schemata since the dominating term of 2")
increases exponentially with the length of the substring,
m,(2).

Several replacement strategies have been tried for
determining which recognizer strings get replaced by the
new offspring strings created by crossover. Choosing a
string probabilistically with respect to Jana
ally gives rise to a homogeneous population consisting
of one highly fit string. Choosing a string that most
closely matches the new string, a “crowding” strategy,
never allows the fitness to go up since the child too often
replaces its parent and thus schemata do not increase
in proportion to their fitness as they should. A similar
strategy, first tried by DeJong [2] and later by Booker
[1] where the closest matching string is chosen from a
small random sample of the recognizer population does
not do much better than the first strategy. A strategy
which works well is one which chooses a small sample of
recognizer strings, 8,, of size crowdsample probabilisti-
cally with respect to jana Then from this sample, the
string which most closely matches the new string is cho-
sen to be replaced. With this strategy, the recognizer
population fitness increases and does not yield a ho-
mogencous recognizer population. The crowdsample is
determined by how many copies of a schema one wants.

eventu-

A sample of size n would on average include recognizer
strings in all schemata that are present in + of the recog-
nizer population. Thus, n subpopulations are allowed to
coexist in the recognizer population since any subpop-
ulation cannot contain more than i recognizer strings.
Since the replacement strategy uses fitness to determine
which strings to replace, it is consistent with the repro-
duction with emphasis plan. That is, schema with below
average fitness will have copies or samples removed so
that their sample size decreases and makes room for an
increase in the number of above average schema. Thus,
this strategy generally replaces the string with the low-
est fitness and which is in the same subpopulation as
the new string.

The algorithm to find schemata actually finds a small
sample of all schemata present in a string population.
The sampling is designed to find the more useful schemata
(building blocks) that evolve under the genetic algo-
rithm rather than all of the schemata present in a pop- :
wation. The algorithm to find these schemata is:

1. Choose a sample, s;, of the population in which
want to find schemata (e.g. n most fit strings).

2, Find the schemata in each string in sample s).
For each string in sample s;:

(a) Get a random sample, s2, from the popula-

tion.

(b) Check string against each string from s, to
find which bits match.

(c} Record schemata which appear between string
and each string in 32 with matching bits no
more than two bits apart (otherwise you have
a separate schema).

(4) Do not store duplicates.

(e) Store schema frequency, sf, by running through

all the strings in s, again.

3. Check to see what schemata are present randomly.
For each string in sample 5):

(a) Use same random sample s2 as in 2a).

(b) Count the number of matches at each bit be-
tween the string and every string in s;.
(c) If
Hdefanedivte #matches,
rt {s2|
where sf is the schema frequency, #matches,
is the number of matches found in 2b) for bit
i defined in schema,
then remove the schema from the list.

sf<
 

For example, let 5; == {0110100011} and s, = {1110011101,

0101010101}. For step (2),
$1, Sy = HIP HHA HHI
with schemata
HILO AH HHEH of frequency } and
HRP EL of frequency 5
and
S111 822 = OLEH HE BOB AL
with schemata
OLA HEHAHHHAH# Of frequency 3 and
HEHEHE HOP HA Of frequency 3.
For step (3), the number of matches at each bit is
(1)(2)(1) (1) (0) (0}(1) (0) (0)(2).
Then for schema
HLLOAA AEH, $< 24 50 remove this schema,
HBEHEA HAL 3 so keep this schema,
OleHHHEHHE, } 2 so remove this schema,
and

AHHH RHOPHL, 1 < 43 so remove this schema.

ns Ina!

2
<

|

For the schemata in the pattern population, both
samples are the entire pattern population. That is, all
useful pattern schemata that can be located by the al-
gorithm are found and counted. For the recognizer pop-
ulation, the (2 - recognizer populationsize) == n most fit
strings are chosen for the first sample, s,, and the sare
size sample is chosen for the second random sample,
$q. So recognizer schemata occuring in i of the recog-
nizer population are located this way in every time step.
Rather than being explicitly represented anywhere in
the model, the schemata so-located are printed out at
every time step so that they can be analyzed.

This model for pattern recognition is very similar
to, though implemented differently from, Booker’s ge-
netic algorithm simulation(1]. In that simulation, the
classifier system learns useful categories or taxa (the
conditions of the classifers) for one or more categories
in the environment, where the categories are generating
input messages to the system, The taxa are equiva-
lent to schemata though there are fewer of them in a
classifier string than in the recognizer binary bit string.
The binary bit string messages are equivalent to binary
Bit pattern strings that the classifers and recognizers,
respectively, are trying to match.

The matchscore that he uses, called Mg, is defined
for a taxon x and message (binary bit string) z' as

1 if z and x! match identically
My(zx) =
s(z) { bn otherwise

where n is the number of mismatched bit values and
fis the length of the strings. This function is linear
with slope ~® except for the case of all defining bits
matching where Ms; returns the value 1 rather than +.

. This Jump when the optimal solution is reached ensures
that a good taxon is never lost from the population. On

the other hand, the fitness function in the model just

a

presented is an exponential curve which increases the
fitness exponentially with each additional matching bit
tather than just a big jump when the perfect match is
found. The fitness function encourages bit by bit ad-
ditions to matching schemata whereas the matchscore
encourages this but really emphasizes the final perfect
match. The fitness function is also more apt to keep
schemata of all lengths in the population in case the
pattern strings in the environment change so that the
recognizer population can quickly build a matching rec-
ognizer from a few schemata. Matchscore just tries to
match the categories in the current environment.

In the pattern recognition model, there are no mat-
ing restrictions, unlike Booker’s simulation where two
strings can mate only if they match the same message.
Mating in the new model is probabilistic with respect
to fitness, which is computed with respect to a sample
of pattern strings. Thus, higher fitness implies that the
recognizer string matches more pattern strings out of its
sample which implies that there is a good chance that
two mating strings both match at least one same pat-
tern string, like a probabilistic mating restriction, This
chance is equal to 1 — ae at sok , where a is
the pattern population size and m is matchsample. For
instance, in this model the chance would be 7 where
a = 10 and m = 3. Thus, the recognizer population
can find the near-optimal points representing the pat-
tern strings and still incorporate new strings with new
schemata for each time step.

Crowding is also implemented similarly to Booker’s
simulation, The Finan equivalent to wm is used to
pick out a sample of strings to delete based on which
string best matches the new string. This is similar to
Booker’s crowding scheme where taxation and a tax re-
bate lower the strength of strings in a subpopulation
that has reached its “carrying capacity”. Then the
strings with lower strength, are more likely to get re-
moved, just as strings in the new model with lower
fitness and higher jaa are more likely to get re-
moved. Since fitness is noisy, there are recognizer strings
in each subpopulation with lower fitness that will prob-
ably get removed; if the subpopulation is of size 4, then
these strings with lower fitness will appear in the sam-
ple and get chosen to be replaced. Also, in the cur-
rent model, using a sample size equal to the number
of desired subpopulations is simpler to implement than
Booker’s scheme where the system keeps track of which
taxa are relevant to what message, unless the relevance
criterion is such that it can be calculated for each sub-
population rather than every individual taxon.

Thus, the recognizer population in the pattern recog-
nition model, just like the classifiers in Booker’s madel,
can also find the near-optimal points representing the
pattern strings and still incorporate new recognizer strings
with new schemata for each time step.

 

 
4 Results

Preliminary runs have found parameter settings with
which the recognizer population increases its fitness (equal
to the sum of the fitnesses of each recognizer string) as
shown in the following logio( fitness) va. time plot. In-
creasing fitness implies that the recognizer population
is matching one or more pattern strings at more bits.

To test the model’s ability to solve the recognition
problem, fifty recognizer strings have been run under
the genetic algorithm to get them to match one pattern
string. The population fitness does increase exponen-
tially (Figure 1) until it reaches the maximum popula-
tion fitness possible when about ninety percent of the
recognizer strings match the pattern string at every bit
position. The other ten percent match at almost every
bit.

To test the model’s ability to solve the covering prob-
lem, a run where fifty recognizer strings try to recog-
nize ten pattern strings has been tried. The fitness of
the recognizer population does increase fairly monoton-
ically (Figure 2). The listing of pattern schemata found
in the final recognizer population shows that all pattern
schemata with five or fewer defining bits have also been
found in the final recognizer population. Also, upon
closer inspection of the recognizer population, a recog-
nizer string can be found for each pattern string such
that the recognizer string matches the pattern string
at no fewer than five bit positions. Thus, the recog-
nizer population recognizes and covers all of the pattern
strings.

With what sample sizes does this genetic algorithm
solve the recogntion and covering problems? How does
the recognizer population evolve with these parameter
settings?

It turns out to be critical that the value for matchsample

be less than the size of the pattern population. That is,
fitness for each string must be computed with respect
to only a few patterns and with respect to different pat-
terns each time step for a given recognizer string. If all
patterns are used to determine the fitness of a recog-
nizer string, then the entire recognizer population ends
up matching only one pattern as it reaches maximum
possible value for fitness.

There are several reasons why matchsample should
be just a portion of the pattern population size. First
of all, the computations to determine fitness are done
faster if only a few patterns are used for the calculation.

Secondly, each recognizer uses the same sample of
patterns to compute its fitness, namely all of the pat-
terns, Thus, the first pattern with a longer schema
found in the recognizer population will be the one matched
by all of the recognizers over time. That schema will
increase the fitness exponentially of the recognizer con-
taining it with respect to the fitnesses of the other recog-
nizers. That recognizer’s descendents will take over the

32

 

Vogl(Fitness) ve Time
49 B00,

1 es6
a Geo 81 660

Figure 1. patternpopulationsize = matchsample =
crowdsample = 1,

Yogt0(Fftnese) ve Time
13 8@0,

 

pane

1 ag

 

8 ap S61 avo

Figure 2. patternpopulationstze = crowdsample =
10, matchsample = 3.

entire recognizer population to match that one pattern.
Thus, in order to cover all of the patterns, a different
sample of patterns must be used to calculate fitness for
each recognizer string.

Thirdly, and most importantly, sampling is the coun-
terpart in this model of collision dynamics. One recog-
nizer can try to recognize only a few patterns at any
one time in real physical space. Modelling this accu-
rately models the fact that this system is far from equi-
fibrium most of the time where changes in the system
are most likely to increase rather than decrease order in
the system. That is, new schemata introduced into the
recognizer population will get their numbers increased
with respect to their fitness and give rise to new recog-
nizer strings as they get distributed throughout the rec-
ognizer population. Otherwise, with non-noisy fitness,
these new schernata would be ignored and eventually
be removed from the recognizer population as one rec~
ognizer string containing one longer schemata was the
only one with surviving descendent strings.

The other sample size for which the value turns out
to be critical is crowdsample. With crowdsample = 10
(equal to patternpopulationsize in this case), a schema
will be replaced if it is in 4 of the recognizer popu-
lation. This allows more room for other schemata to
occur and distribute in 3 of the recognizer population.
If crowdsample is less than that, for example if it set to
3, then the fitness of the recognizer population increases
 

 

 

 

but only recognizes about three of the patterns.

With crowdsample = 10, as in the second run de-
scribed, it was found that the recognizer population
maintains a high number of schemata at all times, keep-
ing the number of recognizer schemata around 100 +25
each time step. By maintaining such diversity in the rec-
ognizer population, all pattern schemata of 5 or fewer
defining bits are found by the genetic algorithm along
with a few schemata with even more defining bits. This
is also due to the recognizerpopulationsize parame-
ter. Theory predicts that when there are 2" schemata
of length n, a population of size 25 = 32 or greater
should be able to search these. A population of size 50,
where 2° < 50 < 25, does this plus finds a few other
longer schemata. One of these longer schemata, not
in the initial recognizer population, has been created
by putting together two shorter schemata that increase
in frequency enough to form the longer schema, which
then increases in frequency. The shorter schemata de-
crease in frequency since they have been incoporated
into the longer schema. Thus, we see the beginning of a
default hierarchy forming[7|[9]. First shorter, more gen-
eral schemata are tested against the environment, and
then the useful ones are used to build longer, more spe-
cific schemata that predict the environment (or in this
case cover the strings in the pattern population even
more accurately}, With a larger population, more of
the longer schemata should emerge.

5 Conclusion

The recognition and covering constraints are the mini-
mal ones that the current model can satisfy. It does this
by taking two shorter schemata and applying crossover
to the strings of those schemata to form a new longer
schema. This longer schema is a combination of the two
shorter ones and is in a string that matches a pattern at
more bits. Thus, a default hierarchy is emerging in the
tecognizer population. Mutation also can create such
longer schemata and fitter strings but only one more
bit will match an pattern. The one additional matching
bit does not increase the fitness of the string nor the
number of recognizer strings to match pattern strings
as much as recombination does. However, these results
are based on preliminary runs and need to be quantified
More precisely before more specific conclusions can be
made.

There are atill many questions to study about this
Pattern recognition model.

1. How is the default hierarchy of schemata repre-
sented in the recognizer string population? Are
there recognizer strings that contain several shorter
pattern schemata and some recognizer strings with

specific schemata or coarse-grained with more gen-
eral schemata? Is the entire emerging default hi-
erarchy maintained in the recognizer population?

2. Do distinct subpopulations evolve in the recog-
nizer population with respect to the patterns or
other criteria such as regularities in the pattern
strings? Does each recognizer string contain schemata
for one pattern or for more than one pattern? If
the patternpopulationsize is increased, will the
system no longer be able to pick out all of the
regularities which may no longer come from one
pattern but may include several patterns? How
will this depend on the recognizerpopulationstze?

3. How do the schemata in the recognizer popula-
tion change when the initial pattern population is
changed? And if one of the parameter values is
changed? What if the fitness function is changed
to give maximum fitness value when T bits match?

4. Are matchsample and crowdsample, which main-
tain diversity in the recognizer population, also re-
sulting in suboptimal fitnesses for the recognizers
that cover the patterns? That is, does it become
impossible for perfect matches to evolve for each
pattern? Is this somehow an adaptive advantage
even if it is not a performance advantage? How
can their values be set by some feedback mecha-
nism based on regularities found and the resulting
fitness, so that the recognizer population can con-
tinue to recognize and cover a pattern population
that is changing its size and strings over time?

5. Is the final recognizer population robust to changes
in the pattern population? Will the recognizer
population fitness decrease alot quickly if an orig-
inal pattern is replaced with a new and different
pattern string? Will the entire recognizer popula-
tion have to re-evolve to redistribute the pattern
schemata it has, lose the ones no longer in the
pattern population, and acquire new ones?

Various tools still need to be built to analyze the
evolution of schemata and the default hierarchy in the
recognizer population. For instance, since the schema
finding algorithm only finds a sample of the schemata
present in the pattern population, a reasonably-sized
phylogenetic tree can be plotted over time based on
what pattern schemata in the recognizer strings give
tise to which new and possibly longer pattern schemata
in new recognizer strings. The frequency of each pat-
tern schemata in the recognizer population (equal to the
number of recognizer strings with that pattern schemata)
could be plotted over time and used with the phyloge-
netic tree to trace the development of the default hier-

longer schemata? In terms of knowledge representation{6}, 2°hY:

is the representation fine-grained with longer, more

 

33
The frequencies of the schemata can also be studied
with analytical tools developed for understanding dy-
namical systems. Every schemata changes its frequency
{equal to the number of strings in the population be-
longing to that schema) in the population with respect
to the probability

M,(t +1)

P(t +1) = vj

= filt)-e.- M(t)

where schema s has an average expected fitness St)
at time t, M,(t) is the number of strings in the recog-
nizer population that are elements of schema s, M is
the total number of strings in the fixed sized recognizer
population, and ¢, is an error term created by having a
schema instance get broken up by a crossover point or
mutation. The size of the recognizer population and the
sample sizes also affect the number of schemata and the
number of their instances that are possible to have in
the recognizer population. Thus, the schemata dynam-
ics can be studied under different parameter settings for
the population and sample sizes, and the affects of fluc-
tuations occuring when new schemata are introduced
into the recognizer population by crossover, mutation,
or changes in the pattern population can also be stud-
ied,

Once the structure and evolution of the default hi-
erarchy is better understood, this pattern recogniton
model can be used to study the structural evolution of
memory in other physical and biological systems that
do pattern recognition. The features of the particular
system can be assigned to the bits of the pattern and
Tecognizer strings. Then one can study how the rec-
ognizer strings of that system’s memory are structured
with respect to the regularities in the patterns the sys-
tem is trying to recognize.

For instance, in modeling the immune system, ex-
perimental evidence can be used to set up the fitness
function and parameter values to be more realistic with
Teal immune systems and to see if the model results in
antibodies with similar characteristics as those found
in the experiments. Some experimental studies have
shown that one antibody can recognize more than one
antigen (called “multi-specificity” in immunology) [13].
Such an antibody is a generalist, Other evidence shows
that when new antigens are introduced, new antibod-
ies are created by an increased rate of point mutations.
The affinity of these new antibodies for new antigens in-
creases ten-fold over a short period of time, indicating
that the antibodies become very specific at recognizing
the antigens [14]. Thus, some sort of default hierarchy
is developing in the immune system. What is the na-
ture of this default hierarchy if the fitness function and
paramters are set to correspond to actual numbers used
and seen in these experiments? What antibodies result
if the intial antibodies are ones that have evolved in this

 

 

 

pattern recognition model and can only undergo muta-
tion in the future when presented with a new antigen?
How might a network of such antibodies compensate for
any shortcomings of the antibody population which can
no longer undergo crossover?

Also, in a letter recognition system, letter patterns
can be represented with line features of different lengths,
angles, and positions. Could this model then generate a
default hierarchy to recognize the letter or letters? How
could it recognize the same letters if they are written in
a different style?

In cognitive science, a concept is represented by a
prototype developed from a set of instances. The na-
ture of the prototype with respect to the properties of
the instances is not yet clearly understood, but the in-
stances do not necessarily all share some set of property
values which might then be the prototype. Rather, each
instance shares one or more property values with one or
more other instances in the set [11]. Translated into
the pattern recognition model, the recognizer popula~
tion can try to match the instances in the environment
where the prototype would appear as the more general
schema that persists over time and is not just used to
build one specific schema. On the other hand, the pro-
totype might turn out to be some piece of the default hi-
erarchy distributed over several recognizer strings. How
would this correspond to any experimental data that has
been collected by psychologists on the nature of proto-
types? What kinds of prototypes are actually commu-
nicated in these experiments (ie. trasmitted verbally
to the experimenter)?

By understanding the organization of the default hi-
erarchy in the recognizer strings, the memories of other
pattern recognition systems evolving under a genetic al-
gorithm can be analyzed as well. A system that tries to
recognize patterns in the environment and then builds
an internal model or memory of those patterns must use
some definition of similarity to cluster the patterns into
classes. If this system also transmits its internal model
by a code representing how to reconstruct that model,
then this system is a good candidate to be cast in and
analyzed by this pattern recognition model.

6 Acknowledgements

This work has been performed under the auspices of
the U.S. Department of Energy. I would like to thank
John H. Holland for his guidance and ideas about ge-
netic algorithms and modelling, Chris Langton for his
comments on modelling artificial life, and Alan Perelson,
Doyne Farmer, and Norman Packard for their valuable
insights into the structure and dynamics of biological
and physical systems.
a as ASAP SENN:

 

 

 

 

References

{1}

{4]

[5]

{6]

{7]

[8

(9

(29

Booker, L.B. 1985, “Improving the Performance
of Genetic Algorithms in Classifier Systems”. Pro-
ceedings of an International Conference on Ge-
netic Algorithms and Their Applications, Pitts-
burgh, Pennsylvania : Carnegie-Mellon Univer-
sity.

DeJong, K.A. 1975, “Analysis of the Behavior of a
Class of Genetic Adaptive Systems”. PhD Disser-
tation, Ann Arbor : The University of Michigan.

Farmer, J.D., Kauffman, S.A., and Packard, N.H.
1987, “Autocatalytic Replication of Polymers”.
Physica 22D, pp. 50-67.

Farmer, J.D., Kauffman, S.A., Packard, N.H.,
and Perelson, A.S. 1987, “Adaptive Dynamic Net-
works as Models for the Immune System and Au-
tocatalytic Sets”. Annals New York Academy of
Sciences.

Farmer, J.D., Packard, N.H., and Perelson, A.S.
1986, “The Immune System, Adaptation, and Ma-
chine Learning”. Physica 22D, pp. 187-204.

Feldman, J.A. 1986, “Neural Representation of
Conceptual Knowledge”. TR189, Dept. of Com-
puter Science, Rochester, NY : The University of
Rochester.

Goldberg, D.E. 1983, “Computer-aided Gas
Pipeline Operation Using Genetic Algorithms
and Rule Learning”. PhD Dissertation, Ann Ar-
bor : The University of Michigan.

Holland, J-H. 1975, Adaptation in Natural and
Artificial Systems. Ann Arbor, Michigan : The
University of Michigan Press.

Holland, JH. 1985, “Escaping Brittleness: The
Possibilities of General Purpose Learning Algo-
rithms Applied to Parallel Rule Base Systems”.
Machine Learning II, Ch. 20, Los Altos, CA : Mor-
gan Kauffman.

Jerne, N.K. 1974, “Toward a Network Theory
of the Immune System”. Annals of Immunology
(Inst. Pasteur) 125C : 373.

[14] Kaplan, S. 1986, Neural Models, Course Notes,

Ann Arbor : The University of Michigan.

[12] Langton, C.G. 1984, “Self-Reproduction in Cellu-

lar Automata”. Physica 10D, pp. 135-144.

[13]

[14]

Rosenstein, R.W., Musson, R.A., Armstrong,
M.Y.K., Konigsberg, W.H., and Richards, F.F.
1972, “Contact Regions for Dinitrophenyl and
Menadione Haptens in an Immunoglobulin Bind-
ing More Than One Antigen”. Proceedings of the
National Academy of Science, USA, Vol. 69, No.
4, pp. 887-881.

Wysocki, L.J., Manser, T., and Gefter, M.L. 1986,
“Somatic Evolution of Variable Region Structures
During an Immune Response”. Proceedings of the
National Academy of Science, USA, Vol. 83, pp.
1847-1851.

 
 

An Adaptive Crossover Distribution Mechanism for
Genetic Algorithms

J Daod Schaffer
Amy Morishima

Philips Laboratories
North American Philips Corporation
Bnarelifl Manor, New York

ABSTRACT

This paper presents a new version of a
class of search procedures usually called
genetic algorithms. Our new — version
implements a modified string representation
that includes special punctuation used by the
crossover recombination operator, The idea
behind this scheme was abstracted from the
mechanics of natural genetics and seems to
yield a search procedure wherein the action of
the recombination operator can be made to
adapt to the search space in parallel with the
adaptation of the string contents. In addition,
this adaptation happens "for free" in that no
additional operations beyond those of the
traditional genetic algorithm are employed.

We present some empirical evidence that
suggests this procedure may be as good as or
better than the traditional genetic algorithm
across a range of search problems and that its
action docs successfully adapt the seareh
mechanics to the problem space.

1. Background

A genetic algorithm js an exploratory procedure
that is able to locate high performance structures in
complex task domains. To do this, i) maintains a set
(called a population) of trial structures, represented as
strings. New test structures are produced by repeating
a two-step cycle (called a generation) which includes a
survival-of-the-fittest selection step and a recombination
step. Recombination involves producing new strings (the
offspring) by operations upon one or more previous
strings {the parents), The principle recombination
operator was abstracted from knowledge of natural
genetics and is called crossover. Holland has provided a
theoretical explanation for the high performance of such
algorithms {5} and this performance has been

demonstrated on a number of complex problem domains
such as function optimization [8,4] and machine learning
{6, 8, 9}.

The action of the traditional crossover is illustrated
in figure 1.

 

36

Starting with two strings from the populationt, a po:
is selected between 1 and L-1, where L is the stri
length. Both strings are severed at this point and t
segments to the right of this point are switched. T
two starting strings are usually called the parents a
the two resulting strings , the offspring. Taking t
metaphor one step further, we will calf this operation
mating with a single crossover event. In all previo
work with which we are familiar, the crossover point
chosen with a uniform probability distribution.

The motivation for our new crossover operat
sprang from some properties of this traditional operate
Specifically, it has a known bias against proper
sampling structures which contain coadaptive substrin
that are far apart. In metaphorical terms, we might cs
them genes which are far apart on the chromosome. Tl
reason is not diffcult to grasp intuitively. The farth
apart the genes are, the higher the probability that
uniforinly selected random crossover point will fa
between them causing them to be passed 1o differe:
ofispring. We observe that this crossover operate
requires only knowledge of the string length; it pays r
attention to its contents. Furthermore its action
nonadaptive. lt performs the same way in ever
generation.

In contrast to this, what we know of Nature
genetic crossover activity suggests that the location ¢
crossover events may be quite sensitive to the content
of the chromosome [1]. There are many activities in thi
microworld which involve the initiation of an action b
the binding of an enzyme to a specific base-sequence
We were motivated to design a crossover mechanist
which would adapt the distribution of its crossove
points by the same  survival-of-the-fittest an
recombination processes already in place. We reasone:
that it should do so by the use of special punctuation
marks introduced into the string representation for thi
purpose. The operation envisioned for this new crossove
was to proceed hy "marking" the site of each crossove
event in the string in which it occurred. Thus if the
search space had the characteristic that crossovers ai
particular loci were consistently associated with inferior
offspring, then they would die out, taking their markings
with them, The converse was also hoped for. If nce
consistent relation existed to he exploited, then a4

 

t We will show all strings as bit strings, but this is not ¢
requirement imposed by the algorithm
soon

 

 

 

PARENTS:

0000000
ALLL
crossover point -- selected randomly with
a uniform distribution
OFFSPRING
0001211
1110000

Figure 1, The action of the traditional crossover
operator. The two parent strings are shown as all
zeros and ones for clarity.

random allocation of markings was expected. In this, we
were reminded of speculation by Lenat that modern
gene pools may contain accumulated heuristics to guide
genetic search as well as an accumulation of well-
adapted expressible genes [7].

The idea of punctuation in the strings used by a
genetic algorithm was first proposed by Holland [5], but
this proposal was for a different purpose than that
proposed here.

The rest of this paper will present, the mechanics of
our new crossover operator, some empirical results
indicating that superior performance can indeed be
achieved with it, and some data exhibiting properties of
its behavior.

2. Mechanics

In this section we will explain the mechanics of
Operation of our new crossover operator. We address
how the crossover punctuation is coded, how an initial
distribution is generated, how this distribution affects
the crossing over of the functional parts of the strings,
how the punctuation marks themselves are passed on to
the offspring, and how a linkage is established between
individual punctuation marks and the functional
substrings with which they are associated.

The representation is straightforward. To the end
of each chromosome (string of bits interpretable as a
point in the search space) we attach another bit string
of the same length. Thus any string representation
previously used with a traditional genetic algorithm can
be employed in our scheme simply by doubling its
length. The bits in the new section are interpreted as
srossover punctuation, 1 for yes and QO for no (ie.
l=crossover, O=no crossover), The loci in the
Punctuation (second) part of the string correspond one-
to-one with the loci in the functional (first) part of the
chromosome. Thus it is natural to think of these two
Parts of the chromosome as interleaved. Punctuation
mark i tells whether crossover is or js nob to occur at
leeus i, In figure 2, the punctuation marks are shown as
+ in the functional string.

Tt is common practice when beginning a genetic
Search to initiate a population of strings by randomly

» Senerating bits with equal probability for zero and one.

We folk

ow this practice for the functional string, but the

37

THE CHROMOSOME
OL11Ld9O00OLT1ECOGLOOLSHOOD
Ss.

oan

expressible crossover
string punctuation
L= 10 L<+ 10

MAY BE INTERPRETED AS

911100 101100

Lf i.e@. crossover here

Figure 2. The string representation of chromosomes with
crosgover punctuation marks.

probability of generating a one in the punctuation string
is designated P,, and is set externally. The influence of
this variable was studied empirically.

BEFORE CROSSOVER

mam: aaaaaaathbbbbbb

pop: ceeclddddddecere

APTER CROSSOVER

kidl: aaaadddbbbeeee

kid2; ccecctaaatdddbbbb

Figure 3. The action of punctuated crossover,

The mechanics of crossover governed by these
punctuation marks is illustrated in figure 3. The bits
from each parent string are copied one-by-one to one of
the offspring from left to right. When a punctuation
mark is encountered in either parent, the bits begin
going to the other offspring (a crossover). When this
happens, the punctuation marks themselves are also
passed on to the offspring, just before the crossover
takes effect. Thus we may think of the marks as being
linked to the functional bit string to the left of its locus.
A little experimentation with pencil and paper should
serve to give the reader a sense of the redistribution
possibilities of this process, Some parental distributions
will result in all the punctuation marks being passed to
one of the offspring and none to its sibling, while others
will redistribute clumped distributions. When an
offspring fails to survive the fitness-based selection step
in its generation, its punctuation marks die with it.
Thus, the dynamics of the distribution of these marks in
the gene pool should reflect an accumulating «xperience
about where is or is not a good place to crossover the
genetic material in the pool.

The action of the mutation operator has
traditionally been employed as a low level (ie. small
probability) defense against premature convergence of
the gene pool. It seems consistent with our metaphor to
allow the mutation operator equal access to the entire
chromosome, both functional and punctuation parts. A
few experiments supported this belief.

 
3. Empirical Evidence

We selected the task domain of function
optimization (minimization) to test the capabilities of
this new genetic algorithm and a set of five scalar
valued functions which has been used in the past to test
genetic algorithms. These functions provide a range of
characteristics of search problems and are summarized
in table 1. Recently Grefenstette has found a
configuration of the traditional genetic algorithm which
performs consistently better on this funetion set. than
any previously known. [4]. We will use his resulls (which
we will call BTGA for best traditional genetic algorithm)
as a benchmark.

 

TABLE 1

Functions Comprising the Test Environment
Fen Dimensions Space Size Deseription
fl 3 10% 10° parabola
f2 3 17x 10° Rosenbrock’s saddie
13 5 1.0% 108 step function
f4 3a 10x 107 quadratie with noise
£5. 2 16x 101° Sheket's foxholes

 

 

 

We adopted Grefenstette's strategy of setting a
genetic algorithm to optimize a genetic algorithm (GA)
in order to locate a good set of parameter values for our
new procedure (a “meta-search”). The meta-level GA
was a traditional GA that allowed vector-valued fitness
{one dimension for each of the five functions) called
VEGA [8]. The parameter set searched at this level
included: population size, crossover rate, mutation rate,
P,,, and sealing window. The performance measure was
online average function value (ic. an average of all
trials in a run of 5000 function evaluations). For more
details on these matters the reader is referred to
Grefenstette’s paper and its predecessors.

Since the performance of BTGA on each of the
individual functions was not previously published, we
first estimated this by running a BGA on each one five
times (n) with different random seeds, The resnits are
given in table 2.

The results of the Meta-search revealed that the
genetic algorithin with punctuated crossover (GAPC)}
performed well for the whole set of funetions at the

TABLE 2
Performance of Best Traditional Genetic Algorithm
on Test Functions

 

Function mean s.d. n global optimum
fl 1.664 19G0) 5 0
f2 25.16 4.497 5 0
fs -27.78 Qtr 5 +30
fa 24.28 1.383 5 0
£5 30.78 2.148 5 as]

 

 

38

 

following parameter settings: population size = 40,
crossover rate = 0.45, mutation rate = 0.002, P,, = 0.04,
and scaling window = 3. Estimates of the performance
of GAPC comparable to those for BTGA are given in
table 3 along with results of two-tailed t-tests of the
mean differences between them.

TABLE 3
Performance of Genetic Algorithm with Punctuated
Crossover on Test Functions and a comparison with
the Best Traditional Genetic Algorithm

 

 

Function mean s.d. n t-test significance
fl 1.111 .1706 0 5 4.84 01

f2 17.22 3.279 5 3.19 05

{3 -27.90 5907 5 0.12 ns

fe 20.32 1.320 5 4.63 OL

15 14.81 LAS 5 13.71 2001

 

These results clearly show the superiority of GAPC.
It statistically outperforms BTGA on four of the five
functions and, is no worse on the other ({3). We believe
the reason for this latter result lies in a floor effect. Both
GAs find good solutions very quickly on [3 so that the
online average rapidly approaches the global optimum.
There is simply insufficient room for improvement to
allow for a statistical difference. We believe the same
explanation applies to the finding that no significant
diflerences were noted between BT'GA and GAPC when
offline average was used as a criterion.

4. Characterization of GAPC Search

There are a number of questions about the
realization of the performance characteristics we
envisioned when this scheme was designed, Specifically,
does the distribution of crossover marks adapt to the
task environment in a meaningful way, is the process
stable (i.e, does the number of punctuation marks in the
population tend to vanish or saturate), and how many
crossovers per mating does it settle upon (if it does
settle)? In this section we present some evidence related
to these questions that was collected while monitoring
the searches reported above.

We define the population distribution of
punctuation marks at time ¢ as the sum of the
punctuation bits at each locus across all the individuals
in the population.

pepace
Pro(Lit)=: SS punet(I,t) (2)
veel
The total number
population js then

of punctuation marks in the

L
Teolt)= Dsl) (3)

aol
Figure 4 shows a time history of »,,(f,¢) for 200
generations of one search of function fl. The x-axis is
chromosome locus and the y-axis is both p,,(/,t) and time
 

 

(generations), The vertical separation between stiecessive
generations is equal to popsize (ie. the number of
individuals in the population) so that, a locus at which
every member of the population has a punctuation
mark, will appear as a peak which just reaches the
baseline of the line above. This figure shows that the
initial distribution is flat since the random initialization
of the punctuation marks does not [favor one focus over
any other. As time progresses, however, some loci tend
to accumulate more punctuation marks than others. The
location of these concentrations changes with time; a
peak may appear and remain prominent for some
generations only to die out as others emerge. The
distribution of punctuation marks does indeed seem to
adapt as the gene poo! adapts.

os

a) generations

freq of crossover

gent ralions

0 bo 2 bo

we

chromosome position

1

p) gencrations £00

CTasSsoyer

of

 

fre

Bene Talons

4000 i

i "1 in

chromosome poson

Figure 4. A time history of the distribution of punctuation
marks for one run on fl.

 

30

Figure 5 shows a plot of 7,,(¢) for the same run.
Also shown are the maxp,,{/,4) and ming,,(!,.). Although
the total seems ta be growing with no sign of stabilizing,
the population is far from saturatedt, the minimum
remains zero until the last few generations and the
maximum concentration at any one locus never exceeds
30 out of the possible 40 (i.e. popsize).

Aut

* popmlatin total

”

1

crossover pls per gene

Faas pf sume Io as

wee Se wee ae = anhalt anna
8 60 100 158 fo L590 WO IO AG) 190 TO) T9N MD EE

 

us

ae

o

generations

Figure 5. Now the numbers of punctuation marks change
with time. These data were from one search on fl,
and were typical.

Tigure 6 shows the average number of crossover
events occurring per mating. This simple counting can
be misleading, however, since the gene pool is converging
as the generations progress. Note that when a crossover
event swaps gene segments which are identical, it is an
unproductive crossover event. When these events are
discounted, we were surprised to see that the number of
“productive” crossover events per mating remained
nearly constant. What is more, the level at which this
statistic holds appears to correlate strongly with L. Sec
table 4. These results are not dissimilar to results
reported by DeJong when experimenting with multiple
crossover points [2]

 

TABLE 4
Time-averaged "Productive" Crossovers per Mating
Function chromosome length crossovers
fh 30 1.49
f2 24 0.87
{3 50 2.02
fs 240 8.64
f5 32 1.65

 

 

 

t Saturation would mean a punctuation bit at every one of the
popaize XL (10% 30 1200) possible Incations
 

f

1 katt
q

 

sit

yan

4, '

A ‘ yi)
thi

La e

>
r
r

Legis
gid, Preduruite
Ber, agit ale ap “atin sent Y$ f *

0
‘

0 tb Wk Ite Que or WN TE APG ASU SUL are On

generalons

+ Wy “ve” en a yp aa
Figure 6. Total and “produvtive’ crossover events per
mating for one run on fl.

5. Conclusions

We have
representation

described a modified knowledge
and crossover operator for use with
genetic search. Its design was driven by intuition
abstracted from Nature’s mechanisms of crossover
during meiosis. Experiments indicate that it performs as
well or better than a traditional GA for a set of test
problems, that exhibits a range of search space
properties. Experiments on other test problems are
continuing.

The distribution of crossover events evolves as the
search progresses and the statistics of “productive’
crossover events per mating indicate steady search effort
even in the face of a converging gene pool. These
statistics seem to correlate with chromosome length and
are consistent with previous results.

We remain cautiously optimistic that continued
experimentation will strengthen these conclusions and
will lead to a robust approach to adaptive knowledge
representation.

Acknowledgement

We wish to acknowledge the valuable contributions
of D. Paul Benjamin to the conception of this project
and to discussions of its implications,

References

1.

40

B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts
and J. D, Watson, Molecular Biology of the Cell
Garland Publishing, Inc., New York, 1983.

K. A. De Jong, Analysis of the Behavior of a Class
of Genetie Adaptive Systems, Ph.D. Thesis,
Department of Computer and Communication
Sclenees, University of Michigan, 1975.

K. A. De dong, Adaptive System Design: A Genetic
Approach, [EEE Transactions on Systems, Man &
Cybernetics SAIC-10,9 (September 1980), 866-574.

5

J. J. Grefenstette, Optimization of Control
Parameters for Genetic Algorithins, JEEE
Transactions on Systems, Afan & Cybernelics

SMC-16,1 (January-February 1086), 122-128,

JOH. Holland, Adaptation m Natural and Artyferal
Systems, University of Michigan Press, Ann Arbor,
MI, 1075.

J. H. Holland and J.
Systems Based on Adaptive Algorithms, in
Pattern-Directed Inference Systems, D, A.
Waterman and F. Hayes-Roth (editor), Academic
Press, New York, NY, 1078.

D. B. Lenat, The Role of Heuristics in Learning by
Discovery: Three Case Studies, in Afachine
Learning, R. S. Michatski, J. G. Carbonell and T.
M. Mitchell (editor), Tioga, Palo Alto, CA, 1983.

J. D. Schaffer, Some Experiments in Machine
Learning Using Vector Evaluated Genetic
Algorithms, Ph.D. Thesis, Department of Blectrical
Engineering, Vanderbilt University, December
1084.

8. F. Smith, Flexible Learning of Problem Solving
Heuristics Through Adaptive Search, 8th
International Joint Conference on  Artsfictal
Intelligence, Karlsruhe, Germany, August 1983.

8. Reitman, Cognitive
 

suites

 

- Yet as their usage

GENETIC ALGORITHMS WITH SHARING FOR KULTIMODAL FUNCTION OPTIMIZATION

David E, Goldberg, The University of Alabama,
Tuscaloosa, AL 35487

and

Jon Richardson, The University of Tennessee
(formerly at The University of Alabama),
Knoxville,

ABSTRACT

Many practical search and optimization prob-
lems require the investigation of multiple local
optima, in this paper, the method of sharing
functions is developed and investigated to permit
the formation of stable subpopulations of dif-
ferent strings within a genetic algorithm (GA),
thereby permitting the parallel investigation of
many peaks. The theory and implementation of the
method are investigated and two, one-dimensional
test functions are considered. On a test function
with five peaks of equal height, a GA without
sharing loses strings at all but one peak; a GA
with sharing maintains roughly equally sized sub-
popularions clustered about all five peaks. On a
test function with five peaks of different sizes,
a GA without sharing loses strings at all but the
highest peak; a GA with sharing allocates decreas-
ing numbers of strings to peaks of decreasing val-
ue as predicted by theory.

INTRODUCTION

Genetic algorithms (GAs) are finding increas~
ing application in a variety of problems across a
spectrum of disciplines (Geldberg & Thomas, 1986).
This is so, because GAs place a minimum of re-
quirements and restrictions on the user prior to
engaging the search procedure. The user simply
eedes the problem as a finite length string, char-
acterizes the objective (or objectives) as a black
box, and turns the GA crank. The genetic algo-
vithm then takes over, seeking near-optima pri-
marily through the combined action of reproduction
and crossover. These so-called simple GAs have
proved useful in many problems despite their lack
sf sophisticated machinery and despite their total
lack of knowledge of the problem they are solving.
has grown, several objections
to their performance have arisen. Simple GAs have
been criticized for sub-par performance on multi-
modal (multiply-peaked) functions. They have also

been criticized for so-called premature conver-

Sence where substantial fixation occurs at most
Sit positions before obtaining sufficiently near-
ptinal points (Cavicchio, 1970; De Jong, 1975;
Manidin, 1984; Baker, 1985).

a4

TN 37996

In this paper, we examine the first of these
maladies and propose a cure borrowed from nature.
In particular, our herbal remedy causes the forma~
tion of niche-like and species-like subdivision of
the environment and population through the imposi-
tion of sharing functions. These sharing func~-
tions help mitigate unbridled head-to-head compe~
tition between widely disparate points in a search
space. This reduction in competition between dis-
tant points thereby permits better performance on
multimedal functions. As a side benefit we find
that sharing helps maintain a more diverse popula-
tion and more considered (and less premature) con-
vergence. In the temainder of this paper, we re-
view the problem and past efforts to solve it; we
consider the theory of niche and speciation
through Holland's modified two-armed bandit prob-
lem, and we compare the performance of a genetic
algorithm both with and without the sharing func~-
tion feature. Finally we examine extensions of
the sharing function idea to permit its implemen-
tation in a wide array of problems.

MULTIMODAL OPTIMIZATION, GENETIG DRIFT,
AND A SIMPLE GA

The difficulty posed by a multimodal problem
for a simple genetic algorithm may be illustrated
by a straightforward example. Figure 1 shows a
bimodal function of a single variable: f(x} ~

&(x-0.5)7 coded by a normalized, five-bit binary

integer. In this problem, the two optima are lo-
cated at extreme ends of the one-dimensional
space. If we start a genetic algorithm with a

population chosen initially at vandom and let it
run for a large number of generations, our fondest
hope is that stable subpopulations cluster about
the two optima (about 00000 and 11111), In faer
if we perform this experiment, we find that the
simple GA eventually clusters all of its points
about one peak or the other.

Why does this happen? After all, doesn’t the
fundamental theorem of genetic algorithms (Hol-
land, 1975; De Jong, 1975; Goldberg, 1986) tell us
that exponentially increasing numbers of trials
will be given to the observed best schemata? Yes
it does, but the theorem assumes an infinitely
large population size. In a finite size popula-

 
 

 

 

Figure 1. Bimodal function with equal peaks.

tion, even when there is no selective advantage
for either of two competing alternatives (as is
the case for schemata 11*** and 00*** in the exam-
ple problem) the population will converge to one
alternative or the other in finite time (De Jong,
1975; Goldberg & Segrest, this volume). This
problem of finite populations is so important that
geneticists have given it a special name,
genetic drift. Stochastic errors tend to accumu-
late, ultimately causing the population to con-
verge to one alternative or another.

The convergence toward one optimum or another
is clearly undesirable in the case of peaks of
equal value. In multimodal problems where peaks
of different altitudes exist, the desirability of
convergence to the globally best peak is not so
clear cut. In Figure 2 we see a bimodal function

with unequal peaks: f(x) = 2.8(4-0.6)" with a
five-bit normalized coding. If we are interested
in obtaining only the global optimum, we should
not mind the eventual convergence of the popula-
tion to the leftmost point; however, this conver~
gence is not always guaranteed. Small initial
populations may allow sampling errors which over-
estimate the schemata of the rightmost points
thereby permitting convergence to the wrong peak.
Furthermore, in real world optimization we are
often interested in having information about good,
better, and best solutions. When this is so, it
might be nice to see a form of convergence that
permits stable subpopulations of points to cluster
about both peaks according to peak fitness.

In either of these cases, we can argue for
more controlled competition and less reckless con-
vergence than is possible when we work with a sim-
ple, tripartite (reproduction, crossover, and mu-
tation) genetic algorithm. For these reasons we
turn to the theory of niche and speciation to find
an appropriate model for naturally regulated com-
pecition.

 

 

A2

 

 

 

Figure 2.

Bimodal function with unequal peaks.
THEORY OF SPECIES AND NICHE

The results of sur initial  gedanken-
experimente (thought experiments) with simple GAs
and multimodal functions are somewhat perplexing
when juxtaposed with natural example. In our
problem with equal peaks, the simple GA converges
on one peak or the other even though both peaks
are equally useful. By contrast, why doesn’t na-
ture converge to a single species? In our second
problem with unequal peaks, we notice that the
simple GA again converges to one peak, usually-
-but not always--the "correct" peak. How, when
faced with a somewhat less fit species, does na-
ture choose to limit population size before re-
sorting to extinction? In both cases, nature has
found a way to combat unbridled competition and
permit the formation of stable subpopulations. In
nature, different species don't go head ta head.
Instead they exploit separate niches (sets of en-
vironmental features) in which other organisms
have little or no interest. In thts section, we
need to bridge the gap between natural example and
Benetic algorithm practice through the application
of some useful theory.

Although there is a well-developed biological
literature fn bath niche and spectation, its
transfer to the arena of GA search has been Limir~-
ed. Like many other concepts and operators, the
first theories directly applicable to artificial
genetic search are due to Holland (1975). To il-
lustrate niche and species, Holland introduces a
modification of the two-armed bandit problem with
distributed payoff and sharing. Let's examine his
argument with a concrete formulation of the same
problem.

Imagine a two-armed bandit as depicted in
Figure 3. In the ordinary two-armed bandit prob-
lem (Holland, 1975; De Jong, 1975), we have two
arms, a left arm and a right arm, and we have dif-
ferent payoffs associated with each arm, Suppose
we have an expected payoff associated with the
left arm of $25 and an expected payoff associated
wich che right arm of $75; in the standard two-
armed bandit, we are wnaware initially which arm
pays the higher amount, and our dilemma is to min-
 

 

 

 

 

Sketch of
queues and sharing.

Figure 3. the two-armed bandit with

imize our expected losses over some number of tri-
als. In this form, the two-armed bandit problem
puts the tradeoffs between exploration and exploi-
tation in sharp perspective. We can take extra
time to experiment, but in so doing risk the
possible gain from choosing the right arm, or we
ean experiment briefly and risk making an error
once we choose the arm we think is best. In this
way, the two-armed bandit has been used to justify
the allocation strategy adopted by the reproduc-
tive plans of simple genetic algorithms.

This is not our purpose here. Instead we
examine the modified two-armed bandit problem to
put the concepts of niche and species in sharper
focus, In the modified problem, we further sup-
pose that we have a population of some number of
players, say 100 players, and that each player may
decide to play one arm or the other. If at this
point we do nothing else, we simply create a par-
allel version of the original two-armed bandit
problem where we expect that all players eventual-
ly line up behind the observed best (and actual
best) arm. To produce the subdivision of species
and niche, we introduce an important rule change.
Instead of allowing a full measure of payoff for
each individual, {ndividuals who choose a particu-
lar arm are now forced to share the wealth derived
from that arm with other players queued up at the
arm. At first glance, this change appears to be
quite minor. in fact, this single modification
Causes a strikingly and surprisingly different
outcome in the modified two-armed bandit

To see why and how the results change, we
first recall that despite the different rules of
the game, we still allocate population members
according to payoff. In the modified game, an in-
dividual will receive a payoff which depends on
the arm payoff value and the number of individuals
gueued up at thac arm. In our concrete example,
an individual lined up behind the right arm when
all Individuals are lined up behind that same arm
feceives an amount $75/100 = $0.75. On the other

“hand, an individual lined up behind the left arm

43

when all individuals are queued there receives
$25/100 ~ $0.25. In both cases, there is motiva-
tion for some individuals to shift lines. In the
first case, a single individual changing lines
stands to gain an amount $25.00 - $0.75 = $24.25.
The motivation to shift lines is even stronger in
the second case. At some point in between we
should expect there to be no further motivation to
shift lines. This will occur when the individual
payoffs are identical for both lines. If N is the

population size, Bight and ™ orp are the number

of individuals behind the right and left queues,

and fight and flees are the expected payoff

values from the right and left arms
the equilibrium point may be
lows:

respectively,
calculated as fol-

E,

right “
Meaght

Evert
Mott

In our example, this complete equalization of in-
dividual payoff occurs when 75 players select the
vight arm and 25 players select the left arm, be-
cause $75/75 ~ $25/25 = $1

This problem may be extended to the k-armed
case directly, and the extension does not change
the fundamental conclusions at all: the system

attains equilibrium when the ratios of arm payoff
to queue length are equal (Holland, 1975). The
incorporation of forced sharing causes the forma-
tion of stable subpopulations (species) behind
different arms (niches) in the problem, Further-
more, the number of individuals devoted to each
niche is proportional to the expected niche pay-
off. This is exactly the type of solution we had
hoped for when we considered the bimodal problems
of Figures 1 and 2, Of course the extension of
the sharing concept to real genetic algorithm

search is more difficult than the idealized case
implies. Ina real genetic algorithm there are
many, many arms and deciding who should share and

how much should be shared becomes a non-trivial
question. In the next section we will examine a
munber of current efforts to induce niche and
species through indirect or direct sharing.

A BRIEF REVIEW OF CURRENT SCHEMES

A number of methods have been implemented to
induce niche and species in genetic algorithms.
In some of these techniques the sharing comes
about indirectly. Although the two-armed bandit
problem is a nice, simple abstract model of niche
and species formation and maintenance, nature is
not so direet in divvying up her bounty. In natu-
ral settings, sharing comes about through crowding
and conflict. When a habitat becomes fairly full
of a particular organism, individuals are forced
to share available resources,

Cavicchio’s (1971) dissertation study was one
of the first to attempt to induce niche-like and
species-like behavior in genetic algorithm search.
Specifically, he introduced a mechanism he called

 
preselection. In this scheme, an offspring re-
places the inferior parent if the offspring's fit-
ness exceeds that of the inferior parent. In this
way diversity is maintained in the population be-
cause strings tended to replace strings similar to
themselves (one of their parents). Cavicchio
claimed to maintain more diverse populations in a
number of simulations with relatively small popu-
lation sizes (n#20).

De Jong (1975) has genetalized preselection
in his crowding scheme. In De Jong crowding, in-
dividuals replace existing strings according to
their similarity with other strings in an overlap-
ping population, Specifically, an individual is
compared to each string in a randomly drawn sub-
population of CF (crowding factor) members. The
individual with the highest similarity (on the
basis of bit-by-bit similarity count) is replaced
by the new string. Early in the simmation, this
amounts to random selection of replacements he-
cause all individuals are likely to be equally
dissimilar. As the simulation progresses and more
and more individuals in the population are similar
to one another (ane or more species have gotten a
substantial foothold in the population) the re-
placements of individuals by similar individuals
tends to maintain diversity within the population
and reserve room for one or more species. De Jong
has had success with the crowding scheme on multi-
modal functions when he used crowding factors CF=2
and CF=3, De Jong’s crowding has subsequently
been used in a machine learning application
(Goldberg, 1983).

Booker (1982) discusses a direct application
of the sharing idea in a machine learning applica-
tion with genetics-based, classifier systems. In
classifier systems, a sub-goal reward mechanism
called a bucket brigade passes reward through a
network oF rules like money passing through an
economy. Booker suggests that appropriately sized
subpopulations of rules can form in such systems
if related rules are forced to share payments.
This idea is sound and has been forcefully demon-
strated in Wilson's recent work with boolean
function learning (Wilson, 1986); however, it does
not transfer well ro function optimization, be-
cause unlike classifier systems, there is no gen-
eral way in function optimization to determine
which strings are related.

Shaffer (1984) has used saparate, fixed size
subpopulations in his study of vector evaluated
genetic algorithms (VEGA). In this study, each
component of the vector (each criterion or objec-
tive measure) is mapped to its own subpopulation
where separate reproduction processes are carried
out. The method has worked well in a number of
trial functions; however, Shaffer has expressed
some concern over the procedure’s ability to han-
dle middling nondominated individuals--individuals
that may be Pareto optimal but are not extremal
(or even near extremal) along any single dimen-
sion, Furthermore, although the study does use
separate subpopulations, it is unclear how the
same method might be applied to the more usual
single-criterion optimization problem.

 

A direct exploration of biological niche the-
ory in the context of genetic algorithms is  con-
tained in Perry’s (1985) dissertation. In this
work, Perry defines a genotype-to-phenotype map-
ping, a multiple-resource environment, and a spe-
cial entity called an external schema. External
schemata are special similarity templates defined
by the simulation designer to characterize species
membership. Unfertunately, the required interven-
tion of an outside agent limits the practical use
of this technique in artificial genetic search,
Nonetheless, the reader interested in the connec-
tions between biological niche theory and GAs may
be interested in this work.

Grosso (1985) also maintains a biological
ortentation in his study of explicit subpopulation
formation and migration operators, Multiplica-
tive, heterotic (problems with diploid structures
where a heterozygote is more highly fit than the
homozygote) objective functions are used in this
study, and as such, the results are not directly
applicable to most artificial genetic search; how-
ever, Grosso was able to show the advantage of
intermediate migration rate values over either
isolated subpopulations (no migration) and pan-
mictic (completely mixed) subpopulations. This
study suggests that the imposition of a geography
within artificial genetic search may be another
useful way of assisting the forming diverse
subpopulations, Further studies are needed to
determine how to do this in more general artifi-
cial genetic search applications,

Although he has not directly addressed niche
and species, Mauldin (1984) has attempted to bet-
ter maintain diversity in genetic algorithms
through his uniqueness operator, The uniqueness
operator arbitrarily returns diversity to a popu-
lation whenever it is judged to be lacking. To
implement uniqueness, Mauldin defines a uniqueness
paramater ke that may decrease with time (similar

to the cooling of simulated annealing). He then
requires that for insertion in a population, an
offspring must be different than every population
member at a minimum of kK loci. If the offspring

is not sufficiently different, it is mutated until
it is. By itself, uniqueness is little more than
a somewhat knowledgeable (albelt expensive) muta-
tion operator. That it is useful in improving
offline (convergence) performance is not un-
expected. Grefenstette (1986) has recently sup-
ported the notion of fairly high mutation prob-
abilities Py “ 0.01 to 0.1) when convergence

to the best is the main goal. It is interesting
to note that uniqueness combined with De Jong's
crowding scheme worked better than either operator
by itself (Mauldin, 1984). This result suggests
that maintaining diversity for its own sake is not
the issue. Instead, we need to maintain appropri-
ate diversity--diversity that in some way helps
cause (or has helped cause) good strings. In the
next section, we show how we can maintain appro-
priate diversity through the use of sharing
functions.
 

 

 

SHARING FUNCTIONS

In attempting te induce species, we must ei-
ther directly or indirectly cause intraspecies
sharing, but we are faced with two important ques-
tions: who should share, and how much should be
shared? In natural systems, these two questions
are answered implicitly through confliet for fi-
nite resources. Different species find different
combinations of environmental factors--different
niches--which are relatively uninteresting to oth-
er species. Individuals of the same species use
those resources until there is conflice. At that
point, they vie for the same turf, food, and other
environmental resources, and = the increased
competition and conflict cause individuals of the
same species to share with one another, not out of

altruism, but because the resources they give up
are not worth the cost of the fight. It might be
possible to induce similar conflict for resources

in genetic optimization. Unfortunately, in many
optimization problems, there is no natural defini-
tion of a resource. As a result, we must invent
some way of imposing niche and speciation on
strings based on some measure of their distance
from each other, We do just this with what we

have called a sharing function.

A sharing function is nothing more than a way
of determining the degradation of an individual's
payoff due to a neighbor at some distance as mea-
sured in some similarity space. Mathematically,
we introduce a convenient metric d over our de-
coded parameters x (the decoded parameters are

themselves functions of the strings xr x, (83))"

d - a(x, ox

i ~>

the

Alternatively, we may introduce a metric over
strings directly:

qi - als;.s5)

In this paper, we use a metric defined over the
decoded parameters xy (phenotypic sharing); later

on, we briefly consider the use of metrics defined
over the strings (genotypic sharing). However we
choose a metric, we define a sharing function sh
BS a funetion of the metric value sh ~ sh(d) with
the following three properties:

1. 0 <sh(d) <1, for all 4
2. sh(0) = 1

3, lim sh(d) = 0
dea

Hany sharing functions are possible.

functions are convenient:

Power law

 

 

sh(d) wi-[ a)", d<o

Oshare share
= 0, otherwise
Tn this equation, o and @ are constants
q » Uhare nts, and

Figure 4 displays power law sharing functions with

 

 

 

 

avalues equal to one, greater than one, and less
than one,

4

ark
al
ac
a<1
a
0 @ vias oer

Figure 4, Power law sharing functions sh»sh(d),

Once we have selected a metric and a sharing
function, it is a simple matter to determine a
string's realized or shared fitness. We define a
string’s shared fitness f' as {ts potential fit-
ness divided by its niche count a

F fy
far

The niche count me for a particular string i is

taken as the sum of all share function values tak-
en over the entire population:

Xu N
m,’ © } sh(d,,) = (d(x, x
Hl & 1a) Be (dy 4x4))

Note that the sum includes the string itself
Thus, if a string is all by itself in its own
niche (a,' ~ 1), it receives its full potential

fitness value, Otherwise the sharing function
derates fitness according to the number and close-
ness of neighboring points,
RESULTS

We evaluate the use of sharing functions
through computational experiments en two multi-
modal problems. The first function is a periodic
function with five peaks of equal magnitude:

£00 = sin®(S.dxt 5)

We compare the performance of a simple genetic
algorithm with stochastic remainder selection,
cressover, no mutation, and no sharing to the per-
formance of the same GA with sharing. We use a

triangular sharing function (a1) with S hare 7

0.1; the metric d is taken aa the absolute value
of the difference between the string x values.
Length 30 binary strings are decoded as unsigned
binary integers and normalized by the constant

204, Genetic algorithm parameters have been

held constant across all runs as follows:

Probability of mutation Pao” 0.0
Probability of crossover Po 0.8
Population size = 50

Maximum number of generations « 100

We seek methods that maintain appropriate diversi-
ty without introducing arbitrary diversity through
mutation or other means, Therefore, we have set
the mutation probability to zero te put the shar-
ing fimetion technique to its most stringent test:
if appropriate diversity can be maintained without
mutation, we should expected similar or better
(off-line) performance when mutation is present.

Runs with and without sharing are started
from the same population generated uniformly at
random, After 50 generations the run without
sharing has lost all points at two peaks as shown
in Figure 5. By contrast, the run with sharing
has stable subpopulations roughly equal in size at
all five peaks as shown in Figure 6. Similar
graphs at generation 100 are shown in Figures 7
and 8, Note how the run without sharing (Figure
7) has completely converged to a single peak even
though there is no selective advantage for any
peak. By contrast, the run with sharing remains
committed to stable subpopulations clustered about
each peak, This latter result is especially re-
markable considering that no mutation has been
used. With no mutation, once an allele is Lost at
a particular locus, there is no way to get it
back. The existence of stable subpopulations
about each peak shows how the sharing function
maintains appropriate diversity--the necessary,
sometimes competing building blocks--required to
exploit all five peaks.

 

 

 

No Sharing. No Mutation

Gansration 50

1x)

Figure 5, GA without sharing concentrates points
at three peaks after 50 generations on function fy
(equal peaks).

Sharing. No Mutation

Generation 50

 

oe

Oo?

oe

100)

os

o4

os

o2

a4

 

Q
o
x
°
*

°
a
2
o

Figure. 6, GA with sharing distributes points to
all five peaks after 50 generations on function fy
(equal peaks).

No Sharing. No Mutation

Ganerotion 100

0?

o¢

too

os

o4

Figure 7. GA without sharing concentrates all
points on a single peak after 100 generations on
function fy (equal peaks).
 

 

 

Sharing. No Mutation.

Ganeration 100,

   

10)

o oz o4 a6 oa 1
x
Figure 8. GA with sharing continues to distribute

points among ail five peaks after 100 generations
on function fy (equal peaks).

The second test funetion f(x) has five peaks

with decreasing peak magnitude the

following equation:

as given by

2

a 7

4 2 59 9667)
08

E,(x) = £,(x)ee

Wa make comparisons of the two GAs, with and with-
out sharing, using the same parameters and string
coding as before, Specifically, we compare the
two cases at generation 100. By that time, the
genetic algorithm without sharing has allocated
all of its trials to the highest peak as shown in
Figure 9. By contrast, the GA with sharing forms
stable clusters of points about four of the five
highest peaks with cluster size roughly propor-
tional to peak fitness as shown in Figure 10.
This is the kind of performance we predicted ear-
ler, except for the lack of strings at the lowest
peak, To understand why this has occurred we

briefly return to the theory of sharing presented
earlier,

From our earlier discussion, we expect a sta-
ble equilibrium to form when the following equa-
tions hold true:

£, fos
mr
4 m,

b+

for all niches i, and

M
ya! oN
tel

a?

where M is the number of niches and N is the popu-
lation size. It may be shown that this set of
equations predicts proportions of niche members as
the ratio of niche fitness to the total fitness.

On function fy, we expect a total number of trials

to be allocated to the lowest peak as follows:
50*0.075/(0.075+0,224+0,54+0.85+1.0) = 1,42

In other words, we expect approximately one
vidual to remain at that

indt-
peak, and we should not

 

 

 

 

 

be surprised that no strings cluster there. If we
want to maintain a subpopulation ar such a low
peak, our theory suggests that we need a larger
population to overcome the unavoidable errors of
stochastic sampling and selection.
No Sharing, No Mutation,
i" Ganrerotion 100
,
os
as
o7
of
= os
o
os
02
ot us 7 T T t
e 2 94 os 98 1
x
Figure 9. GA without sharing concentrates all
points on highest peak (at generation 100) on
function fy (decreasing peaks).
Sharing. No Mutation.
a ___Sansration 100 _
1
oe
os
a7
z
~ os
os
os
oz
A
o- Te 7 r + 7 #
o oz o4 os os ,
x
Figure 10. GA with sharing distributes points to
all but lowest peak (at generation 100) on fune-
tion £ (decreasing peaks). Lowest peak has

expected population size of only one member, not
enough to overcome selection and sampling errors.

 
EXTENSIONS

The method of sharing functions is not Limit-
ed to one-dimensional problems. Sharing functions
may be evaluated using any reasonable metrie that
includes any number of problem parameters. Addi -
tionally, there is no reason te limit the method
to phenotypic sharing (where the measures are cal-
culated based on differences in the phenotype--the
decoded problem parameters). As an alternative,
sharing functions may be evaluated using the geno-
type (the string) directly. In this form, the
Hamming distance (the number of different bits)
may be an especially useful metric.

In many cases, there may be no need to per-
form the sharing calculations as precisely as
has been implied by the above equations. With the

full formulation, & sharing function evaluations

are required a? L£ symmetry is not exploited) to

calculate the niche count values my exactly. This

level of computation may be reduced by taking a
random sample of k (k<<N) share function values
and extrapolating the mean. If the mean of the k
share function evaluations for string 1 is Hy and

the population is of size N, then the following
formula provides a reasonable way to estimate the

niche count my:

mm’ ~ (2a 41

Although this approximate niche count method has
not been tested, Monte Carlo sampling techniques
have been adopted in at least one other GA study
with success (Grefenstette & Fitzpatrick, 1985).
There is every reason to suspect that cheaper,
approximate niche count estimates may be used
without excessive performance degradation.

CONCLUSION

In this paper, we have developed a method for
improving the performance of genetic algorithms in
multimodal function optimization problems. This
method uses sharing functions to induce artificial
analogs to the natural concepts of niche and spe-
cies, thereby permitting the formation of stable,
non-competing subpopulations of points surrounding
important peaks in the search space, The method
has been tested on twa multimodal functions, one
with peaks of equal size and one with peaks of
decreasing size, In both cases, the genetic algo-
rithm with sharing is able to maintain stable sub-
populations about significant peaks while an iden-
tical GA without sharing is unable to maintain
points at more than a single peak. Additionally,
the GA with sharing is also able to maintain sta-
ble subpopulations of approprlate size: the number
of points in each cluster is roughly proportional
to the peak fitness value. This automatic allaca-
tion of resources in a reasonable fashion should
not only be transferable to other multimodal opti-
mization problems, it should help in maintaining
appropriate diversity in genetic algorithms with-

48

 

out resorting to mutation and mutation-like opera-
tors that unnecessarily degrade on-line perfor-
mance. ‘These proof-of-principle results should
permit the extension of these methods to other
jarger and more complex problems of genetic opti-
mization.

ACKNOWLEDGEMENTS

This material is based upon work supported by
the National Science Foundation under Grant
MSM- 8451610.

REFERENCES

Baker, J. E, (1985). Adaptive selection methods
for genetic algorithms, In J. J. Grefenstette
(Ed,), Proceedings of an International
Conference on Genetic Algorithms and

Their Applications (pp. 101-111). Pittsburgh,
Carnegie~Mellon University.

Booker, L. B. (1982). Intelligent behavior as an
adaptation to the task environment, (Doctoral
dissertation, Technical Report No. 243. Ann
Arbor: University of Michigan, Logic of
Computers Group). Dissertations Abstracts

International, 43(2), 469B. (University
Microfilms No. 8214966)

Gavicchio, D. J. (1970). Adaptive search using
simulated evolution. Unpublished doctoral
dissertation, University of Michigan, Ann
Arbor,

De Jong, K. A. (1975). An analysis of the behavior
of a class of genetic adaptive systems.
(Doctoral dissertation, University of Michi-
gan). Dissertation Abstracts International,
36(10), 5140B. (University Microfilms No.
76-9381)

Goldberg, D. E. (1983). Computer-aided gas
pipeline operation using genetic algorithms
and rule learning (Doctoral dissertation,
University of Michigan). Dissertation Ab-

stracts International, 44(10), 31748. (Uni-
versity Microfilms No. 8402282)

Goldberg, D. E. (1986). Simple genetic algorithms
and the minimal deceptive problem (TCGA
Report No, 86003). Tuscaloosa: University of
Alabama, The Clearinghouse for Genetic
Algorithms.

Goldberg, D. E., & Thomas, A. L. (1986). Genetic
algorithms: A bibliography (TCGA Report No.
86001). Tuscaloosa: University of Alabama,
The Clearinghouse for Genetic Algorithms.

Grefenstette, J. J. (1986). Optimization of
control parameters for genetic algorithms.
I Transactions on Systems, Man, and
Cybernetics, SMC-16(1), 122-128.
 

Grefenstette, J. J., & Fitzpatrick, J. M, (1985).
Genetic search with approximate function
evaluations. In J. J. Grefenstette (Ed.},
Proceedings of an International Confer-
ence on Genetic Algorithms and Their Applica-
tions (pp. 112-120). Pittsburgh: Carnegie-
Mellon University.

Grosso, P. B. (1985). Computer simulation of

eneti daptation: Parallel subcomponent

interaction in a multilocus model. (Doctoral
dissertation, University of Michigan, Univer-

sity Microfilms No. 8520908).

Holland, J. H. (1975). Adaptation in natural and

artificial systems. Ann Arbor: The University
of Michigan Press.

 

Mauldin, M. L. (1984). Maintaining diversity in
genetic search. Proceedings of the National

Conference on Artificial Intelligence,

247-250.

Perry, 2. A. (1984). Experimental study of
speciation in ecological niche theory using
genetic algorithms. (Doctoral dissertation,
University ef Michigan). Dissertation Ab-

stracts International, 45(12), 3870B. (Uni-
versity Microfilms No. 8502912)

Shaffer, J. D. (1984), Some experiments in machine

learning using vector evaluated genetic
algorithms. Unpublished doctoral disserta-
tion, Vanderbilt University, Nashville.

Wilson, S. W. (1986). Classifier system learning

of a boolean function (Research Memo RIS-27r).

Cambridge: Rowland Institute for Science.

 

40
 

THE ARGOT STRATEGY:
ADAPTIVE REPRESENTATION GENETIC OPTIMIZER TECHNIQUE

Craig G. Shaefer
Rowland Institute for Science, 100 Cambridge Parkway, Cambridge, MA 02142

Abstract

The ARGOT method ts a search system that “learns”
a strategy appropriate to the problem being solved. It
accomplishes this through dy namic adaptation of its in-
ternal representation of the current trial solutions, re-
ducing or eliminating the context sensitivity normally
associated with this choice of representation. When
applied to function optimization problems, ARGOT,
in effect, “learns” a method of solution similar to the
global homotopy techniques, techniques that utilize the
derivative information available in function optimiza-
tions but which do not easily become ‘stuck’ in local
extrema as do most ‘hill-climbers’. The ARGOT strat-
egy may, moreover, be applied in more general con-
texts to classes of problems other than function opti-
mizations. For instance, when applied to combinato-
rial optimizations {e g., Traveling Salesman Problems
which do not contain derivative information and thus
usually evade solution by ‘hill-chmber’ methods}, AR-
GOT “learns” a strategy that may be described as a
‘divide-and-conquer’ method.

Section: 1. Introduction

The ARGOT Method is a search algorithm designed
in a manner flexible enough to allow it to learn its own
strategies for searching parameter spaces. The ARGOT
algorithm employs the standard genetic algorithm as its
basis and hence utilizes chromosomes composed of bit
strings to represent, in some fashion, the trial parame-
ters of Lhe search space. ARGOT differs from the Stan-
dard Genetic Algorithm Optimizer (SG AQ) through in-
sertion of an intermediate mapping between the chro-
mosome space and the solution space. It 1s this map-
ping that translates the chromosomes into their proper
trial parameters, and instead of this mapping being a
fixed one, it dynamically changes based upon internal
measurements over the chromosome population. In ef-
fect, the translation mapping generalizes and stores in-
formation about the current trial solutions determined
through the norma] hyperplane analysis being carried
out by the usual crossover, mutation, and selection op-
erators of the SGAO. One view of the SGAO is as a Dar-
winian evolution scheme where random mutations oc-
cur with selection of the fittest for further reproduction
and mating. From this viewpoint, the SGAO would
not be called a learning algorithm but rather an evo-
lutionary one. ARGOT, on the other hand, not only
employs this Darwinian evolution of the chromosome
population, but also utilizes information from the cur-
rent trial solutions to modify the translation mappmg
between the chromosomes and the parameter space.

 

This transfer of information from the parameter space
into the intermediate mapping is Lamarktan in nature
and thus provides the mechanism by which ARGOT
can generalize from its trial solutions and thereby ‘learn’
a Strategy for solving a particular problem. By a ‘learned
strategy’, we mean that the translation mapping dis-
plays behaviors and properties during the solution phase
of a problem that mimics the behaviors and proper-
ties observed during the solution phase of some other
known algorithm. When ARGOT behaves in a particu-
lar fashion, we would say that it has learned a particular
strategy. It is important to note that ARGOT has not
imposed a particular search strategy upon the search
process, as is typical for most algorithms, bul rather
the translation mapping has learned its own strategy.
A demonstration of this capability is that, in practice,
ARGOT ‘discovers’ differing strategies when it is ap-
plied to different problems.

ARGOT’s ability to adapt its internal representation
provides it with certam advantages over other algo-
rithms as well as alleviates some of the disadvantages
inherent in the SGAO. In particular, the problem of
premature or rapid convergence js usually a serious
one for genetic algorithms.! The chromosome popu-
lation of ARGOT, however, never undergoes conver-
gence. Even after the solution has been determined,
the chromosome population is still ‘fresh’ and contin-
ues searching for further refinements to the estimated
solution. In addition, the ‘Hamming cliff” problem?
that arises when using a binary coding scheme for the
chromosomes 1s chminated by ARGOT.2 In fact, AR-
GOT works equally well whether employing the typi-
cal binary code or a Gray code for the chromosomes.4
ARGOT also alleviates a large part of the sensitivity
of the SGAO upon the particular values choosen for its
internal variables This insensitivity will be illustrated
by the extended range of mutation rates available to
ARGOT. We discuss these advantages in subsequent
sectiomts,

The next section provides a brief description of the
ARGOT strategy® while the final section provides test
results from the ARGOT program.®

Section: 2. Adaptive Representation Genetic Opti-
nizer Technique

The ARGOT strategy consists of a substructure, rough-
ly equivalent to the SGAO, that employs crossover, mu-
tation, and selection to perform the standard, implicitly
parallel, hyperplane analysis, plus numerous triggered
operators which modify the intermediate mapping that
translates the chromosomes into the trial parameters.
 

 

These triggered operators are controlled through three
types of internal measurements performed upon the
chromosome population or the population of trial solu-
tions. Population measurements are employed by nu-
merous other algorithms, and here we list just a few
examples. The clustering optimization algorithms em-
ploy statistical measurements over the past trial so-
lutions in order to direct future trials.” Specific heat
measurements have been suggested to control the an-
nealing schedule®@ of Simulated Annealing techniques
and the step sizes.8b — For genetic algorithms, Baker
uses a ‘super-individual’ criterion for switching between
different selection operators in an attempt to alleviate
premature convergence of the gene pool,! and Wilson’s
usage of the thermodynamic entropy of a classifier sys-
tem to appropriately switch off the crossover operator
as the trial estimates near the actual solution.?

The three types of population measurements calcu-
Jated by ARGOT are’ (i) convergence measurements
(e.g., the uniformity of the chromosomes at a partic-
ular locus} on the chromosomes, (ii} positioning mea-
surements (relative average position of the current tri-
als to their roving boundaries) of the trial parameters,
and (iii) variance measurements (e.g., the ‘breadth’ of
the trial parameter distributions relative to the roving
boundaries) of the trials across the current parameter
search space. These measurements are made for each
gene of the chromosome and employed to triggered a
given operator when appropriate for a specific gene.
The triggered operators include ones that alter the res-
olution of a gene (number of bits used to represent a
parameter) and ones that shift, expand, and contract
the intermediate mapping between chromosomes and
parameters.

In order to describe this mapping, we first introduce
“roving” boundaries that limits the trials of a parame-
ter in the function optimization problem to the interval
between the two roving boundaries. As the name im-
plies, the roving boundaries allow the range for the trial
parameters to move, contract, and expand in ways that
ultimately lead to restricting the ARGOT algorithm to
search regions near the true solutions. That is, instead
of the optimizer continually exploring a fixed param-
eter domain, that domain, in effect, “climbs the bill”
as the population converges (owards the true param-
eler values. These roving boundaries thus eliminate
the possibility that the nearest, trial parameter js not
equal to one of the true parameter values, since shift-
ing the boundary could always bring the trial param-
eter values closer to the true value. In addition, as
the chromosome population begins to converge for a
given parameter, that parameter’s domain can then be
contracted, thus restricting the search to smaller and
stialler neighborhoods of the true parameter values. In
addition to the roving boundaries, each parameter also
has a set of “brick wall” boundaries defining an inter-
val within which the roving boundaries must always re-
main. These ‘brick wall’ boundaries provide a method
for applying inequality constraints upon the parameter

1

 

values of problem

There are four primary classes of triggered opera-
tors employed to implement the intermedsate mapping
of these roving boundaries. Later, we will add other
classes of triggered operators to the ARGOT strategy,
but for now we consider only those operators which di-
rectly modify the roving search domains: Contraction/-
Expansion, Increase,/Decrease Resolution, Shift Right/-
Left, and Dithering.

Let us give a synopsis of the strategies that are em-
ployed by ARGOT and reasons explaining their use.
In what follows, we will hmit our discussion to a single
parameter in a function optimization problem, but the
same reasoning applies to al] parameters. If the chro-
mosome population for that parameter has converged,
ie, the bit representations across the population of
chromosomes for that gene are almost identical, then it
is assumed that the hyperplane analysis has correctly
evaluated the parameter and since there are no further
bits (hyperplanes) to be found, the resolution for that
parameter is increased (more bits are added to that
parameter’s gene) in order that ARGOT may perform
a finer search of the parameter space. If the chromo-
somes are not uniform for a gene, 7.e., highly random
bit representations, then it is assumed that the hyper-
planes are not being efficiently evaluated and thus the
resolution is decreased. That is, the number of bits
along the chromosome dedicated to that particular pa-
rameter is reduced thus leading to a coarser set of trial
parameters. This in effect “clumps” the previous finer
resolution into a coarser set of trials ,{p),}, and thus
raises the convergence measure (increases uniformity)
for that parameter. As the convergence increases above
a defined threshold, the resolution is increased. The
second component of the strategy entails determining
where the current trial parameter value is located rel-
ative to the current roving boundaries. As the trial
parameter value approaches one of the roving bound-
aries, then the parameter range is shifted in an attempt
to better center the trial parameter. This is a means
for the roving parameter boundaries to follow the trial
parameter value as it moves to its true value. In ad-
dition, if the convergence measure of the parameter is
an intermediate range, it is assumed that the genetic
operators are still evaluating the hyperplanes and thus
the resolution is not altered but rather both of the rov-
ing boundaries are dithered (shifted by random small
increments) This is an attempt to randomly shift the
trial parameter values so thal one may come in coin-
cidence with, or at least come closer to, the true pa-
rameter value and thus allow the chromosome popula-
tion to converge on that value. The third component
entails determining whether the trial parameters are
sharply or widely distributed within the current roving
boundaries. If the distribution is sharp, then the roving
boundary interval] ts contracted in size thereby follow-
ing the argument that the chromosome population, af-
ter translation, is concentrating its efforts in a smaller
domain that the current roving domain, hence the rov-
ing domain should be contracted. Simmbarly. if the rials
are broadly distributed, then the chromosomes have not
narrowly concentrated their search efforts and thus the
roving domain should be expanded in size. Therefore,
as the current chromosome population begins to con-
verge on a given trial parameter value, the resolution of
the search is increased as well as (he search domain be-
ing contracted, thus allowing the ARGOT operators to
“hill chmb” to the true parameter value by increasing
the resolution of the search and narrowing the neigh-
borhood over which the search is being done. ARGOT's
policy thus insures that better trial parameters will not
be missed at the same time it is zeroing in on the solu-
tions at ever increasing accuracies.

Besides these primary operators that modify the chro-
mosome to parameter space mapping, there are sev-
eral secondary operators which increase the efficiency
of searching the parameter space. These include a Met-
ropolis mutation operator that accepts a bit mutation
based upon the change in payoff to the chromosome
after mutation. If the payoff decreases (gets better
in our convention) the mutation is always accepted,
but if the payoff increases then it is sometime accepted
and sometimes rejected, The probability of acceptance
is forms Boltzmann weighted distribution over the in-
crease in payoff. In addition to this Metropolis sam-
pling operator, ARGOT also includes two Annealing
Schedule operators that, when triggered by population
Measurements, are able to implement an automatic an-
nealing schedule. Hence, if the chromosomes are reach-
ing an ‘equilibrium state’ then the metropolis temper-
ature is lowered, otherwise when the chromosomes are
in a nonequilibrium state it is assumed that one should
be searching more of the sample space and hence the
temperature in raised.

A homotopy (local search) operator is also switched
on whenever the chromosome population has largely
converged to a particular trial value. The reasoning is
that after convergence the best trial solution is most
\ikely within the “basin of attraction” of the global so-
lution and hence a local search technique will quickly
proceed towards it.

These secondary operators are not necessary for AR-
GOT to find solutions, but they sometimes speed the
rate of discovery and learning. The secondary opera-
tors also allow for comparisons between alternate algo-
rithms, such as Simulated Annealing, and the ARGOT
method.

Section: 3. Test Problems and Results

We have applied the ARGOT strategy to a series
of both function optimization and combinatorial op-
timization problems. Seven of these test. cases are from
a previously defined set of global optimization prob-
lems.!0 Having been designed specifically to compare
the accuracy and efficiency of various algorithms, these
artificial problems are easier to solve than most op-
timizations stemming from practical problems. There

 

are two primary reasons for the relative ease with wh
these problems may be solved: (i) (hey are Lipschitz
in character, and (ii) they have large “basins of
traction” surrounding their global minima. Being I
schitzian, one can easily calculate the number of se
ples, cither probabilistic or deterministic that are nee
to assure that a global minimum will not escape
tween the trial points. This ability to calculate
sample size is of primary importance to probabili:
algorithms as well as any algorithm that endeavors
mesh the search space. Unfortunately, practical «
tumizations arising in modelling problems are usue
not Lipschitzian. Thus probabilistic and meshing ri
tines have difficulties solving the non-Lipschitzian of
mizations stemming from model fitting problems. 11
second characteristic, (ii), relates to local homoto
or quasi-Newton, techniques. These local determinis
techniques divide the search space into basins of attr.
tion. If the initial trial solution lies within a speci
basin then the local technique will ultimately find ¢
minimum contained within that particular basin. T|
minimum may be either a Jocal minimum or the glol
minimum, Consequently, if the global minima have r
atively large basins of attraction when compared wi
the local minima then the local homotopy techniqt
have a correspondingly large probability of finding t
global minima over the local minima. For these sev
test problems the smallest basin of attraction surroun
ing a global minimum was 20% of the hypervolume
the entire search space. For this particular case, th
20% of randomly chosen initial trials followed by a loc
homotopy technique would lead to the determinatic
of the global optimum. Again, this is uncharacterist
of most practical optimization problems for which tl
basin of attraction surrounding the global minimum
typically a very small fraction of the sizes of the basi
of attraction surrounding local minima. ARGOT ea
ily located all of the global minima for these seven te
problems. Because of the simplicity of these artifici
problems along with the relative ease with which AT
GOT found their global solutions, we next chose to aj
ply ARGOT to two additional and more difficult prol
lems from functional optimization which stem from fi
ting a mathematical problem to a physical situatio:
The first of these entails solution to a system of nir
nonlinear simultaneous equations based upon a mod
of transistor behavior.!! In contrast to the Price CR
algorithm which required five user interventions to fin
what appears to be a local minimum for this problen
the ARGOT system located a global minimum afte
calculating only one-tenth the number of function eva
uations and without user intervention, Although th
sizes of the basin of attraction for the global minima c
this problem are unknown, probabihstic samples indi
cate that they are considerably less than one percen
of the search space hypervolume.

To better illustrate the power of the ARGOT strat
egy for difficult problems we chose a second problem
called the A; problem, stemming from the fitting of «
model to experimental data and employing the Directec

52

 
 

 

 

 

 

Trees Method. We chose this problem for several rea-
sons. Although trajectory minimization methods em-
ploy function gradient information, there seems to be
no known method which makes efficient use of all func-
tion and/or gradient evaluations performed during each
line minimization. The Directed Trees Method, how-
ever, implements a strategy designed to utilize all of
this information. This is accomplished by incorporat-
ing evaluation and gradient information into a global
model of the function being optimized. The second
reason for choosing the Ay is that it is a difficult one,
being both non-Lipschitzian as well as having two de-
generate global minima with very small “basins of at-
tractions” relative to the total search space. In addi-
tion, the very nature of the solution of the Ay problem
makes this a difficult test case for many algorithms.
In particular, the solution consists of two bimodal and
two unimodal parameters. The two bimodal param-
eters produce doubly degenerate global minima with
symmetric basins and therefore one would hope that
an unbiased algorithm would be able to Jocate both
of these minima with equal likelihood. In addition,
the functional dependencies upon these four parame-
ters differ widely. As mentioned earlier, the A, is non-
Lipschitzian, which means that there are no bounds on
the derivatives of the function with respect to some of
these parameters. A consequence of this is that the
basins of the global minima make up a very small frac-
tion of the total search space. Also, the Jacobian ma-
trix over these four parameters is near singular, and
thus elementary Newton-like methods fail for algorith-
mic reasons Hence we studied the Ay problem with a
global homotopy method that does not fail for singu-
lar Jacobians and found that in less than one percent of
randomly chosen starting points, this method was capa-
ble of locating a global minimum, Even though global
homotopy methods are designed not to fail for singular
Jacobians, these techniques still employ gradient. infor-
mation and provides an explanation for their failure on
the Ay problem. The reason stems from the fact that
one of the unimodal parameters has a very broad func-
tional dependency. The partial derivatives with respect
to this parameter are thus extremely small which leads
to large elements in the inverse of the Jacobian. These
large elements are translated by the global homotopy
techniques into very large steps being taken along the
calculated trajectories resulting in divergence. We also
studied this problem by probabilistic methods includ-
ing random search (Monte Carlo) and the Simulated
Annealing methods. The second unimodal parameter
has an extremely sharp functional dependency which
results in the basins of attraction for the global min-
uma being very small relative to the basins for the local
minima, Consequently, probabilistic methods as well
85 composite methods have a difficult time locating the
global minima for this problem. The non-Lipschitzian
Rature of this problem also leads to the unusual prop-
erty that the more data points one employes in the
least-squares fit, the more difficult it is to locate the
global minima. We will return to the specific cases com-

53

 

paring the Monte Carlo. simulated Annealing, SGAO,
and ARGOT methods to this problem but first let us
briefly describe the general framework available within
the ARGOT strategy

Since the ARGOT system allows any of the operators
to be switched on or off, then other algorithms may
be implemented within ARGOT. For instance, switch-
ing off all operators excepting mutation and increasing
the mutation rate to 0.5 implements a Monte Carlo
(MC) search strategy. If only a single chromosome is
employed with the Metropolis mutation operator, then
a Simulated Annealing (SA) approached is obtained.
Composite strategies are also available. For example,
with 200 chromosomes and Metropolis mutation, pop-
ulation measurements may be employed to automate
the annealing schedule via the temperature lowering
and raising operators. Of course, all ARGOT opera-
tors may be switched off leaving crossover, mutation,
and selection of the SGAO. The important point of the
general framework of the ARGOT system is that all of
these alternate strategies are searching identical chro-
mosome spaces and hence may be compared not only
qualitatively but also quantitatively.)

The Ay problem is derived from fitting a model func-
tion to a set of data points employing least squares dif-
ferences between the data and their predicted values.
In particular, the following equation provides the payoff
function for the A, problem: SE (ta! 4+ 2zer19a° 4
(xz + 2zgzy3)e? + 2rgre6a, + cf + t19 ~ a? - a, ~ I},
where k is the number of data points.

We studied multiply cases employing a range of in-
put data points, the a,. These included samples of
from 50 to 200 data points on various intervals rang-
ing from -10 to 10. Both uniformly distributed sam-
ples over these intervals as well as randomly chooses
points were tested. To some test cases white noise
was added to the data. In all of the tests ARGOT
found the global solutions of this problem, which con-
sist of the two degenerate minima: (25,26, 210,213) =
(1/2,1,3/4,0) and (1/2, -1,3/4,0). Comparisons of
ARGOT with MC, SA, and SGAO were performed
with 150 uniformly distributed data points without any
added noise. Thus, at the true global minima, the pay-
off function has a value of 0, (With noise added to the
data, the payoff is positive.) Each parameter was re-
stricted to the interval |~25, 25} and tests were halted
at 3000 generations, unless the global minima were dis-
covered earlier.

These comparison studies yielded three general regimes
of resolution for which qualitatively different solutions
are found by MC, SA, and SGAO. The first regime
consists of resolutions for the parameters which are too
small, roughly all resolutions below 12 bits per parame-
ter In this regime, the trial parameters are too coarse,
allowing the true solutions to ‘slip’ in between the tri-
als, In this regimne only local minima were obtained by
each of the standard algorithms. The second regime is
one of medium resolutions, from 12 to 15 bits per pa-
rameter, for which the SGAO was able to locate global
minima on a few of the tests, but these global nunima
were shifted from their true positions because the trials
were too coarse. See the 12 bit entries in Table 2. In
this regime MC only located Jocal minima while the SA
algorithm found a shifted global minimum on one occa-
sion. The third regime consists of the high resolutions,
from 16 to 20 bits per parameter. where the trials are
now very finely spaced and hence the global minima
do nol ‘shp between the cracks’ But in this regime
none of the standard algorithms found the global min-
ima but rather they became ‘stuck’ in local minima.
Presumably this is due to the vast size of the search
space at these higher resolutions. In effect, the SGAO
can not effectively search such large spaces to locate
the global minima before premature convergence of the
chromosome population occurs. Table 1 lists the sizes
of the search spaces, $(R*), for a few of the resolutions,
1,, considered in these tests. Since a single generation
of each of the algorithms samples equivalent numbers
of points in this space, the %/t column gives the per-
centage of the space searched in every generation. The
Payof fo column lists the payoff that corresponds to the
trial solution which lies closest to one of the true global
minima. From this column note that at a resolution of
4 bits per parameter, the payoff for the “nearest” trial
is quite large (the glubal minima ‘slip’ between the tri-
als) which results in only local minima being found. At
a resolution of 20 bits, however, the “nearest” payoff
is quite small and consequently a very good estimate
of the true solutions are theoretically possible, but now
the percent of the search space sampled per generation
is so small that the global minima are missed and again
only local minima are found. Table 2 provides the the-
oretical and the best solutions found by the various
techniques at various resolutions. The last row of this
table lists one of the solutions located by the ARGOT
method. The resolution, l,, is not given for ARGOT
since ARGOT alters the parameters resolutions dynam-
ically and independent of each other during a solution.
Over the course of the ARGOT solution, however, the
average resolution per parameter was approximately 4
bits.

The lower resolution employed by ARGOT during its
solution leads to an increase in search efficiency over
the SGAO. This is because low resolution means that
the order, or length, of the hyperplanes needed to solve
the problem is lower for ARGOT than for the SGAO.
This results in an additional advantage for ARGOT.
Since mutation typically limits the order of the hyper-
planes that the SGAO can develop (mutation disrupts
long hyperplanes), the SGAO fails for large mutation
rates (typical rates are Jess than 0.001). ARGOT, on
the other hand, does not require high-order hy perplanes
to solve a problem and therefore can tolerate a much
larger mutation rate without failing (up to 0.05). The
Jow resolution and high mutation rate act symbiotically
to greatly enhance the search efficiency of ARGOT.
This occurs because the lower resolutions per parame-
ter means the size of search space is much smaller and

[3]

54

 

consequently the higher allowed mutation rates actu. |
ally provide a means of effectively sampling this small .
search space. This is in contrast with the SGAO whick

simply uses mutation as an ‘insurance policy’ against @
allele loss since the lower allowed mutation rates cannot :
efficiently sample the much larger search spaces.

The figures at the end of this paper provide graphs -
of the roving boundaries and best population estimates -
of the parameters for the A; problem as a function of
generation number. Note for this case the initial rov-
ing boundaries were all set to [4.95,5|. These intervals
do not even include the global solutions to the prob-
lem, but the ARGOT method is still capable of adjust-
ing the roving boundaries to find the true parameters.
When a modified Newton-Raphson with Marquardt pa-
rameter method was apphed to the A; problem start-
ing with the same 200 initial points as the ARGOT
method, not one of the 200 runs succeeded in finding
a global minimum. Most ended by diverging and a
few runs succeeded in halting at local minima. Note
that the ARGOT ‘best’ estimates for the z5 and 6
parameters ‘jumps’ between the two global minima be-
fore it settles onto one of them. This behavior is ty pical
for multimodal parameters and gives a clear indication
of mulimodality. Also, note that the ‘broad’ 219 pa-
rameter has correspondingly large roving boundaries
while the ‘sharp’ rz; parameter has very narrow roving
boundaries. In effect, ARGOT learned a strategy of
solution similar to global homotopy methods.

When ARGOT was employed to solve the Traveling
Salesman Problem (TSP)!3 for 10 to 50 cities, the
strategy found by ARGOT was to imtially make rear-
rangements of large clusters of cites and then gradually
reduce the ‘size’ of the clusters ultimately concluding
with making just local rearrangements of the tour path.
This strategy is a ‘divide-and-conquer’ one.

Section: 4. Acknowledgements

We would hike to express our apprectation to Kenneth
De Jong for supplying Pascal code for a standard ge-
netic algorithm optimizer. We also (hank Stewart Wil-
son for numerous very insightful discussions concerning
genetic algorithms.

Section: 5. References

Baker, J.E in “Proceedings of an International Con-
ference on Genetic Algorithms and their Applications”;
Carnegie-Mellon University, July 24, 1985.

Bethke, A.D. “Genetic algorithms as function optimiz-
ers”; Ph.D. Thesis. University of Michigan, 1980.

A simple argument employing the resolution operators il-
lustrates how ARGOT avoids the “Hamming cliff”. Shae-
fer, C.G. “The Context Insensitivity of the ARGOT Methoc
Rowland Institute Report: RIS 42r, May, 1987.
 

 

 

 

 

iB.

(6

[19]

[13]

[12

A Gray code, while removing the ‘Hamming cliff’ preb-
jem, causes the hyperplane analysis to become inefficient
and thus is not used with the SGAO, ARGOT, though,
does not suffer from this disadvantage stemming from
the Gray code.

For further information, see Shaefer, C.G “The ARGOT
Strategy”: Rowland Institute Report: RIS 40r; May,
1987

A detailed account of the operators and population mea-
surements of ARGOT is provided in Shaefer, C.G. “The
ARGOT Program”; Rowland institute Report: RIS 421,
May, 1987.

Price, W.L. “A controlled random search procedure for
global optimization” in “Towards Global Optimisation
2"; Dixon, L.C.W. and Szegé, G.P. (eds.); North-Holland,
Amsterdam, 1978; p 71.

Kirkpatrick, $.; Gelatt, C.D.; Vecchi, M.P. Science 1983,
220, 671.

Vanderbilt, D. and Louie, $.G. J. Comput. Phys. 1984,
56, 259.

Wilson, S.W. “Classifier System Learning of a Boolean
Function”; Rowland Institute Report: RIS 27r; Febru-
ary, 1986.

Dixon, L.C.W. and Szego, G.P. “The global optimiza-
tion problem: An introduction.” In “Towards Global
Optimisation 2”; Dixon, L.C.W and Szego, G.P. (eds.};
North-Holland, Amsterdam, 1978; p 1.

Price, W.L. “A controlled random search procedure for
global optimuzation” in “Towards Global Optimisation
2”; Dixon, L.C W. and Szega, G.P. (eds.); North-Holland,
Amsterdam, 1978; p 71.

Shaefer, C.G. “General Framework of the ARGOT Sys-
tern”; Rowland Institute Report: RIS 41r; May, 1987.

We employed an extension of Stephen Smith’s (Thinking
Machines Corporation, Cambridge, MA) tour represen-
tation for the ARGOT TSP tests. (private communica-
tion)

55

 
   

Tables and Figures

 

 

 

 

 

 

 

 

 

| Search Spaces at Different Resolutions
to 4 (84) Gt Payof fo
4 216 1008 3.05 x 105
5 220 10717 6.97 » 104
! 9 276 19" 65 1.42 x 108
WM 256 197126 4.15 x 10}
20 280 107198 1.34 x 10°
;
I

Table 1 1, = Number of bits per parameter (resolution}, S(R*4) = Number of points in search space;
: %/t ~ Percent of solution space searched per generation; Payof fg = Theoretical payoff for the
| ‘nearest’ trial solution

 

 

 

 

| Comparisons of Solutions at Different Resolutions

| L, ts TE 10 T13 Payoff
12s 0.49449 0.99511 0.75091 0.60997 » 107? 160.8725
12} 0.32356 0.99511] 0.92185 0.60997 >» 107? 116.7395

uiit 0,551 0.818 12.12 0.46 « 1077 1073.4
20% 0 49999 1.00000 6 74999 --0 22888 x 1074 1.3384
20+ ~ 4.68133 - 0.66591 x 107} --19.71969 -0.70215 x 1073 264.298

20% 1.551 0.080 ~ 4,129 - 0.678 x 10! 1432.4
A 0.5000001 1.0006600 0.7499999 - 0.23817 » 1075 0.00914

 

 

 

 

 

 

 

 

Table 2 x ~ Theoretical ‘nearest’ trial solution, | = SGAO; t= MC, tt = 107 points, fff = Best MC;
A= ARGOT, on average, uses 1, ~ 4

56

 
 

 

 

 

 

Roving Boundaries

Roving Boundaries

20.

10.

-10.

-20,
15.

10,

-10.

-15.

Figure 4

A(M(.005),M7(2,.01),T(15,.5<(eg)<.6)) on [4,95,5]

 

 

 

 

 

CT TNT ~ I ] 4
| “lt
f _
L r yh _
\
zz: . ~
TIS my mm
L_ .; Nm me ee eee
. ~ -“
\ ne
I \ nh _|
1° 3 7
“y
Lt | | | ml
Ld I | | |
po
| A 4 |
/ No
Pe LS +. 4
™A i~
| \Wiy parm TT ar er me er _
y ,
A eo
OA
-— b _—
ft { L { ml
0. 200. 400. 600.
Generation

This figure plots the roving boundaries and best estimates for the two bimodal parameters
of the A, Problem: 75 = {~.5,.5} (upper) and %_ = {-1,1} (lower). Note that the best
estimates of the parameters, the beld lines, undergo early ‘switching’ between the two possible
global solutions.

57

 
 

A(M(.005),M7(2,.01),T(15,.5<(eg)<.6)) on [4.95,5]

 

a 200. Cy 3 T ~ T 4

0 150. \ 7
a j
\ [_ N
s 100. ; 4 4
. ‘
g 50. + fos . 4
wo fy KN -
a 0 Li om se ee UTR A eS ee me ety ee ee ee eee
‘ \ . L -
\ e 50 \/ Mee
a . - ~ 7
V

 

| “too Et | | jp
| a5 &F a

 

 

 

n
! 2
hy
: 5.0 ~~ mm, x \:
oy
° 2.5 7 at 4
mM N
\
ao | . se, ~
a 0.0 [OE aS er ee ree my ree me mee
oO “
= 25 b i 4
“50 Fy | kL. ee ed 7
0. 200. 400. 600.

Generation

Figure 2 This figure plots the roving boundaries and best estimates for the ‘broad’ and ‘sharp’ parameters
of the A; Problem: 79 = .75 (upper) and 7,3 = O (lower). Note the differing scales of the
two graphs and how the broad parameter has very large roving boundaries while the sharp
parameter has extremely narrow roving boundaries

58
sates,

os

 

 

 

NONSTATIONARY FUNCTION OPTIMIZATION USING GENETIC ALGORITHMS WITH
DOMINANCE AND DIPLOIDY

David E. Goldberg and Robert E, Smith

Department of Lugineering Mechanics
The University of Alabama
Tuscaloosa, AL 35487

ABSTRACT

This paper investigates the use of diploid
representations and dominance operators in genetic
algorithms (GAs) to improve performance in
environments that vary with time The mechanics of
diploidy and dominance in natural genetics are
briefly discussed, and the usage of these
structures and operators in other GA investigations
is reviewed An extension of the schema theorem is
developed which Illustrates the ability of diploid
Gas with dominance to hold alternative alleles in
abeyance. Both haploid and diploid GAs are applied
to a simple time varying problem an oscillating,
blind knapsack problem. Simulation results show
that a diploid GA with an evolving dominance map
adapts more quickly to the sudden changes in this
problem environment than either a hapleid GA or a
diploid GA with a fixed dominance map. These
proof-of-principle results indicate that diploidy
and dominance can be used to induce a form of long
term distributed memory within a population of
structures,

INTRODUCTION

Real world problems are seldom independent of
time. T£ you don‘t like the weather, wait five
minutes and it will change. If this week gasoline
sosts $1.30 a gallon, next week it may cost $0 89 a
gallon or perhaps $2.53 a gallon. In these and
any more complex wavs, real world environments are
beth nonstationary and noisy. Searching for good
solutions or good behavior under such conditions 1s
a difficult task; yet, despite the perpetual change
and uncertainty, all is not lost. History does
repeat itself, and what goes around does come
around, The horrors of Malthusian extrapolation
Yarely come to pass, and solutions that worked well
yesterday are at least somewhat likely to be useful
when rircumstances are somewhat similar tomorrow or
the day after. The temporal regularity implied in
these observations places a premium on search
atgnented by selective memory In other words, a
System which does not learn the lessons of its
Ristory is doomed to repeat its mistakes.

 

er .
*% this paper, we investigate the behavior of a
aenstic algorithm augmented by structures and

~S8eEaters capable of exploiting the regularity and

Kepeatabllity of many nonstationary environments

 

OO

Specifically, we apply genetic algorithms that
include diploid genotypes and dominance operators

to a simple nonstationary problem in funetion
optimization’ an oscillating, blind knapsack
problem, In doing this, we find that diploidy and

dominance induce a form of long term distributed
memory that stores and occasionally remembers good
partial solutions that were once desirable. This
memory permits faster adaptation to drastic
environmental shifts than is possible withaut the
added structures and operators.

In the remainder of this paper, we explore

the mechanism, theory, and implementation of
dominance and diploidy in artificial genetic
search. We start by examining the role of diploidy

and dominance in natural genetics, and we briefly
review examples of their usage in genetic algorithm
circles We extend the schema theorem to analyze
the effect of these structures and mechanisms. We
present results from computational experiments on a
17-object, oscillating, blind 0-1 knapsack problem
Simulations with adaptive dominance maps and
diploidy are able to adapt more quickly to sudden
environmental shifts than either a haploid genetic
algorithm or a diploid genetic algorithm with fixed
dominance map, These results are encouraging and
suggest the investigation of dominance and diploidy
in other GA applications in search and machine
learning.

THE MECHANICS OF NATURAL DOMINANCE AND DIPLOIDY

It is surprising to some genetic algorithm
newcomers that the most commonly used GA is modeled
after the mechanics of haploid genetics. After
all, don’t most elementary genetics textbooks start
off with a discussion of Mendel’s pea plants and
some mention of diploidy and dominance? The reason
for this disparity between genetic algorithm
practice and genetics textbook coverage is due to

the success achieved by early GA investigators
(Holistien, 1971; De Jong, 1975) using haploid
chromosome models on stationary problems. It was

found that surprising efficacy and efficiency could
be obtained using single stranded (haploid)
chromosomes under the action of reproduction and
crossover, As a result, later investigators of
artificial genetic search have tended to ignore
diploidy and dominance. In this section we examine
the mechanics of diploidy and dominance to
understand their roles in shielding alternate

 
 

 

solutions from excessive selection.

Most studies of genetic algorithms to date
have considered only the simplest genotype found in
nature, the haploid or single-stranded chromosome
In this simple model, a single-stranded string
contains all the information relevant to the
problem we are considering. While nature contains
many haploid organisms, most of these tend to he
relatively uncomplicated life forms. It seems that
when nature wanted to build more complex plant and
animal life it had to rely on a more complex
underlying chromosomal structure, the diploid or
double-stranded chromosome. in the diploid form, a
genotype carries a pair of chromosomes (called
homologous chromosomes), each containing
information for the same functions. At first, this
xedundancy seems unnecessary and confusing After
all, why keep around pairs of genes which decode to
the same function? Furthermore, when the pair of
genes decode to different function values, how does
nature decide which allele to pay attention to? To
answer these questions, let's consider a diploid
chromosomal structure where we use different
letters to represent different alleles (different
gene function values):

AbCDe
aBCde

At each position (locus) we have used the capital
form ov the lower case form of a particular letter
to represent alternative alleles at that position
In nature, each allele might represent a different
phenotypic characteristic (or have some nonlinear
or epistatic effect on one or more phenotypic
characteristics). For example, the B allele might
be the brown-eyed gene and the b allele might be
the blue-eyed gene. Although this scheme of
thinking is not much different from the haploid
(single-stranded) case, one difference is clear.
Because we now have a pair of genes describing each
function, something must decide which of the two
values to choose because, for example, the
phenctype cannot have beth brown and blue eyes at
the same time (unless we consider, as nature
sometimes does, the possibility of intermediate
forms, but we shall not concern ourselves with that
possibility here).

The primary mechanism for eliminating this
conflict of redundancy is through an operator which
geneticists have called dominance. At a particular
locus, it has been observed that one allele (the
dominant allele) takes precedence (dominates) over
the other alternative alleles (the recessives) at
that locus. More specifically, an allele is
dominant if it is expressed (it shows up in the
phenotype) when paired with some other allele. In
our example above, if we assume that all capital
lerters are dominant alleles and all lower case
letters are recessive, the phenotype expressed hy
the example chromosome pair may be written.

AbCDe
aBCde

-++> ABCDe

60

 

At each locus we see that the dominant gene

always expressed and that the recessive gene

only expressed when it shows up in the company
another recessive, In the geneticist's parlance
say that the dominant gene is expressed wh
heterozygous (mixed, Aa --> A) or homozygaus (pur
CG --> C) and the recessive allele is express
only when homozygous (ee --> 2),

The mechanics of diploidy and dominance se
relatively clear. On a more abstract level, we a
think of dominance as a genotype-to-phenory;
mapping. Yet, if we continue to ponder the natu
and action of diploidy and dominance, they at
really quite bizarre. Why does nature double tt
amount of information carried within the genotyr
and then turn around and cut by half the quantit
ef information it uses? On the surface this seer
wasteful and unnecessarily tedious. Yet, nature 1
no spendthrift, nor is she given to whimsy o
caprice. There must be good reason for the adde
redundancy of the diploid genotype and for th
reduction mapping of the dominance operator.

Actually, diploidy and dominance have lon
been the object of genetic study, and mumerou
theories and explanations of their role have bee:
puc forth. The theories which make the most sens:
in the context of artificial genetic searcl
hypothesize that diploidy provides a mechanism foo
remembering alleles and allele combinations whict
were previously useful and that dominance provides
an operator to shield those remembered alleles fron
harmful selection in a currently hostile
environment, In a natural context, we car
understand the need for both a distributed long
term memory and a means of protecting that memory
against rapid destruction. Over che course of the
evolution of Life on Earth, the planet has
undergone many changes in environmental conditions.
From hot to cold and back to moderate temperatures,
from dark to light to somewhere in between, there
have been dramatic and rapid shifts in
environmental conditions. The most effective
organisms have been those able to adapt most
rapidly to the changing conditions. Animals and
plants with diploid or polyploid structure have
been these most capable of surviving, because their
genetic constitution did not easily forget the
lessons learned prior to previous environmental
shifts. The redundant memory of diploidy permits
multiple solutions (to the same "problem") to be
carried along with only one particular solution
expressed, In this way, old lessons are not lost
forever, and dominance and dominance change permit
the old lessons to be remembered and tested
secasionally.

An often cited example of the long term
memoty induced by diploidy and dominance can he
found in the shifts in population balance of the
peppered moth in Great Britain during the
industrial revolution. The wild form (and
originally the dominant form) of this lepitopteran
had white wings with small black specks Prior to
the Industrial Revolution, this coloration was
effective camouflage against birds and other beasts
of prey in the moth's natural habitat, lichen-
covered trees. In the middle of the nineteenth
century, black forms were caught in the
neighborhood of industrial towns. Careful
experiments by Kettlewell (Berry, 1972) showed that
the speckled version was advantageous in the
pristine setting, while the melanic (dark) form was
advantageous in the industrial environment where
pollution had killed off the lichen covering the
tree trunks, It turned out thar the melanic forms
were controlled by a single dominant gene, implying
that a shift in dominance occurred. When the
industrial revolution shifted the balance of power
toward the darkened form, the darkened form became
dominant and the speckled form was held in
abeyance. Note that the melanic form was not a new
invention; this was no case of fortultous mutation
magically concocting the needed form. Instead, the
black form had been invented earlier, perhaps in
response to forests where lichen was naturally
suppressed, When the by-products of industry
caused the lichen to disappear, the melanic form
was sampled more frequently and then evolved to the
dominant form. With this alternate solution held
in the background, the peppered moth was easily
able to adapt rapidly to the selective pressures of
its changing environment.

In this example we how diploidy and
dominance permit alternate solutions to be held in
abeyance--shielded against over selection. We also
sea how dominance is no absolute state of affairs.

see

Biologists have hypothesized and proven that
dominance itself evolves. In other words, the
dominance or lack of dominance of a particular
allele is itself under genic control. Fisher
(1958) theorized that dominance at a particular
position (locus) along a chromosome is actually
determined by another modifier gene at another

locus. This implies that dominance is an evolving
feature of the organism, subjecr to the same search
procedures as any other feature. If a particular
allele is favored by selection, it will spread more

vapidly if it is dominanr. A modifier gene
therefore enhances the spread of the gene being
modified. This in turn enhances the spread of the
modifier. If the two genes are closely linked,

this positive feedback quickly propagates both the
favered allele and the modifier in the population.

But what sets the dominance of the modifier
gene? In order to avoid infinite regress, we
Yecegnize that a gene can have more than one effect
on the phenotype, In fact, an allele can have
several major effects on the phenotype while it
affects the dominance at one or more other loci.
The presence of such multiple effects is known as
Blelotropy. We will use a simple form of
. Pislotropic modifiers (one with the modifier always
_ attached to the gene it modifies) in experiments

with a diploid GA representation suggested by
. Holistien (1971) and Holland (1975),

 

; In the next section, we examine the diploidy-
ominance schemes used in artificial genetic search
9 See how they incorporate diploid structure,
eminence, and the evolution of dominance.

  

 

of

DIPLOIDY AND DOMINANCE IN GENETIC ALGORITHMS, AN
HISTORICAL PERSPECTIVE

Some of the earliest examples of practical
genetic algorithm application contained diploid
genotypes and dominance mechanisms, In Bagley'’s
early (1967) dissertation, a diploid chromosome
pair mapped to a particular phenotype using a
variable dominance map coded as part of the
chromosome itself (Bagley, 1967, p. 136):

Each active locus contains, besides the
information which identifies the
parameter to which it is associated and
the particular parameter value, 4
dominance value. At each locus the
algorithm simply selects the allele
having the highest dominance value.
Unlike the biological case where
partial dominance may be permissible
(wesulting, for example, in speckled
eyes), our interpretation demands that
only one of the alleles of the
homologous loci be chosen. The
decision process in the case of ties
(equal dominance values) involves
position effects and is somewhat
complicated so that it will be
necessary to outline the process in
some detail.

The introduction of a dominance value for
each gene allowed this scheme to adapt with
succeeding generations, Unfortunately, Bagley
found that the dominance values tended to fixate
quite early in simulations thereby leaving
dominance determination in the hands of his
somewhat complicated and arbitrary tie-breaking
scheme, To make matters worse, Bagley prohibited
his mutation operator from operating on his
dominance values, thereby further aggravating this
premature convergence of dominance values.
Additionally, Bagley did not compare haploid and
diploid schemes, and in all of his cases the
environment was held stationary. In the end, the
convergence of dominance values at all positions
led to an arbitrary random choice dominance
mechanism and inconclusive results.

Rosenberg's (1967) biologically-oriented
study contained a diploid chromosome model;
however, since biochemical interactions were
modeled in data{l, dominance was not considered as
a@ separate effect. Instead, any dominance effect
in this study was the result of the presence or
absence of a particular enzyme. The presence or
absence of an enzyme could inhibit or facilitate a
biochemical

reaction, thus controlling some
phenotypic outcome.
Hollstien’s study (1971) included diploidy

and an evolving dominance mechanism. In fact
Holistien described two simple, evolving dominance
mechanisms and then put the simplest to use in his

study of function optimization. In the first
scheme, each binary gene was described by two
genes, a modifier gene and a functional gene. The
functional gene took on the normal 0 or 1 values
and was decoded to some parameter in the normal
manner. The modifier gene took on values of IM or
m. In this scheme 0 alleles were dominant when
there was at least one M allele present at one of
the homologous modifier loci. This resulted in a
dominance expression map as displayed in Figure 1.

 

 

 

a ee
OM QO 0 0 0
On 0 o | 0 ,
1M > 0 . 1 L
im . . L 1 -

 

 

 

 

 

Figure 1, Two-locus evolving dominance map from
Hollstien (1971).

Hollstien recognized that this two-locus evolving
dominance scheme could be replaced by a simpler
one-locus scheme by introducing a third allele at
each locus. In this triallelic scheme, Hollstien
drew alleles from the 3-alphabet {0, 1, 2}. Here
the 2 played the role of a dominant "1" and the 1
played the role of recessive "1." The dominance
expression map he used is displayed in Figure 2.

 

 

 

1 9 ly

1 L 1 L

0 L 9 0

ly L 0 L
ere beni -—

 

 

 

Figure 2. Single-locus, triallelic dominance map

from Hollstien (1971) and Holland (1975).

The action of this mapping may he summarized by
saying that both 2 and 1 map to "1", bur 2
dominates O and O dominates 1, Holland (1975)
later discussed and analyzed the steady state
performance of the same triallelic scheme, although
he introduced the clearer symbology (0, lg, 1} for

 

Hollstien’s (0, 1, 2}.

The Hollstien-Holland triallelic scheme is
the simplest practical scheme suggested for
evolving dominance and diploidy in artificial
genetic search. With this scheme, the more
effective allele becomes dominant, thereby
shielding the recessive. Minimum excess storage is
required (half a bit extra per locus) and
furthermore, dominance shift can easily be handled
as a mutation-like operator, mapping a 2 to al (a
1 to a lp using Holland's notation) and vice versa.
Despite the clarity of the scheme, Hollstien's
results with this mechanism were mixed. Although
his Breed Type III simulations maintained better
population diversity (as measured by population
variance) than did his haploid simulations, there
was no significant overall improvement of either
average or ultimate performance. This scems
surprising until we recognize that his test bed
only contained stationary functions. Tf the role
of dominance-diploidy is shielding or abeyance, we
should only expect significant performance
differences between haploid and diploid genetic
algorithms when the environment changes with time.

Brindle (1981) performed experiments with a
number of dominance schemes in a function
optimization setting. Unfortunately, their has
been some question as to the validity of this
study's test functions and codings (K. A. De Jong &
L. B. Booker, personal communication, 1986).
Furthermore the study ignored previous work in
artificial dominance and diploidy, and a number of
schemes developed were without basis in theory and
without biological precedent.

Specifically, Brindle considered a total of 6
schemes:

1 random, fixed, global dominance

2. variable, global dominance

3. deterministic, variable, global
dominance

4. choose a random chromosome

5. dominance-of-the-better chromosome

6. haploid-controls-diploid individual
dominance

The second and third schemes make local
dominance decisions based on global population
knowledge. The use of global information is
questionable since the primary beauty of both
natural and artificial genetic search is their
global performance through local action. Once
global operators are imserted, this attractive
feature is destroyed. This is no small matter if
we are ultimately concerned with efficient
implementation of these methods on parallel
computer architectures.

Of the remaining schemes, only the the sixth
scheme suggested by Brindle uses an adaptive
dominance map Like those in Hollstien's (1971) and
Bagley's (1967) earlier work; however, this scheme
completely separates the dominance map (the
modifying genes) from the normal chromosome (the
functional genes) as an added haploid chromosome.
ew

2m oD

 

 

 

 

. SE course,

 

This separation effectively destroys linkage
between the dominance map and the functional genes.

with the
Brindle’s

In addition to the
dominance schemes and test functions,
work, Like studies before, considered only
stationary functions. This is a common thread
running through all previous genetic algorithm
studies of dominance and diploidy. If dominance
and diploidy do act to shield currently out-of-
favor solutions from excessive selection, use of
these operators in static environments is unlikely
to show any performance gain when compared to a
haploid GA. In the next section, we further
buttress the case for dominance and diploidy as
abeyance-~long term memory--mechanisms through an

problems

analysis of schema propagation under these
operators.
THEORY OF DOMINANCE AND DIPLOIDY

Before analyzing the specific effects of
dominance and diploidy, we briefly review the

notion of schemata and the fundamental theorem of
genetic algorithms. The cornerstone of all genetic
algorithm theory is the realization that GAs
process schemata (schema-singular, schemata-plural)
or similarity templates. Suppose we have a finite
length binary string, and suppose we wish to
describe a particular similarity. For example,
consider the two strings A and B as follows:

wp
ga
ad
ro
He
Or
or

We notice that the two strings both have 1's in the
first and third position. A natural shorthand to
describe such similarities introduces a wild card
er don’t care symbol, the star *, in all positions
where we are disinterested in the particular bit
value, For example, the similarity in the first
position can be described as follows:

Likewise, the similarity in the third position may
be described with the shorthand nocation

aed the combined similarity may be described with
*'s im all positions but the first and third:

. these schemata or similarity templates
eet only name the strings A and B, The schema
ewes describes a subset containing 2% = 16
4RE8, each with a one in the first position.
Bore specific schema 1*1 describes a subset

 

 

03

of 23 = 8 strings, each with ones in both the first
and third position.

We notice that not all schemata are created
equal. Some are more specific than others. We
call the specificity of a schema H (the number of
fixed positions) its order, o(H). For example,
o(]l*t<*) » 1 and o(1*1**) - 2. Some schemata have
defining positions spaced farther apart than
others. We call the distance between a schema's
outermost defining positions its defining length,
6(H). For example, the defining length of any one-
bit schema is zero’

BS (LEE) om ECERLE) oo» 0,

On the other hand, the defining length of our
order-two schema example may be calculated by
subtracting the position indices of the outermost
defining positions:

6(141#4) - 3-1 2.

These properties are useful in the
fundamental theoren of genetic algorithms
otherwise known as the schema theorem. Under
fitness proportionate reproduction, simple
crossover, and mutation, the expected number of

coples m of a schema H is hounded by the following
expression:

ODT, 2» AUD Lp.
m@H el) 2 m(H,e) AGT . p BU. p orn) |

In this expression, the factors p, and Py are the
mutation and crossover probabilities respectively,
and the factor £(H) is the schema average fitness

which may be calculated by the following
expression:

D £(s,)

m(H,t)
The schema average fitness f(H) is simply the
average of the fitness values of all strings s

which currently represent the schema H. Overall
the schema theorem says that above average, short,
low-order schemata are given exponentially
increasing numbers of trials in successive
generations. Holland (1975) has shown that this is
a near-optimal strategy when the allocation process
is viewed as a set of parallel overlapping, multi-
armed bandit problems. We will not review this
matter in detail here. Instead, we need to look at
how dominance and diploidy modify expected schema
propagation,

To see the effect of dominance and diploidy
on schema propagation, ve recognize that the schema
theorem still applies: however it is useful to
separate the physical schema H from its expression
H,(H). Of course the expression of a schema is a
funecion of the schema, its range of mates, and the
dominance map in use. Recognizing all this, we may
rewrite the schema theorem in somewhat clearer
 

form:

m(H,t#1) > (He) -

§(4)
ED + ot p,]

Everything remains the same,
fitness of the schema H, F(H),

except the average
is replaced by the

average fitness of the expressed schema H.C),
£(H,(H)). In the case of a fully dominant schema
H, the average fitness of the physical schema

always equals the expected average fitness of the
expressed schema H, CH) +

£CH) ™ £(H,CH))

In the case of a dominated schema H, the hope is
that the average fitness of the expressed schema is
greater than or equal to the average fitness of the
physical schema:

ECH,CH)) 2 £CH)

This situation is most likely to occur when the
dominance map itself is permitted to evolve, if
the average fitness of the allele as expressed is
greater than its fitness when homozygous, then the
currently deleterious, dominated schema will not be
selected out of the population as rapidly as in the
corresponding haploid situation. This is how
dominance and diploidy shield currently out-of-
favor schemata.

To make this argument more quantitative,
let's consider a simple case where only two
alternative, competing schemata may be expressed,
ene dominant and the other recessive, Physically,
we can think of this as representing elther two
alternate alleles at a particular locus, or two
multi-locus schemata that have come to dominate a
particular set of loci. In either case, we assume
that the dominant alternative is expressed whether
heterozygous or homozygous, and we assume that the
recessive alternative is expressed only when
homozygous. Rearrangement of the schema growth
equation permits us to calculate the proportion of
recessive alleles, Pf, in successive generations,
t. If we assume that there are only two
alternatives and the dominant form has a constant
expected fitness value of f4 and the recessive form
when expressed has a constant expected fitness
value of f,, the proportion of recessives expected
in the next generation may be calculated as
follows:

ph + r(1-pt
pil PEK tet ERD) EC we
ots rp

where r= E4/f,, and K = crossover-mutation Loss
constant.

& similar equation may be derived for the
haploid case where the deleterious alternative (we
will still call this recessive even though it no
longer recesses) is always expressed when present
in a haploid structure,

64

 

Proportion ratio (PL Eel yp fy and proportion vers
time graphs are plotted "for haploid and dipie
cases in Figures 3 and 4, From Figure 3 we ne
that for a comparable proportion of alleles, t
haploid case always destroys more recessiv
(always has a smaller proportion ratio) than t
corresponding diploid case. Of course, this do
not imply that the diploid case has a low on-li
performance measure. In fact, the sampling ra
remaing low (proportional to P*) for the po
(recessive) alleles in the diploid case.

 

0.9000 4

ee 4

3 0.7000 +}

a.e000 4

 

 

9.3000
O ap000E +00

 

T
5 Q0000E~O1

Seating ion of Population
oD Rez + TUmiung Pemen F PEP

1.00G008+

Poplald Ree

Figure 3. Expected ratio of recessives proportio
pL tp_* versus P.", for haploid (r ~ 2), diploi
(r= 2), and Limiting diploid (r ~ =).

n.g0ca

 

a,4000 4

fom]
jh]

 

 

8.1000 4
aooco 1s nd .
o Vo 20 30 40 50
Osnerction Number
DHoplola Rw? + Diploid R= Umiting Dipiold
Elgure 4. Preportion of recessives alleles pt
versus generation t for haploid (r ~ 2), diploid
(r = 2), and limiting diploid (r ~ ©).
The previous analysis clearly demonstrates

the long
dominance.

term memory induced by diploidy and
Because of this effect we also expect
 

 

 

 

 

that mutation should play even less of a role in

the operation of a diploid genetic algorithm.
Holland (1975) has presented an analysis of the
steady state mutation requirements of diploid

structures as compared to haploid structures. We
reproduce his arguments to understand the ergodic
performance of these mechanisms.

For a haploid structure under selection and
mutation it may be shown that the proportion of
recessive alleles in the next generation pi is
related to the proportion in the current
generation, pt by the following equation:

ttl c t t
PL = (Lee) PL + pyCl-PLo) + PyPy

Here we have the sum of three terms,
due to selection, the source of
mutation and the loss of alleles
The ¢ factor is the proportion
selection and other operator losses
state,

the proportion
alleles from
from mutation.
lost due to

At steady

ttL t
P. 7 Prom Pss-

Solving for p,, ve obtain the following equation:

«P.

ss

PTT oP,

This equation suggests that the final steady state
preportion of alleles is directly proportional to
the mutation rate (with large ¢ and small Pog)

For a diploid structure under selection and
mutation it may be shown that the proportion of
recessive alleles in the next generation is related
to the number in the current generation by the
following equation:

t41 typ t Lop t
PL ~ (1-2eP “PLS + 2p, 2P, )

At steady state we obtain a relationship between
the required mutation rate and the steady
proportion of recessive alleles:

Pa = €Pgg°/(1-2P ag)

For small steady state proportions of recessive
alleles, P,.<cl, this equation suggests that the
mstation rate required to keep a certain proportion
ef tecessive alleles around is proportional to the
square of the proportion. Of course, the presence
s£ the same proportion of alleles in the diploidy
Sase does not mean that they will be sampled as

Erequently. Although the same proportion is
Bresent, they are sampled as the square of their
Proportion, This underlines the need for

Sécasional dominance changes so stored alleles may
"be sampled in the current context

 

65

EXPERIMENTAL RESULTS

To investigate the behavior of a genetic
algorithm with dominance and diploidy, we compare
three schemes: a simple haploid GA, a diploid GA
with a fixed dominance map (1 dominates 0) and a
diploid GA with an evolving dominance map
(Hollstien-Holland triallelie scheme). We apply
all three GAs to a 17-object, blind, nonstationary
0-1 knapsack problem where the weight constraint is
varied in time as a periodic step function.

The 0-1 knapsack problem is an NP-complete
problem of operations research where we maximize
the total value of a subset of objects (selected
from a set of N possible objects) that we place in
a knapsack subject to some maximum load or weight
constraint. More mathematically, we associate a
value v; and a weight wy with the ith object. The
problem may then be stated as follows:

N
max Yvyxy
and
subject to the weight constraint

au.

i
Ywyxy
iml

Here the x; variables take on the values 1 or 0 as
the object is im or out of the sack respectively,
and Wis the maximum permissible weight.

Table 1

17-0bjegt 0-1 knapsack Problem Parameters

Object Number Object Value Object Weight

 

 

 

 

 

i wy
1 12
2 3 5
3 9 20
4 2 1
5 4 5
6 4& 3
7 2 10
8 7 6
9 8 8
10 10 7
11 3 4
12 6 12
13 5 3
14 5 3
15 7 20
16 8 1
7 6 2
Total: 91 122

We have adopted weights and values from the
literature (Gillett, 1976) as shown in Table 1, but
we have made the problem more difficult along two
dimensions. First, as with most GAs, the problem
in other words,

is presented to the GA blindly.
the algorithm has no knowledge of the problem
structure or problem parameters. It is forced to
exploit high performance similarities in the
problem coding. Second, since we are primarily
interested in observing the long term memory
induced by dominance and diploidy, we have
introduced a nomstationarity in our problem to test
this capability, In particular we cause the weight
constraint to vary as a step function between two
values: 80% and 50% of the total object weight,
The weight is shifted between the two values at 15
generation intervals (a total period of 30
generations).

We code the problem as we did in an earlier
set of haploid-only experiments (Goldberg & Smith,
1986). The 17 x, values are concatenated position
by position to form a 1?-bit string. In the
diploid cases a second homologous string is carried
along, and in the trilallelic case three different
alleles (1, 1p, 0) are permitted at each locus
within each homologous string. The constraint
inequality is adjoined to the problem with an
external penalty method where weight violations are
savered and multiplied by a penalty coefficient,
Cyenal = 20. Negative fitness values that result
from the “application of the penalty function are
set to zero.

Optimal solutions to the problem at both
weight constraints have been calculated using
standard methods. In the 80% constraint case, the
optimal sack value is 87 and the optimal string is
OLLILLOLLLLLLLLIILL (low abject number to high object
number from left to right). In the 50% constraint
case, the optimal sack value is 71 and the optimal
string is 01101101111111011.

All three GAs have been coded in Pascal and
executed on IBM PGs. The details of the haploid CA

are similar to those covered elsewhere. In the
diploid GAs, we need to discuss one important
implementation detail: the appropriate place to
perform crossover, In mature, erossaver occurs

during the pachytene stage of the first prophase
during meiosis. For GAs the important point to
keep in mind is that the cross oceurs (if it does)

between the homologous chromosomes of a mature
individual. This detail has been overlooked by
some genetic algorithmists; however, ic is
important, because the cross within a viable

individual is a less risky affair than a cross of
chromosomes from tandem individuals (the usual
haplo{d GA cross). There is always the possibility
that when stripped of its current context (the new
genome's homologous complement) that a few
recessive alleles will be expressed with possible
negative results; however, the new chromosome
(vegardless of a cross) did at least make it to
adulthood in some context. In a way we may think
of this as a safe cross,

As a benchmark we examine the performance of

a simple haploid GA on the oscillating knapsack
problem, We select GA parameters as follows:

crossover probability p, - 0.75,

66

 

mutation probability p, - 0.001,

and population size N = 150.

These parameters are held constant throughout al
simulations (beth haploid and diploid), ar
stochastic remainder selection (Booker, 1982) 14
used throughout.

 

“fit

604

Finvan

 

a 100 200 00 400
Generation §

Figure 5. Haploid GA best-of-generation results,

wo

 

204

79

 

§
c wo
304
3294
14
od
9 108 200 300 ac0
Generation #
Figure 6. Haploid GA generation average results.

Figure 5 shows the best-of-generation results
with the haploid Ga. Starting from a randomly
generated population, the haploid GA quickly
decides on the solution with the higher fitness
value (the 80% weight constraint case), and
thereafter each oscillation to the other problem
(50% constraint) causes a crash to infeasible, low
fitness solutions. This same set of circumstances
is reflected in the generation average results as
shown in Figure 6. These results are not
unexpected as earlier experiments with haploid GAs
in nonstationary problems (Pettit & Swigger, 1983)
also were unable to track changing environmental
conditions.
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

observations underscore the need for evolving
dominance maps,
ao
oa
so +
s a4 4
z 824
c 0 +
76 4
: ve
2 744
72
w+
6 ar
3 rr er: nr 408 ss 4
Canacation # om
e4
Figure 7. Diploid GA with fixed dominance map iad 3 r wo v Mo v mo T 00
(1-dominates-0) best-of-generation results.
Ganeration #
90 Figure 9. Diploid GA with evolving dominance map
(triallelic scheme) best-of-generation results.
oo 4
704
co i)
3 504 a0
= so 70
304 69
204 : 9
&
= 40
104
o , T rT t T 7
2 100 zoo 300 400
20
Generation &
10
Figure 8. Diplo{d GA with fixed dominance map
(1-dominates-@) generation average results. ° T tT at r T
° 100 200 300 400
Generation #
“ fh ix : *
We perform a diploid experiment with a fixed Figure 10. Diploid GA with evolving dominance map

dominance map (where 1 dominates 0 at all loci)
In Figure 7 (best-of-generation results) we see
that the existence of diploidy and dominance
enables better switching capability under the same
sudden shifts than was obtained in the haploid run.
Similar improvement is noted in the generation
average results (Figure 8), It is important to
Rete that once the algorithm has searched under the

influence of both weight constraint values, it
Amstantly recovers a near optimal lower valued
solution when the next shift occurs; however,

fepeated searching must be performed to attempt to
find a higher valued solution, Also, the lower-
valued (50% constraint) optima is never discovered.
These inadequacies are a direct consequence of the
Broblem representation coupled with the fixed
_foninance map. A knapsack with a higher weight
constraint requires more objects to be put in the
ptimal bag (with more 1's in the string). As the
Sieple l-dominates-O map can only hold 0's in
abeyance, the GA is able to hold only light
knapsack building blocks in abeyance. Therefore,
She GA cannot memorize the higher-valued (80%
. 8natraint) solution, This disruption also retards
- 8Grall performance of the algorithm. These

 

6?

(triallelic scheme) generation average results.

In the third and final experiment, we execute
the triallelic diploid GA Figures 9 and 10 show
best-of-generation and generation average results
respectively. The triallelic scheme with an
evolving dominance map demonstrates better tracking
of time-varying optima than either the haploid or
fixed dominance schemes: the algorithm is able to
discover and retain both optima. Despite the fact
that the higher-valued solution is periodically
unacceptable (overweight), the population retains
the information necessary to quickly rediscover
this solution in the majority of the problem
eycles. Other important features of the triallelic
diploid GA's performance occur at generation 135,
and in the last 3 sets of cycles near the end of,
the run. At first glance these points seem to show
drastic performance failures, but further
inspection shows desirable memory-like activity at
work. At generation 135, the algorithm has
completely converged on the higher-valued (80%
constraint) optima, and the population fitness
collapses after the weight constraint shifts; in
the following generation, the alleles necessary to
construct the lower-valued optima are brought out
of abeyance (through expression in their homozygous
form), and the algorithm recevers the optimal
solution for the 50% weight constraint case. Near
the end of the run, the GA does not find the
higher-valued optima for two consecutive cycles;
after rediscovering this optima in generation 375,
however, the solution is quickly recalled in the
following cycle. These are important
demonstrations of the triallelic diploid GA's
ability to recover solutions from its past

CONCLUSIONS

We have examined the use of diploidy,
dominance, and evolving dominance maps in genetic
algorithm search. We have found these mechanisms
te be an effective means of inducing a form of long
term distributed memory within a population of
structures.

Specifically, we have examined the theory and

implementation of these mechanisms. The schema
theorem has been extended to distinguish between a
physical schema and its expression. The theory

suggests that evolving dominance permits currently
deleterious alleles to become recessive thereby
providing an extra element of shielding against
excessive selection.

Computational experiments have shown the
superiority of diploidy over haploidy in a
nonstationary knapsack problem. A 17-object, O-1
knapsack problem with an oscillating weight
constraint has been optimized using three GAs:
haploid, diploid with a fixed dominance map, and
diploid with an evolving dominance map, Both
diploid schemes were better able to satisfy the
switching requirements of the nonstationary
environment than was the haploid scheme, The
diploid scheme with the evolving dominance map was
better able to quickly serve both optima as the
environment changed than was the diploid scheme
with the fixed map. These procf-of-principle
results demonstrate the effectiveness of diploidy,
dominance, and evolving dominance maps as a means
of achieving faster response to severe
environmental nonstationarities.

ACKNOWLEDGEMENTS

This material is based upon work supported by the
National Science Foundation under Grant MSM-
8451610. The second author gratefully acknowledges
assistantship support from the Rochester Products
Division of General Motors.

REFERENCES

Bagley, J. D. (1967). The behavior of adaptive
systems which employ genetic and correlation
algorithms. (Doctoral dissertation,
University of Michigan). Dissertation

Abstracts International, 28(12), 5106B.
(University Microfilms No. 68-7556)

68

 

Berry, R. J. (1972). Genetics (2nd ed.). Liverpoo
English Universities Press.

Booker, L. B. (1982). Intalligent behavior as an
adaptation to the task environment. (Doctor.
Dissertation, Technical Report No. 243. Ann
Arbor: University of Michigan, Logic of
Computers Group). Dissertation Abstracts
International, 43(2), 469B. (University
Microfilms No, 8214966)

Brindle, A. (1981). Genetic alporithms for functic
optimization. Unpublished doctoral
dissertation, University of Alberta,
Edmonton, Canada.

De Jong, K. A. (1975). An analysis of the behavior
of a class of genetic adaptive systems.
(Doctoral dissertation, University of
Michigan). Dissertation Abstracts
International, 36(10), S140B. (University
Microfilms No. 76-9381)

Fisher, R. A (1958). The penetical theary of
natural selection. New York: Dover.

Gillett, B. E. (1976). Introduction co operations
esearch. New York: McGraw-Hill

Goldberg, D. E., & Smith, R. E. (1986, October). A
Meets OR; Blind inferential search using
genetic algorithms. Paper presented at the
Joint National Conference of the Operations
Research Society of America and The Institute
of Management Sciences, Miami Beach, FL.

Holland, J. H. (1975), Adaptation in natural and
artificial systems. Ann Arbor: The University
of Michigan Press.

Hollstien, R. B. (1971). Artificial genetic
adaptation in computer control systems
(Doctoral dissertation, University of
Michigan). Dissertarton Abstracts
International, 32(3), 1510B, (University
Microfilms No. 71-23, 773)

Pettit, E., & Swigger, K. M. (1983). An analysis of
genetic-based pattern tracking and cognitive-
based component tracking models of
adaptation. Proceedings of the National

Conference on Artificial Intelligence, 327-
332.

Rosenberg, R. S. (1967). Simulation of genetic
populations with biochemical properties,
(Dectoral dissertation, University of
Michigan). Dissertation Abstracts
International, 28(7), 2732B. (University
Microfilms No. 67-17,836)

Sysio, M. M., Deo, N., & Kowalik, J. S. (1983).
Discrete optimization algorithms with Pascal
programs. Englewood Cliffs, NJ. Prentice-
Rall.
 

 

 

GENETIC OPERATORS
FOR HIGH-LEVEL KNOWLEDGE REPRESENTATIONS

H. J.

K. 8.

Antonisse
Keller

The MITRE Corporation, McLean, VA
April,

ABSTRACT

The genetic algorithm has been demonstrated as
an effective learning and discovery technique for a
large class of problems. Despite its promise,
however, the algorithm has not found wide accept~
ance among members of the artificial intelligence
community at large. Our contention is that this is
due to the representations used in genetic
algorithm research. These representations are
extremely simple, while much of the progress in
artificial intelligence has been supported by
highly expressive, relatively comple« representa-
tions. In this paper we discuss the the adaptation
of the genetic operators to a representation that
captures the salient features of the description
and processing of knowledge typical of high-level
knowledge representation systems.
1. Introduction
Learning is any process by which a system improves
its performance through modification of its
behavior. The capacity for learning is a key
element of intelligent behavior in living
organisms, yet one that 1s almost totally absent in
present intelligent artifacts such as computer
programs. However, as we attempt to build increas-
ingly sophisticated problem-solving behavior into
computers, the flexibility and adaptability
afforded by learning approaches to problem-solving
will become increasingly essential for success.

The genetic algorithms (GA) community has been
engaged for well over a decade in research in
automated learning and discovery.©'? This work has
ied to a number of impressive sycgegses in the
solution of difficult problems,"?'"? The approach
is Particularly interesting in its use of two
powerful metaphors by which computation 1s
organized -- the economy metaphor in the bucket
brigade algorithm, and the metaphor of natural
_ Selection and evolution in the genetic algorithm
Bfoper. Yet this approach has failed to make
significant inroads to the artificial intelligence
»CATY community. We believe this is primarily due
to three closely related factors: (1) the need to
Simulate yery large numbers of generations 1n the
to reach convergence on good problem solutions;
the need for great speed in the simulation in
Grder to produce the numbers of generations needed
~& reasonable time; and (3) the representation of
@ basic units of the GA as vectors, typically bit
HGS so that a large proportion of the possible

 

69

1987

solutions can be represented by any population of
strings.

Al is characterized in large part by its attention
to highly expressive knowledge representations.

The theoretical advantage of the bit-string
representation is disavowed in most AI research
programs. Many programs depend on logically tagged
data structures that are intended to be interpreted
directly as human cognitive models of a domain,

The mechanisms which exploit those knowledge
representations tend to be adequately fast for
decision aids and other interactive programs, but
are very much slower than the speeds demanded of GA
algorithms in the past. However, the speed at
which learning is achieved may not need to be as
fast for knowledge-based systems as has been needed
for GA research and previous GA applications. The
GA typically starts with little knowledge of
correct solutions, and does much of its work
converging on good approximate solutions. In
contrast, knowledge-based systems are knowledge-
engineered precisely to embody good solutions at
the outset of their operational deployment.
Moreover, such systems are expected to have a
fairly long operating life-span. Consequently, for
many AI applications it 1s reasonable to trade the
speed of the bit-string GA for a slower genetic
processes based on the more complex representations
found in typical knowledge-based systems.

2, Project Objectives and Organization of the
Paper

This paper describes work conducted as part of a
machine learning project at MITRE, The overall
objective of the project is to improve the
performance capacity of knowledge-based systems by
imbuing such systems with an ability to learn from
their “experience", For an automated system to
exhibit learning it must (at some level) be able to
evaluate its performance to determine whether its
behavior is adequate to its tasks, and it must be
able to modify its behavior if not. Fora
knowledge-based system the modification, and
therefore the evaluation, must oceur at the
knowledge level, since that is the level at which
the system was designed to be open to modifica-
tion. In the context of such a system, we
interpret "experience" to mean 1) that which the
system 1s told by the knowledge engineer, 2) the
system's input data stream, and 3) the intro-
spective examination of the knowledge base, data

base, and system results by the system itself. In
 

 

such knowledge-based systems, the bucket brigade is
appropriate as an evaluation mechanism of system
performance, the GA as a discovery algorithm for
system improvements over the long term of the life
of the system. We have explored the utility of the
bucket brigade algorithm for knowledge evaluagion
in traditional rule-based application systems! and
for one variant of rule-based inference under
uncertainty.

The principle problem which we are addressing
in this paper is the generation of legal rules
within a high-level rule representation language.
We reconsider the GA in light of representations
widely used by the mainstream Al community. In the
past the GA has been applied to high-level problems
by developing a representation for the problem that
ean be Bagegd ta an underlying GA vector represen-
tation. 717) The mapping has followed two paths,
often called the "Michigan" and "Pittsburg"
approaches to the GA after the schools at which
their respective research teams established each
approach. ‘The Michigan approach views the GA as
working a3 a process internal to a system, so both
evaluation and modification secur in the context of
a simple system adaption, to its environment. The
Pittsburgh approach views each represented
individual in the GA as an entire system, for
instanee a possible rule-base, which is in compet-
ition with other systems to establish itself in an
environment. Our work can best be understood as
that part of the "Michigan" tradition concerned
with mapping GA approaches to complex representa-
tions internal to a system.

The GA has in the past been applied to existing
high-level representations indirectiy, through the
creation of mapping functions to interpret GA
results at the representation level.

We are engaged by the other side of the problem.
Given a knowledge representation, we are asking how
we might modify the genetic operators to accommo-
date that representation. We believe the
representation we have enosen Is sufficiently
general to be of interest to a wide audience. We
hope that, in turn, the computing paradigm repre-~
sented by the GA will be fruitfully applied within
the traditional computing approaches in AI, and in
increasing numbers of real-world Al applications.
Finally, we hope to clarify the relationship and
relative utility of GA approaches to other learning
and discovery techniques in the field.

The rest of the paper 1s organized as follows,
Section 3.0 describes our representation. In
Section 4.0, we reconsider the operators that form
the heart of the GA from the point of view af the
representations of Section 3.0. We point out some
issues raised by this work in Section 4.0, and
conclude the section and the paper with a
description of the status of this work and the
expected activities in this area for the next year

3.0 The Knowledge Representation
We represent knowledge in two forms: the "static"

knowledge and vocabulary of the system, in our case
an object description and datatype language

70

 

expressed as a taxonomy of the terms of interes
and the "procedural", dynamic knowledge of the
system expressed by pattern-inyoked conditton~
action production rules, Our deseription of th
forms will be a slight simplification of the ac
representations {n our system, The simplifieat
are aimed at avoiding the difficulties of explai
ing aspects of production systems that don't bea
directly on the adoption of GA operators to the
representation.

The conceptual content of the system 1s expresse
at four levels corresponding to the components o
the rule language. They are objects and their
characteristics or attributes, such as abject, a
(location of object,); relations among objects a
their characteristic¢s, expressed in clauses such
(is-within-distance (location of object,} (locat
of object,) distance); conditionals, which are t
relations on those clauses that constitute rule
left-hand sides, @.g., (AND clause, clause,); an
the rules themselves, which express inferential
relations with conditionals as antecedents, such
IP (AND clause, clause) THEN assert (clauses).

3.1 Objects

The objects are organized in a taxonomy of types
objects, thus every object can be characterized t
its class membership as well as the properties
attached to it. Properties of a class* are inher
ited by all members of the class. An example
object definition hierarchy appears in Figure 1.

r

 

tre ob

 

Univerar of ducauear

 

objerk refation “amtaty ne
an arguement ty G0 reese
Ne . 7
vehlele Coliding withtn dint dintance
a po for for dist} dat
. i ~
vehy vey bldyy tides

rely be tee,
ard ond, y sh mee

 

as a ort of charsctessst,

ae apecficd by the clase of objicth ta winch il belongs

The leaves of the tree are instantiated objects,
each of which has a set of characteristics as
specified by the class of objects to which it
belongs.

3.2 Relations

Relations are expressed in prefix notation, with
the relation name followed by an ordered list of
arguments wrapped in parenthesis. The arguments
are typed, the available datatypes neing provided
by the concepts defined at the time the relation i:
defined. Relations and datatypes are represented

* The coliection of properties of a class are the
necessary and sufficient conditions for membership
in that class,

 
|

as terms in the concept language in the same way
that objects are, with properties and class member-
ship. A relation may be expressed among constants
or variables of the appropriate types, with the use
of variables in relations providing quantifica-
tion, An instantiated relation, one with all
constants in its arguments, takes on a value (true
or false for predicates, a member of the function
range for functions). Figure 2 illustrates a
relational form, a relation with free variables,
and an instantiated relation.

 

(Retahun, concept type, concept tynee
fwsthin distance floration, Taeeiton, Pdistance)

fusthen distance flocation vehicle} (lacation cehucleg) £Oft}

3.3 Conditionals

The left-hand side of a rule expresses a predicate
relation among a set of propositional clauses, The
predicate relation forms the condition under which
the rule can be applied. There are three types of
propositional clause: short term memory elements
{8TM), associations, and predicates. We give these
special attention because of their roles in the
left-hand side of inference rules. STM are used to
introduce an arbitrary number of new variables into
the conditional part of a production rule, Facts
that are expressed to the system in such a way as
ta legally bind to (unify with) all the variables
of a STM pattern are said to match that pattern.
Variables must be bound in the datatype of the
corresponding relation argument. Variables may
also be constrained to be any subtype of the
relation argument's datatype, as defined in the
concept hierarchy. Variables bound to values in
one STM must be bound in the same way whenever the
variable is mentioned in another proposition of the
left-hand side.*

Associations introduce a single new variable
into the left-hand side by binding that variable to
the value of a function call. The function's
arguments must all be constants or previously bound
variables. Association clauses function to set up
variable bindings for the clauses,

 

 

Predicates are the only other type of clause
directly expressed in the left-hand side. They
istreduce no new variables. As with functions in
association clauses, all the predicate's arguments
must be constants or bound variables.

 

 

PBs

go

_ YRe might think of a list of possible bindings
being carried from left to right across the list of
Propositions, such that at any break between
Brepesitions one can construct the bindings so
far. We will use such a view in deseribing GA

eaters in the representation in Section 4.0,
Slow,

The propositions of the conditional part of a rule
are considered in some relation to each other, and
the values of the propositions combined according
to that relation. The combination operator
collects and combines the values of the clauses of
the left-hand side. For simplicity we consider
only combination operators "and" and “or”. “Not”
is viewed as a relation that may appear as a
proposition in the conditions (where "and" and "or"
may also appear, recursively}. Any clauses that
mention variables bind those variables to appro-
priate values. Bindings remain in effect over the
entire scope of the rule. Figure 3 illustrates one
possible form of a conditional.

Flew 3

(combination operator
(STM reloteon, Sear, constant,
{STM relation, Puar, constant, constant ,)
{Sasocrate fuar, ffanthon, Friar, ronsiant }}
fpredicate, Fvar, Fuare}}

3.4 Production Rules

Figure 4 {llustrates a typical rule in our
system, The rule's conditional clause contains the
combination operator "And", indicating that all
participating clauses must be true for this rule to
be a candidate for firing. Each participating
clause in the conditional denotes a relation whose
value plays a part in the combination method.
(Associate clauses are not participating clauses in
these relations). The matching of the rule's
left-hand side makes it a candidate for taking its
right-hand side action and it 18 added to the
eonflict set.

Figure +

 

Concept representahion rules

wv (AND tel, tear f const}
titel, Prarf conat,} )

THEN (DO fArton, frart))

The actions a rule can take fall into three
categories, adding STM elements to working memory,
deleting such elements from working memory, or
evaluating an arbitrary function. No new variables
are bound on the right hand side, and all variables
mentioned are bound to the the corresponding values
of their bindings on the left-hand side. For
example, a constant Const, bound to ?vari such that
both (Rel, ?varl const ;) and (Rel, ?varl const.)
are true results in (Action, const.) taking place.

 
 

3.5 Rule Grammar

The system described should be naturally express-
ible by a context-sensitive grammar, since each
category is systematically well-formed and each
depends on context information (e.g. variable
bindings) provided by previous rule elements.

a grammar would provide a formal framework for
describing the construction of legal rules, includ-
ing variable binding decisions, as derivation
trees. We are preparing such a granmar to support
a version of crossover for this rule represen-

Such

tation, Its role is described in section 4.5,
below.
3.6 Control

The minimal contrel structure for a production
system consists of a three-step cycle, plus an
initial action taken by a “prime mover" (typically
the user), The "mover" adds a working memory
element to the database of facts on which the
system operates, simulating what would otherwise be
control cycle. In the second step the resultant
set of candidate rules to fire, the conflict set
is computed. Step three consists of a rule being
chosen and firing.* The rule may add or delete
elements of working memory, initiating a new
eycle. The conflict set 1s updated with the new
set of matched rules, and so on. The system halts
when the conflict set is empty at the end of a
cycle.

We have described a pattern-invoked, strongly
datatyped knowledge representation. The represen-
tation is typical of a well-organized AI applica-
tion system based on such tools as OPS-like rule
interpreters and FRL-like declarative representa-
tron systems. In the next section we discuss the
implications of adopting GA type learning and
discovery algorithms to such systems,

4. Genetic Operators for Pattern-invoked,
Datatyped Representations

The genetic algorithm simulates evolution and
natural selection over a population of individuals
un an artificial environment. The population is
usually represented as a set of simple production
or rewrite rules such as the one illustrated in
Figure 5.

* {f several rules are chosen for firing simultane~
ously we have a parallel system, Since we are
concerned, in this paper, with the manner of
automated rule construction, and not (directly)
with the results of inference using those rules, we
needn't concern ourselves with the issues parallel-
ism normally raises. Such {ssues may need to be
raised in considering how to control the appli-
cation of the GA operators, but, except for the
comments on Section 5.0, that is beyond the scope
of this paper.

Tigure 5

 

Abstractly, the bit-string represents the possible
states of the world. Each position of the string
may be thought to correspond to an independent
proposition about the world, so all strings of O's
and 1's correspond to particular possible world
states, A “don't care" character, "#", is intro~
duced to make if possible to refer to sets of
possible world states. In the context of a bit~
string production rule, a left-hand side string
represents the (set of) state(s) in which the rule
may rewrite the world-state as its right-hand
side. For example, in Figure 5, a world-state of
00101010 would be rewritten by the rule as
goitiaid.

Below we consider how to express the operators
developed in this representation in the language of
relations described in the last section. We will
first describe the operator of interest in terms of
the bit-string manipulations it performs. We then
consider mechanisms which carry out an analogous
operation in pattern-invoked, datatyped representa-
tions. We do not consider modifications at the
object level, which 15 considered the primitive
level of the system ~- analogous to bit positions
and values in the bit-string representation. for
do we consider the rule level, since considerations
of the relation and conditional levels suffice and
ean be trivially extended to include the right-hand
sides of rules (no new variables being introduced
there). We don't mean to imply that modifications
in the action parts of rules are uninteresting,
just that the mechanics of those modifications ean
be understood completely analogously to modifi-
cations of the left-hand sides of rules.

The operators we consider are generalization,
specialization, mutation, inversion, and cross-
over. Because of the importance of hierarchieal
concept representations in many AI systems, the
issues in concept specialization and generalization
have been the concern of a large part of the
machine learning community. We therefore treat
them separately, even though spectalization and
generalization are, strictly speaking, sub-species
of mutation.

4,1 Generalization

Generalization is achieved in the bit-string
representation by changing a "0" or "1" to a "don't
care” symbol ("#"). The utilization of "#" within
the condition of a bit-string production provides
an opportunity for that production to apply to more
world states. It 13 used as a global pattern-
matching variable, Figure 6 shows an example of a
bit-string production being generalized by switch-
ing the 0 in position 1 for a # in position 1.
 

 

 

 

 

 

41d

Generalization at relation level

Instantiated relations in the rule language can be
generalized completely analogously to the bit-
string operation by changing a literal to a
variable. Que to the strang typing af our
relational language, constraints apply to those
variables. @ variable might be bound to any member
of the class to which the literal belongs, or to
any member of a superclass of the literal, up to
the datatype of the relation argument that the
literal waS iunstantiating. For instance, the
instantiatied relation {location-of bidg,) ean be
generalized by replacing the reference to a
specific building by reference to a class of
buiidings or objects, such as (location-of ?bldg)
or (location-of ?object), but not just any element
of the universe of discourse. Notice that
(location-of ?vehicle) 15 not a generalization of
a.bey {location-of ?building).

A very similar mechanism can be brought to bear to
affect generalization at a slightly higher level.
The relation language itself 1s defined in a
taxonomy in which a particular relation may have
parent relations of which they are specializa-
tions. Such parent relations would have the same
or fewer parameters of the same or more general
types. Generalizing a relation directly would
require, at most, removing one or more arguments
from the previous relation's argument list.

4.1.2 Generalization of conditionals and
productions

Generalization can also be achieved by
removing clauses from the conditional, thereby
removing a requirement for invoking the right-hand
side action, Removal of an entire clause has to be
handled with care because of the interaction of
variable binding among conditional clauses and the
actions of the rule. If a clause introduces a set
af variables, subsequent references to those
variables must be purged from the rule. The
easiest way to do this is to simply remove all
clauses with such references, but one might also
replace such variables with variables of the same
class introduced elsewhere in the conditional. So
dropping the first clause of the conditional in
Figure 3 would necessitate either the removal of
@lause 3 and then clause 4(!), or it would require
changing 2?var, 1n clause 3 to, e.g., ?vary from
clause 2. On the other hand, dropping clause 4
could be done without further adjustments because
it introduces no new variables.

73

If we keep to the analogy that generalization
of a literal or a variable 1s like "0" or "1"
turned into "#", 1t's difficult to see what analogy
to make of generalization at the relation level
unless it is something like turning sowe pattern of
"O"'s and "1"'s into a pattern of "#"'s. The
difficulty of finding a GA analog seems rooted in
the richly layered representation inherent in the
relation-based approach, whereas the simple
description of the GA presented here is for only
one level of process. Although it is possible to
organize layered GA processes, 1t 15 4 particular
strength of most Al representations that they do so
naturally. The price for such expressiveness,
apparent here, 18 in the degree of complexity of
operations on such representations.

4.2 Specialization

The specialization operator in a bit~string
operation changes a "don't care" symbol to a
literal value of 1 or 0. The example in figure 7
shows the fourth position of the condition
specialized to a 1 value, This restricts the
application of this rule to fewer possible world
states.

Figure 7

Gpecralizalion of fule im Figure >

Old rule

New rule

 

   

4.2.1 Specialization at relation level

The mechanics of specialization in relations
1s very similar to generalization. Rules can be
specialized by filling variable positions with
literal values that have a type of the correspond-
ing relation arguments (a constructor being used to
generate a legal literal value based upon its type
in the class hierarchy), by constraining the
variable to be of a subtype of the type at which it
is presently constrained, or by replacing the
relation with a subspecie of the relation.

4.2.2 Specialization at conditional level

Another method of specialization {s the
addition of clauses in the left-hand side of the
rule. In the simple case the new clause does not
introduce any new elements, it only further
constrains the conditions of the rule's applica-
bility. However, when new variables are introduced

 
  

by a STM or associate clause, the clause will do
little useful work unless at least one of the newly
introduced variables 1s mentioned elsewhere in the
rule.# Therefore, specialization by clause
introduction needs to be tied to the existing

clauses through an existing variable or must be

complemented by another process other clauses to
effect such a tie.

4.3 Mutation

Mutation is very similar to specialization and
generalization in that it is conceptually a local
operator that transforms individual elements of a

Tn the bit-string representation
qe

rule construet.
this transformation consists of replacing a
with a "O" and vice versa, see figure 8.

Pigure 8

 

of Bul

  

Old rule

New rule

 

4.3.1

Mutation at relation level

In the production rule representation,
mutation is constrained by the same datatyping
considerations as specialization and
generalization. It produces value changes based
upon the type of the argument corresponding to the
selected position, but, as opposed to those twa
operators, it seeks sibling concepts for the new
variable or constant. Recall the instantiatied
relation (location-of bldg,) that was generalized
by replacing the reference to a specific butiding
by reference to a superclass of the building. A
mutation on that clause would be, for instance,
(location-of ?vehicle), which, as was pointed out
in the discussion of generalization above, 15 not a
generalization of the instantiated relation.

4.3.2 Mutation at conditional level

Mutation of the relation expressed by the
clause can be viewed as the removal of a clause
from the conditional followed by the addition of a
new clause. Notice that in a fully articulated
concept graph, the relation expressed by the new
clause must be, at some level, a sibling or cousin
relation to the original. One might use this to
provide a metric and control regimen for the degree
of mutation in a rule .

*This is by no means always the case. Sometimes
variables are introduced to test for some general
condition to decide if an action should be taken,
resulting in a rule like IF (sty ?var,) THEN BO
{action constant,).

74

   
   
   
  
 
  

4.4 Inversion

The inversion operator of the G& produces an
end for end reversal of positions in the condition
of a rule, It can be applied as a mutation
operator during the running of the system, or as a
crossover/mutation operator applied during
reproduction, The example in figure 9 shows the
basic notion behind inversion.

Figure 4

Tnvernion of tule in Tapure 5

 

Old rule.

New rule:

 

 

NALA

Inversion at relation level

Because of the strict ordering of arguments
within relations, inversion cannot be in general
achieved unless the types of the relation's
arguments are symmetrical. Under any other
circumstances, inversion would result in illegally
formed clauses.

4.4.2 Inversion at conditional level

The production system as described restricts
ordering of clauses according to the introduction
of variables, Inversion of the clauses in the
conditional would presumably destroy some of the
dependencies of that variable introduction scheme,
requiring renaming and rethreading of variables
through the inverted clause set. Without an
explicit relation between, say, the clauses in a
rule and the actions to be taken in a plan, it's
difficult to see the point of inversion. Given the
variable set for a rule, it shouldn't make any
semantic difference in what order the clauses are
expressed.

 

4.5 Crossover

The crossover operator is the workhorse
modification operator of the GA. It is also an
operator which doesn’t have direct analog in the #
"traditional" machine learning literature. The
example in figure 10 shows two rules being crossed
over after they were chosen as mates. A locus :
(denoted by the cut) determines how mich of Ruled
and how much of Rule2 will be used. In this
example, Rulel provides the first 3 positions of
the new rule with Rule 2 Providing the remaining
positions.

4.5.1 Crossover at relation level

It is not immediately obvious that crossover
makes sense for relations. In fact, the argument
typing requirements are often clted as the cause
for crossover's inapplicability in this domain.
 

 

 

 

Fleurs 19

 

Rulel.

 

Rule2:

New rule

 

 

However, there is a way in which crossover might be
applied (the locus of crossover chosen) even within
strongly datatyped relations when two rules are

being eressed. This requires two modifications of

the system we have so far described: (1) the system
should be described by same context-sensitive rule
grammar that formally delineates what is legal at
all the levels as described in Section 3.5, (2) all
rules should be described by a datastructure that
indicates the derivation tree through the grammar
followed in the construction of the rule. A rule
can be cut at any point, but each part of the rule
must carry a copy of its (partial) derivation.
Crossover can then be performed by any two rule
fragments whose derivations complete each other.
That is, the cut results not only in the splitting
of a rule into two fragments, but the tagging of
each cut end of the fragment, the left-hand side
with the partial derivation tree down to the cut.
The right-hand side with the subtree starting at
the art. The unification of the derivations
produces the new rule. This process 1s obviously
more complez than the ather operators considered
thus far and we are presently trying to specify a
grammar for our system that will form the basis of
the derivation path datastructure.

4.5.2 Crossover at conditional level

Figure 11 illustrates a naive view of
crossover when the cut oceurs between clauses in
the conditional. Rulet and Rule2 are two
hypothetical rule mates, Their combination via
crossover produces a rule that mentions a variable
in its action that 1s not well-founded; ?varc is
supposed to be bound to (function ?varB), but ?vare
is unbound in New Rule). Moreover, ?vara is also
unfounded, and neither ?varA nor ?varl are doing
any work that is obviously related to the action to
be taken.

Our first implementation of crossover
attempted to achieve crossover for conditions, and
initial experience to test some intuitions. It
Promoted modification only among rules with the
Same goal or class of goals, A rule was naively
Produced. It then went through one round of
generalization, in which variables of the same type
hith different names were unified and one round of
Specialization in which variables of predicates and

75

Fipure 34

Crossover Uperatur Protlerta

 

Rated,

(F (AND ffei, Peart const.)
Cut > «
(Rel, fart canst} }
THEN (D0 fAction, feerf})
Take 2
© (AND fitel, Prar4 Peardi)
ful > <
[Pred tear}
{Avsee Prort {function Pearfi}))
TEEN (DO fAchany Suar€') }
New Rule
Ik {AND (Rel, frart consi,}
[Pred fara}
{Astor fart {functon Pror8}))
THIN (DO (Action, fear€})

associative relations are given literal values if
they are not were bound by the STM relations.
Predicates without any variables (only literals) or
associative relations whose variable is not used by
predicates or assertions, were expunged from the
rule condition because of their useless
computational expense.

Qur present approach (until we work out the
grammar-based one) is to view crossover as a
repeated generalization by clause removal on the
rule that 1s to pravide the action side of the new
rule (e.g., Rule2 in Figure 11), followed by a
repeated specialization by clause addition on the
resulting generalized rule. The clauses added in
the specialization process are those before the cut
in the rule that 1s to provide the first part of
the new rule (e.g., Rulel in the figure). The
characterization 1s rough because the conditional
1s bound together by the combination operator. We
mutate the combination operator of the generalized
intermediate form of the rule (derived from Rule2}
to the combination operator of the rule providing
the specialization clauses (Rulel).

5. Issues and Future Work

At least two issues are raised by the
representational adaptations we are attempting.
Firstly, are we still discussing computation within
a genetic metaphor if we accept this
representation? Secondly, how are these operators
to play in AI systems, what kind of control regimen
should they operate under, and is that genetic? We
discuss these issues very briefly here.

5.1 The Genetics Metaphor

An important issue raised by this work is
whether the metaphor of genetic recombination is
not completely lost when one goes to such
high-level representations. The integrity
requirements on the variable bindings makes it
difficult to view rules as suitable material for
local genetic recombination operators. When the
legality requirement is enforced the way we have
described, the whole operation of adapting GA

 
 

 

techniques seems driven by a priori representation

characteristics. The result has been an adaptation
that may seem ad hoc, from the GA perspective.

A much more satisfactory solution seems to be
at hand if the entire production system can be
described in a formal grammar. In that context
genetic recombination re-emerges much more clearly
as a metaphor for the process, although the
operators are much more complex than usually
described in the GA literature. Perhaps an analogy
of such operators to the role of RNA is
appropriate. A glimpse of this approach appears in
the discussion of crossover within relations,
above.

5.2 Control

Having constructed a set of operators to
generate changes to a rule base, we are now
confronted with the heart of the machine learning
problem, how to control system modiPication
operators, A (perhaps crucial) element of the GA
is the probabilistic control of recombination of
elements in a gene pool of some sort, leading to
new generations of the overall system. In real
world applications of knowledge~based systems, the
effort expended on knowledge engineering, the
expectation that system changes will be
incremental, and the hope that they will be
well-understood seems ta argue against directly
incorporating the GA approach to coritrol in such
systems. Yet the GA approach may represent an
aspect of contro] that must be present in a system
(in some form) for it to exhibit the capacity for
discovery.

5.2 Future Work

We have described {ssues in the application of
the GA to pattern-based production systems. We are
currently implementing the operators described so
we may explore those issues, The system in which
this approach is being explored is implemented on a
Symbolies 3670. It runs the bucket brigade as a
knowledge evaluation mechanism, and has demon-
strated auto-modification of its behavior using the
bucket brigade alone. We are presently exploring
variations on the competitive learning scheme
exemplified by the bucket brigade, as well as rule
generation strategies as exemplified the genetic
operators. Generalization, specialization, and a
version of crossover are implemented, and, at
present, invoked on a per-rule basis depending on a
rule's performance.

Next year's work is aimed at quantitatively
assessing these approaches through controlled
machine learning experiments. We expect to pay
particular attention to the GA control strategy as
process for generating effective new rules in
competition with other such processes. We will
attempt to implement operators in a meta-level rule
representation to begin to provide explanation of
rule derivations and justification of system
modifications.

40.

Ve

76

References

Simon, H. A., "Why Should Machines Learn?,"
Machine Learning, an Artificial Intelligence
Approach, R. 3, Michalski, J. G. Carbonell,
and T. M. Mitehell, Eds., Tioga Publishing
Co., Palo Alto, CA, 1983.

Holland, J. H., "Adaptation in Natural and
Artificial Systems”, University of Michigan
Press, Ann Arbor, MI, 1975.

DeJong, K. A., "Genetic Algorithms: a 10 year
perspective", Proc. Int. GA Conf.
pp. 169-177, 1985.

Goldberg, D. E., “Computer-aided gas pipeline
operation using genetic algorithms and rule
learning”, Ph.D. Thesis, University of
Michigan, Ann Arbor, MI, 1975.

DeJong, K. A., and Smith, T., "Genetic
Algorithms Applied to Information Driven
Models of US Migration Patterns", IEEE
Trans, on Systems, Man and Cybernetics,
Sept. 1980,

10,9,

Grefenstette, J. J., Gopal, R., Rosmaita, B,
J., and VanGucht, D., "Genetic Algorithms for
the traveling salesman problem", Proc. Int. GA
Conf., pp. 160-168, 1985.

H. J. Antonisse and K. S. Keller,
Evaluation of Imprecisely Specified
Knowledge,"

Proc. Digital Avionies Systems
Conf. (Fort Worth, TX, 1986).

H. C. Tallis and H. J. Antonisse, “Evaluation
of Intelligence Fusion Rules Using the Bucket
Brigade", submitted for publication.

“Dynamic

Cramer, N. L., “A representation for the
adaptive generation of simple sequential
programs”, Proc. Int. GA Conf., pp. 183-185,
1985.

Smith, D. “Bin packing with adaptive search",
Proc. Int. GA Conf., pp. 202-206, 1985.

Forrest, S., "Implementing semantic network
structures using the classifier system", Proc,
Int, GA Conf., pp. 24-44, 1985,
 

 

 

 

 

 

 

 

 

 

TREE STRUCTURED RULES IN GENETIC ALGORITHMS

Arthur 8.
Riva Wenig Bickel

Bickel

Department of Computer and Information Systems
Florida Atlantic University
Boca Raton FL 33432

ABSTRACT

We apply genetic algorithm techniques to
the creation and use of lists of tree-
structured production rules varying in
length and complexity. Actions, conditions
and operators are randomly chosen from
tablea of possibilities. Use of the
techniques 1s facilitated by the notions
of time delay and dependency in performing
tests or observations and by accounting
for indeterminate test results.

can adapt to 4
changing the
tables to
particular
library of

GENES,
variety

a general program,
of systems by
cantents of the various
accomodate the domain of a
problem and by substituting a
appropiate cases.

INTRODUCTION

The genetic algorithm 1s an iterative
adaptive search technique which has been
applied successfully to learning systems
having large search spaces and to problems
which cannot easily be reduced to closed
form. Holiand, who origanated the notion
of computer congtruets which mimic natural
adaptive mechanisms, noted that in order
to apply this technique, the subject to be
treated must be capable of being
répresentad by some data structure and the
solutions must be capable of being
evaluated and ranked (1,2).

APPLICATION TO EXPERT SYSTEMS

Conventionally in genetic algorithm
implementations, the rules containing the
e4pertise have been expressed in the form
of bit strings of fixed length. Thas
approach works guite well in terms of the
ane of applying the genetic operators —

mutations may turn aOQO toa lor to some
other symbol taken from a similar simple
alphabet; crossovers are implemented by
randomly cheosing two crossover points
. Along the atring and exchanging the

77

anformation between those two points,
However there are many problems which
cannot be easily axpreased in terms of a
simple alphabet, and still be amenable to
the genetic operatora (see, e.g. work on
the traveling salesman problem (3, 4)).
The commonly used dual valued alphabet 1s
@specially troublesome for two reasons.

First, aince the number of possible actual
interpretations is unlikely to be a power
of two, some interpretations will be

redundant, weighting the possible choices

unevenly, Secondly, come mutations would
be extromely unlakely, o@.g. 000 would
probably not mutate to ili. These
problems are both eliminated by having
each decisional element be an index into
an appropriate table. Choices for each
position and changes due to point

mutations are then made merely by randomly
chosing one of the possible indices into
that table.

Schaffer and Grefenstette (5),
building upon the LS~1 learning system
eriginated by Smith (6), used a data
structure where each expert was not an
andividual rule but, rather, an entire set
of rules, in order to deal with multi-
objective learning. These rule sets were
of fixed length, although all the rules
might not necessarily be used. But there
may be insufficient flexibility an solving
difficult classes of problems where the
axpert 1s camnposed of linear data
structures of fixed Jength, since
practicable computer expert systems are
eften presented as a system of non-linear
rules (7). These rules tend to be of the
form:

IF condition (Ufandtor condition}

THEN action

the action
obtain more

where
to

may be either an attempt
anformation on the system

(an action rule), or a terminal decision
in the nature of deciding the answer to
the presented problem (a decision rule).

 
 

 

CREATING AN EXPERT GENERATOR

He created GENES, a program to
develop expert systems. Each expert in our
systen Gonsisted of a linked last of
rules, each rule parsed syntactically as a

The actual number of
linked last was randomly
via an average length parameter
passed to the program upon its anitiation.
The number of rules allowed was optionally
bounded,

tree structure.
rules ain each
determined

A different program parameter
determined the maximum number of nodes for
each rule. The minimal possible value was
one, this single nede being an action nede
- which would mean that same yrescribed
action (randomly chosen from a table of
possible actions) would always be carried
out. one type of action was the
undertaking of a specifie additional test
er observation; the other possible action
was a final determination as to the atate
of the system - e.g. if the expert was a
prospector, then this might be that a
given mineral exists at a certain Location
in recoverable concentrations.

There was only one action node per
rule. Additional nodes each represented a
possible condition precedent to carrying
out the action, linked by boolean
operators (again, randomly chosen).
Conditions precedent nodes are of the
form:

IF itest-result-yields RELATIONAL

OPERATOR scalar quantity).

As an example, if the expert created were

a chemist, a possible node might be "“1f
melting Point > 160.51 degrees". In
creating (but not an evaluating) the
expert, booleans, relational operators,
possible tests, and possible range of
results were all randomly chosen from a

permissible domain of actions, conditions

{IF (NOT CCCL = 2) AND (C4 > 3)) OR (C2 >=
12)) THEN Al} ->
{AS} ->
{IF ((C5 = TRUE) OR NOT (Cl <> 17)) THEN
Al}

Figure 1
Statements within {} indicate a single set
of rules Statements beginning with C are
conditions

Statements beginning with A are actions
~> indicates the linking of rules which
are evaluated in order

~

and values appropriate to the area of
expertise desired for the system. Figure
1 shows a possible rule set for an

unspecified expert.

APPLYING THE GENETIC OPERATORS TO THE
EXPERT POPULATION

Three types of genetic operators were
applied to the population of experts -
mutation, crossovel and anversion. The
tree
for some modifications
operato1s normally deal
strings of bits.

to the way these
with £ixed length

Mutation was performed as an
operation upon a single rule. A number of
varieties of mutation were possible and
the type of mutation occurring in each
given case was chosen randomly from the
list of possibilities, as was the rule to
be mutated and the node of the rule at
which the mutation would occur. The
simplest type of mutation was ai point

mutation, Thais involved the
operator to another (e@.g., > becoming >=).
It was also possible for an operand of
either a conditian or an action type to be
exchanged with another (e.g., an a
chemical settang boiling poant becoming
melting point or 450 degrees might become
200). & thard possibility is exemplified
by the exchange of an OR for an AND or the
addition or deletion of a NOT an the
boolean expression. The ilast mutation
possible was the addition or deletion of a
rule or a portion of a rule - the latter
being equivalent to a randomly selected
branch of the parse tree.

change of one

The next simplest
an order inversion. Each inversion
carried out on a single set of rules.

involved randomly choosing, via the
MOD operator, two pointe along a rule set
of given length, and then exchanging the
order of rules that existed between the
two points. This operation did not affect
the Fules themselves but, lather, the
order ain which they were evaluated. The
possible effects of anversion were
twofold. First, the firing of an earlier
rule (that as conditions evaluating
positively so that an action is taken)
could cause later rules to evaluate
differently. With 4a new ordering, given
preconditions now may or may hot exist.
Second, due to rule evaluatien halting
once a decision rule fared, new rules
might come under consideration or old ones
might not row be reached.

genetic operation
wag
was
This

The most complex
involved crossover. Crossover, an
system, occurred at points between

and involved exchanges of adentacal

genetic operator
used
thia
rules,

structured nature of each rule called -
 

 

 

amounts of genetic material (in terms of
number of rules) between experts. Because
the rule sets of each of the pair of
experts chosen to undergo crossover were
unequal ain length, one point for crossover
was chosen to be MOD the length of the
shorter expert and one MOD the longer
expert. I€ both values turned out to be
less than the length of the shorter rule
set then a double crossover occurred: a
central list of rules was exchanged
between the two and both experts retained
their original size. If only one point
was lese than the shorter rule set, then
only a single crossover oceurred, with the
taal end of the shorter expert now grafted
on to what was once the longer one, and
vice versa. Both offspring of each
auversion survived. Figures 2a and 2b
demonatrate the two types of crossovers.

R1A~-R1IB--RiC-1-R1D--R1IE--R1F-|-R1G--RLH--
R1I--R1LJ--R1K

R2A--R2B--R2C-1-R2D--R2E-~R2F-1-R2G6
crossover of rule sets 1 & 2 yield

RIA--RiB-~R1C~(-R2D--R2E--R2F~|-RiG--RiH-~
R11--R1J--R1K

R2A~-R2B--R2C-}-R1D--RIE~-R1iF-1-R2G

DOUBLE CROSSOVER - CROSSOVER POINTS
INDICATED BY

Figure 2a

R1A--R1iB-~-R1C~-R1D-!-R1E--~R1IF~-R1G--R1H-~
R1I~-R1J--R1K

R2A--R2B--R2C--R2D-1 -R2E--R2F--R2G6
eressover of rule sets 1 & 2 yield
RLA--~RIB--R1C~-R1D-|-R2E--R2F--R2G

R2A~-R2B~--R2C-~R2D-1-R1E--RLF--R1G~-R1H--
R1I~-R1d--R1K

SINGLE CROSSOVER - CROSSOVER POINT
INDICATED BY

Fagure 2b

THE ROLE OF THE CRITIC

Experts, naturally, relate to some
field of expertise. The general set of
problems which these experts were to solve
involved a set of related problems, where

7°

a number of different tests could be
undertaken ain order to determine some
information about the system, The goal of
each expert was to generate the proper get
of tests, in proper order, so that the
correct solution for each presented
problem would be obtained at a minimal

cost in terms of testing expenses,

The role
each expert’s

of the cratic was to apply
rule set, in order, (and
repeatedly if necessary), to each member
of a library of actual cases, and to
obtain thereby some figure of merit for
each expert. If the conditions precedent
an the rule were met, then the rule fired
and the action called for was taken. If
the rule action called for an observation
about the case, aunformation might be
unmasked, and the expert’s state of
knowledge increased (but enly aif the
information was available in the actual
library cage). of course coats may have
been incurred in making such observations.
These costs could involve time, money, or
anjury to the subject being observed.
Results might be aunconclusive or even
unavailable.

Rule evaluation
decision rule fired.
set was evaluated

ceased when a
If the entire rule
without a decision rule
firing, then the set was reevaluated from
its beginning (and at an additional cost),
because the results of earlier fired rules
were likely to have provided additional
anformation that could result in different
rules firing on subsequent passes. &
maximum number of passes were allowed in
order to ensure termination. A valuation
of the expert’s merit vis a vis that
patticular problem was made, and then the
next problem in the set was presented to
it, and so on. Obviously, the larger and
more varied the set of problems, the more
sophisticated and discriminating the
expert.

Creation of a library
direct program control can be a

and error-prone proceas. In order
facilitate both entry of and changes to
the problem set, GENES allowed creation of
the problem set via the normal Unix(TM) vi
editor, and then fed the error-free result
to the program.

of problems
under

tedious
to

In
interested

our trial
in evaluating

system we were

Our experts on a
dual basis. One consideration was how
often they reached the correct result.
The other was how little cost was expended
in ceaaching this result. An apportionment
between these two factors 1s clearly a
value judgment that must be made for each
particular expert system. Thus each
expert accrued a certain cost in its
decision making, that being the sum of
dollar costs incurred in running tests,

 
 

 

degradation costs in che case of
environmental misfortunes (e.g. a damaged
drill aun the instance of oil exploration?)

and penalties for had evaluations. The
merit of each expert was inversely related
to the cogts it incurred.

TIME DEPENDENT EVALUATIONS

Because the system we were studying
was ene where time was a factor, we
introduced the notion of time dependency

to our evaluations of the goodness af each
expert. Many of the tests which could be
performed by the experts would not yield
immediate resulta. While such answers
were pending other tests might or might
not be undertaken, depending upon the
rules (where actions might or might not be
dependent upon previous test results).

Time dependency required that the
entire rule set for each expert be
reevaluated repeatedly, because new testa
might be ordered once earlier test results
came in. In many existing systems such
delay involves additional daily overhead
costs, so these costs were factored into
the testing costs for each expert where
appropriate.

A related notion was persistence,
persistence being a quality associated
With tests whose results do not change
meaningfully over short intervals. In
order to prevent the repeated ordering of
such tests, they were each assigned 4a
persistence value which was decremented
with time. If a rule required that a test
be done when that test showed a non-zero
persistence value, the rule was ignored.

It should also be noted that there
Was a possibility that test results might
be eguivocal. To simulate this
possibility, the individual problems in
the sets presented were issued certain
flage representing both the current and
possible results of tests and observations
that the expert might make. Thus, for

example, an oa medical situation, the
patient’s sex might generally be
immediately knowable, the reaults of a
blood culture would be available in 24
hours, but meaningful results of a CAT
scan might actually turn out to be
unobtainable. Because there was ne

prohibition on the repetition of rules
within an expert’s repertory, however, the
CAT scan could be repeated and might show
results the second time.

HANDLING CONVERGENCE

Researchers in the genetic algorithm
field have repeatedly noted the difficulty
of fine tuning the parameters in order to
obtain results along the desired results.
Tie algorithm shares this rather nasty

fact
programming:

wath other types of computer
"what you ask for ig what you
get" (See, €.g., (8). Due to the
emphasis upon getting the correct result
“ag compared to the cost of arriving at
that result, the population of experts
tended to converge rapidly to ai rathe

homogeneous solution set, wherein a high
percentage of the problem sets tested were
xolved properly, but at less than optimum
oat. Rapid convergence 1s a common
problem ain these types of systems and hag
been solved in various ways (3). The
solution used in GENES was to consider the
aystem to have converged if the gum of all
the expert scores remained within a
narrowly bounded interval over several
‘onsecutive iterations.

If the
converged,

system was judged to

then a percentage of the
population was replaced by newly generated
eaperts, thereby providing a major influx
if genetic material into the system.

PARAMETERS FOR GENETIC OPERATIONS

Rates of
crossover, as compared
‘uplication, were tunable
wis the population size.
ih the range
‘10), namely a
° percent
inversion
generation
tasults and

inversion and
with just plain
parameters, as _
Typically, rates
recommended by Grefenstette
30 percent crossover rate,
mutation rate, 5 percent
Late, and a 40 percent
gap, were found to give good
were used. Variations in
these ratea affected the time it took to
reach a good solution, but did not effect
‘he goodness of that solution. Other
typical variable values were a population
of experts set at 50 and 50 maximum rules;
thia was with 13 possible actions that
vould be taken and 12 possible tests-to-
be-done where the possible variations for
the bound of each test-to-be-done sere
sstremely large since the constants chosen
were from a continuous rather than a
ssacretea population (e.g. temperature >
192.48), This, along with the variable
number of branches in each rule and the
wtable number of rules an each set,
makes the size of the search apace
~tfficult to estimate. It certainly
wot small.

mutation,

Using these parameters, the choice _r
wadch experts were ta undergo eaun
procedure was made via the weighted
1sulette wheel. In all cases at least one

upy of the expert with the best value wae
-utained.

RESULTS
The GENES model, we tested at least
on a small scale, converged un
approximately 2000 iterations with 2
population size of 36, or ain €

 

 

 

 

 
F
t
:
i
i

 

 

 

iterations with a population size

to results which appear optimal in
the correct decisions
nine problems
reasonably

of 50,

that
were made for all
presented and at a
low cost. This required no
excessive or duplicative testing. Due to
the very laige possible variation of
rules, there did, however, seem to be some
strange rules that vemained in the best
expert's rule sets. These rules tended
not, to make a great deal of sense, but
they did not actually affect the system as
their very weirdness assured that they
either always or never fired, Thus the
rule interpreted as "if the patient’s
temperature is less than 120.3 degrees
discover aif abdominal pain 1g present”, is
practically equivalent to stating “always
check for abdominal paan’”,

Additionally, it should be noted that
all rules following one that resulted in a
decision being made were never evaluated,
s0 they did not actually contribute to the
rule set, although they did provide
genetic material tu succeeding generations
and might be activated during inversion,

FURTHER DIRECTIONS

This system 1s currently being
examined with promising results in two
areas of expertise.

The first area that we are
investigating is the minamization of
hospital costs during patient diagnostic

evaluations in a prospective reimbursement
environment. The simple environment we
are studying is a rule set appropriate to
4 patient admitted wath suspected gall
bladder cligease, The expert in this
system should be able to arrive at a
correct diagnosis using minimal hospital
resources in terms of tests and time. It
should be noted that thas system does not
involve interpretation of tests, but only
the suyyests the order of testing, as a
function of previous results.

The second system under study
involves server activation and
deactivation ain queteing problems. Costs
here include both activation and use of
the server, wath the goal being
Minimization of costs, while providing
adequate service to the queue under

varying server loads.

A back end anterpreter to translate
the rule sets into terms understandable by
Ssemi-expert users as also under
Consideration.

ACKNOWLEDGMENT
The authors wish

4§sistance of Neal
helpful suygestions.

to acknowledge the
Coulter for has many

a

BIBLIOGRAPHY

1. John H. Holland. Outline for_a
Logical | Theory of _ Adaptive Systems. In

Essays on Cellular Automata, Edited by
Arthur W. Burks. University of Illinois
Prass (1970).

2. John H. Holland, "Adaptation in Natural

  

and Artifacial Systems". Univ. Michigan
Press, Ann Arbor (1975).

3, Davad E. Goldberg and Robert Langle,
Jv, Alleles... Loci, and _the. Travelang
Salesma Problem, Proceedings of an
International Conference on Genetic
Algorithms and Their Applications, John J.

Grefenstette, ed. (1985) pp 154-159.

 

4. John Grefenstette et
Algorithms forthe Trave
Problem, Proceedings of an

  

Conference on Genetic Algerathms and Their
Applications, John J. Grefenstette, Ed.
(19985) pp 160-165.

5. 3. D. Schaffer and
Multa -Objective _. Learning Via. Genetic
Algorithms, International Joint Conference
on Artifisial Intelligence (9th 1985) pp

John Grefenstette,

  

 

592 - S95.

6. S. F. Smith, Flexible Learning of
Problem __ Solving _Heurastics rough
Adaptive_ Search, International Joint
Conference on Artificial Intelligence (8th
1983) 422 - 425.

7. A.J. Tomas and A. Th. Schreiber, HERMAN

Computer Aided Medical Decision Ma
Artificial Intelligence in Medicine, 1. De
Lotto, M. Stefanelli, Ed. North-Holland
1985S, pp 1 - 9.

 

a. KF. De Jong,
Year Perspective,

Genetic Algorithms: A 10
Proceedings of an
International Conference on Genetic
Algorithms and Their Applications, John J.
Grefenstette, ed. (1985) pp 169 - 177.

9. 4. &. Baker, Adaptive Selection Methods
for Genetic Algorithms, Proceedings of an

International Conference on Genetic
Algorithms and Their Applications, John J.
Grefenstette, ed. (1985) pp 101 - 111.

10. John Grefenstette, Optimization. of

Control Parameters for Genetic Algorithms,

TEEE Teansactions on Systems, Man, and
Cybernetics, (Vol. SMC 16 No. 1 Jan/Feb
1986) pp 122 - 128.

 
 

 

GENETIC ALGORITHMS AND CLASSIFIER SYSTEMS
AND FUTURE DIRECTIONS

FOUNDATIONS

John H_ Holland

The University of Michigan

Abstract

Theoretical questions about classifier systems,
with rare exceptions, apply equally to other
adaptive nenlinear networks (ANNs) such as the
connectionist models of cognitive psychology, the
immune systern, economic systems, ecologies,
and genetic systems This paper discusses
pervasive properties of ANNs and the kinds of
mathematics relevant to questions about these
properties. It discusses relevant functional
extensions of the basic classifier system and
extensions of the extant mathematical theory
An appendix briefly reviews some of the Key
theorems about classifier systems.

Classifier systerns are examples of a broad
class of systems sometimes called adaptive
nonlinear networks (ANNs, hereafter) Broadly
characterized, classifier systems, and ANNs in
genera}, consist of a large number of units that

(1) interact in a nonhnear, competitive
fashion, and (2) are modified by various
operators so that the system as whole
progressively adapts to its environment

Typically an ANN confronts an environment
that exhibits perpetual novelty and it can
function (or continue to exist) only by making
continued adaptations to that environment
Because ANN/environment interactions are
complex, except in artificially constrained
cases, an ANN usually operates far from
equihbrium ANNs form the core of areas of
study as diverse as cogmitive psychology,
artificial intelhgence, economics, immunogene-
sis, genetics, and ecology

Founda tions.

Classifier systems are quite typical ANNs so
that questions about classifiers, suitably
translated, are typically questions about ANNs
and vice versa To carry out the translation it

is necessary to identify the counterparts
other ANNs of the message-processing rules

called classifiers, that are the units of classifier

systems. For example In genetics, the
counterparts are chromosomes, in game theory |
and economics, they are (rule~defined) strate~
gies; in immunogenesis, antigens, in connec-
tionist versions of cognition, (formally~defined)
neurons, and so on Under such translation it
is relatively easy to identify a range of
important theoretical questions that apply to
classifier systems in particular and ANNs in
general

[These questions, and some of the ensuing
discussions, are presented assuming that the
reader has some farmlarity with Holland et al

[1986] or Holland [1975] There is simply not
enough room here to define the terms, a
reader familiar with the hterature concerning
seme other ANN should be able to make the
relevant translation in most cases }

(4) What parameters and operators favor the
emergence of stable hierarchical covers such as
default hnerarchies, internal models, and the
hike (via an increased diversity of umts and
progressively more complheated interactions
between them)?

(2)

Are the farmhar “ecological” interactions

parasitism, symbiosis, competitive exclu-~
sion, ete —~ a cormmon feature of all parallel,
nonlinear competitive systems?
(3) Are multi-functional units (umts that can
serve in several contexts) the major stepping-
stone employed by ali ANNs in making adaptive
advances”

(4) What environmental conditions favor
recombination, imprinting, triggering, and
other constrained or biased procedures for
generating new tnals (rules, chromosomes,
organizational structures, etc )?

(5) What environmental conditions favor
 

 

tracking vs. averaging, exploration vs
tation, etc.?
(6) What combinations of operators yield
impheit parallelsm?

exploi-

Traditional mathematics with its reliance
upon linearity, convergence, fixed points, and
the lke, seems to offer few tools for studying
such questions. Yet, without a relevant
mathematical framework, there 1s less chance
of understanding ANNs than there would be of
understanding physical phenomena in the
absence of guidance from theoretical physics A
mathematics that puts emphasis on combina-
torics and competition between parallel
processes 1s the key to understanding ANNs.
What seems starthng when one uses differential
equations, where the emphasis 1 on con~
tunuity, 1s commonplace in a programming or
recursive format, where the emphasis 1s upon
combinatorics (Consider, for example, the
chaotic regimes that are so unexpected in the
context of differential equations, but are an
everday occurrence, in the guise of biased
random number generators, in the program-
ming context )

Because classifier systerms are formally
defined and computer-oriented, with an
emphasis on combination and competition, they
offer a useful test~bed for both mathematical
and simulation studies of ANNs We already
have some theorems that provide a deeper
understanding of the behavior of classifier
systems (see the Appendix), and simulations
suggest a broader class of theorems that
delineate the conditions under which internal
models (q-morphisms) emerge in response to
complex environments (Holland [1986b]).

By putting classifier systems in a broader
context, we can bring to bear relevant pieces
of mathematics from other studies. For
instance, in mathematical economics there are
pieces of mathematics that deal with (a)
hierarchical organization, (2) retained earnings
(htness) as a measure of past performance,
(3) competition based on retained earnings, (4)
distribution of @arnings on the basis of local
interactions of consumers and suppliers, (5)
taxation as a control on efficiency, and (6)

 

83

division of effort between production and
research (exploration versus exploration).
Many of these fragments, mutatis mutandis,
can be used to study the counterparts of these

processes in other ANNs

As another example, in mathematical
ecology there are pieces of mathernatics dealing
with (1) niche exploitation (models exploiting
environmental opportunities), (2) phylogenetic
hierarchies, polymorphism and enforced
diversity (competing subsystems), (3) func-
tional convergence (similarities of subsystem
orgamzation enforced by environmental
requirements on payoff attainment), (4)
symbiosis, parasitism, and mimicry (couplings
and interactions in a default hierarchy, such
as an increased efficiency for extant generalists
simply because related specialists exclude them
from some regions m which they are
inefficient}, (5) food chains, predator—prey
relations, and other energy transfers (appor-
tionment of energy or payoff amongst
component subsystems), (6) recombination of
multifunctional co~adapted sets of genes {re-
combination of building blocks), (7) assortative
mating (biased or triggered recombination), (8)
phenotypic markers affecting interspecies and
intraspecies interactions (coupling), (9) "found-
er” effects (generalists giving rise to specialists),
and (10) other detailed cormmonalities such as
tracking versus averaging over environmental
changes (compensation for environmental
variability), allelochemicals (cross-inhibition),
linkage (association and encoding of features),
and still others. Once again, though
mathematical ecology 1s a young science, there
is much in the mathematics that has been
developed that 1s relevant to the study of other
nonlinear systems far from equilibrium

The task of theory is to explain. the
pervasiveness of these features by elucidating
the general mechanisms that assure their

emergence and evolution Properly applied to
classifier systerns, or to ANNs in general, such
a theory militates against ad hoe solutions,
assuring robustness and adaptabihty for the
resulting organization One of the best ways to
insure that the mechanisms investigated are
general is “to look over your shoulder"
 

 

frequently to see if the mechanisms apply to

all ANNs. This view is sharpened if we pay
close attention to features shared by all ANNs

(14) Merarchical organization Ai) ANNs exhibit
an hierarchical organization. In hving systems
proteins combine to form organelles, which
combine to form cell types, and so on, through

organs, organisms, species, and ultimately
ecologies Economies involve individuals,
departments, divisions, companies, economic

sectors, and so on, until one reaches national,
regional, and world economies. A similar story
can be told for each of the areas cited. These
structural similarities are more than super~
ficial. A closer look shows that the hierarchies
are constructed on a “building block” principle
Subsystems at each level of the hierarchy are
constructed by combination of small numbers
of subsystems from the next lower level.
Because even a small number of building blocks
can be combined in a great variety of ways
there is a great space of subsystems to be
tried, but the search is biased by the building
blocks selected. At each level, there 1s a
continued search for subsystems that will serve
as suitable building blocks at the next level.

(2) Competition A still closer look shows that
in all cases the search for building blocks 1s
carried out by competition in a population of
candidates. Moreover there 1s a strong relation
between the level in the hierarchy and the
amount of time it takes for competitions to be
resolved -~ ecologies work on a much longer
time-scale than proteins, and world economies
change much more slowly than the
departments in a company More carefully, if
we associate random variables with subsystern
ratings (say fitnesses), then the sampling rate
decreases as the level of the subsystem
increases. As we will see, this has profound
effects upon the way in which the system
moves through the space of posstbilities.

(3) Game-lke system/environment interaction.
An ANN interacts with its environment in a
game~like way’ Sequences of action ("moves")
occasionally produce payoff special inputs that
provide the system with the wherewithall for
continued existence and adaptation Usually
payoff can be treated as a simple quantity
(energy in physics, fitness in genetics, money
in economics, winnings in game theory, reward

84

in psychology, error in control theory, etc }
It as typical that payoff 1s sparsely distributed
m the environment and that the adaptive
system must compete for it with other systems
in the environment

(4) Exploitation = of ~—sregularities The
environment typically exhibits a range of
regularihes or mches that can be exploited by
different action sequences or strategies As a
result the environment supports a variety of
processes that interact in complex ways, much
as in a multi-person game Usually there 1s

no super-process that can outcompete al}
others so an ecology results (domains in
physics, interacting species in ecological
Benetics, companies in economics, cell
assemblies in neurophysiclogical psychology,
etc.) The very complemty of these interac~

tions assures that even large systems over long
time spans can have explored only a mmuscule
range of possibilities. Even for much studied
board games such as chess and go this is true,
the not so simply defined “games" of ecological
Benetics, economic competition, immunogenesis,
CNS activity, etc., are orders of magnitude
more complex As a consequence the systems

are always far from any optimum or
equilibrium situation
(5) Exploration vs exploitation There is a

tradeoff between exploration and exploitation
In order to explore a new niche a system must
use new and untried action sequences that take
it imto new parts (state sets) of the
enviroment This can only occur at the cost of
departing from action sequences that have
well-established payoff rates The ratio of
exploration to exploitation in relation to the
opportunities (niches) offered by the
environment has much to do with the hfe
history of a system

(6) Tracking vs. averaging
tradeoff between “tracking” and “averaging”
Some parts of the environment change so

rapidly relative to a given subsystem’'s response
rate that the Sub-systerm can only react to the

average effect, in other situations the
subsystern can actually change fast enough to
respond “move by move" Again the relative
proportion of these two possibilities in the
meches the subsystern inhabits has much to do
with the subsystem's hfe history

There is also a

hij

sit
th
(4
ex
giv
us
mk
eff
pr
fu

res

thi

sy!

the
(7) Nontneartty The value (“fitness”) of a
given combination of building blocks often
cannot be predicted by a summing up of values
assigned to the component blocks This
nonlinearity (cormmonly called epistasis 1n
genetics) leads to co-adapted sets of blocks

(alleles) that serve to bias sampling and add
additional layers to the hierarchy

(8) Coupling. At ali levels, the competitive
interactions give rise to counterparts of the
familiar interactions of population biology ~~
symbiosis , parasitism , competitive exclusion ,
and the hke

(9) Generalists and specialists. Subsystems can
often be usefully divided into generalists
(averaging over a wide variety of situations,
with a consequent high sampling rate and high
statistical confidence, at the cost of a relatively
ingh error rate in individual situations) and
specialists (reacting to a restricted class of
situations with a lowered error rate, bought at
the cost of a low sampling rate)

(10) Multifunctionality Subsystems often
exhibit multifunctionality in the sense that a
given combination of building blocks can
usefully exploit quite distinct niches (environ~
mental regularities), typically with different

efficiencies Subsequent recombinations can
Produce specializations that emphasize one
function, usually at the cost of the other

Extensive changes in behavior and efficiency,
together with extensive adaptive radiation, can
result from recombinations involving these
multifunctional founders

(11) dnterna/ models ANNs usually generate
implicit internal models of their environments,
models progressively revised and improved as
the system accumulates experience The
systems  /earn. Consider the progressive
improvements of the immune systern when
faced with antigens, and the fact that one can
infer much about the system's environment
and history by looking at the antigen
Population This ability to infer something of a
system's environment and history from its
changing internal orgamzation 1s the diagnostic
feature of an implicit mternal model The
models encountered are usually prescriptive ~~
they specify preferred responses to given
environmental states -~- but, for more complex
Systems (the CNS, for example), they may

 

85

also be more broadly preaictive, specifying the
results of alternative courses of action The
relevant mathematical concept of a model of
process~like transformations 15 that of a
Aomomarphism Real systems almost never
admit of models meeting the requirements for
a homomorphism (“commutativity of the
diagram"), but there are weakenings, the
so-called g-morphisms ( guasi-hemomorphsm
The origin of a hierarchy can be looked upon as
a sequence of progressively refined q-morph-
isms (specifically q-morphisms of Markov
processes) based upon observation

Functional Extensions

The foregoing questions and commonalities,
together with some of the problems already
encountered in simulations, have already
suggested extensions of the standard definitions
(as in Holland (1980]) of classifiers systems.

One important change involves the way
bids are used in deterrnining the winners of
competititions for activation The standard
way of doing this 1s to calculate a Aid = [bid
ratio}*{strength] Under this arrangement, the
local fixed points of classifiers are such that a
generalist and a specialist active in the sarne
situations will come to bid the same amount
(because the strength of the generalist increases
to the point of compensating for its smaller bid
ratio, see the Appendix) This goes against the
dictum that specialists should be favored in a
competition with generalists. To compensate
for this an Offecere sr 18 ealeulated by
reducing the dd in proportion to the general-
ity of the classifier producing the bid. The
effective bid is then used in determining the
probability that the classifier generating it 1s
one of the winners of the competition. If the
classifier wins 1t must pay the bid, sot the
effective bid, to its supphers under the bucket
brigade Thus, the local fixed points are not
changed, but specialists are favored in compe-
ution with generalists This change goes a long
way toward reducing instabilities in emergent
default hierarchies (We are still exploring the
effects in simulations and, at the level of
theory, the resulting modifications in global
fixed points).
 

A related change concerns the method of

determining a classifier's probabihty of
producing offspring, its fitness, under the
genetic algorithm The higher strength of a
Generalist at its loca) fixed point greatly favors
it in the production of offspring, and
simulations indicate that this overbiases the
evolution of the system toward the offspring of
generalists The simplest way of compensating
for this 1s to make the fitness proportional to
{bid ratio}*{strength} rather than strength
alone In intuitive terms, this makes the
fitness proportional to the classifier's potential
for affecting the systern (its bid can be thought
of as a “phenotypic” effect), rather than its
reserves (strength 1s a quantity determined by
its “genotypic” fixed~point) We have yet to
carry out an organized set of simulations based
on fitness so~determined

Simulations have also revealed two other
effects worth systematic investigation The
first of these is the “focussing” effect of the size
of the message hst (see R Ruiolo’s paper in this
Proceedings) In effect, a small message hst
forces the systern to concentrate on a few
factors in the current situation Clearly there
1s the possibility of making the size of the
message list depend upon the “urgency” of the
situation For example, during “lookahead” the
message list's size can be quite large to
encourage an exploration of possibilities, while
at “execution” time the size can be reduced to
enforce a decision Clearly, the system can
use classifiers to control the size of the message
lst This makes the size dependent upon the
system's “reading” of the current situation, and
the “reading” 1s subject to long-terrn adaptive
change under the genetic algorithm

A second simple effect 1s to revise the
definition of the environment, or equivalently
the defimtion of the system's speed, so that
typical stimu persist for several time~steps
(Ths corresponds to the fact that the CNS
operates rapidly relative to typical changes in
its environment -- usually, milliseconds vs
tenths of a second ) The resulting “persistence”
and “overlap” of input messages makes it much
easier for the classifier system to develop causal

80

models and associative links (see below)
yet, to my knowledge,
been built along these lines

As
no simulations have

At a much more general (and speculative)
level, use of trigrered genetic operators pro-
vides a major extension of genetic algorithms
Triggering amounts to invoking — genetic
operators with selected arguments, when
certain predefined conditions are satisfied.

As an example of a triggermg condrtion
consider the following “Only general classifiers
that produce weak bids are activated by the
current input message " When this condition
occurs it is a sign that the system has little

specific information for deahng with the
current environmental situation Let this
condition trigger a cross between the input

message and the condition parts of some of the
active general rules The result will be plaus~
ible new rules with more specific conditions
This amounts to a bottom-up procedure for
producing candidate rules that will automati~
cally be tested for usefulness when similar
situations recur

As another example of a triggering
condition consider "Rule C has gust made a
large profit under the bucket brigade *

Satisfaction of this condition signals a propitious
time to couple the profitable classifier to its
stage-setting precursor An appropnate cross
between the message part of a rule C, active

on the irnmednately preceding time-step -- the
precursor -~ and the condition part of the
profit-making successor can produce a new pair
of coupled rules (The trigger 1s sof acti~
vated if C, is already coupled to C) The

coupled, offspring pair models the state transi-
tion mediated by the original pair of
(uncoupled) rules Such coupled rules can
serve as the building blocks for models of the
environment Because the couplings serve as
“bridges” for the bucket brigade, these building
blocks will be assigned credit in accord with the
efficacy of the models constructed from them
Interestingly enough there seerns to be a rather
small number of robust triggering conditions
(see Holland et al [1986]), but each of them
 

would appear to add substantially to
responsiveness of the classifier system

the

Tags are particularly affected by triggering
conditions that provide new couphngs. Tags
serve as the glue of larger systems, providing
both associative and temporal (model-building)
pointers Under certain kinds of triggered
couphng the message sent by the precursor in
the coupled pair can have a “hash~coded’
section (say a prefix or suffix). The purpose of
this hash-coded tag is to prevent accidental
eavesdropping by other classifiers -- a
sufficient number of randomly generated bits in
the tag will prevent accidental matches with
other conditions (unless the tag region in the
condition part of the potential eavesdropper
consists mostly #'s) If the coupled pair proves
useful to the systern then it will have further
offspring under the genetic algorithm, and
these offspring often will be coupled to other
rules in the system Typically, the tag will be
passed on to the offspring, serving as a
common element in all the couphngs The tag
will only persist if the resulting cluster of rules
proves to be a useful “subroutine” In this
case, the “subroutine” can be “called” by
messages that incorporate the tag, because the

conditions of the rules in the cluster are
satisfied by such messages.

In short, the tag that was imtally
determined at random now "names" the

developing subroutine It even has a sewing
im terms of the actions it calls forth
Moreover, the tag is subject to the same kinds
of recombination as other parts of the rules (it
1s, after all, a schema) As such it can serve
as a building block for other tags. It is as if
the system were inventing symbols for its
internal use Clearly, any simulation that
provides for a test of these ideas will be an
order of magnitude more sophtsticated than
anything we have tried to date, Runs
involving hundreds of thousands of time~steps
and thousands of classifiers will probably be
required to test these ideas

Support 1s another techmique that adds
considerably to the system's flexinhty Basic~
ally, support is a techmque that enables the

87

classifier systern to integrate many pieces of
partial information (such as several views of a
partially obscured object) to arrive at strong
conclusions Support is a quantity that travels
with messages, rather than being a counter-
flow as in the case of bids. When a classifier
is satisfied by several messages from the
message list, each such message adds its
support imto that classifier’s support counter.

Unhke a classifier's strength, the support
accrued by a classifier lasts for only the
time-step in which it 1s accumulated. That is,

the support counter is reset at the end of each
tume-step (other techniques are possible, such
as a long or short half-life). Support is used
to modify the size of the classifier’s bid on that
time-step, large support increases the bid,
small support decreases it. If the classifier
wins the bidding competition, the message it
posts carries a support proportional to the size
of rts bid = The propagation of support over sets
of coupled classifiers acts somewhat  hke
spreading activation (see Anderson [1983]}, but
it 1s much more directed It can bring associ-
ations (coupled rules) into play while serving
its primary mussion of integrating partial mfor-
mation (messages frorn several weakly—bidding,
general rules that satisfy the same classifier).

In addition to these broadly conceived
extensions, there are more special extensions
that may have global consequences, partic~
ularly in respect to increased responsiveness
and robustness. One of these concerns a
simple redefinition of classifiers The standard
definition of a 2-condition classifier requires
that each condition be satisfied by some
message on the message list, in effect an AND,
requiring a message of type X and a message of
type Y. It 15 a simple thing to replace the
imphert AND with other string operators, e g
a bit-by-bit AND or a bimary sum of the
satisfying messages, which is then passed
through as the outgoing message. This
extension has been wnplemented, but has not
been systematically tested

Other simple extensions impact the
functioning of the genetic algorithm. It is easy
to introduce, in the string defimmg a classifier,
Punctuation marks that bias the probability of
 

crossover (say crossover is twice as hkely to
take place adjacent to a punctuation mark).
These punctuation marks are not interpreted in
executing the classifier, but they bias the form
of its offspring under the genetic algorithm
Punctuation marks can be treated as alleles
under the genetic algorithm, subject to muta~
tion, crossover, etc , Just as the other (func~
tion-defining) alleles This ensures that the
placement of punctuation marks 1s adaptively
determined. Similarly one can introduce mat-
ing tags that restrict crossover to classifiers
with similar tags, again the tags, as part of
the classifier, can be made subject to modi~
fication and selection by the genetic algorithm

Finally, there are two broad ranges of
investigation, far beyond anything we yet
understand either theoretically or empirically,
that offer intriguing possibihties for the future
One of these stems from the fact that classifier
systems are general~purpose They can be
programmed initially to implement whatever
expert knowledge 1s available to the designer,
learning then allows the system to expand,
correct errors, and transfer information from
one domain to another It 15 important to
provide ways of instructing such systems so
that they can generate rules ~- tentative
hypotheses ~~ on the basis of advice It 15 also
important thet we understand how lookahead
and virtual explorations can be incorporated
without disturbing other activities of the sys-
tem Little has been done in either direction

The other realm of investigation concerns
fully-directed rule generation In a precursor
of classifier systems, the 4droadcast Janguage
(Holland {1975]), provision was made for the
generation of rules by other rules With minor
changes to the definition of classifier systems,
this possibility can be reintroduced (Both
messages and rules are strings. By enlarging
the message alphabet, lengthening the message
string, and introducing a special symbol that
indicates whether a string 1s to be interpreted
as a rule or a message, the task can be
accomplished.) With this provision the system
can invent its own candidate operators and
rules of inference Survival of these meta-
(operator~lke) rules should then be made to

depend on the net usefulness of the rules they
generate (much as a schema takes it value
from the average value of its carriers) It 1s
probably a matter of a decade or two before
we can do anything useful in this area

Mathematical Extensions

There are at least two broader
mathematical tasks that should be undertaken
One 1s an attempt to produce a general

characterization of systerns that exhibit amplicit
parallelism Ip to now all such attempts have ;
led to sets of algorithms that are easily recast
as genetic algorithms —~ in effect, we still only -
know of one example of an algorithm that
exhibits implicit parallelism

The second task involves developing a .
mathematical formulation of the process
whereby a systern develops a useful internal
model of an environment exhibiting perpetual
novelty In our (preliminary) experiments to
date, these models typically exhibit a (tangled)
hierarchical structure with associative coup-
lings As mentioned earher, such structures _
can be characterized mathematically as quasi~
homomorphisms (see Holland et al [1986])
The perpetual novelty of the environment can
be characterized by a Markov process in which
each state has a recurrence time that 1s large
relative to any feasible observation trme  Con-
siderable progress can be made along these lines
(see Holland [1986b}), but much remains to be
done In particular, we a need to construct
an interlocking set of theorems based on
(1) a more global set of fixed point theorems
that relates the strengths of classifiers under
the bucket brigade to observed payoff statistics,
(2) a set of theorems that relates building
blocks exploited by the “slow” dynamics of the
Genetic algorithm to the sampling rates for
rules at different levels of the emerging default
hierarchy (more general rules are tested more
often), and
(3) a set of theorerns (based on the previous
two sets) that detail the way in which various
kinds of environmental regularities are exploited
by the genetic algorithm acting in terms of the
strengths assigned by the bucket brigade

 
Appendix

A simphhed version of the fundamental
theorem for genetic algorithms can be stated as

follows (for an explanation of terms, see
Holland [1975] or Holland [1986a]}
Theorem (/mplcit parallelism ) Given a

fitness function u (0, 11K Reals, a population
B(t) of M strings drawn from the set (0,1),
and any schema se {0,1,*)K defining a hyper-
plane in (0, nk,

M,(t+1) 2 u°,(t)egM,(t),
where M,(t+1) 1s the expected number of

instances of s in B(tt1),

W's) = Dy sambeert
is the average observed fitness of the instances
of scherna s in B(t), and

= Cintke- DPLo/(k- 1),

b/g,

18 a “copying error” induced by crossover,
where Po, 18 a constant of the genetic algor—
tthm (often Poss = 1) grving the proportion of

strings undergoing crossover in a given genera-
tion, and k,-1 is the number of crossover

points between the outermost defining symbols
of s€ 10,14

Under interpretation, the implicit parallel-
ism theorem says that the sampling rate for
every schema with instances in the population
is expected to increase or decrease at a rate
specified by its observed average fitness, with
an error proportional to its defining length

Theorem (Speedup) The number of schemas
processed with an error < € under a genetic
alyorithirn considerably | execeds M° for a
population of size M = g?k where € = k’/k

Theorem (Bucket brigade focal fixed-point )

If, under the bucket brigade algorithm, 1, 1s

average incorne (after taxes) of
and r. 15 its bid~ratio, then its

the long-term
a classifier C,

strength S, will approach J(/r¢

Theorem (g-nierphism parsunony , for defi-
nitions, see Holland et al [1986]) A
g-morphism of n levels, in which each succes~
sive level uses k or fewer additional variables

Bo

to define exceptions to the previous level, and
in which the rules at each level are correct
over at least a proportion p of the instances
satisfying them, requires no more than

% Palkcr-p| rules (A homomorphism defined

on nk variables requires 29 rules)
k=2, p=0 5, the q-morphism requires fewer
than gie rules, while a corresponding home—
morphism would require 20 rules, that is,
the homomorphisrn would require at least 256
times as many rules as the q-morphism

For n=10,

ferences

Anderson, J R_ [1983] The Architecture of
Cognition — Carmbridge, Massachusetts
Harvard University Press

Holland, J H [4975]. Adaptation in Natural
and Artificial Systems, Ann Arbor

University of Michigan Press

Holland, J H_ [1980] “Adaptive algorithms
for discovering and using general patterns
In growing knowledge-bases * interna-
tonal J Pohey Analysis and Information

Systems, 4,2 217-240

Holland, J. H [1986a] “Escaping brittleness
The possibilities of general purpose
algorithms apphed to parallel rule-based
systems." Ch 20 in Machine Learning
‘7 (eds Michalski, RS, et al) Los
Aitos = Morgan Kaufmann

Holland, J H  [1986b] "A mathematical
framework for studying learning in
classifier systems “ Physica 220 307-317

Holland, J H , Holyoak, K J., Nisbett, R E,

and Thagard, P R [1986] = Induction
Processes of Inference, Learning, and
Discovery Carnmbridge, Massachusetts
MIT Press

Acknowledgments

Much of the research reported here has
been supported by the National Science
Foundation, currently under grant IRI
8610225 The author also wants to thank the
Los Alarnos National Laboratory for a year
(1986-87) as Ulam Scholar at the Center for
Nonlinear Studies, a year to pursue his chosen
research objectives without hinderances

 
 

G6. E. Liepins & M,

by

R. Hilliard

Oak Ridge National Laboratory, Oak Ridge, TN 37a831*

Mark Palmer & Michael Morrow
Oniversity ot Tennessee, Knoxville, TN 37996

ABSTRACT

The performance of conventional] and PMK genetic
algorithms as problem optamizers is compared with
greedy aigorithms coupied with genetic algorithms.
Set cavering, travelina salesman, ami job shou
scheduling vrobiems form the basis for the com-
parison. Conventional and PMX genetic alqorithms
are suguested to be robust, although not partic-
ularly distinguished, optimizers for the class of
problems studied in this paper. Greedy genetics
are recommended only when the underiying qreedy
algorithm is powerful, but not ovtimal, it remains
difficult to characterize when greedy genetics wild
outperform pure greedy algorithms.

INTRODUCTION

The goal of this research is to investigate
whether the population diversity of genetic
algorithms can be favorably used to ameliorate the
myopic tendencies of tne greedv algorithm; or in
short, is a synergistic combination possible?
(Greedy algoritims are best-first searcn aigorithns
with no backtracking -~ see Lawler, 1976 and
Nilsson, 1980.} This paper contrasts the verfor-
mance of greedy genetics to conventional and PMK
genetic algorithms. A camparison with established
operations researcn algorithms ana techniogues sucn
as simulated annealing awaits further experiments,

A criticism of conventional qenetic aigorithms
as optimizers is that they fail to incorvorate
problem structure in their formulation. Thear only
tie to the specific function being optimized is
through the reward structure, The incorporation of
the greedy algorithm allows problem specific
information to be used in the crossover operation.

Three ciasses of vroblems are studied: set
covering, traveling salesman, and job shop schedul-
ing. Not unexnectedly, results suqgest that wnen
the associated greedy algoritnn 1s vowerful (that
is, a single application of the algorithm produces
a generally "reasonable" solution), greedy qenetics
outperform conventional and PMX qenetics. Con-
versely, when the greedy algorithm is weak, greedy
qenetics perform worse than conventional genetics.
No definitive comparison can be drawn regarding the
comparative performance oi the pure greedy al-
gorithm and greedy genetics. The lesson seems to
be that the conventional and PMK qenetic aiuor.thms

are robusi, use greedy genetics only when there 1s
good reason to believe that tne underlying greedy
is powerful but not optimal.

Conventional Genetics

Genetic aigoritims were oriysnaily Geveloved by

Holland (1975). Foliow-on researcn of their
suitabilitv as function optimivers. nas been
campleted by Devong (1975) and Bethke (1980). The
basic regimen contains five major features. qd)

representation of the solution space as 4 binary
string, (2) a critic (function evaluation}, (3)
reproduction, (4) crossover, and (5) mutation.

Hidden behind the conceptual simplicity of the
genetic algorithm are a variety of parameters and
policies such as crossover rate, mutation rate, and
replacement policy ~~ all of whose settings affect
pertormance (DeJong 1975; Liepins and Hilliard,
1985). With tne parameters properly set, conver
tional genetic aigorithms aré robust and vowerful
wren applied to multimodal reul valued optimization
probiems (DeJong 1975; Bethke 1980). However, they
are sorely imadecuate for combinatorial omtimiza
tion problems whose solutions rewuire immivulation
of permutations (problems such as the traveline
salesman and joo stuo scheduling). Goldverq and
Lingle (1985) modified the basic genetic aigoritmn
to handle vermutations by the introduction of tne
partially-mavved crossover (PMX). The example
Goldberg and Lingle use to illustrate PMK cons.ders
two permutations of ten objects:

 

A=984
Bae@yvie

If two random numbers, say 4 and 6, are cnosen
as the crossover voints for these two genes, then
each is to be modified by the permutations (5 ?),
(6 3), and (7 10)

A=96415
BeaT7T1'2

and the respective results become

t

At

9842310165
Bt 792

7
810156 43.

It

Goldberg and Lingie investigatea survivability
of o-scnemata under PMS and tnen appized PMX to
Karq and Thomson's ten city traveling salesman
probien.

 

 

 

r

rt Gi et Ob

BOMOMMFO NE Naso

eA

POtrP +t oat aoa- octet eon
 

 

 

 

 

 

Greeay Cenetics

Althouqn tney did not call it such, Grefenstette
et al (1985) developed a greedy crossover for the
traveling salesman vroblen:

1. For each basr of varents, pick a random city
for the start.

2. Compare the two edges leaving the city (as
represented in the two parents) and choose the
shorter edae.

3. If the sherter parental edge would introduce a
cycle into the partial tour, then extend the
tour by a randam edge.

4. Continue to extend the partial tour using steps
two and three until the circuit is comleted.

Grefenstette et al. applied their heuristic toa
a 50 city and a 100 city problem. Unfortunately,
the results of Grefenstette et al. were not
directly comparable to those of Goldberg and
Lingle,

It miaunt be expected tnat an analysis of
scnemata survivability (Goldberqa, 1986) or a niili-
climbing variant {Ackiey, 1987) might be inves-
tigated tor the greedy crassover. A little
reflection suggests that tne former is signifi-
cantly more diff.cuit tman for PYX or traditional
crossover operators. Alternating hili-clambing
with genetic alqorithms has produced successful
results when interim pomulations suggested by the
yenetic algorithm provide qood initial points for
hill-climbing. However, the greedy alqorithm (as
inmiemented here) 1s not a “hill-climber", nor does
it benefit from different starting points. On the
otner hand, Margov chain analysis (Goldberq and
Segrest 1987) might provide insight into qreedy
genetics; such studies are expected) to be pursued
in jater pavers,

The comparative anaiys of conventional and
greedy genetics was vertormed on three classes of
vroblems: set cavering (SCS), Joo shop scheduling
{J8S), and traveling salesman (TSP}. The genetic
alqurithm used was the Genesis system
(Grefenstette 1986) modified to generate a single
etfspring for each crossover overation. This
offspring replaced tne most similar of its parents.
(One offspring ver crossover was mandated by the
greedy genetics and was used in all crossover
metnods ta insure comparability. Replacement of
the most samslar oaren* provide a simple vet
efticient means of avoading premature convergence~
~ @ pervasive problem witn qreedy genetics.)
Additional modipicatiors were made to accommodate
the integer coding underiving JSS and TSP. (No
Sucn modification was remured for SCP.) PMK and
greeny crossover operations were aaded to Genesis,
A fixed popuation size of 75 was used tor SCP and
candiuate solutions were reoresented as o1zary

 

 

Wn

strings of length 25 (the SCP matrices were of
dimension 50 x 25). A fixed population size of 50
was used for TSP and JSS for tne majority of the
experiments and the candidate solutions were
represented as permutations of 15 objects (JSS and
TSP problems involved 15 jobs and 15 cities,
respectively). (Selected experiments were run with
population sizes 100 and 250. The results differed
little from those presented here.) Genesis
parameters not explicitly discussed in tne previous
paraqrapns were left at their default settings.

For each of the problem classes, a sample of
oroblems were generated and the problems were
attacked with both conventional and greedy versions
of the genetic algorithm. Comparative performance
of the best solution and trial at which it was
achieved; as weil as on-line, off-line, and average
performance were all investigated. Golden and
Stewart (1985) suqgest a number of statistical
tests to evaluate the relative perfarmance of
competing algorithms. They suqgest the Wilcoxon
signed rank test, the Friedman test, and an
expected utility approach. Such tests are useful
as confirmatory statistical analysis. However, the
results of this paper are still considered prelimi-~
nary and it is felt that little would be gained by
additional statistical formalism.

Set-Covering

The set-covering proDlem (SCP) is NP-complete
{Garey and Jonnson 1979) and is often encountered
in applications such as resource allocation
(Revelle et al 1970) and scheduling (Marsten and
Shepardson, 1981). Known solution techniaues for
SCP include such methods as integer programming;
heuristic branch and bound; and most recently,
Laqrangian relamation with subgradient variations
~~ see Balas and Ho (1980) for a review of set
covering solution techniques.

The usual representation of a set covering
problem is as a zero/one matrix with a cost
associated with each column. A set of columns is
defined te cover the matrix if for each row of the
matrix at least one of the columms of the set
contains a ‘il! entry. The SCP objective is to find
a minimal cost cover of the matrix as follows:

Let A be an man binery matrix and wy, i=l,
“0 non-negative costs.

n
Minimize (over é) EZ wy, &
~ el
Subject to A 4> 1 wnere
= t
a. 4 ={ ye Sy)
p. 1 1s an mdimensional column vector of
ones, and

Gc. each 6) is binary.
 

The genetic algorithm representation of SCP 1s

straightforward: @ candidate SCP solution is
expressed as a binary string where a ‘i' in
position i indicates that tne ith column of the
matrix is included in the solution. The reward is
a large number M minus the sum of the cost of the
columns used and a venalty function Pn for failure
to cover:

R=

where 6, andn are binary with n= 0 if the
solution is a cover and n= 1 otherwise, and P is
an appropriate scaling for the penalty. {The
choice of the penalty scaling factor P ia known to
affect the performance of the genetic algorithm SCP
performance, but will not be investigated in this
paper. )

Each of single crossover, double crossover, and
greedy crossover was investigated. The jatter was
defined as tollows:

For every pair of parent genes,

1. Initialize the set S to be empty and the matrix
A to be the original set covering matrix with
columns (c;} and associated costs iwy}-

2. For the unused columns and uncovered rows
calculate the cost-ratio (cost/number-of-rows
-covered = w, /number-of-rows~ covered).

3. Append to S the column (say colum c) with the
least cost tnat is included in one of the
parents.

4. Strike column c and the rows covered by c from
A and let this new matrix be A‘.

5. If S is a cover or
represented in the
set A to A' and go

=f no other columns are
parents, stop. Otherwise,
to step 2,

The greedy SCP crossover is illustrated in
Figure 1.

 

parent 1: 10000 wy=5 Wy=2 Wa=5 Wytd Wax
parent 2: 01100 6 1 1

1 1 1 0 1
initial selection oO oO 9 1 QO
for child: column 1 QO 0 i a
2 with cost-ratio 1
wee he ee

second-stage cost-ratio

second
selection: cohmn 1 colum 3 column 4 colwm 5
colum 1 5 CS 2 %
21
Note: Column 4 was not selected because it was not

represented in either parent.

Figure 1. Greedy crossover for SCP.

 

The experimental desic¢m for the set-covering
problem involved matrix density as an atiditiona]
parameter. (Matrix density for a zero-one matrix
is the proportion of ones in the matrix.) Five
test problems were qenerated for each of the
fesected) densities ranging from 10% to 90%. The
results are displayed below: Table 1 vresents tne
relative success of eacn of the methods on eacn of
the classes of the oroblems and Table 2 vresents a
problem by problem breakdown of the verformance.
Generation average pverformance and best of qenera~
tion performance was displayed for representative
matrices of 30% and 70% density in Fiqures 2~5,

CROSSOVER METHOD

“one-point two-point
wins ties wins ties

 

wet AQ 8 4

Table 1.

Relative SCP performance by problen
density and crossover method.

The major conclusion that can be drawn from tne
SCP results is that the qreedy genetics for tte
set~covering problem not only yield a better
soiution than either the vure greedy or conver
tional genetics, but that convergence to that
solution is mach more rapid than with comventiona:
genetics.

fhe

job Shon Schedul ing

The JSS proplem (just like the TSP) problem as
a pure ordering proplem. Since it is not quaran~
teed to result in an order, conventional crossover
is not applicable to ordering oroblems unless the
Probiems are formulated as a penalty function
problem. Otherwise, mechanisms such as PMX must be
used and the underivind: genetic building olocks are
the o-schemata. Bethke (1980) analyzed GA-hard
functions using Walsn transforms, qroup characters
of the group m Zo, the n-fold Cartesian product of
the group 2, with component wise addition (Dym ari
McKean 1972). A natural extension of Bethke's work
to the analysis of "GA-order-hard” functions might
be based on the group representations of the
symmetric qroup (Boerner 1963).

Three types of crossover overators were
investiqated for job snop scheduling: PMX, a weak
greedy crossover, and a powerful greedy crossuver.
The joo snop schequling proplem investigated was
the simpiest scheduling problem: a static queue of
Jobs with specified due aates and run times with no
precedence constraints, a single server and minimas

   

 

pT pt

 
 

1 Pe. Cross 2 Pt. Cross Greedy Cross Pureg
Best Trial Best Trial Best Trial Greedy Ootimal

 

i: 61,29 412 62.03 560 60.11 80 60.11 ?

2: 45.72 593 47.79 593 48.2 56 48.76 ?

10% 3: 48,31 613 43.16 630 43.28 258 44.42 ?
4: 49.72 702 51.58 663 49.11 90 49.82 2

5: 55.79 547 50.85 389 48.38 7 48.38 ?

1: 19,36 682 18.78 831 18.78 61 22,33 ?

2: 34.54 550 33.6 496 29.34 669 33,27 ?
20% 3: 33.66 719 98.37 524 27.49 114 30.25 ?
4: 26.42 534 27.02 783 22.22 646 24.17 ?

5: 24.09 614 16.23 641 13.41 78 16,03 ?

I: 16.30 786 11.52 787 41,52 TT 41.52 11.52
2: 20.13 918 16.70 733 16,06 93 17.19 14,80
30% 3 8.46 611 41.85 913 8.20 65 8,20 8.26
4: 17.77 163 18.40 431 18.10 108 20.06 ?

5: 9.99 688 8.83 915 7.58 96 7.57 7.55

4: 10.73 937 11.58 788 11.63 81 11,53 10.70

2: 12.2) 793 42.97 745 9.78 108 11.62 9.56
40% 3: 11.38 949 11,70 1006 7.50 95 7.50 7.50
4: 7.99 523 6.45 633 6.45 96 6.82 6.42

By 17,96 631 15.26 612 19.31 104 16.00 12.64

a: 3.62 703 5.45 623 2.86 18 3.62 2.86

2: 4.38 378 5.40 434 4.38 79 4.38 4.38

50% 3: 8.07 855 41.16 684 1.57 81 1.57 7.57
4: 6,10 992 8.91 718 6.10 220 7.90 6.10

5: 6,39 929 6.99 644 6.27 90 6.39 6.19

1: 4.75 893 4.24 913 3.12 90 3.28 3.12

2: 6.49 687 4.66 581 4,66 1014 5.23 4.66
60% 3: 6.24 123 6.14 962 5.86 77 5.86 5.31
a: 5.15 799 3.64 795 3.63 B82 3.63 3.63

5B: 6.93 493 4,08 779 3.77 16 5.12 3.77

1: 4.78 645 4.49 607 4,49 107 5.76 3.96

2: 3.42 663 2.15 735 1.76 94 2,04 1.76

TO%® 3: 2.31 547 2.31 729 2.31 76 2.31 2.31
4: 5.21 727 3.00 610 2.64 37 2,78 2.64

Ss 0.96 473 0.96 540 0.61 V7 0.76 0.61

1: 0,92 728 2.09 802 0.92 78 0.92 0.92

2: 2,03 426 2.03 922 2.03 92 2,69 2.03
80% 3: 3.89 442 3.41 548 2.25 85 2.92 2.25
4.85 546 5.46 922 4.13 80 4.18 3.70

3.58 6819 2.59 686 2.59 80 3.58 2.59

3.08 3887 0.62 551 0.62 16 0.78 0.62

0.66 979 1.19 787 0.66 86 G,.68 0.66

90% 0.73 851 0,54 B15 0,51 79 0.72 ?
3.82 667 3.81 532 3.65 80 5.20 3.65

1,33 738 2.34 825 1.08 76 1.04 1,04

 

 

 

& Greedy heuristic apnlied recursively until a cover is generated.

Table 2. Problem by problem SCP verformance,

93

 
 

Mat Covering Prontan 42 of 308 Deowity

 

oi |

wt \
\

a
yj SN

Ceer Cet

 

19 beeen
a a2 oa on os 4
(Thousanet)
trials
© tat. cree + Rpt cree + weer

(gate. eterariot a drage vert team 2

Sat Covering Protieg #2 af 202 Oenaity

   

Onrec Cent

Best st gereration

Figure 3.

at Covering Arpelen #4 of 708 Donasty

 

 

 

Teele
4 at, Cross + am.

Figure + eheratien -fecuxe perfoesarce

94

Set Covering Proujee #4 at 70% Derwity

Corer Cont
ee

a a2 oa Oo oe 1
Tinks
st pt. com, + Apt, cree + wreety
Figure 5, Sesc wf gererarion
(signed) lateness as the criterion. (fne more

complicated job shop scheduling problem consiuered
by Davis (1985) was not investigated, ) A woli
known heuristic provides the optimal job schedule
for this problem: “Order the queue according to
increasing run times." The two versions of the
greedy crossovers were based on weak and powerful
heuristics, respectively. The powerful heuristic
is the optimal heuristic previously discussed; tne
weak neursctic states "order the queue according to
increasing difference between due date and run
time"; that is, a job with an early due date and a
jong run time would be scheduled before a ‘op with
a late due date and a short mm time. The arecuy
crossavers are immiemented as follows:

1. For each parent pair, start at the first jop
{i=1).

2. Compare the two jobs at the ith vosition of the
two parents and wlace the better (accordimy to
the heuristic being used) of the two in the
child's ith position. If one of the jons has
already been placed in the chiid, then autamat-
ically pack the other. If botn of the jobs
have already bee vlaced, then pick a job
randomly from the yet unplaced jobs.

3. Repeat step 2, incrementing i, until ali
positions 1n the child string are defined.

Results fram the job shop scheduling emeri-
ments are presented in Table 3 and generation
average and best of generation performance for eacn
of the crossover methods for a representative 75%
proviem are qranhically presented in Figures 6 and
7, It as obvious that the strong greedy crossover
deminates PMX, which in turn dominates the weak
greedy crossover. Repeated application of tne
powerful heuristic (pure strong greeay) would vied
the optimal schedule with an evaluation of 0.00.

 

 

 

 
 

 

PMX Weak Greedy Stronq Greedy
Best Trial Best Trial Best Trial

1: 40,84 825 124.41 314 0,00 514
2: 33.78 913 51.42 951 8.82 321
3. 29.74 938 57.71 948 2,27 397
4: 59.60 873 86.93 984 6.99 370
6. 27.22 861 54.94 972 2.75 576
6: 34.94 967 68.76 849 3.34 462
T: 20.90 689 71.72 926 2.20 938
8: 14.68 890 112,84 887 0,00 545
9: 53,60 695 101.93 923 2.20 861
10; 13.52 994 203.40 169 0.00 432
11 28.99 946 74,98 711 0,00 416
12 48,14 B35 115.14 422 0.00 1005
13 18.02 941 63.18 955 1.60 481
14 17.57 7O1 80.20 967 1.24 720
16 15.07 976 150.09 397 10.85 424
16 7.93 946 136.03 492 2.23 599
17 50.18 786 208.27 43 3.24 959
18 31.35 1007 122.13 29 4.71 828
Ag 25.97 904 184.46 281 2.49 344
20 46,00 793 183.39 681 3.78 937

 

Tabie 3. Prodlem by problem JSS verformance

20 Sep Scheeuling Provine #2

Traveling Salesman Problem (TSP

The traveling salesman problem (TSP) is another
of the very difficult NP complete combinatorial
ordering problems with a long history of interest.

‘ Few (1955) discovered a heuristic solution tor the
Euclidean problem with a quaranteed worst case
performance of “V2n + 1.78. Christofides (1976)
discovered an elegant heuristic with a worst case
error of 50% of ovtimal tour length. Karp (1977)
showed that partitioning into subsets and con-
catenating optimal tours of each subset yieids
tours with expected percentage error approaching
zero. Crowder and Padberg (1980) solved a 318 city
tour to optimality, and Golden and Stewart (1985)
reported excellent results for their CCAD heuris~
tic. Of the many theorems and results relating to

 

 

Ereue Note evagd turn

00 eee 00H 0 Sealing Praplam A

   

 

ae the TSP, two are of particular relevance to this
sea paper.
on + a
4 2 Theorem 1 (Rosenkrantz, Stearns, and Lewis
>: 4 . 1977). For every r > 1, there exist n-city
i a= — instances of the {TSP for arnpitrarily ijarge n
; im . obeying the triangle inequality such that NN(I) > r
100 ‘ . OPT(I), where NN(I) is the nearest neighbor optimal
« ‘ tour and OPT{1) is the overall optimal tour.
‘ \ seat eee ee
~ i
ne \
od EEE
Qo a2 oe oe oe 1
(Trouearce)
wile.
2 Oe + el RE eer Strong Srvedy
Flaara 7 Best of yevcticiog,

 

 

95

 
 

 

 

 

Tara mmc qreedy

problem nearest unanchored anchored On genetic qreedy
# neighbor and seeded and seeded anchored seeded genetic
1 19484 z9374P 182430 23269 i9211 1824380
2 164807 164804 164807 18096 16480% 164804
a 176734 176737 17673 22588 17673" 17815
4 18701 15626 14837 18495 14331 3142569
5 16219 1621082 16219 18569 16219 16219
6 136139 136132 136139 13613 136139 23916
7 174008 174008 174007 24578 174007 17745
8 15067% 150677 150672 46714 150677 150673
9 19755 19785 108770 25129 195557? = 19661P
10 21299 21299 20739" 23392 20778 194212
1 18034 177118 17740° 22150 17364 171918
12 185877 185874 185872 21998 165879 19708
13 17078 17078 17078 21063 17078 1575

14 19535 19472 192059 23845 185812 1858195
15 160527 160522 160522 20856, 160422 160522
16 21934 20226? 19897 21442? 19253" 3ag712>
17 16384 16354 162299 20352 162299 16354
18 21977 21977 21977 26089 21807a> 22668
19 17049 170068 317015? 22651 17049 18519
20 178692 178699 178692 25119 17869 178697

Table 4. Problem by problem TSP performance

4 best performance

b performance surpasses nearest neganbor

Theorem 2 (Papadimitriou and Steiglitz 1977).
If A is a local search algorithm wnose neighborhood
search time is bounded by a polynomial of the
problem representation, then if P # NP, A cannot be
guaranteed to find a tour whose length is bounded
by a constant multiple of the optimal tour even
with an exponential number of iterations.

Theorem 1 is relevant to this paver because the
standard against which the genetic algorithms are
compared is the pure greedy algorithm, called the
nearest neighbor solution in the operation research
literature. The implication is that this standard
itself could be poor,

Theorem 2 suggests that if the greedy genetic
algorithm is a local search algorithm as is
believed, then examples of tours can be found for
wnich the algorithm performs arbitrarily poorly.
Presumably, the same conclusions would hold for
PMX.

The work presented here buiids on that of
Brockus (1983}, Goldberg and Lingle (1985), and
Grefenstette et al (1985). The results of those
three studies were not directly comparable and it
is interesting to ask how they compare. Moreover,
none of the previous studies investigated the
effect of seeding the initial pooulations, of
anchoring the tours, or of stochastic variation due
to different randomization.

96

Two tyoes of crossover operators were inves~
tigated for the TSP: PMX and a modification of tne
greedy crossover of Grefenstette et al (1985). For
the anchored case all members of the poyulation
were normalized to begin with the same starting
city {city '1'). For the unanchorea case this was
net done and the starting cities were randamly
chosen. For any two parents, the modified
Grefenstette crossover [greedy crossover) produces
a child according to the following specifications:

1. For a pair of varents, start at the first
position (always the same city).

6

Choose the shortest edge leading fran the
current city (that is represented in the
parents). If this etige leads to a cycle,
choose the other edge. If this i1eads to a
cycle, cnoose a random city that contimues we
tour.

3. If the tour is comlete, step; eise qo to 2.

 

 
The modified Grefenstette crossover will be
called the greedy crossover and is always anchored
an the results presented here. Both greedy
crossover and PMK crassover were tested with seeded
and unseeded initial populations. The seeded
initial population inciuded the fifteen optimal
ureedy algorithm generated tours (possibly in-
cluding auplicates) begamning at each of the
fitteen different cities. This subpopulation of
fifteen was randomly extended to an initiai
powuation of fafty. The nearest neighbor tour was
the best of tne fifteen greedy aiaorithm cenerated
fours. Resuits of the pe-focmunce of Mie arecty
andi PMK crossovers are given am Table 4. Nis-
couraging is the poor verfermance of the PMX with
the unseeded nosuljation; only for one problem did
it verform better than the nearest neighpor
algorithm. Virtually no verfurmance differential
among the remaining variations of tne genetic
algorithm was onserved; each verformed somewhat
better than the nearest neiqhbor. Perhaps suroris-
ingly, the unseeded greedy genetic performed neaciy
as well as its seeded countervart.

Relative verformance between anchored and
unanchored PMK and seeded and unseeded greedy
crossover is presented in Table 5, where the 2 in
tne upper left hand corner indicates that the
unanchored PMK verformed better than the ancnorerd
PMX on two vroblems. The remaining entries arc
interpreted similarly. Aoparently, no important
differences are indicated. Figures & and 9
represent generation average performance and best
of generation results for the fourth problem at 70%
density.

Bethke {1980) has shown that for certain
problems, genetic algorithms are unstable, that is,
the solution determined seems to denend on gepetic
drift and on the randomization of the initial
population and genetic mechanism. Tables 6 and 7
display the results of difterent rardamization of
anitial ponulations and genetic mechanisms for the
first problem at 10% density. Discouragingly, an~
proximately 98% variation in performance can be
noted for tne qreedy genetic as either initial
population or qenetic mechanism is randomized
difterentiy. The counterpart variations for tne
PMK are 13% and 17%, respectively. The implication
seems to be that tne TS? is a difficult problem for
either tne PM or greedy qeuetics,

mm unanchored

 

 

and seeded and seeded
2 6
qreedy genetic " sqreedy genetic
seeded unseeded
8 6
fable 5. Seeding and anchoring as factors in

verformance differential

 

97

etx ae

Toure

reine

Sour

pomulation qrecdy
seed # prc genetic
1 22671 19696
2 22445 18243
3 21247 18733
4 22456 18771
5 24075 18888
Table 6. Randan initial population seeds as
a factor in performance differential
crossover qreedy
seed # peer genetic
1 24029 19537
2 23244 20721
3 22167 20713
4 24074 20275
5 22427 19072
6 21460 20078
7 22898 19600
a 21628 19348
3 22671 19636
10 26125 20415
Table 7. Random seeds for GA mechanism as a
factor in performance differential
Treveling Salama Preplen a4
= a
ry
w=
a
=
200
-_
=
an
im
20
tao
10

 

4ira @  Seneratica average performince fer angele! vith: y+

Treveling Salesman Problen 44

Figure %. Beot of geraration fer arseeded utial peeilatiers
 

Conclusions and Future Research

Although little effort was made to optimize tne
algorithms investigated in this paper, the evidence
suggests that greedy genetics can successfully make
use of problem specific infomation whenever the
underlying greedy algorithm is powerful. (The
underlying greedy could be defined to be "yowerful"
whenever a singie problem applicatian of it resuits
in a "reasonable" solution.) On the other hand, if
the underlying greedy ig misdirected, greedy
qenetics are less successful than traditional
genetics. However, regardless of tne power of the
underlying greedy algorithm, qreedy genetics often
converge more rapidly that their conventional
counterparts.

It appears that greedy genetics have their
Dlace in optimization and it would be interesting
to extend this work to more realistic oroblems
{such as job shop scheduling problems with prece-
dence constraints and muitivle servers) and to
compare performance with more traditional optimiaza-
tion techniques. The variance in verformance due
to randomization. Markov chain results, extensions
to Bethke's work, and vroper penalty function
formuiations all deserve additional analysis.

RERERENCES

Ackley, D. H. (1987). R Connectionist Machine for
Genetic Hiliclimbing, Kluwer Academic Publishers,
Boston, MA.

Balas, E. and A. Ho (1980}. "Set Covering
Algorithms Using Cutting Planes, Heuristics, and
Subgradient Optimization: A Comoutationai
Study,"' Mathematical Programming 12, 37-60.

Bethke, A. D, (1980). Genetic Algorithms as

Michigan, Ann Arbor.

Boerner, H. (1963). Revresentations of Groups witn
Special Considerations tor the Needs of Modern
Physics, North Holland Publishers, Amsterdam.

 

Brockus, €. G. {1983}. "Shortest Path CGotimization
Using a Genetics Search Technique", Proceedings
dation

 

Conference, Pittsburgh, PA. 241-5.

Christofides, N. (1976). Worst-Case Analysis of a
New Heuristic for the Traveling Salesman Problem,
Report 388, Graduate School] of Industrial
Administration, Carnegie Mellon University,
Pittsburgh, PA.

Crowder, H. and M. W. Padberg. (1980). "Solving
Large-Scale Symmetric Traveling Salesman Problens
to Optimality, Management Science, 26, 496-509.

Davis, L., (1985). "Job Shoo Scheduling with
Genetic Algorithms," Proceedings of an Interna-

 

Avplicatians, Carnegie-Melian University,
Pittsburgh.

DeJong, K. A. (1975). An Amslecie of toe Bewavior
of a Class of Genetic Acatmive Systems," Ph.D.
Thesis, University of Michigan, Ann Arbor.

Dym, H. and H. P. McKean (1972). Fourier Series

and Integrals, Academic Press, New York,

Few, L. (1955). "The Shortest Path and the Shortest
Road through n Points", Mathematika, 2, 14i-144,
Garey, M. R.

and D. S. Johnson,

   

   

   
 

 

(32979). Computers

Grefenstette, J. J., Gooal, B. J. Roca ta, and
D. V. Gucht, (July 1985). "Genetic Algorithms for
the Traveling Salesman Problem,” Proc.
an_International Conference on Genet t
and Their Applications, Carnegie-Melion Uraver-
sity, Pittsburgh.

Grefenstette, J. J., (April 1986). A User's Guide
to GENESIS, Technical Report CS-84-11, Computer
Science Denartment, Vanderbilt University,
Nashville.

    

   

Goldberg, D. E. (1986). Simple Genetic Algorithms
and the Minimal Deceotive Problem, The Clearing-
house for Genetic Algorithms, TCGA Report No.
86003, University of Alabama, Tuscaloosa,

Goldberg, DB, E., and R. Lingle, (July 1985).
“Alleles, Loci, and the Traveling Salesman
Problem," Proceedings of a ional
Conference on Genetic Algo: Their

Apvlications, Carneqie-Mellon University,

Pittsburgh.

 
  

Goldperq, D. E., and P. D. Segrest, (to anvear)
"Finite Markov Chain Analysis of Genetic
Aigorithms”, University of Alabama, Tuscaloosa,

Golden, B. L., and W, R. Stewart. (1985).
Empirical Analysis of Heuristics in The Traveling
em, Lawier et al {eds), John Wiley

  

Holland, J. 4. (1975). Adaptation in Natural and
Artificial Systems,
University of Michigan Press, Ann Arbor,

Karp, R. M. (1977). “Probabilistic Analysis of
Partitioning Algorithms for the Traveling
Salesman Problems in the Plane, Math Goerations
Research 2, 209-224.

Lawlez, Z. L. (1976). Compinatorial Optimization:
Networks and Matroids, Hoit, Rinehart and
Winston,

Lawler, BE. H., J. K. Lenstra, A. H. G. Rinoovkan,
and D. B. Shmovs, (19485). The Traveling Salesman
Problem, John Wilev and Sons.

 

 
Liepins, G, E., and M. R. Hillsard. (1986)
"Representational Issues in Machine Learning”,
Proceedings of the International Symposium on
Methodologies for Intelligent Systems Colloquia
Program. Knoxville, TN. ORNL-~6362.

Marsten, R, F., and F. Shepardson, 1981. “Evact
Solution of Crew Scheduling Problems Using the
Set Partitioning Model: Recent Successful
Applications,” Networks, Vol. 11, No. 2,
pp. 165-177.

Naisson, N. J. (1980). Principies of Artificial

Alto, CA.

Papadimitriou, C. H., and K. Steiglitz. (1977)
"On the Complexity of Local Search for the
Traveling Salesman Problem". SIAM J. Comput.6,
78-83.

Reveile, C., D. Marks and J. C. Lietman (1970).
“An Analysis of Private and Public Sector
Facilities Location Models,” Management Science
16, 12, 692-707.

Rosencrantz, D. J., R. E. Stearns, and P. M. Lewis
TI, (1977). "An Analysis of Several Heuristics
for the Traveling Salesman Problem", SIAM J.
Comput. 6, 563-581.

i
i
;
|
hn
}
i
f
i
i

 
 

INCORPORATING HEURIS LIC INFORMATION INTO GENETIC SEARCH

dung Y Suh
Dirk Van Gueht

Computer Science Department
Indiana University
Bloomington, Indiana 47405

(812) 335-6429

CSNET jysulG@indiana, vgueht@indians.

Keywords Genetic Algorithins, Heuristics, Qpumization Probfeins,
Sunulated Anneahug, Shding Block Puzzle, Traveling Salesman Problem

Abstract

Genetic Algorithms have been shown to be robust optimiza:
tion algorithms for {positive} real-valued functions defined
over domains of the form Ji” (R denotes the real numbers)
Only recently have there been attempts te apply genetic
algorithms to other optsmization problems, such as combi-
natorial optimization problems In this paper, we rdentrfy
several obstacles which need to be overcome to successfully
apply genetic algorithms to such problems and indicate how
integrating heuristic information related to the problem un-
der consideration helps in overcoming these obstacles We
illustrate the validity of our approach by providing genetic
algorithms for the Traveling Salesman Problem and the Slid-
ing Block Puzzle

1. Introduction

Suppose we have an object space X and a function f X —
R* (R* denotes the positive real numbers} and our tash is
to find a global mmimum (or maximum) for the function f
In this paper, we will concentrate on genetic algortthing, a
class of adaptive algorithms invented by Jolin Holland {8},
to salve (or partially solve) this problem.

Genetic algorithms dilfer from more standard search al-
gorithms {eg , gradient descent, controlled random search,
hill-climbing, simulated annealing [9] etc ) in that the search
is conducted using the information of a population of siruc-
fures msiead of that of a single structure The motivation
for this approach is that by considering many structures as
potential candidate solutions, the risk of getting trapped in
a local optimum is greatly reduced

Genetic algorithins have been apphed with great suc-
cess by De Jong [4] to a wide variety of functions defined
over object spaces of the form A", ie, each structure x
consists of n real numbers z[1] . jn] Only recently have
there been attempts to apply genetic algorithing to other op-
tumization problems such has the Traveling Salesinan Prob-
lem (TSP) (6, 7], Bin Packing [13], Job Scheduling {2, 3}
An important observation made hy Grefenstette et al [7]
was that to successfully apply genetic algori(hins ¢o such
problems, heuristic tnformation has to be mcorporated into
the genetic algorithin, in particular they proposed a heuristic
crossover operator and showed the dramatic improvement as
compared to crossover operators which did not such heurs-
tic information In this paper, we continue the efforts of
[7]. In Section 2 we identify several problems which need
to be overcome to successfully apply genetic algorithms to
optimization problems other than the standard function op-
limization problems In Sections 3 and 4 we illustrate the
validity of our approach by providing genetic algorithms for
me Salesman Problem and the Shding Block Puz-
zie [12

410(

2. Design Issues of Genetic Algorithms
In this section, we outhne the major obstacles in the de-
sign of genetic algorithms for optimization problems other
than standard function optimization problems and suggest
approaches to overcome themt

2.1. The Representation Problem

As mentioned before, genetic algorithms have almost exclu-
sively been applied to functions defined over object spaces
of the form R® When we want to solve other optimiza-
tion problems, such as combinatorial optimization problems,
simple parametn¢ representationg of the structures can no
Jonger be used This suggests that the first step towards
successfully applying genetic algorithms to these problems
is to use a natural representation for the structures of the
problem at hand In particular, we suggest that the choice
of such a representation allows for the definition of recom-
bination operators which incorporate heuristic information
of the problem We thus imply that the selection of “good”
representations and recombination operators are highly cor-
related (for a more detailed discussion of these issues we
refer to [5]}

2,2. The Selection of Appropriate Recombination
Cperatora and the Importance of Local Improve-
ment
The power of applying genetic algorithms to functions de-
fined over #” 1s that the standard recombination operators,
crossover and mutation, make intuitive sense in this prob+
Jem Jn other problem domains, however, tlus is not useally
the case Since the recombination slep is critical for the
success of a genetic algorithm, it is important to carefully
select an appropriate set of recombination operators for such
problem damams

Early research on genetic algonthms [f, 8} was primar-
dy concerned with operatura which guarantee that {some)
structural information of the structures to wloch they are
apphed 13 preserved Examples of such recombimation op-
erators are the standard crossaver, inutation and imversion
operators used in function optimization Grefenstette et al
argucd that such an approach does not carry over with sim-
Uar success to the traveling salesman probleain (TSP} They
showed that considering recombination operators (in thor
case, only crossover operators} that merely preserve struc-
tural information results in pooily performing genetic al-
gorithins (.e@, not much better than random search) They
discovered however that it is possible and natural to incorpo-
rate hearistic information about the TSP mto the crossover

} [t should be noted that De Jong [5] and Grefenstette
et.al [7] already identified some of these problems

 

 

 

 
operator and still mamtain its fundamental propetty,
namely preservation of structural information of the strue-
tures to which the operator ys apphed, This resulted in a
fasrly successful genetic algorithm for the TSP, but certainly
not an algorithm shat is competitive with other approxima-
tion algorithms for the TSP (see {10])

We claim that it ts often the case that additional un-
provements can be gained if one incorporates even more
heuristics about the problem into the recombination step
of a genetic algorithm Often, heuristics about problems
ac jneorporated into algorithms in the form of operators
which iteratively perform local improvements to candidate
solutions Examples of such operators can be found in gra-
dient descent algorithms, bill climbing algorithms, simulated
annealing, etc We will argue that it 39 usually straightfor-
ward, and in fact, we think, essential ifa competitive genetic
algorithm is desired, to incorporate such local improvement
operators into the recombination step of a genetic algorithm.
Au additional advantage of {is approach 19 that it suggests
a natural techiuque of blending genetic algorithms with more
standard optimization algorithins.

2.3, Avolding Premature Convergence

One of the major difficulties with genetic algorithins {and im
fact with most search algorithms) 15 that sometimes prema-
ture convergence, :e convergence to a suboptimal solution,
occurs It has been observed that Uus problem is clasely tied
to the problem of losing diversity in the population, One
source of loss in diversity 1s the occasional appearance of a
“super-indridual” which yn afew generations takes over the
population One way of avoiding this problem os to change
the selection procedure, as was demonstrated by Daher [i].
Anather source of loss of diversity results from poor per-
formance of recombmation operators in terms of sampling
new structures To overconte such problems, we claim that
the recombination operators should be selected carefully so
that they can offset each others vulnerabilities (for a more
detailed discussion of these issues, we refer to [f, 4, 7, 8])

3, The Traveling Salesman Problem ,
Jn this section we show that solutions to the problems raised
in Section 2 enable us to develop a genetic algorithms for the
TSP

The TSP is easily stated’ Given a complote graph with
WN nodes, find the shortest Hamiltonian tour through the
graph (in this paper, we wul assume Euclidean distances
between nodes). For an excellent discussion on the TSP, we
refer to [10|

The object space N obviously consists of all Hanulto-
nian tours {tours, for short) associated with the graph, and
f, the funetion to be optimized, returns the fongth of a tour

As in [7], we represent a tour by its adjacency repre-
sentatron It twins out that this representation allows us to
easily formulate and implement heuristic recombination op-
erators In the adjacency representation, a tour is described
by a hst of cities There is an edge m the tour from city f to
city g fand only uf the value in tth position of the adjacency
tepresentation is 7 For example, the tour shown im Figure
1 is represented as (315 2 4).

We now tui to the most critical step of the design
the selection of appropriate recombination operators We
elected to have two such operators ‘The first opetator is a
slight modification of the heuristic crossover operator intro-
duced hy Grefenstette et al [7] This operator constructs
an offspring froin two parent tours as follows Pick a rane
dom city as the starting poont for the ollspring’s tour, Com-
pare the two edges leaving the starting city in the parents
and choose the shorter edge Continue to extend the par-
tral tour by choosing the shotter of the two erlges in the
parents winch extend the tour Lf the shorter parental edge
would introduce a cycle into the partial tour, check sf the
other parental edge mtroduces a cycle In case the second

 

4

Figure t. The tour (3 15 2 4).

edge does nat introduce a cycle, extend the tour with this
edge, otherwise extend the tour by a random edge. Con-
tinue until a complete tour is genorated ft us in the 6c
lection of the shorter edges that we exploit heuristic infor-
mation about the TSP, indeed, it seems lhely that a good
tour will contain short edges The effect af the heurnstic
crossover operator is to “glue” together “good” (1e, short)
subpaths of the parent tours, (Notice also that it preserves
strucbural information about the parent tours) The prob-
lem with the heuristic crossover operator is that it leaves
undesirable crossings of edges as Wlastrated in Figure 2 {sce
also Appendix 1). In other words, the heuristic crossover
operates performs poorly when it comes down to fine-tuning
candidate solutions This motivated us to introduce a sec
ond recombination operator
Whereas the heuristic crossover operator can be

thought of as a global operator, the second recombination
operator has a more local behavior and thns qualifies as a
local improvement operator It was introduced by Lin and
Kernighan {43} and 09 called the 2-opt operator The 2-opt
operator randomily selects two edges (41, 21) and (t2, a) from
a tour (see Figure 2) and checks if 2D (11, 91} + ED(1a, 2) >
ED(45,33) + ED(Qa, 1) (ED stands for Enchidean distance).
Vf this is the case, the tour is replaced by removing the
edges (ty, 1) and (tg, 9) and replacing them with the edges
(in, ja) and (ia, 11) (sce Figure 3) Actually, we use a more
subtle variation of the 2-opt operator, mspired by recent
work on stmulated annealing for the TSP by Kirkpatrick
etal (9 In this varation there is a (small) probability
(depending on a slowly decreasing temperature) that when
ED(uyn)+ D(a, 3) < ED(4, 72) + ED(1a, 71), the ong-
mal tour is replaced using the previously described transfor-
mation f
To make the description of our genetic algorithin complete,
we need to desciibe three parameters

a crossover rate this parameter indicates the amount

of structures tn the population which will undergo cross-

over
b local improvement rate this parameter indicates
the amount of stiuctares in the population whieh will
undergo 2-opt operations
© Q-opt rate if the graph under consideration has N’
nodes, each structure which 15 selected to undergo local
unprovement will undergo (NV x 2~opt rate) 2-opt
operations pet generation
The algorithm was stopped when the majority of the tours
in the population were identical,
We tried our algorithin on a wide variety of (euctidean}

 

t lt should be noted, however, that the performance of
the algouithm with the shnple 2-opt as usually only shightly
worse than an algorithm that uses the simulated annealing
version
 

Figure 2 Tour with edges (t¢, 31} and (13,52)

Figure 3. Tour with edges (t1, 9) aud (12,1)

traveling salesman problema In Figure 4, we show a selec-
tion of such problems

krolak
lattice Fe :
-cireles
. ms m1
200-c1ties latedecy
“Acre

Figuied Five Tiaveling Salesman Problems

In Table 2, we show the results obtamed by the algorithm
of Grefenstette et al [7] for the following parameter set-
tings indtial population = 100 randomly chosen tours,
crogsover rate = 50%%, local improvement rate = 0%,

2-opt rate = not appheable In Table 2, we show the re-
sults of a genetic algonthms which uses the local umprove-
ment operator with the following parameter settings pop-
ulation size = 100 structures, crossover rate == 50%.
local improvement rate = 50%, 2-opt rate = 0.1.

Table 1
Genetic Algorithm Without Local Improvement

TSP Nodes Optimum Our Generations
Solution

Krolah{to} 100 21282 25651 «UE
latice 100 1001019 209
tercles 200 2167 390 300
lat-t-erre 200 «11256 1392 986
200-cities 200 ? 1928 376

Table 2
Genetic Algorithm With Local Improvement

TSP Nodes Optimum Our Generations
Solution

krolak = 100, 21282) 21651 679
lattice 100 100 100 188
4-circles 200 246 215 218
Jat-4-cire 200 12656 1133 669
200-cities 200 ? 153 6 946

In Figure 5, we show the (best) tours obtamed and the goner-
ation in which they were first found by the algorithm which
uses local tmproverment. Clearly, the addition of a local m-
provement technique improves the performance (measured
in terms of the tour length of the best tour obtained) of
the algorithm dramatically (In terms of extra resources,
on average, the algorithm uaing local improvement required
about 22 times more generations to obtain its best struc-
ture ) In fact, the results obtamed by our algorithm are very
competitive, again, in terms of the tour length of the best
tour obtamed, compared to results reported in the literature
for other approximation algorithms for the TSP (9, 10] {In
Appendix 1, we give additional results )

4. The Stiding Block Puzzle (SBP)
We now desembe how the appraach described im Section 2
can be used in the design of a genetic algorithin for a problem
which 13 hot usually thought of as a function optimization
problem the Sliding Block Puzzle [12]

Consider the initia} board of the puzzle shown in Figure
6 and let the board shown in Figure 7 be a goal board (the
empty tile 1s represented by the symbol 0}

123 4
5 6 7 8
9 0 1011

P43 14:15
Figure 6 The Initial Board of a Shding Bloch Puzzle.

Figure 7 A Goal Board a Sliding Block Puzzle

 

{ dn our implementation, we used 3% 3 and 4x 4 puzzles

 

 

ENON

 

 
 

 

JT Sy

Syl

krolak (21651)

(COQ)

(24.53)

lattice

(lo0.0)

4-circles

  

200-c1 ties

(153.67) (113.34)

fated-cire

Figure 5 Best Tours Obtained by a
Genetic Algorithm with Local Improvement

The objective of the SBP 1s to reach the goa! board starting
from the intial board using a sequence of valid moves There
are four basic moves:

L: move the empty tile to the left.

U move the empty tile upwards

R move the empty tle to the ught.

D move the empty tile downwards,

The only precondition required for applying a move is that
it should not move the empty tle out of the board For
example, a sequence which transforms the board shown in
Figure 6 into the board shown in Figure 71s (L,U,R,D,R,U).

In order to apply genetic algorithms to the SBP, we
need to formulate the problem as a function optmuzation
problem The object space V consists of all valid sequences
of moves applicable to the mutial board Notice that the
structures in X do not have a fixed length representation
Other research with genetic algorithms on object spaces with
structures having variable length representations was done
by Smith [14] who implemented a machine learning system
(LS-1) using structures corresponding to production system
programs.

In order to define f, the function to be optunized, we
need to introduce some extra notation. We will denote the
muitial board by 18 and the goal board by GB, Let {21,. an}
be a sequence of valid moves (ie, an element of X), we
denote by IB(z;,. .,2,) the board which is obtained by
applying the sequence of moved (y,...,2,) to IB.

Cousider the boards IB(z,;,  , Zn) and GB For each tile
except the empty tile) in 1Bf21,.. 2), compute the Afan-
hatlan distance between the tile’s posthion in 18(z,,...,2n)
and its position in GB We define the per formance(zy,..,n)
as the sum of all these Manhattan distances,

In our first altempt, we defined

f(t.

, Lut quickly discovered that a much better measure 15
f(t.

ie, the value of a structure (x,,.. ,2n) is delined as the per-
formance of the sub-sequence (44,., 2.) whose correspond-
ing intermediate board IB(z1,. ,2,) comes closest to GB (it
should be nated that computing f(z1, ., Zn) can be done in
O(n)} It should be clear that whenever f(z1, ©, za} = 0,
theseqnence{t1, — , 2p) contains a subsequence (21,. ., 21)
which 13 a solution to the SBP We now turn to the selection
of the 1ecombination operators

The crossover process here is similar to that in the TSP
Suppose that two sequences of operators are given. We pick
the first operator from each sequence Apply each opera-
tor to the initial board to see which operator yields a new
board closer to the goal board Choose with high proba-
bility the operator which yields the closer board, ie, the
one with the better performance Notice that it is here that
we employ heuristic information about the Sliding Puzzle
Problem, indeed, it seerus likely that we should try to ob-
tam mteimediate configurations that get closer to the goal
board We do not always choose the better operator, how-
ever, because this may eventually lead to a bad sequence
«hose performance we can not iinprove as long as it staris
with that particular operator Jn short, we could get stuck
ain the local optimum However, it ts our assumption that
in general it is more likely the case that selecting the better
operator wul contribute to constructing a good sequence
In case the two operators have the same performance, pick
any of them randomly Once the operator is chosen, it be-
comes the first operator of the new sequence and the board
is updated accordingly Now we pick the second operators
of each sequence. Again, we will take the one with the bet-
ter performance Ht may however be the case that one or
both of them is nb longer legal, ie, it pushes the empty
tile off the edge of the board. This is possible because the
operator chosen for the new sequence 15 not necessarily the
one which preceded the current two operators In case only
one of the operators is illegal, choose the one which is le-
gal, Otherwise, randomly generate a legal one. It becomes
the second operator of the new sequence. Again update the
board This process is repeated until we reach the end of
the two sequences

The local improvement process is petformed on a single
structure. First randomly pick mm positions (Q < m <n)
inthe sequence For the left-most position, make the cor-
responding board arrangement by applying the operators in
the saquence preceding this position Randomly generate a
legal operator in that position and check if this new opera-
tar is acceptable by camparing it with the old operator using
the boltzruan distribution test {ie., perform simulated an-
nealing) This test goes as follows Accept the new operator
U it yields a board closer to the goal board than the old one
does Otherwise, accept it with the probability arcording
to a boltzinaa distribution If the temperature in the boltz-
man distribution is high, the new operator will be accepted
with high probability, even af its performance is bad If the
temperature 13 Jow, the operator is accepted with less prob-
ability (Temperature deceases exponentially We chose the
temperature T = Typ", where Tg ¢3 4 initial temperature,
0 < p <1 and gen is the number of generations). If the
new operator is accepted, it replaces the old one m the se-
quence If not, the old one is hept Now proceed scanning
the given sequence to the mgbt until the second initially cho-
sen position at which local improvement will be performed
ag reached, checking of, along the way, any of the operators
should be updated due to the replacement of the previous

12a) & performance(zy,, , tn)

,f9) = min{performance(zi, . ,z}|1 <4 < nf,

403

 
 

operator If an operator has to be updated, replace it with

a legal one Repeat this process for all the mitrally cho-
sen positions at which local improvement will be performed.
Upon completion of this process, the whole new sequence is
compared with the mitial sequence and accepted according
to the boitzman distribution test

To overcome the difficulties of the non-fixed length rep-
resentation of sequences, the genetic algorithm is embedded
ina loop which periodically extends the length of the struc-
tures of the population under consideration Tor example,
we may start out with a population of sequences of length
10, apply the genetic algorithm to this population until a
steady state 1s reached, then extend the sequences by ran-
domly adding a fixed amount of valid moves to each sequence
in the steady state population and resume the GA on the
new population until a new steady state ss reached ‘This ex-
tension process is continued until a solution or near-optunum
solution is obtained

The results of initial experments are promising Con-
sistently we find solutions or near-optimoim solutions using
luttle compatational ime In the case of the 3x 3 shding
puzzle gaines, we always found a solution using the following
parameter settings initial population = 20 randomly
chosen structures of length 10, the extension of the struc-
tures when the algorithm reached a steady stale was done
in chunks of length 5, crossover rate = 70%, local tm-
provement rate = 30% In the case of 4 « 4 sliding puz-
zle games, the boards we reached (for difficult cases) are
within 4 to 5 in Manhattan distance from the goal canfig-
uration, hence the majority of tiles are im place Unfortu-
nately, we seem to have trouble generating exact solutions
for these puzzles In Figure 8 we show a typical case of such
a “difficult” puzzle For this example, it takes 31 moves to
transform the initial board into the goal board In Figure
9, we show a board reached by a genetic algouthin using
the following parameters initial population size = 50
randomly chosen structures of length 20, the extension of
the structures when the algorithm reached a steady state
was done in chunks of 10, croasover rate = 70%, local
amprovenent rate = 30% This board las a Manhattan
distance of 4 from the goal board. In Appendix 2, we report
additional resulls

Boag 4 10 1 G6 2
5 6 7 8 3°08 ~«15 4
9 0 10 11 9 7 13
12.13 14 15 5 O 12 M4

Figure 8 ‘The Initial and Goal Board of a 4 x 4-SBP

101 6 2
3°08 15 4
§ 7 1314
9 0 12 it

Figure 9. A Board Obtained by a Genetic Algorthm

The fact that we only reached near-optimam solution
was not totally unexpected atnce it is well known that genetic
algorithms are excellent in finding near-optimum solutions
(see the TSP above and [4]}, but are usually not powerful
enough to find exact solutions However, we plan to fine-
tune our current algorithm and expect, at least for the SBP,
to overcome tlus problem

5. Conclusion

We have identified several problems im generalizing genetic
algorithms to optumzation problems other than the stan-
dard function optimization problems By addressing these
problems we designed genetic algorithms for two well-know
problems the Traveling Salesman Pioblem, an example of a

combmaterial optimization problem, and the Shding Block
Puzzle, an example of a puz7ie problem studied in Artificial
Intelligence It turned out that the selection of a natural
representation and the selection of heuristically motivated
recombmation operators is critical in the design of robust
Benetic algorithms for such problems We believe that our
approach 1s quite general and can be applied to many related
problems

Finally, it is worthwife to mention that the “operator.
oriented” approach used in the SBP is easily generalized
to many ather problems, thus rendering genetic algorithms
applicable to such problems As an example, in the stan-
dard function optimization problem {assmmmg that the ar-
guinents to the function are represented in binary code}, we
could define the following operators
Set(i,1) set the i-th position im the bitstiing 00. .00 to 1
and design heuristically motivated recombination operators
which work on sequences of stich operations A similar ap-
proach could be taken for the TSP.

Acknowledgements
We would like to thank the referees for their insightful com-
ments and criticisms which helped us to smprove the paper

G. References

(J Baker,“ Adaptive Selection Methods far Genetic Algo-
miluns”, Proc of an Int’) Conf on Genetic Algorithms and
Vheir Applications, pp 101-111 (July 1985).

2) L Days, “Job Shop Scheduling with Genelie Algon-
thms", Proc of an Int’) Conf on Genetic Algorithing and
Their Applications, pp 136-140 (July 1985).

fl L Davis, “Applying Adaptive Algorithms to Epistatic
Domams”, Proc. of Oth LIGA, pp. 162-164 (Aug 1985)

fi] KA De Jong, “Adaptive System Design a Genetic Ap-
proach”, ILEE Trans Syst, and Cyber, Vol SAC-10(8),
pp 556-57 ¢ (September 1980).

{s] KA Be Jong, “Genetic Algorithms a 10 Year Perspec-
tive”, Proc of an Int'l Conf on Genetic Algorithms and
Their Appheations, pp 169-177 (July 1985}

lo] DE Gotdberg and R Lingle, “Alleles, Loci, and the
Traveling Salesman Problem”, Proc of an Int'l Conf on
Genetic Algorithms and Their Applications, pp  15t-159
{July 1985)

[7] F4 Giefenstette, R Gopal, BJ Rosmaita and D Van
Gucht, “Genetic Algorithms for the Traveling Salesman Pro-

blem”, Proc, of an Intl Conf on Genetic Algorithms and

Their Applications, pp 160-168 (July 1985)

(jl J, Molland, Adaplatron mm Natural and Artrfictal Systems,

mv. of Micligan Press, Ann Arbor (1975).

{9]S Kirkpatrich, CD Gelatt and M P. Vecchi, “Optimiza-

tion by Simulated Annealing”, Science Vol 220(1598), pp

671-680 (May 1983)

(4 EL Lawler, JK Lenstra, A JEG. Rinnooy Kan and
. Shmoys (Ed), The Traveling Salesman Problem, John

Wiley & Sons Ltd (1985)

[U1] S Lin and BW Kernighan, “An Effective Heuristic

Algorithm for the Traveling Salesman Problem”, Operations

Research 1972, pp 498-516

[2] N J Nilsson, Principles of Artificial Intelligence, Tioga

Pubhshng Company Palo Alto, California (1980)

{13] D Smith, “Bin Packing With Adaptive Search”, Proc

of an Int'l Conference on Genetic Algorithms and Their Ap-

plications, pp 202-206 (Jitly 1985)

14] S$ F Smith, “Plesible Learning of Probleny Solving Teu-

ristics Through Adaptive Search”, Proc. of 8th LICAI (Ang

1983)

4

 

 
ae cheeies

peresote 1

  
  
  

aes es
‘Bomario 2 # any
Data + oe.

 

 

In Mile appecaty ue ylve aces addtgleved teawtty con-ceatng
Sho Traveling Rateqnan Frowiee the followiay Ip s daseelee
How of thy tanantlal patenatace snd sore tetminology Used

   

we

    
   
 

 

 

 

LN Dat laspartecues fi ee cos se pis

top Ystlon lr 4 the surher of ours in & population ¥& The [GR ie REESE

snr given thee Bast fine gery fan ge

Mtructure Length 4 The tous Length wan fu A3t [errnayzas|
Crossover Pate 1 Fir gortion Of population undergolog sae) ¥

coms nver The ret vil) undergo loced
Lal rorerant

DUCRATIO 1 If T Se Uke current tenperatuse used in
aonealing echene BECRATIO ' T vill ba

the new Lerpesature rope
s# nOTe: the Initial teepprature Je tha Vl

1h of everaga Beelotion of the betelal i,
foruistion [6 th generations

Trlal ¢ Each new conputatinn ef the tour Length

 

af ter counts a9 6 trial ae
TETORY « Che nuniet of telaly alter Witch the

    

Cenprrabure dy uptated ope
Toph Mate + {2 ont Rate * Structuce length) phelde ted, 4
the nunhar of 9 opt eperatlons of
sttucture underpalng Local Lnprovemant Cy
How to sad 6 table vent
Matton eae UY Date g30g) 199 Dae 1ep0y <
wy dos 6 thins 5981) deo a Lupgaatoey
cre} tons (rossiazis} Jeo” (ann iaaess ae
en 4 Ut Es F031 too t2ragN} whee
stn $e 135) asap

190 4 1189119650)

 

We conducted S abperiments uslg wo dl ffarent T-ope
rater,

OV Iocel Rate weane that va arn running qtose-over caly unm
withont welng Legal deprovecent.

tig deacription Texp 1? stands for expertment 1

The notatirn on on {999 teety Ludleates that ve abtatn= Poot
ed the selutlon of sengeh LIL] ataer 97d generations wt j
sich took €2Ue trLale oa

4. antics 4, 10D Mantes Potate 5 bartien one ¢ Cress

RIV De

  

Pepstevien 3148 = 109,
a

- Pepsterinn Vite ted
dates fle © 102
meth = 180 mein = 08

¥

   
 

   
   

    

     

a eee EL ou pet
ower OBE HL (au su en)

Set Sa ae aay
ep Mb a pie foe Ganadeiel

 

anime

“al

 

enw

az) ae qu
sud ale

 

aan eh s.
orm arom 7

oa Ue I Oe

ed wee
wets oonme

a WAS Se (fea
waged Ae apy BS (Be i
aie we BS OA

 

105

 
LLL

Se

 

APPLEDIX 2

In this appendix, wa show sore of the results
obtained for two 4x3 puzzles and one (dLfft-
cult) 4x4 puzzle.

Population Size + Same as In appendix 2
Percent of Papulation

undergoing
crosa-aver + some aa in appendix 1
simulated
annealing ‘ fame aa in appendix 1

Annealing Schedule +

temperature - (initial average deviation) +
(O.975 ** geny

where gen denotes the number of generations
and ** denotes: raise to the power.

Note; The Lacal Improverent Rate is 20 4 here
but 1a not explicitly shown

txtengion Schedule
This table indicates
- that, between the G
th ond the 5 th gene-
ration, the Tength
of the’ structures is
10 “After the 5 th
generation,

Length

 

generation

os
after 5

 

 

 

10
20

 

the structures are extended to have
Length 20.

2. Y by FADtaing Farate

Fepetecton #1 tn
Highest °.
chetigotny cveet aver :
thoergotng stnuleted ennestieg + 30%

  

     

deviarsen) * (0 978 17 gen}
on,

 

Annealing echaduler Tewparsture © (intttel
* eri

ra gen danotua the 6)

 

2 Amy a sitaing Pareto

 

Popupetion rir 430
Fareant sf Popattetion

Ender poleg cross over sme
Andergoinp aimuloted ennealing #29 ¥

 

Annealing Echedutet Temperature + {tnlclel average dertetion) + (0
where gen danatea tha perteret ions

VIS 6+ pony

Urcention Fepadules

penaretion

o- 10

   
 

Base Structure ehteined tn
diatance from tamer of
meres token

 
 

406

 

Tose egye
raed?

1a
i432
greed

tab

 

fate

pi

tii

bees spre &
erty

 

Gaeky

Hat

 

tore sare
Guest

 

 

   

 

eet th ever

ot as

we apiyeye te een
iii

eo nent
ty

mer daira tees
Prt

stant

 

ont

 

te
Hie

hil

Reentry

Abbi

aietep oer

ay i
} i i u
Sienna
hig
Hida

eampaee

dd

PERT CST teen

hhh

iptaetp en
Fett

ued
Mad

Gite
Wad

Stn Boe

tbh

SOPH BE ee

thd

 

 

 

  

hath st! FER see
whid dad
GueNE TO Lanta tt oe

Oya
ada

tetyeege Heed
GAAS

 

 

ory

weed

on

we

we

eet

ae

wet

 

greete ae

Cae

Held

one

 

 

 
amd

ont

   
 

 

407

 

   

BOE BTS
BUDE & i

'
a) tase
ae

 
 

 
  
 

USING REPRODUCTIVE EVALUATION TO IMPROVE GENETIC SEARCH
AND HEURISTIC DISCOVERY

Darrell Whitley

Computer Science Department, Colorado State University, Fort Collins, Colorado 80523

ABSTRACT

In experiments using real valued features,
the role of inversion and reproductive evalua-
tion in genetie search is explored. In problems
where ordering does not affect performance,
inversion does more than change the linkage
between features. It provides a type of non-
destructive noise that helps crossover to
escape local maxima. The search using inver-
sion was consistently more efficient, finding
more optimal and near optimal solutions.
Reproductive evaluation involves determining
which parents have produced the most valu-
able offspring and then allocating them more
reproductive opportunities. Experiments indi-
cate that reproductive evaluation can further
speed up the genetic search process.

INTRODUCTION

Genetic search and learning algorithms pro-
vide a very efficient and general technique for
discovering critical combinations of informa-
tion. To date, most of the research in the area
of genetic learning has emphasized one genetic
operator: crossover. Crossover involves swap-
ping the information in two lists at some
breakpoint, resulting in the generation of two
new lists. In learning applications these lists
may be the condition parts of if-then produc-
tion rules, complete rules containing both con-
ditions and actions, or even lisis containing
complete rule sets. The term schema will refer
to any information structure to which genetic
operators are applied. Inversion, another
genetic operator, rearranges the order of the
elemental data items (features) comprising the
schema, In problems where order does not
alter the value of a schema, inversion can be
used to change the “linkage” of features (Hol-
land 1975:108). In other words, it changes
which features are most likely to be passed on
together during crossover by changing the

408

proximity of features in a schema.

Research with algorithms using inversion
has largely been abandon because most
researchers now use a binary notation to
represent features. A binary representation
provides a minimal alphabet as well as a com-
pact, efficient form of data. Because binary
representations are positionally dependent it is
not possible to do inversion unless additional
data is stored to interpret the bit string.

A series of experiments were conducted to
explore the role of inversion in genetic search.
The results of these experiments indicate that
when real valued features are used on prob-
lems where ordering does not affect the value
of schemata, inversion does more than
change the linkage between features. It pro-
vides a type of non-destructive noise that
helps crossover to escape local maxima. Com-
pared to tests using crossover alone, the search
using inversion was consistently more
efficient; optimal and near optimal solutions
were found faster and more high-valued com-
binations of features were found.

These experiments also introduce a new
way of evaluating information structures. In
biological systems natural selection is not
based on the immediate worth (ie. perfor-
mance rating) of an individual’s genetic com-
position, but instead upon reproductive poten-
tial. "Survival of the fittest," a pre-Darwinian
concept introduced by Herbert Spenser
(Bohannan and Glazer 1973:5), in fact does
not accurately describe the selection process as
it occurs in biological systems. Genetic fitness
is a function of reproductive fitness--survival,
although a necessary condition for reproduc-
tive fitness, is only one factor that contributes
to genetic fitness. Indeed in many species indi-
vidual survival may be risked when the poten-
tial for reproductive payoff is substantial. The
importance of reproductive potential is espe-
cially relevant to problems where ordering
does not affect performance in that inversion

 
 

changes the reproductive potential of schemata
but does not change the performance of such
schemata. Two schemata with the same
features, but with different orderings of those
features, have exactly the same performance
potential but may have very different repro-
ductive potentials.

A simple method referred to as reproductive
evaluation is proposed. The experiments
reported here suggest that crediting schemata
for their reproductive potential indeed speeds
up the search process. In the search space
explored it was possible to have duplicate
features and the order of features did not
affect schema performance. Further research is
needed in problem domains where the search
process might display different characteristics.
Some preliminary research has been done on
the traveling salesman problem. In this prob-
lem duplicates are not allowed and the order
of rules does affect performance, since the
order indicates the sequence in which cities are
visited (Goldberg and Lingle 1985, Grefen-
stette et.al, 1985). Preliminary results indicate
that reproductive evaluation is also helpful in
this domain.

Reproductive evaluation can be done as
part of the credit assignment process employed
by bucket brigade algorithms. It therefore may
be employed to speed up the discovery of new
and useful rules in a production system
environment. As a side advantage, the same
method proposed for reproductive evaluation
can be utilized to evaluate other heuristics
used to aid in rule generation. Such evalua-
tion makes it possible to integrate heuristic
discovery methods with genetic algorithms.

THE ROLE OF INVERSION IN GENETIC
SEARCH

The genetic learning process normally
begins with the random generation of a pool of
tules. This set of rules is analogous to a "gene
pool” in that it contains the entire set of
features that the learning system “knows”
about. The search problem selected is that of
all possible 5 card poker hands--except that 5
of a kind is allowed. (Straights and flushes are
also ignored to keep evaluation simple.) The
evaluation function gives the worth of a hand.
Experiments were conducted using fixed pool
sizes of 100 hands and 500 hands. Search pre-
ceded for 200 and 500 cycles in different

109

experiments, where a cycle involves the gen-
eration of one new hand. Hands are the sche-
mata, with each hand being a list of 5 cards,
Real card values are used as the features, so
the list elements are positionally independent.

During genetic search, the generation of a
new hand is considered a trial. Trials should be
directed so that the genetic operators are usu-
ally applied to those hands which have the
highest worth, yet also occasionally allocate tri-
als to hands with lower worth to insure that
potential sources of information are not over-
looked, If a small number of schemata come
to dominate the pool, the search quickly stag-
nates at a local maxima, a problem that results
in premature convergence. There are several
ways of dealing with the problem of duplicate
schemata. DeJong (1975) experimented with
a "crowding mechanism" where some subset of
the population is selected at random and the
schema in that subset’ most like the newly
created schema is replaced. However, such a
method does not preclude duplicates. Further-
more, duplicates affected the average worth of
hands in the pool. For these reasons dupli-
cates are simply disallowed. Booker (1985)
describes other crowding mechanisms that
limit the number of schemata by requiring
them to share limited resources. When there
are too many schemata their share of the
resources dwindles and weaker schemata are
deleted. When resources increase or when
there are few schemata, they are more likely
to survive and reproduce. In the poker prob-
lem, hands have a fixed evaluation, so varying
resources are not a factor.

Holland (1975:75) has pointed out that
there are 2 competing goals when one is allo-
cating reproductive trails. First, we would like
to allocate trials to exploit the observed best.
Second, lesser ranking schemata should also
be allocated reproductive trials to reduce the
probability of error. By allowing lower ranked
schemata to reproduce, the system may exploit
information that is not found among the
higher ranked schemata. If trials are allocated
randomly, there is no bias toward the best
(except a passive bias as lower ranked sche-
mata are deleted). To bias reproductive trials
toward higher ranked schemata an exponential
distribution of trials was used. An exponential
distribution of trials produces 2 positive
effects. First, it allows high ranking schemata
to reproduce most of the time, while still

 
 

maintaining a stochastic selection process to
allow lower ranking schemata to occasionally
be selected. Second, the active bias toward the
higher ranking schemata makes it possible to
exploit information about reproductive poten-
tial. We want to actively bias reproductive tri-
als toward the schema with the better repro-

ductive evaluation. An exponential distribu-
tion of trials does this when schemata are
ranked according to both performance and
reproductive evaluation. Performance and
reproductive evaluation were added together to
produce the rank of a schema. Afier a new
schema is generated it must be added to the
pool, displacing some schema. An exponential
distribution of replacement choices is again
used, this time with the bias toward the low
worth end of the pool. Thus, lower ranked
schemata are replaced, but the process is not
deterministic.

An exponential bias perhaps made the
premature convergence problem more acute
than allocating reproductive trials equally. But
using reproductive evaluation can also help
here. Without reproductive evaluation a
schema may obtain and maintain a high rank
so that it dominates the pool. The result can
be an undue bias toward high ranking sche-
mata, Using reproductive evaluation assures
that a schema will not maintain a high rank
unless it maintains a high reproductive poten-
tial. This means the system will be biased
toward a small set of schemata only as long as
they yield competitive offspring. The result is
an informed distribution of trials and fewer
duplicates.

The genetic search algorithm was run using
5 different initial pools of randomly generated
schemata so that different searches could be
observed and averaged. This is not, of course,
a sufficient number of cases to draw any con-
clusions of a statistical nature but does allow
hypotheses to be established for future
research. On each run, a pool of 100 random
hands was generated, then 200 new hands
were created using the genetic operators. For
each pool, 7 different variants of the search
algorithm were tried: 1) search using just

crossover, 2) search applying inversion 10%

of the time (90% crossover), 3) search apply-
ing inversion 20% of the time (80% cross-
over}, 4} search applying inversion 30% of
the time (70% crossover), 5) search where
inversion was applied after 10 crossover

"10

attempts failed to produce a new, non-
duplicate hand, 6) search where inversion
was applied after 5 crossover attempts failed to
produce a non-duplicate hand, and finally, 7)
search where inversion was applied after 2
crossover attempts failed to produce a non-
duplicate hand. A non-duplicate hand is one
that does not have the same information in
the same position as another. Allowing hands
that have the same information, but in
different positions is necessary since this is
exactly the type of hand that results from
inversion. The idea behind the last three vari-
ants of the algorithm is that Inversion moves
around, or “stirs,” the information available for
crossover. Randomly doing inversion some X
percent of the time has the same effect and is
perhaps preferable since doing inversion after
crossover begins to fail actually places the
search in a position of difficulty before any
corrective action is taken.

Of the 7 variants of the search algorithms,
3 were superior. Success of these algorithms
can be measured in 3 ways: 1) by the increase
in the average worth of the pools, 2) by the
discovery of optimal or near optimal hands,
and 3) by the amount of effort the algorithms
expended generating a non-duplicate hand.
Effort refers to the number of duplicate hands
that are generated and must be discarded.
Effort is one measure of the efficiency of the
search, since in generating 200 new hands, the
system may generate 1000 duplicates. Thus a
large portion of the search time can be spent
generating useless information.

Using 20% or 30% inversion worked well.
In those cases where 20% inversion was used
the optimal solution was found 4 out of 5
times after generating on average 66 new
hands--much less than 1/1000th of the search
space. Applying inversion after 5 failed cross-
overs also did well. Search applying only
crossover consistently performed the poorest
in terms of average worth, and in 2 cases per-
formance was dramatically worse. In 2 cases it
did find the optimal solution, but did so later
in the search (after generating 100 new
hands). (In experiments preliminary to those
just discussed, 5 different sets of initial hands
were used. In these experiments 20% inver-
sion and inversion after 5 failed crossovers did
about the same, 30% inversion did not per-
form as well and crossover failed in every
search to find the optimal solution, again with

 
2 dramatic failures. These results are not
averaged with those discussed here because of
slight differences in the programs.)

One of the problems with using only cross-
over is that each position in the list of features
is isolated. Its erratic performance may be due
to the fact that crossover cannot exploit all the
information in the pool and may not be able to
get at certain segments of information that
develop in the hands. If the pool is viewed as
a two dimensional array with each row being a
hand and the features appearing in the
columns, it means that information in one
column cannot be accessed in another column
by crossover. For example, it is possible that
no ace card appears in the search pool as the
first card of a hand. Such a coincidence makes
it impossible to find the optimal solution (5
aces) when only crossover is used. In less
extreme cases, it can mean that certain
features are not as well represented in some
columns, or positions, as they are others. This
limits the amount of information that cross-
over has to work with and forces the search to
prematurely reach a local maxima.

Inversion "stirs up" the columns and
thereby "stirs up" the available information. It
not only creates a new linkage between
features, but also moves these newly linked
segments around. For example, assume that 2
schema segments in two different schemata are
both located at the rightmost end of the sche-
mata. Crossover will never allow the two seg-
ments to be joined. However if one of these
schemata is inverted and one segment is no
longer located at the rightmost end, then it
can be joined with the other segment during
crossover. Inversion therefore does more than
increase the linkage between features. It pro-
vides a constructive type of noise that moves
around information without disrupting schema
performance. In doing so it helps the search
to avoid stagnation. (Simulated annealing, as
used in neural networks, also employs noise to
escape local maxima, but of course the noise
used must be disruptive enough to achieve the
desired effect. This is part of the reason learn-
ing proceeds so slowly using this method.
(Ackley and Hinton 1985).)

REPRODUCTIVE EVALUATION

Using the same 5 pools the tests were run
again, except this time reproductive evaluation

iM

was used in conjunction with performance rat-
ings to rank hands, Two different kinds of
eredit were recorded for each hand: perfor-
mance credit and reproductive credit. Perfor-
mance credit, or worth, is the value of the
hand, the same value used in the previous
tests. Reproductive credit is calculated using
the average worth of the offspring of a hand.
That is, when a hand is chosen for reproduc-
tion its offspring are evaluated and the parent
given reproductive credit. In the tests
reported here, a newly created offspring’s
value is averaged in with the current reproduc-
tive credit that the parent already possesses.
Reproductive credit is then added to the worth
of the hand to determine its rank. Figure 1
provides a comparison of the previous results
(using just performance evaluation) and those
achieved using reproductive evaluation. The
initial average for the randomly generated pool
of hands was 244. Averages in the hundreds
range roughly correspond to 2 of a kind, aver-
ages in the thousands to 3 of a kind, and in
the 10 thousands range to 4 of a kind. At first
glance it appears that using reproductive
evaluation reduces performance, but Figure 2
shows that it increased performance in the
early part of search, but lost ground after an
optimal or near optimal hand was discovered.
Part of the problem is that using reproduc-
tive evaluation means that some hands will be
saved which have below average performance,
but which have significant reproductive value.
In a small pool these hands lower the average
performance of the search. The lower average
performance is less significant during the early
search since many low valued hands are still in
the system and the growth that results from
the rapid exploitation of unsearched space
offsets the disadvantage of keeping hands with
below average performance. One would
expect this to be much less of a problem when
a larger pool of hands is maintained. Tests
that used a pool size of 500 and ran for 500
cycles support this idea. Without reproductive
evaluation an average worth of 2497 was
achieved with an effort of 1085 over 5 runs.
Using reproductive evaluation an average
worth of 2953 was obtained with an effort of
1101. In both cases 20% inversion was used.
The cases that employed reproductive evalua-
tion produced a higher average worth in the
pool in every case. Also, in terms of high-
valued hands found, the tests using reproduc-

 
 

 

ALGORITHM PEFORMANCE

 

REPRODUCTIVE EVALUTION

 

No Yes

 

100% 0%
90% 10%
80% 20%
70% 30%

CROSSOVER INVERSION

VALUE EFFORT | VALUE EFFORT

6138 1989 5672 2105
9288 2144 8189 1521
11881 1658 10369 1435

12258 2228 10800 1583

 

FAILED-CROSS
10
5
2

 

 

VALUE EFFORT | VALUE EFFORT

9981 1048 10155 1301
11231 2492 11702 1670
10186 4904 11606 3632

 

 

 

FIGURE 1. Comparison of results with and without the use of reproductive evaluation. Trials were run using 0%, 10%, 20%
and 30% inversion as well as using inversion after 10, 5, and 2 failed attempts al crossover, VALUE refers to the resulting
hand average. EFFORT refers to the number of duplicate hands that were generated.

FIGURE 2. A graph of search with and without
reproductive evaluation. The dotted line
shows that reproductive evaluation improved
search in the early stages of the search
process. The solid line line maps the per-
formance of search without reproductive
evaluation. Both of these cases used 20%
inversion, The dashed line maps the search
using just crossover. Note that this graph

is a log graph.

VALUE

 

 

 

 

“42

 

150

  

 
tive evaluation did so on average 38 cycles fas-
ter. The resulting averages for a pool size of
500 are of course lower than those obtained
when a pool size of 100 was used. The seem-
ing disparity in averages is partly an illusion in
that hands which would have been pushed out
of a pool of 100 are saved and pushed down in
a pool size of 500.

THE TRAVELING SALESMAN
AND REPRODUCTIVE EVALUATION

These experiments demonstrate that repro-
ductive evaluation is effective in improving
search for the poker problem. Towever, the
poker problem is ideally suited to a combined
use of crossover and inversion since duplicates
are allowed and ordering does not affect per-
formance. There are problems, such as the
traveling salesman problem, that might be
implemented as a genetic search problem with
very different characteristics than the search
space of all possible poker hands. In the trav-
eling salesman problem ordering is very much
related to performance, and duplicates are not
allowed. Some preliminary work has been
done using just inversion on this problem; an
exponential bias was again used. Searches
were conducted both with and without repro-
ductive evaluation. The specific tour studied
is a 10-city tour given by Karg and Thompson
(1964). All tests were run for 400 cycles, with
1 new tour generated each cycle. The genetic
algorithm was run 15 times using 15 different
initial pools for each variant of the algorithm--
with and without reproductive evaluation.
When reproductive evaluation was used the
optimal tour was found 60% of the time (9 out
of 15 cases) and the second optimal tour was
found 80% of the time (12 out of 15) cases.
Without reproductive evaluation the optimal
tour was found only 40% of the time (6 out of
15 cases) and the second optimal tour was
found 47% of the time (7 out of 15 cases).
Using reproductive evaluation, the system also
produced on average 239 fewer duplicates
during the search process (Whitley and David-
son 1987).

Part of the reason that reproductive evalua-
tion helps here is that it correctly directs the
strong bias of the system toward good parent
tours. If tours are in fact not good candidates
for reproduction, then this will show up in the
reproductive evaluation, So even though the

 

"13

use of an exponential bias does favor the sche-
mata at the top, schemata may be devalued
after reproducing a few times. In this way the
bias toward high ranked schemata is mitigated
by reproductive evaluation; the top of the pool
changes as schemata are evaluated for repro-
ductive potential. This phenomenon was
much more evident in the traveling salesman
problem than in the poker problem. Also, in
the traveling salesman problem it was noticed
that a large number of offspring will usually
pull down the parent’s reproductive rating
because on average the offspring tend not to
be as good as the parent, especially if the
parent has a high performance rating. Furth-
ermore, finding better offspring becomes more
difficult as the average tour length gets closer
to an optimal solution.

REPRODUCTIVE EVALUATION
AND CREDIT IN PRODUCTION SYSTEMS

One of the problems in using genetic algo-
rithms for learning applications is crediting
rules for their performance. The action of a
rule may not yield an immediate payoff, but
may instead create a situation where another
rule is able to fire and achieve positive (or
negative) results. Por example, a rule that
allows a chess player to capture a piece has an
obvious payoff, but several  stage-setting
moves must have previously been made to
make such a payoff possible. This is known as
the credit assignment problem. Holland has
been able to deal very nicely with the prob-
lem of distributing credit in a production sys-
tem, and it is worthwhile to relate his solution
here. It will also provide a context in which to
discuss ways of distributing reproductive credit
in such an environment.

Obviously, if some rules post messages (or
assert conditions) when they fire and some
tules perform output actions when they fire,
only those rules that perform output actions
are going to directly receive a payoff for their
actions. This is the credit assignment problem
in a little different guise. In terms of a pro-
duction system, a rule that posts messages to
working memory has no means of receiving
direct credit and so it is entirely dependent on
sharing credit for its stage-setting role for out-
put rules. To solve this problem Holland uses
what he calls a bucket brigade algorithm. Hol-
Jand (1986) offers the metaphor of a small
 

economic system to describe it. When a rule
fires it must pay those rules whose messages
match some item on its condition list. A rule
pays its predecessors by giving away some per-
centage of its own credit. The more credit it
has accumulated, the better it is able to pay
these preceding rules. If the firing rule per-
forms an output action, it receives credit, or
payoff, directly. If the firing rule posts a mes-
sage, then it must re-accumulate credit. by
having its message used by other rules. Over
time, rules that are part of a chain leading to
positive payoff will accumulate credit relative
to the amount of that payoff. Rules that are
part of chains that do not lead to positive
payoffs will lose credit as the loss of payoff
negatively affects the ability of rules farther
along in the chain to pay back those rules
whose messages match items in its condition
list. Much as in an economy, the effect of a
payoff (or lack of it) does not have an
immediate effect throughout the system. Hol-
land also discusses the use of rules that define
goals and remain active until some output
action occurs that either achieves the goal or
fails in an attempt to achieve the goal. This
provides a kind of focus which can speed up
the credit assignment process. Overall the
bucket brigade algorithm provides a very
elegant means of apportioning credit,

REPRODUCTIVE EVALUATION
IN A BUCKET BRIGADE ENVIRONMENT

A means of achieving reproductive evalua-
tion is outlined that can be used in econjunc-
tion with algorithms that use a bucket brigade
to apportion credit. Some suggestions are also
made as to how reproductive evaluation can be
used to evaluate other kinds of heuristics that
might be employed in conjunction with genetic
operators,

In many ways reproductive evaluation is a
heuristic method for improving genetic algo-
rithms. However, reproductive evaluation is a
primary part of biological evolution. In natural
selection, genetic fitness is achieved not just
by being well-adapted--the real issue is repro-
ductive fitness. If ordering does not affect
performance, then the inversion operator does
not change the adaptive value of a rule, but
rather its reproductive potential. In a natural
system, those individuals with greater repro-
ductive potential are more likely to pass on

14

their genetic information. To make full use of
the inversion operator we need to allocate
more opportunities to reproduce to those rules
which in the past have produced good rules.
This echoes a heuristic proposed by Douglas
Lenat, "If a gene has mutated successfully
several times in the past, then increase its
chance of mutating in the next generations,
and conversely” (Lenat 1983:294). The
difference is that we do not want to mutate
rules, but rather to cross rules, or do inver-
sion. Still, the fundamental idea is there: if the
genetic algorithm is considered a kind of
heuristic search, we should be able to intro-
duce rules that serve as metaheuristics for
improving the efficiency of genetic search.
Rules related to reproductive potential are par-
ticularly important. Reproductive evaluation
provides the basis for such a heuristic.

Reproductive evaluation must be tied into
the credit assignment process since this is how
rule performance is evaluated. Those parent
rules that produce viable new offspring must
tap into the payoff scheme as outlined in the
bucket brigade algorithm. [ach rule could
maintain a “created-by” are that pipes credit
from the bucket brigade payoff to the parent
rules involved in its creation. By shuttling
credit to parent rules, the “created-by" are
makes it possible to evaluate the reproductive
potential of parent rules. Two kinds of credit
would then be circulating in the system: per-
formance credit and reproductive credit. This
would of course come at some additional cost
in terms of computation time, although the
extra effort should not be dramatic. In any
case, production systems, as Holland has
pointed out, allow a high degree of parallelism.
If one is using a bucket brigade algorithm, part
of what happens when these rules fire is that
they pay off those rules whose actions created
conditions allowing them to fire. The
“created-by” arc could fire in parallel with the
“payoff” are. Since the “created-by” arc may
point to more than one rule, all rules would be
credited in parallel,

The "created-by” are not only makes it pos-
sible to distribute credit for reproductive
evaluation, but also provides an elegant means
for evaluating other heuristics. There is no
need to simply assume heuristics work; a
heuristic must earn its way or be removed.
These heuristics can be expressed as special
production rules that generate new rules when

 

 
their condition lists are satisfied. Such rules
could embody more cognitive kinds of heuris-
tic rules about creating problem-domain rules,
or could be the kinds of general heuristics out-
lined in Lenat’s work. Such rules would bid
for the right to fire (and thereby create a new
rule) just like the problem-domain rules,
except their measure of performance would be
the reproductive credit they receive. Lenat’s
work indicates that previously useful heuristics
may cease to be useful when the problem-
domain shifts. By requiring heuristics to main-
tain a good credit rating it is guaranteed that
existing rules are working to create useful
problem-domain rules.

CONCLUSIONS

The research reported here suggests that
reproductive evaluation can be an effective
means of speeding up genetic search and
learning. Reproductive evaluation, when used
in conjunction with an exponential bias,
focuses the search by biasing the allocation of
reproductive trials toward schemata which are
the most promising candidates for reproduc-
tion. If reproductive evaluation is not used,
reproductive trials might be allocated randomly
to existing schemata, or crowding mechanisms
used to reduce or enlarge pool size in response
to available resources--but only performance is
considered.

This research emphasizes real valued
features. It is not clear yet how reproductive
evaluation will effect searches that use a binary
notation. In languages using a notation that is
binary at lower levels and multi-valued at
higher levels, such as that used by Smith
(1980), reproductive evaluation and inversion
can be used to speed up the search process.
Reproductive evaluation also provides a very
effective means of merging genetic algorithms
with other forms of heuristic discovery and
learning.

REFERENCES

Ackley, DJI., Hinton, G-E., & Sejnowski, TJ.

1985 <A Learning Algorithm for
Boltzmann Machines. Cognitive Science
9:147-169.

M45

Bohannan, P., and Glazer, M.
1973 High Points in Anthropology. New
York: Knopf.

Booker, L.B.
1985 Improving the Peformance of
Genetic Algorithms in Classifier Systems,
Proc. International Conf. on Genetic
Algorithms and their Applications. J.
Grefenstette, ed.

DeJong, K.
1975 Analysis of the Behavior of a Class
of Genetic Algorithms. Ph.D. disserta-
tion. University of Michigan.

toldberg, D. & Lingle, R., Ir.
1985 Alleles, Loci, and the Traveling
Salesman Problem. Proc. International
Conf. on Genetic Algorithms and their
Applications. J. Grefenstette, ed.

Grefenstette, J, Gopal, R., Rosmaita, B., &

Van Gucht, D.
1985 Genetic Algorithms for the Travel-
ing Salesman Problem. Proc. Interna-
tional Conf. on Genetic Algorithms and
their Applications. J. Grefenstette, ed,

Holland, John
1975 Adaptation in Natural and
Artificial Systems. Ann Arbor: Univ. of
Michigan Press.
1986 Escaping Brittleness: the Possibili-
ties of General Purpose Learning Algo-
rithms Applied to Parallel Rule-Based
Systems. IN Machine Leaming, Vol 2.
R. Michalski, J. Carbonell and = T,
Mitchell, eds. Los Altos, CA: Morgan
Kaufman.

Karg, R.L., and Thompson, G.L.
1964 A Heuristic Approach to Solving
the ‘Travelling Salesman Problem.
Management Science, vol. 10, no. 2.

Lenat, D.
1983 The Role of Heuristics in Learning
by Discovery: Three Case Studies. IN
Machine Learning, Volume 1, R.
Michalski, J. Carbonell and T. Mitchell,
eds. pp: 243-306. Palo Alto, CA: Tioga.

Smith, S.
1980 A Learning System Based on
Genetic Algorithms. Ph.D. dissertation.
Department of Computer Science,
University of Pittsburgh.

Whitley, D. & Davidson, H.
1987 The Role of Inversion in Genetic
Search. Proceeding of the Rocky Moun-
tain Conference on Artificial Intelligence.

 
 

TOWARD A UNIFIED THERMODYNAMIC GENETIC OPERATOR

David J. Sirag

Paul T. Weisser

United Technologies Research Center, Bast Hartford, CT 06108

Abstract

Controlling
in the Genetic Algorithm
convergence behavior can prevent the Genetic
Algorithm from discovering an acceptable solution.
Work toward a unified thermodynamic operator has
been undertaken to improve performance in ordering
problems such as the Travelling Salesman Problem.
The operator uses a model of genetic activity based
on the Boltzmann distribution to allow control of
convergence by a global temperature value as is done
in Simulated Annealing.

the rate of population convergence
can be difficult. Poor

Introduction

For analysis, the optimization tasks performed
by the Genetic Algorithm (GA) can be grouped into
two classes: value problems and order problems.
Value problems are those problems, traditionally
well suited to a GA representation, whose solutions
can be naturally represented as a set of assignments
of values to variables {in genetic terminology, the
assignment of alleles to genes). Ordering problems
are those problems, traditionally more difficult to
represent, whose solutions are naturally represented

as an ordered list of values (the assignment of
genes to loci). Examples of problems in the GA
literature which are naturally represented as
ordering problems include the Travelling Salesman
Problem (TSP), Job Scheduling, and Priority
Assignment.

Diversity of individuals in the population is

required to find good solutions with the GA. A new
operator is presented vithin the framework of this
extension that unifies crossover, inversion, and
mutation operators with concepts from Simulated
Annealing. The operator uses a "temperature"
parameter to provide control of the genetic
diversity in the population. Diversity can he
introduced by raising the temperature. The GA varks
to remove the diversity when the temperature is
lowered.

Ordering Problems

Ordering problems can be
by encoding them in a vay that allows them to he
treated as value problems. This approach has
several drawbacks. The first difficulty is that the
encoded form of the solution may not retain

represented in the GA

M6

information when the genetic operatois are applied.
Second, low order schemata may no longer contain
important information about the solution. Third, the

encoded form may change the problem so that legal
solutions are omitted or illegal solutions are
included. Fourth the encoding may increase the size

of the solution space which must be searched because
representations of permutations are Inherently less
dense than binary representations of the same
length. Finally, the fitness of an encoded solution
will typically be more expensive to compute because
the individual must first be decoded.

A natural extension to the GA along the lines
suggested by Holland’s inversion operator [Ref. 1]
allows the GA to be applied to ordering problems.
Ordering problems can be accommodated more naturally
by allowing the GA to compute the locus of each gene
in addition to the allelles. The final order of the
genes can then be interpreted directly as the
desired permutation. This serves two purposes, In
a value oriented problem it allows the GA to group
the genes of a successful schema together. Seeond,
in an ordering problem, the position of the gene in
the individual can represent the ordering
information required to solve the problem,

In mixed problems the value and ordering
mechanisms can operate together. However, this
problem will be larger than usual because the
solution space of this mixed problem will be the
product of the solution spaces of each of the
separate subproblems. As an example of this last
ease consider assigning school busses to bus stops.
The route to be taken can be represented by the
order of the stops in the individual and the bus
assigned to each stop can be represented by the
value of the gene. For instance, parent-1 in Figure
1 represents a route in which bus 1 stops first at

 

 

 

 

 

 

 

 

 

 

St. Charles, then at Ventnor, and so on...
PARENT-1
NAME [61 CHARLES] PARK PLACE] VENTNOR] ATCANTIC|BOARDWALK|
VALUE[ 808-1 [ BUS-2 BuSs-1 | BUS-2 Bus-3 |
A
PARENT-2
NAME [S1_CHAALES]_ATLANTIC|VENTNOR|PARK PLACE] BOARKWALK]
VALUE} BUS-1 BuS-3 6US-3 [Buss | Bus |

 

 

OFFSPRING OF CROSSOVER AT POINT 4
NAME [e7_ CHARLES [PARK PLACEIATLANTIC] VENTNOR] BOARD WALT
VALUE[___ BUS pus-2 | 608-3 T Bus-3 Bus-2

Figure 1. Crossover with mixed representation

 

 

 
 

 

This extended representation requires changes
to the genetic operators to allow them to operate on
individuals encoded in this vay. Inversion can be
implemented in this context simply by allowing it to
invert the order of the genes in a substring of the
individual vithout changing their alleles. Crossover
can be extended in the following way: Select a
crossover point in parent #1; Transcribe allele/gene

pairs from parent #1 into the offspring up to the
crossover point in the order found in parent #1;
Complete the offspring by copying the remaining
allele/gene pairs from parent #2 in the order in
which they occur in parent #2 (Figure 1). Mutation
is unchanged because the inversion and crossover
defined above are sufficient to guarantee that no
permutation will be permenantly lost from the

population’s potential search space.

Implemented System

We have implemented the Genetic Algorithm on a
Symbolics Lisp Machine. The system uses the
representation for individuals described just above.
There are other aspects of the system that are
"non-standard" well. These will be described
briefly below. These extensions were developed
before our work on the thermodynamic operator.
Because of their observed usefulness they have been
retained in the current work.

as

In the standard GA each individual's fitness is
evaluated by a single function. In contrast we allow
multiple independent fitness measures. Individuals
are chosen for breeding based on these measures with

each measure used in a proportion defined by the
analyst. One of the fitness measures is designated
“primary”. This is the parameter that is really

being optimized. The other measures are "secondary"
and are used to promote the presence of desireable
characteristics within the population. For instance,
in the TSP, the primary fitness is the length of the
tour. A secondary fitness we have used is based on
the number of cities that are connected to their
nearest neighbor. The rationale is that although
simply connecting each city to its nearest neighbor
(a greedy solution) does not preduce an optimal
route, the optimal route does contain many nearest
neighbor connections. The “nearest-neighbor”
secondary fitness is intended to promote the
presence of such connections in the population. The

secondary measures are typically invoked much less
frequently than the primary measure, perhaps about
20% of the time.

A copy of the current best individual as
measured by each of the fitness measures is stored
Seperately from the main population. The "refresh"

Operator brings a copy of one of these current best
individuals back into the population. This operator
must be used sparingly (about 5%) because it can
easily focus the population on a local minimum.

Both the secondary fitness functions and the
refresh operator are intended to increase the GA’s
Performance by maintaining diversity yet retaining
important information in a smaller population.

 

‘47

We also use the heuristic
described by Greffenstette et al. [Ref 2}. Heuristic
operators contain domain knowledge and bring this
knowledge to bear during a crossover operation. ‘The
heuristic for the TSP described in [Ref. 2] is ta
choose the next city to be visited from either
parent, whichever is closer to the last city
transcribed. This provides a powerful performance
boost for the GA in solving travelling salesman
problems. Later, we suggest briefly a way to
incorporate heuristic operators into the proposed
thermodynamic scheme.

crossover operator

Convergence Rates

Goldberg and Lingle [Ref. 3] introduce
schemata (o-schemata) as the
traditional
Ref. 3). Ia

order
ordering analog to the
allele schemata (called a-schemata in
ordering problems the o-schemata are

building blocks for optimal orderings just as the
a-schemata are building blocks for optimal variable
assignments in value problems. The hope has been

that operators could be discovered that sift through
the o-schemata allocating trials to them optimally
as is done with a-schemata. However, the effect of
the extended genetic operators on o-schemata is not
immediately apparent. For instance, the crossover
operator described above is much more destructive of

o-schemata than a standard crossover is to
a-schemata [Ref. 4]. Put another way, the context
sensitivity of o-schemata makes them fragile with

respect to crossover.

Consider the extended crossover described
above. A crossover point is chosen randomly. Then
all of the genes to the left of this point are
transcribed from the first parent. These genes
define o-schemata retained from the first parent.
The remaining unused genes are then transcribed to
the offspring from the second parent. However, since

the genes that have already been used must be
skipped, the section of the offspring to the right
of the crossover point will not be identical to a

section in either parent. In general these genes
define new o-schemata not present in either parent

We call the expected
retained by an operator
of

number of schemata
divided by the total number
schemata in an individual the retention ratio of
the operator. We have estimated the retention
tatios for the extended crossover and inversion
operators described above. We find that as the
length of the individual becomes very large, the
o-schemata retention ratio for crossover approaches
zero (most o-schemata are destroyed) while the
o-schemata retention ratio for inversion approaches
one (most o-schemata are retained). An example
derivation of the o-schemata retention vatio for
inversion is given in the appendix.

While the destructive behavior
could be tempered by
operator that just

of crossover
introducing a "breeding"
copies a selected individual
(retention ratios = 1) and the activity of inversion
increased by taking more than one cut point per

individual it is not clear what the most desirable
 

retention ratios are or if they remain the same over

the duration of the GA optimization process. What

is needed is a control parameter for schemata
retention that allows the amount of information
retained by genetic activity to he varied

continuously.

Simelated Annealing

Simulated annealing [Ref. 5] provides just such
a parameter, the temperature. In simulated annealing
solutions are evaluated on the basis of their
“energy". Disordered states (inefficient solutions)
have a higher energy than ordered states (efficient
solutions). A single best solution is maintained,
rather than a population of solutions as in the
genetic algorithm. The probability that a newly
generated solution is accepted as the current best
solution depends on the temperature and relative
energies of the current best and new solution as

follows:
iE (E,, < Ew) pal (eq 1)
if (, > ED) pe expl(B, ~ EB /kT]

where Eg, is the energy of the new solution and

the energy of the current best solution.

Eis
c
Conceptually, the system must pay a price in terms
of energy to transition to a state that is higher in
energy than the current state. The energy required
is supplied by the heat present in the system.
Boltzmann’s equation relates the temperature te the
available energy. The equation is probabalistic
rather than deterministic since the ambient
temperature represents the average of a large number
of small volumes. Even at a low ambient temperature

the local temperature in a small volume might be
very high. Boltzmann's equation gives the
probability that a given amount of energy is
available at a given temperature. In simulated

annealing the temperature is set to a high value
initially and then lowered according to an annealing
schedule provided by the analyst

There are reports that simulated annealing has
enjoyed great success at the travelling salesman
problem [Ref. 6}. These reports are difficult to
assess, since performance numbers are not given.

In any event, it would be desirable to include
some aspects of simulated annealing within the
general framework of the genetic algorithm. In
particular we desire to tie the schemata retention
behavior of the standard operators (crossover,
inversion and mutation) to a temperature patameter.

This suggests a model for genetic operators as
described in the following section. The model
described is artificial in the sense that it is

motivated by a need to control schemata retention in
a computer program rather than by biological or
physical processes.

Thermodynamic Operator

Ve consider a
parent

crossover operation between two
individuals that have been selected from the

population vith probability proportional to their
fitness. An offspring individual is constructed by
transeribing the genes from the parents in such a
way that o-schemata and a-schemata are partially
preserved.

Transcription begins with the first gene of one
of the parents. We imagine that it is energetically
favorable to continue transcription from the same
parent; it costs energy to switch transcription to
the other parent. The energy available in the
environment follows a Boltzmann distribution. I£ we
call the energy at which transcription vill switch
to the other parent o {the crossover threshold

energy) then a switch will occur just when the

energy available locally (which depends on the
Lemperature) exceeds Qe This occurs with
probability

exp (-9,/kT] {eq 2)

where k ig an atbitrarily chosen scaling constant.
We assume k = 1. Transcription continues from the
second parent until the crossover threshold energy
is again exceeded. At high temperatures,
transcription might switch back and forth between
the parents many times. At low temperatures
transcription might not switch at all resulting ina
direct copy of the parent individual.

Inversion is modelled similarly. We imagine
that the strings selected from each parent stand a
chance of being rotated as they are transported from
the parent to the offspring (see figure 2). Again,
it costs energy to perform this rotation and this
energy is supplied by the heat available locally.
Thus, the probabilty of inversion is

exp (-8,/kT]

vhere a, is the inversion threshold energy. Tn

some domains it may be desirable to model the Fact

that it requires more energy to rotate a longer
string than to rotate shorter one. Due to the
rotational symmetry in the travelling salesman
problem we ignored this issue. That is, the distance
Erom city-i to city-j is the same as the distance
from city-j te city-i so that in this case,
o-schemata wholly contained within the rotated
string are not affected by the operation. In other
cases, for instance bin packing, packing item-i
before item-j will almost always result in a
different fitness than packing item-j before item-i.

 

STRING BETWEEN
CROSSOVER POINTS

|
f™
CITI
NY
oe’ OFFSPRING

Figure 2. A string may be rotated

 

 

tome ot ama aad

em Anwaemoa
 

Thus inverting a longer string in the bin packing
case should require more energy than inverting a
shorter string.

Finally, mutation (of the allele} is treated
similarly. We imagine that as each allele is
transcribed there is a probability that its value

will change. The probability of mutation is just

exp [-8,/kT}

Thus a thermodynamic operator is specified by a
triple: (6.18) ,8,])+ The overall system

temperature is varied according to
schedule chosen by the analyst. As the temperature
approaches’ zero, the probability that any of the
thresholds will be exceeded also approaches zero,
genetic activity ceases, and the schemata retention
ratio approaches 1. As the temperature approaches
infinity, schemata are retained only by chance and
the algorithm becomes a random search. The operator
therefore provides a retention ratio for o-schemata
that is continuously variable from 0 to 1.

an “annealing”

implemented Operators

We have implemented the unified thermodynamic
operator and use it in place of the basic genetic
operators. Since the exponentiation operation in
{eq 2) is computationally expensive, we calculate
the expected number of genes to be transcribed at
the current temperature before a crossover event
occurs. This quantity is given by a geometric
distribution [Ref., 7]:

-~G/kT

Ne [ln U/ in (1 -e4 1

where U is a uniform deviate in the interval {0,1].
The calculation is the same for crossover, inversion
and mutation. These quantities should be
recalculated if the temperature ts raised.

In the thermodynamic framework tefresh is
implemented by allowing the operator to be defined
by a  4-tuple [8,58; 8,0] where O<r<t.

Normally, individuals are selected as parents with

probability proportional to their fitness. The
parameter, r, specifies the probability of using the
saved copy of the current best individual instead.
When there are secondary fitness measures one of the
current best individuals is selected randomly. At
high temperatures this individual is altered
Significantly through crossover with the second
Parent before being returned to the pepulation but
at low temperatures it tends to be copied directly
into the population. An overview of the system is
shown in figure 3.

Although we have not done so, it would seem
that the heuristic operators could also be placed in
the thermadynamic framework. (We use a heuristic
Sperator which is seperate from the thermodynamic
Operator. }

 

   

THERMODYNAMIC
CROSSOVER

 

 

 

   
  

 

 

REFRESH SELECTION

 

 

 

 

 

 

 

 

 

BEST POP CURRENT POP.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

del
O

Figure 3. Overview of a system to implement the GA

 

The heuristic employed in [Ref. 2] for the
travelling salesman problem is to choose the next
efity to be visited from the next available gene in

either parent, whichever is closer to the last city

transcribed. One could imagine, instead, evaluating
the relative energies of the two choices as in
simulated annealing and then using (eq 1) to make a

choice. Thus, the less favorable alternative is
more likely to be chosen at higher temperatures than
at lower ones. This at least ameliorates the
deterministic quality of the heuristic operator. An
implementation difficulty for this idea is that
there does not seem to bea way to avoid an
expensive exponentiation operation at each gene

during transcription.

Annealing Schedule

The overall performance of the GA using the
thermodynamic operator is sensitive to the annealing
schedule chosen by the analyst. The general
strategy we have used is to start the system at a
high temperature, drop the temperature fairly
rapidly to a moderate range and then descend more
slowly to a very low temperature. The process is
then repeated with a slightly lower initial
temperature. This can continue until the initial
temperature becomes too close to the final
temperature. Then the process can be repeated again
starting from a high temperature (figure 4)

TEMPERATURE

TIME

Figure 4. Temperature vs Time for a typical
annealing schedule

 
 

The specific annealing schedule used for a 100
city problem is implemented by the routine shown in
Figure 5. Two individuals are created and evaluated
between each call to this update routine.

(Defun update-temp ()
{cond ((> *temp* 115) (decf
(QO *temp* 100) (deck
((> *temp* 50) (decf *temp* 0.1))
(© *temp* 36) (deck *temp* 0.05))
(t (progi (setq *temp* *reset-temp*)
(if (< +reset-~temp* 50)
(setq *reset-temp* 150)
(decf *reset-temp* 10))))

*temp* 1.0)
*temp* 0.2))

Figure 5. A LISP routine to implement an
schedule

annealing

Test Cases

Ve have run many experiments vith travelling
salesman problems of various sizes. Some sample
cases are exhibited in figures 5, 6, and 7. We have
found that the use of thermodynamic crossover in
conjunction with heuristic crossover provides a
marked improvement over the use of the extended
crossover operator alone or heuristic crossover
alone. We have also found that thermodynamic
crossover allows the reduction of the population to
sizes much smaller than those indicated by Goldberg
and Lingle [Ref. 3]. For instance, we use
population size of 12 - 15 individuals for a
city problem.

a
200

Other parameter settings used in the 100
problem are as follows.

eity

oot 375.0
8: 140.0
6: not applicable

tn
refresh: -07
250.0

Initial temperature:

The annealing schedule shown in figure 5 was used.
The heuristic crossover operator was invoked 25% of
the time and thermodynamic crossover 75%. A
population of size 12 was used.

In the following figures "Trials" is the number
of individuals that have been evaluated. The
fitness measure is in terms of the theoretical
optimum value derived in [Ref. 8} and referenced in
(Ref. 9]. The fitness quantity used is the optimum
tour length divided by the actual tour length. Thus
a fitness of .95 is about 5% above the theoretically

predicted minimum tour. These results compare
favorably with results presented by other
researchers using the GA on the TSP, [Ref. 2} for
example, both in terms of the required number

fitness evaluations and the resulting route length.

120

 

 

RANDOM TOUR
POPULATION SIZE = 10
50 CITIES

 

 

 

FITNESS = 0 8976

 

2500 TRIALS

 

 

 

5000 TRIALS
FITNESS = 0.9367

Figure 6. A 50

 

10,310 TRIALS
FITNESS = 0 9443

eity problem

 

 

 

 

 

RANDOM TOUR
POPULATION SIZE = 12
1400 CITIES

2500 TRIALS
FITNESS = 0.7567

 

20,000 TRIALS
FITNESS = 0 9155

 

    

Figure 7. A 100 ciry problem

 
 

    

2500 TRIALS
POPULATION SIZE = 12 FITNESS = 0.5607

200 CITIES

 

  

70,676 TRIALS
FITNESS = 0 9656

 

35,000 TRIALS
FITNESS = 0 95037

Figure 8. A 200 clty problem

Conclusion

Although the unified Thermodynamic Operator is
not directly motivated by a theory of genetics or
thermodynamics it provides an explicit control over
population convergence. Empirically, it has been
shown to increase performance through the use of an
annealing schedule, a concept borrowed from
Simulated Annealing. Analysis of the thermodynamic
operator and a better understanding of the
relationship between schemata retention and optimal
convergence rates is expected to lead to even better
performance by customizing the annealing schedule to
the requirements of specific applications.

Appendix
0-Schemata Retention Ratio For Inversion

The inversion operator on an individual of
length L selects two unique points 4 and B. The
order of the genes between these paints is reversed.
To calculate the o-schemata retention ratio (the
fraction of o-schemata retained by a typical
application of the inversion operator) ve first
calculate the expected number of o-schemata retained

and then divide by 2b, the total number of
o-schemata originally represented in the individual.

To calculate the expected number retained, &,
we sum the product of the expected number retained

for each choice of A times the probability of
choosing that value of A. Since all the values of A
are equally likely (for all x,O<x<Lil, P(A=x) = 1/L)
this multiplication can be moved outside the
summation.

To calculate the expected number retained for a
given A, Eas ve break this problem into two cases:

the case where (A>B) and the case where (BPA).

Ey = P(ADB) E + P(B>A) E

ADB B>A

Since all choices for B are equally likely:
P(ADB) = (A-1) / (i-1)
P(B>A) = (L-A) / (L-1)

Enon can be calculated by summing the product of

the actual number to be retained, when A and B are
both known (and A>B), times the probability of that
selection of A and B given A>B. Once again this
probability is the same for all values of A and B
given A>B ( P = 1/(A-1) ) so it can be moved outside
the summation.

Finally, the number retained when A and B are
known (and A?B) can be rewritten as the total

number, gh minus the number lost. Generally, any
o-schemata using genes from the inversion region
will be lost, so the number lost in an inversion

between A and B can be shown to be 2O+A-B) (the
number of o-schemata using genes between A and B).

A similar analysis will alse work for the other
half where B>A simply by reversing the roles of A
and B and using P = 1/(L-A).

Assembling the expression described above leads
to a complex expression which reduces to:

Boe 2h ae B/(LAL) + B/L(L-4)

~ (a¢2') / LcL-1)]
To get the retention ratio we must divide by at;
R= t+ Coty se qa 7 2h¢n-1)]
248 6 Mee) - 18 7 LCLH1)]

For large values of L it can be approximated by:

which approaches 1 as L becomes very large.

From this analysis we can see that inversion will
have a tendency to retain a large portion of the
o-schemata, not allowing it to explore new schemata
effectively. This could help to explain the
generally poor performance of inversion in solving
order problems cited by Goldberg and Lingle
{Ref. 3}.

Similar analysis af the other operators has
shown that each one has different retention
eharacteristics for o-schemata than for a-schemata.

 
References

Holland, J. H.: Adaptation in Natural and
Artificial Systems. The University of Michigan
Press, Ann Arbor, 1975.

Greffenstette, J., R. Gopal, B. Rosmaita, and
D. Van Gucht: Genetic Algorithms for the
Travelling Salesman Problem. Proceedings of an
International Conference on Genetic Algorithms
and their Applications, Pittsburgh, July 74 26
1985, pp 160-168.

Goldberg, D. E. and R. Lingle, Jr.: Alleles,
Loci, and the Travelling Salesman Problen.
Ibid, pp 154-159.

De Jong K.: Genetic Algorithms: A 10 Year
Perspecitve. Ibid, pp 169-177.

Kirkpatrick, &., C. D. Gelatt, and MH. P.
Vecchi: Optimization by Simulated Anngaling.
Science, Vol. 220, May 1983, pp 671-680,

Press, W. H., B. P. Flannery, S. A. Teukolsky,
and W. V. Vetterling: Numerical Recipes, the
Art of Scientific Computing. The Press
Syndicate of the University of Cambridge,
Cambridge, 1986, pp 326-334.

Knuth, D, E.: The Art of Computer Programming,
Vol. 2, Addison-Wesley Publishing Co., Reading,
MA, 1969, pp 116-117.

Beardvood,J., H. Halton and J. M. Hammersley:
The Shortest Path Through Many Points. Proc.
Cambridge Philosophical Society, Vol 58, 1959.

Bonomi, E. and J. Lutton.: The N-city
Travelling Salesman Problem: Statistical
Mechanics and the Metropolis Algorithm. SIAM
Review, Vol. 26, No. 4, October 1984

 

ho

 
TOWARDS THE EVOLUTION OF SYMBOLS!

Charles P. Dolan
Hughes AI Center and
UCLA AI Laboratory

Michael G. Dyer
UCLA Al Laboratory

Abstract

This paper addresses the dual problem of implementing
symbolic schemata in a connectionist memory and of
using simulated evolution to produce connectionist net-
works that operate at the symbolic level. Connectionist
models may change the way we model symbols, and we
take the view that a connectionist architecture that imple-
ments symbol processing must have a plausible evolu-
tionary path through which it could have passed. A
simulation methodology is introduced which allows
symbolic models to be studied at the neural level without
expensive computational numerical models. A model for
schema processing is proposed and that model is shown to
have a plausible evolutionary path using only mutation.
The proposed model allows us to implement the memory
model of CRAM, a model of learning and planning cur-
rently implemented entirely at the symbolic level, using
distributed representations. We also propose a more
general solution for using genetic algorithms to construct
connectionist networks that manipulate explicit symbolic
structures.

1. What do genetic algorithms have to
offer connectionism?

The question of “what genetic algorithms have to offer
connectionism” is an important one because we believe
that genetic algorithms are not just another search method
for finding a good set of weights in a high-dimensional
non-linear weight space. Genetic algorithms may be part
of a solution to the problem of representational opacity.

One problem in connectionist systems is that the
representations they learn are often incomprehensible to
humans. In the case of the NETtalk system for
pronouncing English text (Sejnowski 1987) it was neces-
sary to use multi-dimensional clustering techniques to find
out what distinctions the hidden layer was making among

 

iThe second author is supported in part by the JTF program
of the DoD, monitored by JPL, and by grants from the ITA
Foundation and the Hughes Artificial Intelligence Center.

 

the inputs. In Rumelhart and McClelland's system (1986)
for learning the past tense of verbs, even though the
system looked as though it was following rules, no rules
could be found directly in the weights of the network. In
both these systems, the back-propagation learning algo-
rithm (Rumelhart et a! 1986) learned non-intuitive
mappings of symbols (English words) to actions.

It is our contention that a network that learns a specific
task, but does so by constructing an opaque representation,
may yield little to our understanding of human cognitive
information processing. General intelligence will proba-
bly not be achieved by only picking statistically signifi-
cant features of the environment”. To combat this
tendency of connectionist systems we follow the principle
of intermodular transparency. This principle states that
modules always communicate with a fixed representation
that can be given some symbolic interpretation. In the
case of connectionist networks, the module communicate
with patterns of activation. Inside the modules any
representation at all may be used.

The problem we encountered is that there are no connec-
tionist leamming algorithms for performing search in the
space of configurations of functional units. To attempt a
solution to this problem we used a degenerate form of
genetic search (Holland 1975), hill climbing, to search a
small space of configurations and received encouraging
results described in Section 9. This has led us to
formulate the entire problem as a general genetic search
problem using the representation described in Section 10.
We have yet to show that genetic search yields reasonable
structures, but the representation we have constructed
shows some promise when measured against the
representational issues set forth in (DeJong 1985),

21m (Holland et al 1987) the point is made that the
inductive inferences a system makes are highly dependent
on the goals of the system, not just the regularities of the
data.

123
 

2. Symbol processing in PDP networks

We adopt a functional approach modelling cognitive
processes. For a given process we want to model, we
design a complex architecture that models it, Then, piece
by piece, we try to replace functional parts of the
architecture with networks of simple units, as in parallel
distributed processing (PDP) (Rumelhart and McClelland
1986) models. This approach simultaneously answers the
questions of how to adapt symbolic models for PDP
implementation and how to implement symbols in PDP
models. The question which this approach does not
answer satisfactorily is “How could these architectures
have gotten there?” To answer this question we must
provide a plausible evolutionary path through which an
architecture could have developed. We hope that genetic
algorithms will not only help us establish this path but
will give a tool for exploring the space of possible
architectures in a principled manner,

To more fully understand our approach it is helpful to see
that distributed connectionist models fail on a continuum
of functional structure. The continuum ranges from ho-
mogeneous to highly functional models. The homoge-
neous models such as (Rumelhart and McClelland 1986)
have a large number of identical units and a uniform
connection pattern among units, These networks use the
same learning rule for al! units and rely on emergent
properties of these collections of units for modeling all
cognitive behavior. The approach we are taking here is
highly functional. In these models, the general spirit of
connectionism is kept, but large portions of the network
may lose the ability to develop emergent properties. In
addition, these models also admit larger amounts of
external control than other models and more than one type
of unit. These are all simplifying assumptions that allow
us to model symbolic processing. A more complete
discussion of the various degrees of functionality can be
found in (Dolan and Dyer 1987).

3. Symbolic schemata and hierarchical
structure

Too understand why we want to embed symbolic
structures in connectionist networks one has to see the
importance the such structures have played in previous
cognitive models. For example, symbolic schemata have
been used extensively in story understanding under various
names, such as _ scripts (Cullingford 1980) and MOPs
(Schank 1982). Schema recognition is a fundamental
operation in such systems. Figure 1 shows an example
schema with some unfilled roles; waiter, owner,
food, and payment. This schema is simplified and
adapted from the restaurant script in (Cullingford 1980).
The notation used in Figure 1 stresses that a schema-based
representation can be implemented as a set of relations,
one relation for each slot in the schema. Likewise, the
constraints among the slots of the schema, which

constitute the structural description of the schema, can be
represented as sets of relations.

 

 

 

SRESTAURANT PTRANS MTRANS ATRANS
CUNO SOL) enn actor actor co
atabiehmontae MaMaiSOn —s—reeto
water ————HUMAN(servica-persan) e Po actar
Food OBJECT (edible) object —L, obyect

 

 

Figure 1 - Example schema, “the restaurant”

SRESTAURANT
$FAST-FOOD $FANCY
MacDonalds BurgerKing MaMaison

Last Thursday | had | heard John went to
a Big Mac MaMaison

Figure 2 - Example Schema Hierarchy

In addition to having a rich structural description,
symbolic schema are also often organized into a hierarchy
such as the one shown in Figure 2. In these hierarchies,
the most general schemata are located at the top with more
specific schemata on the levels below. The leaves of the
tree are instances of schemata that the program has
encountered,

This organization supports two operations that give
schema processing programs a great deal of their power.
The first is inheritance. Once a situation is recognized as
being an instance of a schema somewhere in the hierarchy,
all the knowledge associated with that schema’s super-
classes is also available. The second is discrimination.
Once a situation is recognized, the program can walk down
the branches of the tree to see if the descriptions of
anymore specific situations apply. Our approach to
connectionist symbol processing is to directly implement
features such as inheritance and discrimination at the
neural level. This allows us to determine the value of
such mechanisms when our models are implemented on
physiologically more realistic hardware.

4. Symbol processing in CRAM

CRAM is a symbolic model of comprehension and learn-
ing from fables, with a schema-based memory (Dolan and
Dyer 1986). CRAM has various components that use the
schema-based memory: (1) a story comprehension
system, (2) a planning system, (3) a symbolic learning
system, and (4) an advice and adage-generating system.
Such a model is ideal for testing a connectionist-style
memory because the memory module must support these

124

 

 

 

 

 

 
 

 

four diverse cognitive tasks. The memory model for
CRAM is based primarily on three operations on
schemata: (1) instantiation, (2) role binding, and (3)
recognition.

For example, when CRAM notices that a character in a
story has a particular goal, such as needing a secretary, and
notices that the same character flatters a secretary, then
CRAM needs to instantiate the schema for <of fering
someone a job> and bind all the correct roles. Role
binding is different from binding in pattern matchers
because it involves modifying the roles of schemata
already instantiated in short term memory. For this reason
we cannot use Touretsky and Hinton's solution (Touretsky
and Hinton 1985) for production systems. For a
discussion of role binding in connectionist memories see
(Dolan and Dyer 1987).

5. Distributed symbols and relations

One way to think of symbols in distributed connectionist
models is as “bit strings”. These “bit strings” are not
memory addresses, as are symbols in traditional
implementations, but are feature vectors. An example of
this is found in (McClelland and Kawamoto 1986) where
each symbol is classified along a number of dimensions.
A fixed number of bits is allocated to each dimension, one
for each possible classification, and the bit string is
formed from that set of features. As an example from
(McClelland and Kawamoto 1986),

DIMENSIONS FEATURES
GENDER male, female, neuter
SIZE small,medium, large

are the features, and the symbol mappings are

VECTORS
SYMBOLS GENDER SIZE
John ~-> (100 gol }
Mary —-> ( 010 010 )
Book --> ( 002 100 )

By constructing symbols this way, we ensure that
symbols with similar meanings will have similar
representations, Relations can then be formed from these
symbols in a very straightforward manner, as in (Hinton
1981). Assume that we have defined the symbol “hit” for
arelation as (100 100), then we can represent “John
hit Mary” with (100001 100100 010010) for
(ROLE1 REL ROLE2). This method of representation
is compatible with the neutral/distributed and biased
methods for intermodule communication.

Sets of relations of these units can be represented as a
Pattern of activation on a set of units using conjunctive
Coding (Hinton et al 1986), An example of such a
Tepresentation is given in Figure 3. Here we allocate a

 

cube of units where each dimension of the cube is the
length of one symbol. Each unit in the cube represents a
three-way conjunction of the features of the ROLE1, REL,
and ROLE2 positions of a relation. In this way, multiple
relations can be stored on the same set of units. This
approach is similar to the design of the working memory
for the production system in (Touretsky and Hinton 1985),
except in that design, each element of the working
memory was assigned a random subset of each of the
possible symbols for the three positions in a relation. In
our design, each unit in the working memory plays a very
specific role in the semantic meaning of the relations
stored in working memory.

 

10H.

ee

o

8

 

 

 

REL

Figure 3 - Conjunctive coding of
relations in working memory

6. Methodology

In order to understand the meanings of their symbols,
symbolic researchers need a model of a network that
allows them to study the interactions of micro-features and
micro-inferences (Hinton et al 1986) while still being able
to specify symbolic interactions at a relatively high level.
The approach we propose here is to use a model of groups
of idealized neurons. One useful side effect of this
approach is that simulations are less complex compared to
those of individual neurons, By making the objects of
study functional groups of neurons, modelers can study
architectural issues, but only after it has been shown that
groups of units can be made to function in the desired way
using detailed numerical simulations.

Using this method, we simulate a group of units, called a
“unit-set”, as an object that responds to messages in the
sense of object-oriented programming. This is similar to
the approach taken in (Bilbert and Salter 1986). Eilbert
and Salter modeled each layer of the network as an abject,
and object parameters determined the probability of a par-
ticular unit exciting either the unit directly below it or
some random unit in the next layer. In this way they were
able to easily simulate complex hierarchical networks. In
the Section 10 we describe a way of encoding a configura-
tion of unit-sets that is suitable for use by a genetic
algorithm.

Underlying this methodology is the idea that a network
architecture can be judged by how well it minimizes three
quantities: (1) the number of unit-sets, (2) the complexity
of the control layer, and (3) the number of different

125
 

messages used. We believe that symbolic processing can
be accomplished with a small number of messages,
including:

(1) SEND-OUTPUTS to trigger a unit-set to
excite the unit-sets to which it is
connected

WINNER-TAKE-ALL to cause a mutually
inhibitory unit-set to settle on a small
number of active units

REINFORCE to invoke reinforcement
learning

DECAY to cause a unit-set to decay the
activation of each unit towards zero

(2)

@)
4)

Between unit-sets are weighted links (connections) of the
type normally used in neurally inspired models. Static
links are used to pass data between sub-networks.
Modifiable links are used to make one unit-set learn to
respond to a pattern of activation from another unit-set. In
cases where a learning algorithm is used, a message of the
appropriate type is sent to the unit-set. For units that
leam by reinforcement, for example, the unit-set is sent a
REINFORCE message. Active units within the unit-set
have their links modified according to reinforcement (Barto
et al 1981). The static links are used to implement the
architecture and should be thought of as data paths, The
modifiable links are used to store knowledge.

7. Evolving functional structure

Our methodology in studying distributed connectionism
lends itself to answering questions pertaining to adaptation
at the architectural level. If we assume a set of architec-
tural building blocks, such as winner-take-all networks,
hidden layers, conjunctive coding networks, and unit-sets
for individual relations and symbols, then we can ask
whether configurations of these building blocks convey
some adaptive advantage over other configurations and
what configurations of building blocks does some
principled model of evolution lead us to postulate.

In order to discuss evolution of functional structure, we
need an architecture from which to start developing that
structure. An example of such an architecture is given in
Figure 4. The task for the network is to Jearn to reproduce
the patterns from the training set on the output units when
excited by noisy versions of those patterns on the input
units. If the training set is composed of non-linearly
separable vectors then this is clearly a problem that
requires hidden units. These hidden units are where we
shall try to learn functional structure.

If we start with the network as described above, using a
Jearning procedure such as back-propagation (Rumelhart e¢
al 1986), the hidden units will learn to form features from
the input patterns, and the output layer will learn to
reproduce those patterns based on the features formed by

the hidden layer. The features that the network learns will
depend on the width (i.e. number of units) of the hidden
layer and to a large degree on chance, which is based on
the order of presentation of the training set and the initial
settings of the weights.

noisy training data

Input layer
modifiable weights
hidden layer
modifiable weights
output layer

cigan training data
for teaching

Figure 4 - Simple auto-associative network

As was shown in (Sutton 1986), search in the weight
space is slowed by strict gradient descent. Because
gradient procedures “creep” along a “bumpy” error surface
they are not likely to come to rest with the same features
that humans have. They are more likely to stop in some
“pot hole”, and if that pot hole allows the network to learn
the training set without errors, the learning algorithm will
not get out. Also there is no reason to expect that the
features that people see in the training data are better than
the “pot holes” in terms of the error surface.

We believe that general intelligence does not emerge from
a single network of interconnected units. In order to learn
large symbolic structures of the type that people use,
specific architectures will be required. Such networks
would have fewer stable states but would, we hope,
generalize faster and would reach generalizations that are
closer to those that people make.

Unfortunately, there currenily are no algorithms along the
lines of back-propagation that leam functional structure,
There is, however, a process in nature, evolution, which
does yield functional structure. The is also a class of
algorithms, genetic algorithms (Holland 1975), which are
idealizations of evolution and have been shown to be
effective for high-dimensional, non-linear adaptive
problems. These algorithms have been applied to such
problems as adaptive control (see (DeJong 1985) for a
summary) and the traveling salesman (Goldberg and Lingle
1985; Grefenstette et al 1985).

As a first attempt at search in the space of configurations
of functional units we used an algorithm that is a
degenerate form of a genetic search algorithm, which uses
a population of 2, one genetic operator, mutation, and for
each generation chooses the most fit structure and a

 
 

mutation of that structure for the next generation. In
terms of Holland's characterization for reproductive plans
(1975), this one is3:

Ri(crossover=0,Pinversion=90,Pmutation=1,<1>)

Even with this simple degenerate case of genetic algo-
rithms we were able to get improved performance from the
network. This seems to indicate that for some problems,
such as the one described in Section 9, search at the level
of configurations of functional units is very important.
Search for network structure at the link level, such as that
demonstrated in (Ackley 1985), will yield learning that is
too slow to operate on an evolutionary time scale. Also,
this type of learning will yield systems which do not have
intermodular transparency. It is better to assume a basic
level of architectural organization, and then search for a
solution at that level. Thus, we can evolve systems that
are highly fit. This approach also agrees with the basic
assumption of genetic search, that search occurs in the
space of the genotype (genetic code), not the phenotype
(physical manifestation of the organism). To require that
individual links are coded for in the genotype would mean
that the genetic code is the size of the network, and this is
clearly not the case with people.

The proper measure of fitness for an architecture is how
well it learns the training set. A fit architecture must be
able to deal with noisy inputs, should be stable, and
should learn fast. In addition an architecture should be
flexible enough to be put in a different enviconment from
that which formed it and still perform well. In our
experiment we used a measure of fitness for a structure A
in an environment E as below:

HEA) = ale+ bey + 1)

where T is the time the network is allowed to learn, e is
the total number of errors the network makes, and e7 is
the number of errors the network produces at the end of a
tun, The constants, a and b ,are used to weight these two
measures of network efficiency and in the experiment
described in Section 9 they are both set to 1.

8. An architecture for schema processing

In this section we present an architecture for schema pro-
cessing that performs the functions of schema recognition,
instantiation and role binding. The central component of
the architecture is the working memory. The working
memory is a set of units that conjunctively code sets of
telations, such as the one in Figure 3, The full architec-
ture is shown in Figure 5. The schema memory performs
the functions of recognition and instantiation. The proce-
dural memory could be implemented a number of ways, as

 

37his can also be characterized as simple hill climbing.

127

 

a connectionist production system (Touretsky and Hinton
1985), as a classifier system (Holland and Reitman 1978),
or as a symbolic production system. The full architecture
is described more fully in (Doland and Dyer 1987).

 

rodudural memo
working
memory]

 

 

 

Figure 5 - Schema processing architecture

Figure 6b shows the details of the schema memory. The
memory is composed of many winner-take-all cliques.
Each clique discriminates a mutually exclusive set of pos-
sible schemata based on the contents of working memory.
In addition to recognizing schema, each unit in the cliques
also excites the units in working memory to instantiate
the remainder of the schema. In addition to the connec-
tions to the working memory, the cliques are connected to
each other. These links implement inheritance and dis-
crimination. Part of the tree in Figure 6a is shown in the
schema memory of Figure 6b. Strong positive weights,
shown in the figure as think black lines from the sub-
class-units to super-class units, implement inheritance.
Weak links, shown in the figure as thin black lines from
the super-class units to the sub-class units, implement
discrimination. Once a super-class is active, it slightly ac-
tivates all of its sub-class units in the same clique. If
there is enough additional evidence in working memory to
activate any of the sub-classes the one with the most evi-
dence wins. This structure can also implement exception
handling. To override a default, a strong negative weight,
shown in the figure as a thick gray line, is formed between
sub-class and the super-class to be overridden.

 

 

 

 

excitation to and
from working memory

(b)

 

(a)
Figure 6 - An example schemata in memory

It is clear that this organization can be “wired up” to
Tecognize and instantiate schema according to their
symbolic definitions. What is not clear is how plausible
this architecture is, The question we must answer is how
this architecture got there. In the next section we describe
an experiment in simulated evolution that shows that this
architecture is very plausible.
 

9. An experiment in the evolution of
structure

To test the plausibility of this architecture, we designed an
experiment to see if simulated evolution would arrive at
anything similar to the architecture in Figure 6b. In
genetic search algorithms, a genotype is defined from
which phenotypes can be constructed. A population of the
genotypes is then put in a pool and altered using genetic
operations such as crossover, inversion, and mutation
(Holland 1975), Differential reproduction based on
estimated fitness, combined with recombination using
crossover allows the system to search through the space of
coadapted alleles with provable efficiency. In our experi-
ment we used a population of two, a single organization
of the network and one mutation of it, and determined
which was most fit. The winner of the competition
moved on to the next generation. The reason for not
using a large gene pool is that we were only interested in
finding a plausible evolutionary path,

9.1 Experimental design

The unit of mutation in our experiments is the winner-
take-all clique, Within each clique, only the unit with the
strongest activation is allowed to fire. In addition, each
unit in a clique has an adaptive threshold. If the unit does
not fire often enough, it lowers its threshold; if it fires too
often, it raises its threshold. A unit decides whether or not
it is firing “enough” based on the number of reinforcement
cycles the network has gone through since it last fired.
The effect of the adaptive threshold is to make each clique
try to maximize its information-carrying capacity, If a
unit is either not firing at all or firing all the time, it is
hot carrying any information.

It would be easy to define a genotype for this architecture
by letting a certain number of genes code for the number
of cliques and other genes for the relative distribution of
clique sizes. However, for this experiment we define four
types of mutation directly on the architectural phenotype4:
(1) splitting a clique, (2) merging two cliques, (3) deleting
aclique, (4) adding a new clique with two units.

Since cliques also have modifiable connections among
each other, these four mutations allow the system to form
a wide variety of architectures.

To ensure that the system does not simply learn an
architecture that is tuned to a specific input set, for each
generation we use a new training set. The only thing in
common among all training sets is that they are all tree
structured, For example, we might want a network that

 

4Because the mutations are performed at the level of
functional blocks, this can be considered as search in the
space of the genotype. Section 10 details a representation
that allows search directly in the space of the genotype.

was able to handle trees of depth 3 and width 4, so each
generation we generated a random tree of depth 3 where the
branching fact varied from 0 to 4 at each node. For
example, the some of the training set for the tree shown in
Figure 6a is shown in Figure 7.

Feature vector Examples
ABCDEFGHIdS
1. (101000000 0) c
2. (01021001060 0) D
3 (00000110 1) Ja

Figure 7 - An example of hierarchical features

For example, to represent a member of the class C as in
test vector I, tum on bit C and turn on bit A, because A
is super-class of C, but never turn on bit B because B and
C are in mutually exclusive sub-classes of A. Likewise,
to represent a member of class F as in test vector 2, we
tum on the bits on the path from F to the root: F, B, and
A. To represent a member of class D, as in test vector 2,
we tum on the bits for D, B, and G but not A because
there is a cancellation link from D to A.

A set of features such as these were presented to a network
such as the one in Figure 4 where the hidden layer was
structured as a set of cliques. In addition, noise was added
to the training vectors by flipping bits. The average
number of bits flipped was proportional to the depth of the
trees being learned. For 2 level trees it was 1 bit, for 4
levels, 2.

The experiments started with a hidden layer with just one
clique. The clique was large enough to discriminate
among a set of feature vectors from an “average bushy”
tree of the desired depth. In this case we were trying to
evolve an architecture for learning trees of depth three.
The rationale for starting with a single large clique is that
this structure works extremely well in an environment
with no noise and with a training set whose size perfectly
fits the width of the hidden Jayer.

We used leaming to evaluate the fitness of the various
architectures, Competitive learning (Rumethart and Zipser
1986) was used to modify the weights from the input to
the hidden layer and the weights among the cliques.
Hebbian learning (Hebb 1949) was used to modify the
weights from the hidden layer to the output layer. We
wanted to select for three features of the learning
performance: speed, stability, and accuracy. To select for
speed, we only allowed the networks ten iterations through
the training data. Considering the fact that noise was
introduced, this was not a large number of trials. To
select for stability and accuracy we kept track of two
quantities during a given generation: the total number of
errors a network made and the current performance. At the
end of ten iterations, the networks were compared; if a
network had lower scores on both counts it moved on to

 

 
 

the next generation. If the decision was split, then it was
settled at random.

9.2 Results

By watching the performance of various organizations we
were able to derive some mules of thumb for predicting the
fitness of an architecture. These observations are not
quantitative, but serve to illustrate why one particular
organization had a selective advantage over another.

Flat organizations (i.e. a single, wide clique) are extremely
efficient. However, they are unstable and very susceptible
to noise. Very often, noise will force the competitive
learning algorithm to change the encoding from the input
layer to the hidden layer, and this will undo alf the
leaming performed between the hidden and output layers.

It is easy to construct an organization of binary cliques
that exactly mirrors the hierarchy. This structure is
efficient and extremely stable, but if the tree changes
substantially in the next generation, a finely-tuned
structure will not perform well. Organizations with lots
of structure do extremely well. A good mix of various
clique sizes allows the organization to perform well in a
wide range of environments (i.e. all trees of depth three
and branching factor 0-4),

9.3 Analysis

In general we found that networks tended to mutate
towards increasing complexity. Since flat organizations
are not particularly robust, a mutation in which a new
Clique is added will increase stability and will therefore be
selected for. The mutation that adds another small clique
gives only a small advantage if any, but allows the
organization to mutate to one in which two smail cliques
combine. This yields a large advantage over multiple
small cliques or one single, wide clique.

One unfortunate feature of this method is that the
environment can kill off a promising line of evolution by
accident. As a network is growing into a very complex
organization, with many cliques of various sizes, if the
training set for a particular generation is very bushy it
may overload the capacity of the network to learn the
structure of the tree, In these cases, a flat architecture with
the same number of units will be more fit, since it can
encode one training example on each unit of the hidden
layer. This is a case of a representational boundary as
described in (DeJong 1985). Because of that fact, the first
architectures to evolve are often large, flat, single-clique
architectures. Once these cliques split, however, they
form very robust hierarchical architectures. It seems quite
difficult for the system to develop hierarchical architectures
from the start by incrementally adding more and more
cliques. The reason is that the intermediate stages of such
a development are subject to extinction by difficult
environments.

 

129

10. A proposal for genetic search in the
space of functional structure

The primary problem in any genetic algorithm is choosing
the representation. DeJong (1985) has identified several
issues that need to be addressed in choosing a
Tepresentation. One is the problem of schema? that confer
above-average performance: (1) do they exist?, and (2) are
they at all preserved by the genetic operators? A second
one is the problem of representational boundaries: are there
Tepresentations which confer above-average performance,
but which cannot be improved upon without a radical
change in genotype? Although it is possible to devise
such an encoding for the simple structures described in
Section 9, it is difficult to devise a fixed encoding which
would be capable of describing structures such as the one
shown in Figure 5.

A solution to these problems may be found in adaptive
codings as suggested in (Holland 1975). In that
formulation, complex productions are expressed as strings
of a ten member alphabet, which has a rather complex
interpretation scheme, and a reproductive plan is run on
that representation. A similar, but more elegant
formulation is found in (Holland and Reitman 1978) in
which each production is a bit string from the alphabet
{1,0,#}, where # stands for “don't care”. Productions are
matched against and post bit-patterns to a blackboard
whose contents also determine the actions of the system
(in our case which unit-sets to create and how to connect
them). A further refinement of this formulation called
Classifier systems along with the bucket-brigade algorithm
(Holland et af 1987) allows credit (fitness) to be
apportioned among a chain of productions responsible for
producing a structure of above-average fitness. The
genetic algorithm is then run either by treating an entire
set of productions as a genotype, as in (Smith 1983), or
by using each production as a genotype as in (Holland and
Reitman 1978). In the second option the production
system is treated as a population and the strength of a rule
is its measure of fitness.

This style of representation holds great promise for our
task because these classifier systems can be used to
compute almost anything (anything if we give them an
infinite blackboard). However, we must be cautious of
DeJong's pitfalls for genetic algorithms (1985). To begin
with, we structure the condition and action parts of a
production's bit vector with the following fields,

(block~type,*1,y1,Z1,unit-type,
projection-type,x2,y2,22, link-type,
tag)

 

5The word schema is used in two different ways in this
paper, to indicate a compiex symbolic structure and to
indicate a set of alleles on a genotype as in (Holland
1975). All previous references have been to the first
meaning. All subsequent ones shall be to the second.
 

This indicates a functional block of a certain type (e.g.
relation vector, symbol vector, relation cube, or hidden
layer of a particular size) at location (xj,y1,z1) with a
certain type of units (e.g. linear threshold, with or without
refractory period, etc.), projecting with a certain pattern
(e.g, random, one-to-one, or layered) to a block at location
(x2,y2,22) using a link with a particular learning rule (e.g.
reinforcement or back-propagation). This same format is
used both to test for blocks and to assert the existence of
blocks onto the blackboard. The tag field can be used to
link arbitrary productions.

Now we can attempt an informal evaluation of this
representation in terms of DeJong's metrics. Taking the
position that either individual productions or the entire
system is a genotype, we can see that the position codes
constitute fairly robust schema. However, they convey no
particular advantage until combined with a particular block
type, and they make even stronger schema when combined
with a projection to another location. For example, a
production with #'s in the positions for both sets of (x,y)
coordinates would create a layer of identical unit sets, each
one connected to the unit set either above or below it
depending on the values of the z coordinates. Now,
however, we are talking about schemata almost as long as
the genotype if we use individual productions in
population. Clearly, we need to use the first option, i.e.
the entire production system as a genotype, to make such
a linkage viable. With this method there is also a
possibility that the message field can be used to form
linkages between productions. These linkages, however,
will be extremely weak unless we can use inversion to
strengthen them.

This representation looks like it will have interesting
representational boundaries. First, it is not clear whether
most random genotypes will even produce a configuration
with two connected functional blocks. Second, once there
are a set of productions represented in the population
which produce a structure with a certain fitness, it is not
likely that an operation which modifies those productions
will result in improved fitness. It is much more likely
that it will completely inhibit the creation of the structure
and result in much lower fitness. This conforms very
well to our notion of how epistasis® should work in
functional evolution. It is much more likely that
productions will be found which add other structures on
top of the ones already found to be useful. This is similar
to the situation found in the evolutionary development of
the brain. The structures found in higher organisms are
built on top of those found in lower organisms (Kolb and
Whishaw 1980).

 

SEpistasis is the phenomena, in genetics, of a reaction that
depends on several enzymes, and will not proceed until all
the enzymes are present.

In order to test this representation, we propose three steps.
First, we will continue working with hand-coded networks
of functional units, but restrict our attention to networks
that can be compactly represented by a production system.
Second, we will take those same representations and see
what other systems are constructible from them, since
classifier systems can be non-deterministic (when a
classifier matches more than one element of working
memory or when more than one classifier matches). Last,
using a corpus of symbolic manipulation tasks, such as
story comprehension and question answering, we will test
the ability of a genetic algorithm to construct neural-like
networks which can process symbolic structures.

11. Current status

The architecture for the schema memory is currently
implemented in Scheme on a workstation. The model has
been integrated with a long term schema-based memory
and a procedural memory implemented serially at the bit-
level. The system can currently understand fragments of
One-paragraph stories, The largest network we have
simulated so far stores six schemata with approximately
seven relations each. The relations are represented with
eight bits for each symbol. A more complete description
of this system can be found in (Dolan and Dyer 1987).

12. Conclusions

We have found that search in the space of configurations
of sub-networks can yield a network that learns human-
like generalizations. The resulting architecture learns fast,
and forms the same type of hierarchical structures that are
formulated as symbolic models in humans. By using this
approach we can now evaluate yarious symbolic models at
the micro-feature level, to see if their symbolic
representations hold up under a detailed analysis.

We now have a hope of using true genetic search in the
space of configurations to find architectures for general
intelligence. This would probably not be the case if we
could not have shown that simple hill climbing in a
simple case was able to yield a network that out-performed
the simplest flat network.

13. References

Ackley, D. H. (1985). A Connectionist Algorithm for
Genetic Search. Proceedings of the First International
Conference on Genetic Algorithms and their Applications,
121-135,

Barto, A. G., Sutton, R. S., and Brouwer, P. S. (1981).
Associative search networks: A reinforcement learning
associative memory. Biological Cybernetics, 40, 201-211.

430

 
 

Cullingford, R. E. (1981). SAM, in R. C. Schank and C,
K, Riesbeck (Eds) /nside Computer Understanding: Five
Programs Plus Miniatures. Lawrence Erlbaum Associates.

DeJong, K. (1980). Genetic Algorithms: A 10 Year
Perspective. Proceedings of the First International
Conference on Genetic Algorithms and their Applications,
169-177,

Dolan, C. P. and Dyer M. G. (1986), Encoding
Knowledge for Planning, Learning, and Recognition, in
Proceedings of the Eighth Annual Conference of the
Cognitive Science Society, 488-499,

Dolan, C. P. and Dyer, M. G. (1987). Symbolic
Schemata in Connectionist Memories: Role Binding and
the Evolution of Structure. UCLA AI Laboratory
Technical Report, UCLA-AI-87-11.

Eilbert, J. L. and Salter, R. M. (1986) Modeling Neural
Networks in Scheme. Simulation, 46(5), 193-199,

Goldberg, D. E. and Lingle R. (1985), Alleles, Loci, and
the Traveling Salesman Problem. Proceedings of the First
International Conference on Genetic Algorithms and their
Applications, 154-159.

Grefenstette, J. J., Rajeev, G., Rosmaita, B. J., and Van
Gucht, D. (1985). Genetic Algorithms and the Traveling
Salesman Problem. Proceedings of the First International
Conference on Genetic Algorithms and their Applications,
160-168.

Hebb, D. O. (1949). The organization of behavior. Wiley.

Hinton, G. E. (1981), Implementing Semantic Networks
in Parallel Hardware, in G, E. Hinton and J. A. Anderson
(Eds), Parallel Models of Associative Memory. Lawrence
Erlbaum Associates.

Hinton, G. E., McClelland, J. L., and Rumethart, D. E.
(1986). Distributed Representation, in D. E. Rumethart
and J. L. McClelland (Eds) Parallel Distributed
Processing, Volume 1. MIT Press.

Holland, J. H. (1975). Adaptation in Natural and
Artificial Systems. University of Michigan Press.

Holland, J. H. and Reitman, J. S. (1978), Cognitive
Systems Based on Adaptive Algorithms, in D. A.
Waterman and F, Hayes-Roth (Eds) Pattern-Directed
Inference Systems. Academic Press.

Holland, J. H., Holyoak, K. J., Nisbett, R. E., and
Thagard, P, R. (1987). Induction: Processes of Inference,
Learning, and Discovery. MIT Press.

 

Kolb, B. and Whishaw, I. Q. (1980). Fundementals of
Human Neuropsychology. W. H. Freeman and Compandy.

McClelland, J. L. and Kawamoto, A. H. (1986).
Mechanisms for Sentence Processing: Assigning Roles to
Constituents, in J. L. McCleiland and D. E. Rumelhart
(Eds) Parallel Distributed Processing, Volume 2, MIT
Press.

Rumeithart, D. E., Hinton, G. E., and Williams, R. J.
(1986). Learning Internal Representations by Error
Propagation, in D, E. Rumelhart and J. L. McClelland
(Eds) Parallel Distributed Processing, Volume 1. MIT
Press.

Rumelhart, D. E, and McClelland J. L.(1986), Parallel
Distributed Processing, Volume 1. MIT Press,

Rumelhart, D. E. and Zipser, D. (1986). Feature
Discovery by Competitive Learning, in D. E. Rumethart
and J. L. McCielland (Eds) Paralle! Distributed
Processing, Volume 1. MIT Press.

Rumelhart, D, E. and McClelland J. L. (1986). On
Leaming the Past Tenses of English Verbs, in J. L.
McClelland and D. E. Rumelhart (Eds) Parallel Distributed
Processing, Volume 2. MIT Press.

Schank, R. C. (1982), Dynamic memory: A theory of
reminding and learning in computers and people.
Cambridge University Press.

Sejnowski, T. J. (1987). From Signals to Symbols in
Connectionist Networks. Lecture at UCLA, May 1987.

Smith, S. F. (1983). Flexible learning of problem solving
heuristics through adaptive search. Proceedings of the
Eighth International Joint Conference on Artificial
Intelligence, 422-425,

Sutton, R. S. (1986). Two Problems with Back
Propagation and Other Steepest-descent Learning
Procedures for Networks. Proceedings of the Eighth
Annual Conference of the Cognitive Science Society.

Touretsky, D. S., and Hinton, G. E. (1985). Symbols
Among the Neurons: Details of a Connectionist Inference
Architecture. Proceedings of Ninth International Joint
Conference on Artificial {ntelligence, 239-243,

434
 

SUPERGRAN: A connectionist approach to learning,
integrating genetic algorithms and graph induction

G. Deon Oosthuizen

Dept. of Computer Science, University of Strathclyde
26 Richmond Street, Glasgow G1 1XH, Scotland

Abstract

The operation of genetic algorithms is based on
the recurrent selection of strings of values
according to their usefulness to the system and
the subsequent creation of new strings, by
application of genetic operators, in an effort to
obtain more appropriate strings. A learning
algorithm, based on a connectionist approach to
knowledge representation, can be used to
supervise and enhance the application of genetic
operators on the basis of analysing the underlying
features of the strings.

The genetic algorithm generates new
strings. The learning algorithm utilises the
strings to induce new concepts (schemata) and
these are used to advise the genetic algorithm,
which in tum produces new strings.

Combining the learning capabilities of the
two components results in a versatile system that
is able to digest eagerly and adapt gracefully,

Introduction

The application of genetic algorithms has
already delivered very interesting results in
various areas, including optimization
([Grefenstette 1986], cognitive modelling
{Holland & Reitman 1978}, classification and
control (Holland 1986]. Genetic algorithms have
two inherent characteristics that make them
highly suitable for learning by discovery: on the
one hand, sufficient variability is maintained to

prevent convergence to focal optima. On
the other hand, categorization and
recombination yields powerful —simplicit
parallelism.

Yet, the overall process is based on the
appropriateness of strings. The underlying
relations and dependencies between features are
not dealt with directly in the generation of new
strings.

We employ another fearning method, the
GRAND algorithm for GRAph iNDuction!
developed by the author [Oosthuizen ef al.
1987b], to supplement the genetic operators by
providing knowledge about the underlying
Telations and dependencies between features.
The aim is to use “deep” knowledge to direct
the application of the genetic operators and
thereby to enhance the performance of the
overall process.

Since we are concerned with learning
systems, we shall devote our attention to the
application of genetic algorithms to the class of
message-passing rule-based systems called
classifier systems [Holland 1986], also referred
to as Holland classifiers (Schrodt 1986]. We
shall use the term classifier system to refer to a
total learning system consisting of a rule base,
an apportionment of credit algorithm and in
particular a genetic algorithm. Although we
shall use the terminology employed in that
context, the ideas described here are applicable
to the broader class of optimization procedures
called genetic algorithms. Thus, our use of the
term string refers to the members of the
population, applying to the condition/action part
of a classifier (or their concatenation}, but also to
a structure or vector in other optimization
procedures.

In this paper we describe how, by
combining the inductive capabilities of

'praph induction = induction by graphic representation
and graph manipulation

432

 
 

genetic algorithms and graph induction, we
obtain a competent learning system, We
describe the system by first looking at the role
of schemata in classifier systems. We then
describe the method of graph induction in the
context of classifier systems. Subsequently, we
introduce the idea of a supervised classifier
system, Finally, we highlight the characteristics
of the integrated system, called SUPERGRAN,
as well as some crucial implementation
considerations.

Schemata

Schemata play a central role in the application of
genetic algorithms. They are used as building
blocks from which new — strings are
constructed, and are also used (as tags) to
facilitate sophisticated tansfer of knowledge
from one situation to another [Holland 1986].
Schemata are basically generalized descriptions
of categories of strings. A schema is
expressed as a string containing particular
symbols in particular positions, indicating that
all strings in that category have similar symbols
in the equivalent positions. Unspecified
positions, i.e. positions that "don’t matter” are
filled with *’s. In classifier systems, schemata
define subsets of the space of possible conditions
of actions, i.e. subsets of the space of possible
strings. A schema thus constitutes a
characteristic description for a set of strings.
We shall merely refer to a schema as a
description of a set. The schema also serves as
an identifier for a set of strings. We shall
therefore sometimes just speak of a schema,
when referring to the set it represents.

In the method described here, schemata
are employed in an additional role. Information
(already captured in schemata) regarding the
composition of the "best" strings in the current
population, is used directly, and forms the basis
for concept generation and clustering. The kind
of string categorization described here is based
on the combination of features present in
schemata and strings, and is therefore
different from the use of tags, which also results
in schema classification/string clustering. Tags
relate schemata on the basis of the temporal
sequences of categories (Holland 1986]. We
consider the non-temporal association of

schemata, based on their inherent properties.

Graph Induction

Graph induction evolved from research
regarding the connectionist approach to
knowledge representation. The method thus
originated against the background of a study of
the leaming capabilities of connectionist
architectures, ie, systems in which
connections - rather than memory cells - are
the principle means of storing information.
Such systems employ vast networks of very
simple processors, functioning in a massively
parallel way. This accounts for the graphic
basis of our method. We will expand on
the significance of the connectionist approach in
a later section. We now describe the method in
the context of its application to classifier system
schemata.

Each individual position in a string
(genotype) can assume more than one value
(allele - taken from the standard terminology of
genetics). Each allele can be considered to define
an entire category of its own, i.e. the class of all
strings in the population containing allele x in
position y, say. Similarly, each allele can be
used to identify a set of strings. If we
compute the intersection of two such sets, we
obtain a third set which is identified by the co-
occurrence of two alleles. This third set can
be conveniently described by a schema
containing only the two alleles involved and
*’s in ali other positions.

We can now represent this situation
graphically as illustrated in fig. 1. An upward
arrow represents a subset relationship. If two
arrow-ends meet - an intersection is implied.
Fig. 1 represents the fact that the set
identified by the schema 01* contains the
intersection of the sets identified by the schemata
*1* and O**. Fig. 2 shows that the sets
identified by the schemata *O* and *1* (we
shall refer to them as 1-allele schemata), are
contained in the set *#*.

Fig 3. shows some other intersections
that may be formed. This notation can be
extended to represent any combination of any
number of alleles - ie. the intersection of any

133
 

“4° or

Fig. 1
o1° q

, * string value may be
#(unspecif.),0 of 1
# = string value must
be unspecified
“9° 4°
Fig. 2

‘o" “4 on 1"

00° or 10°
Fig. 3
new
* schema
created
{a) (b)
Fig. 4

group of sets. These newly obtained sets can
in tum be intersected at random to form further
new sets - each described by its own schema.
Thus we obtain a graph consisting of tangled
hierarchies of schemata, categorizing groups of
strings.

The sets thus created contain strings
from the population. Some schemata may
therefore represent empty sets. Our convention
is to create and retain schemata (nodes in the
graph) only when they are needed, i.e. for
non-empty sets, At the lowest level, every

member of the population is assigned to its own
set - ie. a special kind of schema containing

no *’s. This schemacan then be accurately
described in terms of the intersection of & other
sets, where k is the length of the string. These k
sets will belong to the abovementioned level of
I-allele schemata. The result is a two-level
graph: the top level containing the 1-allele
schemata and the bottom level containing
the k-allele schemata.

In essence the GRAND algorithm now
does the following. The graph structure is
restricted in such a way that it adheres to the nile
that the intersection of any two sets in a graph
must be contained in a single set. This means
that a configuration like the one in fig. 4a
(referred to as a distributed intersection) is
changed to the one in fig. 4b. This results in
a policy of consistent data factoring. If the
tranformation is applied to fig. 5, the
configuration of fig. 6 is obtained. Taking
into account the fact illustrated in fig, 2 and
applying the basic principles of set theory, we
find that fig. 6 can once again be transformed,
resulting in fig. 7.

000 U 100 = (0°* mn +00) U (** 400) - fig. 6
= (OF* U 14) 4 700
= HFM #00 as in fig. 7

The significance of the GRAND algorithm
is that it maintains a regime of maximal
schema integration. © Maximafly common
schemata (sets) are identified and strings are
clustered together accordingly. By close
inspection one finds that the application of
above transformation rule corresponds to the
generalization rules introduced by Michalski in
his theory of inductive learning [Michalski 1983}.
The different generalization rules described by
Michalski manifest as different variants of
semantic network configurations. This is an
interesting result, because it implies that the
application of above transformations to the
network structure represents learning. As
changes occur in the world of interest, nodes
and relations are constantly added and
removed, transformations are continuously
triggered, and new schemata (concepts) are
formed. Consequently, we obtain what could be
described as a “self-leaming” network.

134

 
 

feature 1 feature 2 feature 3
tor ty (arte y Gta ty
Fig. 5
feature 1 taatura 2 feature 3
[or wy [ 0" sy [ta tt]

| SR

100

Fig. 6

 

000 100 014 WW
Fig. 7
Graph induction has been applied to

published examples as processed by packages
based on Quinlan’s ID3 algorithm for learning
from examples, and produced similar results,
(In some cases the results were better.) It has
also been shown [Oosthuizen 1986] that in
principle they imply an extention of Lebowitz’
“similarity-based generalization" method of
leaning from observation [Lebowitz 1986].
Thus, GRAND embodies various inductive
learning strategies, in particular learning from
examples, learning from observation and

135

conceptual clustering.

Supervised Classifier System

The application of the GRAND algorithm makes
a graph structure highly sensitive to similarities
between strings. The consequent effect is the
rapid identification of any such similarities and
the creation of new categories characterizing
these similarities. Thus, we obtain an automized
string classification and clustering mechanism.
Employing this mechanism, string "fimess" can
now be judged on more general “feature” levels
rather than on the string level itself. The
objective is to discover more easily and to make
it more transparent, what aspect of a string it is
that makes it a “good” string - and then to be
able to concentrate on that aspect. A similar
argument holds for the “bad” aspects of a string.

The above effects can be obtained as
follows:

For each schema set Q, we store its current
number of members, as well as the current
average value of v(Ci}, the "fimess” of the
individual strings Ci that are members of Q. In
each case, the stored value pertains to the
terminal nodes (i.e, strings) of the tree below a
particular node (schema) of which the node is the
root. However, not all the strings in the
population are taken into account. To improve
the efficiency of the method, only the best
performing 50%, for example, of the population
is used. As strings in this best performing
group are replaced by new ones, some
schemata might eventually become redundant on
the basis of not having any members, and are
then removed from the graph.

Enhancement of crossover operation

Let us say two schemata exhibit above average
“strength”, neither is a subset of the other (like
schema B and schema £ in fig. 8), and both
represent an adequate number of strings.
Such a situation would indicate that the two
schemata represent two clusters of strings. It
would now be sensible to apply the crossover
operation tO strings within the same schema
(cluster) in the hope to discover stronger
variations within this schema. This corresponds

 
 

*~O1) [014]

Cc </
t
YK OS
A
0
t t

T= sets that might have Indhidual strings as their members
+ because of lack of space we ues | 0 1 ] to represent wg. [ “ores wtitetts }
Fig. 8

for] [ot]

to what has been done in other applications of
genetic algorithms. The only difference is that
the schemata used here are systematically
selected schemata and maximal within clusters.

When the crossover operator is applied
to strings from different dominant schemata
(e.g. schemata B and E), then we would like
to retain the features that characterize each.
Consequently, we introduce a pattern-crossover
(as opposed to the known part-crossover defined
by a crossover point). The pattern-crossover is
achieved by replacing the alleles in a string from
schema E with the characterizing features of
schema B and vice versa. This effects the
combination of the eminent features of both
schemata.

Once again, we are concerning ourselves
with characteristics already present in
contemporary genetic algorithms and playing a
major role in their behaviour. Genetic
algorithms already preserve good schemata and
generate new instances for them. We are just
building upon the existing characteristics, trying
to enhance their effect by focussing them on the
relevant elements of strings.

A genetic algorithm, by definition,
preserves the best strings from a population.
Only the best of these are entered into the
induction graph and only the most common
features of “strong” schemata are selected for
recombination by means of a pattern-crossover.
Thus, the pattern-crossover combines the most
common alleles from strings that are already
proven winners,

[01]

{G1i] [O01] [01]

Supervised mutation

If one schema (A) falls below another schema
(B) in a tree structure embedded in the induced
graph (see fig. 8), it implies that schema A is a
subset of B. It also implies that schema A is
more specific than schema B and _ therefore
contains characteristic features not present in
schema B. Now, if
av(A) > av(B),

where av(Q) is the average strength of the strings
in Q, it means that the extra features in A
actually improve the “fitness” of its strings. Since
these strings are also incorporated in B, it means
that some of the strings in B’ ( =B-A), ie. those
in B which are not in A, did not fare so well.

We now use a supervised mutation
operation to combine the extra features
present in A with those of the strings in B’.
Thus, we are spreading proven qualities in a
controlled way to a larger population. The same
effect can be obtained by applying the
pattern-crossover operation (o stings in A
and B, respectively. Since the characteristic
features of B is contained in A, this operation
will have no effect on strings in A.

If the opposite sitation holds, ie.
av(B) > av(A),

then similarly, supervised mutation can be used
to create new strings not having the
distinguishing features of A. This might give
rise to other schemata under B, ie,
specializations of B, if the newly introduced
features correspond to others already present in
another string in B.

136

 
Clearly, the application of above
operations should be dependent on a defined
minimum value for D, where

D = abs (av(A) - av(B)),
the difference of the average fitness of strings in
A and B. In other words, if the superiority of
one of the schemas is not evident, the operation
might be superfluous.

Application

It is envisaged that the standard genetic
operators (mutation, inversion, etc.) would
be applied as usual to maintain the
variability of the population and thereby to
guard against overemphasis of a given kind of
schema. The supervised operations described
above should be applied in a constrained
manner, to prevent distortion of the well
balanced search process. The role of the graph
induction is seen primarily in speeding up
convergence once a possible optimal solution
area has been spotted, by applying
knowledge obtained from significant
discoveries in immediate subsequent string
generation. An appropriate mix of standard and
supervised operations should, however, prevent
premature convergence to suboptimal solutions.

SUPERGRAN

The account given above, amounts to the system
represented in fig. 9. However, the concept
acquisition and generalization capabilities of
GRAND enables SUPERGRANd to conduct
two learning strategies in parallel, Whilst the
genetic induction mechanism ensures flexibility
(internally), SUPERGRAN can also digest
information from an external source (see fig,
10). Strings volunteered to the system are added
to the population. Such strings are given a high
initial strength and are therefore immediately
absorbed into the induction graph. Strings
volunteered are either specific - corresponding
to examples - or general - corresponding to
rules. If the added strings are of
significance, GRAND, in its advisory role,
will tigger immediate adjustments to the
mixture of features in the new strings created by
the genetic algorithm.

4

genetic algorithm generates strings
+

strings tasted
+
“pest” strings captured in graph

+
graph induction

+
classifler system advised
berrenrcsnstsneall
Fig. 9

 

strings (rules

genetic algorithm
generates snnas on
from external
source
strings tasted

“best” strings captured in graph
+
graph induction

+
classifier system advised

 

Fig. 10

SUPERGRAN is currently _ being
implemented.

Graph Induction and Connectionism

As mentioned earlier, graph — induction
originated from a study of the knowledge
representation capabilities of massively
parallel connectionist networks,

Although the basic idea behind graph
induction - the restructuring of a graph into a
"canonical" form - is quite straightforward, its
application poses a computational problem,
illustrated in fig. 11. Let us say a certain
schema A is described by 6 features (e-j) and
another schema B is added to the graph.
According to the graph induction rule, fig. 11
has to be transformed to the configuration in
fig. 12. But, recognizing distributed intersections
like those at A and B, requires the inspection of

437

 
 

the conjunction of all possible combinations of
pairs of sets intersected by B. If we want to
arrive at the configuration in fig. 12, it requires
the inspection of all combinations of all higher
order conjunctions as well, ie. of groups of 3
sets, 4 sets, etc. Thus, there is an exponential
increase in the number of conjunctions to be
inspected as the number of features of B
increases. To make things worse, sets may
have ’semi-explicit’ intersections. Although the
sets themselves do not intersect, their indirect
descendents may well (see fig. 13).
Consequently, the inspection has to be
conducted to depths of more than one level.

McGregor & Malone [1982a,1982b]
developed a set processing machine that is
able to operate on vast quantities of data items
(nodes). Subsequently software as well as
hardware implementations (prototypes) have
been created as part of the low level functions
of the FACT Expert Database System
[McGregor & Malone 1983]. The hardware
version of the system, known as Generic
Associative Memory (GAM), dynamically

Fig. 12

Fig. 13 A

138

constructs networks to process set data in a
Massively parallel fashion which makes
processing speed not highly dependent on the
depth of the network. These systems are
described in detail elsewhere.

The core operation of this set processing
system, involves the closure of a set (node). The
upward closure of a node includes the node itself
as well as all nodes "above" it in the network,
ie. following the arrows. The downward
closure includes all modes below. The
determination of the intersection of closures is
implemented as a hardware function
(Oosthuizen er al. 1987a].

As mentioned above, tansformations are
triggered by the detection of distributed

intersections. The GAM closure operation is
ideal for the detection oof distributed
intersections and especially — semi-explicit

distributed intersections in complex networks
(ie. where the distributed intersections exist at
levels lower than the immediate descendants of
the two nodes involved).

The use of the GAM closure operation in
principle eliminates the distinction between
explicit and semi-explicit intersections. Because
of its inherent parallelism, the closure operation
is also virtually independent of the width of the
“tree” involved - ie. whether a particular
node has 2 or n descendants (where n >> 2).
Furthermore, the intersection of closures of two
nodes like A and B in fig. 11 is (1) a primitive
operation in GRAND and (2) executed in a
minimal amount of time.

Although the abovementioned
computation problem does not affect current
small scale prototype learning systems, GAM
ensures the feasibility of the application of graph
induction as a general purpose learning method.

Concluding Remarks

Although dependencies between features
(attributes) are dealt with implicidy in the
induction graph, we have not made explicit use
of them. Functional dependencies can be used to
make further inductive inferences, (Functional
dependencies can be inferred from mappings
between sets - information already captured in

 
the graph.)

SUPERGRAN exploits the interaction
between two learning methods: the genetic
algorithm generates strings and thereby creates
examples which are used by the graph
induction method to generate schemata and
highlight underlying properties. “Champions”
are identified among features or combinations
of features and used to advise the genetic
discovery mechanism on where to look for
“better” strings. Thus, a higher degree of
coherence is established between the event of a
significant discovery, or underlying trend, and
the present exploration of the search space.

By incorporating the whole population in
the induction graph, the induction graph
becomes a knowledge base to the system. In that
way, the learning process becomes fully
integrated with the knowledge representation
function of the cognitive system, (The notation
used in the induction graph in fact
constitutes a knowledge representation model
based on Generic Associative Sets [Oosthuizen et
al, 1987b].)

Acknowledgements

IT am indebted to the members of the IKBS
working group (in particular to Michael
Odetayo} for their stimulating discussions and to
Jesus Christ, the Origin of all knowledge.

References

Grefenstette, JJ. [1986] Optimization of
Control Parameters for Genetic Algorithms,
(EEE Transactions on Systems, Man, and
Cybemetics. Vol. SMC-16, No, 1,
January/February.

Holland, JH. [1986] Escaping  Brittleness:
The Possibilities of General Purpose Algorithms
Applied to Parallel Rule-Based Systems. In
Machine Leaming: An Artificial Intelligence
Approach Vol. 2 by Michalski, R.S.,
Carbonell, J.G., Mitchell, T.M.  (Eds.)
Morgan Kaufman.

Holland, JH., Reitman, J.S. (1978] Cognitive

Systems based on adaptive algorithms. In
Pattern-Directed Inference Systems by
Waterman, D.A. Hayes-Roth, F, (Eds.),
Academic Press.

Lebowitz, M. [1986] UNIMEM, a general
learning system: An overview. Proceedings of
ECAL-86, Brighton, England.

McGregor, DR, Malone, JR. [1982a]
Generic Associative Hardware. Its Impact on
Database Systems, Proceedings TEE
Colloquium on Associative Methods & Database
Engines, May 1982.

McGregor, D.R., Malone, LR. {1982b] Generic
Associative Memory, G.B. Patent No. 8236084.

McGregor, D.R., Malone, IR. [1983] The
Fact System - a Hardware- oriented approach.
In DBMS A Technical Comparison - State of the
Art Report. Maidenhead; Pergamon Infotech pp
99-112,

Michalski, R.S. [1983] A Theory and
Methodology of Inductive Learning. Artificial
Intelligence, Vol. 20 pp. 111-161.

Oosthuizen, G.D. [1986] A Paradigm for
Automatic Leaming. Intemal FACT 24/86,
University of Strathclyde, Scotland.

Oosthuizen, G.D., McGregor, D.R., Henning,
M., Renfrew, C. (1987a] Parallel Network
Architectures for Large Scale Knowledge
Based Systems. Proceedings of First Workshop of
the British Special Interest Group on Knowledge

Manipulating Engines (SIGKME). Reading,
England, January 1987,
Oosthuizen, G.D., McGregor, D.R., Malone,

J.R. [1987b] The Use of a Simple Connectionist
Architecture for Matching and Learning, Internal
Report FACT 2/87. University of Strathclyde,
Scotland.

Schrodt, P.A. [1986] Predicting international
events. BYTE, November.

139

 
 

PARALLEL IMPLEMENTATION OF GENETIC ALGORITHMS IN A
CLASSIFIER SYSTENI

George G. Robertson
Thinking Machines Corporation

Abstract

Genetic Algorithins are the primary mechanism in
Classifier Systems [Ju] for driving the selective evo-
lution of rules (earning) to perform some specified
task. The Classifier System appr ach to machine
learning. in particular, and Genetic Algorithms, in
general, are inherently parallel. Implementations of
Classifier Systems and Genetic Algorithms -o date
have mostly been done on conventional serial com-
pmers Because of thes serial bottleneck, researchers
have heen able to study only relatively small prot-
lems. The advent of commercially available mas
sively parallel computers, such as the Connection Ma-
chore system [10]. now make parallel implementations
of these inherently parallel algorithms possible, and
make at possible to begin studymg these algorithms
on larger task domains This paper describes an mm-
plaunentation of Genetic Algorithms in a Classifier
System on the Connection Machine.

Classifiers
Learning

Systems and Machine

Classifier Systems represent ane hasic approach to
Jearminy dy example. Two other approaches are sym-
bohe rule-based systems and Connectionist Netwark
systems These approaches, and the inherent par-
allelism in them. are compared and contrasted with
Classifier Systems below. This provides a framework
for machine learning m which to place Classifier Sys-
tems, and suggests the ease of implementation for
cach approach on massively parallel computers.
symbole rule-based systems attempt to model
high (symbolic; level cognitive processes. A good ex-
ample is the SOAR problem-solving architecture be-
ing studied at Curnegie-Mellon University (Newell),
Stanford University (Rosenbloom), and Xerox PARC
(Land) (see :18, 19, 201). SOAR is an architecture
for problem solvine and Jearning. based on heuristse
search in problemi spaces and chunking. 1t 1s based
on a modified version of the OPS5S Production Sys
tem arclutectipe [5,. Althoueh SOAR shows promise
as an approach to learning, 1 dues not appear to be
# good candidate for massively parallel computers

Gupta [8] has done studies of parallelism possible in
symbolic production systems and found that systems
with more than thirty or so processors would not be
effectively utilized. Most of the traditional Al work
in machine learning (see Michalski et all [22,) also op-
erates at the symbolic level and is perhaps evon bess
easily parallelized than SOAK.

Connectionist Network approaches to Jearning he-
gan with the work on Perceptrons by Rosenblatt [28
(also see Minshy and Papert (24) New wars of deal-
mig with the problems that they encountered have
recently been studied, including work on Boltzmann
Machines [J]. Back Propagation net works [29, 30. 31).
work by Klopf [17]. Barto [2], Grossberg [7 , Feld-
man 4}, Hinton 22), and Minsky :23]. These svs-
tems model low level neural processes, with long-term
knowledge represented as sirengths (or weights) of
the connections between simple neuron-hke process-
ing elements The Boltzmann Machines introduce a
stochastic process ta avoid the pitfalls of local nnn-
ima in learning behavior Back Propagation networks
have successfully been used to learn to recognize syin-
mictries m viswal patterns 31). as wel) as to learn a
speech synthesis task 30). In this latter task a cor-
pus of text and its phonetic translation is used as
the example for learmng. A fixed initial structure
(a three level Connectiomst Network) begins with
random bink and node threshold weights and ovei
time learns ihe proper weights to translate the text
into understandable phoneme streams. These sys-
tems are well-adapted to massively parallel si stems
in fact, Back Propagation and Boltzmann Machines
have been implemented on the Cownection Machine
system.

Classifier Systems attempt to model macroscopic
evolutionary processes with Genetic Algorithins and
natural selection. Holland's early work on Genetic
Algorithms {15} has recently led to numerous research
efforts and application of Classifier Systems to a num-
ber of tash domains (see (6, 13, 14. 25. 20. 46 av
This approach shares some properties with both the
symbolic rule-based and Connectionist Network ap
proaches. Tt is rule-based, but the rules are low level
message passing tules (below the sumbol level} A

 

140

 

 
 

 

 

svinholic leveh can he added om top of C lassiber &y6
toms (see farrest’s wotk (oO } Like Boltzmann Ma-
chines. Classifier oysters use a stochastic process to
avuld local inmnia in learmng behavior They use
Gonche Algorithms to replace weak rules (those not
contributing or contributing incorrectly to the prob
Jem solution) with offspring of strong rules. Unlike
the Back Propagation approach. which requires pre-
defined layers of networks, the Classifier System ap-
proach does not require any predefined structure (al-
though one can be provided if desired, in the form of
an intial set of rules), Because of the nature of the
ules. messages, and the match evele, this approach
is vet) well-adapted to massively parallel computers

To date, woh on all of these approaches has taken
place primarily on conventional serial computers with
sinmilators Because the speed of these serial simula-
tars depends directly on the size of the problems be-
ing solved, only small examples have been used to test
these ideas. The massive parallelism and dynamic
seconfigurabslity of the Connection Machine offers a
chance to nnplement and evaluate the Connectiomst
Network and Classifier System approaches to learn-
ing on much larger (and more realistic) task domains.
The speed of the parallel versions of these systems 1»
nearly independent of the size of the problems being
solved.

Parallelism on the Connection
Machine

Most computer programs consist of a control se-
quence {the instructions) and a collection of data el-
ements. Large programs have tens of thousands of
instructions operating on tens of thousands, or even
millions of data Clements. There are opportunities
for parallelism in both the control sequence and in
the collection of data elements. In the control se-
quence, it is possible to identify threads of control
that could operate independently. and thus on differ
ent processors. This approach is known as “control
parallelism”, and is the method used for program-
ming most multiprocessor computers The primary
problems with this approach are the difficulty of iden-
ufying and synchronizing these independent threads
of control Alternatively. it is possible to take advan-
tage of the large number of data elements that are
independent, and assign processors to data elements.
This approach js hnown as ‘data parallelism” [11.
33]. This approach works best for Jarge amounts of
date and for main appHcations is a more natural
programming approach, The Connection Machine
system is a general purpose implementation of data
parallchisin.

The Connection Machine sistem is a dvpamicalls
reconfigurable computer with 3.830 processors and
32 Mbytes of memory which ean process large vol-
umes of data at speeds exceeding one bithon instruc-
tions per second (1900 VIPS) This computer is a
departure from the conventional von Neumann model
of computing. which bas one processor with a large
memory. In the Connection Machine, the processing
power 1s distributed over the memory. so that there
are inany processors, cach with a sruall amount of
menory (4090 hits).

In addition to a large number of processors, a data
parallel computer must have an effective means of
communicating between processors. The Connection
Machine system has a communications system that
allows any processor to send a message to any other
processor, with possibly all processors sending mes-
sages at the same time. This mechanism allows ap-
pheations to dynamically reconfigure the commu ti-
cations topology to adapt if to the communications
nceds of the moment. For example, in the low level
part of an image understanding system, a two dimen-
sional grid is the most appropriate topology. At the
intermediate level, such a system might want to com-
municate in a tree structured topology. And ai the
highest level, a semantic network for understanding
might require an arbitrary network topology. Each of
these communications topologies can be dynamically
configured on the Connection Machine.

From a programming point of view, the Connec-
tion Machine system can be thought of as a parallel
processing accelerator for a conventional serial com-
puter. In fact, the Connection Machine system is
niade up of two partis: a flont-end computer (like a
Vax or a Symbolies 3600), and an array of Connection
Machine processors and memory, The front-end com-
puter can operate ou the Connection Machine mem-
ory as though it were part of ils own memory, or it
can amvohe paralld! arithmetic, logical, or communi-
cations (data movement) operations on that memory
The program runs on the front-end computer. invok-
ing parallel operations when necessary. Thus, all the
program development tools of the front-end computer
are available for developing Connection Machine pro-
grams,

Connection Machine programming languages are
simple parallel extensions (o familar sequential lan-
guages. C* [27] is a parallel version of the C language
(9, and *Lisp [2]: is a parallel version of Common
Lisp [32] In each of these languages, the extensions
were made in an unobtrusive manner, For example,
in *Lisp, two parallel variables can be added together
with +! (the 1 at the end of a operator name in-
dicates the parallel version of that operator) ‘The

144

 
process of writing a data parallel programs with a lan-
guage like *Lispis quite sunples write an algeusthin as

though a wer operating on one data clenient. then
use the parallel versions of the operators In cone
trol parallechom. understanding how to do a parallel
tash decomposition is often an intdlectually dificult
task. In such a system, it 1s alsu often difficult to
take an existing program and understand it. How-
ever in data parallelism. the programming «{yle is
much closer to what programmers are used to, and
it is thus much simpler to produce and understand
data parallel programs

*CFS: A Parallel Classifier System

The state of the art for Classifier Systems is repre-
sented by the CFS-C 43 stem [25]. designed and imple-
mented at the University of Michigan. This system
is designed with parallelism in mind, but has been
simulated on conventional serial machines, and thus
has been restricted to small task domains Serial im-
plementations of CFS-C are relatively fast for small
sets of classifiers: up to 20 cycles per second for 200
Clas ifiers on a one MIP (one million instructions per
second) computer However, because the match pro-
cedure (described below) requires every message to
be matched against every classi or, serial imiplemen-
tations for large sets of classifiers are too slow to be
useful

“CFS is an implementation of CPS-C on the Con-
nection Machine, written in *Lisp. Because of the
parallel nature of *CFS on the Connection Machine,
it will work with 65,000 classifiers about as fast as it
works with 200 classifiers. The speed of the svstem
does not depend on the number of classifiers and is
relatively fast, up to 10 cycles per second for 65,000
Classifiers Thus, *CFS on the Connection Machine
provides a way to explore and evaluate Classifier Svs-
tems and Genetic Algonthms on large task domains.

*CFS: Overview of Operation

From the user's point of view, both CFS-C and "CFS
provide the saine basic structure, a message-passing
tule-based system that uses Genetic Algonthms to
evolve rules to solve some specified problem. Both
systems are task domain independent. To define a
task domain, you simply define three functions: a
function to provide input messages that describe the
state of the external environment on each cycle (these
are called “Detector” messages), a function to ana
Ik7e antput messages to alter the external environ-
ment (called “Effectors” ). and a function to evaluate
the changes made tu the environment, and supply a

 

reward or punishment for those chatiuses

A’ Message’ isa fixed deneth bit string with cach
bit position conmlaimimg a zeae ane, or
A “Classifier™
and anaction ‘The condstons are message patterns
which are matched with incoming messages Thr ace

“dont eare”

value weak with two conditions

Hon is a inessage pattern for production of an oatgo-
Ing message.

On cach evele, the messages output from the pre-
vious evele are combined with Detector messages de-
scnbing the environment to provide the meoming
message list All Classifiers are matched with all the
messages on the message list, The matching Classi
fiers can cach post one or more new messages to the
outgoing message list The message list is limited to

small siz¢ to force the Classifiers to compete for the
right to post new messages. There is a strength as-
sociated with each Classifier that 1s used to control
this bidding process. The strengih reflects how good
a rule is; that 1s, a high strength Classifier 1s one
that contributes to a correct solution, while a low
strength Cliosifier is one that either does not com
tribute or contributes to an meorrect solution Clas-
sifier strength 1s adjusted with several different mech-
anisms: (1) reward o punishment from the evalua-
tor changes the strengths of all Classifiers the won the
bidding competition and posted messages during that
cycle. (2) Classifiers that win the bidding process pay
their bid to the Classifiers in the previous cvcle that
produced the messages which the winning C lessifiers
matched (this is called the Bucket Brigade algonthm
131, and is necessary for chains of rules to form and
survive}: and (3) taxes are used to eliminate non
contribaters and ‘o help prevent over-peneral Classi-
fiers from dominating the bidding process

Genetic Algorithms ase used periodically to re
place some percentage of the population of Class
fiers (gcnerally Jow strength Classifiers art replaced}
with offspring from matings of other (generally Ingh
strength) Classifiers. The two primary Genetic Al-
gorithms used are Crossaver Mating and Mutation.
The Crossover algorithm considers the entire Clas-
sifier (both conditions and the action) as a chromo-
some, picks a tandom point for the crossover, and
crosses two parents to produce two offspring (i.e., the
high order bits up to the randomly picked point are
swapped in the parents), The Mutation algonthm
uses a4 Poisson distribution to decide whether to make
zero, one. two. or three random madifications to a
new offspring. The mutation rate 1s kept qinte low,
so Uhat only a few offspring are nmtated at all, and
very few have more than one mutation.

142

 

 

 
*CFS: Parallel Data Structures and
Algorithins

The inherent computational Ing tof CFS can be
charactenzed as proportional to the product of the
length of the message hst and the number of Clas-
sifiers  Eapcrience suggests that the message list
should be restricted fo a small size to promate strong
competition in the Classifier population Ji alse ap-
pears that solving problems in large ieal-world tack
domains will require a large Classifier populatiru
These two observations, along with the fact that the
Classifiers are operated on almost entirely indepen-
dently, suggest a natural data parallel approach for
"CTS. which is the assignment of une processor to
each Classifier.

The primary parallel data structures in *CFS are
associated with Classifiers. Jn addition to the two
conditions, action, and strength that have already
been mentioned. there are about 30 other variables
associated with each Classifier Some of these vari-
ables maintain performance statistics, others are used
{o control various algorithms in the system, Values
stored in parallel yanables on the Connection Ma-
chine can be of iny size needed The parallel vari-
alles that represent a Classifier include several ane
lit booleans, signed integers ranging from 5 to 32
bits and unsigned integers that are as long as the
message size in the particular tash domain (9 bits
for the J*tter sequence prediction task). In the cur-
rent Huplementation, about 850 bits are used for each
Classifier, with performance statistics accounting for
about half of that, That means that coveral Classi-
fiers could be implemented on each physical processor
on the Counection Machine (since each has 4096 bits
of memory). In fact, the Connection Machine has a
Virtual Processur mechanism that supports that. By
selecting the right virtual processor configuration, a
05,536 processor Connection Machine can sunujate a
aulhon processors hence a milli m Classifiers.

The other pniwary data structure im *CFS is the
message list. which is maintained on the front-end
conipater. Ajthough all operations on Classifiers are
done an parallel, the operations on messages in the
message list must be done senally However, the size
of the message list is relatively small (less than 30
messages for most applications}, and the number of
operations that sequence through the message hist is
small (the match proceedure, the creation of the new
message list, and activation of Lflectors)

To see how the parallel Classifiers and the serial
message list interact, Jet us consider the match al-
gunthm Assume that cach message is N bits long

and that there are at) sost M messages in the mes-

sage Ist The conditions in the Classifiers are repre-
soiled as two N bit wusipaed fields. one for the bits
taeros and ones) and the other for the wildeards (tle
‘dou't care” Inte). Pet us allorate an M-bit “messace
winner” field for each condition, to represent which
messages matched that condstion (ie., the nvth dit
of that field will be one if the on“th message in the
message list matched that condition} To match all
the messages with all the Clasafiers we first sequence
through the message list broadcasting each message
to all Classifiers in parallel. This is the only serial
part of the match, the rest of the operations are done
i parallel, We compate the Jogical OR of the wild-
card bits and the condition hits with the logical OR
of the wildcard hits and the message bits. If they are
the same, we set the nth message winner bit, After
sequencing through the message het. we find alt Clas
sifiers that matched by selecting all Classifiers that
have non-zero message winner fields for both condi-
tions. The whole match process takes about 3.8 mil-
liseconds plus about 0.5 milhseconds for each message
on the message list, regardless of the number of Clas-
sifiers (up to 65.536}

Parallel Genetic Algorithms In *CFS

As mentioned carher, the critical component of Clas-
sifier Systems that drives their learning behavior 1s
the Genetic Algorithms In *CFS, there are currently
two parallel Gcnetic Algorithms employed. Crossover
Mating and Mutation, These algorithms are invoked
periodically and replace some percentage of the popu-
lation of Classifiers with offspring of matings of other
Classifiers. In a typical experument. the Genetic Al
gorithms might be inveked once every eleven cycles,
with five percent of the population being replaced
each time, The frequency of invocation of Genetc
Algorithms is a prime number to avoid the effects of
cycle patterns in the test environment. which might
cause strength patterns that alter the effectiveness
of the Genetic Algorithms. The percent of the pop-
wlation being replaced is relatively low so that the
Bucket Brigade and other strength adjustment algo-
rithms have time to establish some stability.

The first step in these Genetic Algorithms is to pick
a set of Classifiers to be replaced and a set of Clas-
sifiers to be parents. These two sets are the same
size and a one-to-one mapping is established between
them, m the form of two-way pointers between ele-
ments of each set. The parents are then copied onto
the Classifiers being replaced. From that point on,
the set of Classifiers being replaced is referred to as
the set of offspring Some percentage of the offspring
are leff as pure repheations, while the rest are paired

143

 
 

and mated with the crossover aleorthan.

Finally,
there is a small probability that cach new offspring
will be changed shghthy by the matation algorthan
All of the algerithims descaibedi in this seeton oper
ate at speeds that are independent of the number of
Classifiers involved (up to 65,536}

Picking Parents and Classifiers to Replace

Relatively Jow strength Classifiers are picked to be re-
placed The choice is made probabilhistically, making
random draws without replacement fiom the popu-
Jation with the probability of being piched propor-
tonal to the square of the inverse of the strength of
the Classifier. This is done so that weak Classifiers
have some small chance of not being replaced and
strong Classifiers have some small chance of being re-
placed Likewise. relatively high strength Classifiers
are picked as parents, This is also done probabihsti-
cally. with the probability proportional to the square
of the strength.

A “Parallel Random Weighted Selection” al-
gorithm supports the picking of parents and Classi-
fiers to replace. Of the several approaches to imple-
mentation of this algorithm, the following has been
the most successful. First. generate a random number
between zero and the weight (e g.. the square of the
strength} for each Classifier in parallel. Then, Rank
those random numbers in parallel The Rank oper-
ation is the first step im a parallel sort. and is done
in log time on the Connection Machine using either
Batcher’s bitome sort (3 or a radta sort (1). After
ranking the random numbers, the Classifier with the
smallest random number will have rank 0. the Clas
sifier with the largest random number will have raik
N (the size of the population), and cach sank wall
he umque. At this port, select the lowest five pere
cent (or whatever percent is being replaced) of the
ranked Classifiers as the Classifiers to replace, and
the Jughest five percent of the ranked Classifiers as
parents. This random weighted selection algorithm
takes about 23 milliseconds on the Connection Ma-
chine.

Crossover Mating

At this point. we have identified two sets of Clasa-
fiers a set of parents and a replacement set. Before
proceeding. we need to replace the Classifiers piched
for replacement with copes of their parents. The first
step in this proccss is to make each offspring (one
of the replacement set} point to a parent, and vice-
versa. This is dune using a “Parallel Rendezvous”
algontim, which takes Uee steps

1. Enunerate the pirnt set ind Send the pro
cose addicess of cash parent ta a rendezvous
sanalde inthe cnimerated processors fd mane
ation is another dog tine operation on the Con-
nection Machine. The result of cnumerating the
parcnt ser wil be a unique numbering of the par
ents fram zeto to the number of parents The
Send from the parents tu the enumerated pro-
cessors will result in the rendezvous variable in
processor zero containing the processor address
of the first parent, in processor one the ren-
dezvous vanable will contain the processor ad-
dress of the second parent, and su on. Hf there
aie N parents, then the rendezvous variable in
the first N processors will contain the addresses
of the parents

te

Enumerate the replacement set. and have each
offspring Get the address in the rendezvous vari
able fram the enunierated processors, Since this
enumeration will be the same as the replacement
rank developed while picking the Classifiers (o
replace. we can use the replacement rank and
skip this enumeration to save time. The Get will
result in the first offspring obtaining the addiess
of the first parent from the rendezvous variable
in processor zero, the second offspring obtain-
ing the address of the second parent, and go on.
Now. each offspring has a pointer to its parent,

3, Offspring Send ther processor addresses to their
parents. Since offspring now have thei parents’
addresses, they can do this directly

Given the pointers between parents and offspring,
we now replace the offspring with copies of their par-
ents. This is done by having parents Send relevant
parts of themselves to then offspring using the off
spring address pouters derived from the rendezvous
stucp Some parallel varrables associated with the off
spring are simply initialized. while others are copied
from their parents. The strength of a new offspring 1s
set to a value half way between its parent's strength
and the average strength of the population. This al-
lows the new offspring to participate in the bidding
process quickly, without dominating the process.

The Crossover algorithm used in "CFS allows some
percentage (a system parameter) of null crossovers, or
pure repheations. To decade which offspring will be
replications we generate a 1andom muinber between
0 and 106 for each offspring. If that nuinber is below
the specified cutoff, we are done with the Crossover
algorithm. since the parent has already been copied

Por the remaming offspring we proceed with the
Crossover algorithni by pichang pairs to mate and

144

 

 
 

The nest
step as to pan the oflspriug (since these arc now copies

crossing those mates at random pomts
of parents, thisas equivaloat te paring the parents),
Vhisas done dn dividing the set of offspring mm half
(cwrsenth done by avon rephacement rank versus odd
replacement rank} and using the rendezvous algo-
rithin to get the ven ranked offspring to point to
the odd ranhed offspring. and vice-versa Wath this
set of ponters, we can now camplete the Crossover.
Generate a random number between 0 and the length
of the chromosome (1.e., {hiee times the Jeng h ofa
message, since the chromosome contains both of the
conditions and the action) for each even ranked off-
spring, to be used as the crossover point Now the
even ranked offspring Get the odd ranked offspring’s
chromosame (conditions and action), Now, using par-
allel load-byte and deposite-hy te, swap the high order
bits of the chromasomes up to the crossover point Fi-
nally, store the crossed chromosomes back in the even
ranked offspring and Seud the crossed chromosomes
to the odd ranked offspring.

Mutation

The Mutation algorithm uses a Poisson distribution
to mutate zero, one, two, or three genes in the chro-
mosome (bits in the conditions and action) of the new
offspring. The Poisson distribution is mmplemented in
parallel using a table (see [34 ). A random number
between 0 and 1000 is gener: ed for each offspring.
A match fo the Poisson table indicates how many
mutations to make (normally, very few mutations are
done), For «ach mutation, a random bit position 1s
picked. and a random new value (one, zero, or don ¢
care} 1s fain that position.

Smumary: Parallel Genetic Algirithms

To summarize, the patalld) Genetic Algontlins used
im *CPS are all independent of Classifier population
size (up to 65,536). The total tume for the Genetic
Algorithms is around 400 milliseconds (or an aver-
age of 36 milliseconds per cycle in the experiment
deseribed abave, since they were run once every 3)
exeles} Very little optimization has been done on
this part of the system. so the figure can probably
be reduced by a@ factor of two or three. All these
algorithms make heavy use af the communications
mechanisms ip the Connection Machine, both explic-
ily (with uses of Seud and Get in parent copying
and vrossover} and implicitly (with uses of Enumer-
ate and Rank), Two crucial algorithms that support
the Genetic Algorithms are parallel random weighted
selection (for picking parents and offspring) and par-

alld rendesvons ($0 establish pomters between par-
ents and offspring aud fetwecn pairs of mates im
APOSsOveT)

Summary

‘The bnphmentation of a parallel version of Genetic
Algonthins in a Classifier System on a fine-grained
massively parallel computer, the Connection Ma-
chine, provides verification that these inherently par-
allel algorithms can, in fact, be fully parallelized on
a compater. This is best demonstrated by the speed
of the system, which is independent of the number
of Classifiers, and for parts of *CFS other than its
Genetic Algoritums depends only on the size of the
isessage hist and is linear.

This work also provides a demonstration of the
power and importance of data parallelism as a pro-
gramming style. 11 took less than one man-month to
develop the first working version of the hasic *CFS
system, starting only with a description of CFS-C
(the documentation in [25, 26]), This also provides
confirmation that the Connection Machine system is
well-adapted {o data parallelism and is easy to pro-
«ram

Finally, parallel implementations of Classifier Sys-
(ems and Genetic Algorithms, Jike *CF>, provide a
vehicle for exploring large tash domains Ultimately,
a Classifier System with a large population should be
able to evolve a set of rules for any repetitive struc-
tured task for which an evaluation function can be
defined, An example of a difficult task that is un
hkely to ever work with a small population of Classi-
fiers, Lut might work with a large population, is the
game of Gu. The best current Go playing program,
by Wilcox ,35}, is only slightly better than a novice.
Can we express the knowledge embedded in Wileox’s
program tna set of rules and an evaluation function,
and use that as the starting point for evolving rules
for a really good Go plaves’ Parallel implementa-
tions of Classifier Systems and Genetic Algorithms
allaw us to hegin investigating such questions.

References

1. Ackley, DH., Hinton, G.E, and Sejnowski, TJ,
A Learning Algonthm for Boltzmann Machines,
Cognitive Science, vol 9, no. 1, 147-159, 1985.

2 Barto, A., Learning bv Statistical Cooperation
of Self-Interested Neuron-like Computing Lie-
ments, COIN Technical Report 85-11, University

of Massachusetts Apnl 1955

3 Batches, KLE, Sorting Networks and thes Appli-

145

 
3

i.

Q

'

13

cations. Tn Pree 1468 sprang domt Computer
Conference. ALIVS. pp 07-311 Apnt Teas

Feldinan, JA. Dynanie Connections ino New
ral Networks, Tiological Cyberneties 10, 27-34,
1982

Forgy, (.L., OPS5 Manual. Carnegie-Mellon Vni-
versity Computer Science Department Technical
Report, 198).

Forrest, S., lmplementing Semantic Network
Structures Using Classifier Systems, in 32.
Grefenstette (d.). Proceedings of an Inver-
national Conference on Genetic Algorithms
and thes Applications, Carnegie-Mellon Univ ,
Pittsburgh. Pa., July, 1985

Grossberg. $., Competition, Decision. Consensus,
Journal of Mathematical Analysis and Applica-
tions, 06, 470-93, 1978.

Gupta,
tems,

A., Parallelism in Production Sys-
The Sonrees and the Expected 5peed-
up. Carnegie-Mellon University Computer Sci-
ence Department Techmeal Report CMU-CS-64
369, December 1484

Harbison, S P., and Steele, GL, C: A Reference
Manual, Prentice-Hall, New Jersey, 14964

Hilhs, W.D., The Connection Machme, MIT
Press, 1985.

Hihs, WD. and Steele, GL. Data Parallel Algo-
rithms, CACM, vol 29, No. 12, pp 1170-1283,
December, 150,

Hinton, G.E . Implementing semantic Networks
m Parallel Hardware. G Hinton and J. Ander-
son (ids ). Parallel Models of Associative Aton
ory, Hillsdale, NJ, Erlbaum, 161-187, 1951

Holland, J H., Properties of the Bucket Brigade
Algonthm, im J.J. Giefenstette (Ld.), Pruceed-
ings of an International Conference on Genetic
Algonthms and their Applications, Carnegie-
Mellon Univ., Pittsburgh, Pa., July, 1985,

Holland, 3.H., Escaping Britileness: The Possi-
Indities of General-Purpose Learning Algorithims
Applied to Parallel Ruke-Based Systems, in RS.
Michalsha, JG Carbonell, and T.M. Mitchell
(Eds.j. Machine Teasmine An Artificial Inteshi
gence Approach H. Los Altos Calsformta,
Morgan Kanfinann Publishers Pas.

vol

mm,

Ib.

26

2s

146

 

HoNand 40 Adaptenon ie Naturab and 4
thiol Svstams Cay oof Muchivan Press Aun
Arbor $975

Holland, JW 0 Bolvoak Koy Nisbett. RE and

Thagard. PR.. Induction Processes af Infer-
ence, Learning, and Discovery MI‘) Press, 1986.

Klopf AU, The Hedonste Neuron, Washing-
ton, DC Hemispher . 1982

Land J, SOAR User's Manual, Xerox PARC
Internal Document. June 1985.

Laid. J. Rosenbloom, P., and Newell, A., To-
wards Chunking as a General Learning Mechin-
rin, in Two SOAR Studies. Carnege-Mellon
University Computer Science Department Tech-
nical Report CMU-CS-85-110, January 1985.

Laird, JE. Rosenbloom, P.S.. and Newell A.
Chunking in Soar: The Anatomy of a General
Leaming Mechanisin, Machine Learning, Vol. 1
No. J, 1980.

« Lasser, C., The Essential *Lisp Manual, Thinking
Machines Technical Report, July. 1986

Michalski, R. Carbonell. J,, and Mitchell. T.,
Machine Learning. vol UH, Morgan Kauffman,
1984,

Minshy, M.
Sunon and Schuster

The Society of Mind, New York,
T8686,

Minsky. M. and Papert, $.. Perceptrons. An
Intiodnetion to Computation Gronetry, MIT
Press [aau,

Riolo RL. CISC A Package of Domain In-
dependent Subrowones for Implementing Clas-
set Systems in Arbitrary, Usa-Defined Lnvi-
juntuents, Unis of Michigan. Div of Com-
puter Serence and Engineriing Logie of Com-

puters Group Technical Report, January, 196b.

. Riolo. kL. LETSEQ: An hnplementation of the
CPS-C Classifier System in a Task-Domain that
Involves Learning to Predict Letter Sequences,
Univ. of Michigan. Div. of Computer Science
and Engineering. Lugic of Computers Group
Technical Report. January. 1986.

Rose, J and Steck. CG. C* Language Quick
Reference, Thinking Machines Technical Report.
December, T9886

Rosenblatt. Fo. Paineiples of Nenrodynanues
Spartan Books New York 1963

 

 
20 Pumedbant. DL. Winton boo and Walliams
RJ, Learning Intermal Representarions by Lito
Propagation JOS Report Sano, UC San Taege
September bass

30, Semnowsht To ONDTtalke A Paralle) Net-
work that Learns to Read Aloud, Johns Hopkins
Viiv. Elrctrical Cngiieering and Computer Ser
ence Technical Report JHU,ELCS-86 01, Jan
vary, 1980

31 Sejnowski, T.J,, Keinker, P.K.. and Hinton, G E.,
Symmetry Groups with Hidden Vnits: Beyond
the Perceptron, Physica D (in press}

32. Steele, G.L., Common Lisp’ The Language. Dig-
ital Press, Massachusetts, 1984,

33. Thinking Machines Corporation, Introduction
to Data Level Parallelism, Thinking Machines
. Technical Report, April, 1986,

34. Wagner, HM, Principles of Operations Re-
search Prentice-Hall, New Jersey, 190%,

35 Wileoa, B., Reflections on Building Two Go Pro-
2 grams, SIGART Newsletter, October 1985.

36° Wilson. S.W , Classifier System Learning of a
Boolean Function, Rowland Institute for Science
Research Meno RIS No. 27r, February, 1966.

37, Wilson S W.. Knowledge Growth in an Artificial
Anonal, in J.J. Grefenstette (Ed.), Proceedings
of an International Conference on Genetic Algo-
rithms and ther Applications, Catnegie-Mellon
Univ. Pittsburgh, Pa. July. 1985

147

 

 
 

PUNCTUATED EQUILIBRIA: A PARALLEL GENETIC ALGORITHM
by
J. P, Cohoon, S. U. Hegde, W. N. Martin, D. Richards

Department of Computer Science
University of Virginia
Charlottesville, Virginia 22903

ABSTRACT

A distributed formulation of the genetic algorithm paradigm
is proposed and experimentally analyzed. Our formulation is
based in part on two principles of the paleontological theory
of punctuated equilibria—allopatric speciation and stasis.
Allopatric speciation involves the rapid evolution of new
species after being geographically separated. Stasis implies
that after equilibria is reached in an environment there is
little drift in genetic composition. We applied the
formulation to the Optimal Linear Arrangement problem, In
our experiments, the result was more than just a hardware
acceleration, rather better solutions were obtained with less
total work.

INTRODUCTION

The genetic algorithm paradigm has been previously
proposed to generate solutions to a wide range of problems
{HOLL75]. In particular, several optimization problems have
been investigated. These include control systems [GOLD83],
function optimization [BETHS81], and combinatorial
problems [COHO86, DAVI85, FOUR85, GOLD85, GREF85,
SMITS85]. In all cases, serial implementations have been
proposed, We will argue that there is an effective parallel
realization of the genetic algorithms approach based on what
evolution theorists call ‘punctuated equilibria.’ We
propose 2 parallel implementation and present empirical
evidence of its effectiveness on a combinatorial optimization
problem,

While the genetic algorithm (GA) approach is easily
understood, it would be difficult to glean a canonical
“pseudo-code’’ version from published accounts. Various
implementations differ and many ‘‘obvious’’ design
decisions are omitted. In all cases, a population of solutions
to the problem at hand is maintained and successive
“‘generations’’ are produced by manipulating the previous
generation. The population is typically kept at a fixed size.
Most new solutions are formed by merging two previous
ones; this is done with a ‘‘crossover’’ operator and suitable
encodings of the solutions, Some new solutions are simply
modifications of previous ones, using a ‘‘mutation’’
operator. Successive generations are produced with new
solutions replacing some of the older ones, An ad hoc
termination condition is used and the best remaining solution
{or the best ever seen) is Teponied.

148

A solution is evaluated with respect to its ‘‘fitness,’’
and of course we prefer that the most fit survive. There are
two mechanisms for differential success. First, the better fir
solutions are more likely to crossover, and hence propagate.
Second, the less fit solutions are more likely to be replaced.
it is important to realize that the GA approach is
fundamentally different from, say, simulated annealing
[KIRK83) which follows the ‘‘trajectory’’ of a single solution
to a local maximum of the fitness function. With GA there
are many solutions to consider and the crossover operation is
so chaotic that there is no simple notion of trajectory.

How can parallelism be used with the GA approach?
Initially it seems clear that the process is inherently
sequential, Each generation must be produced before it can
be used as the basis for the following generation; it is
antithetical to the evolutionary scheme to jump forward. A
simple use of parallelism is the simultaneous production of
candidates for the next generation. For example, pairs of
solutions could be crossed-over in parallel, along with the
selection and mutation of other solutions. But algorithmic
issues remain to be resolved. How are the ‘‘parents’’
probabilistically selected? How are the solutions that are to
be replaced chosen? The simple answers to these questions,
which require global information, suggest the use of shared-
memory architectures. Note that this sort of parallelism does
not make any fundamental contribution to the GA approach;
it can be viewed simply as a ‘‘hardware accelerator’’.

We restrict our interest to the study of parallel
algorithms for a distributed processor system without shared
memory. Our reasons are threefold. First, we have access to
such a system, Second, the extension of our results to
massively parallel machines will be quite natural. In such a
machine, the cost of connecting and distributing data is an
important component in the analysis of the algorithms. We
assume that the interconnection network is sparse, and hence
communication between distant processors is expensive.
Third, as implied above, we are interested in developing
more than just a hardware accelerator. Rather we desire a
distributed formulation that gives better solutions with less
total work.

We feel the most natural way to distribute a genetic
algorithm over the processors is to partition the population of
solutions and assign one subset of the population to the local
memory of each processor. Consider a straightforward
implementation of the GA approach. In order to
probabilistically select parents for the crossover operation,

 
 

global information about the (relative) fitnesses must be
used, This implies an often performed phase of data
collection, processing, and broadcasting. Further, extensive
data movement is required to crossover two randomly
selected solutions that are on distant processors.

Considerations such as the above led us to question the
wisdom of using the GA approach as it is typically
presented. There are many ad hoc methods for bypassing
these difficulties. For example, continuing the above
examples, knowledge of the global fitness distribution can be
approximated. Further, steps can be taken to artificially
reduce the diameter of related computations. Instead we
found a simple model that naturally maps GA onto a
distributed computer system, It is drawn from the theory of
punctuated equilibria, discussed in the next section.

We chose to do our inital study using the Optimal
Linear Arrangement problem (OLA). It is an NP-complete
combinatorial optimization problem [GARE79}. We selected
it due to the practical interest in such placement problems, as
well as its simple presentation, There are mn objects and m
positions, where the positions are arranged linearly and are
separated by unit distances. For each pair of objects i and j
there is a cost cj. We need to find an mapping p, where
object i is assigned to position p(i), that minimizes the
objective function

D cy PO-pG. q)
isy
We note that OLA is related to the ubiquitous traveling
salesman problem; they are both instances of the quadratic
assignment problem.

PUNCTUATED EQUILIBRIA

N, Eldredge and S. J. Gould [ELDR72] presented the theory
of punctuated equilibria (PE) to resolve certain
paleontological dilemmas in the geological record. While
the extent to which PE is needed to explain the data is hotly
debated [ELDR85], we have found it to be an important
model for understanding distributed evolutionary processes.
PE is based on two principles: allopatric speciation and
stasis.

Allopatric speciation involves the rapid evolution of
new species after being geographically separated. The
scenario involves a small subpopulation of a species,
“peripheral isolates,” becoming segregated into a new
environment. By using latent genetic material or new
mutations this subpopulation may survive and flourish in its
environment. A single species may give rise to many
peripheral isolates.

Stasis, or stability, of a species is simply the notion of
lack of change. (This directly challenges phyletic
gradualism.) It implies that after equilibria is reached in an
environment there is very little drift away from the genetic
composition of a species. The motivation is that
“sympatric’’ speciation (differentiation in the same
environment) is difficult since small changes can not
compete with the ‘‘gene flow’’ of the current species.
Ideally a species would persist until its environment changed
{or it would drift very little),

It is instructive to define ‘‘species’’ in a way that relates
to the concept of a solution in the GA approach. We adapt
an old idea of S. Wright [WRIG32] that introduces the
concept of the ‘‘adaptive landscape,’’ which is analogous to
the fitness ‘‘surface’’ over the space of solutions, Consider a
peak of the landscape that has been discovered and
populated by a subset of the gene pool, That subset (perhaps
with nearby subsets) corresponds to a species. There can be
many ‘‘species’’ in a given environment, some so distant
that their mutual offspring are not adapted to the
environment. The difficult question is how can a species, as
a whole, leave its ‘“‘niche’’ to migrate to an even higher
peak. The concept of stasis emphasizes the problem.

PE stresses that a powerful method for generating new
species is to thrust an old species into a new environment,
that is, a new adaptive landscape, where change is beneficial
and rewarded. For this reason we should expect a GA
approach based on PE should perform better than the typical
single environment scheme.

What are the implications for the GA approach? If the
“environment” is unchanging then equilibrium should be
rapidly attained. The resulting equivalence classes of similar
solutions would correspond to species. It is possible that the
highest peaks have been unexplored. Typically, when GA is
used the mutation and crossover operations are relied on to
eventually find the other peaks, PE indicates that a more
diverse exploration of the adaptive landscape could be
achieved by allopatric speciation of peripheral isolates.
Therefore, subpopulations must be segregated into
environments that are somehow different.

Two different schemes for changing the environment
are suggested. Suppose fimess is a multi-objective function.
Various low-order approximations to the true fitness could
be tied at different times and places. We will not explore
this further here. The second scheme simply changes the
environment by throwing together previously geographically
separated species. We feel that the combination of new
competitors and a new gene pool would cause the desired
allopatric speciation. Further, we will define fitness so that it
is relative to the current local population. So a new
combination of competitors will alter the fitness measure,
This scheme is used in the work presented here.

GENETIC ALGORITHMS WITH PUNCTUATED
EQUILIBRIA

Our basic model of parallel genetic algorithms assigns a set
of n solutions to each of N processors, for a total population
of size nxN, The set assigned to each processor is its
subpopulation. (It is a simple extension to the model to
allow different and time-varying population sizes. Other
extensions are discussed in Section 6.) The processors are
connected by a sparse interconnection network. In practice
we might expect a conventional topology to be used, such as
a mesh or a hypercube, but at present the choice of topology
is not considered to be important. The network should have
high connectivity and small diameter to ensure adequate
“‘mixing’’ as time progresses.

The overall structure of our approach is seen in Figure
1, There are E major iterations called epochs, During an

 
 

epoch each processor, disjointly and in parallel, executes the
genetic algorithm on its subpopulation. Theoretically each
processor continues until it reaches equilibrium. Since we
know of no adequate stopping criteria we have used a fixed
number, G, of generations per epoch, This considerably
simplifies the problem of ‘‘synchronizing’’ the processors,
since each processor should be completed at nearly the same
time. After each processor has stopped there is a phase
during which each processor copies randomly selected
subsets of its population to neighboring processors. Each
processor now has acquired a surplus of solutions and must
probabilistically select a set of 1 solutions to survive to be its
initial subpopulation at the beginning of the next epoch.
(The selection process is the same as the adjustment
procedure used by GA proper.)

The relationship to PE should be clear. Each processor
corresponds to a disjoint ‘‘environment’’ (as characterized
by the mix of solutions residing in it). After G generations
we expect to see the emergence of some very fit species. (It
is not necessary or even desirable to choose G so large that
only one ‘‘species’’ survives. Diversity must be
maintained.) Then a ‘‘catastrophe’’ occurs and the
environments change. ‘This is simulated by having
representatives of geographically adjacent environments
regroup to form the new environments. By varying the
amount of redistribution, that is, § = 1S, |, we can control the
amount of disruption.

There can be two types of probabilistic selection used
here. The fimess of each element of a population is used for
selection, where the probability of selecting an element is
proportional to its fitness. When there is repeated selection
from the same population it can be done either with
replacement or without replacement. The decision should be
based on both analogies with the natural genetic model and
goals for efficiently driving the optimization process. The
selection of each final (end of epoch) subpopulation is done

 

inidalize
for E iterations do
parfor each processor i do
tun GA for G generations
endfor
parfor each processor i do
for each neighbor j of i do
send a set of solutions,
Ss, yp from i to j
endfor
endfor
parfor each processor i do
select an n element subpopulation
endfor
endfor

Figure } — The parallel genetic algorithm with punctuated
equilibria.

without replacement; the good solutions will only
“‘propagate’’ at the beginning of the next epoch. (The
selection of each Sj; is not done probabilistically, A
“random’’, i.e. using a uniform distribution, selection is
used to simulate the randomness of environment shifts.)

We present our interpretation/implementation of the
GA code each processor uses in Figure 2. The crossover
rate, OSC S1, determines how many new offspring are
produced during each generation. ‘‘Parents’’ are chosen
probabilistically with replacement. The crossover itself, and
other details, are discussed below. Our crossover produces
one offspring from two parents. The fitnesses are
recalculated, relative to the new larger population, Then,
probabilistically without replacement, the next population is
selected. Finally, (uniform) random elements are mutated.
The mutation rate, OSMsS1, determines how many
mutations altogether are performed.

IMPLEMENTATION DETAILS

The problem we studied, OLA, is a placement problem.
Hence a solution must encode a mapping p from objects to
positions, Since we may assume both objects and positions
are numbered 1, 2,..., m, the mapping p is just a permutation.
In fact, throughout we use the inverse mapping, from
positions to objects, as the basis for our encodings. There
are many ways to encode permutations but it is not at alt
clear which, if any, are suitable for the GA approach. Note
that it is desirable to preserve adjacencies within groups of
objects during crossover.

Several representations and crossovers have been
proposed for related problems, e.g. the traveling salesman
problem. Inversion vectors (‘‘ordinal representations’’) are
the most obvious choice since they allow ‘‘typical’’
crossovers (as in [HOLL75]). However the use of such
crossovers with inversion vectors is quite undesirable
[GREF85] since it breaks up groups of adjacent objects.
Goldberg and Lingle [GOLD85] gave a ‘‘partially mapped
crossover”’ that uses a straightforward array representation
of the (inverse) mapping, where the i" entry is j if the j™
object is in the i" position. Briefly, their crossover copied a
contiguous portion of one parent into the offspring, while

 

for G iterations do
for nxC iterations do
select two solutions
crossover those solutions
add offspring to subpopulation
endfor
calculate fitnesses
select a population of n elements
generate nM random mutations
endfor

Figure 2 — The genetic algorithm used within an epoch at
each processor.

 

a

 
 

having the other parent copy over as many other positions as
possible. Smith [SMIT85] proposed a ‘‘modified crossover’’
for the array tepresentation. A random division point is
selected and the first ‘‘half’’ of one parent is copied to the
offspring. The remainder of the offspring’s array is filled
with unused objects, while preserving their relative order
within the other parent. For example, parents [7 13 5 6 2 4]
and {3 42 7 15 6} with the division point after the third
position produce the offspring (7 1 3 4 2 5 6]. We
essentially used this representation and crossover but
allowed the first or the second ‘‘half”’ of the first parent to be
used. By arguments analogous to those of Goldberg and
Lingle [GOLD85], we can argue that our approach has the
desirable ‘‘schema-preserving’’ property that the GA
approach exploits. However it is an open problem to give a
theoretically compelling proof of that property. In any
event, it is clear that our scheme tends to preserve blocks of
adjacent objects.

The mutate operator was selected next. We felt that the
mutations should not be too disruptive; if most adjacencies
were broken then with near certainty the mutation would be
immediately lost. We chose to use “‘inversion,’”’ the reversal
of a contiguous block within the array representation. The
beginning of the block was randomly selected. The length of
the block was randomly chosen from an exponential
distribution with mean |t. We typically kept 1 small to
inhibit disruption, The nature of the OLA problem
encourages inversion as opposed to pairwise interchanges,
which do not involve block moves.

How should fimess be calculated? With any
minimization problem, such as OLA, the scores of the
solutions should decrease over time. The score is the value
of the objective function, Two simple fitness functions
suggest themselves. First, the fitness could be inversely
related to the score; this could cause excessive compression
of the range of fitnesses. Second, the fitness could be a
constant minus the score. The constant must be large
enough to ensure all fitnesses are positive (since they are
used in the selection process) and not too large (effectively
causing compression), If such a constant was optimal
initially, it would become a poor choice near equilibria. For
these reasons we used a time-varying ‘‘normalized’’ fitness.

We chose our fitness to be a function of all the scores in
the current population. We have empirically found that
randomly generated solutions to the OLA problem have
scores that are ‘‘normally distributed’’ (i.e., have a bell-
shaped curve), with virtually every solution within 1
standard deviation (s.d.) of the mean, and no solutions were
found more than 3 s.d,’s away. For related evidence see
[COHO87, WHIT84]. Therefore we used

(py - score (x))+ a9
200

where [ly is the mean of the scores, o is the s.d., and a is a
small constant parameter. Note that in practice we expect
0 < fitness (x) < 1, we use clipping to ensure it is positive.
Near equilibrium the scares will not be normally distributed
because the contributions from most mutations and many
crossovers will almost certainly be below the mean, biasing
the distribution.

fitness (x) = (2)

464

Our fitness measure has several advantages. It is
somewhat problem independent, so that we can reasonably
compare very different instances. It also tends to control the
effect of a few ‘‘outliers’’ on the population, A
disadvantage is that it is expensive to calculate and it needs
to be recomputed at regular intervals. An approximation
scheme can be used where new fitnesses are calculated
according to the current mean and variance. The other
elements would not have their fitnesses recalculated, unless
they were otherwise encountered, but the pseudo-
normalization renders them comparable.

EMPIRICAL RESULTS

We performed several experiments to determine if our
parallel genetic algorithm is an effective approach. The
efficacy of any GA approach is determined by the design
variables. Our initial experiments, reported here, have been
made with the Optimal Linear Arrangement (OLA) problem
as the base case. This permutation problem has a raw score
(Eq. 1) that the system is to minimize for a given cost
matrix, The system uses a fitness measure (Eq. 2) in
selecting the elements for crossover and in determining the
survivors for each generation. Remember that for both of
those selection processes within a subpopulation the fitness
is judged relative to that subpopulation, and that the fitness is
used in a probabilistic manner.

Our current implementation is a sequential simulation
of the parallel genetic algorithm with punctuated equilibria.
It operates with an arbitrary configuration of the N
subpopulations. Presently we are investigating “‘mesh’’ and
“hyper-cube’’ connection topologies. Although other
configurations will be analyzed, the hyper-cube topology is
of particular importance to us. We plan to obtain ‘‘real’’
performance measures on the hyper-cube multiprocessor at
the University of Virginia. In most of the initial studies
described below, a mesh configuration is used with NV = 4,
ie., each subpopulation being able to ‘‘communicate’’
during the inter-epoch transition with two other
subpopulations.

For each experiment the number of epochs, £, is given
along with the number of generations per epoch, G, and the
end-of-generation subpopulation size, n. Thus, over the
course of a single example the parallel system will create
NxExG generations. [f we set N and E to one, then we have
a ‘‘standard’’ sequential GA creating G generations. While
the sequential GA has a single evolutionary time line, the
parallel algorithm has multiple, interrelated evolutionary
time lines. The ‘‘interrelated’’ qualification is quite
important because the parallel system does more than just
create N divergent time lines. As with the sequential GA,
one cannot say, a priori, how many distinct individuals, i.z.,
possible problem solutions, a particular example run of our
system will examine. For our purpose here, we will use
NxXExGxn to be an indicator of the total number of solutions
created during the experiment. The remaining design
variables of importance are C, the crossover rate, M, the
mutation rate, 5, the size of the redistribution set, a, the
fitness scale factor, and \4, the mean length of the mutation
block,

 
 

   

Results For Three Problem Instances

     

n

 
  

  
   

a

3.00

     

 

NIE

     

G

  

100 9600 9600

2
15596

 

15450

   

    

 

300 | 15540

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Effects Of Changing The Design Variables
Cc M Sipliian a {N/E G 5 r s*
0.50 | 0.05 | 15 3.| 80 | 3.00_| 4 6 50_}_15225 |_15318
0.80 15240 | 15315
0.20 15270 | 15311
40 15240 | 15343
12 15270 | 15360 | 15210
50 15225 | 15371
1.50 15210 | 15258
0.50 15210 | 15225
12 25 15210 | 15266
Table 2

 

Tables 1 and 2 present the results and the settings used
to derive those results, In those tables the quantity, s”, is the
theoretical optimal OLA score (as opposed to the fitness
Measure); 3 is the average of the best OLA score from each
example with the specified settings; and § is the score of the
single best solution created during the example. In the
current simulations a single random number process is used,
allowing multiple examples to be generated for the same
design variable setting by changing a single ‘‘seed.’’ In the
discussion below the term ‘‘average’’ will indicate that
several examples with the same settings and inputs, but with
different seeds, have been run and the resulting measures
averaged, In all cases reported here, the average is taken
over a minimum of four runs,

Table 1 is broken into three instances, with three rows
for each instance. The first row shows the results from
running the parallel genetic algorithm with punctuated
equilibria, i.e, independent subpopulations — with
communication, For these results a four node mesh
configuration was used. The second row shows the results
from a sequential genetic algorithm, That algorithm was
derived by setting N and E to one, i.e, one population
creating G generations. For a comparable uniprocessors, this

152

algorithm would require about four (NV) times the amount of
“wall clock’’ time as the parallel genetic algorithm with
punctuated equilibria. The third row shows the results from
a simple parallel genetic algorithm that just used four (NV)
independent populations withour communication. Here each
example run was derived by setting N to four and E to one,
and then, at the end of the parallel operation, selecting the
best overall from the best of the four populations.

For each instance, we kept NxExG xa constant over all
three rows. This product is indicative of the total number of
OLA solutions examined by the system. By keeping the
product constant we assume that approximately the same
amount of total computation is required.

For these initial studies we have considered ‘‘antificial’’
OLA examples, which allow easy determination of the
optimal score. While the examples are contrived, they
exhibit natural clustering patterns. In all cases, the costs
were chosen so that the identity permutation produced the
optimal score; the only other optimal permutations were
simple perturbations of the identity mapping. (Of course this
does not make the problem any easier.) We used three types
of problems, i.e., cost matrices, in our experiments.

      
 
   
  

          
     

 

wh
sol
in

Ta

alt
so

ta
al
tt

o
 

The first type of problem instance, with unique optima,
has a cost matrix of the following form:

OABCDOODD
AOABCDOOO
BAOABCDOO
CBACABCDO
C,9)=|DCBAOABCD
ODCBAOABC
OO DCBAOAB
OOODCBAOA
OO0ODCBAD

where m=9 and A®B®C»D20. We believe the
solution spaces for such problems instances to be ‘‘convex""
in some sense, and therefore ‘‘easy,”’ Instance one (/ = 1) of
Table 1 used C)(9), wih A=1000, B=100, C=10, and
D=1. Note that, for this instance, the parallel genetic
algorithm with punctuated equilibria found the optimum
solution in each example, as indicated by 5 = 3"

The second problem type was slightly more complex.
We increased m to 18 and created a cost matrix by
embedding two independent 9-element orderings as given by
the following cost matrix:
C,(9) 9

C2(18) = 0 619} :

Note that two groups of 9 are uncoupled and that this is
tantamount to solving two disjoint problems. A, B, C,
and D in C2(18) were as above. The settings and results for
this cost matrix are shown as instance two (/ = 2) in Table 1.

The third type of problem incorporated further
complexity by embedding interrelated blocks of three
elements each. For nine elements the resulting cost matrix
would be

C3(9) =

anaanppe
aRAaAaAaAaPers
andaanwo py pm
OA AOP Pow AN
AaaAnaroetann
Qanavpoprpaanna
rPomaananan
opanaanna
oppananan

0
oO
-
P

We assume A>B>C. The intra-block cost, A, causes
primary clustering, and the inter-block cost for adjacent
blocks, B, forces an ordering of the blocks. The other costs,
the C’s, tend to ‘‘flatten’’ the search space by making all
permutations have similar scores. This cost matrix pattern
was extended to C4(18), ic., eighteen objects comprised of
six blocks, and we let A=50, B=30, and C=15, The
results are shown as instance three (/ = 3) in Table 1.

The results shown in Table 2 are presented to indicaie
the effects of changing individual design variables. The
problem instance used C3(18) with aA=40, B= 30, and

 

153

C= 15. Note the slight reduction in A was intended to make
the optimum solution more elusive, The settings for the base
case are shown in the first row. In the remaining rows we
show the altered value of a particular variable and the
obtained results. Again, the averages were taken over four
example runs. In terms of 5, the most dramatic changes were
due to reducing a, and due to increasing E. The effect of
decreasing & is to make the fitness measure more sensitive to
smaller differences between solutions that are near the
current best for the subpopulation. In this way incremental
improvements are given more of an opportunity to survive
and create further improvements.

The increase in E effectively provides more
communication opportunities between the subpopulations.
The jast row of Table 2 and instance three from Table 1
provide the strongest experimental results to date for the
effectiveness of the parallel genetic algorithm with
punctuated equilibria. These promising effects prompted an
experiment combining the modifications in the last two rows
of Table 2. We obtained the optimal solution in 5 out of 6
runs,

CONCLUSIONS AND EXTENSIONS

In attempting to develop parallel algorithms one always
wants to obtain the simple speed-up of having more
processors to do more instructions in the same ‘‘wall clock’’
time. However, there is evidence that in attempting to
develop parallel versions of previously known algorithms
one often derives a modified formulation which comprises
more fundamental efficiencies. We believe this to be the
case for our parallel genetic algorithm with punctuated
equilibria. The partition into subpopulations specifies a
simple and balanced mapping of the workload to a non-
shared memory, multiprocessor system, while the intra-
epoch isolation and inter-epoch communication provide a
fundamental modification to the basic genetic algorithm that
will generate better solutions while considering a smaller
total number of individuals.

Several extensions to the model have been considered
and are being attempted. For example, the use of fixed-size
subpopulations is not suggested by the natural evolutionary
setting. When a processor receives a surplus of very fit
solutions it makes sense to retain most of them, However it
is clear there should be some mechanism for limiting the size
of each subpopulation as well as the total size. Varying-
sized groups will create data management and coordination
problems, and it is not clear they are worth the additional
computational load.

Our model is essentially synchronous, though it is
easily realized asynchronously with handshaking, A tuly
asynchronous madel would allow each processor to decide
for itself whether it has reached equilibrium and should
begin another epoch. At that point it could poll its
neighbors, asking for subsets of their current subpopulations
to be sent to it. Global termination, while somewhat
arbitrary before, becomes even more difficult.
 

ACKNOWLEDGEMENTS

The authors’ work has been supported in part by the Jet
Propulsion Laboratory of the California Institute of
Technology under Contract 957721 to the University of
Virginia. The work of James Cohoon has been supported
additionally in part by the National Science Foundation
through grant DMC 8505354. Their support is greatly
appreciated.

REFERENCES

{BETH81} A. Bethke, Genetic Algorithms as Function *

Optimizers, Ph.D. Thesis, Department of
Computer and Communication Sciences,
University of Michigan, 1981.

[COHO86] J. P. Cohoon and W. D. Paris, Genetic
Placement, [EEE International Conference on
Computer-Aided Design, Santa Clara, CA, 1986,
422-425.

fCOHO87} J.P. Cohoon and M. T. Roberson, Jump Starting
Simulated Annealing, Department of Computer
Science, University of Virginia, 1987.

(DAVI8S] L. Davis, Job Shop Scheduling with Genetic
Algorithms, Proceedings of an International
Conference on Genetic Algorithms and Their
Applications, Pittsburgh, PA, 1985, 136-140.

(ELDR72] N. Eldredge and S. J. Gould, Punctuated
Equilibria: An Alternative to  Phyletic
Gradualism, in Models of Paleobiology, T. J. M.
Schopf (ed.), Freeman, Cooper and Co., 1972,
82-115,

[ELDR85] N. Eldredge, Time Frames, Simon and Schuster,
1985.

{FouR85] M. P. Fourman, Compaction of Symbolic
Layout Using Genetic Algorithms, Proceedings
of an Imernational Conference on Genetic
Algorithms and Their Applications, Pittsburgh,
PA, 1985, 141-150.

[GARE79}] M.R. Garey and D. S. Johnson, Computers And
Intractability - A Guide To The Theory Of NP-
Completeness, W. 1. Freeman and Co., San
Francisca, CA, 1979.

[GOLD83} D. E, Goldberg, Computer-Aided Gas Pipeline
Operation Using Genetic Algorithms and
Learning Rules, Ph.D. Thesis, Department of
Civil Engineering, University of Michigan,
1983.

{GOLD85] D. E. Goldberg and R. Lingle, IJr., Alleles, Loci,
and the ‘Traveling Salesperson Problem,
Proceedings of an International Conference on
Genetic Algorithms and Their Applications,
Pittsburgh, PA, 1985, 154-159.

[GREF85] J. J. Grefenstette, R. Gopal, B. J. Rosmaita and
D. Van Gucht, Genetic Algorithms for the
Traveling Salesperson Problem, Proceedings of
an International Conference on Genetic
Algorithms and Their Applications, Pittsburgh,
PA, 1985, 160-168.

[HOLL75]

(KIRK83]

(SMIT85}

[TWHIT84]

[WRIG32]

J. H. Holland, Adaptation in Natural and
Artificial Systems, University of Michigan Press,
Ann Arbor, MI, 1975.

S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi,
Optimization by Simulated Annealing, Science
220, 4598 (May 13, 1983), 671-680.

D. Smith, Bin Packing with Adaptive Search,
Proceedings of an International Conference on
Genetic Algorithms and Their Applications,
Pittsburgh, PA, 1985, 202-206.

S. R. White, Concepts of Scale in Simulated’
Annealing, International Conference on
Computer Design: VLSI in Computers
Proceedings, Port Chester, NY, 1984, 646-651.

S. Wright, The Roles of Mutation, Inbreeding,
Crossbreeding, and Selection in Evolution,

Proceedings of the Sixth International Congress
of Genetics 1, (1932), 356-366.

 
 

A PARALLEL GENETIC ALGORITHM

Chrisila B. Pettey, Michael R. Leuze, and John J. Grefenstette”
Vanderbilt University, Nashville TN 37235

ABSTRACT

A parallel genetic algorithm (PGA) is
presented as a solution to the problem of real
time versus genetic search encountered in genetic
algorithms with large populations. A discussion
of the algorithm is followed by descriptions of
experiments which were performed with the
PGA and of performance measures which were
collected during each experiment. The paper
concludes with a discussion of experimental
results,

1. Introduction

An important open question in the study of
genetic algorithms (GA’s) is the optimal size of a
population. The problem centers around the
tradeoff between the amount of genetic search
that can be done and the amount of real time
available. If the population size is too small,
then the GA will have a, possibly improperly,
constrained search space because of an
insufficient number of schemata in the
population. If the population size is too large,
however, an inordinate amount of time will be
required to perform all the evaluations. In the
worst case a GA can be reduced to random
search if the amount of available time is
exhausted before any genetic search is performed.

Goldberg [3] has developed a theory of the
optimal population size for binary-coded genetic
algorithms based on the length of an individual.
But, here again, the tradeoff between the amount
of genetic search and the amount of real time is
still evident. For instance, according to
Goldberg’s theory, the optimal population size
for individuals of length 60 is approximately
10200. Jf evaluation of an individual requires 1

Research supported in part by the National Science
Foundation under Grant DCR-8305693,

* Current address:
Navy Center for Applied Research in AI
Code 5510
Naval Research Laboratory
Washington,DC 20375-5000

155

second, it will take approximately 2.8 hours to
evaluate one generation. Since an evaluation can
be as complex as solving a large queueing
network model or running a simulation of
processes on a multiprocessor, it is not
unreasonable to believe that an evaluation could
require 1 second or more. Execution times
become even greater when populations with
longer individuals are considered.

In order to overcome the genetic search vs.
real time problem, a parallel implementation of
GA’s has been investigated. This class of parallel
genetic algorithms (PGA’s) is presented here,
along with some experimental results. The
implementation techniques used in this work are
not restrictive to any particular multiprocessor
architecture, but the results reported in this
paper are from an Intel iPSC, a message-based
multiprocessor system with a binary n-cube
interconnection network.

2. Outline of the algorithm

A PGA operates on a very large population of
several distributed groups of individuals. There
is precedent for this type of population found in
the idea from population genetics of a polytypic
species [2]. A polytypic species is a species
composed of different groups which are isolated
from each other yet which are capable of
producing offspring with each other. The human
species, for example, is polytypic, in that it
consists of groups isolated from each other either
physically or culturally. Precedent for isolated
groups in a population can also be found in
Wilson’s Animat [5], a classifier-based learning
system, and in Schaffer’s VEGA [4]. In the
Animat system, each individual specifies one
action, a direction for an artificial animal to
move; when a parent is selected for crossover, its
mate is selected from the subpopulation of
individuals which specify the same action as the
first parent. VEGA, on the other hand, in order
to find an individual which performs well in
several dimensions, performs selection by
choosing subpopulations from the total
population based on fitness in an individual
dimension.

 
 

A PGA consists of a group of identical “nodal
GA’s” (NGA’s), one per node of the
multiprocessor system. Each NGA maintains a
small population which is a portion of the large
population, and functions in much the same way
as a sequential GA. The one difference between
an NGA and a sequential GA is that once during
each generation, an NGA communicates with its
neighboring NGA’s. This communication phase
consists of sending the best individual in the
local population to each neighbor and receiving
the best individual from each neighbor’s
population. After the best individuals are
received from the neighboring NGA’s, it is
necessary for an NGA to insert the new
individuals into its local population. This
insertion can be done by replacing random
individuals, the worst individuals, or the
individuals most like the incoming individuals
(ie., the smallest Hamming distance). The
modified algorithm for an NGA is listed in
Figure 1.

NGA:
begin
Initialization;
Evaluation;
while (not done)
begin
Communication;
Selection;
Recombination;
Evaluation;
end
end

Figure 1. A Nodal Genetic Algorithm

While it is true that a PGA can be thought
of as a sequential GA with a very large
population, there is one major problem with this
analogy. In a sequential GA an individual is
selected based on its performance in the whole
population, but in a PGA an individual is
selected based on its performance in its local
subpopulation. This difference in selection could
result in premature convergence. On the other
hand, this difference might lead to a slowing
down of convergence and ultimately produce
better results. At this time there is no
theoretical basis for this type of selection.

156

3. Description of Experiments

Four of DeJong’s testbed of functions [1]—f1,
f2, £3, and f5—were used to test a PGA. For
each of the four functions, an NGA maintained a
population of 50 individuals and used a worst
individual replacement policy. Five experiments
involving different numbers of NGA’s were
performed. For each of the four functions, 1, 2,
4, 8, and 16 NGA’s were used. Corresponding
population sizes were, therefore, 50, 100, 200,
400, and 800. For each experiment the best
individual performance, the online performance,
and the offline performance were collected after
each generation from each NGA. Each NGA was
treated ag a standard sequential GA, and
performances were measured accordingly. From
this collection of results, performances for the
PGA were calculated.

4. Experimental Results

The results of the 20 experiments are
graphically displayed m Figures 2 through 5.
Figure 2 shows the best individual per generation
for each of the four functions. Since a PGA with
more NGA’s has a larger initial population, it is
possible for it to contain a better best individual
in the initial population than a PGA with fewer
NGA’s. Since the addition of an NGA does not
change the initial population of other NGA’s, it
is not possible for a PGA with more NGA’s to
have a worse best individual in the initial
population than a PGA with fewer NGA’s.

According to Goldberg’s theory, the optimal
population size for f1 is 116, for £2 is 51, for £3 is
2240, and for f5 is 262. The results from fl
(Figure 2a) and f5 (Figure 2d) appear to agree
fairly well with the theory. Although the
experiments were not run on populations of size
116 and 262, the populations of aize 200 and 400
for f1 and f5, respectively, quickly found the
optimum, and increasing the population size had
no significant improvement, The results from [3
(Figure 2c) also tend to corroborate the theory.
While a PGA was not run on a population larger
than 800, the results did tend to improve with
the increase of population size. Although
populations of size 200 and 800 performed
reasonably well on {2 (Figure 2b), the results
from this function would seem to substantiate
the belief that the selection performed in a PGA
increases the likelihood of premature
convergence. It is interesting to note that the

 

 

 
 

best answer was found with a population of size
50, which is approximately the theoretical
optimal population size. These data would tend
to indicate that the population size for a PGA
should be set with the optimal population size in
mind.

Figure 3 shows the online performance of a
PGA on the four functions. If it is desirable to
optimize online performance, a large population
should be used since a PGA’s online performance
is better the larger the population. However, it
is not clear that online performance has any real
meaning in a PGA since a PGA runs on a
multiprocessor system.

Calculating the best performance and the
online performance of a PGA is fairly
straightforward. The offline performance,
however, is a different matter. A possible offline
performance measure is the average of the offline
performances of each of the NGA’s. Figure 4
shows the results of this measure. Perhaps more
in keeping with the idea of offline performance is
the measure shown in Figure 5. The offline best
performance is obtained in much the same
manner that offline performance is obtained in a
sequential GA, with the exception that only the
best individuals from each NGA are used in the
measure. Therefore, the offline best performance
measure is the average of the bests of the local
NGA best individuals seen so far. As is to be
expected, the offline best performance measure
improves with an increase in population size,

Tt should be noted that a PGA has the
desirable property that an increase in population
size by the addition of another NGA should
increase the execution time of a PGA only
slightly. This increase in time is due to increased
communication overhead among a larger set of
NGA’s. However, while there is no firm data as
yet on increase in real time due to the addition of
an NGA, experience indicates that it is negligible
in comparison to the increase in real time due to
a comparable increase in population size of a
sequential GA.

5. Conclusions

This paper is a presentation of the initial
work with a PGA. There is still much work that
needs to be done with PGA’s. Timing studies of
the five functions on the iPSC are continuing. In
the near future a PGA will be applied to the
Traveling Salesman Problem and to a mapping

497

problem {i.e., finding the appropriate placement
of processes in a multiprocessor system in order
to minimize the response time), and PGA’s will
be implemented with alternate means of
communication (eg., sending different
individuals to each neighbor, sending individuals
probabilistically based on their performance,
etc.), Also, the selection mechanism in PGA’s
needs to be compared theoretically with the
selection mechanism in sequential GA’s. The
initial work, however, indicates that a PGA is a
viable means of increasing the population size of
a GA since a PGA will allow the use of genetic
search in problem areas where a candidate
solution has a long representation and is time
consuming to evaluate.

Acknowledgements
The authors would like to thank Mike
Hilliard and Gunar Liepins of Oak Ridge

National Laboratory for their contributions to
this work.

References

{1} Kenneth A. DeJong, An Analysis of the
Behavior of a Class of Genetic Adaptive
Systems, Ph.D. Thesis, Department of
Computer and Communication Sciences,
University of Michigan, 1975.

(2] Theodosius Dobzhansky, Genetics of the
Evolutionary Process, Columbia

University Press, New York, 1970.

David EE. Goldberg, Optimal Initial
Population Size for Binary-Coded Genetic
Algorithms (TCGA Report No. 85001),
Tuscaloosa: University of Alabama, The
Clearinghouse for Genetic Algorithms,
1985.

[3]

[4] David J. Schaffer, Some Experimenta in
Machine Learning Using Vector
Evaluated Genetic Algorithms Ph.D.
Thesis, Department of Electrical

Engineering, Vanderbilt University, 1984.

Stewart W, Wilson, Knowledge Growth in
an Artificial Animal, In Proceedings of an
International Conference on Genetic
Algorithms and Thetr Applications, J.J.
Grefenstette, Ed., July 1985.

15]

 
 

Side em

&
wo alae

Paeses

  
  
 
 

tn dee ee

Bnvdes eee

o1

1Gnodep remeron

  

ooul 4

0 0U01 +

o4

 

 

 

T OT - T T T
0 3 50 Te 100
generations

Figure 2a. Function f1

Lnsde
Dosdee

Andee ere

;

i

| Bavdes Hoeee
| lonodes emereee

i

i

 

 

—T" T T T T
50 75 100
generavons

°
ts
B

Figure 2c. Function f3

 

 

 

 

 

 

1 Vonade meme
Daeder
dante errens
nodes Ate
Ol-
Lt puedes eae
001 -
Uh 4
an aT T t T
Q a8 50 15 100
generations
Figure 2b. Function f2
Dopede aan
16 Z|
Taodes
4 nodes wees
$ nodes seed
a
Idnodes wee
44
24 a
14
I I aT T
a a5 50 75 100

generations

Figure 2d. Function [5

Figure 2. Best Individual Performances

158

 

 

 

 
 

 

 

 

 

 

  

 

 

 

 

 

a
tao) 1000 .
In-dee Paede —n warn
50 (nodes ewes
“wy BS nudes teehee 500 4
TE sudan ren ”
10+
100 +
504
1
On 1 10
T T T T ~ To oe T T ~ v T
a 25 50 75, 100 Go 25 50 7S 100
generations generations
Figure 3a. Function fl Figure 3b. Function f2
node eens
500
2 podea
60 4 ncdte revere
uy B nodes eres
Vesuen weerex
10 100 4
: so-|
1
O4-. 10-
re repre a ace as oT T T T r
30 18 100) a a5 50 78 100
generations generatons
Figure 3c. Function {3 Figure $d. Function {5
7 Figure 3. Online Performances

 

 
 

Dnode ee pusde ee

Dede + 2 noden

54 anodes dda ere
Bncdes meneee
B nodes Nemeee

Tne peterereeen
So nodes eae

 

 

 

 

 

  

 

 

 

 

T T T T ~T TO
0 25 50 1 100 Q 25 50 75 100
generations Generations
Figure 4a. Function f1 Figure 4b. Function {2
Lnode
pnoden «+
4 nodes everws
100 + & poder H+ntes
16 nodes mecwseten
50 +
+4
3
O44
T t T T T F T T TT
oO 2h 60 1s 10a Qa 25 50 1S on
generations generations
Figure 4c. Function {3 Figure 4d. Function {5

Figure 4. Offline Average Performances

160

 
 

1 \ neds ae

2osdes

Anodes eee

6 ods Hee
oN Yonodes sneer
01 \,
_ we :

T T T T
0 25 50 15 100
generations

 

j

Figure 5a. Function fl

104

{node ————

 
    
  

Leder +
sy Anodes ree

‘ B nodes eeewe

woe

panne T

 

en eerie
—~

 

ok 4
OT Te
0 25 50 75 100
generations
Figure 5b. Function f2

16 tanode =

B

4+

 

 

 

T F T T a T
a B 50 15 100 0 25 50 75 100
generations generations

Figure 5c. Function {3

Figure 5d. Function [5

Figure 5. Offline Best Performances

 

 

 

 
 

GENETIC LEARNING PROCEDURES IN DISTRIBUTED ENVIRONMENT!

Adrian V. Sannier JI
Research Assistant

Erik D. Goodman
Professor and Director

A.H. Case Center for Computer-Aided Engineering and Manufacturing
Michigan State University, East Lansing, MI 48823

ABSTRACT

This paper introduces a strategy for fostering the develop-
ment of hierarchically organized distributed — systems
which uses a genetic algorithm [1] to manipulate indepen-
dent, computationally limited units that work toward a, com-
mon goal. The central thesis of this work is that impor-
tant characteristics of the hierarchical structure of living
systems can be duplicated by applying idealized genetic
operators to a functionally interacting population of
encoded programs. The modeis and results we present in
this paper suggest that the introduction of functional
imeraction between distributed units can promote the
development of a genome capable of producing a set of
independent, differentiated units and implicitly organizing
them into a coherent and coordinated distributed system.
The results presented here are available in more complete
form in [2].

AN IDEALIZED DISTRIBUTED GENETIC SYSTEM

The class of systems we have attempted to model
emerges from consideration of macro-organisms composed
of vast numbers of interacting, distributed units. During
ontogeny, a single genetic program, or genome, represented
in several strands of DNA, replicates itself, more or less
exactly, millions of times. The cells containing these repli-
cas of the original program do not all perform alike, howev-
er. Instead, they differentiate into many types, each of
which performs a specialized function. None of the individu-
al units operates on the scale of the macro-organism, yet
together the operations which these differentiated cells
perform are implicitly coordinated to produce the behavior of
the larger organism. This process of differentiation is con-
trolled, at least in large part, by the action and organization
of the original genome.

We propose here an idealized mechanism for developing
what we call composite genomes, i.e, genomes capable
producing differentiated offspring. Their formation is desir-
able since they provide a means for coordinating the actions
of independent, functionally interacting units operating in a
common environment. By replicating itself many times and
placing the offspring in appropriate environments, a single
composite genome produces a system of implicitly coordi-
nated independent units capable of pursuing a common
objective. Furthermore, all of the information about the sys-

tem is located in a single unit, (of which there are many
copies), making communication of the group strategy sim-
ple. Composite genomes are in large part responsible for
the structure of the recursive, holonic hierarchies [3] which
are characteristic of living systems.

Our basic hypothesis is that composite genomes arise
from the consolidation of specialized genetic programs
which control independent units that produce behaviors
which are symbiotically related. Via a reproductive opera-
tion we call hybridization, genetic programs which produce
distinct specialized behaviors are combined into a single
genome capable of producing either behavior. Which of the
encoded behaviors replicated offspring of a composite
genome exhibit is determined by the internal and external
environmental conditions in which they are placed.

The Basic Modet

in our idealized model of the natural genetic system, an
environment is defined in terms of a space, which we visual-
ize as a two-dimensional integer grid of infinite extent, the
standard setting for cellular automata. Within this grid,
environmental conditions are defined in a local way, Le.,
conditions exist in regions of the space and vary with time
and/or the action of the living systems in the region. Living
systems are modeled as abstract genomes which reside at
some Jocation in the grid and interact with the conditions
present in some symmetric region surrounding them. Each
genome is capable of :

1) detecting the conditions present in its immediate

external environment;

2) detecting the conditions present in its internal

environment;

3) reacting to sets of external and internal states,

either by establishing some internal condition,
or by initiating a process capable of interacting
with, and possibly altering, the conditions pre-
sent in its immediate environment.

Also associated with each genome in our model is a num-
ber, denoting its strength. A genome’s strength is initial-
ly assigned by the environment and is adjusted at period-
ic intervals to reflect the fitness of the genome with
respect to it. [t is intended as an analog to the natural sys-
tem, in that it provides the basis for selection. When a
genome’s strength falls below a certain threshold, it "dies",
and is removed from the grid, but whenever its strength

 
 

climbs above a somewhat higher threshold, it is able to pro-
duce an offspring. In this way, the number of offspring
allocated to individuals is biased by performance. The
initial strength of an offspring comes from its parents,
depleting their strengths and thus restricting the frequency
of individual reproduction. The principal genetic operators in
the model are crossover and replication. (Augmenting these
are idealized mutation and inversion operators. To simpli-
fy the presentation, their action is not discussed here;
we utilize them in the standard fashion [1,4,5] ).

The environmental grid mediates functional interaction
between genomes. The potential for functional interac-
tion exists wherever the processes initiated by one
genome can, through the medium of a common environ-
ment, establish conditions which affect, either positively
or negatively, the strength of anather genome, In our mod-
el, genomes which are proximate to one another on
the grid experience similar environmental conditions and
can simultaneously affect their mutual environment.
Since a genome’s fitness is a function of its perfor-
mance within iis immediate environment, the  fitness-
es of genomes which share a common region of the
grid are linked.

The grid also performs an important role in mediating
Teproductive interactions between genomes. In order to
more accurately model the natural system, mate selection
and offspring placement are spatially biased. The praba-
bility of an offspring emerging from a crossover between two
parent genomes is inversely related to the distance
between them. Similarly, when an offspring is placed in the
grid, the probability that that offspring is found a distance
x from its parents decreases inversely with the magnitude
of x. These spatial dependencies appear to us to be
important factors in promoting the formation of composite
genomes, as we discuss below.

Structural and Spatial Coherence

In distributed populations, two kinds of groups emerge
due to the action of the genetic operators and regional varia-
tions in environmental conditions. These groups derive
their coherence from different properties and each serves
a different function. The first of these are structurally
coherent groups, or genotypes. The members of a geno-
type can be said to be structurally coherent in the sense that
the structures of their programs, and hence the behaviors
these programs produce, are similar. In sufficiently
large, unevenly distributed populations, we expect a num-
ber of diverse genotypes to emerge due to functional spe-
cialization,

{fthe individual genomes in an unevenly distributed
Population are computationally limited, in the sense that
they are incapable of responding, simultaneously, to all
the demands and regularities present in their environ-
ment, then functional specialization will occur [6]. Spatially
distributed genomes under selection pressure and the
action of genetic operators, will begin to pursue different
survival strategies and, aftera time, will be separable

into distinct genotypes according to their patterns of behav-

ior and the structure of the programs which encode
for this behavior. As time goes on, the behaviors
encoded in these specialized genotypes will become mutu-
ally exclusive, in the sense that a given genome, due to
its limited computational capabilities, will be unable to ade-
quately perform more than one of ihe behaviors at a time.

The second kind of grouping arises as a consequence of
the spatial dependencies built into our model. Under our
formulation, a collection of genomes which occupies a par-
ticular region forms a spatially coherent group, or sub-popu-
lation, These spatial groups hold a special place in dis-
tributed systems. Not only do the individuals within them
share a common context, but since mate selection and
offspring placement are spatially biased, reproductive
activity tends to be concentrated within them. If the
population density is sufficiently low, individuals will
tend to cluster in these spatially coherent groups, particu-

larly if certain regions of the environment are more
“hospitable” than others. The individuals within such
"spatial niches” [7] will form isolated sub-populations

whose members interact both functionally and reproduc-
tively, and tend to place their offspring back in the niche
with high probability. Since the members of these groups
interact functionally, if the niche contains genomes from dif-
ferent genotypes, the potential for symbiosis between indi-
viduals from these different types exists and is selected for.

A Reproductive Continuum

Crossovers of parents within and between the groups
described above produce offspring of different characters,
each of which perform a different function in the distribut-
ed system. The reproductive operations induced by the
crossover operator take on the character of a continuum, pro-
ducing different kinds of offspring depending on the
degree of structural similarity between the crossing par-
enis. This continuum is depicted below, linking two natu-
rally occurring operations, replication and recombination
with a third operator which we refer to as hybridization.

(recone nacre reenter nner eee ene nnseerennneent reer nnainenmnnan een

| Replication
t

Admittedly, it may seem odd at first to assert that any
kind of continuum exists between replication and recombi-
nation, since so many differences exist between them.
Replication, the process by which a single composite
genome produces a macro-organism composed of millions
of differentiated cells, seems to have little to do with
recombination, the process which produces new indi-
viduals by “mixing” two parent genomes. They occur at
different levels in the biological hierarchy and the physical
process which implement them are quite distinct. Never-
theless, we argue that, from an information processing per-
spective, replication and recombination can be consid-
ered as points on a logical continuum of reproductive
operations by which new genomes can be produced from

 
 

existing ones. While only certain regions of this continuum
are realized in the natural genetic system, we have incorpo-
rated all of them in our model.

Replication lies at one extreme of this hypothetical con-
tinuum, occuring in our model either from the action of the
replication operator or as a crossover between identi-
cal parents. The resulting offspring isan exact replica of

the parental genome, but, as with all genomes in our
model, its actions and responses are dictated by the
state of its internal and external environment. Replica-

tion is most interesting when the parental genome is com-
posite. When this is the case, the sections of code
which become active in the offspring may differ from
those which are active in the parent, due to differences in
their internal and external environments. For example, a
composite genome containing code for two distinct pro-
cesses, each one sensitive to a distinct set of intemal
and external cues, can produce two differentiated types of
offspring. While each is an exact genetic replica of the par-
ent, different components of the composite genome
are active in each of the offspring types, due to differences
in the internal and external environments in which they have
been placed. As a result the two types exhibit different
behaviors.

When crossovers occur between genotypic variants,
ie, non-identical members of the same genotype, the off-
spring genome begins to look less like a replica of its par-
ents and more like a mix of two similar, but distinct par-
ents, In place of replication, we get recombination. Those
familiar with genetic algorithms will recognize recombina-
tion as the standard image of crossover. Two parents,
variants of the same basic genotype, cross to produce an
offspring which shares sections of both parents’ genetic
code, and exhibits behaviors characteristic of both par-
ents, as well as new behaviors arising from epistatic inter-
actions between the spliced sections, Recombination has
been studied extensively by a number of genetic algortihm
researchers [1,4,5], and has been shown, under certain con-
ditions, 10 generate individuals of increasing fitness through
an intrinsically parallel search of potential genospace.

As the parents become less and less similar and the
behaviors they encode for increasingly — specialized,
crossovers in our model tend toward hybridizations. We
define hybridization as a crossover between genomes
from radically different genotypes, each of which codes for
a separate process activated by a distinct set of internal
and external cues, which produces a new, composite
genome that contains the code for both processes. It is
the primary source of composite genomes in our model. In
hybridization, two parent genomes, each of which codes
for a different, specialized process, cross at a non-inter-
fering point to form a composite offspring capable of exhibit-
ing either behavior, depending on the external and internal
conditions it encounters.

Composite Genomes

We see evolution in the distributed genetic system as an
interaction of the operations described above. As new indi-

464

vidual strategies emerge, recombinations between genomes
which follow similar strategies produce increasingly fit indi-
viduals. When the individuals are computationally limited
and spatially distributed, we expect a number of mutually
exclusive strategies to emerge. If migrations occur in the
space, we expect the formation of spatially coherent groups
that contain genomes from more than one of these special-
ized genotypes.

The key to the development of successful composite
genomes lies in hybridizations that occur within these
structurally diverse sub-populations between symbiotically

related parents. When a spatially coherent group, com-
posed of genomes from more than one genotype, each
of which exhibits a different specialized behavior, per-

sists over time, spatial biases in mate selection and off-
spting placement increase the frequency of hybrid off-
spring, generated by parents from different genotypes
within the spatial niche. Those offspring which successful-
ly hybridize mutually beneficial behaviors will gain selective
advantage. We suggest that hybridizations which occur
between the genomes in spatially coherent groups
whose members are symbiotically related give rise to
composite genotypes which make structurally explicit
the implicit, spatial coherence that exists between the
members of such groups. Replications of these composite
types produce a set of independent computational units
which act as an implicitly coordinated, distributed system.
Preliminary verification of our hypothesis that suc-
cessful composite genomes can form from the spatially-
biased interaction of computationally limited adaptive
units was provided by an experiment with the authors’
software system, Asgard. Asgard is designed to simulate
the behavior and adaptation of artificial animals operating
under a genetic algorithm; similar systems have also
been developed by Holland and Reitman [8], Booker [9],
and Wilson [10]. Unlike these systems, however, the
main objective of Asgard was the study of the evolution of
a distributed population of interacting genomes, rather
than the evolution of functionally isolated individuals. In
the next section we describe the organization of the system
and review the results of the most interesting simulation.

ASGARD

Asgard’s environment is a finite, toroidal, two-dimen-
sional grid, with a resolution of 160 x 60, which contains
“food" at various locations. It is divided into 4 equal quad-
rants, each of which can be displayed on a Tektronix 4105
graphics terminal. The display style is similar to Wilson’s
{10]. The grid is "home" to 4 population of genomes which
move about the grid in search of food. Each time step, the
genomes in the grid expend one unit of strength in order to
stay “alive", and each unit of food they consume increases
their strength by one unit. In order for a genome to survive,
then, its average food consumption must be at least one
unit per time step; in order to reproduce, its average con-
sumption must be somewhat greater than |.

The behavior of each genome is controlled by its individual
program. These programs are lists of labeled instructions of

 

 

 
 

two types, Move and Food?, which specify either a
movement direction or a test and transfer of control. Order
of execution is controlled by a program counter that speci-
fies which instruction in the list is to be performed next.
Move instructions take one argument, which specifies one
of eight directions of movement (N, E, SW, etc.). When exe-
cuted, these instructions move the genome one unit in the
specified direction. Food? instructions transfer control by
testing the eight locations surrounding the genome for
the presence of food. They take two arguments, both
labels, which specify the statement to which the program
counter should point given that food is or is not present in
any of the surrounding locations. A short example is given
below, together with a potential pattern of movement.

Start
I Move § |
Food? 2.1 (no) ?
2 Move E (no)
Food? 23 | etc. «~—— ? 2 (n0)
3Move Ws (yes) ? —» 2? > 2 — >”
Food? 23 (yes) (yes)

In terms of our model, each genome’s immediate external
environment is defined solely by the concentrations of food
in its neighborhood. A genome can modify its external
environment in two ways, either by consuming food or by
moving to a different location in the grid. A genome’s
internal state is defined by the position of its program
counter, By executing Food? instructions, the genome
can detect the state of its external environment and, based
on this information, can change its internal state, poten-
tially altering its future external response pattern.

The evolution of the population is driven by an algorithm
from the class of reproductive plans [1]. The algorithm
starts with a population of genomes whose programs are
formed at random from the set of legal instructions. Food
is placed into the environment according to a pattern which
may depend on time as well as the behavior of genomes in
regions of the torus. The objective is to produce a population
which responds to regularities in this pattern. The algo-
rithm produces a sequence of populations whose mem-
bers interact with each other through common local environ-
ments and are allocated offspring on the basis of
their individual consumption. Over time, the succes-
sive populations will contain individuals adapted not
only to the conditions of the artificial environment, but
also to the behavior patterns of the other members of the
population.

Let A(t) denote the population at time t and let a be any
member of A(t). One iteration of the algorithm consists of
the steps below :

cobegin { for all a in A(t) } :
determine Loc (a) based on a’s program and
update program counter [allows & one Move];
consume food at Loc(a);
coend
cobegin { for all ain A(t) such that Str (a) > T, } :

replicate @ to create a’;
begin { with probability P, }:
choose mat random from the set A (t} - {a},
biasing choice by distance |Loc(a) - Loc(m)|;
crossover a’ and m;
replace a’ with one of results of crossover;
end
choose Loc(a’) at random, biasing the choice
by the distance |Loc(a’)-Loc(a)};
coend
cobegin { for all a in A(t) such that Str (a) < 1}:
delete a from A(t);
coend
Increment t;

where:
P, = probability of crossover;
T, = reproduction threshold;
Str (a) = strength of genome a at time t;

Food (Loc(a)) = number of food units present
at a’s current grid location.

The reader will note that Asgard's algorithm differs in
some respects from more standard implementations, (see
{1,4,5}). For example, the population size in Asgard is
variable, governed only by the carrying capacity of the
environment. Also, mate selection and offspring placement
are biased by distance as our model specifies. These modifi-
cations were undertaken in order to make Asgard more
closely resemble a community and allow for the formation
of semi-isolated sub-populations. Within these — sub-
populations, trials are allocated to individuals and

schemata in proportion to their observed fitnesses,
insuring that Holland intrinsic parallelism theorems hold
(i).

The Task

The task set before Asgard’s population is to identify and
exploit the regularities present in the pattern of food place-
ment. Although we performed experiments with a num-
ber of patterns, of various complexities, we discuss
only one here, the so-called “seasonal” pattern, Under
this pattern, there are three basic regularities to which the
population can adapt.

1) Food can appear in only 1/8 of the space, concen-
trated in eight evenly-spaced fertile regions.
Each genome must periodically visit one of these
fertite regions in order to survive,

2) The productive capacity of each fertile region oscil-
lates periodically between a maximum and a mini-
mum value, (hence the term "seasonal"). Each
fertile area can support many more genomes dur-
ing its "summer" than its “winter", so competi-

tion for food within a particular area becomes
increasingly fierce as winter approaches.

 
 

3) The amount of food actually produced during a giv-
en time step by a fertile region is linked to the
amount of food consumed in the region during the

previous time step. At any given time, an opti-
mal consumption level exists for a region, and the
genomes in that region can over- or under-con-
sume with respect to this level, as a group, dur-
ing a given time step, resulting in decreased food
production in the next time step.

In each of the four quadrants of the grid, two 10x15
areas are established as fertile, (see figure la). For con-
venience, these areas are called farms. They are the only
areas in the grid where food can be found; the rest of the
gud is desert-like. Each time step, Ki(t) units of food
are produced by the i’th farm and distnbuted randomly
within it, (Gi = 1,..8). Ki(t) is itself a function of two other
quantities: Mi(t), the seasonal productive potential of farm
i; and Ei(t), the consumption efficiency within farm i, a
value derived from the total consumption within farm i during
the previous period, Ci(t-1). The exact form of these rela-
tionships is :

A<Mg

M,® =Asin@(2Mt/¥) +0) +M,
[26M

MO for 0 < C(t) < MQ;

E,(t+1) =

0 ; otherwise;

Ki = Eo M(t)
where Y is the length of Asgard’s year.

The strength of genomes in the grid is based solely on
their ability to consume; the more they consume, the
stronger they become and the more offspring they have. In
the absence of the factor Ei(t) then, the optimal strategy
for each genome would be to consume as much food as pos-
sible, at every time step. This “greedy” strategy leads to
global extinction, however, when Ki(t) depends on E(w),
since future food supplies are then tied to present rates
of consumption, So the seasonal pattern exerts two differ-
ent pressures on the members of the population. On the
one hand, each genome is encouraged to compete with
all the other genomes in the population for every available
unit of food, in order to increase its individual strength.
On the other hand, the environment gives selective advan-
tage to groups of genomes which manage to cooperate in
restricting their group consumption rate to track optimum
levels,

Given the limited capabilities of each genome, however, it
is clear that the forces which coordinate any group actions
Must come from some source outside the individual. Indi-
viduals are so limited in their range of options that they can
respond to only the first two regularities of the pattern,
ie., they can seek fertile farms and, on finding them,
remain near them. In order to respond to the third condi-
tion, group consumption levels must somehow be regu-

160

lated, and this is simply impossible from an individual per-
spective. The individual genomes have no way to communi-
cate with the other members of the population, nor do they
have any direct means for changing one another’s behavior,

In our model, the selective pressure which the third reg-
ularity places on groups of individuals is the agent which
produces coordinated action. The selective advantage giv-
en to groups of individuals which succeed in maximizing
the productive efficiency of their farm assures their contin-
ued survival. Thus we expect two kinds of groups to form
as a result of this pressure: spatially coherent groups of
structurally similar genomes whose behaviors are nearly
identical and mutually complementary and spatially coherent
groups of genomes from distinct, but symbiotic genotypes,
each of which performs some different, but mutually compati-
ble behavior.

Farmers and Nomads

In studying the evolution of populations in Asgard, our
objective was to detect the pattern and character of the
changes which the population underwent over time. To
this end we developed a set of tools which allowed us to
study populations in Asgard qualitatively, as one might a
colony of living bacteria, to try to categorize and account for
the changing patterns of behavior and observe the evolu-
tion of the community as it developed, The interactive
display thus became our primary means for studying the
system.

As the size of our experiments increased, (typical popula-
tion sizes were on the order of three to four hundred indi-
viduals and required almost 48 hours of CPU time on a
Prime 550 to complete), observation in real-time soon
proved unwieldy. In place of real-time observation, we
substituted a kind of “batch” mode, which allowed more
extensive study of the population at periodic intervals. A
“snapshot", which preserved the complete state of the sys-
tem, was taken every 300 iterations. After every
“snapshot", each of the genomes in the population
was automatically classified into a genotype based on
similarities in program structure. Each of the genotypes’
behavior patterns was then studied separately. For each
genotype, Asgard was run interactively, using a population
of representative individuals from that genotype, and their
behaviors noted. After each genotype had been studied
in isolation, Asgard was again run interactively, this time re-
starting the simulation with all the data from the
"snapshot", to study the behavior of the population as a
whole. (in this mode, the genomes of each genotype
were displayed by a different letter, thus making it possible
to visually identify each genome’s “species".) We give the
results from one of these qualitative studies here. An ini-
tial set of 1000 individuals was constructed at random,
their strengths set to SO, (see figure 1b}, This initial set
was entered into the grid in random locations at a gradu-
ated rate, to smooth the transient behavior of the popula-
tion curve.

As the simulation begins, the programs of the initial

 

 
 

 

genomes interact with the environment to produce what
look like random walks. Those which have been initially
placed far from any of the farms die quickly, many without
ever consuming a single unit of food. In this initial stage,
selection favors those genomes with the ability to recog-
nize and remain in the vicinity of a fertile zone. Farm
efficiencies, as measured by the value Ei(t), all average
below .3. After 150 iterations the initial population col-
lapse stabilizes, and the population begins to oscillate
between two and three hundred individuals, evenly dis-
tributed between the four quadrants.

After 1500 iterations, (see figure {c), the population
consists of genomes which can be separated by pro-
gram structure into eight different genotypes, with some
slight variation in each type. Some of these genotypes
are present on only one farm, while others have spread,
by chance migrations and reproductions, to several of the
fasms in the grid. All the genomes exhibit the same basic
pattern of behavior at this stage. Their programs consist
of various series of Move instructions which cause them
to driftin a particular direction, punctuated liberally with
Food? tests, Positive values for these tests tend to halt
the drift, so that the genomes wander about in various
directions and, upon encountering a fertile farm, move
randomly within it, Occasionally an individual will drift
from one fertile farm to another, but most often those who
leave a farm end up dying in the desert. Globally, the only
populated areas are the regions immediately surrounding
the fertile farms, the genomes there forming eight spatially
coherent groups. The size of each group follows the produc-
tive capacity of the farm, climbing during the "summer",
falling in "winter". In independent trials of the various
genotypes, no evidence of symbiosis was found to exist

 

between the groups. Each of the genotypes performed as
well in the absence of the other groups as they did in their
presence, provided population densities were maintained at
approximately the same levels. Consumption efficiencies do
not exceed .4, on any farm, during any season.

After 3,000 iterations however, both structurally and
spatially coherent groups are present and exhibiting co-
operative behavior, (see figure 1d). The two upper quad-
rants are completely extinct and the farms there can no
jonger produce food, but the lower two quadrants now
sustain an average of 350 individuals, roughly the same
size population as the four quadrants once maintained.
Only two significant genotypes exist, each exhibiting
markedly different behavior. The first group, nicknamed — the
Farmers, are the descendants of the genotypes present
at 1,500 iterations, A group of Farmers is present at each
remaining fertile farm. As individuals, they follow a fixed,
citcular pattern of movement, so that when placed in the
neighborhood of a farm, following this pattern assures
them of encountering it repeatedly. When tested in isola-
tion, a population consisting solely of Farmers was stable
and able to maintain the productive capacity of each farm in
the grid. This "all Farmer’ population functioned efficient-
ly as a group, maintaining average consumption efficien-
cies of approximately .9 during the winter. During the sum-
mer however, the Farmers did not multiply sufficiently to
keep pace with the farm’s increased productive capacity
and, as a result, their efficiency rating declined to around
5.

The second group, nicknamed the Nomads, are a new
genotype, which exhibits a kind of migratory behavior.
Each Nomad moves steadily across the screen, from left to
right, in a tight, zig-zag pattern. Their rate of advance is

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B
c U Vos Vv ¢g
u
M YW K oO A
aoe - i
A |B G ix@s ©
L
Summer Winter Y z P Y i
D K
(a) (b)
Me Nw EL ONN
Tr F
MM,,4 NN grr nN
Bd M_R N IF RF FE N
if NN NN
D N
(c) (d)

Figure 1

 
synchronized to the change of seasons, so that they strike
the left edge of a farm during the farm’s "spring" and move
across it during its most fertile period, leaving the right-
most edge of the farm as its productive potential wanes.
The Nomads are found in two separate groups, which are
evenly spaced between the farms in the lower two quad-
rants. Each group is largest as it leaves the right-most edge
of a farm. As the group crosses the “desert” between
farms, its numbers decrease steadily until the next farm is
reached.

Unlike the Farmers, when a population of Nomads
was tested in isolation, they could not survive. In following
their pattern of cyclic migration, the Nomads failed to main-
tain consumption levels on farms during their “winter” sea-
son. As a result, only a fraction of the food available the
first summer was available in the summer of the second
year. Without Farmers to build up food stocks during the
winter and early spring, the amount of food available to
amiving groups of Nomads was gradually reduced. This
decline in the summer food supply began a vicious cycle of
declining numbers which ended in the extinction of the test
group.

When observed together, however, the two structural-
ly coherent groups complemented each other’s efforts,
With both Farmers and Nomads present in the same pop-
ulation, consumption efficiencies were at all times greater
than .8. Recall that, operating in isolation, groups of Farm-
ers were able to maintain near optimal consumption rates
during the winter, but that during the summer their efficien-
cy fell significantly. When Nomads and Farmers were
observed together, winter consumption rates were main-
tained by the groups of Farmers at approximately the same
level as when the Farmers were tested in isolation, ( Dur-
ing the winter the two groups of Nomads were crossing the
desert, and thus did not affect consumption rates.) When a
Nomad group arrived at a farm during its "spring"
though, the population influx increased the consumption
rate within the farm enough to bring the value closer to the
optimum, As a result, summer consumption efficiencies
increased from .5 to .8.

Both structurally and spatially coherent group behav~
iors are present at this stage, then. The Farmers and
the Nomads are both examples of structurally coherent,
(or genotypic), groups. The implicitly coordinated actions
of their individual members, acting on a local scale, ( ie.,
pursuing particular patterns of movement), add up to a
group action on a global scale, ie., the regulation of
group consumption rates. And on each farm, during its
summer, members of both genotypes, Farmers and
Nomads, form spatially coherent groups which act together
to increase farm efficiencies during that season. These
increases are beneficial to both genotypes, since the
increased food production allows both groups to generate
more offspring. Because of spatially biased mate selection,
many of these offspring are crosses between Nomad and
Farmer.

Functional specialization is present in the population at
this stage however, and as a result, iuttle in the way of
improvement can be expected from recombining Farmers

 

and Nomads.

Farmer behavior is sufficiently different
from Nomad behavior that an offspring which shares some
of the traits of each group more often than not does a poor
job at both, The Farmer and Nomad function well together,
but only as separate, independent units. Each genome is
limited enough in complexity that it cannot satisfactorily
perform both functions. What is required is a hybridization
combining the two behaviors into a single genome capable
of producing both types of offspring.

A successful composite genome emerges after 4,000 tter-
ations which consolidates the symbiotic relationship
between Farmer and Nomad into a single program. At the
top of this Farmer/Nomad composite is a section of code
which acts as a kind of “switch. This switch is actually a
series of linked Food? tests. When an offspring of this
composite genome is placed in the grid, it tests several
times for the presence of food. If two in a row of these tests
turn up positive, the offspring’s program counter moves to
a set of statements which correspond to an infinite loop
which codes for Farmer behavior; if two in a row tum op
negative, the counter moves to a different set of state-
ments, and the offspring performs an infinite loop which
causes it to act like a Nomad.

The population composed entirely of composite forms per-
formed almost exactly as the population we described at
3,000 iterations. The only difference is that the implicit
link between the two groups, Farmer and Nomad, is now
explicitly expressed in a single piece of code. A_ single
copy of the composite genome, placed in fertile conditions
and allowed to replicate, produces offspring which spread
throughout the space and spontaneously organize them-
selves into the symbiotic groups of Farmers and Nomads
which sustain the farms in the environment at near optimal
levels.

DISCUSSION

The results of the Asgard experiments are encouraging.
The emergence of the composite Farmer/Nomad shows
that it is possible, through hybridization, to evolve a
composite genome capable of producing and organizing a
distributed system of independent units which work toward
acommon goal. Asgard provides another example of the
versatility and adaptive power which genetic algorithms
can exhibit, and as a prototypical example of the use of a
genetic procedure in a distributed context, it gives insight
into which facets of the natural system may be important in
forming coordinated, co-adapted systems of distributed
processes.

However, work with Asgard was hampered by a
number of characteristics which made experimentation with
the system unwieldy. For example,  Asgard’s  environ-
ments were extremely sensitive to slight changes in a
number of internal parameters, and unforeseen interactions
between environmental components often led to extinc-
tions or population explosions. Attempts to extend Asgard
to more useful domains proved to be exercises in frustra-
tion, Furthermore, we decided that in order to imple-
ment the distributed genetic algorithm in more realistic

 
 

domains, some substitute source for the contextual
clues provided by the environmental space would have to
be developed.

To address some of these questions, we are currently
developing anew system, Midgard, designed to produce
an adaptive solution to the problem of balancing the load in
a distnbuted interconnected local area network (inter-
LAN). If communication costs between the nodes on an
inter-LAN are sufficiently small, it is possible to achieve
gains in overall efficiency by reassigning jobs between
nodes in the network to exploit wader utilized computing
resources [11]. The algorithms which control this load
balancing process are themselves typically distributed.
Each node runs a scheduling process, or job scheduler,
which communicates with the other schedulers in the
network to determine when and how to migrate jobs
between processors, Midgard is designed to explore
the utility of a distributed, genetically based, adaptive
algorithm in developing co-adapied process schedulers
capable of cooperating to improve network performance.

The networks we consider are inter-LANs, in the class
of systems which Fuller and Siewiorek categorize as loose-
ly coupled [12]. The environmental grid, which in our mod-
el implicitly groups genomes operating in common con-
texts, 18 teplaced in Midgard by explicit classes which
are induced from the hierarchical structure of the net-
work and the patterns of functional interaction which
arise between its nodes. Distance is replaced by class
membership as the measure of contextual proximity.
Structurally coherent groups are encouraged by biasing
mate sélection and offspnng placement to occur within
a LAN cluster or processor class. Encouraging repro-
ductive interaction within these groups fosters the forma-
tion of genotypes in the system organized according to
the symmetries which exist between processors in the
network.{{7]] By increasing the number of crossovers
between simular machines, recombinant offspring will be
generated which gradually increase the performance of
the machines within clusters and within processor classes,

An analog to the spatially coherent group emerges
when mate selection and offspring placement are biased to
occur more often within classes induced by the occurrence
of successful migrations between machines of different
classes. By monitoring the pattems of successful migra-
tions over time, reproductive interaction can be concen-
trated between individuals, from different genotypes, which
have successfully co-operated to increase network efficien-
cy. Favoring matings within these groups allows sub-
populations working on common goals to become semi-iso-
lated reproductively, setting the stage for successful
hybridizations which explicitly capture group interdependen-
cles,

The software to implement Midgard is currently under
development in C on a unix-based Vax 8600, The net-
work which the schedulers drive is tmplemented using C-
SIM, a network simulation package developed by
Schwebtman at Purdue University. Results from the

experiments ale expected by September of 1987.

REFERENCES

{. Holland, John H." Adaptation in Natural and Artifi-
gial Systems", Ann Arbor: University of Michi-
gan Press, 1975.

2, Sannier, A.V. ,"Hierarchical system development _in
distributed genetic adaptive systems.", Ph.D.
dissertation, Michigan State University, (in
preparation).

3. Koestler, Arthur._"The Ghovs in the Machine", New
York: Random House, 1976.

4. DeJong, K. A. "An analysis of the behaviour of a
class_of genetic adaptive systems". Ph.D. disser-
tation, University of Michigan, 1975.

5. Smith, S. F. "A learning system bated on genetic
adaptive algorithms”. Ph.D. dissertation, Univer-
sity of Pittsburgh, 1980.

6. Grosso, P.B."Computer simulation _of genetic adap-
tation: Parallel subcomponent interaction in a
multilocus model". Ph.D. dissertation, University
of Michigan, 1985.

7. Perry, Z. A. “Experimental study of speciation _in
ecological niche theory using genetic _algo-
rithms", Ph.D. dissertation, University of Michi-
gan, 1984.

8. Holland, John H. & Reitman, 1.8. "Cognitive sys-
tems based on adaptive algorithms’, in Pattern
Directed Inference Systems, Waterman and
Hayes-Roth, Ed, New York: Academic Press,
1978.

9. Booker, Lashon. ”" intelligent behavior as_.an adap-
tation to the task environment’. Ph.D. disser-
tation, University of Michigan, 1982.

10. Wilson, Stewart. "Knowledge growth in an artifi-
cial animal", Proc. of an International Conf. on
Genetic Algorithms and their Applications, July,
1985,

11. Stone, Harold S. "Multiprocessor scheduling with
the aid of network flow algorithms" IEEE

Transactions on Software Engineering, vol, SE-
3, January, 1977.

12. Fuller, S.H. & Siework, D.P. "Some abserva-
tions on semi-conductor, technology and_the

architecture of large digital modules", Comput-
er, vol, 6, Oct. 1973.

 

 
 

Parallelisation of Probabilistic Sequential Search Algorithms

Prasanna Jog
Dirk Van Gucht

Indiana University

Computer Science Department
Bloomington, IN 47405

jog@tuvax.cs.indiana.edu

vgucht@iuvax.cs.indiana.edu

Abstract

This paper explores ways of paralleling probabilistic se
quent search algonthins that use a local improvement
operator to generate iteratnely a new candidate solution
from previous candidate solution We study the (radeoll
between the processors working in isolation and comimunt
cating with each other, in terms of required effort and per-
formance achieved

1, Introduction

Tlos paper deals with the parallelisation of probabilis-
tic sequential search algorithms which generate a sequence
of candidate solutions (structures), each derived from the
previous one by the use of a probabilishic operator ‘The
type of operators under consideration are thase that make
sinall local changes that improve the structure Ag an ex:
ample, consider the probabilistic sequential search algonthin
of Lin and Kernighan (the r-opt strategy} [1] used to find
an appiwumate solution to the Traveling Salesman Problem
(TSP), The TSP can be stated as follows Given NV erties, if
a salesman starting from lis home eit} is to visit each city
exactly once and then return home, flud the otder of visits
(the tour) such that the total distance traveled is minimum
In the general r-opt strategy, the operator replaces r cdges
m the current tour for r edges not im this lour if the re-
sulling tour has a tour length less than the length of the
previons tour Tor example, the 2-opt strategy randomly
selects two edges (4), 92) and (49, 32} from a tour (see Figure
1) and chechs if

ED in) + BD(ta, 12) > Dts, 32) + D(a, 1)

(ED stands for Buchdean distance) If tus is the case, the
our is replaced by removing the edges (11,1) and (12,33)
and replacing them with the edges (44,39) and (12, 3,) (see
Figure 2)

One way of patallehoing a probabilistic sequential search
algorithin 15 to split the problem into n subproblems and let
each processor work on one sulbproblem Such a division 1.
m general, not possible For instance ina TSP, at is unlikely
that we could mahe different processors attempt edge rater-
changes simultancously on the same tour and hape (a obtain.

 

Figure J Tour with edges (41, 9:} and {19, 32)

 

Figuie 2 Tour with edges (11,33) and (19,71)

a legal tour In other words, to achieve such paralighsation
one Has to add conflict resolution techniques, usually resnlt-
ing im a degradation of the performance of the algorithm
Another way of parallehsing 3s to let all n processors
run independently and take the best avaiable solution at
the end We willrall this the mdependent strategy A poten-
lial problem with thig strategy 33 that as the processors run
independently, Soine of them may get caught m a local man-
ima or may search sub-optumal regions of the search space,
wasting valuable resource power Inturtively, it seems hkely
that we might do better if we let the proressers work mnde-

vw

 
 

pendently for same time, then exchange information about
“good” candidate solutions, again work fora while, exchange
new information and soon We call such a method an m-
terdependent strategy and call the timo of the processing in
between to information exchanges a generation? Clearly,
there are many ways of exchanging imformation about goad
candidate solutions A straightforward strategy 15 lo over-
write after each generation a certain number of “bad” can-
didate solutions by good candidate solutions More sephis-
ticated strategses could involve exchanging structural party
of good candidate solutions Examples of such strategies are
cross-over operators found in Genetse Algorithms (GA) [3]
In the rest of this paper, we give evidence that snterde-
pendent strategies are usually better than the independent
strategy when parallelising probabilistic sequential search al-
gorithims which use local improvement operators tn fact, we
suggest that a good technique of parallehsation is to use an
interdependent strategy where mformation exchange is done
ona fairly regular basis In Section 2, we illustrate this ap-
proach by parallelising a simple problem, the Classical Oc-
cupancy Problem [4] Section 3 covers the parallelisation of
a more complicated search algorithm’ the 2-opt strategy of
Lim and Kernighan for the TSP mentioned above tn Section
i, we describe experiments with a genetic algorithm for the
TSP We show that genetic algorithms cau be viewed as par-
allel search algorithms that smplement an interesting kind
of interdependent strategy to achieve good, robust perfor.
mance ‘The standard selection procedure of the GA can be
viewed as a mechauism for achieving information exchange
and the local improvement operator can be viewed as a re-
combination operator of the GA Finally, Sertion $ offers a
discussion of the ideas and results grisea in this paper

2. A Toy Problem: The Classical Occupancy Prob-
jem

Consider the classical occupanry problem Garena
structure of VW empty cells, shoot points randomly at the
cells (with a probability of 1/N of hutting any given cell)
until all cells are filled The time for solving this problem Js
the number of shots required to fill all V cells This prob-
len bas been studied extcasively by Kolchin et al [5] We
Present here some experimental results which Whistrate the:
advantages of au interdependent strategy over Uie indepen-
dent strategy

‘The sequential algorithm snvolves starting with the ime
ial structure of N cmpty cells and repeatedly gen rating a
random numbcr between f and N, this is called a tral H
the cell was empty before, it » now assumed to be full and
uf not, the tnal has been unsuccessful

In the mdependent case, this sequential algovtlin will
run separately on all n processors
required to solve the problem is the number of shots required
by the processor that finished first In the mterdepeudent
case the algorithin would look a» follows

In this case, the time

t Note that ifa generation volver infinitely many trials
{attempts at local improvements), the interdependent stiat-
tgs reduces to the independent strategy

 

assign to cach processor the empty structure,

full ~ 0,

generation + 0,

while full < N do

begin

generation — generation + 1,

each processor generates a number between I and N,

if (at least one processor is successful} then

begin

randomly choose one among the successful praces-
sors,

distribute its structure to all other processors,

full — full+ 1,

end,

end,

We now consider results of simulation on the classical
occupancy problem with N = 100 First we fixed the num-
ber of processors and varied the number of trials per gener-
ation (tpg). As indicated by the above algorithm, after cach
Generation the best structure was redistributed {copied} to
all the processors
we found that tpg = 1 resulted im the minimum number
of generations It should be nated, however, that a higher
oumber of trates per generation also did well compared to
the independent case For N = 100 and n = (0 processors,
the mdependent case required on average (taben over 500
e\periments} 370 gencrations until completion (Note that
100 is optimal) In the interdependent case, for tpg = 1 the
processors required on average 128 generations, for tpg = 4
the processors required 165 generations and for tpy = 7 the
processors required 190 generations

Next we fixed the trials per generation and varied the
number of processors As the number of processors increased
the generations required in the independent case reduced,
but at a slower rate than the generations required for the m-
terdependent case For example, with 1 trial per generation
and WV = 100, it took 3 processare on average 221 genera-
tions ta fiussh wm the interdependent case and on average 427
generations m the independent case With 5 processors the
interdependent strategy required 166 generations while the
iudependent case required 397 generations With 10 proces-
sors the generations required were 128 and 370 respectively
{As the number of processors tends to infinity both strate-
gies will require N’ gencrations) Our experiments suggest
that domg less number of trials per generation and eresa-
ing the number of processors 13 better

With the number of processors n = 10,

But this was a simple problem In thus problem, we
know the (optimal) solution and among processors that have
a ouecessftl tial, there is no one best structure, since they
all have the sane amount of empty cells, That is, they are
all equally good [or this reason, we decided not to try a
strategy of taking the & best candidate solutions with & > 1,
although a slightly different performance may be expected
We now conaider the parallelisation of a more comple prob-
Yen and compare the perforwnance of various interdependent
strategtes with cach othor and the independent strategy

3, Experiments with a Search Algorithm for the
Traveling Saleaman Problem

 
 

The domain of this experiment 1s the Traveling Sales-
man Problem (TSP). We use the operator of Lin and
Kernighan (the r-opt strategy}, described earlier a3 an evam-
ple of a probabilistic operator that makes smal! local changes
to produce new structures from old ones The larger the
value of r, the more likely it is that the final solution (when

uo Inore exchanges are possible) 15 optimal However, ris
usually chosen to he 2 because the possible edge interchanges
ts of the order of (7) x rif In each generation, cach pro-
cessor performs a certam number of applications of the 2
opt strategy on the structure it currently holds in its local
memory and after each generation the best structure was
redistributed (copied) to all the processors The domain for
expermments descnbed here 1s the lattice of 100 poms spread
over (0,0), (0,9), (9 8), {9,9} as shown an Tigure 3

Figure 3 Lattice of 100 cities

Clearly, the optimal tour length 3 100{ We will consider
ihe total effort, given by the number of generations times
the number of trials per generation, required to get to within
10° of optimal performance as the sumber of trials por gen-
eration (tpg) 1s varied Notice that we do not take into ac
count the overhead mvolved in redistributing the structures
after cach generation The algorithm ty as follows
for various values of tpg do
begin
generation 0
generate a structure randomh,
copy it ta all n processors,
while (lestper formance & (1,10*oplimumvalue})do
begin
generation — generation + 1,
each processor attempts tpy local improvements,
find the structure with the best performance,
distribute this structure to all other processors,
end
end

+ It should be noted that Lin and Kernighan propose a
inore powerful strategy where ris varied dynamically, but
the ample 2-opt strategy also gives good results and 1s quite
efficient [2}

¢ It should be noted that similar results can be obtained
for other TSPs

 

The actual algorithm also heeps trach of the number of suc-
cessful Jocal improvements (ie, applications of the 2-opt
operator that make the tour length smaller) and the num-
ber of experiments that got aborted (1e those experiments
wherem the performance did not reach below 10% of op-
timal performance aven after many generations) Figure 4
shows the graph for 10 processors Similar graphs were ob~
tamed Jor different number of processors This shows that
if we have fewer trials per generation then the total effort
required to get a relatively good performance is less (As
mentioned before, when tpg tends to infimty we have the
mndependent strategy, and that clearly requires a lot more
effort}

But aot all trials mvolve successful local improvements
Those that do not perform any local improvement will take
very little processor time because the tour does not need to
be rearranged Therefore, to get an idea of total time re-
quired, we plotted in Figure 5 the number of successful local
Again, we see that domg
fower triads per generation reduces the total time required

improvements done on average

‘Shore ts a danger in doing too few trials per generation
however As Table } shows, out of 50 experunents for cach
value of (pg, some get aborted for low values of ipy This
happens when the algorithin gets caught m a local optimum,
which occurs because the algorithm described above has no
means of mamtaming diversity of structures for low valucs
of tpg Note that for higher values of tpg, that ig strate.
gies closer to the independent strategy, no expertments get
aborted, showing the robustness of the operator and the al-
gorithm being used

To avoid getting caught in a focal optimum we decided
to change the interdependent strategy slightly to increase the
diversity of examined structures Instead of taking the one
best structure we decided to experiment with the redistribu-
tion of the & best strurtures after each generation (Again
as k lends to n we have the independent case} Jn the ex-
peritnents with 10 processors we found that with & = 4 only
1 or 2 experiments get aborted (out of 50} for lower values
of tpg We may conclude that increasing & (1e maintaining
diveraity} makes the algorithm more robust

‘There 1s another price we pay for eschangmg informa-
tion too quichly Dvery time we exchange information a copy
time is involved and this copy time inercases as the number
of trials per generation decreases But it wall decrease as k
increases {for a fixed talue of tpg)f. Thus, our experiments
show that one obiaims a good interdependent strategy by
keeping (pg as low as possible to decrease the number of
total! trials and to use a large enough & to make the algo-
iithm averd getting trapped mto a loca) minima as well as
to reduce copy time

4. Experiments using a Genetic Algorithm

Genetic Algorithms (GA) [8], introduced by Holland,
have been applied with good success to function optimisation
problems involving comples functions [6] as well as on some
combinatorial optimisation problems (7,8,9,10,11,12,13}. A
typical GA maintains a population of structures after each

 

ft ft should be noted that on a parallel inachine the copy
tine can be reduced

472

 
 

generation, which consists of applying local provement op-
erator a cettam number of mes (trials) ta each structure,
uses a selection mechanisin to produce a new population of
structures for the next generation (Thus, a trial involves a
probabilistic operation that makes a small local change to
the structure) The selection mechanism assigns a strength
of performance measure to each structure This measure
1s uplially the ratio of the performance of the structure to
the average performance of the population The number of
occurences of this structure in the next generation 16 pro-
portional to this performance measure t

A genetic algorithm can be viewed as a parallel search
algorithin of the type we have been discussing im the previ-
ous sections Whereag a sequential algorithm (using the local
improvement operator) operates on one structure a GA op-
erates on a population of n structures If one imagines that
each structure 1s worked on by a separate processor, the se-
lection mechanism can now be viewed as an mterdependent
strategy, The n processors run separately for a while {a
generation) and then exchange information (about perfor-
mance) resulting im processors getting a (possibly) different
structure for the nevt generation This strategy is a more
sophisticated version of our earlier idea which consisted of
taking the & best structures from the current population

We used a version of the genetic algorithm: (GA) that
uses the 2-opt local improvement operator to solve the TSP
We ran the GA for 66,000 trials, varying pg The population
size, or alternatively the uuinber of processors, 19 50

On the lattice of 106 pomnts (optimum length 1s 100) the
GA did not do well for small values of (pg (5 and 10) but did
almost «qually sell on higher values of tpg Tabte If shows
the effort required for the GA to get to within 10% of the
optunal performance It showld be noted that these values
are close to the values obtained in our earlior interdependent
strategy of taking the oue best (Table J

Next we chose the domain to be 100 poimts uniformly
distributed between (0,0) (0,1) (1,0) (1,0) Figure 6 shows
the performance versus tpg curve The best performance is
seen at about tpg = 500 and it worsens a little after that
Therefore using the selection mechanism of a GA as strategy
of interdependency, and keeping tpg relatively small yields a
fast and robust algorithm with good performance,

&. Conclusion
We have given evidence that probabilistic sequential
search algorithms which operate by performing a local
change on a structure to generate a new structure, can be
paralleliscd by domg a reasonably large amount of local im-
provements for each structure per generation and then ey.
changing information about “good” structures Doing too
few trials per generation, however, may not yield good per-
formance values as premature convergence may occur On
the other hand, we also showed that taking one best struc-
ture and ecdustributing it over the other processors ts too
simplistic a strates} siice ut also Causes premature convers
gence, honce afew best should be selected for redistribution
} Most GAs also use a crossover operator that takes two
structures and interchanges parts of them to produce two
new structures Jn this paper we ignore such operstors

Ia fact, it turned out that the more soplusticated interdepen-
dent strategy, the selection procedure of a genetic algorithm,
gave the resulted in the most robust strategy. It should be
noted that we have ognored copy time in the presentation of
the results We believe however, that even if we take this
overhead into account similar results cuncerning the mde-
pendent versus interdependent strategy may be obtained

Acknowledgementa

We wish to thank the referees for helpful comments
which belped im clarifying some of the ideas and results of
the paper

References
1, S Lin and BW Kernighan, *An Effective Heuristic Al-
gorithm for the Travelmg Salesman Problem”, Operations
Rescarch 1972, pp (98-516

2 DT L Lawler, JK Lenstra, AH G Rinnooy Kan and
DB Shmoys {ED), The Trateliny Salesman Problem, Jahn
Wiley and Sons Ltd (1985}

3 3 Holland, Adaptation in Natural ond Artificial Systems,
University of Michigan Press, 1975

4 NAL Jobnsou, 5. Kotz, Urn Afodels and their Applica-
lions, Wiley and Sons, 1977

5 V ¥ folchin, BA Sevastyanov, V P Chistyakov, Random
Allocations, VE, Winston and Sons, 1978

6 K A Dejong, “Adaptive system desgn a genetic ap-
proach” JFEE Trana on System, Man and Cybernetics, Vol
SMG-10(9}, pp 556-574 (Sept 1980)

7 JJ, Grefenstette, R Gopal, BJ Rosmaita and D Van
Gucht, “Genetic Algorithins for the Traveling Salesman

Problem”, Proc of an Intl Conf on Genelec Algorithms
and Their Applications, pp 160-168 (duly 1985)

8 J4° Grefenstette, “Incorporating Problem Specific
Knowledge mto Algorithms”, ‘To appear

9 JY Suh, D Van Gueht, “Incorporating Heuristic [n-
formation into Genetic Search”, Technical Report, Indiana
University, February 1987.

10, L Davis, “Job shop scheduling with genetic algorithms”,
Proc of an Int Conf on Genelie Algortthms and Their
Appheations, pp 186-110 (Suly 1985)

if MP Fourman, “Compartion of symbolic layout using
genetic algorithms", Proc of an Intl Conf on Genetic
Algorithms and Thesr Applicationa, pp 141-153 (July 1985),
12 DE Goldberg and R Lingle, “Alleles, loci, and the trav-
cling salesman problem”, Proe ofan fall Conf on Genelie
Algorithins and Thetr Applications, pp 154-159 (duly 1985)

138 T) Snuth, “Bin packing with adaptive search”, Proc of
an Intl Conf on Genetic Algorithms and Their Applica-
fons, pp 202-206 (duly 19)

 

473

-

 
 

tpg total success total experiments variance
effort rate success aborted

5 1605 0.030 45.009

10 1820 0.034 61.320

20 2160 «0.0386 = 77.385

5 0.092
3 0.123
3 0.093
40 2640 «0.038 = 99.2611 0.107
80 3360 = (0.035 V17.152 3 0.110
160 «4160 = (0.034 140.610 0 0.157
320 5440 = -0.030 160.867 0 0.245
640 6190 0.027 472,721 2 0.590
1280 7680 0.022 170.622 2 0.922
2660 10240 0.019 192.836 2 1.857
5120 15360 0.015 222.767 0 2.114

tpg - trials per generation

total effort = tpg * (generations to get to within 10%)
success rate = fraction of trials successful

total success = total number of successful improvements done
experiments aborted are out of total of 50 for each tpg

 

TABLE fi
tpg total
effort
5 1500
10 2058
20 2812
40 3760
80 4480
160 5600
320 = 6080
640 7040
1280 7680
2560 7680
5120 10240

474
 

Peaod

CHO MMO

MO oOo et

aneadaca

49000

8000

6000

4000

2000

250

200

160

100

50

Qo

 

effort for 10 processors

 

— -
| a
a
<

| -

| i

I. ce ee ee aL sec ee deere =
Oo 1000 2000 3000 4000 5000

trials per generation
Figure 4
success Tate for 10 processore
fo7 - ee oe
ae

_f

/

f
| :
\
bee meee el ee alee wee LL
o 1000 £000 3000 4000 5000

trials per generation

Figure 5

 

 
RPDOAPBHOMN OD

-05

00

-95

-90

-B5

-80

75

performance of 100 random cities

 

0 200 400 600 800
trials per generation

Figure 6

Ve

 

 

 
 

 

PARALLEL GENETIC ALGORITHM FOR A HYPERCUBE

Reiko Tanese”

Department of Electrical Engineering and Computer Science
University of Michigan

Ann Arbor, MI

Abstract

This paper discusses a parallel genetic algorithm for a
medium-grained hypercube computer. Each processor
runs the genetic algorithm on its own sub-population,
periodically selecting the best individuals from the
sub-population and sending copies of them to one of
its neighboring processors. The performance of the
parallel algorithm on a function maximization prob-
ler is compared to the performance of the serial ver-
sion. The parallel algorithm achieves comparable re-
sults with near-linear speed-up. In addition, some ex-
periments were performed to study the effects of vary-
ing the parameters for the parallel model.

1 Introduction

The genetic algorithm, which was designed by
Holland[1] in the 1960s, is applicable to a wide range
of problems and its popularity is steadly increasing.
Tn some applications, such as a simulation of natural
phenomena, the algorithm requires many generations
and a large number of individuals in the population.
However, time limitations make it infeasible to run
the algorithm with such a large population. To allevi-
ate this, this research investigates a parallel version of
the genetic algorithm on a hypercube computer. The
Tesults show that the parallel version is significantly
faster than the serial one, and that nearly linear speed-
up can be achieved with as many as 64 processors.

The parallelization of the algorithm involves divid-
ing the population and placing a sub-population on
each processor. A processor runs the genetic algo-
rithm on its own sub-population, periodically selecting
good individuals from the sub-population and send-
ing copies of them to one of its neighboring prees-
sors. Earlier work was done by Grosso{2] studying sub-
populations in the genetic algorithm on a serial com-
puter. Grefenstette[7] has proposed several versions of

 

“Partially supported by Digital Equipment Corporation.

48109-2122 USA

parallel adaptive algorithms for different architectures

The next section in this paper describes the hy-
percube architecture and the specific hypercube com-
puter, the NCUBE, on which this research was per-
formed. Section 3 describes the serial and parallel
algorithms studied. The algorithm’s performance is
measured on a function maximization problem; the
function to be maximized is defined in section 4. The
experiments which were conducted and their analyses
are contained in section 5.

2 The Hypercube Computer

The parallel genetic algorithm presented in this paper
is designed to take advantage of a parallel architecture,
a medium-grained hypercube. An n-dimensional hy-
percube computer consists of N = 2” processors inter-
connected as an n-dimensional binary cube (Figure 1)
Each processor is a node of the cube and has its own
CPU and local memory. The communication among
the processors is done via message passing. Each pro-
cessor is directly connected to n other processors (its
neighbors), which makes the distance between any two
processors in the cube at most n communication links
long. This means that information can be broadcast
through the cube in n (or lg N) time steps, compared
to O(N4/?) in a 2-dimensional gtid, another popular
parallel architecture.

The specific machine used to run all the ex-
periments in this paper is a general purpose 64-
processor NCUBE/six hypercube made by NCUBE
Corporation[3}. Each processor in the NCUBE/six is a
powerful 32-bit custom VAX-like CPU chip with 128K
bytes (soon to be 512K) of local memory. Although
a quarter of this memory is needed to store the node
operating system and message buffers, it is still large
enough to categorize the NCUBE/six as 2 medium-
grained hypercube computer. An 80286 host proces.
sor is used to provide the connections between the node

477

 
 

©

0-dimensional

@©——O®

1-dimensional

o— ©—

rv

 

2-dimensional

3-dimensional

Figure 1: Hypercubes for n = 0, 1, 2, and 3

processors and the external world.

3 The Genetic Algorithm

3.1 The Serial Algorithm

The serial genetic algorithm (Figure 2) is a straight-
forward implementation of a standard algorithm with
two genetic operators: crossover and mutation. Two
parameters, crossover_rate and mutation_rate, deter-
mine the frequency of application of these operators.
Crossover-rate is the average number of crossovers per
mating. The actual number of crossovers for any par-
ticular pair of individuals is a Poisson-distributed ran-
dom variable with a mean of crossover_rate. That is
if crossover-rate is 1, any pair will undergo crossover
operation on the average once. Sirnilary, mutation_rate
is the average number (Poisson-distributed) of muta-
tions per individual. Notice that this differs from the
traditional interpretation, in which the mutation rate
represents the probability of mutation per allele. The
method used here is functionally equivalent to the tra-
ditional method using mutation_rate / the length of
a chromosome. The new method is simply more ef-
ficient, especially when there are few mutations per
chromosome.

The algorithm keeps track of the best individual in
the population and carries it over to the next genera-
tion. This feature, which was first used by DeJong[5],
is especially helpful when solving function optimiza-
tion problems, in which the environment remains un-
changed over time.

478

Randomly generate initial population of pop-size.
For gen = 1 to num_gens do

Compute the fitness of each individual in the

population.

Determine the number of offspring which each
individual is going to have based on the
individual’s fitness.

For i = 1 to (pop_size / 2) do
Pick 2 parents randomly without replacement.
Crossover the parents based on crossover rate

to produce 2 new offspring.
Mutate each offspring based on mutation rate,
endfor

If the best fit individual from the old population is
not in the new population, replace one of the
individuals in the new population with the best
individual from the old population.

endfor

Figure 2: The serial genetic algorithm

3.2 The Parallel Algorithm

In the parallel implementation (Figure 3), each pro-
cessor runs the genetic algorithm on its own sub-
population, periodically selecting good individuals
from its sub-population and sending copies of them
to one of its neighbors. It will also receive copies of
this neighbor’s good individuals, with which it will re-
place bad individuals in its own sub-population. The
neighbor with which this exchange takes place will vary
over time: each exchange will take place along a dif-
ferent dimension of the hypercube. The frequency of
exchange and the number of individuals exchanged are
two new adjustable parameters.

For example, suppose there are four processors, each
containing a sub-population of size 50. Suppose they
are to send 10% of their good individuals to their
neighbors at every 10th generation. Suppose that the
processors have their identification numbers assigned
as shown in Figure 1. At generation LO, processors
0 and 1 will send 5 of their good individuals to each
other, and so do processors 2 and 3. At generation 20,
processors 0 and 2 will exchange individuals and so do
processors 1 and 3.

This exchange of individuals is designed to remedy
the parallel algorithm’s main difference from the serial
one. In the serial algorithm, an individual can select
its mate from any other individual in the population,
while in the parallel version potential mates are all in

 
 

Randomly generate initial sub-population of
sub_pop_size.

For gen = 1 to num_gens do
Calculate fitness as in the seria] algorithm.
If (gen mod frequency_of.erchange = 0) then

Choose num-erchange good individuals
from the pool of individuals whose fitnesses
are at least equal to the average fitness of
the sub-population.

Send copies of good individuals to the
neighboring processor along one dimension
of the hypercube.

Receive copies of neighbor's good individuals.

Choose num_erchange bad individuals
from the pool of individuals whose fitnesses
are no greater than the average fitness of
the sub-population.

Replace these individuals with the newly
received ones,

endif
Allocate offspring as in the serial algorithm.
Reproduce as in the serial algorithm.
Keep the best fit individual from this generation
as in the serial algorithm.
endfor

Figure 3: The parallel genetic algorithm

the same sub-population, a much smaller pool. This
restriction is overcome by allowing the better fit indi-
viduals to circulate among the sub-populations, in the
hope that they might prove to be “useful”.

Notice that the “good” individuals to send to a
neighboring processor are chosen probabilistically from
the individuals whose fitnesses are at least equal to
the average fitness of the sub-population. The current
implementation ensures that the individual with the
highest fitness has the the highest chance of being du-
plicated and sent to a neighboring processor. Similarly,
the “bad” individuals are chosen probabilistically from
the individuals whose fitnesses are no greater than the
average fitness of the sub-population, giving the worst
fit individual the highest chance of being replaced.

There is a limit, though, to how frequently ex-
changes should occur. Notice that the exchange
scheme produces duplicates of the fit individuals,
which is equivalent to producing more offspring. This
increased fertility can, in some cases, lead to prema-
ture convergence.

Another consideration in determining the exchange
frequency is the problem domain. Function optimiza-
tion domain is one in which the environment is un-
changing; the function values are fixed. In other
domains, in which the environment may be chang-
ing, it might be better not to exchange as frequently.
This is due to the fact that as the environment
changes, the sub-population may undergo substantial
re-adjustment. It would be best to perform exchanges
only after the sub-populations have settled into qui-
escence, which could take a number of generations to
achieve,

The parallel genetic algorithm in Figure 3 can easily
be implemented on the NCUBE. The two main modi-
fications required are: addition of an inter-processor
communication routine for exchanging individuals,
and addition of a host program which loads the node
processors with the genetic algorithm program and es-
tablishes the I/O interface between the user and the
node processors, since the node processors cannot per-
form direct [/O. The core of the genetic algorithm run-
ning on each processor is as in Figure 2.

4 Function Maximization

The problem of function maximization is used for pre-
liminary demonstration of the performance of the par-
allel algorithm. This is because in the function opti-
mization problem there is a clear measurement of how
well the system is performing at all times, whereas in
other problems performance can be harder to analyze.

The first attempt used five functions studied by
DeJong in his thesis(5]. However, the serial algorithm
has found the global maximum for all five functions
within 200 generations, starting with a randomly gen-
erated population of 50. There is little need to par-
allelize problems which are quickly solvable by se-
tial computers, so a search was made for reasonable
genetically-hard functions. Bethke[4] has shown in his
thesis that Walsh functions can be used to construct
genetically-hard functions. This was used to define a
Walsh-like function W on binary strings of length 64.
The objective of the algorithm is to maximize W. W
is defined as a composition function:

W(s) = F(Order(s))

where Order(s) = the number of ones in the string s,
(0 < Order(s) < 64) and F is defined so that W has
the values shown in Figure 4.

The function W has the following characteristics.

e [t has many local optima and a single global op-

 
 

 

 

 

 

 

 

 

ae

 

 

 

 

 

 

 

 

 

 

 

‘ vTy 7
2
\
0 tT . v T s
o 8 16 24 32 40 48 56 64
Order(s)

Figure 4: Function W

timum.

The domain of W is huge, namely 2°.

The function values for neighboring points are rel-
atively close together. This helps keep a single fit
individual from taking over the population easily,
thus preventing the genetic algorithm from devel-
oping a fixation.

The global optimum is not an isolated spike which
can only be found by a random search.

Experiments and Results

5.1 Number of Processors

Table 1 contains the results of running the parallel al-
gorithm using various number of processors. For each
dimension, the parameters are set as follows: total
population size 400, crossover_rate 0.6, muta-
hhon-rate = 0.5, and exchange.frequency = every 5 gen-
erations. Table la contains the results of runs in which
10% of sub-population is exchanged. Table 1b contains
the results of runs exchanging 20% of sub-population.
The runs for dimension 0, those with one processor,
are equivalent to running the serial algorithm in which
no exchanges occur.

Sixty-four runs were made for each dimension. Each
run keeps track of the earliest generation in which the
global maximum is found. The column “average gens”
in Table 1 is the average of such generations over 64
runs. When a run does not find the global maximum

within 1000 generations, as seen in dimension 0, the
average generation is calculated as if that run reached
the global maximum in generation 1000. This is in-
dicated in the table by “>”, meaning that it took at
least that many generations on average to reach the
global maximum. “#éruns reaching max” is the num-
ber of runs that reached the global maximum out of
the 64 runs. The range column shows the earliest and
the latest generations in which the global maximum
is found. Since the genetic algorithm is a stochastic
algorithm, this range can be quite large.

The total population size for dimension 4, 5 and 6 is
kept approximately 400. This is because the parallel
program handles only even sub-population sizes in the
current implementation. Notice that there is no data
for dimension 6, in Table 1a. This is because 10% of
six individuals is less than one, so no exchanges would
have taken place.

The results in Table 1 show that the parallel algo-
rithm finds the global maximum in roughly the same
number of generations as the serial algorithm.

Table 2 shows the execution times for the runs in
Table la. The times are for complete runs of 1000
generations, regardless of when the global maximum
is found. The parallel algorithm introduces two ex-
tra steps over the serial one: choosing good and bad
individuals, and exchanging these individuals. From
empirical observations, these steps are negligible when
the sub-populations are larger than 10; typically under
1% for selection and under 2% for exchanging.

Figure 5 demonstrates the speed-up achieved by the

180
 

 

 

linear speed-up
~*~ my resuits

log of execution time

 

 

dimension

Figure 5: Dimension versus log of execution time

parallel algorithm. The graph plots the dimension of
the hypercube versus the log of the execution time.
Notice that for runs made on 6-dimensional hyper-
cube, the sub-population size is only 6: consequently
the communications overhead becomes a much larger
factor in the execution time. It also takes longer to
load all 64 processors with the program. These two
factors combine to push the 6-dimensional time away
from true linear speed-up,

The two results, the comparable performance and
the near-linear speed-up, together demonstrate the
power of the parallel algorithm.

5.2 Different Mutation and Crossover Rates

The effects of varying the mutation rate in the serial
algorithm was studied by DeJong in [5]. The purpose
of a mutation is to introduce a lost allele in a given bit
position. When a bit position is fixed to one allele in
all of the population, no amount of crossover will in-
troduce a different allele there. Therefore, a mutation
is an insurance policy which re-introduces lost alleles.

A single bit mutation of an individual can also be
thought as a local search in an area surrounding that
individual in a multi-dimensional space. The prob-
lem is when the population converges prematurely to
a local optimum, in which case it may require more
than a single mutation to get over to a new region, A
high mutation rate is helpful in such situation, but it
may cause too much disruption during the exploration
phase.

The difficulty is in keeping the balance between the
exploration and exploitation (or switching the empha-
sis from one to the other). Ackley(6] used a variable

184

mutation rate over time, in which he started out with
high mutation rate to explore the search space, and
gradualy lowered the rate to concentrate on a promus-
ing region. This use of the mutation rate is similar to
the use of temperature in simulated annealing. Ackley
used an “iterated” genetic algorithm in addition to this
variable mutation rate to ensure that the algorithm
would not prematurely converge to a local optimum

The parallel algorithm shown in this paper intro-
duces a different way to handle the exploration versus
exploitation problem. Since the algorithm uses mul-
tiple processors, why not have some processors run
with a high mutation rate, emphasizing exploration,
and the other preessors run with a low mutation rate,
emphasizing exploitation? When a processor running
with a low mutation rate exchanges individuals with a
processor with a high mutation rate:

1. If any of the new individuals are less fit than the
rest. of the processor’s sub-population, then they
will eventually die off.

Ww

If any of the new individuals are more fit than the
processor's sub-population, then they may help
to focus the search in a new direction, possibly
avoiding premature congergance.

‘Two experiments were conducted on a 3-dimensional
hypercube to demonstate this point, All pararneters
other than mutatron_rate were set, as follows: total pop-
ulation size = 400, crossover_rate = 0.6, exchange 10%
of sub-population every 5 generations. The first exper-
iment used mutation_rate equal to 0.5 per individual
on half of the prcessors, and 1.0 on the other half. The
second experiment used mutation-rate of 0.5 for half of
the processors, and 2.0 on the others, The results of
these experiments are in lines 2 and 3 of Table 3.

Another experiment was run with different values of
crossover and mutation rates on all eight processors.
Crossover-_rate is set either to 0.6 or 1.0. Mutatton_rate
can be 0.5, 1.0, 1.5, 2.0. The combination of these
two parameter values makes every processor run in a
different mode than the others.

The results of these runs are summarized in line 4
of Table 3. Compared to the results of the “stan-
dard” run in line 1, all three experiments found the
maximum within a reasonable number of generations.
These results are very encouraging in dealing with one
of the main difficulties of the genetic algorithm, its
sensitivity to the parameter settings. Numerous stud-
ies have dealt with this problem; the algorithm is not
practical unless appropriate parameter settings can be
found easily, The experiments of Table 3 suggest that

 
 

the parallel algorithm can perform well even without
knowing the best parameter settings.

5.3. Exchange Frequency

Table 4 shows the results of changing the frequency
of exchange for runs using a 3-dimensional hypercube.
The other parameters are set as follows: total popula-
tion size = 400, crossover-rate = 0.6, mutation_rate =
0.5, exchange 10% of sub-population.

As the results indicate, exchanging too frequently or
too infrequently may degrade the performance of the
algorithm. Exchanging every 5 generations seems to
be most effective for optimizing the function W.

6 Conclusion

The problems that can be solved by the genetic algo-
rithm can be categorized in three groups:

1. Problems that work better using a large popula-
tion and the serial algorithm,

2. Problems that work better using the sub-
population model and the parallel algorithm,

3. Problems that work well with either using a large
population and the serial algorithm, or the sub-
population model and the parallel algorithm.

The function maximization problem presented in
this paper belongs to the third category. Using the
sub-population model for problems in this category has
practical benefits, as it achieves near-linear speed-up.

The benefits of a large single population versus those
of the sub-population model have been argued in the
field of population genetics by Fisher and Wright{8}.
According to Wright’s “Shifting Balance Theory”,
there exist situations in population genetics in which
the sub-population model succeeds but the single pop-
ulation model will not. It is not immediately apparent
that this theory is transferable to computer science,
but if it is, there may be problems which can be solved
only by the parallel genetic algorithm.

Further resarch 1s required to identify the problems
in the first category (if any), and to determine if they
can benefit from other parallelization techniques.

Finally, the experiments in Table 3 suggest that the
parallel algorithm may be more robust than the serial
one, in that reasonable performance can be achieved

by having different processors use different parameter
values without determining the best possible parame-
ter values. More investigation is required in this area,
possibly studying the effects of varying other parame.
ters in addition to mutation and crossover rates.

Acknowledgements

I would like to thank Professor Quentin Stout for his
valuable advice and guidance, and Colin Underwood
for his help in presenting these ideas. This work is
only possible because of their enthusiasm. I also like
to thank Professor John Holland for introducing me to
this subject.

References

(i] J. H. Holland, Adaptation in Natural and Artifi-
cral Systems, University of Michigan Press, 1975

[2] P. Grosso, Computer Simulations of Genetic Ap-
plcation: Parallel Subcomponent Interaction m
a Multilocus Model, PhD Thesis, Computer and
Communication Sciences, University of Michigan,
1985.

[3] J. P. Hayes, T. Mudge, Q. F Stout, S. Colley
and J. Palmer, A Aftcroprocessor-based Hypercube
Supercomputer, IEEE Micro, vol. 6, no. 5, Oct.
1986, pp. 6-17.

[4] A. Bethke, Genetsc Algorithms as Functional Op-
timizers, PhD Thesis, Computer and Communi-
cation Sciences, University of Michigan, 1981.

{5] K. A. DeJong, Analysis of the behavior of a class
of genette algorithms, PhD Thesis, University of
Michigan, 1975.

[6] D. H. Ackley, Stochastic iterated genetic hill
chmbing, Carnegie-Mellon University, CMU-CS-
87-107, 1987.

(7] J. J. Grefenstette, Parallel adaptive algorithms
for function optimization, Vanderbilt University,
(Preliminary Report), Technical Report CS-81-
19, 1981.

[8] W. P. Provine, Sewall Wright and Evolutionary
Biology, The University of Chicago Press, 1986.

182

 
 

 

 

 

(a) Exchange 10% (b) Exchange 20%
Dim | #PEs Sub Average #Runs Range Average Runs Range
Pop Size Gens Reaching Max Gens Reaching Max
0 1 400 >198 63 60- 1000+ >198 63 60- 1000+
1 2 200 212 64 65-615 198 64 60-675
2 4 100 201 64 55-530 >210 63 70- 1000+
3 8 50 171 64 80- 385 209 64 60- 665
4 16 24 248 64 95-595 200 64 70-510
4 32 12 307 64 105-785 212 64 65-465
6 64 6 > - > 232 64 115-460

 

 

 

 

 

 

 

Table 1: Data for 64 runs on a hypercube with 0 to 6 dimensions

 

 

 

Dimension | #Processors [CPU Seconds
0 1 1548.33
1 2 787,98
2 4 393.29
3 8 199.18
4 16 98.88
5 32 54.86
L 6 64 37.85

 

 

 

 

Table 2: Execution times for runs of 1000 generations

 

 

 

Crossover Mutation Average #Runs
Rate Rate Gens Reaching Max Range
06 0.5 171 64 80-385
0.6 0.5,1.0 150 64 45-380
0.6 0.5,2.0 199 64 65-570
| 961.0 — 0.5,1.0,1.5,2.0 153 64 65-435

 

 

 

 

‘Table 3: Data for 64 runs with different crossover and mutation rates (dimension=3)

 

 

 

 

Exchange Average Runs

| Frequency (gen) Gens Reaching Max Range
1 >222 62 70- 1000+
2 >197 63 60- 1000+
3 >180 63 40- 1000+
5 171 64 80-385
10 >248 63 95- 1000+
20 302 64 120-695 |

 

 

 

Table 4: Data for 64 runs with different frequency of exchanges (dimension=3)

 

 
 

BUCKET BRIGADE PERFORMANCE: I. LONG SEQUENCES OF CLASSIFIERS

Rick L. Rialto

The University of Michigan

Abstract

In Holland-type classifier systems the bucket brigade algo-
rithm allocates strength (“credit”) to classifiers that lead
to rewards from environment. This paper presents results
that show the bucket brigade algorithm basically works
as designed~-strength is passed down sequences of cou-
pled classifiers from those classifiers that receive rewards
directly from the environment to those that are stage set~
ters, Results indicate it can take a fairly large number
of trials for a classifier system to respond to changes in
jts environment by reallocating strength down competing
sequences of classifiers that implement simple reflex and
non-reflex behaviors. However, “bridging classifiers” are
shown to dramatically decrease the number of times a long
sequence must be executed in order reallocate strength to
all the classifiers in the sequence. Bridging classifiers also
were shown to be one way to avoid problems caused by
sharing classifiers across; competing sequences.

1 INTRODUCTION

Like all highly parallel, fine-grained, rule-based learning
systems, classifier systems ([Holland, 1986a], [Burks, 1986],
[Holland and Burks, 1987], (Holland, 1986b]) must solve
the apportionment of credit problem. In short, the ap-
portionment of credit problem is the problem of deciding,
when many rules are active at every time step, which of
those rules active at step ¢ are necessary and sufficient for
achieving some desired outcome at step t-+n. In terms of
Samuel, who first recognized the problem in the context of
his checker playing program [Samuel, 1959], the problem
is how to know which of the many moves (or sequences of
moves) made in the early parts of a game “set the stage”
for a triple jump later in the game. The problem of ap-
portioning credit is especially difficult in complex domains
in which (a) information about what is a good result is
provided only occassionally, perhaps after long sequences
of actions, and (b) there are millions of possible states or

state sequences, so that the system never sees the same
exact sequence twice.

In classifier systems using the bucket brigade algorithm
{Holland,1985}, credit is allocated in the form of a value,
strength, associated with each classifier. The strength as-
signed to a classifier is important for two reasons:

1. Strength determines in part what classifiers will be
active at a given time step, and so controls the short
term behavior of the system.

. Strength is used by rule discovery algorithms to guide
the creation and deletion of classifiers, thereby influ-
encing the longer term learning behavior of the sys-
tem.

Thus for classifier systems both to perform well and
to learn, strength must be allocated properly and expedi-
tiously by the bucket brigade algorithm.

Basically the bucket brigade algorithm acts in two ways:

1. It adjusts the strength of those classifiers that are ac-
tive when a payoff is received from the environment.
Each classifier’s strength is changed a little at a time
until it is proportional to the average of the payoffs
the system receives when that classifier is active.

. It redistributes strength from each active classifier
to classifiers that posted messages that activated it.
Each classifier’s strength is modified a little at a time
until it is proportional to the strength of the classi-
fier(s) it activates.

Over time the bucket brigade algorithm reallocates strength
from classifiers that directly lead to payoffs from the envi-
ronment to those classifiers that indirectly lead to payoff,
ie,, to classifiers that post messages that “set the stage” for
those classifiers directly responsible for receiving payoffs.
The bucket brigade has several characteristics that make
it ideal for use with highly parallel systems like classifier
systems. First, the bucket brigade algorithm uses only [o-

cal information: when adjusting the strength of a classifier,

184

“This work was supported by National Science Foundation Grant
DCR 83-05830

 
 

it only needs to know which classifiers directly activated
it and which classifiers it directly activates. There is no
need for complicated book-keeping or for high-level crit-
ics to analyze sequences of actions and assign credit ac-
cordingly. Second, the bucket brigade works in a highly
parallel way, changing the strength of many (or all) rules
at the same time, Third, the bucket brigade acts inere-
mentally, changing the strength of classifiers gradually. By
changing the strength of classifiers only a small amount at
a time, the classifier system tends to learn gracefully, with-
out the precipitous changes in performance that may result
from making a large change in response to a single, possibly
anomolous, case.

One key issue for systems using the bucket brigade al-
gorithm is how fast strength flows down long sequences of
classifiers. If a whole sequence of classifiers must be acti-
vated many times in order to adjust the strength of a clas-
sifier at the beginning of the chain in response to a change
in payoff associated with the last step in the sequence, the
system's response to simple changes in its environment will
be too slow. Wilson [Wilson, 1986] used a simple simula-
tion to show that allocation down a sequence of classifiers
can take a fairly large number of steps. (He suggested an
alternative “hierarchical” bucket brigade algorithm that is
designed to speed up th»: flow of credit down long sequences
of classifiers.) Holland [Holland,1985] mentions this prob-
lem and suggests a way to implement “bridging” classifiers
that speed up the flow of strength down a long sequence of
classifiers.

This paper describes some simple experiments designed
to show how well the bucket brigade is able to allocate
strength down long sequences of classifiers. Section 2 de-
scribes the CFS-C/FSW1 classifier system, which is used
to carry out all experiments described in this paper. In
Section 3, the allocation of strength down a single chain of
classifiers will be examined. In Sections 4 and 5, the ability
of the bucket brigade to allocate strength so that the sys-
tem learns to make the proper choice at the beginning of a
long sequence of steps is examined. The effects of “bridg-
ing” classifiers are also examined. In Section 6 the effect of
sharing classifiers in different sequences is examined, with-
out and with bridging classifiers.

2 THE CFS-C/FSW1 System

All experiments described in this paper were done using
the CFS-C classifier system (Riolo, 1986], set in the FSW1
(“Finite State World 1”) task environment [Riolo, 1987].
This section describes the parts of the CFS-C/FSW1 sys-
tem that are relevant to the experiments described in this
paper. For a complete description of those systems, see the
documentation cited.

Basically, the FSW1 domain is a world that is modeled
as a finite Markov process, in which a payoff is associated
with some states. The classifier system's input interface

 

 

 

opwotett LUFF , eet te.
» t oR € veep a
Biboyaatattf, bet ares Te hye 7 it
barnett spa tenes Hae
yh eget ee AIT
O" py tue ayes

provides a message that indicates the cubdent state of the
Markov process. The classifier’s output interface provides
the system with a way to alter the transition probabilities
of the process, so that the system can control (in part) the
path taken through the finite state world. When the clas-
sifier system visits states with non-zero payoff, that payoff
is given to the system as a “reward”. Thus the task for the
CFS-C classifier system in the FSW1 domain is to learn
to emit the appropriate signals at each step so that the
Marlov process will visit states with higher payoff values
as often as possible.
Mare formally, the FSW1 task domain is fully defined
by specifying:
« A set of n states W;, i= 0, ..., n—1, each with an
associated payoff u(W,) & ®; one state also is desig-
nated the start state.

A set of probability transition matrices, P(r), where
each entry p,,;(r) in P(r) gives the probability of going
to state W,, given that the system is in state W; and
that the classifier system has emitted r as its output
value (r =0.,.15).

 

Figure 1: A simple FSW1 finite state world.

For example, consider the simple three state world
shown in Figure 1. (In this and other diagrams, states
are shown as circles, and arrows designate non-zero prob-
ability transitions.) Wo is the start state. The payoff for
state W, is 100; the payoff for the other states is 0. When
the system is in state Wo, if r = 1 the probability of going
from state Wo to W; is 1.0; if r = 2 the probability of go-
ing from state Wo to W; is 1.0, For other values of r, the
probability of going from state Wo to state W, or W; is 0.
The probability of going from either W, or W; to Wo is 1.0,
no matter what the value of r. Thus if the classifier system
is to maximize its payoff in this world, it must learn to set
r == 1 whenever it is in state Wo.

The CFS-C classifier system is a standard, “Holland”
type learning classifier system that consists of four basic
parts:

« A message list, which acts as 2 “blackboard” for com-
munications and short term memory, In the CFS-C
classifier system, the message list has a small, maxi-
mum size.

3 AN Ca aneAT

pore
i
ae
 

« A classifier list, which consists of condition-action
rules called classtfiers. Each classifier in the CPS-C
system is a two-condition classifier of the form:

Cy, Cz / Action

A classifier’a condition part is satisfied when each
of the conditions C, and Cz is matched by one or
more messages on the message list. The second con-
dition may be prefixed by a “~”, in which case that
condition is satisfied only when ne message matches
the condition C,. A satisfied classifier produces one
message for each message that matches its first con-
dition, C1, using the usual “pass through” procedure.
Each classifier also has an associated strength, which
is related to its usefulness in attaining rewards for
the system, and a specificity (sometimes called its bid
ratio), which is a measure of the generality of the
classifier’s conditions.

e

An input interface, which provides the classifier sys-
tem with information about its environment. In the
FSW1 domain, the input interface provides one de-
tector message which indicates the current state of
the Markov process.

An output interface, which provides a way for the
classifier system to communicate with or change its
environment. In the FSW1 domain, the output in-
terface maps messages that start with a “10” (some-
times called effector messages) into an effector set-
ting, r, r = 0...15, which determines the transition
probability matrix P(r) used to select the next state
of the Markov process.

As in other “Holland” classifier systems, messages are all
strings of fixed length é, built from the alphabet {0,1}.
Each condition C, and the action part of a classifier is also
a string of length ¢, built from the alphabet {0,1,#}. The
# acts as a “wildcard” symbol in the condition strings, and
it acts as the “pass-through” symbol in the action part of
a classifier.

The CFS-C/FSW1 system is run by repeatedly execut-
ing the following steps of the classifier system’s “major cy-
cle”:

1. Add messages generated by the input interface to
the message list. In the FSW1 domain one message,
which indicates the current state W(t) of the world,
is added to the message list.

. Compare all messages to all conditions of all classi-
fiers and record all matches for classifiers that have
their condition parts satisfied.

. Generate new messsges by activating satisfied classi-
fiers. If activating all the satisfied classifiers would
produce more messages than will fit on the Message

186

list, a competition is run to determine which classi-
fiers are to be activated, Classifiers are chosen prob-
abilistically, without replacement, until the message
list is full, The probability that a given classifier is
activated is proportional to its drd.

Procesa the new messages through the output inter-
face, resolving conflicts and selecting one effector set-
ting, r, to be used for the current time step. Once r
is set, the associated transition matrix P(r) and the
current state W(t) are used to select the world state
W(t +1) to which the system moves.

. Apply the bucket brigade algorithm, to redistribute
strength from the environment to the system and
from classifiers to other classifiers.

. Apply discovery algorithms, to create new classifiers
and remove classifiers that have not been useful.

Replace the contents of the message list with the new
messages, and return to step 1.

In the CFS-C/FSW1 system, the id of classifier i at
step t, B,(t), is calculated as follows?:

B,(t) = k4 S,(t) * BidRatio;

K is a small constant (usually about 0.1), which acts as a
“risk factor”, i.e., it determines what proportion of a clas-
sifier’s strength it will bid and so perhaps lose on a single
step. S,(¢) is the classifier's strength at step t. BidRatio,
is a number between 0 and 1 that is a measure of the clas-
sifier’s specificity, ie,, how many different messages it can
match. A BidRatio of 1 means the classifier matches ex-
actly one message, while a BidRatio of 0 means the clas-
sifier matches all messages.

When a competition is run to determine which classi-
fiers are to be activated, the probability that a given (sat-
isfied) classifier ¢ will win is:

Prob( i wins ) = A,(t)/ ~ B,(t)
i
The effective bid, §,(t), of a classifier ¢ at step ¢, is:

B.(t)

BidPow is a parameter that can be set to alter the shape
of the probability distribution used to choose classifiers
to produce messages. For example, if BidPow = 1 then
B(t) = B,(t), ie. a classifier’s probability of producing
messages is just its bid divided by the sum of bids made by
all satisfied classifiers. Setting BidPow to 2, 3, and so on,
makes it more likely that classifiers with the highest bids

B, (1) BidPow

 

‘Actually the CFS-C/FSW1 bid calculation involves other param
eters not shown here, but for the experiments described in this paper,
those parameters have been disabled,

 
 

will win the competition. The effects of varying BidPow
are considered further in section 4 of this paper.

Note that the output interface of the CFS-C/FSW1 sys-
tem may have to resolve conflicts, e.g., when one classifier
produces a message that says “set the effector value r to 1”
and another produces a message that says “set the effector
value r to 2”. Since the effector can only be set to one value
at a time (just as we can either lift our arm or lower it, and
not both), an effector conflict resolution mechanism must
be used. Basically, when there are conflicts the value of r
is chosen probabilistically, with the probability that r =r!
equal to:

2 Bm (t)/ Bm (t)

ma 7m
where m! ranges over the effector messages that say “set r
tor’, Ba:(t) is the effective bid of the classifier that posted
message m', m ranges over all effector messages, and 6,,(t)
is the effective bid of the classifier that posted message m.
The winning r value is used to select a transition matrix
P(r), which in turn is used to determine to which state
the system will move. Once an effector value is chosen, all
messages that are inconsistent with that setting are deleted
from the new message list.

The basis for the reallocation of strength done by the
bucket brigade algorithm is payoffs received by the classifier
system from the FSW1 environment. In the C'S-C/FSW1
system, when the Markov process enters a state W, all clas-
sifiers that posted messages which are on the new message
list (after any effector conflicts are resolved) have the full
payoff (W,) added to their strength. Thus when the acti-
vation of a classifier tends to be directly associated with a
high reward from the environment, that classifier’s strength
ig on average increased.

The bucket brigade algorithm also redistributes strength
from classifiers to other classifiers. In particular, when a
classifier posts messages, it pays the amount it bid to the
classifiers that made it possible for that classifier to be-
come active. Let BtdShare equal the classifier’s bid, B,
divided by m, number of messages that matched its condi-
tions. Then the strength of the active classifier is decreased
by BidSharexm and BidShare is added to the strength of
each classifier that produced a message that matched the
activated classifier’s conditions. (If a classifier is matched
by one or more “detector” messages, ie., messages from
the system’s input interface, the classifier’s strength is still
decremented by BidShare for each detector message used,
but that amount is not added to the strength of any other
classifier. Thus just as the environment is the ultimate
source of strength, in the form of payoffs, it is also the ulti-
mate sink for strength, when detector messages are used.)

In summary, the strength S,(f +1) of a classifier ¢ at
t+ is:

S(t +1) = S(t) + R(t) + P(e) - Be)

187

where 5,(¢) is the strength of classifier ¢ at ¢, R(é) is the re-
ward from the environment during step (t), P,(t) is the sum
of all payments to classifier t from classifiers that matched
messages produced by ¢ during the previous step, and B,(t)
is the classifier’s bid during step t. Clearly a classifier’s
strength reaches a fixed-point when the amount of strength
it receives is equal to the amount it pays. Thus in the long
tun a classifier’s fixed-point strength, Sy,, approaches:

Sty = (R-+ P)/(k ¥ BidRatio)

where & and P are the average amounts the classifier re-
ceives per activation as rewards from the environment and
payments from other classifiers, respectively.

Since the focus of this paper is on the allocation of
strength by the bucket brigade algorithm amoung existing
classifiers, rather than the creation of new classifiers, the
CIS-C/FSW1 system’s rule discovery algorithms are not
used in the experiments described in this paper. Instead, all
classifiers are added to the initial classifier list. Those clas-
sifiers remain unchanged, except for their strengths, during
the course of each experiment.

3 SIMPLE SEQUENCES OF CLASSIFIERS

To get a feel for how the bucket brigade algorithm works
in the FSW1 domain, consider the finite state world shown
in Figure 2. There are 13 states in this world, W,,t
0, ..., 12. State Wy, has an associated payoff of 100, while
the payoff for all other states is 0. The start state is Wo.
The classifier system must set r = 1 to move from state

Wo to Wi; ie, py(1) = 1.0 for jg = f+1,f = 0...11
HWS vee EL EL Gig)
F100

Figure 2: A single path of states leading to a reward at
state Wy.

When the system reaches state Wis, it will go to state Wy
by default; ie., p,,(0) = 1.0 for ¢ = 12,7 = 0. (In this and
subsequent descriptions of finite state worlds, all transitions
not mentioned have probability 0.) Thus with a perfect set
of classifiers, the system could achieve a reward of 100 every
13 cycles.

The CFS-C/FSW1 system was run in this world with
twelve classifiers, each with a starting strength of 50. Clas-
sifier 1 was of the form:

do, do/ex,7 = 1

where each condition“dy” matches the detector message for
state Wo, and the action “e,r = 1" posts an effector mes-
sage ¢, that sets the effector value r to 1. (In order to

 
 

make the classifiers more readily understandable, they will
be shown in an “interpreted” form rather than in terms
of strings built from the {0,1,#} alphabet.) Classifiers 2
through 12 are of the form:

G1 G-1/e,7 = 1

where the condition “e,.,;” matches the effector message
produced by classifier 1~1,i = 2, ..., 12, and the ac-
tion part of each classifier 1 produces an effector message
¢, which sets r to 1.

In short, when the Markov process enters state Wo, clas-
sifier 1 is activated, which posts an effector message that
moves the system to state W). Since classifier 1 is activated
by detector messages, it pays ita bid to the system rather
than to some other classifier. At the next time step, clas-
sifier 2 is activated by the message produced by classifier
1, so classifier 2 pays its entire bid to classifier 1. Classifier
2 also posts an effector message that moves the system to
state W.. Thus the process continues, each classifier be-
ing activated by and paying its bid to the classifier that
was active on the prior step. Finally the Markov process
reaches state Wz, in which case the classifier active during
that step, classifier 12, receives the payoff associated with
that state (100). The system then returns to state Wo, and
the cycle starts again.

Figure 3 shows the results of running the 12 classifiers
described above for a period of 3000 cycles (about 230
passes through the sequence), using k = 0.1 and BidPow =
1. The strength of classifiers 1, 4, 7, 10, 11, and 12 are
shown plotted against the number of time steps executed.
Figure 3 clearly shows the wave of strength flowing from

 

1o00r

800+

D
a
3

   

STRENGTH

400-

 
 
 
 
 
  

—— CLASSIFIER #1
w—s CLASSIFIER *4
wo CLASSIFIER #7
as CLASSIFIER #10 4
—* CLASSIFIER #11
ee CLASSIFIER #12 4

200F

 

 

600

1200 1800
CYCLE STEP

2400 3000

Figure 3: The flow of strength down a simple sequence of
coupled classifiers. Classifier 12 leads directly to a reward,
and classifier 1 is at the start of the sequence.

188

 

  
 
     

4900+ — BiD_K=005
a—s BID_K=010
L w—¥ BID_K=020
= BID_K=030
=
© 3000
=
uo
oS
e
w
S 2000
a
a
a
4
o
>
wu

 

 

DISTANCE FROM REWARD

Figure 4: The number of steps required for a classifier ina
chain of coupled classifiers to reach 90% of its fixed-point
strength, plotted against the number of steps to the end of
the chain, for k = 0.05, 0.1, 0.2, and 0.3.

classifier 12, which recsives the reward directly from the
environment and reachs its fixed-point strength first, to
classifier 1, which is farthest in the chain from the envi-
ronmental reward and so reaches its fixed-point strength
last. (The blips are artifacts of when strength is recorded:
sometimes a classifier’s strength is displayed at the end of
a step in which it posted a message, so that it has just had
its strength reduced by its bid but it has not yet been paid
by the next classifier in the chain.)

As expected, the fixed-point strength, Sy,, of all the
classifiers in this experiment is the same (1000), since each
classifier pays its full bid to its one predecessor (or to the
system for the detector message, in the case of classifier 1).
The number of cycles it takes for a classifier n steps from
the environmental reward to reach 90% of its Sy, fits the
following equation:

t = 286 + 155n

where n = 0 for classifier 12, n = 1 for classifier 11, and so
on. In terms of the number of passes through the sequence
of states (i-e., the number of rewards received}, this works
out to be:

R= 22 +11.9n

This number is in good agreement with the value arrived at
in [Wilson, 1986], using a simple simulation of the bucket
brigade.

One way to speed the flow of strength back through a
sequence of classifiers is to increase k, the bid constant that
specifies what proportion of a classifier’s strength is to be
risked on any one bid. Figure 4 shows a comparison of the

 
 

 

results obtained when the 12 classifiers described earlier
were run using values of & = 0.05, 0.1, 0.2, and 0.3.
The horizontal axis shows the number of steps a classifier is
from the environmental reward. The vertical axis shows the
number of cycles it takes for a classifier to reach 90% of its
fixed-point strength. Higher values of & result in faster flow
of strength down the sequence of classifiers. For example,
for a classifier 10 steps from the reward, the number of
passes through the sequence is 59 for k = 0,3, compared to
141 for k = 0.1.

4  CHOSING BETWEEN SIMPLE SEQUENCES OF CLAS-
SIFIERS

Clearly ane important characteristic of the bucket brigade
algorithm is the number of cycles it takes for strength to
flow down a chain of coupled classifiers. Another important
measure of the bucket brigade algorithm’s performance is
the ability of the classifier system to respond to changes in
the payoffs associated with states.

For example, consider the CFS-C/FSW1 world shown
in Figure 5. There are 19 states in this world. The start
state is Wo. When the system is in Wo, if the classifiers
set the effector value r to 1, then the system goes to state
W, with probability 1; if the classifiers set r to 2, then
the system goes to Wy;. Once the top or bottom path
is chasen, the Markov process can be moved through the
intervening states to Wy or Wy by continuing to set r to
1 or 2, respectively. When the process reaches state W or
Wyo, it always returns to state Wp.

The CFS-C/FSW1 system was run in the world de-
scribed above using the follwing 18 classifiers:

dy, do/ey,¥ = 1 (1)
do, do/eis,r = 2 (11)
eyenajenrel (F209)
tm nr/eor = 2  (#12...19)

(The numbers in parentheses on the right serve to identity
the classifiers.) Each classifier has BidRatio = 1. Basi-
cally, the classifiers 1 and 11 compete to become active
when the system is in state Wy. Those classifiers try to
have the system take the top or bottom path, respectively,
by (a) setting the effector value to 1 or 2, and (b) producing
a message that sets the stage for the rest of the classifiers
in its associated sequence to fire, one after another. (Note
that 1 and 2 can’t post messages at the same time, since
they try to set r to different values.) For example, if classi-
fier 1 wins the competition, its message sets r to 1, so that
the system moves to state W,. In the next step, classifier
2 is matched by the message produced by classifier 1, so
classifier 2 pays its bid to classifier 1, and the system is
moved to state W2. This process continues until the sys-
tem reaches state Wy and classifier 9 receives the payoff
associated with that state.

 

  

wees "1(6) (ve) =")

ese
[pH =400}

=100

 

Figure 5: Two competing paths of states leading to rewards
at states Wy and Wi.

Suppose states Wy end Wy both have a payoff of 100,
and all other states have a payoff of 0. In this case it
does not matter what path the system takes—the maxi
raum payoff rate it can achieve is 100 per 10 cycles. Note
that the fixed-point strength of all classifiers in both se-
quences is 1000 (assuming k = 0.1 and each classifier has a
BidRatio = 1). In particular, classifiers 1 and 11 will have
the aame fixed-point stvengths, so each will have a 0.50
probability of winning the competition, and so the system
will go down each chain 50% of the time.

Suppose the payoff assoicated with state W, is changed
to 400. In this case the optimal payoff, 400 per 10 cy-
cles, can be achieved by always taking the top path, Thus
to achieve optimal performance, the bucket brigade must
reallocate strength so that classifier 1, the classifier that
causes the system to go down the top path, has a higher
fixed-point strength than classifier 11, the one that causes
the system to go down the bottom path. The faster the
system can reallocate strength, the faster the system can
respond to the change m its environment.

Figure 6 shows the -asults of running the above classi-
fiers in the world descr bed above, with y{I¥)) = 400 and
B(Wi) = 100, (The results in this and the rest of the
experiments described in this paper are the average of 10
runs, each started with a different seed for the system's
pseudo-random number generator.) All classifiers had an
initial strength of 1000, i.e., the system was started as if
it had been run with Wp = Wyy = 100 until the classifiers
all reached their fixed-point strengths. BidPow was set to
J, Le., a classifier’s effective bid equals its bid. k wag set
to 0.1 in this and all the rest of the experiments described
in this paper. Given that BidRatto = 1 for all 18 clas-
sifiers, the expected fixed-point strengths for classifiers in
the top and bottom chains is 4000 and 1000, respectively.
The maximum payoff rate (per 200 cycles) is 8000, and the
payoff expected if the choice of path is made at random is
5000, y a

Figure 6 shows the marginal payoffthe system received
plotted against cycle st 15s executed. The strength of clas-
sifiers 1 and 11, the clausifiers that compete to choose the
‘

LE

9000 T T t T 7 T r 7 T
—— STRENGTH OF CLASSIFIER *
wa STRENGTH OF CLASSIFIER “11

 

7500+ Yo MARGINAL PAYOFF TO SYSTEM

1

60007

aso}

3000F

15007

L

 

 

 

» 4 4
1600 2400
CYCLE STEP

ob 1 1 _
0 800 3200 4000

Figure 6: Marginal performance and the strength of classi-
fiers i and 11 when the system is run with two competing
sequences in the world shown in Figure 5. The rewards at
the end of the sequences chosen by classifiers 1 and 11 were
400 and 100, respectively.

path the system takes, is also shown. Note that as the
strength of classifier 1 increases toward its fixed-point value,
marginal performance also increases. Let the fixed point
payoff, Py,, be defined as the average marginal payoff in
the last one quarter of arun. Then the average Py, for the
runs shown in Figure 6 was 6870, which is 85.9% of the

optimal payoff rate. The system’s performance is less than

optima! because the competition is stochastic—the higher
strength classifier has a higher probability of winning but
it doesn’t always win.

Also note that it takes about 1700 steps (170 trials) for
the system to reach 90% of the Pyy. This is a little more
than might be expected given the results described in the
previous section; the reason for the longer observed time
is that the system often traverses the lower path, in which
case increased strength is not flowing down the top chain.

One way to increase the fixed-point payoff rate is to
bias the effective bid in favor of high strength classifiers
by setting BidPow > 1. The following table compares the
results obtained for BidPow = 1,2, and 3:

Pep
BidPow (% Max)
1 85.9
2 96.7
3 98.3

 

8000; +

~
3
a
a
7

6000; 4

MARGINAL PAYOFF TO SYSTEM

 

 

 

5000+ i
— BIDPOW=1
{ == BIDPOW = 2 i
4000 ¥—¥ BIDPOW = 3
$00 a
a 800 700 +~+~-2400.~+~«3200°~—«4000
CYCLE STEP

Figure 7: Marginal performance when the system was run
with competing sequences of classifiers as in Figure 6, using
three different values of BidPow.

As expected, increasing BidPow increases the payoff
tate. Figure 7 shows the marginal performance obtained
in these experiments plotted against the number of major-
cycle steps executed by the system. Not only are the fixed-
point performance levels increased by increasing BidPow,
but the number of steps it takes the system to respond to
the change in payoff is decreased somewhat. In the rest of
the experiments described in this paper, BidPow is act to
3.

Even with BidPow = 3, it takes the classifier system a
large number of passes down the full sequence to learn to
take the path to the higher reward: as Figure 7 shows, it
takes about 100-120 passes (1000-1200 cycles) to begin to
tespond to the change in the environment, and about 160-
170 passes (1600~1700 cycles) to reach 90% of the fixed-
Point performance rate.

One way suggested by Holland [Holland,1985] to speed
up reallocation of strength down a long sequence of clas-
sifiers is to introduce a “bridging” classifier. Basically, a
bridging classifier (sometimes called an “epoch marking”
or a “support” classifier) is one that is activated by a mes-
sage produced by the first classifier in a sequence, and
which remains active until the payoff state at the end of
the sequence is reached. Since al] classifiers that are active
when a payoff is achieved have that payoff added to their
strengths, the bridging classifier has its strength increased
the first time the sequence is executed. The next time the
sequence is executed, when the bridging classifier again is
activated by the message produced by the first classifier in
the chain, its payment to that first classifier reflects the
payoff it received on the first pass down the sequence. In

 
 

this way, the change in payoff at the end of a long sequence
of classifiers is passed almost immediately to the classifier
at the beginning of the sequence.

To test the effectiveness of bridging classifiers, the CFS-
C/FSW1 system was run using the same finite state world
shown in Figure 5, using the same 18 classifiers described
above plus two more bridging classifiers (one for the top
sequence and one for the bottom one). In particular, the
bridging classifiers were:

(e1 | mio), ~ do/mmio (10)
(err | M20), ~ dio/m2 (20)

Both of the classifiers have BidRatio = 0.5. Classifier 10,
the bridge for the top sequence, says “If the message from
classifier 1 or from classifier 10 is in the message list, and
the detector message for state Wo is not in the list, then
post message myo”. Thus if classifier 1 posts a measage on
step ¢, classifier 10 will be activated on the next step, and it
will use the message it produces to activate itself until the
system reaches state Wo, the end of the top path. Classifier
20 acts similarly for the bottom chain.

The starting strength for all classifiers was set to the
fixed-point strength expected when the payoffs for states
W, and Wj, are both 100. The payoff for state WW was
then set to 400, and the system was run for 4000 major-
cycle steps.

Figure 8 compares the results obtained when the system
was tun with and without the bridging classifiers. Both
marginal performance (per 200 steps) and the strength
of classifier 1 is plotted versus cycle step. (The strength
of classifier 11 doesn’t change in these experiments—just
the ratio of the strength of classifier 1 to 11.) Note that
with the bridging classifier, the strength of classifier 1 be-
gins to rise almost immediately, within the first 200 cycles,
as does the marginal performance. On the other hand,
without the bridging classifier, the atrength of classifier 1
and the marginal performance doesn’t begin to rise until
about cycle step 1100. Similarly, with the bridging classifier
marginal performance reaches 90% of its fixed-point level
(98.4% of the maximum, StdDev = 1.1%) in 600-700 steps,
whereas without the bridging classifier it takes 1600-1700
steps.

There are many other ways to implement “bridging”
classifiers. For example, the following classifiers also serve
as bridges for the sequences 1-9 and 11-19:

ti, a/miqg t= 1,..9 (21)
1, ¢,/my9 t= 11...19 (22)

Classifier 21 says “If the message list contains a message
posted by any of the classifiers in the top sequence, then
post a message”. This classifier will be activated by classi-
fier 1 and remain active until the system receives a payoff
in state Wy. The difference between the this type of epoch
marker and the one described earlier is that this one is ac-
tivated by every classifier in a sequence, rather than just

 

 

    
     
  

A-~A STRENGTH OF CLASSIFIER *1, WITH BRIDGE
v--¥ STRENGTH OF CLASSIFIER #1, NO BRIDGE
[  4—~# MARGINAL PAYOFF TO SYSTEM, WITH BRIDGE
77 MARGINAL PAYOFF TO SYSTEM, NO BRIDGE

8000

 

 

4000F é TIE I FF
‘ Fad
Lf ¥
‘ /
é f
20008 ros
me
Yo ¥- yy eee
gl
oO 800 1600 2400 200 4000
CYCLE STEP

Figure 8: Marginal performance and the strength of classi-
fier 1 observed when the system is run in the world shown
in Figure 5, with and without “bridging” classifiers,

the first one in the sequence. Thus the brigding classi-
fier passes strength to all the classifiers in the chain rather
than to just the first one. (Classifier 22 acts similary for
the bottom sequence.)

To test the effectiveness of the second kind of bridging
classifier, the system was run in the world described in
Figure 5 with the 18 classifiers described earlier and the
bridging classifiers 21 and 22. The results were similar
to the results obtained using the bridging classifiers 10 and
20: the strength of classifier 1 began to rise immediately, in
the first 200 steps, as did the marginal performance. The
system again reached 90% of its fixed-point performance
rate in about 700 cycle steps.

The main difference between using the second type of
bridging classifier and the first is the fixed-point strengths
of the classifiers in the top sequence. With the first type
of bridging classifier, the fixed-point strength of classifier
i was changed from 2000 (for n(W,) = 100) to 8000 (for
u(W,) = 400); the fixed-point strengths of the other clas-
sifiers in the top sequence changed from 1000 to 4000.
With the second type of bridging classifier, the fixed-point
strength of classifier 1 was changed from 2000 (for u(W,) =
100) to 8000 (for 4(W,) = 400); the fixed-point strengths
of the other classifiers in the to sequence changed from
1000 to values ranging from 7406 (for classifier 2) to 4000
(for classifier 9). The reason classifiers 2 through 9 have
different fixed-point strengths is that the second type of

 
 

bridging classifier pays some of its strength to each classi-
fier in the sequence, but it pays more to the classifiers at
the beginning of the sequence, since that is when the bridg-
ing classifier’s strength is the highest (just after receiving
a payoff).

& SEQUENCES WITH MULTIPLE MESSAGE SOURCES

In the experiments described in the previous section, each
classifier (except the first one) in a sequence was activated
solely by the message produced by the classifier that pre-
ceded it in the sequence. Those sequences implemented
something akin to a reflex: once the first classifier in the
sequence is activiated by some signal from the environment,
the rest of the classifiers in the chain are activated, one af-
ter another, until the last one fires, no matter what effect
their actions are having. Since each classifier in the se-
quence used only messages from its predecessor, each paid
its full bid to that predecessor, so that all classifiers in a
sequence had the same fixed-point strength (ignoring the
effects of bridging classifiers).

Another type of classifier sequence is one in which one
or more classifiers after the first one in a sequence have
conditions that match messages produced by sources other
than their predecessors in the sequence. For example, some
of the classifiers could have one condition that matches a
message produced by its predecessor and a second condition
that matches a detector message produced by the system’s
input interface. Sequences of this type are non-reflex se-
quences: the system can monitor the effects of executing
each step, so that if the sequence isn’t producing the ex~
pected results {as indicated by messages on the message
list), the sequence can be stopped or alternatives steps can
be executed.

To test the effectiveness of the bucket brigade algorithm
in allocating strength down non-reflex chains, the CFS-
C/FSW1 system again was run in the finite state world
shown in Figure 5. In these experiments the following nine
classifiers were used for the top path:

do, do/er,r = 1 (1)
es, ds/e4,r = 1 (4)
¢e,dg/er,r = 1 (7)

€21,¢-1/e,7 = 1 f= 2,3,5,6,8,9

Classifier 1 is matched by state Wo. Once it is activated,
classifier 2 is activated by the message produced by clas-
sifier 1, and then classifier 3 is activated by classifier 2’s
message. Classifier 4 is activated only if classifier 3 posted
a message on the previous step and if the system is in state
Wy. Classifiers 5 and 6 then fire reflexively, and classifier 7
fires only if the system is in state Wg. Classifiers 8 and 9
then fire reflexively.

A similar set of classifiers (11 to 19) was included for
traversing the bottom path. Classifier 1 and 11 compete
when the Markov process is in state Wo to guide the system
down the top or bottom path, respectively.

192

80c0r

    
  

—— STRENGTH OF CLASSIFIER “1 4
o—a STRENGTH OF CLASSIFIER "4
we STRENGTH OF CLASSIFIER #7
o—e MARGINAL PAVOFF TO SYSTEM

6000)

4000

2000

 

 

 

a 2 J. 1 __ A 1 L. J.
i600 2400 +~—-3200

CYCLE STEP

r
800

4000

Figure 9: Marginal performance and the strength of classi-
fiers 1, 4, and 7 when the system is run with two competing
non-reflex sequences of classifiers, in which classifiers 4 and
7 each use one detector message and one message from their
predecessors in the sequence.

Figure 9 shows the results of running this set of clas-
sifiers in the world shown in Figure 5, with the payoff for
state Wig = 100 and the payoff for state Wp changed from
100 to 400 at step 0. The marginal performance (per 200
steps) is plotted against the major cycle step executed. The
strength of classifier’s 7, 4, and 1 are also plotted. Aa ex-
pected, the strength of 7 begins to rise first, since it is
closest in the sequence to the reward state, followed by the
strength of classifier 4 and then 1. When the strength of
classifier 1 begins to rise the marignal performance begins
to rise, since the classifier 1 begins to win the competition
with classifier 11 more often, thus leading the system down
the top path to the higher payoff.

Note, however, the fixed-point strength of classifier 4
is 1/2 that of 7, and strength of classifier 1 is in turn 1/2
that of 4, The reason for this drop is that classifiers 4 and 7
pay only 1/2 of their bid to their predecessor classifiers; the
other half of their bids is paid to the system for the detector
messages that match those classifiers. In general, then, any
sequence that has n classifiers that use messages not from
their predecessors in the chain will have an exponential (in
n) fall-off in strength down the sequence.

In this experiment this exponential fall-off didn’t hinder
the ability of the system to respond to the change in payoffs
at the end of the sequences, since both competing sequences
contain the same number of classifiers using messages from
multiple sources. In other classifier structures, where the
competing sequences have different number of classifiers

 
 

 

80c0

600d!

4 heh hk eh a

 

4000

“ A--& CLASSIFIER #1, BRIDGE

{ ¥--¥ CLASSIFIER "1 NO BRIDGE =
a PAYOFF, BRIDGE

7 PAYOFF, NO BRIDGE

2000+}
:

F AAAI IANA IRR IO TE
ae

y-¥-7~9-y-9-7

QUTeewiT ad

0 800 1600 2400 3200 4000

CYCLE STEP

 

 

Figure 10; Marginal performance and the strength of classi-
fier 1 when the system is run with two competing non-reflex
sequences of classifiers, with and without “bridging” clas-
sifiers.

using messages frora multiple sources, the results would be
different.

Figure 9 also shows that the response time of the non-
reflex sequence was about the same as was observed with
the reflex-like sequence (described in the previous section).
To see if bridging classifiers could speed up the learning rate
in non-reflex sequences, the classifiers described above were
run with the two bridging classifiers, 10 and 20, described
in the previous section.

Figure 10 shows the results obtained for the non-reflex
sequence when run with and without bridging classifiers.
As with the reflex-like sequences, the rate of learning with
the non-reflex sequence is greatly increased by the use of
a bridging classifier like classifier 10. With the bridging
classifier, the strength of classifier 1 and the system’s per-
formance began to increase immediately, in the first 200
cycles, and it reached 90% of its fixed-point performance
rate in about 700 cycles. Also note that the fixed-point
strength of classifier 1 is now 5000: the strength passed
through the bridging classifier is able to overcome the ex-
ponential fall-off of strength observed in non-reflex chains
without bridging classifiers.

6 SEQUENCES WITH SHARED PARTS

Because classifier systems exchange information through a
fully open “blackboard”, the message list, any classifier can
be used in any context, as long as the message list contains
messages that satisfy the classifier’s conditions. One advan-
tage of this architecture is that classifiers that are created

 

for use in one domain can be used in other domains that
are similar. For example, a group of classifiers (e.g., to
contro] a robot’s “hand”) that were discovered while the
system was learning to do one task also could be used to
solve some other task, greatly reducing the time required to
learn the second task. This “knowledge sharing” not only
makes it possible to learn faster, it also leads to a more
economical use of classifiers, since each situation will not
require its own unique classifiers.

 

Figure 11: Paths leading to rewards at states W, and Wg.

In order to explore the effect of shared classifiers on
the allocation of strength by the bucket brigade algorithm,
the CFS-C/FSW1 system was run in the simple finite state
world shown in Figure 11. There are 9 states, in this world.
The start state is Wo. When the system is in Wo, if the
classifiers set the effector value r to 1, then the system goes
to state W, with probability 1; if the classifiers set r to 2,
then the system goes to Ws. Once the top path ia chosen,
the Markov process can be moved through the intervening
states to W, by continuing to set r to 1. If the bottom path
is chosen, in order to move to W, the system must first set
r to 1 to get to We and then set r to 2 for the rest of the
bottom path. When the process reaches to state W, or Ws,
it always returns to state Wy. State W, has a payoff of 100;
all other states have 0 payoff.

The CPS-C/FSW1 system was run in the world shown
in Figure 11 with the following 8 classifiers:

do,dofer,r = 1 (1)
Gaitia/eor=1 (i= 2, 3, 4)
do, dg/es,r = 2 (5)
€s,¢5/¢a,r = 1 (6)

ents talent = 2 (i= 7, 8)

Classifiers 1 and 5 compete to select the path to be followed,
ie., 1 selects the top path and 5 selects the bottom. Once
a path is chosen the rest of the classifiers in each sequence
(2~4 or 6-8) are executed in order. Note that there are no
classifiers shared between the two sequences in this case.

The system was also run with classifiers similar to those
shown above, but with classifiers 2 and 6 replaced by the
following single classifier:

erssers/ear=l (9)

 
 

Classifier 9 is shared by the two sequences: it matches mes-
sages produced by either classifier 1 or classifier 5, and sets
the effector value r to 1. / (Classifiers 3 and 7 were modi-
fied so they both respon to to the message produced by
classifier 9; pass-thro \ symbols were used to ensure
that classifiers 3 and 7 fire only when classifiers 1 and 5,
respectively, were fired two steps before.)

The set of classifiers with the shared classifier, 9, was
also run with the following bridging classifiers:

 

(10)
(11)

Classifier 10 serves to bridge the top sequence, and classifier
11 serves to bridge the bottom sequence.

All classifiers were started with a strength of 1000. The
following table shows the results obtained for the three sets
of classifiers:

(e: | Mio), ~ d4/m40
(es| mu), ~ de/mu

 

Pre Stor _S 9.8

No Sharing 3880 2048 651
Shared Classifier 1990 1075 1065
Shared, with Bridge 3913 6035 1677

When no classifiers are shared, the system basically achieves
the optimal fixed-point performance (4000 per 200 steps).
However, when a classifier is shared the system’s perfor-
mance is the about what it could achieve by choosing the
path at random (2000),

The performance with shared classifiers is consistent
with the fixed-point strengths of classifiers 1 and 5, which
are about the same. When those classifiers compete to se-
lect a path, they each win 50% of the time. As a little
thought will show, the reason classifiers 1 and 5 have the
same strength when classifier 9 is shared between the chains
is obvious: both 1 and 5 are paid only from classifier 9, and
the fixed-point strength of classifier 9 is just the average of
the payments made to it whenever it used,

Adding a bridging classifier rectifies this situation: clas-
sifier 1 now gets income from both classifier 9 and its bridge
classifier, 10, and classifier 5 gets income from classifier
9 and its bridge, classifier 11. Since the bridging classi-
fier 10 gets the 100 payoff, while classifier 11 gets 0, the
strength of classifier 10 is greater than that of 11, and in
turn the strength of classifier 1 is greater than that of 5.
Thus adding a bridging classifier restores the fixed-point
performance rate to near the maximum.

7 CONCLUSIONS

The apportionment of credit problem is the problem of
deciding, when many rules are active at every time step,
which of those rules active at step t are necessary and suf-
ficient for achieving some desired outcome at step t-++n. In
classifier systems using the bucket brigade algorithm, credit
is allocated in the form 3f a value, strength, associated with

194

 

each classifier. Because the bucket brigade algorithm usea
only local information to incrementally change the strength
of classifiers, one key problem for systems using the bucket
brigade algorithm is how rapidly strength can be passed
down long sequences of classifiers. If a whole sequence of
classifiers must be activated many times in order to adjust
the strength of a classifier at the beginning of the chain in
response to a change in payoff associated with the last step
in the sequence, the system’s response to simple changes in
its environment may be too slow to be useful.

This paper has presented results that show the bucket
brigade basically works as designed—strength is passed
down a chain of coupled classifiera from those that receive
the reward directly from the environment to those that are
“stage setters”. The system can thereby learn to respond
to a change in the environment by choosing to activate one
classifier rather than another, even when a reward is re-
ceived only after a long sequence of classifiers is activated,

However, it does seem to take a large number of trials
before the classifiers at the beginning of a chain reach their
fixed-point strengths, and so alter the choice of paths to
take. While increasing the bid constant k can speed the
flow of strength, a large k means classifiers are risking a
large proportion of their strength on each bid they make, so
that there is not much room for classifiers to make mistakes.
Increasing the BidPow parameter was shown to increase
the fixed-point payoff rate the system will achieve, and to
slightly decrease the systems response time to a change in
payoff at the end of the sequence.

The bucket-brigade algorithm was shown to allocate
strength down sequences that implement both “reflex” like
subroutines and non-reflex like sequences. The system can
respond to changes in payoff at the end of two competing
non-reflex sequences just as fast as it can to changes that
occur at the end of two competing reflex sequences. How-
ever, the fixed-point strengths of classifiers in non-reflex se-
quences fall off exponentially with each classifier that uses
a message not from its predecessor in the sequence. This
drop in strength could present problems for non-reflex se-
quences that try to compete with reflex-like sequences, and
it could have effects on the creation and deletion of rules if
strength is used to bias the rule discovery algorithms used
by the system.

One way to ameliorate the fal] of strength in non-reflex
chains that use only detector messages is to change the
bucket brigade algorithm so that classifiers pay less than
the full BidShare for detector messages. This would mean
that classifiers that match detector messages will in gen-
eral have higher fixed-point strengths than other classifiers.
The higher fixed-point strengths may bias the system’s rule
discovery algorithms to create more detector messages—a
bias which makes some sense, since the system should put
a high priority on using messages from its environment.
On the other hand, such a bias in favor of classifiers that
use detector messages won't solve the exponential strength

 
 

fall-off problem for non-reflex sequences that use messages
from other classifiers.

“Bridging” classifiers were shown to have many effects
on the allocation of strength and the performance result-
ing from competing sequences of classifiers. Bridging clas-
sifiers lead to a dramatic decrease in the number of passes
the system must make down a sequence before it can re-
spond to a change in the environment, for both reflex and
non-reflex sequences. Bridging classifiers also allocate ad-
ditional strength to the earlier classifiers in non-reflex se-
quences, overcoming the exponential fall-off of strength
seen without such bridging classifiers.

Note that classifier sequences with bridging classifiers
require the same number of passes down the chain to re-
spond to a change in payoffs at the end of the sequence,
ho matter how long the sequence is. For example, when
the experiments described in Section 4 were repeated with
sequences of 19 classifiers, the system’s performance with
bridging classifiers again began to increase almost immedi-
ately, within the first 15-20 trials, just as it did with se
quences of 9 classifiers. Without bridging classifiers, the
length 9 sequence required about 120 trials to respond,
whereas the length 19 sequence required 250 trials.

One way to decrease further the response time might be
to use multiple bridging classifiers for each sequence. Fach
bridge would pass additional strength to the first classifier
in the sequence, enabling it to dominate the competition
sooner.

There are many ways to implement bridging classifiers.
Two ways were tried in experiments described in this pa-
per. While each type of bridging classifier acted to decrease
the system’s response time, each also resulted in somewhat
different distributions of strength over the classifiers in a
sequence, These differences in the fixed-point strengths
may have important effects on long term learning in the
system if strengths are used to guide the creation and dele-
tion of classifiers. Other types of bridging classifiers should
be tried to see what effects they have.

Finally, sharing clasifiers between sequences could be
a very good way to promote transfer of knowledge from
one domain to another, and to economically use the same
rules in more than one context. However, experiments de-
scribed in this paper show that sharing classifiers can lead
to problems for the allocation of strength down sequences
of classifiers by the bucket brigade algorithm. Basically,
sharing classifiers means the information being passed from
classifier to classifier (in the form of strength) is lost, or
at least greatly attenuated, when shared classifiers are in-
volved. Simple bridging classifiers were shown to be one
way to avoid the problem caused by shared classifiers.

495

 

REFERENCES

(Burks, 1986] Burks, Arthur W. “A Radically Non-Von
Neumann Architecture for Learning and Discovery.”
In CONPAR 86: Conference on Algorithms and
Hardware for Parallel Processing, September 17-19,
Proceedings., 1-17. Wolfgang Handler, etal. (Eds.},
Springer-Verlag, Berlin, 1986.

{Holland,i985] Holland, J. H. “Properties of the Bucket
Brigade.” Proceedings of an International Conference
on Genetic Algorithms and their Applications, 1-7.
John J. Grefenstette (Ed.). Carnegie-Mellon Univer-
sity, Pittsburg, 1985.

{Holland, 1986a] Holland, John H. “Escaping Brittleness:
The Possibilities of General-Purpose Learning Al-
gorithms Applied to Parallel Rule-Based Systems,”
In Machine Learning: An Artificial Intelligence Ap-
proach, Volume IT, Michalski, Ryszard S., Carbonell,
Jamie G., and Mitchell, Tom M. (Eds). Morgan Kauf-
man Publishers, Inc., Los Altos, CA (1986).

[Holland and Burks, 1987] Holland, John H. and Burks,
Arthur W. “Adaptive Computing System Capable of
Learning and Discovery.” United States patent applied
for (1987).

[Holland, 1986b] Holland, John H., Holyoak, Keith J., Nis-
bett, Richard BE. and Paul A. Thagard. Induction. Pro-
cessea of Inference, Learning, and Discovery. The MIT
Press, Cambridge, MA, 1986.

[Riolo, 1986] Riolo, Rick L. “CFS-C: A Package if Domain
Independent Subroutines for Implementing Classifier
Systems in Arbitrary, User-Defined Environments.”
Logic of Computers Group, Division of Computer Sci-
ence and Engineering, University of Michigan, Ann
Arbor, 1986.

[Riolo, 1987] Riolo, Rick L. “CFS-C/FSW1: An Imple-
mentation of the CFS-C Classifier System in a Domain
that Involves Learning to Control a Markov Process.”
Logic of Computers Group, Division of Computer Sci-
ence and Engineering, University of Michigan, Ann
Arbor, 1987 [in prep.}.

[Riolo, 1987} Riolo, Rick L. “Bucket Brigade Performance:
II. Simple Default Hierarchies.” [in this procedings...]

(Samuel, 1959] Samuel, A. L. “Some Studies in Machine
Learning Using the Game of Checkers.” IBM Journal
of Research and Development, 8, 211-232 (1959).

(Wilson, 1986] Wilson, Stewart W. “Hierarchical Credit
Allocation in a Classifier System.” Research Memeo
RIS No. $7r. The Rowland Institute of Science, Cam-
bridge MA, 1986.

 

 
 

BUCKET BRIGADE PERFORMANCE: IJ. DEFAULT HIERARCHIES

Rick L. Riolo

The University of Michigan

ABSTRACT

Learning systems that operate in environments with huge
numbers of states must be able to categorize the states into
equivalence classes that can be treated alike. Holland-type
classifier systems can learn to categorize states by build-
ing default hierarchies of classifiers (rules). However, for
default hierarchies to work properly classifiers that imple-
ment exception rules must be able to control the system
when they are applicable, thus proventing the default rules
from making mistakes. This paper presents results that
show the standard bucket brigade algorithm does not ead
to correct exception rules always winning the competition
with the default rules they protect. A simple modification
to the bucket brigade algorithm is suggested, and results
are presented that show this modification works as desired:
default hierarchies can be made to achieve payoff rates as
near to optimal as desired.

1 INTRODUCTION

Any learning system that is to operate in environments
with huge numbers of states must be able to categorize the
states into equivalence classes that can be treated alike. For
rule-based systems like classifier systems ([Holland, 1986al],
[Burks, 1986], [Holland and Burks, 1987]) the problem in-
volves finding a set of classifiers (condition/action rules)
that induce the appropriate equivalence classes.

One approach to this problem is to try to find a set
of rules that never make mistakes and that partition the
whole environment. Such a set of rules in effect estab-
lishes a homomorphic model of the world. The problem
with this approach is that realistic environments typically
involve millions of possible states, with very complicated
underlying equivalence classes, of which the system may
have sampled only a small fraction. In such situations it
would take a vast number of rules to establish a homomor-
phic model of the world.

Another approach is to implement a default hierarchy
(Holland, 1985], [Holland, 1986b]) of rules. A default hier-
archy is a multi-level structure in which classifiers (rules) at
the top levels are very general. Each general rule responds
to broad set of states, so that just a few rules can cover all

196

possible states of the world. Of course since a general rule
responds in the same way to many states that don’t really
belong in the same category, it will often make mistakes. To
correct the mistakes made by the general classifiers, lower
level, exception rules are added to the default hierarchy.
The lower level classifiers are more specific than the higher
level rules: each exception rule responds to a subset of the
situations covered by some more general rule.

Default hierarchies have several features that make them
well suited for learning systems that must build models of
very complex domains:

© Default hierarchies can be made as error-free as nec-
essary, by adding classifiers to cover exceptions to the
top level rules, to cover exceptions to the exceptions,

and go on, until the required degree of accuracy is
achieved.

e Default hierarchies are the basis for building quasi-
homomorphic models of the world, which generally
require far fewer rules to implement a given degree
of accuracy than do equivalent homomorphic models
[Holland, 1986b].

e Default hierarchies make it possible for the system
to learn gracefully, since adding rules to cover excep-
tions won’t cause the system’s performance to change
drastically, even when the new rules are incorrect.

This paper describes some simple experiments with de-
fault hierarchies implemented in the CFS-C/FSW1 classi-
fier systern [Riolo, 1987b]. These experiments show that
when default hierarchies are built in a top down manner,
by adding rules to cover exceptions, overall performance
does improve as predicted. However, using the standard
bucket brigade algorithm, the system does not achieve the
performance expected. The reasons for this lower than ex-
pected performance are explained, and a modification to
the bidding mechanism used in the standard bucket brigade
algorithm is proposed. This modification is shown to lead
to performance as close to the expected performance as
desired.

*This work was supported by National Science Foundation Grant
DCR 83-05830
 

 

2 THE CFS-C/FSW1 SysTEM

All experiments described in this paper were done using
the CFS-C classifier system {Riolo, 1986], set in the FSW1
(“Finite State World 1”) task environment [Riolo, 1987a].
This section briefly describes the parts of the CFS-C/FSW1
system that are relevant to the experiments described in
this paper. For more details, see [Riolo, 1987b] or the cited
documentation.

Basically, the FSW1 domain is a world that is mod-
eled as a finite Markov process, in which a non-zero payoff
is associated with some states. The classifier system’s in-
put interface provides a message that indicates the current
state of the Markov process. The classifier system’s effec-
tor interface provides the system with a way to alter the
transition probabilities of the process, so that the system
can control (in part} the path taken through the finite state
world. Basically, a message that begins with “10” is inter-
preted by the effector interface as a command to set the
effector to some value r, r = 0,1,...15, depending-on the
tight most 4 bits of the message. The effector setting r
is used to select the transition matrix P(r} which specifies
the probability the system will go from the current state W,
to some state W,;. When the system moves to a new state
with a non-zero payoff, that payoff is given to the active
classifiers as a “reward”.

The CFS-C classifier system is a standard, “Holland”
type learning classifier system. While the CF'S-C classifiers
all have two conditions, in this paper they will be treated
as if they have only one condition. (This is done by making
both conditions identical.)

For the experiments described in this paper, no classi-
fiers are coupled, i.e., no classifier is satisfied by a message
produced by another classifier. All classifiers match only
detector messages, i.e., the classifiers are matched when the
current state of the finite state process is in the set of states
matched by the classifier’s condition. The most important
parts of the classifier system are (a) the mechanism imple
menting bidding and competition to post messages and to
set the effectors, and {b) the allocation of payoff to classi-
fiers from the environment.

tn the CFS-C/FSW1 system the bid of classifier ¢ at
step t, B,(t), is calculated as follows:

B,(t) = k* S(t) « BidRatio,

& is a small constant (set to 0.1), which acts as a “risk
factor”, i.e., it determines what proportion of a classifier’s
strength it will bid and so perhaps lose on a single step.
5,(t) is the strength of classifier i at step t. BidRatto, is a
number between 0 and 1 that is a measure of the classifier’s
specificity, ie., how many different messages it can match.
A BidRatio of 1 means the classifier matches exactly one
message, while a Bid Ratio of 0 means the classifier matches
any message. Thus classifiers that implement high level,
general rules in a default hierarchy will have low BidRattos,

    

while classifiers that implement lower level, exception rules
will have higher BidRatto values.

When a competition is run to determine which classi-
fiers are to be activated and post messages, the probability
that a given (satisfied) classsifier ¢ will win is:

Prob( i wing ) = B(t)/ > 8, (t)

where j ranges over all bidding (satisfied) classifiers at t,
and f,(t), the effective bid of classifier ¢ at t, is:

BA(t) = B,(t) Brow

BidPow is a parameter that can be set to alter the shape
of probability distribution used to choose classifiers to pro-
duce messages. In all experiments described in this paper,
BidPow = 3,

Note that the output interface of the CFS-C/FSW 1 sys-
tem may have to resolve conflicts, e.g., when one classifier
produces a message that says “set the effector value r to 1”
and another produces a message that says “set the effector
value r to 2”. Since the effector can only he set to one value
at a time (just as we can either lift our arm or lower it, and
not both), an effector conflict resolution mechanism must
be used. Basically, the value of r is chosen probabilistically,
with the probability that r = r’ equal to:

Peel) To Patt

where m!' ranges over the effector messages that say “set r
tor’, B(t) is the effective bid of the classifier that posted
message m', and m ranges over all effector messages. Once
an effector value is chosen, all messages that are inconsis-
tent with that setting are deleted from the new message
fist.

When a classifier wins a competition and posta a mes-
sage, its strength is decremented by the amount it bid to
become active. Since there are no coupled classifiers, classi-
fiers only receive payments from the environment when the
system maves to a state with an associated non-zero pay-
off. In particular, the full payoff is added to the strength of
every classifier that has posted a message during that time
step.

Because the bids made by classifiers are not paid to
other classifiers in the experiments described in this paper,
the strength at step ¢+1 of a classifier ¢ that produced one
or more messages at step ? is:

S,(t+1) = 5,(t) — Bt) + RE)

where B,(t) is the classifiers bid at step t and R(t) is the -
reward from the enviroment at step t. Note that the fixed-

point strength of classifier 1, 5, yp, is inversely proportinal
to its bidratio, BidRatio;. In particular,

Sutp = Lf (k * BidHatio,)

where J; is the average amount paid to classifier ! whenever

 

 
 

it posts messages. Other things being equal, a general clas-
sifier (with a low BidRatio) will have a higher fixed-point
strength then a more specialized classifier (with a higher
BidRatio).

 

Figure 1: A simple FSW1 finite state world.

3. A SIMPLE TEST WORLD

In order to examine default hierarchies in the CFS-C/FSW1
system, consider the simple FSW1 world shown in Figure
1. There are seven states in this world, W,, ¢ == G...6.
State W, is the start state. States W4 and Ws have pay-
offs of 200 and 400, respectively; all other states have zero
payoff. For: = 4, 5, or 6, and any effector value r,
py(r) = 0.25, j = 0...3. That is, when in state Wy, We,
or Wg, the system has an equal chance of going to any one
of the four states on the left in Figure 1, and no chance of
going to a state on the right, no matter what the value of
r. When in state Wo or Wj, if r = 1 the system goes to
W,; if r = 2, the system goes to Wz. When in state W2 or
Ws, if r = 2 the system goes to Wy; if r = 1, the system
goes to W;. And when in state W3, the system goes to We
only if r = 3. (Other values of r are not allowed in the
experiments described below.)

The system can do best if it sets r to 1 when in states
Wo or W4, sets r to 2 when in state W2, and sets r to 3
when in state W 3. A good set of classifiers for this world
would classify the states on the left into three categories
and respond accordingly. (Since the system can’t alter the
transition probabilities when in any of the states on the
right, no rules categorizing those states can be more usc
ful than any others.) While this world is too small to show
how a default hierarchy (i.e., a quasi-morphism) can require
fewer classifiers than a homomorphic model, it is complex
enough to show the relationship between default and ex-
ception rules.

As a simple test of the CFS-C/FSW1 system, it was
run in the world shown in Figure 1 using the following
three classifiers:

198

doy /r = 1 (2)
d,fr = 2 (2}
dsfr=3 (3)

(The numbers in parentheses on the right serve to iden-
tify the classifiers.) The condition “do,” taatches two de-
tector messages, for states Wo or Wi. The conditions d,
and ds each match one detector message, ie., the detector
messages for states Wz and Ws, respectively. The action
“r = 2” posts an effector message that sets the effector
value r to z, r= 1, 2, or 3.

In this and other experiments described in this paper,
the system was run for 2000 major-cycle steps. In all runs
classifiers reached their fixed-point strengths within the
first 1000 steps. The average performance of the system
over steps 1500 to 2000 is used to establish the “fixed-
point” perfarmance, P,,, for one run. All results presented
are the average of 10 runs (each started with a different
pseudo-random number generator seed).

For the three classifiers shown above, Py, was 124.4
per step (StdDev = 1.3}, As can be easily calculated, the
expected value is 125 (the system gets a non-zero payoff at
Most once every two steps). Thus these three rules perform
just as expected. Of course these rules do not implement
a default hierarchy: there is a general rule, 1, but it never
makes mistakes, and the other rules do not cover subsets of
the cases cavered by the general rule. Instead, these rules
iraplement 4 homomorphic model, since the rules partition
the space of possible states (with respect to the 4 states on
the left in Figure 1, the states the system is modeling). The
Next section shows how a default hierarchy can be built to
cover the same states.

4 RESULTS

To test the performance of a default hierarchy in the world
shown in Figure 1, the CFS-C/FSW1 system first was run
with just one classifier:

doigs/r = 1 (BidRatio = 0.56) (4)

This general classifier clearly implements a high level de
fault rule for this world, namely “When in any state W,, ¢ =
0..,.3, set r to 1”. The expected fixed-point performance
for a system using just this rule is 50, since the system will
get a reward only when it moves from state Wo or W, to W,,
which it will do 50% of the time. The observed fixed-point
performance was 49.9, or 99.8% of the expected value (Std-
Dev = 1.9%). The classifier’s fixed-point strength is 1765,
which is also exactly the expected strength.

To improve the system’s performance, it was run again

with two rules, the above default rule, 4, and the following
“exception” rule:

dyy/r = 2 (BidRatio = 0.75) (5)

 
 

ret ted ATT TTI

This classifier covers a subset of the states covered by clas-
sifier 4. For that subset, it corrects a mistake made by
classifier 4, since it sets r to 2, and so causes the system
to go from state W, or Ws to W, {instead of to Ws). The
expected payoff for a system using the default hierarchy
implemented by these two rules is 100 per step, since every
other step the system should receive the 200 payoff from
state W,. The observed fixed-point payoff was 83.9, or just
83.9% of the expected rate (StdDev = 1.8%}. The fixed-
point strengths of classifiers 4 and 5 were 2661 and 2667,
respectively. Note that the strength of the default rule 4
went up as predicted [Holland, 1985| when the exception
tule 5 is added to the system, but it did not go up as high
as expected (to 3571).
The system also was run with an additional classifier:

ds/r = 3 (BidRatio=1) (6)

This rule covers another exception, so that the system can
get to the high-payoff state W_ whenever it is in state W.
Together rules 4, 5, and 6 implement a complete default
hierarchy for this simple world. The expected payoff for
these classifiers is the same as for the original 3 perfect
rules (1, 2, and 3), i.e., 125 per step. However, the observed
performance was just 108.7, or 86.9% of the expected value
(StdDev = 1.3%). The fixed-point strengths of classifiers
4,5, and 6 were 2860, 2667, and 4000, respectively.

Why is the performance lower than expected? First,
note that the fixed-point strengths of classifiers 5 and 6
are just as expected, given the amount of payoff they re-
ceive. On the other hand, classifier 4, the top level default
rule, has a fixed-point strength that is much lower than
its expected value, 3571. Thus classifier 4 must be making
mistakes.

To get a better idea of what is happening consider Fig-
ure 2, which shows the results of a run using just two clas-
sifiers, 4 and 5. Figure 2 shows the marginal performance
of the system plotted against major-cycle steps executed.
It also shows the strengths for classifiers 4 and 5.

First, note that the strength of classifier 5 stabilizes at
2667, which is the maximum strength it can reach given its
maximum average income (200 per bid) and its BidRatio,
Q.75, On the other hand, the strength of classifier 4 oscil-
lates, and it never reaches its expected value. Instead, as
soon as the strength of 4 gets much above the strength
of classifier 5, performance begins to drop (as does the
strength of classifier 4). This co-oscillation of the strength
of classifier 4, the general default rule, and marginal per-
formance is the key to why the performance of this default
hierarchy is lower than expected.

Recall that in classifier systema there is a competition to
post messages and control effectors: the higher a classifier’s
bid, the more likely it is to win the competition and control
the system’s behavior. Also recall that when a classifier
loses the competition to set an effector, its messages are
deleted from the message list and it does not pay its bid to

199

vor MARGINAL PAYOFF TO SYSTEM
——— STRENGTH OF DEFAULT CLASSIFIER (#1)
@—e STRENGTH GF EXCEPTION CLASSIFIER (#2)

 

Q 500 1000
CYCLE STEP
Figure 2: Performance of a default hierarchy using the stan-

dard bucket brigade algorithm.

1500 2000

its suppliers. Bearing these facts in mind, the reason for
the oscillations and sub-expected performance are clear:

1. When the strength of an exception rule (rule 5) is
greater than the strength of the default rule for which
it is exception (rule 4), the exception rule tends to
control behavior in the states it covers. The exception
tends to protect the default from making mistakes,
which allows the average income for the default rule
to reach its maximum.

2. Since the default rule has a smaller BidRatio than
the exception rule that protects it, if the average in-
comes of the rules are about the same, the maximum
fixed-point strength of the default rule will be greater
than the maximum for the exception rule. In this ex-
ample, as rule 5 protects 4, the strength of rule 4
eventually exceeds that of rule 5 (since both have a
maximum income of 200 per bid).

3. As the strength of the default classifier rises, its bid
and effective bid will rise. Once the effective bid of
the default rule is near to or greater than the effective
bid of the exception rule, the default rule will begin
to win the competition. That is, the exception rule
will not always win the competition in the situations
it covers, which means it will not be able to atop the
default rule from making mistakes.

4, Once the default rule begins making mistakes, both
the system performance and the strength of the de-
fault rule will begin to fall, Eventually the system re-
turns to the beginning of the performance oscillation,
when the strength of the default rule is enough lower
than that of the exception rule so that the exception

 

 
 

tule can again protect the default from making mis-
takes,

In short, the protection an exception-covering rule provides
for a default rule allows the strength of the default rule to
tise until the exception rule can no longer protect it. This
will almost always happen in default hierarchies, since the
maximal fixed-point strength of a general default classifier
is in general higher than that of of more specific, exception
classifier.

One way to correct this problem would be to alter the
bucket brigade so that the maximum fixed-point strength of
general classifiers is not higher than that for more special-
ized classifiers. For example, payments to classifiers could
be biased so that a lower Bid Ratio leads to a smaller share
of the payment from other classifiers or from environmen-
tal rewards. Lowering the fixed-point strength of general
classifiers may create sore other problems, however. For
instance, since default rules tend to make mistakes more
often than more specialized rules, the fixed-point strength
of a default rule should be relatively high, so that it can
afford to make mistakes (and lose strength) without hav-
ing its strength go so low that it its eliminated from the
system.

Another approach is to leave the relationship between
the fixed-point strengths of general versus specific classi-
fiers the same, but to bias the effective bid, 8, against the
general classifiers. Since the effective bid only changes the
probability distribution used to determine which classifiers
post messages and control the systems effectors, it does
not alter a classifier’s per bid income or payment. Thus
genera! classifiers will still have relatively high fixed-point
strengths, even though specialized classifiers will tend to
win the competition more often.

To test this idea, the way effective bids are calculated in
the CFS-C/FSW1 system was changed by adding a factor
involving a classifier’s BidRatio. In particular, the effective
bid, B,(t), of a classifier ¢ at step ¢ is now calculated as
follows:

B,(t) = B,(t)P4?* « BidRatioF BRP

The value of EBRPow can be changed to alter the shape of
probability distribution used to choose the classifiers that
are to produce messages. For example, if BidPow = 1
and EBRPow = 0 then ,(t) = B,(é), ie., a classifier’s
probability of producing messages fs just its bid divided by
the sum of bids made by all satisfied classifiers. Setting
EBRPow equal to 1, 2, and so on, makes it less likely that
general classifiers (those with lower BidRatio’s) will win
the competition with more specific, co-active classifiers.

To test the effects of this modification, the system was
run in the same finite state world described earlier, using
classifiers 4, 5, and 6, which implement a simple 3 level
default hierarchy, The following table shows the results
obtained using various values for EBRPow:

% Expected
EBRPow  Pyy (1 StdDev) S44,
0 86.9 (1.3) 2860
1 92.7 (1.3) 3104
2 95.8 (0.9) 3280
4 98.6 (1.4) 3479
6 99.5 (1.8) 3553

As can been seen, raising the EBRPow parameter increaseg
the average fixed-point performance of the system to vir-
tually a mistake-free level. Also note that the fixed-point
strength of classifier 4, the default rule, increases to almost
its maximum value, 3571.

6000:

 

4000
3000

T 7 T 7 t +t 7 T
5000 PO errr ETT TTY

2000

we—w MARGINAL PAYOFF TO SYSTEM

 

 

 

"006 —— STRENGTH OF DEFAULT CLASSIFIER (#1)
e—s STRENGTH OF EXCEPTION CLASSIFIER (#2) +4
0 L L Leah L Lal L
0 500 1000 1500 2000
CYCLE STEP

Figure 3: Performance of a default hierarchy using the
modified bucket brigade algorithm with EBRPow = 3.

Figure 3 shows the results of repeating the experiment
shown in Figure 2 with EBRPow = 3. The oscillations of
both the marginal performance and the strength of classi-
fier 4 are all but eliminated. Thus with a high EBRPow,
the exception classifier 5 is able to almost always win the
competition with the default rule it protects, in which case
the system never makes mistakes.

The system was also run using the same three classi-
fiers in two slightly different versions of the world shown
in Figure 1. First, the system was run in a world in which
setting the effector value r to 2 in states W2 and W causes
the system to go to another state, Wy which haa a payoff
of 100. That is, an exception rule like classifier 5 leads to
a lower payoff than the default rule does when it doesn’t
make a mistake. In this world the expected performance
(per step) is 112.5, while the observed performance with
EBRPow = 0 was 97.7 (86.9% expected performance,
SidDev = 1.8). However, with EBRPow 3 the ob-
served performance was 109.0 (97.5% of the expected value,
StdDev = 1.9). Thus the default hierarchy performs just
 

 

as well when the exception rule leads to a lower payoff than
the default rule does when it is correct.

Second, the system was run in a finite state world like
that shown in Figure 1, except that various amounts of un-
certainty are introduced into the transitions. For example,
instead of going from state Wo to state W, with probability
1.0 when r is set to 1, the system will go to that state with
probability 0.92 and go to one of the other etates on the
right with probability 0.04 each; i.e., 8% of the time the
system will go somewhere unexpected. The following table
shows the results of running the three classifiers 4, 5, and
6 that implement the simple default hierarchy described
earlier, in worlds with increasing amounts of uncertainty
{using EBRPow = 3):

 

% Un- Expected Observed % Optimal
certainty Payoff Payoff (1 StdDev) Payoff
0 125 122 (3.44) 97.6
8 116 116 (3.64) 99.1
16 107 105 (3.64) 98.1
32 89.0 87.5 (5.52) $8.3

As can be seen, uncertainty in the environment has little
effect on the systems ability to obtain the best possible
payoff rate, This is an important property for any learning
system that must contend with very complex environments
in which it can never completely reduce the uncertainty.

5 CONCLUSIONS

Default hierarchies are an excellent way for classifier sys-
tems to cope with very complex enviroments. However, for
default hierarchies to work properly, classifiers that imple-
ment “exception” rules must be able to control the system
when they are applicable, thus protecting the default rules
from making mistakes. This paper has presented results
that show the bucket brigade algorithm as described in
(Holland, 1985] does not lead to correct exception rules al-
ways winning the competition with the default rules they
protect. A simple modification to the bucket brigade algo-
rithm is suggested, which involves biasing the calculation of
the effective bid so that general classifiers (those with low
BidRatios) have much lower effective bids than do more
specific classifiers (those with high BidRatios). Results are
presented that show this modification works as desired: de-
fault hierarchies can be made to achieve payoff rates as near
to optimal as desired.

Since the modification to the bucket brigade algorithm
only changes the effective bid made by classifiers, the allo-
cation of strength under the bucket brigade is not altered
(except insofar as the fixed-point strengths of default rules
are increased to their maximum expected levels). Also, if
the system does not have more specialized exception classi-
fiers to compete with a particular default rule, that default
rule will continue to control the behavior of the system.
The modified bucket brigade algorithm only reduces the

the probability that default rules will post messages when
they are competing with co-active exception rules.

The classifier system also is shown to work as expected
in environments with varying amounts of uncertainty, and
when the payoff received for activating a correct exception
rule is less than the payoff received by the default rule it
protects when the default is used in a correct situation.

REFERENCES

[Burks, 1986] Burks, Arthur W. “A Radically Non-Von
Neumann Architecture for Learning and Discovery.”
In CONPAR 86: Conference on Algorithms and
Hardware for Parallel Processing, September 17-19,
Proceedings., 1-17. Wolfgang Handler, etal. (Bds.}.
Springer-Verlag, Berlin, 1986.

(Holland, 1985] Holland, J. H. “Properties of the Bucket
Brigade.” Proceedings of an International Conference
on Genette Algorithms and thetr Applications, 1-7.
John J. Grefenstette (Ed.J. Carnegie-Mellon Univer-
sity, Pittsburg, 1985.

(Holland, 1986a] Holland, John H. “Escaping Brittleness:
The Possibilities of General-Purpose Learning Al-
gorithms Applied to Parallel Rule-Based Systems.”
Tn Machine Learning: An Artificial Intelligence Ap-
proach, Volume I, Michalski, Ryszard S., Carbonell,
Jamie G., and Mitchell, Tom M. (Eds). Morgan Kauf-
man Publishers, Inc., Los Altos, CA (1986).

[Holland, 1986b] Holland, John H., Holyoak, Keith J., Nis-
bett, Richard E. and Paul A. Thagard. Induction, Pro-
cesses of Inference, Learning, and Discovery. The MIT
Press, Cambridge, MA, 1986,

(Holland and Burks, 1987] Holland, John H. and Burks,
Arthur W. “Adaptive Computing System Capable of
Learning and Discovery.” United States patent applied
for (1987).

[Riolo, 1986] Riolo, Rick L. “CFS-C: A Package if Domain
Independent Subroutines for Implementing Classifier
Systems in Arbitrary, User-Defined Environments.”
Logic of Computers Group, Division of Computer Sci-
ence and Engineering, University of Michigan, Ann
Arbor, 1986.

[Riolo, 1987a} Riolo, Rick L. “CFS-C/FSW1: An Imple-
mentation of the CFS-C Classifier System in a Domain
that Involves Learning to Control a Markov Process.”
Logic of Computers Group, Division of Computer Sci-
ence and Engineering, University of Michigan, Ann
Arbor, 1987 [in prep.].

[Riolo, 1987b] Riolo, Rick L. “Bucket Brigade Perfor-
mance: I. Long Sequences of Classifiers.” [In these
procedings.|

 

 
 

MULTILEVEL CREDIT ASSIGNMENT IN A GENETIC LEARNING SYSTEM

John J. Grefenstette

Navy Center for Applied Research in AI
Naval Research Laboratory
Washington, DC 20375-5000

Abstract

Genetic algorithms assign credit to building blocks
based on the performance of the knowledge structures in
which they occur. If the knowledge structures are rules
sets, then the bucket brigade algorithm provides a means
of performing additional credit assignment at the level
of individual rules. This paper explores one possibility
for using the fine-grained feedback provided by the
bucket brigade in genetic learning systems that manipu-
late sets of rules.

1. Introduction

There are two distinct approaches to machine
leaming using Genetic Algorithms (GA’s). These
approaches are popularly denoted by the names of the
universities where they were first elaborated. In the
Michigan approach, first described by Holland and Reit-
man [Holland 78], the population comprises a single set
of production rules, or classifiers. Each individual rule
is assigned a measure, called strength, that indicates the
ulility of the rule to the system’s goal of obtaining
extemal payoff. The bucket brigade algorithm [Holland
86] shifts suength among the rules during the course of
problem solving, so that mules achieve high strength
either by obtaining direct payoff from the task environ-
ment or by setting the stage for later rules. New rules
arc discovered by genetic operators applicd to existing
rules on the basis of strength. In the Pitt approach,
developed by De Jong and Smith at the University of
Pittsburgh [Smith 80], each structure in the population
maintained by the GA represents a production system
program, i.¢., a set of rules. Each structure is evaluated
by running it through a production system interpreter in
the environment of the learning task and measuring vari-
ous aspects of the system's performance. As the result
of each evaluation, a fimess measure is assigned to the
enlire program and is used to control the selection of
structures for reproduction. The usual genetic search
Operators ~- crossover, mutation and inversion -- are
applied 1o the structures without reference to the perfor-
mance of the individual rules comprising the structures.

These two approaches have provided a basis for
several successful learning systems [Booker 82, Gold-
berg 83, Schaffer 84, Smith 80, Wilson 85], each of
which incorporates significant additions to the basic out-
line given above. It should be emphasized that both
approaches are undergoing constant evolution and that
the relative merits of the two approaches is a topic of
continuing interest. This paper describes the first
attempt to combine the strongest features of cach
approach in a single Icarming system.

A primary strength of the Michigan approach
derives from its use of the bucket brigade algorithm
[Holland 85]. This elegant scheme achieves a distribu-
tion of credit among a large rule sect without the necd
for coslly bookkeeping overhead, e.g., the traces of sys-
tem behavior required of some learning systems, The
bucket brigade algorithm supports an important theme in
massive parallelism: that effective global pattems of
behavior (e.g., default hierarchies) can emerge from vast
numbers of local competitions. From a GA perspective,
however, the application of genetic operators in the
Michigan approach is problematic. The framework
described in [Holland 75] provides a theory for the evo-
lution of co-adapted sets of alleles within the structures,
arising as a result of competition among the indepen-
dent structures in the population. There is no theory
provided for cooperation among the members of the
population, although this might be an interesting exten-
sion. As a result, instead of emerging as co-adapted
alleles, co-adaptation among tules in the Michigan
approach is typically achieved by additional mechan-
isms, e.g., UWiggered genetic operators and bridge
classifiers [Holland 86, Riolo 87], that specifically
encourage rule linkages.

In contrast, the Pitt approach corresponds more
closely to the biological metaphor of evolution through
natural selection. The knowledge structures in the Pitt
approach correspond to chromosomes in a population of
competing organisms. The performance measure
corresponds to an organism’s fitness to its environment.
In the Pitt approach, co-adapted rules correspond to co-
adapted alleles in [Holland 75]. Holland make a strong
case for viewing adaptation as an optimization problem

 

 
 

over the response surface defined by the payoff provided
by the task environment and argues that GA’s are suit-
able for performing the optimization. The Pitt approach
to leamming rule sets seems to be a plausible, albeit
ambitious, extension of the successful use of GA’s for
parameter optimization problems. However, there are
two primary bottlenecks in the Pitt approach to learning:
computational resources and the feedback bandwidth.

Extensive computational resources are required in
to evaluate thousands of complete rules sets. Two lines
of current research address this issue. First, it can be
shown that GA’s are especially well-suited for optimiza-
tion using Monte Carlo techniques in the evaluation
phase [Grefenstette 87b]. This follows from the obser-
vation that even if individual structures are evaluated by
probabilistic procedures, the expected error in the
observed performance of hyperplanes is much less than
the expected error in the the performance of the indivi-
dual structures. The implication for the Pitt approach is
that the amount of problem solving activity required to
test individual rule sets can be fairly limited without
adverse effects on the overall performance of the learn-
ing system. A second means of dealing with the com-
putational burden of the Pitt approach is through the use
of increasingly available coarse-grained parallel comput-
ers with 50 io 100 processors. It is a straightforward
task to install a production system interpreter on each
node of a MIMD machine and thereby perform all of
the structure evaluation for a given gencration in paral-
Icl. Several papers in this volume explore this approach,

The second bottlencck in the Pitt approach is the
limited feedback bandwidth. In LS-1 [Smith 80] a critic
provides feedback concerning the performance of each
program, The critic includes not only a measure of the
task level performance, but also measurements of vari-
ous dynamic characteristics of the behavior during the
problem solving task, such as the number of rules that
fire, the number of rules suggesting external actions, and
so on. All of this information is combined into a single
scalar value to represent the fitness of the rule set under
evaluation. Shaffer's LS-2 [Schaffer 84] expands the
fcedback bandwidth by allowing the critic to provide a
vector of performance measures. Subpopulations are
sclected on the basis of cach clement in the performance
vector, providing the equivalent of environmental niches
in which specialists can evolve, Recombination in LS-2
takes place without regard to the subpopulation boun-
daries. The result is an effective search for Pareto-
optimal regions of the search space. Yet another
approach to increasing the detail of credit assignment in
GA’s is represented by recent attempts at the traveling-
salesperson problem (TSP). In the TSP, the usual fitness
measure is the overall tour length. However, this fitness

measure does not provide sufficient selective pressure
toward high performance tours. In one approach to this
problem [Grefenstette 87a], the crossover operator takes
the length of competing parental edges into account
when constructing the offspring tour. This represents a
low level credit assignment strategy for directly promot-
ing high-performance first-order hyperplanes.

In this paper, we explore a similar multilevel
credit assignment strategy in the context of the Pitt
approach to learning production system programs. Fig-
ure 1 illustrates the overall system design. Our
approach is to use the bucket brigade in the course of
evaluating production sets to assign credit at the level of
individual rules. The GA uses the overall performance
of the rule set to guide selection, as in LS-1. But in
addition, the strengths of the individual rules influences
the physical representation of ihe structure, making it
more likely that high strength rule combinations survive
the crossover operation. That is, we are proposing a
heuristic form of the inversion operator, called cluster-
ing, based on the feedback from the bucket brigade
algorithm. The description of the system will proceed
bottom-up: We first discuss the object level task and the
represeniation of rules. This is followed by the descrip-
tions of the problem solver and the leamer. Experimen-
tal results are presented, followed by a discussion of
future directions for this research.

2. The State Space Search Task

In the interests of obtaining general results, the
object level task for our learning system is an abstract
state space search characterized by the triple < S$, O, P>,
where § is a set of states, O is a set of state transition
operators, and P is a mapping of payoff to the states.
Certain states in § are initial states, others are final
States. At the start of a task, a random initial state is
chosen. During each problem solving step, the problem
solver selects an operator that is then applied to the
current state to produce the new state. A task is com-
plete when the new state is one of the final states.

One particular problem is shown in figure 2. The
set of initial states is (0, 1, 2, 3} and the set of final
states is {12, 13, 14, 15}. The arcs are labeled by the
operator(s) that perform the transition. In this problem,
a payoff of 1000 is associated with state 13. All other
states have a payoff of 0.

The object for the learning system is to leam a set
of heuristic rules that select operators for the current
state. The rules have the form:

IF the current state is in the set 8,
THEN choose an operator from the set 0,

The action associated with such a rule is to choose one

203

 
 

of the operators in the set O, at random.

The knowledge representation employed is an
important bias for any learning system. We now
describe our representation of the heuristic rules. In
most state space search problems, the number of states
is very large. In these cases, it is infeasible to specify
arbitrary sets of states in the conditions of a heuristic
rule. However, states can often be usefully character-
ized by a set of features. Furthermore, it is reasonable
to assume that heuristic rules for selecting operators may
usefully attend to some features and ignore others. In
fact, Lenat cites the following as one of the central
assumptions underlying the use of heuristics:

If an action A is appropriate in situation S,
then A is appropriate in most situations that
are very similar to S [Lenat 83].

So the loft hand sides of our rules will contain patterns
that match the feature vectors of states, using the sym-
bol # to indicate that the value of the corresponding
feature is irrelevant. For case of presentation, in this
paper we identify the name of each state with its binary
feature vector. On the other hand, state transition opera-
tors do not usually have features that capture their simi-
larity. Furthermore, the number of operators is usually
smaller than the number of states by several orders of
magnitude. Given these considerations, it seems reason-
able to permit the right hand sides to specify arbitrary
sets of operators. We do so by interpreting the right
hand side as a membership vector for the set of opera-
tors. In the example problem, the operator set is {a, b,
c, d}. So, for example, the rule

O##1 -» 1010
represents the heuristic

IF the current state is in the set {1, 3, 5, 7}
THEN choose an operator from the set (a, c}

A problem solver that uses this representation is
described next.

3. The Problem Solving Level

The problem solver for this system is based in
part on Riolo’s CFS-C system [Riolo 86]. As that sys-
tem is described in detail elsewhere in this volume, we
restrict our discussion to the modifications implemented
for this work. At the start of each problem solving
cycle, a detector message indicating the current state is
posted on the message list. Each rule whose left hand
sides matches the current state produces a bid that is
proportional to the rule’s strength. (The system
currently ignores the specificity of the rules in comput-
ing bids.) A competition, based on bid size, is held
among the bidders to see which rules get to produce a

message. (The maximum number of messages produced
is fixed.) Effector conflicts among the resulting messages
are resolved by a competition based on the total bids
associated with each distinct message. Once a winning
message is selected, one of the operators indicated by
the message is selected at random and applied to the
current state to produce the new state. Rules that pro-
duce messages consistent with the selected operator are
said to be active. Our implementation of the bucket bri-
gade is similar to the one in Wilson’s animat system
[Wilson 85):

1) each rule producing a message has the amount of

its bid deducted from its strength;

2) ~~ the total of the bids thus collected is distributed
evenly among the rules active on the previous
cycle;

3) any external reward associated with the new state
is distributed evenly among the currently active
rules,

Note that it is possible that no rule matches the current
state. In this case, a randomly chosen operator is used
to produce the next state. Since there are no active
rules, no external payolf enters the system.

It is acknowledged that many of the decisions
regarding our implementation of the bucket brigade are
ad hoc and that current research may well suggest
significant improvements. Nonetheless, the basic feature
of even our somewhat crude version is that the strength
of each rule serves as an estimator of its typical payoff.

In the example problem, it takes exactly three
cycles to move from an initial state to a final state. To
evaluate a given rule set, 100 traversals of the state
space (300 CFS-C cycles) are performed, each traversal
starting at a randomly chosen initial state. The average
extemal payoff achieved by the rule set is returned to
the learning level, along with the updated strengths of
the individual mules.

4. The Learning Level

At the top level of the learning system, the
GENESIS genetic algorithm system [Grefenstette 84)
maintains a population of fixed-length structures, each of
which is interpreted as a set of fixed length rules. For
the experiments described here, each structure consists
of 240 bits, interpreted as rules sets of 20 mes each.
Each rule is represented by a string of 12 bits. For the
left hand sides, 00 represents a “Q" in the representation
used by the problem solver, 11 represents a "1" and 01
and 10 both represent "#". Since the right hand sides
contain no “#" symbols, only 4 bits are required to
represent the right hand sides.

204

 
 

 

 

In addition to the usual genetic operations --
selection, crossover and mutation -- the learner also per-
forms an additional operator, called clustering,
described below. The crossover operator treats the
knowledge structure as a ring, and selects two crossover
points without regard for rule boundaries. Each rule in
the offspring inherits the strength associated with the
corresponding rule in the parent structure. (In the case
of a new tule created by crossing in the middle of two
rules, the new rule is assigned the average strength of
the two parental mules.) The only notice that crossover
pays to the rule boundaries is that duplicate rules are not
included when choosing the segment from the second
parent. Both offspring from crossover are kept, so that
all rules in the parent structures survive in one of the
offspring (with the exception of those rules in which one
of the crossover points falls).

Because the performance of a knowledge structure
is independent of the position of the rules within the
structure, it is natural to consider the use of the inver-
sion operator. Inversion typically reverses a randomly
chosen portion of a structure, thereby altering the
defining length of many hyperplanes. This is turn alters
the probability that those hyperplanes will be disrupted
by crossover. A moderate rate of inversion produces
slight performance improvements in LS-1 [Smith 80].
In other studies, the effects of inversion has generally
been hard to measure. One explanation is that the space
being search by inversion -- the space of all possible
permutations of genes -- is in general much larger than
the original task space. Faced with the huge space of
representation permutations, the random mutations of the
representation provided by inversion cannot be expected
to show much progress in the amount of time before
selection pressure guides the population into conver-
gence,

We now describe a heuristic version of inversion
called clustering. Like inversion, clustering does not
introduce any new phenotypes into the population.
Rather, the intent of this new operator is to modify the
physical representation of a rule set so that co-adapied
sets of rules are less likely to be disrupted by crossover.
Clustering is defined as follows:

1) select a rule position at random;

2) move ihe highest strength rule to the selected
position;

3) arrange the remaining rules so that the distance
from each rule to the highest strength rule is a
decreasing function of the rule’s strength (treating
the knowledge structure as a ring for the purposes
of computing distances).

205

An example showing the effects of clustering appears in
figure 3. Clustering is performed on all structures after
the selection phase and before crossover; if a given rule
set is assigned multiple offspring, each offspring has a
random initial position chosen for clustering. Note that
the combination of clustering and crossover does not
generally produce an offspring that contains all the high
strength cules in each parent. In fact, experience has
shown that it is usually counterproductive to impose
such deterministic heuristics on genetic search. Since
the strength obtained by a given rule is a function of its
context in the rule set, it is possible for a group of rules
to achieve low strength in one rule set and higher
strength in another rule set. This is just the kind of test-
ing of building blocks at which the GA excels.

Of course, the efficacy of the clustering heuristic
depends on the assumption that co-adapted rules tend to
have similar levels of strength. This is true in at least
some interesting cases. For example, consider a chain
of rules, e.g.,

Ry, Ra ws Ry

in which R, always leads to the firing of R,,,; that is,
assume that R, is §, > O, such that

0, Sa) € Sing

for each s,€8;,0,,€0;, and 1<i<». If no other rules
in the rule set have any states in common with S,, for
1<i<n, then all rules in the chain converge to a com-
mon level of strength (given our rules for the bucket bri-
gade). In this case, we expect the clustering operator to
facilitate the inheritance of the chain as a group. How-
ever, the validity of the clustering heuristic in the gen-
eral case certainly requires further analysis. We now
describe a preliminary test of its effects in practice.

5, Experimental Results

This section describes experiments that measure
the performance of the learning system with and without
the clustering operator. The GA at the learning level
uses parameter settings consistent with previous studies
(see [De Jong 75, Grefenstette 86):

population size = 100
0.5

mutation rate = 0.001

crossover rate

n

generation gap = 1.0
scaling window = 5

Table 1 shows the average system reward for the best
structures evolved during ten runs with and without
clustering. Each run of the system consists of 5000 rule

 
 

set evaluations (50 generations).
start with the same initial population of knowledge
structures. The performance of the learning system with
clustering clearly dominates the system without cluster-
ing. (Using the Wilcoxon matched-pairs sign test, the
hypothesis that both systems produce the same perfor-

Corresponding runs

mance can be rejected at the 0.005 level of
significance.) Figure 4 compares the payoff trajectories
of the best structures in each system averaged over the
ten runs. One might expect that higher performance as
measured by the best structure is achieved at the
expense of online performance, as is the case with a
high mutation rate; for example, when the clustering
operator creates an offspring containing only high
suength rules, it must also create one containing only
low strength rules. Figure 5 shows that this effect does
not significantly harm online performance. In fact, the
online performance improves with clustering. It is clear
that clustering produces a significant performance advan-
tage for this problem. Further experiments are required
lo validate these results on a variety of larger problems.

In two cases, the learning system with clustering
evolves rule sets that achieve a perfect payoff rate.
Table 2 shows the rules with above average strength in
one of these perfect rule sets. The Id entry indicates the
position within the knowledge structure of each rule.
This rule set shows some interesting features. In prac-
tice, the sequence of states visited with this rule set is
either

0555913
or
{1, 2,3} 36-10 - 13

It is encouraging to see that this rule set succeeds by
making correct decisions early in the state space search,
effectively eliminating much of the space from later
consideration. There are many instances of cooperative
rules. For example, even though rule 8 is overly general
and would lead from state 4 to state 11 (a dead end for
payoff, see fig. 2), this does no harm since state 4 is
avoided by rules 4 and 9. It is also interesting to note
that the rules with the highest swength occupy contigu-
ous locations on the knowledge structure, as indicated
by the Id entry. This provides support for the
hypothesis that the clustering operator is partially
responsible for the high level of performance.

6. Summary and Future Directions

This paper represents a preliminary study of the
use of the rule level credit assignment provided by the
bucket brigade in a system using the Pitt approach to
genetic learning. This system shows that the informa-

tion concerning the strength of individual rules can be
useful in altering the physical representation of the rule
sets in order to promote the linkage of co-adapted sets
of rules. There are many obvious directions for further
investigations. The analysis of the effects of the cluster-
ing operator on the search performed by a GA deserves
closer attention. It is expected that much of the analysis
applied to crossover and inversion in LS-1 [Smith 80]
carries over to the clustering operator. The learning sys-
tem should be extended to handle variable length struc-
lures, as in LS-1, The effects of altering the clustering
rate needs to be investigated. Further experimental stu-
dies are required in order to validate the effects of clus-
tering of larger problems. Finally, it is expected that
current research on the bucket brigade will lead to
improved ways to incorporate this level of feedback into
genelic learning systems that evolve knowledge struc-
tures based on sets of rules.

Acknowledgements

I want to thank Lashon Booker and Ken De Jong for
many simulating discussions on this topic, and Rick
Riolo for providing a copy of his CFS-C software.

References

{Booker 82] L. B. Booker, Jntelligent behavior as as
adaptation to the task environment, Ph. D. Thesis,
Dept. Computer and Communication Sciences,
Univ. of Michigan, 1982.

{De Jong 75] K. A. De Jong, Analysis of the behavior of
a class of genetic adaptive systems, Ph. D. Thesis,
Dept. Computer and Communication Sciences,
Univ. of Michigan, 1975.

[Goldberg 83] D. E. Goldberg, Computer-aided gas
pipeline operation using genetic algorithms and
machine learning, Ph. D. Thesis, Dept. Civil Eng.,
Univ. of Michigan, 1983.

[Grefenstette 84] J. J. Grefenstette, "A user’s guide to
GENESIS", Technical Report CS-83-11, Computer
Science Dept., Vanderbilt Univ., 1984.

[Grefenstette 86] J. J. Grefenstette, “Optimization of
control parameters for genetic algorithms”, JEEE
Trans. Systems, Man, and Cybernetics, SMC-
16(1), 122-128, Jan/Feb 1986.

[Grefenstetle 87a] J. J. Grefensiette, “Incorporating prob-
lem specific knowledge into genetic algorithms”,
to appear in Genetic Algorithms and Simulated
Annealing, D. Davis (ed.), Pitman Series on AI,
in press.

(Grefenstette 87b] J. J. Grefensteue and J. M. Fitzpa-
trick, "Genetic algorithms using Monte Carlo

206

 

 
 

function evaluations”, in preparation.

[Holland 75} J. H. Holland, Adaptation in Natural and
Artificial Systems, Univ. Michigan Press, Ann
Arbor, 1975.

(Holland 78] J. H. Holland and J. S. Reitman, "Cogni-
tive systems based on adaptive algorithms”, in
Pattern-Directed Inference Systems, D. A.
Waterman and F. Hayes-Roth (eds.}, Academic
Press, 1978.

{Holland 85] J. H. Holland, "Properties of the bucket
brigade”, Proc. Intl. Conf. Genetic Algorithms and
Their Applications, 1-7, Pittsburgh, 1985.

{Holiand 86a] J. H. Holland, "Escaping brittleness", in
Machine Learning, Vol. 2, 593-623, R.S. Michal-
ski, J. G. Carbonell, and T. M. Mitchell (eds.),
Morgan Kaufman, 1986.

{Holand 86b] J. H. Holland, K. J. Holyoak, R. E. Nis-
bett and P, A. Thagard, Induction: Processes of
Inference, Learning, and Discovery, MIT Press,
Cambridge MA, 1986.

{Lenat 83] D. B. Lenat, "The role of heuristics in learn-
ing by discovery: three case studies", in Machine
Learning, An Artificial Intelligence Approach,
243-306, R. S. Michalski, J. G. Carbonell and T.
M. Mitchell (eds.), Tioga Press, Palo Alto, 1983.

{Riolo 86] R. L. Riolo, "CFS-C: A package of domain
independent subroutines for implementing
classifier systems in arbitrary, user-defined
environments", Technical Report, Dept of Elect.
Eng. Comp. Sci., Univ. of Michigan, Jan. 1986.

{Riolo 87) R. L. Riolo, "Bucket brigade performance:
long sequences of classifiers", Proc. 2nd Intl.
Conf. Genetic Algorithms and Their Applications,
Boston, 1987,

(Schaffer 84] J. D. Schaffer, Some experiments in
machine learning using vector evaluated genetic
algorithms, Ph. D, Thesis, Dept. of Electrical and
Biomedical Engineering, Vanderbilt University,
1984,

[Smith 80] S. F. Smith, A learning system based on
genetic adaptive algorithms, Ph. D. Thesis, Dept.
Computer Science, Univ. of Pittsburgh, 1980,

{Wilson 85] 8. W. Wilson, "Knowledge growth in an
artificial animal”, Proc. Intl. Conf. Genetic Algo-
rithms and Their Applications, 16-23, Pittsburgh,
1985.

 

207

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Performance of Best Knowledge Structures

 

 

 

 

Run Clustering No Clustering
1 980 880
2 920 810
3 780 790
4 1000 720
5 910 790
6 950 680
7 730 700
8 900 770
9 1000 620

10 880 710

 

 

 

 

Table 1. Effects of Clustering

 

Task Rute Set Performance
Rule Set
credit update
New Rules Bucket Brigade
credit
«_
mee} Genetic Algonthm |<+————
Fig. 1. Multilevel Learning System
start
states

100

50 250 x

100

Fig. 3.

 

 

Fig. 2. An Abstract State Space Search Task

700 2

700

50 50 x

The Clustering Operator

(strength)

208

 

__Rules with Above Average Strength

 

 

Id

Rule

Streng

th

 

 

POO tw) hin nm

 

SSUnRS Be
Wb SI eee
$lidvidd

d

c
b
d
d
b
d

 

10514
9998
4687
4447
4401
3246
3022

 

 

Table 2, A Rule Set with Perfect Payoff

 

 
 

 

 

 

  
  
   

 

 

 

 

 

 

 

 

 

 

1000 F q T T T 7
Cc
C -- System with Clustering
sco + ~~ P - No Clustering a
R -- Random Walk eo
—— P
wo vee
$ _—
& 600 4
&
’
<
S
te
a
B10 4
et. L 1 L A 4
o 10 20 30 AD 40
Generations
FIG. 4. PAYOFF TO BEST STRUCTURES
rage T T T 5
C ~ System with Clustering
sof  P-- No Clustering 4
R -- Random Waik
vu
ivy
5 - 4
ea0
E
€
S
4
oo
&
Bo oe ft co
°
o fh i vk. aL A
o 10 1a 30 +o aD
Generations

FIG. 5. ONLINE PERFORMANCE OF CLUSTERING OPERATOR

209

 

 
 

On Using Genetic Algorithms to Search Program Spaces

Kenneth De Jong

Computer Science Department,
George Mason University

1. Introduction

The variety and quality of the papers in
this conference is strong evidence of the
dramatic increase in interest in Genetic Algo-
rithms (GAs) both as an object of study in their
own right, and in terms of the applications to
which they might be applied. At our last
conference, I tried to summarize where we stood
in terms understanding what these GAs were,
what kinds of problems seemed well-suited for
GA. applications, and what the open research
issues appeared to be. In this paper I will
attempt the same sort of perspective for a par-
ticular class of problems which is receiving con-
siderable attention today: using GAs to search
program spaces to quickly locate programs which
are capable of performing a task at an accept-
able performance level.

Tt has been pointed out many times that,
because it has been difficult in the past to find a
forum in the existing journal/conference struc-
ture for reporting GA research, there is a fairly
serious gap between the “common wisdom” held
by members of the GA research community and
the material available in the open literature. As
a consequence, those who are new to the area
find it difficult to ascertain who has been doing
what and frequently get involved unnecessarily
in rediscovering various aspects of undocu-
mented “wisdom” regarding the implementation
and application of GAs. This seems to be par-
ticularly true when attempting to use GAs to
search program spaces. So, in the following sec-
tions ['ll try to summarize what we know and
what the open issues are for this particular use

of GAs.

2. Conceptualizing the Problem

My favorite way of conceptualizing the
problem of searching program spaces is to

210

clearly separate out the process which is execut-
ing a task program from the GAs being used to
construct new (and hopefully better) task pro-
grams. If we make this separation, we can focus
on two important and inter-related questions:
what kind of changes to task programs are rea-
sonable to attempt with GAs, and what kinds of
task programming anguages _ encourage
(discourage) the use of GAs.

In considering what kinds of changes
might be made to task programs, there appears
to be a range of approaches of increasing sophis-
tication and complexity. The simplest and most
straight forward approach is to have GAs make
changes to a set of parameters which control the
behavior of a pre-developed, parameterized task
program. The advantage to this approach is
that it immediately places us on the familiar
terrain of parameter optimization problems for
which there is considerable understanding and
guidance, and for which classical GAs can be
used, It is easy at first glance to discard this
approach as trivial and not at all representative
of what is meant by “searching program
spaces’. But note that significant behavioral
changes can be achieved within this simple
framework. Samuel's checker player is a strik-
ing example of the power of such an approach.
So, one piece of common wisdom is that, if a
particular GA application permits, formulating
the task modification problem as a parameter
optimization problem has significant advantages.

However, there are many problems for
which such a simple approach is inappropriate in
the sense that “more significant” structural
changes to task programs seem to be required.
Frequently in these situations a more complex
data structure is intimately involved in control-
ling the behavior of the task, and so the most
natural approach is to have GAs make changes
to these key structures. A good example of

 
 

problems of this type occur when the task pro-
grams to be modified are designed with top level
“agenda” control mechanism. Task programs
for traveling salesman problems, bin packing,
and scheduling problems are frequently organ-
ized in this manner and GAs are expected to
construct agendas to be tested, evaluated, and
subsequently used to fabricate better ones.

This approach at first glance may not
seem to introduce any serious difficulties as far
as using GAs, since it is usually not hard to
“linearize” these data structures, map them into
a string representation which can be manipu-
lated by GAs, and then reverse the process to
produce new data structures for evaluation.
However, experience has taught us otherwise.
Representation issues now rear their ugly heads
and we can easily slip into a situation in which
GAs are rendered impotent by a poor internal
representation of the space to be searched. One
must be careful, for example, to either choose a
string representation on which the traditional
genetic operators like crossover and mutation
make sense, or to invent new operators better
suited to the representation and which also
satisfy the fundamental schema theorems. Both
of these alternatives can be difficult to achieve.
My favorite example of this is the amount of
time and effort that has been expended in
finding a “good”? way to use GAs to solve trav-
eling salesman problems. I will not dwell on
these representation issues here since they are
discussed in more detail in a variety of papers
both in this conference and the 1985 proceedings
(see, for example, [Grefenstette85], [Goldberg85],
and (Davis85]).

By now the reader is probably ready to
reply that neither of the approaches just dis-
cussed “really” involves searching program
spaces. Rather, the reader has in mind the use
of GAs to derive task programs at the more fun-
damental level of manipulating the executable
code itself, I’m not sure that there is anything
fundamentally different between interpreting an
agenda and executing a Pascal program. How-
ever, I do feel that by taking such a step we are
introducing an additional level of complexity as
far as GA applications are concerned, and that
it is important to “look before we leap”. Since
there is good deal of interest in such applica-
tions, the remainder of the paper will discuss
these issues.

2"

 

3. Choosing a Programming Language

I believe that the best way to proceed is
to focus on the second of the inter-related ques-
tions posed in the introduction: what kind of
programming languages encourage (discourage)
the use of GAs at this level. This approach
should not be surprising since we are really just
raising the representation issue in yet another
form.

It is quite reasonable view programs writ-
ten in conventional languages like Fortran and
Pascal (or even less conventional languages like
Lisp and Prolog) as linear strings of symbols,
This is certainly the way they are treated by
editors and compilers in current program
development environments. However, it is also
quite obvious that this “natural” representation
is a disastrous one as far as traditional GAs are
concerned since standard operators like crossover
and mutation seldom produce syntactically
correct programs and, of those, even fewer which
are semantically correct. As a consequence it
should be clear that traditional GAs are not par-
ticularly effective in searching spaces in which
the payoff is zero almost everywhere!

One alternative is to attempt to devise
new language-specific “genetic” operators which
preserve at least the syntactic (and hopefully,
the semantic) integrity of the programs being
manipulated. Unfortunately, the complexity of
both the syntax and semantics of traditional
languages makes it difficult to develop such
operators. An obvious next step would be to
focus on less traditional languages such as
“pure” Lisp whose syntax and semantics are
much simpler, leaving open the hope of develop-
ing reasonable genetic operators with the
required properties. There have been a number
of activities in this area including at least one
paper in this conference [Fujiko87].

However, there is at least one important
feature that “pure” Lisp shares with other more
traditional languages: they are all procedural in
nature. As a consequence most reasonable
representations have the kinds of properties that
cause considerable difficulty in GA applications.
The two most obvious representation problems
are order dependencies (interchanging two lines
of code can render a program meaningless) and
context sensitive interpretations (the entire
meaning of a section of code can be changed by
minor changes to preceding code, such as the

 
insertion or deletion of a punctuation symbol).
A more detailed discussion of these representa-
tion problems is presented in [DeJong85].

These issues are not new and were antici-
pated by Holland to the extent that he proposed
a family of languages called classifier languages
which were designed to overcome the kinds of
problems being raised here [Holland75]. What is
perhaps a bit surprising is that these classifier
languages are a member of a broader class of
languages which continues to reassert its useful-
ness across a broad range of activities (from
compiler design to expert systems), namely, pro-
duction systems (PSs) or rule-based systems. As
a consequence, a good deal of time and effort has
gone into studying this class of languages as a
suitable language for use in searching program
spaces with GAs,

4, Searching PS Program Spaces

One of the reasons that production sys-
tems have been and continue to be a favorite
programming paradigm in both the expert sys-
tem and machine learning communities is that
PSs provide a representation of knowledge which
can simultaneously support two kinds of activi-
ties: 1) treating knowledge as data to be mani-
pulated as part of a knowledge acquisition and
refinement. process, and 2) treating knowledge as
an “executable” entity to be used to perform a
particular task (see, for example, [Newell77]j,
[Buchanan78], or [Hedrick76]). This is particu-
larly true of the “data-driven” forms of PSs
(such as OPS5) in which the production rules
which make up a PS program are treated as an
unordered set of rules whose “left hand sides”
are all independently and in parallel monitoring
changes in “the environment”.

It should be obvious that this same pro-
gramming paradigm seems to offer significant
advantages for GA applications and, in fact, has
precisely the same characteristics as Holland’s
early classifier languages. If we focus then on
PSs whose programs consist of unordered collec-
tions (sets) of rules, we can then ask how GAs
can be used to search the space of PS programs
for useful rule sets.

To anyone whose has read Holland’s book
[Holland75], the most obvious and “natural”
way to proceed is to represent an entire rule set
as string (individual), maintain a population of
candidate rule sets, and use selection and genetic

212

 

operators to produce new generations of rule
sets. Historically, this was the approach taken
by De Jong and his students while at the
University of Pittsburgh (see, for example,
[Smith80, or [Smith83]]) and has been dubbed
“the Pitt approach”.

Tlowever, during that’ same time period,
Holland developed a model of cognition
(classifier systems) in which the members of the
population are individual rules and a rule set is
represented by the entire population (see, for
example, (Holland78] and [Booker82]). This
quickly became known as “the Michigan
approach” and jnitiated a continuing (friendly,
but provocative) series of discussions concerning
the strengths and weaknesses of the two
approaches.

I think it is fair to say that’ most people
who encounter classifier systems after becoming
familiar with the traditional GA literature are
somewhat surprised at the relatively subservient
role GAs play and the emergence of a “bucket,
brigade” to deal with apportionment of credit
issues. A reasonable explanation for this shift is
that Holland was focusing on developing models
of cognition rather than developing adaptive
search techniques for large, complex spaces.

In any case, both approaches have pro-
duced encouraging results in quite different con-
texts some of which are being presented at this
conference. At the same time I think it is fair
to say that there is still not enough experience
to understand precisely the — strengths,
weaknesses, and tradeoffs involved in either of
the approaches. The current popular view is
that the classifier approach will prove to be
most useful in an on-line, real-time environment
in which radical changes in behavior cannot be
tolerated whereas the Pitt approach will be use
ful with offline environments in which more lei-
surely exploration and more radical behavioral
changes are acceptable.

There are also some recent developments
which suggest that it might be possible to com-
bine the two approaches in powerful and
interesting ways [Grefenstette87].

5. PS Architecture Issues

Even though we have narrowed the scope
of the discussion to the problem of searching PS
program spaces, there are a number of PS

 
 

architectural issues which still need to be
addressed regardless of which {if any) of the
preceding approaches are taken. In the follow-
ing sections I will attempt to summarize those
aspects which seem to arise most frequently.

5.1. Rule Set Sizes

If we think of rule sets as programs or
knowledge bases, it seems rather silly and
artificial to demand that all rule sets be the
same size. However, the “path of least resis-
tance” for GA applications leads one to precisely
that point. [If one adopts the Michigan
approach, all implementations that I am aware
of assume a fixed population size (i.e., a fixed
number of classifiers). Most classical implemen-
tations of GAs as adaptive search procedures
assume populations of fixed length strings. So,
following the Pitt approach, using one of these
implementations would require representing rule
sets as fixed length strings. This issue is usually
resolved (in keeping with one’s philosophical and
pragmatic perspectives) in one of several ways.

In classifier systems fixing the size of the
tule set is viewed more in terms of setting an
upper bound on the amount of classifier memory
available in a cognitive model. Since humans
seem to have the same sort of limitation, setting
it to a reasonable, but fixed size seems to be
quite justifiable (assuming mechanisms like gen-
eralization and forgetting for dealing with lim-
ited memory capabilities).

One can adopt the same view using the
Pitt approach and require all rule sets (strings)
to be the same fixed length. However, the
justification is usually in terms of the advan-
tages of having redundant copies of rules and
having “workspace” within a rule set for new
experimental building blocks without having to
necessarily replace existing ones,

llistorically, the strongest. motivation for
fixed length representations was that all the
schema theorems were developed in this context.
Tiowever, Smith [Smith80] was able to show that
the formal results cowd indeed be extended to
variable length strings. THe complemented those
results with a GA implementation which main-
tained a population of variable length strings
and which efficiently generated variable length
rule sets for a variety of tasks. One of the
interesting side issues of this work was the need
to provide via feedback an “incentive” to keep

213

 

down the size of the rule sets by including a
“bonus” for achieving the same level of perfor-
mance with a shorter string.

In summary, although constraints on rule
set sizes may seem at first to be too simplistic
and restrictive for GA applications, this has not
proven to be the case in the work to date and
can be relaxed if necessary.

5.2, Rule Formats

A similar concern arises when developing
the Jormat of the rules which make up the rule
sets. Requiring the rule used in an expert sys
tem shell to a conform to a rigid format consist-
ing of a fixed number of slots. on the left and
right hand sides would be viewed as artificial
and unduly restrictive. However, most GA-
based applications do just that, Unfortunately,
unlike the case of fixed length rule sets, it’s
much more difficult to justify such a restriction
except on rather pragmatic grounds.

Whether one is treating rules or entire
rule sets as individuals, one generally needs to
have genetic operators like crossover and muta-
tion occur at a finer level of granularity than
just at rule boundaries. By requiring rules to be
in a rigid, fixed length format, the usual genetic
operators can be applied without much concern
about producing syntactically or semantically
meaningless offspring. If we increase the flexibil-
ity of the format by, for example, allowing a
variable length left hand side, one is generally
forced to modify the genetic operators to he
“sensitive” to punctuation marks so that well-
formed offspring are produced. Tlowever, intro-
ducing new operators means that care must be
taken In ascertaining that they have the charac-
teristics required by the schema theorems.

Most implementations of classifier systems
and the more classical GA systems that I am
aware of assume a fairly rigid format for reasons
of simplicity. The usual (informal) arguments
that are made to support the thesis that this is
not a severe restriction generally take the form
of: 1) the application doesn’t require it or 2} one
can achieve the same computational power (as
the more flexible formats) by using multiple rule
firings together with internal “working
memory’’.

 
 

5.3. Working Memory

Another PS architectural dilemma revolves
around the decision as to whether to use
“stimulus-response” production systems in
which left hand sides only “attend to” external
events and right hand sides consist only of invo-
cations of external “‘effectors”, or whether to use
the more general OPS5 model in which rules can
also attend to and make changes to an internal
working memory.

Arguments in favor of the latter approach
involve the observation that the addition of
working memory provides a more powerful com-
putational engine which is frequently required
with fixed length rule formats. The strength of
this argument can be weakened somewhat by
noting that in some cases the external environ-
ment tlself can be used as working memory.

Arguments against including working
memory generally fall along the lines of: 1) the
application doesn’t need the additional general-
ity and complexity, 2) concerns about how one
bounds the number of internal actions before
generating the next external action (i.e, the
halting problem), or 3) pointing out that most
of the more traditional machine learning work in
this area (e.g., [Michalski83]} has focused on
stimulus-response models.

Most of the implementations that I am
aware of provide a restricted form of internal
memory, namely, a fixed format, bounded capa-
city message list. However, it’s clear that there
are plenty of uses for both architectures. I’m

Just happy that this choice is not imposed on us
by GAs.

§.4, Pattern Matching

Many of the rule-based expert system
paradigms (e.g., Mycin-like shells) and most
traditional programmimg languages provide an
IF-THEN format in which the left hand side is a
boolean expression to be evaluated. This
boolean sublanguage can itself become quite
complex syntactically and can raise many of the
representation issues discussed earlier. In partic-
ular, variable length expressions, varying types
of operators and operands, and function invoca-
tions make it difficult to choose a representation
and/or a set. of genetic operators in which useful
offspring are easily and efficiently produced.

214

The alternate approach used in languages
like OPS5 and SNOBOL is to assume the left
hand side is a pattern to be matched. Unfor-
tunately, the pattern language can easily be as
complex as boolean expressions and in many
cases more complex because of the additional
need to “save” objects being matched for later
use in the pattern or on the right hand side.

Consequently, the GA implementor faces a
fairly serious dilemma regarding the style and
complexity of the left hand side of production
rules. Most implementations that I am aware of
have followed Holland’s lead and have chosen
the simple {0, 1, #} fixed length pattern
language which permits a relatively direct: appli-
cation of traditional genetic operators. [Jowever,
this choice is not without problems. The rigid
fixed length nature of the patterns can lead to
very complex and “creative” representations of
the objects to be matched. Simple relationships
like “speed *> 200” may require multiple rule
firings and internal memory in order to be
correctly evaluated.

A favorite cognitive motivation for prefer-
ting pattern matching rather than boolean
expressions is the feeling that “partial ratch-
ing” is one of the powerful mechanisms that
humans use to deal with the enormous variety
of every day life. The observation is that we are
seldom in precisely the same situation twice, but
manage to function reasonably well by noting its
similarity to previous experience.

This has led to some interesting discus-
sions as to how “similarity”? can be captured
computationally in a natural and efficient way.
Holland and other advocates of the {0, 1, #}
paradigm argue that this is precisely the role
that the ‘‘#:” plays as patterns evolve to their
appropriate level of generality. Booker and oth-
ers have felt that requiring perfect matches even
with the {0, 1, #} pattern language is still too
strong and rigid a requirement, particularly as
the length of the left hand side pattern
increases. Rather than returning simply success
or failure, they feel that the pattern matcher
should return a “match score” indicating how
close the pattern came to matching. An impor-
tant issue here which needs to be better under-
stood is how one computes match scores in a
reasonably general, but computationally efficient
manner. The interested reader can see
{Booker85] for more details.

 
 

In any case we need to understand better
what is involved in choosing a syntax and
semantics for the left hand side of rules which
balances the need for simplicity of representation
and the need for expressive power. In my view
this is one the most difficult and open issues in
using GAs to search program spaces, and the
key to successful applications.

6. Payoff Functions

In addition to choosing an appropriate
representation language, careful thought must be
given to the characteristics of the payoff func-
tion used to provide feedback regarding the per-
formance of task programs. Strategies for
designing effective payoff functions tend to differ
somewhat depending on whether one is working
with classifier systems or using the Pitt
approach to evolving useful task programs.

In classifier systems the bucket brigade
mechanism stands ready to distribute payoff to
those rules which are deemed responsible for
achieving that payoff. Because payoff is the
currency of the bucket brigade economy, it is
important to design a feedback function which
provides a relatively steady flow of payoff rather
than one in which there are long “dry spells”.
Wilson’s “animat” environment is an excellent
example of this style of payoff (Wilson85}.

The situation is somewhat different in the
Pitt approach in that the usual view of evalua-
tion consists of injecting the program defined by
a particular individual into a task processor and
evaluating how well that program as a whole
performed. This view can lead to some interest-
ing considerations such as whether to reward
programs which perform tasks as well as others,
but use less space (rules) or time (rule firings).
Smith [Smith80} found it necessary to break up
the payoff function into two components: a
task-specific evaluation and a task-independent
measure of the program itself, Although these
two components are usually combined into a sin-
gle payoff, recent work by Schaffer [Schaffer85]
suggests that it might be more effective to use a
vector-valued payoff function in situations such
as this. To my knowledge no one has explored
this possibility.

However, one of the more provocative
papers in this area at this conference suggests
that there is an opportunity to have “the best
of both worlds’ with a multilevel credit

245

 

assignment strategy which assigns payoff to both
rule sets as well as individual rules [Grefen-
stette87]. This is an interesting idea which, I
suspect, will generate a good deal of discussion
and merits further attention.

7. Selecting Genetic Operators

It has been pointed out many times that
there is nothing sacred about the traditional
operators defined and analyzed by Holland.
What ts important is that we have criteria set
by the schema theorems that such operators
should meet. It is particularly tempting when
using GAs to search program spaces to introduce
new operators to deal with the complexity of the
representations. One need not feel reluctant or
apologetic about doing so, However, f changes
are made to existing operators or new ones are
introduced, it is important to verify that they
aren’t overly disruptive of the process of distri-
bution of trials according to payoff and that
they encourage the formation of building blocks.
Without these basic properties, one is bound to
be disappointed in the performance of GAs in
searching program spaces.

8. Conclusion

So where does all this leave us? I think
that it is quite clear that, as designers, if we can
keep down the complexity of program spaces by
using parameterized procedures or data struc-
tures which are easy to represent and manipu-
late with GAs, we have a much better chance for
a successful GA application. If, on the other
hand, the situation calls for evolving task pro-
grams at a more fundamental level, it seems
equally clear that production system languages
are currently the best choice for GA. applica-
tions. Whether one chooses the Pitt or the
Michigan approach is still a matter of individual
preference. Perhaps by the time we get together
again no such choice will be necessary.

References

(Booker82| Booker, L. B., “Intelligent Behavior
as an Adaptation to the Task Environment”,
Doctoral Thesis, CCS Department, University of
Michigan, 1982.

 
 

[Booker85] Booker, L. B., "Improving the Perfor-
mance of Genetic Algorithms in Classifier Sys-
tems", Proc. Int’! Conference on Genetic Algo-
rithms and their Applications, July, 1985.

[Buchanan78] Buchanan, B., Mitchell, T.M.,
"Model-Directed Learning of Production Rules",
in Pattern-Directed Inference Systems, eds.
Waterman and Iayes-Roth, Academic Press,
1978.

[Davis85] Davis, L. D., “Job Shop Scheduling
Using Genetic Algorithms", Proc. Int’l Confer-
ence on Genetic Algorithms and their Applica-
tions, July, 1985.

[DeJong85] De Jong, KL, "Genetic Algorithms: a
10 Year Perspective", Proc. Int’l Conference on
Genetic Algorithms and their Applications, July,
1985.

[Fujiko87] Fujike, C. and Dickinson, J., "Using
the Genetic Algorithm to Generate Lisp Code to
Solve the Prisoner’s dilemma", Proc. Int'l
Conference on Genetic Algorithms and their
Applications, July, 1987.

[Goldberg85] Goldberg, D. and Lingle, R.,
Alleles, Loci, and the Traveling Salesman Prob-
lem", Proc. Int’l Conference on Genetic Algo-
rithms and their Applications, July, 1985.

[Grefenstette85| Grefenstette, J. et al., “Genetic
Algorithms for the Traveling Salesman Prob-
lem", Proc. Int’l Conference on Genetic Algo-
rithms and their Applications, July, 1985.

[Grefenstelte87] Grefenstette, J, “Multilevel
Credit Assignment in a Genetic Learning Sys-
tem”, Proc. Int’l Conference on Genetic Algo
rithms and their Applications, July, 1987.

{Hedrick76] Hedrick, C.L., "Learning Production
Systems from Examples”, Artificial Intelligence,
Vol. 7, 1976.

[Nolland75] J, 1 Holland, Adaptation in
Natural and Artificial Systems. University of
Michigan Press, 1975.

(Holland78] Holland, JL, Reitman, J., “Cogni-
tive Systems Based on Adaptive Algorithms”, in
Pattern-Directed Inference Systems, — eds.

216

Waterman and Hlayes-Roth, Academic Press,
1978.

[Michalski83] Michalski, R., “A Theory and
Methodology of Inductive Learning”, in Afachine
Learning: An Artificial Intelligence Approach,
eds. Michalski, Carbonell, and Mitchell, Tioga
Publishing, 1983.

[Newell77] Newell, A., "Knowledge Representa
tion Aspects of Production Systems", Proc. 5th
IJCAL, 1977.

[Schaffer85] Schaffer, D., "Multiple Objective
Optimization with Vector Evaluated Genetic
Algorithms", Proc. Int’l Conference on Genetic
Algorithms and their Applications, July, 1985.

[Smith80] Smith, S. F., "A Learning System
Based on Genetic Adaptive Algorithms”, Doc-
toral Thesis, Department of Computer Science,
University of Pittsburgh, 1980.

[Smith83] Smith, S. F., “Flexible Learning of
Problem Solving Heuristics Through Adaptive
Search”, Proc. 8th IJCAI, August. 1983.

{Wilson85] Wilson, S., “Knowledge Growth in an
Artificial Animal", Proc. Int’l Conference on
Genetic Algorithms and their Applications, July,
1985.

 
A Genetic System for Learning Models of Consumer Choice

David Perry Greene
Stephen F Smith

Carnegie Mellon University, Pittsburgh, PA 15213

Abstract

Consumer choice modeling can he viewed as a classification
problem where the ‘model’ of interest is the rule that an
individual uses to identify acceptable from unacceptable products
ina given purchase situation The search space af possible rules
can be characterized as large, multimodal and noisy ‘Traditional
statistical methods for generating consumer choice models arc
Inmited by the assumptions and representation they impose A
recent symbole induction approach attempted to address these
limitations but fell short in managing the complexity of the
search space H1s hypothesized that probabilistic genetic search
ean overcome this complexily while retaining the advantages of a
symbohe cule representation This paper proposes a genctic
algortthin (GA) based model generator capable of addressing
these problems, describes its implementation, and compares it to
alternative modeling techniques on a simulated consumer
decision problem

0. Introduction

Gaining insight into how and why a consumer chooses whether
or not to buy in a speetfic purchase situation is of obvious
interest fram a marketing standpoint An ability to identity the
relevant features of a purchase situation, the relative order of
importance of these features to the cansumer, and the
(potentially) complex relationsiips the consumer posits between
them can provide significant leverage in both predicting product
performance and focusing promotional activities Within the
marketing Merature, this problem of understanding consumer
behavior is referred to ax consumer choice modeling. “the
problem can be generally stated as follows given a collection of
features that describe a specific purchase situation, and a set of
examples of specific consumer choice decisions relative to this
purchase situation, infer a decision nile that accounts for the
consumer's bebayior Consider the situation of renting an
apartment Tn this case, the modcler is presented with a set of
previously made rent decisions, describing each considered
apartment in terms of a predefined collection of apartment
features (cg price, distance from work, reputation of the
Jandtord, ete) Assuming that the consumer chooses to rent or
not rent a given apartment aceording to some deetsian rule
defined over these features, the goa? is to construct a model of
the consumer's decision rule from these examples of past choice
behavior Ideally, a model of consumer choice should
demonstrate adequacy over three broad performance dimensions
1) its abihty to correctly forceast future choice decisions {or
predictive validity), 2} its ability to carrectly onder the
consumer's value for different product features (or diagnostic
validity), and 3} ss ability to provide insight into the consumer's
decision strategy (or sirncturalfintuitive validly) — These
performance dimensions provide a basis for contrasting
alternative techniques for consumer choice modeling

Traditionally, marketing researchers have relied on mutltiattnbute
statisteal methods as a means of modeling consumer choice
{Wik73,, Such methods, however, are not without
shortcomings They provide limited intuition into the structure
of the individual's choice rule and the assumptions that must be
made conceming the structure of the solution space can
sometimes result in scrious predictive errors{Cury8 Iffobn85]
To overcome these imitations, a recent paper [CuerR6) adapted
an AT perspective of the problem, using a production rule
tepresentation and symbole induction While the use of
production rules provided structural intuition, the representation
created scarch difficulties which impaired the other validity
measures Genetic algorithms (GAs), because of their unique
search mechanism, offer the potential of retaining the benefits of
a production rule representation while overcoming the
complexities of the problem space Thts paper investigates the
hypothesis that GA's will be able to predict decisions as well as
statistical methods yet offer the representational superiority of
the symbole approach,

he remainder of this paper is divided into five sections The
first section examines the strengths and weaknesses of the
conventional statistical methods ‘The second section considers
consumer choice modeling as a symbolic learning task and
describes the approach taken m |Cur86] In the third section,
an alternative approach using a genetic algorithm is developed to
address the hmitations in the existing methods Phe next section
desertbes a simulated problem to compare the three techniques
using the three measures of validity discussed above The fifth
section presents the results followed by a brief discussion

1, Traditional Statistical Methods

In) marketing, multiatinbute preference models, such as
discriminant analysis and regression models, have been the
traditional classification technique (Curr84} ‘The majority of
these models represent a consumer's choices as a combination
of feature weights, such as the linear regression model [Wilk7 3.
A major assumption in such approaches 48 that consumers
trade-off relevant attnbutes of a product when forming overall
evaluations ‘These are termed compensatary decision rules,
because good features can compensate for the bad Research,
however, indicates the existence of non-compensatory choice
strategies which do not accommodate tradeoffs (Payn76]{Pinh 70].
Therefore one concern with these statistical models is that the
presumption of compensating behavior is incorrect. Despite
acknowledged robustness [Dawe74], under conditions likely to
occur in a competitive market the use of compensatory models
may be tmappropriate and misappheation can have severe
consequences Iohn® 5]tCury8 1]

 

A second concern with linear models is their timited ability to
provide behavioral insights Knowing how a person's choice
strategy ts constructed could provide a practical benefit by

 

 

 
 

identifying targets for special emphasis or in modifying behavior

{Wrg73]| Unfortunately, despite strong predictive performance,
statistical methods ate linuted in two ways by the
representational assumptions they impose The first problem is
that non-caompensatory behavior is not casily identsfied from the
feature weights Second, the numerical coefficients offer no
insight tn to the configural relations among the attnbutes

2. Consumer Choice Modeling as Symbolic Learning

2.t Symbolic Representation of Choice Strategies

Production systems offer a representation which can capture the
configural nature of a choice stralegy in an intuitive manner
Following the terminology of [Mich80}, we can represent a
consumer's choice rule as a collection of elementary ‘terms’,
where cach term consists of a conjunction of attnbute-value (A-
V) pairs called ‘selectors’ Preference for a specific apartment, for
example, could be represenied in the following form

[rent = $450[[cammute = 2 milcs|[heat_inc) = true| (hy

Vhe fogical union of several terms forms a disjunctive
expression, Icading to a disjunctive nannal form (DNF)
representation of more complex choice strategics Such strategies
can be classified as “compensatory” in naturc since the selectors
in one term can compensate for selectors in another An
example of a compensatory rule in DNU for selecting an
apartment would be

IF [rent < $400{[commute ~ 2miles|[heat_inc] = true} (2)

or [rent < $350}{commute < 3mites|[laundry_incl = tric]

THEN purchase

In contrast, a choice rule consisting of a single tenn (ie no o's)
would represent. a “conjunctive choice strategy = In the
consumer behavior literature the distinction between conjunctive
and compensatory strategic; has important implications,
therefore the ability of the leaming system to distinguish these
structures will be an critical part of the evaluation (for this paper
structure and strategy will be used interchangeably)

While behavioral researchers acknowledge the supenority of
production systems as sepresentation [Klah87], generating high
validity rules is a dificult problem As [Vab85] imdicates,
learning disjunctions of conjunctions 13 computationally complex
for a reasonable number of features and becomes NP-hard sa
cerlain circumstances Tf we assume only 40 binary selectors, for
example, the size of the description space for a single term 15 2
However, an individual is quite likely to have rules which use
multiple terms (“term F" ar "term 2" or .."term n") . For
simplicity, if we restrict this compensatory decision, gute fo ony
an} 4
5. jcrms, the number of possible rules would be, 2 ? x2 x
4 e we
2°x 2° or approximately 10° posstble combinations: *.

2.2 Concept Learning System

In a recent paper, Currim, Meyer, and Jc [CurrR5] identified the
Jirttattons of lmicar statistical models discussed in Section [, and
proposed a solution to the symbole rule generation problem
Uheir approach, called the Concept Leaming System (CLS), 6
an appheation of Quinlan’s 113 system [Quin84} (named after
the eather CLS system upon which [D3 is based [Hent66}). The
1193 inductwe alganthmn constructs a consumer choice rule by

UL The possible combinations QQARNCk OF, where ke othe number of
selectors) k(n \(mo [1,2, Jhimits of relevance} ) and t - (the ntimber of terms in
arule, ¢ (m \(mo (1,2, linia of cogmtion}), although the number of "legal"

rules ts a function of the independence among selectors

2418

snerementally building a classification tice that covery the
framing set, repeatedly adding additional selectors to
underspecified branches on the basis of their estimated

discriminatory power. ‘Thus, given a training set of specific
consumer judgements, with the attributes over which the
judgements were rendered encoded as binary (true or fatse)
sclectors (eg rent $350, distance from-work < Yiles), CLS
searches by evaluating cach selector for is ability to discriminate
among the consumer's decrtions ‘The selector which provides
the best separation at cach point in the search, as determined by
information-theoretic measures aimed at minimizing the
expected number of tests to classify an instance, becomes a new
branch of the decision tree This process continucs iteratively
until all trainmg examples are correctly partitioned The choice
rule generated, then, consists of all the selectors used to build the
tree (as in expression (2))

The use of a production system formalism is an important
advantage Because of their apparent consistency with human
thinking [Newe72] [Klah87] production system representation arc
casily understagd by users Tlowever, the use of production
niles also exposes the complexity of DNF induction problems
discussed carker The major concern im the case of CLS 1s that
fhe decision tree building proceduse is stepwise optimal but not
necessarily globally optimal [Bres84] which means constructing a
tule piece by piece will be ineffective if it 19 critical combinations
of pieces which provide superior performance A second
concern 15 the sensitivity of the CES inductive strategy to errors
(ar “nase” in the consumer choice data (which is not an
uncommon occurrence in practice) As allempts are made 10
classify noisy traimng cxampley, the irees generated by CLS
become complex and inaccurate’ 1 Therefore, when conditions
are less than ideal, the predictive quality of the rote may fare
poorly

3. ADAM: A Genetic Algorithm Based Model
Generator

Genclic algorithms (GAs) are a class of search techniques which
have demonstrated the capabilities to address such problems in a
number of rule mducton tasks (cg [Smet83[Gold85]}{Scha8 4)
Accordingly, at was hypotheazed that a GA would exhibit
annular performance in the consumer choice modeling domain
To validate this claim, a GA-hased consumer choice rule
generator called ADAM (A Decision Adaptation Model} was
implemented and evaluated In the following subsections, we
desenbe the principal components of this system We assume
fanufiarnty with the basics of GA's and refer uninformed readers
to [Holl75| [TotlR6]

4.1 Representation

To smphfy the comparative analysis, the binary selecior coding
schome employed by CTS in formulating rules was also adopted
within ADAM — Thus, a given term in a candidate mile 1s
expreased as a string of length @ over the alphabet (0,14) Of
representing a don't care position), where 9 equals the number
of dichotomous selectors defined and cach bit position
corresponds (oa particular selector. A complets choice rule 18
then expressed as the concatenation of one or more terms The
number of terms ina given rule can vary, and there 1s no
significance attached to the order in which terms appear Using

[2f In [Quin86] some extensions to the CES Induchve strategy to cope with
hone are proposed [ese extensions were not implemented and will have to be
cratuated ina Inter study

 
 

 

 

 

mmphicrt disjunctive normal form the rule string
VARNA AHHH OUMHHE
would be interpicted as
IP (1) and (5))
or (C1) and (4) and (6))
or (not 1) and (2) and (3) THEN choose.
If the conditions of any term are matched then the rule is active
(fired) indicating an acceptable product,

Tor purposes of the comparative analysis, we will assume all
features describing the purchase situation to be binary attributes
Thus, there is always a one to one correspondence betyecen
purchase features and the selectors over which the term
reptesentation 18 defined It should be noted, noncthcless, that
the term representation 1s pot a good one for the GA in the
general case Specifically, the representation implies that
attnbutes which range over a continuum of valucs (eg the cost
of renting an apartment) must be perceived through a set af
binary selectors (cg frent- = WO}, rent + JOO), [rent — = 400),
ste) [hts yiclds a term representation space that includes a large
numbce of illegal structures, However, this problem is casily
resolved by simply using several string pasitions (a recognize
such atirrhutes" salues {as is routinely donc in’ classifier
systems) The need to explicitly define a set of possible
selectors #6 specific to the CTS inductive strategy.

Assuming the above descnbed sepresentation of a consumer
choice mule as a ciszunction of one or more conjunctive terms,
there are still alternatives as to Nhe naiure of the population of
structures to be manipulated by the GA On one hand, we
mught consiler a single term to be an atomic rule of sorts,
present the GA with a population of such atomic rules, and
interpret the entire population as the consumer's choice rule
Adopting this approach, however, requires a means of assigning
credit to the individual terms that confrbute to the overall
performance af the choice rule. We can bypass this problem of
term credit assignment (as was done in [Smit80]} by encoding
complete choice rules as individual structures in’ the GA
population, and including discovery of teri mtcractions as part
of the GA search. In this case, the GA well manipulate vanable
length mile strings of the form
If <term f- or <fterm 2> or. <ftermn> “THEN Choose

and the population of structures represcnts a collection
competing chose rules This Jatter approach is the approach
adopted in ADAM,

3.2 Evaluation Funciion

The evaluation function or “entic’ in ADAM contains three
components The dominant component, — “prediction”,
measures how frequently a rule correctly predicts choice, The
other two Components, “specificity” and “lerm-count”, relate to
the structure of the rule The three measure are werghted and
summed to yield an overall fitness measure Prediction is given
aw dominant weighting reflecting its importance in a model.
howeser, the weights for the siructural componcnis have a
critical role in determining bias, or direction of the search,
toward the "“frue”  choie strategy (eg conjunctive or
compensatory) On the whole, if two rules perform equally, the
imore general 1s preferred Preliminary investigation showed that
the GA would direct the initial search toward the appropriate
structure, however, greater discrimination in the later stages was

| Actually, we can take advantage of the fact that with reepect fo many non
binary atitibules we are interested in ranges of values, and define a better
representation [fn euch situations, we have experimented with a representation
that mncitdes a -, >, of - prefix to patterns defined over such attributes,
using a binary number inerpretation of the attrsbute pattern in the care of ©
or prefixes

219

necessary Some means for determining the most likely strategy
was necessary to adjust the bias during the scarch

32.1 Strategy Structures
Using Bettman's [Bett79] description of conjunctive and
compensatory for a 3 attribute example, the following structures
would be expected
for conjunctive rules there would be only one term with a
specific value defined at cach relevant attribute
conjunctive I -A=x and B=y and C=7>
THF N Choose,
for a linear compensatory rule the structure is disjunctions of
conjunctive feos and should haye many terms with fewer
defined positions (low specificity, high term-count)
compensatory IF -A=xand R=y>
or ~A=jand C=k >
‘THEN Choose
Based on thes “specificity” measures the number of don't-caces
ima rule string and “term-count” measures the number of terms
Conjunctive rules would have high specificity and low term-
count while compensatory rules would have the reverse

3.2.2 Shifting Bias

Since one purpose of the cule is to infer strategy from rule
structure an appropriate structural bias cannot he known a
priori Unlike traditional methods which may erroneously
presume a specific structure (i.¢ lincar compensatory), ADAM
should adapt its bias as the scarch progresses Because of the
GA cfficiency at initially locating good structures, ADAM can
use this information to adjust the evaluation function (bias)
towards conjunctive or compensatory structures That is, "fit?
rules fend to acquire a specific form and the weights associated
with specificity and term-count can be adjusted to favor search
among roles with that form In the later stages of the search,
once a specific form becomes predominant, this bras becomes
uscful in discriminating among otherwise equivalent rules.
Tventually this bias adjustment could be implemented as a
smooth function [Berl79] based on the population average size
weighted by performance, for now, the shift occurs when the size
of the best structure m the population falls below the average
sve of the imtial population (-3 terms)

3.2.3 Measuring Prediction

Predictive validity or the ability to forecast choice can be
measured with a simple contingency table (Figure 1), when a
rule agreed or disagreed with the training sct chores

Basic Contingency Fable
{"fired” implies the nile conditions were met
and so the item was acceptable)

(not
red fired
chosen a b
not chosen c d
Fig. 1

‘Thus figure 18 representative of a hypothesis test, with cells b and
¢ considered errors of omission and commission respectively.
While differential penalties might be apphed based on the
severity (cost) of cach type of error, for this paper they are
considered equivalent

3.2.4 Folding sample

One final characteristic influencing the evaluation is the
segmentation of the taming sample into a smaller training set.
At the end of a gencration a new training set was drawn starting

 
 

with the clement about 2/3 of the way from the previous first
clement. [his caused a stagger or overlap of about 1/3 for each
training set. The primary reason was to maintain some of the
adaptive flavor of the search by providing a changing
environment (the overlap of the sample allowed the change to be
more gradual}. One effect was fo maintain diversity among the
genes fo prevent lost alleles A second reason was that it
requires {ewer computations per generation even if the training
sample size were inercased Interestingly enough, this scheme of
staggering the training examples considered during evaluation
appeared to tead to cqual performance compared with repreated
consideration of the entire training set in prelimmary
experiments

3.3 Rule Generating Parameters

Based on the results of earlier studies [Delo75} [Grefk¥, a
population of 50 candidate choice rules is maintained by the GA
in conducting ifs search = The initial set of rules ts generated
randomly with the number of terms ranging between | and 6,
and a 33% probability of a 1, 0 or # at cach individual string
position. Although typically a larger percentage of #'s are used,
the potential of highly specific “conjunctive” rules suggested that
too general a seeding would slow the search progress

After the cvaluation fonction assigns a fitness to cach rule the
scores are normalized, and genetic operators are applied to
produce a new population of candidate rules An "clitest™
strategy, as suggested by Delong (1975), then inserts the best
Tule from the previous generation The probability of crossover
ts set at 06, a8 originally suggested by [Delo75{ Crossover
operates at two levels (between terms and between selectors) as
outlined by [Smit80] with a modification to alignment point
selection to include a zero level crossover {recall that the
etructures being manipulated are variable length) This has the
effect of permitting single ferms to occur and be included in the
crossover process’. Finally, the mutation rate was set at 0001

ADAM continues to search until either it finds a rule which
pericctly classifies the set of choices of it reaches a prespecified
number of generations After if stops it then selects the rule
with the best score The output of (his search process is a single
rule used to charactestze an individual's choice strategy

4. Methodology

A simulation was constructed to perform a comparative analysis
of ADAM, CTS, and a linear Logit model [he following four
factors were investigated for their potential affect on
performance

1- type of "truce" choice strategy (conjunctive,
compensatory or mixed)
2- number of attributes on which the rule is based (4, 6, 9)
3- level of noise in representing the choice (0%, 10%, 20%)
4. sample size used for estimatton (20, 100, 200)
with half of cach sample used as a holdout.

To provide the simulation data a table of randomly gencrated
A-V terms was created Fach alternative consists of three, six, or
nine attributes using a random number befween @ and 99. A

{4} As suggected by Smith [Smt&0], the level of crossover seems to yield better
carly performance when set high bul better fafer performance when set low In
preliminary experiments staggering the crossaver level operator bated on the
oumber of generations elapsed, appeared to yield steadier increases in overall
performance

220

coding function representing a decision maker's strategy
(conjunctive, compensatory or mixed) was applied resulting in a
ect of decisions marked as positive or negative examples ‘This
could be dhought of as a sample of products characterized by 3,6
or 9 alinbutes plus an indicator of whether or not the consumer
thought it should be chosen

The choice indicator was generated using the folowing three
choice functions

conjunctive choice = Eif (xp = #1) 7. (x2 Sty) ty Fy)
0 otherwise

compensatory choice = 1] if (4) t x2 4
0 otherwise

+ Xq) fn >t,

mixed chowe= Vif (tx, > ty) - (2 # 4g +
@ otherwise

+ Xn) f/m ta)

‘n= number of alinbutes ina given experimental condition

(n @ (3,6,9})
* x, = a given alttnbute

(ie (4,2... })
* 41, = thresholds, each t will be selected a pnori for each of 2
conditions to generate an approximately cqual split between the
oumber of “chosen” and "nonchosen” alternatives

Noise was introduced into cach coded sct of examples by
changing the decision indicator of any alternative in the set with
4 probability of 0%, [0% or 20% ‘This represents a severe
form of misclassification since an alternative which contained
acceplable atirtbute values 1s now indicated unacceptable and
vice versa

Using combinations of choice strategy, number of attributes,
and noise level, nine selection models were created representing a
4x 3.x 3 parhal faciosial Fach of the nine models is apphicd for
three different sample sizes yielding 27 (model/sample sizes)
Tach of these 27 conditions is repeated § times yielding 145 data
sets Half of each data set was used as a holdout sample meaning
it represented a set of decisions not previously seen and therefore
usable for evaluation ADAM, CES and the simulation were all
Programmed in PASCAL and run on an IBM-PCS! The Logit
resulls were generated using Hottrans on a VAX computer and
with RATS on an FAM-PC

5. Results

Yhe objective of ADAM was to simulate performance under a
number of conditions and to determsme whether it offered any
improvements with respect to the issucs described in cartier
secon Performance was ctaluated on the basis of the
previously discussed measures of a good model predictive,
dsagnastic and structural validity These results are summarized
below

5.1 Prediction

Ter the simulation the ability to accurately predict is measured
by how well the model's rule predicts the hold-out sample The
comparative predictive levels of the models averaged over the §
sepetitions are presented in Table |

[5] The CLS code was supplied by Rob Meyer al the University of Califor
Tas Angeles

 
 

 

 

Effectiveness of ADAM, CLS and Logit
(percentage of holdout cases correctly predicted)

sample= 10 sample = 50 sample = 100
Strat Atribs Noise GA CIS agit GA CIS logit GA CLS I ogit
1 Conjd 0% 100 90 92 100 100 180 100 160 100
2 Cony6 1% 72 62 6 7 67 75 76 SB O72
3 Conj9 20% 86 73 66 80 76 80 78 66 76

4Comp3 10% 76 72 73 79 65 76 83 69 80
SComp6 0% 75 59 84 R2 68 84 7 66 B2
6Comp9 20% 62 47 71 66 64° 70 66 S56 7I

7 Mixd 3 20% 78 68 73 76 69 73 75 66 78
8 Mixd 6 1% 88 47 Bd 88 80 86 RE 80 85
9Mixd 9 0% 7 74 97 92°90 95 87 88 94

Table 1.

Ht is evident that ADAM, using a genetic algorithm, gencrated
tules with equal or superior predictive ability to those of CLS
across almost all the experimental conditions (the exception
being + and 2 points difference for the mixed model with 9
attnbutes)! In comparing ADAM to the Logit model, the
major impression 1s how comparable and consistent their
performance was, Overall ADAM predicted at 807% accuracy
vs Logit st 799% As was expected, the performance vaned
across conditions She results are shown in Table 2,

Performance Across Conditions

MODI NoIsr STEPCTORS 817
evra cnj cmp mxd 0% 10% 20% 3 6 9 10 50 190
Adam 80.7 85 74 RJ RR RZ 72-85 79-77 «79 2 RI
Togit 799 80 77 83 9 79:72 83:79 78 75 82 82

‘Table 2.

As seen from fable 2, ADAM appears to offer a slight edge
with respect to conjunctive cules and «mall sample sizes, areas
where traditional models are expected to be weak  Uowever

none of the performance difference were found statistically
significant

A companson of regsession models using dummy variables
prumartly examined mam effects ‘The sesutts indicate significant
differences (p ~ 0001) for all main effects (model type, number
of attributes, Ya-noise, sample size) consistent with expectations
That is, increasing noise and larger attribute sets had detrimental
effects on predictive accuracy, while larger samples had a positive
effect The performance of ADAM showed significant
improvement over CLS Tlowever, an F-test between tegression
models did not indicate significant difference over Logit at
(p~ 053) even with first order interactions included

Onc explanation for the surprising strength of the logit model an
conjuncttye cules was the nature of the simulation environment
As several researchers have noted (Dawe74]Cury81]{fobn85] the
use of a uniform distnbution in generating simulation attnbutes

[6] To establish thal the simulahon repheated Curan ct al's earher study, at
leet for dferener in the patruise resnfis wae find non mpnificant, 126 df
0241 , the pull hypothesis thal the means were equal could not be rejected
Tus 1s encouraging since st addrestes the objective of extending their efforis

 

provides a best case environment for the averaging of a statistical
model — itowever, such caviranments are hkely to be
inconsistent with a realistic market situation, where the true
conditions would prove detrimental to the hnear model.

Two encouraging findings were the low variance of ADAM’s
results across repetitions of the trals and the stabily of ADAM
across diflerences in both strategy and noise, supporting the
expectations for genctie search Note that the falloff in
prediction is consistent with the increase in the noise level,
Several additional runs using noise as the only experimental
vanable support this finding ©‘.

5.2 Diagnosis

When consumers make choices they frequently do not place the
same weight on all the features but instead indicate that certain
attributes are more important than others Diagnostic validity
measures the models ability to recover the importance of an
attribute. Ina regression model like Logit, atinhute importance
is represented by the beta coeflicients or parameter weights. A
comparable parameter is ADAM?s relative frequency measures
for cach attribute A critical question is whether production nutes
can provide diagnostic validity

Diagnostic Correlation between ADAM and Logit

Corclation between Relative Frequency of Attributes
in ADAM and beta cocflicients from Logit.

Sample Set Size
Attrib (cases) 10 50 100
Rl 435) 045a O80 0788
X2 (135) 0500 07da 0790
X3 (135) 041b O75a 0758
x4 (90) 64a 0,65a 0 76a

XS {907 0.68 Of2a 0 65a significance

X6 (90) 065a 058a 0 65a a po O00L

X7 (45) 043 O8Sa 0.6le b p< int

XB (45) 047 048 042 ce pe Ol

X9 (45) O53 056 0 72b d- po 05
Table 3.

As 1a evidenced in ‘Fable 4, not only do the two algorithms
generate equivalent predictions, but they also appear to agree as
to the relative importance of attributes even across sample sizes.
The correlations appear to follow increasing convergenes as
sample size increases although this trend was not significant
(p~ Ol) ‘This is a very interesting result since it Jends support
to the use of production rule models in providing useful
quantitative measures a9 well as intuition,

5.3 Structure

Structural validity indicates the model's ability to recover the
form of a given choice process, ultimately to provide insight inte.
the individuals behavior Doing so is a two step procedure.
First, the model must find a cule that predicts well Given that
tule, the sccond step is to infer the underlying choice strategy
hased on the rule’s steucture Phe structural clements arc
features such as the number of terns, the number of “don't-
care” symbols (#), the distribution of usage frequency across
terms, and so on Classification is determined by specifying a

 

IT} The shghtly lower performance In compensatory piles may be allnbutable
10 the joes of formation caused by encoding a 100 value random number ae a
dichotomous variable

 
 

mapping from characteristics consistent with the known choice
strategy to the featuses (structural elements) of a rule, The
ability to provide structural intuition is a key advantage of
ADAM over Logit. A critical issue is whether the true structure
can be recovered when the data is noisy (a debilitating condition
for CLS)

Structural Classification for Model by Noise

Noise: O% 10% 20%
Model: conjunctive 100% 87% 73%
compensatory 87% = =— 93% = 6%
mixed 60% 60% 20%
Tabte 4.

Trom the table it appears that the performance is encouragingly
robust, generating misclassification within an acceptable range
based on the given noise level

§.4 Conclusion

Based on a simple simulation, the performance of ADAM was
evaluated using the three validity measures With respect to the
research objectives and performance hypotheses, the following
results were recorded ‘The GA provided superior performance
to CLS acrass all measures especially in those areas addressing
the weakness of traditional models ‘This suggests genctic search
may offer a suitable alternative. Further, the GA performed very
well with respect to the traditional strengths of the Logit model
while providing the important feature of production system
tepresentatian One additional finding is that production mules
can provide quantitative measures comparable to statistical
coefficient Accepting that the simulation represents a simplified
siuation, the results appear to provide strong support for the
potential of genetic search as a method for modeling consumer
choice

6. Discussion

6.1 Review

Preference models provide intuition about how a consumer
might behave and why he does so Although the traditional toot
has been the linear compensatory model, behavioral research has
shown evidence of a number of situations where the use of such
models is inappropriate and where misappheation can have
severe consequences. A proposed alternative, based on
Quinlan’s classification tree building procedure, provided the
necessary production system formalism, however the predictive
and diagnostic performance was weak duc to the complexity of
the problem domain To address the Hmits of earher
approaches, a classification system called ADAM was developed
utilizing a GA for search A simplified choice simulation was
used to evaluate GA performance and to provide a comparison
to the earlier methods The results of the simulation support the
potential of genctic search and {he use of ADAM for choice
modeling

6.2 Future Directions

A model which can provide performance equal to statistical
models with intuitive advantages of production systems, offers a
more versatile tool for marketing professionals However several
tssuca need to be examined With respect to the problem
domatn, it is important to look at increasing the number and
type of attributes as well as different distributions of attribute
sels With respect to the representation and its influence on the

222

solvability of a problem, two issues can be identified ‘The first
concems experimentation with non-binary attribute pattern
sepresentations that better facilitate the characterization of
numerical ranges of valucs As was mentioned in passing in
Section 31, we have begun investigating the use of pattern
prefixes in this regard The second issue concems the
infonnation Joss ihat results from any discretization of the
possible values that a continuous variable may assume We are
exploring techniques for dynamically rescaling (ie expanding or
contracting) attribute ranges when performance stagnates within
a specific range of values With respect to the cvaluation
function the initia) efforts at an adaptive bias using a shift in the
evaluation function weights also appear promising = An
important next step is applying an upgraded ADAM to actual
consumer choice data. Overall the positive reaults suggest that a
much more detailed investigation of a genetic system for
modeling consumer choice strategies 15 warranted

References

[Berl79} Berner, "On the Construction of valuation
Functions for Large Domains”, Proccedings IICAI-79, 1979

[Bett79} Bettman, IR An Information Processing Theory of
Consumer Choice, Addison-Wesley, 1979

{Beeik4) Breiman, Friedman, Olshen,R. and Stone,C
Classification and Regression Trees, Wadsworth, Inc , 1984

(Curr84] Currim, and Sarin.R K "A Comparative Fvaluation
of Multiattnbute Consumer Preference Models”, Management
Science, vol 30, no. 5 (May), 1984, p 543-561,

[Curr86] Curmm,., Meyer,RJ, and Te, N, "A Concept-
Learning System for the Inference of Production Models of
Consumer Choice”, working paper no 149, Center for Marketing
Studies, UCT A, 1986

[Cury81] Curry,D 7, Louviere,J.1 and Augustine, M 1. "On the
Sensittsity of Brand-Chaice Simulations ta Attnbute bapartance
Weights”, Decision Sciences, vol 12, 1981, p 502-516

[Dawe74] Dawes,R M. and Corrigan,B, "Pinear Models in
Decision Making", Psychological Bulletin, vol &f, no 2, 1974,
p 95-106

[PeJo75] Defong,.K A "Analysis of the Behavior of a Class of
Genetic Adaptre Systems", PhD Thesis, Dept of Computer
and Communication Sciences, University of Michigan, 1975

\Pinh70] — Finhorn,I J. “The Use of — Nonhincar,
Noncompensatory Madels in Decision Making", Psychological
Bulletin, vol. 73, no 3, 1970, p. 221-238

[Gold84]— Goldberg,D “Camputer Aided Gas Pipeline
Operation Using Genetic Adaptive Systems”, PhD Thesis, Dept.
of Civil Fngmeering, University of Michigan, 1983

[Gold85] Goldberg, "Dynamic Systems Control Using Rule
Tearning and Genetic Algorithms", Proceedings of the Ninth
International Joint Conference on Artificial Intelligence (Los
Angeles, Caltffornia) Morgan Kaufinann, 1985, p 588-592

[GrefR4 Grefenstette }J "Optanization of Genetic Scarch
Algorithms", Tech Rep CS-83-14, Vanderbilt University, 1984

 

 
[Hot?5} Wolland,] 1. Adaptation in Natural and Artificial
Sputemr, University of Michigan Press, 1975.

filol8A) VolandJ 1 “Pseaping Brittleness the Possbihties of
General Purpose Learning Algarthms Applied to Parallel Rule-
Based Systems” in Machine Learning: dn Artificial Intelligence
Approach, vol. H, R Michalski, J) Carbonell, and T. Mitchell
(Pids ), Morgan Kaufmann,1986,

{unt6s} Wont,7 8, Marini. and Stone,P T Lxperiments in
Induction, Academic Press, [966

{John85] Johnson, F 1., Meyer, RF, and Ghose,S. "When Choice
Models Fail Compensatory Models in Vfficient Sets”, working
paper, Graduate School of Industrial Administration, Carncgie
Mellon Urn , 1985,

{KJah87]} Ktahr,D, Langley,P. and Neches,R Production
Spstem Models af Learning and Development, MEV Press, 1987,

fMichRO} Michalski,R and Chilausky,R 1 “Knowledge
acquisition by cucoding expert rules versus computer induction
from examples’ a case study involving soybean pathology",
International Journal of Man-Machine Studies, volume 12, 1980,
p 63-87

[Newe72]| Newell,A
Solving , Prentice Half

and Stmon,l (1972) Human Problem

{Payn76] Payne, JW "Vask Complexity and Contingent
Processing in Decision Making An Information Search and
Protocol Analysis’, Organizational Behavior and Human
Performance, vol {6, 1976, pg 366-387,

[Quin&4} Quinlan,] Ro "Inductive inference as a Too! for the
Construction of Wigh-Performance Programs” in Machine
fearning An Artificial Intelligence Approach, vol 1, R
Mehalski, J) Carbonell, and T Mitchell (Fds), Hoga Press,
toad,

[Quink6] Quinlan, R. "The Filect of Noise on Concept
Teaming” in Afachine Learning: An Artlfirlal Intelligence
Approach, vel H, Ro Michalski, F Carbonell, and ‘I Mitchell
(Pds), Morgan Kaufmann, 1986

[Scha&5} — Schaffer,t 1 “Learning  Multiclass Pattern
Discnounation™, Proceedings of an International Conference on
Genetic Algonthms and their Applications , Pittsburgh, Pa,
John Grefenstettc, (Fd ), 1985, p

[Smit8Q] Smith ST. “A Learning System Based an Genetic
Adaptive Algorithms", PhD Thesis, Dept. of Computer Science,
University of Pittsburgh, December, 1980

[Smut&3| Smith,S F  "Flexthle Learning of Problem Solving
Heovristics via Adaptive Search”, Proceedings 8th International
Tot Conference on AI, Kararuhe, West Germany, August,
1984

(Vali&S] Vatent G "Learning Disjunctions of Conjunctions”,
Proceedings of the Ninth International Joint Conference on
Artificial fntelhgence (Los Angeles, California) Morgan
Kaufmann, 1985, p 560-566

 

223

fWilk73} Wilkic, WT and Pessemice,. “Issues in Marketings
use of Multt-Attribute Attitude Models”, Journal of Marketing
Research, vol 10, November, 1973, p 428-44],

{Wrig7} Wright,P., "Use of Consumer Judgement Modets in
Promotion Planning", Journal of Marketing, vol 37 , October,
1973, p. 27-33,

 
 

A STUDY OF PERMU TATION CROSSOVER OPERATORS ON THE
TRAVELING SALESMAN PROBLEM

by

ILM. Oliver*. D.J. Smith?, and J.R.C. Holland}

* Texas Instruments Lid (Bedford UK)
tTexas Instruments Ine (Dallas TX}
{Mullard Ltd (Malmesbury UK)

Abstract

The application of Genetic Algorithms to problems
which are not ameanable to bit string representation and
traditional crossover ha been a growing area of interest.
One approach has been to represent solutions by permuta-
tions of a hist. and + pertuutation erassover” operators have
been introduced to preserve legality of offspring. Three per-
mutation crossovers are analyzed to characterize how they
sample the o-schema space. and hence what type of prob
lems they may be appheable to. Experiments performed on
the Traveling Salesman Problem go some way to ~npport
the theoretical analysis.

Tntroduction

This paper i a study of three crossover operators de-
signed to sample ou cherma [(GoLi8s]:

' ~ “Modified” crossover [Davi85] - the

a version of Dax
Order crassover

* Goldberg and Lingle s Partially Mapped Crossover -
PMX [GoLi85|

a previously unreported crossover - the Cycle crossover.

We introduce a modified version of o-schemata and
present an analysis for cach of the crossovers. This is fol-
lowed by experiments with the crossovers on the Euclidean
Traveling Salesman Problem (TSP) [LLRS85]. The exper-
imental] results are tied back to the analytic predictions
and are compared with other Genetic Algorithm (GA) ap-
proaches, a heuristic solution, a “Neural” Computation so-
lution, and to the optimal solution

The Crassovers
The Order Crossover

The order crossover proceeds as follows. First two cut
points are chosen at random. The section of the first parent
between the first and second cut points is copied whole to

224

the offspring. The remaining places are filled using elements
not occurring in this crossover section, To do this we use
the order the elements are found in the second parent after
the second ent point:

Parentl h#keefd bla tg7
Parent2 abedef ghia jkl
Offspring defghi bla jke

The offspring sequence, 7 k ¢ (return to the first posi-
tion) de f gh i, is the second parent starting at the second
cut point with the elements of the crossover section of the
first parenf removed

The absolute positions of some elements of the first par-
ent are retained. Turther the relative positions of some
elements of the second parent are also kept.

Davis's (Davi83] “Modified” crossover is exactly as above
except that the first cut point is always at the beginning of
the string.

The PMX

The PMX also proceeds by choosing two cut pomts ¢
random.

hkeefd
abcde f

Parent 1 ble ry

Parent 2 ghar gai

‘The cut out section defines a series of swapping -tes
ations to be performed on the second pareut Ly the case
above we must swap 6 with g./ with he and a with «te
sulting in the offspring:

Offspring igedef bia gkh
The absolute positions of some elements of the both the

first and second parents are preserved,

The Cycle Crossover

The cycle crossover so a answer to the question’ Cat
we create an offspring differen) from: the parents where ov
ery position js occupied ty a corresponding element from

 

 
 

one of the parents? The answer, in general, is yes. Every
element of the offspring can come from one of the parents
and usually the offspring will be different from both par-
ents,

We wish to satisfy the following conditions:

a) Every position of the offspring must retain a value found
in the corresponding position of a parent.

b) The offspring must be a permutation.

Consider again these parents:

Parent 1 hkeefdblaig?

Parent2 abedefghijkt

Consider position one, condition a) says the offspring
must contain either hk or a. Suppose we choose hk. Using
condition b}, a cannot be chosen from parent 2 since that
position is now occupied by kh. So a must also be chosen
from parent 1.

Since a in parent 1 is above? in parent 2 we must also
choose 7 from parent 1, Continuing the argument, having
chosen ht from parent 1 we must also choose a,i,j, and /.

The positions of these clements are said to form a cycle,
choosing any of the positions from parent | or 2 forces the
choice of the rest if the above conditions are to hold.

The cycles are usually labelled by a number or are called
unary and designated by the letter U:

Parent 1 hkeefdblargy
ebedefgh ght

Cyele Label 120733 32brL121

Parent 2

In a standard crossover operator a random crossover
point or cut section is chosen, In the cvele crossover a
random parent is chosen for each eyele.

Choosing evele f from parent 1 and the rest from parent
2 u the above example produces the following offspring:

Offspring hbcdefglatks

Parent number 122222212121

The absolute positions of on average half the elements
ot both the first and second parents are preserved.

Cyele crossover properties

Cycles have some interesting properties. Consider the
cycles fortned when two random parents of length L com-
bine: (the order O of a cycle is the number elements in the
cycle).

1. We can show that the probabuity ot an eletnent bemg,

in a cycle of order O {where 0 < O < L}

l
L

Note that this is independent of O.

 

2. The expected number of cycles of order O
_llLi i
-E* 076

3.

 

4. The expected number of cycles of length 2 or greater

Lod
= pink) -1

O-2

5. If 2 parents produce [n(£) - 1 cycles of length 2 or
greater then the number of possible ways of perform
ing the cycle crossover to produce an offspring different
from both parents ts.

_ Bn) 4

gin(L}-} 9
In faet the expected number of different offspring for
two random parents will be greater than this. This
shows there are in general a good number of choices
to be made. comparable with choices available when
cromsing over two bit strings

O-schema Analysis

What i# an o-scherna ¢

The analysis of the Order crossover has motivated a
relaxation of the definition of an oschema. Usually the
follow ing schema are different (where ! means do not rare}.

abttet!*t*! ftabltetttt

In our scheme these are all considered equivalent to
abllctttt!, the schemata has been normalized by taking the
first element to be the successor of the longest string of
don't care symbols. Lexical ordering of the fixed value
symbols can be used to resolve slon't care strings of equal
length.

 

Phis definition of a schema is valid for the TSP as all
the equivalent o-schema have the same varue. It also has
two clear advantages: there are fess schemata so a given
population will hold a greater percentage, and the defining
length of many schemata is reduced.

For instance the following schema are equivalent:
birtrtttta, abfttrytryt

The first schema had a defining length of 10 but naw has
length 2.

 
 

Definitions

O is the order of an o-schemma
D is the length of an o-schema
K is the Jength of a cut

Lis the length of a string

Analysis method

The analysis in genera! follows the Goldberg and Lingle
{GoLi83] analysis.

We calculate the probability of survival of o-schetmata:

P(S} = PUS/PWIPAT) + P(S/O)P(O} + PEs CAPICY

W (Within) occurs when the o-schema is entirdly within
the cut section.

O (Outside) occurs when the o-schema is entirely outside
the cut section.

C (Cut) occurs when the o-schema contains a cut point.

P(S/C). the probability of survival of an o-schema on a cut
point is considered to be negligible.

Relaxing the definition of an o-schema allow- for higher
survival rates. However this effect 1s only apparent in the
analysis for the Order crossover. The Cycle aud PMX op-
erators attempt to retain absolute positions m the ordering
so the probability of relative movement of sc} ctuata along
the string is negligible.

The Order Crossover

Let parent 1 be the parent from: whoin all the elements
within the cut are taken

Let parent 2 be the parent whose ordering ts ted to place
the elements outside the cut

The probability of the o-schema surviving from parent 1 is
the probability that the o-schema is completely within the
eut: .

_K-D+1

L

The probability of the o-schema surviving from parent 2
is the probability that none of the elements of the schema
have been taken from parent 1.

P(S/W)P(W) = P(W)

For large L and small D the following approximation holds:

P(S/0)P(0) = @ - a”

Thus the probability of o-schema survival for the Order

crossover is:
r\ DB
ff
+ (a - i)

L

226

The PMX

Let parent 1 be the parent from whom all the elements
within the cut are taken

Let parent 2 be the parent from whom all the elements
outside the cuf are taken, after the mapping operation has
been performed.

The probability of the o-schema surviving from parent 1. is
the probability that the o schema is completely within the
cut’
oie An
P(S/WYP(M | = POW) = oS
The probability of the o-schema being completely outside
the cut

The expected number of elements needing to be moved in
the mapping operation

The probability of 1 element needing to be swapped
K

L

KY 1
L

=K x L K 7
For large L and small O the following approximation holds
[Note 1]:

1 Ay?

(s/o) = (1 1)

Thus the probability of o schema survival for the PMX
P(S) = P(S/W)P(W) + P(S8/0)P(O)
L-~K~D+1 ( K

TAPP «(1 7)

oO

The Cycle Crossover

Let N elements be taken from parent 1
Let L-- N elements be taken from parent 2,

(NV is similar to the K factor in the Order and PMX analysis
above)

The probability of o-schema surviving from parent 1 is the
probability that all of the elements of the o-schema are
taken from parent 1.

For large L and small O the following approximation holds

N\o
(
The probability of o-schema surviving from parent 2 is the

probability that all of the elements of the o-schema are
taken from parent 2.
Tor for large Z and small O the following approximation
holds

7 (t - *)

= L
Thus the probabilitv of o-schema survival for the eyele

crossover
(3
“AE

O-ochema summery

oO

+0

  

O-scherma from parent 1 P(S/IV PW). have the same
probability of survival with the Order crossover and PIX,

When D = 0. the probability of survival from parent
2 is a factor of.

£L
L-K-D+#+1

more hkel with the Order crossover than the PMX = How-
ever as D becomes greater than O the probability of o-
schema surviving from the second parent decreases expo-
nentially compared to the probahility of survival under the
PMX. Thus we would expect the order crossover to perform
better than the PMX in problems where compact schemata
are important. We expect the PMX would perform better
than the Order crossover when compact schemata are less
important.

The cycle crossover is noteworthy as the probability
of an o-schema surviving is independent of its length. We
would expect it to perform well mn problems where the com-
pactness of schemata is of no importance.

Experimental Method

A single 30 city problem was used for all the GA
experiments, the “Neural” computation comparison, and
Lin/Kernighan comparison {Note 2].

Each crossover was studied separately, no mixing of
crossovers was attempted. Crossover was always applied fo
80°% of the population. A single mutation operator, SWAP,
was used, The SWAP operator simply interchanges two po-
sitions on the string. For each crossover the SWAP opera-
tor was applied to a percentage of the population varying
from 30° through 100% at intervals of 10°. Population
sizes of 50, 200, and 500 were tried for each crossover with
each mutation level. In all cases 50,000 tours were caleu-
fated. This gives 1,000 generations for population size 50,
through 100 generations for population size 500. Roulette
wheel reproduction was used with some modifications to
reduce selection errors. No clitist strategy was used.

The system was written in the programming language
©, One run (50,000 evaluations) takes approximately 15
minutes on an Apollo DN3000 (operating system Acgis8-
Domain/IX Kernel version 9.2.5. C compiler version 4 1.6).

 

 

Experimental Results

The Order Crossover

Here wo see 8 runs of the order crossover with mutations
levels at 30-100'7. Population sive is 200. The x-axis is the
number of craluations, this can be thought of as the the
“work done”, and is constant in this part of the study over
all population sizes.

No =Za9zmr acon

po-

 

@ 16 20 | 3@ 40 Se
EVALS/19@6 Pop 288 Order 80

The order crossover ip very sensitive fo mutation levels.
The difference in performance is 30°C frou the best (449
at 30°, pop. 500) to the worst (646 at 100‘, pop. 500).
There is a steady improvement in performance as mutation
1s decreased from 100% to 30% This trait is common in
large (500) and smal) (50) populations.

Performance improves slightly as population sive in-
creases. The best tour with a population of 500 (149) 1s
4% better than the best tour with a population of 200 466)
which is 3°% better than che best tour for a population of
50 (492}.

The PMX

For the PMX with population s17¢ 50 we have:

i +
10 |

\ meomzmec scons
o

a2e-
a

 

8 1@ 2@ a8 4a 5a
EVALS‘160@ Pop 58 Prax 86

227
 

The PMX is sensitive (yet not so much as Order) to
the mutation level. The difference in performance is 40‘/
from the best (498 at 60%, pop 50) to the worst (687 at
100%, pop 50). The best mutation level is not as obvious

as the with Order crossover, it is between 40°% and 60°
depending on the population size.

There is little performance variation with population
ize; in total a 5% difference, with populations 50, 200, and
500 giving best tours of 498, 518, and 521 respectively.
The Cycle Crossover

For the Cycle crossover with population size 50 we have:

M1 7

myazmr mcom

x

aon

 

a 1a 26 38 46 38
EVALS71@00 Pop S@ Cycle 86

Cycle crossover is least sensitive to mutation levels. The
variation in performance is 16°% from the best (517 at 50°%,
pop. 50) to the worst (601 at 100°%, pop. 50). The best
mutation level is not clear, it varies from 50% for population
50, to 100% for population 200.

Performance was 8°% better for population 50 (517)
than for population 200 (559) and 500 (860).

Experiment Summary

Here we see the best runs, with a picture of the best
tours, for the Order crossover, PMX. and Cycle crossover
respectively.

 

tbe 188
T 88
gia - 6a
Rig | Y 4@
L |

E 8

N

G

Tt ?

HK

, 6

1 5

8

8 4

 

a 108 28 a8 48 5a
EVALS/1@@@ Pop S@8 Order 86 Suap 30 Best 449

 

1h 198 —
7 Ba
018
u 1 68
Rog 49
L 26
n° 8
G
t 7
H
, §&
105
8
8 4
a 14 20 3a 48 5a

EVALS/1096 Pop 5@ Pnmx 88 Suap 6@ Best 498

N maname wean

 

oo-

8 16 26 36 40 Se

EVALS71888 Pop 5@ Cycle 98 Suap 5A Best 51?

Order does 14% better than the PMX, and 15°% better
than the Gycle crossover,

The a-schema analysis above ties in with Chese experi-
mental results. The TsP is a problem where adjacency of
particular elements (cities) is important. This we would
expect that the Order crossover would perform best. The
fess sensitive behavior of the PMX and Cycle crossovers,
especially the cycle crossover, can be thought of as a con-
sequence of & built in higher Jevel of mutation present in
these operators.

It should be remembered that the experimental results
are for the ToP only. and the oschema properties of PMX
and the Cyele crossover should be considered when apply-
ing permutation crossovers to other problems,

“Going for the One

Using the results of this study. further expernment= were
carried out to find a very good tour for the 30 cities. The
Tesults suggested use of the Order crossover. with mutation
at 30% or lower, population size 500 or larger. ind tore
evaluations.

 

 
At population size 500, for 400 generations. with muta
tion at 15'7, and with population size 1000. for 200 gyenera-
tions. with mutation at 254. the Order crossover produced
the following tour of leugth 425

  

 

i nel ees i ee
a 20 4a 68 88 180

This is 5% shorter than the best tour in the comparts sont
study

Comparison with other approaches

We can directly compare our results the optimal tour,
the Lin/Kernighan algorithm [LiKe73] and Hopfield and
Tank's “Neural” computation technique [HoTa85]. Nome
comparison is possible with other GA results,

This is the optimal tour, Its length is 424 [Note 3]:

 

1e8 oo
1 —~ —-.
I
Bo -, _
-
68 -4
| “
i leo
40 —
20

=
fo

 

The Lin/Kernighan heuristic finds this tour. The best GA
result of length 425 1s within 1°7 of this.

Next we see the best “Neural” computation tour. Its length
is 509 [Note 4], This is ws 19°C from optimal on the same
30 cities,

 

229

188

 

ea | 4
7 ~~
| \ a
68 | ce
40 -.
. ——— ;
| :
2a. | :
/
' ‘
a

en
8 20 aq 60 86 106

Goldberg and Lingle [GoLi85] report optimal results for 10
cities.

We have not run the same cities as Grefenstette et al
[GGR G85} used, but some comparison is possible - Grefen-
stette et al report results within 16-27°% of optimal for 50.
100, and 200 cities

To put this work in perspective, current TSP research is on
tours of tens of tousands of cities.

Surnmary

O-schoma analysis showed that the Order crossover is
superior in problems where compact schemata are impor-
tant (such as the TSP}. We expect the PMX to be superior
when compactness i less important. The Cycle crossover
preserves o-schoema independent of length. and is poten-

tially be superior in problems where compactness is hot
relevant,

The caperimental analysis supported the theoretical
predietion of good performance of the Order crossover for
the TSP Results from these experiments indicated how to
the fine tune of population size and mutation Jeve] to pro-
duce results that were near optimal for the 30 cities studied.

Notes

e

» Goldberg and Lingle report k+1 here. where we caleu-
late k.

i)

. The coordinates of the 30 cities are: ((82 7) (91 38) (83
46) (71 44) (64 60) (68 58) (83 G9} (87 7 (7
71) (58 69) (54 62) (34 67) (37 84) (41 O4
(22 60) (25 62) (18 54) (4 50) (13 40) (18
{25 38) (41 26) (45 21) (44 35) {58 38) {62 39)

3. The optimal tour and Lin/Kernighan tour were cam-

puted by David Johnson.

4 These were not the actual coordinates used by Hop-

field and Tank but were extracted from the diagrams
in [HoTa8s]. The extraction was quite accurate; they
report 507 as the length of the “neural” tour. we
calculate 509 on our points, they report 426 for the
Lin/Kernighan tour, this is 426 on our points also. The

 
diagrams in (HoTa85] should he rotated throngh 90 de-
grees, laterally inverted, and scaled up by 100 to corre-
spond with our diagrams.

{Davi85]

|GGRG83]

[GoL185

HoTn85]

{Like73]

([LLRs83]

References

Davis, Lawrence, Applying Adaptive Algo-
rithms to Eptatatre Domains, Proc. Inter-
national Joint Conference on Artificial In-
telligence, 1985.

Grefenstette,J]... Gopal,R., Rosmaita,B..
Van Gueht.D , Genetic Algorithins for the
Traveling Salesman Problem, Proc. Inter-
national Conference on Genetic Algorithms
and their Applications, £983,

Goldberg, D E., Lingle.R . Auleles, Loci, and
the Traveling Salesman Problem, Proc. In-
ternational Conference on Genetic Algo-
tithms and their Applications, 1983.

Hopfield.J.J.. Tank.D.W.. *Neural” Cam-
putation of Decisions in Optinization
Problems, Biological Cybernetics 52. t41-
152. 1985.

Lin.s . Kernighan.B.W . An Effective Heurts+
tic Algorithm for the Traveling Salesman
Problem, Operations Research 21. 498-516,
1975

Lawler.E.L.. Lenstra.d K.. Rinnooy
Kan.A H.G.. shmoys.D.B.. The Traveling
saleaman Problem. Jolin Wiley & Sens,
UK, 1985.

 

230

 
 

 

 

A CLASSIFIER-BASED SYSTEM POR DISCOVERING SCHEDULING HEURISTICS

by

M. R. Hillaard & G. E. Liepins
Oak Hidge National Laboratory, dak Ridge, TN 37831"

Mark Palmer, Michael Morrow & Jon Richardson
University of Tennessee, Knoxville, TN 37996

ABSTRACT

A ciassifier system has been designed to
discover heuristics tor simole job snop scheaulang
broblems. Experiments nave shawn that the clas-
sifier system is capable of discovering heuristics
in a limited domain. We describe a new generalized
version of the system and a set of experiments
designed to oroduce more ceneral rules based on
sets of pvroviems. Initial experiments with the
generalized system indicate a potential for
Success.

THE NEED TO DISCOVER HEURISTICS

Constrained resources are common to many
amportant activities. Manufacturing and distribu-
tion of goods, the mobilization and allocation of
military forces, and government agencies' managqe-
ment of resources sucn as water and energy are all
resource constrained. Since most ontimization
methods are limited either in their aoility to
model the constraints of the system or in their
ability ta solve problems of "real-world" scale,
Dianning and scheduling in these environments
Tremuires heuristics. These heuristics may range
from simple “rules of thumb" to camplex algorithms
based on considerable mathematical analysis;
however, they all trede quarantees of optimality
for speed and efficiency. The develoument of
expert systems and other artificial intelligence
techniques has led to methodologies for using
heuristics to solve such nroblems, but the dis~
covery of the neuristics for such a system remains
a difficult task, A system which could learn
heuristics vased on 1ts awn experience and
experimentation would vrowide a vowerful tool for
developing dynamic, responsive, and useful auto-
mated aids for planning and scheduling.

THE SCHEDULING DOMAIN

Scheduling vrovides a variety of increasincly
complex problems to test a learning system's
camapilities, We have cnosen the formal structure
of job shop scheduling as a model within which to
work. In tnis model, orocesses must be completed
in time to meet a delivery schedule as closely as

possible. The processes each take a certain amount
of time (run time) and may have precedence con~
straints (1.e., processes A and B must be comlete

before process C can begin). Sets of processes
comprise jons, ana jots are expected to be com
pleted vy a certain time (due date). Resources

2o1

(macnines) are available to complete the processes
out can only be used on one process at a time.
‘Numerous optimization based techniques have been
investigated to solve this vroblem {including
genetic aigorithms [Davis, 1985)), bat the most
successful techniques rely on heuristics {Rirmpoy
Fan, 1976). We have performed several experiments
using a modification of the classifier system, on
tne simplest form of the problems (one process per
job, mo precedence constraints, one machine) using
the objective of minimizing lateness, L.

{1) Lb = f delivery date; - due_date; ;
alli i= job 1, job 2,"..., jobn

This problem has a well known optimal heuris—
tic: "Sort the jobs by increasing run time." Our
experiments showed that the classifier system could
find the optimal ordering of the jobs for a
varticular example, and, given the relative
rankings of the run time, the svstem could develop
a set of rules whach said, "Do the shortest Job
tirst. Da the second shortest job second ... Do
the longest job last." A vecently developed
extended system appears to have potential for
developing more general rules,

THE PHASE I CLASSIFIER SYSTEM

The first phase of experiments with the
Classifier system was intended to evaluate desiqn
decisions (detectors, effectors, and feedback
mechanisms). The resulting systen demonstrated
jearning only in a limited sense; however, it
provided insignt into the desicn decisions.

The _ Environment

To ve anle to evaluate desi¢n decisions, the
initial experiments with the classifier system have
all considered the simplest version of the job-shop
scheauling problem (as statea above}. The system
is presented a set of jobs with randomly generated
run times and due dates, and is required .ta
ordanize the jobs into a queue such that the total

* Overated by Martin Marietta Energy Systens, Inc.,
under contract No. DE-ACOS-840R21400 with the U. $s.
Department of Eneray.

 
 

lateness {1) is minimized. If the joos are
renumbered by their position in the queue, then

i
i run time,.
eI

(2) delivery date;

It as well Imown [Conmway, et al, 1967] that the
minimum lateness schedule is achieved by organizing
the jobs by increasing run time.

Classifier Representation

A system's ability to react correctly to ats
environment is determined in part by the adequacy
of its internal representation of that environment.
Within a traditional learning classifier system
{Holiand, 1986], there are certain representational
limitations due to the use of the genetic algorithm
as the primary learning mechanism [Liepins and
Hilliard, 1986]. The foremost of these limitations
is the necessity of encoding the environmental
detector messages as strings from a small (usually
binary) alphabet. The length of these detector
messages determines the size of the search space
for the classifier system, and a balance mist be
achieved to provide an internal representation of
the environment which is adequate for the system to
reason correctly and to develop useful classifier
rules but simple enough toa be searched in an
acceptabie period of time.

During the design of the action and reward
structure, a minimal representation was necessary
to provide a testbed for evaluating various
decisions about the other features of the clas-
sifier system. Thus, in the Phase I experiments,
the messages consisted of a single field, the rank
of a job's run time in relation to other jobs in
the problem. The Phase II experiments were per—
formed using the binary representation of the run
time as the messaqe - See Table 1.

Ranked Run Raw Run
Time Time
Job__Run Time Due Date Message List Message List

1 5 16 00 900101
2 12 22 ol 601100
3 23 13 10 010111
4 50 40 ib 110010
Table 1. Ranked run time and raw run time
message lists.
Actions

For a classifier system to learn how to
schedule jobs, it must have sane means of interac-
ting with tne job queue. Interaction is accom—
plished through the use of effector classifiers
which perturb the queue. In Phase [ of the Job
Shop Scheduling Classifier System, all classifiers
are effector classifiers, Each of these is
compared with the current detector message list
{representing the current job queue with one
message per job) and acts upan the matched job. A

232

succesSrul Classifier implementation requires that

effector actions satisfy three criteria: 1) all
possible orderings must be reachable py the
actions; 2) the actions must be simpie enougn tnat
@ user can trace and analyze the steps leading to a
particular queue ordering; 3) the evaluated merit
of an action needs to reflect its true worth. Many
possible actions were discussed including swapoiny
jobs, moving jobs to the top of the queue, moving
jobs to the top or the bottom of the queue, andi
moving jobs to a queue location suecified by the
Classifier itself. of these, we investigated
moving a job to the ton of the queue and moving a
job to an indicated position.

The action of moving a job to the too of the
queue was powerful enough to create all possible
job queues and simple enough to trace and analyze
{the first and second criteria}. A trace of the
classifiers acting in the last few cycles wil]
indicate how the queue was tormed, and any specific
ordering can be achieved by placing the jobs at the
top in reverse order of their run time. However,
this desiqn fails to meet the criterion of provid-
ing for an approwriate reward for action, because
the system would have to Jearn to create a bad
queue ordering in order to make a good one. This
contradiction confused the classifier system and
the action of moving jobs to a specific queue
location was considered instead.

This secon alternative moving jobs to a
svecific queue location meets all criteria.
This action meets the third criterion particularly
well, since the actions can be rewarded with the
value of the new queue as described in the next
section. Therefore, the right hand side of a rule
in the Phase I exeriments was the binary renresen-~
tation of tne queue position in which to blace the
job matched by the left hand (aetector) side.

Reward

Providing the system with feedback indicative
of ats performance is another major requirement of
a learning classifier system impienentation. Each
classifier which effects a change in the system's
environment must be given some reward in order to
estimate the classifier's relative value within the
environment and to make an appropriate decision as
to which classifiers should be used in reproduc-
tion, which should be replaced during reproduction,
and which should be ailowed to act. The classifier
system maintains a strength value for each clas
sifier as an assessment of its relative worth
within the environment. This strengtn 1s qenerally
increased by rewards for classifiers which uroduce
qood actions and decreased by taxes and payments.

The reward within tne jeb shoo scheduling
environment is pased on the total iateness of the
current JoD queue. This can ne considered the
learning aporoach to reward in contrast to the
training approach where each action taten by a
classifier would be evaluated and the rewara would
be based upon the action itsesf rather than tne
combination of its direct effects and its side
effects ucon the job queue. In the learning
approach it is the state of the newly constructed
Job queue that forms the basis for the feedback to
all the active classifiers and not the manner in
which the new queue was formed. Thus the learning
strategy permits the classifier system to develop
any strategy for queue organization as Jong as it
leads to an acceptable job queue confiquration.
This reward structure minimizes the amount of
vreblem specific information in the system, and
avoids encouraging myooic or "qreedy" heuristics.
The reward in the learning approach implicitly
includes both an evaluation of the actions taken
ami a measure of the side effects of the actions.
This approach requires that a classifier be given
the oppartunity to perform its action in a variety
of situations (in order to senarate the effect of
its action from the side effects of other classi~
fier's actians).

A successful and apwropriate reward value is
the actual lateness of the current queue scaled to
the range 0,200]. Combining this reward strateqy
with the simultaneous action methodology described
below created a system canable of distributing
strength to a set of good classifiers.

Conflict Resolution

The conflict resolution strategy allows the
system to develop sets of rules which cooperate to
solve a problem. This cooveration can be based on
finding niches in the environment as in Holland's
defauit hierarchy, or it can be explicit. Our
conflict resolution strategy allows for both types
of operation by allowing maitiple rnies to compete
for niches in the environment and allowing multiple
rules to be active in each cycle. Each cycle
proceeds through the following steps:

1. Match the detectors to the messages fram the
job list, Let S be the set of matching
classifiers.

2. Generate a bid for each matching classifier,
based of strength and add random perturbatian
to the bad to produce a “noisy” bid. (Bid =
c*Strength +c).

3. Select the classifier with the largest "noisy"
bid, Perform the action indicated by the
classifier.

4. Select the classitier with the next hignest
bid and (if possible) perform the action
indicated by the classifier.

5. Repeat step 4 until all matching classifiers
have beer examined or until a1] slots are
filled.

6. Pall any remaining slots with available jobs
(maintaining their previous relative ardering}.

7. Evaluate the resulting queue.

 

233

To provide more control in tne experiments,
there were eight jobs in the queue; therefore, the
classifiers could be completely specific and did
not employ #'s. The entire population of 64
possible rules (8 jobs x 8 positions) was main-
tained throughout the run. The classifier system
established appropriate weights so that the eight
optimal rules (i.e., 000/000, 001/001, ....,
113/111) dominated the population and generated an
optimal queue. This was a limited demonstration of
learning transferable heuristics. The system was
limited by the requirement to use ranked run time
(rather than raw ron time) ami to use svecific
rules.

THE GENERALIZED CLASSIFIER SYSTEM - PHASE II

Two extensions were necessary for the system to
eroduce more general heuristics. The first is to
generalize the left-hand side of the classifiers,
allowing a classifier to match a set of jobs. The
second extension is to reward the system for its
performance on a set of problems, since appropriate
generalizations require multiple examies. An
analysis of the steady state of the system under
different designs provided impetus for same design
decisions.

Extensions

The Phase II system achieves generality in the
classifier representation by using the raw run time
information as in Table 1, and by incorporating a
ternary representation with the third character ,#,

being regarded as a “don't care" symbol and

matching either i or 0. Thus,

100#01# matches all of 1000010,
1000011,
1001010,
10010141,

Since the classifiers match sets of jobs, the
conflict resolution strategy becanes more complex.
Two strateqies are being comared in our initial
experiments, the champion strategy and the consen-
sus strategy. In both strategies, the classifier's
bid is calculated the same way bids are calculated
in the all specific classifier system, except that
the random perturbation is added at a different
point in the process. The conflict resolution
strategies view the bids through a job-position
matrix, This matrix consists of celis for each
job/position combination. Thus, in the case of
four jobs, the matrix consists of sixteen cells and
can be ‘represented as in Figure 1.

Position
a 2 3 4

JOB

Fiqure 1,
 

Celi Cy, contains the bids of classitiers which
match job 2 and recommend placement in position 3.
The champion strateqy adds ramiom perturbations to
each bid in a cell and chooses the largest noisy
bid to create the bid for the cell. The consensus
strateqy sums the bids in a cel] and adds a random
perturbation to the sum to create the cell's bid.
In both cases the conflict resolution pbroceeds as
in the totally specific case, with cells replacing
classifiers. The reward under the champion
strateqy is awarded to the chosen classifier in
each active cell, while the consensus strategy
divides the reward equally among all classifiers
bidding into an active cell. Initial experiments
have shown the champion system to be superior.

The second extension, evaluating the rules on
multiple problems to encourage generality, can be
accomplished in two ways. In the cyclic reward
implementation, the system cycles through the
problems in a training suite several times in
different randan orders, bidding and receiving a
reward for each problem. Reproduction, selection,
and crossover occur intermittently throughout this
cycle, In contrast, the delayed reward scheme
allows the classifiers to bid, accumilates the
rewards for the entire suite, and adjusts the
classifier strengths based on the suite as a whole.
Reproduction, selection, ard crossover occur after
several bidding cycles. This delayed reward is
intended to prevent the most recent problem from
having a stronger influence on the strengths at the
time of reproduction. The two evaluation methods
are being compared in Phase If.

Steady State Analysis

An analysis of the study steady equations
indicates that both conflict resolution strategies
temd to select more specific classifiers to act.
Depending on the reward structure, selection for
reproduction can tend toward general rules,
specific rules, or reies with mid-level speci-

ficity. Let
S = steady state strength,
B= steady state bid,
R = average reward,
C = bid constant,
k = specificity (number of snecific positions),
T = tax constant (ST maid each cycle),
nh = mumber of jobs matched by the classifier,
and
m = number of classifiers bidding into the
cell.
We can consider a single classifier and a
single cell. Three possible reward structures
produce tnree different resuits. (See Table 2)

234

Classifier Selection
Tendency

Steady State Value

 

  

Payments = S(Ck + T}
S= OR
ck4T
Be RL Specific
1+ T/C

Payments = S(Ck + Tn}

    

‘SiC + Tn)
S= RR _ Specific
C+ Tm
B= R_ Specific
i+ T/C
Table 2. Steady state analysis for the champion

strategy.

The consensus strateqy produces steady state
values equal to the champion strategy's values
divided by m. Since the bids in a cell are summed
to determine the actions, the selection for actions
continues to tend toward specific classifiers.
Selection for reproduction, however, tends to favor
classifiers occupying celis with small nopulations.
Since the offspring tend to pomulate the same cell
as their parents, this provides a control on
overpopulation of any single cell.

Initial experiments with the systems tend to
confirm these generalities, but the system is very
sensitive to the value of T. If the value is too
iow, extremely general ciassifiers (#1.###t/010)
tend to gain strength, whereas a high value tends
to produce an extremeiy sparse matrix. A complete
comparison of tecnniques awaits the completion of a
full set of experiments, Initial analysis of the
populations does reveui that once created, "qood"
general rules (CO0/MWH#/000) tend to survive.

CONCLUSIONS

Although we have not completed all the experi-
ments in Phase II, the classifier system, when
implemented with a conflict resolution scheme
allowing for cooveration between classifiers,
provides a mechanism which appears to be capable of
generalization at a rudimentary level. The
conflict resolution methods and bidding schemes
developed for the simple job shon scheduling task
seem to be extendible to more complex scheduling
problems (precedence constraints, multiple
machines, etc.) and possibly to other ordering
problems. Future research will include a variety
of experiments desiqned to help us estimate the
systems ability to iqnore extraneous data and the
usefulness of a penalty function formulation for
precedence constraints.
Acknaledqments

We wish to thank David Goldberq of the University
of Alabama and Chrisila Pettey of Vanderbilt
University for mmerous helnful discussions and
coments.

References

Comaay, Richard W., William L. Mawell, and

louis W. Miller. Theory of Sc!
Wesley, Reading, Mass., 1967.

ing. Addtison-

 

Davis, Lawrence, "Job Shoo Scheduling with
Genetic Algorithms.” Proceedings of an _

International Conference on Genetic Algorithms
. Pittsburgh, PA.,

 

 

Holland, John H., "Escaping Brittleness: The
Possibilities of General-Purpose Learning
Algorithms Applied to Parallel] Rule-Based

 

 

Intelligence Approach, Vol. IT, R. S. Michalski,
J. G. Carbonell, and T. M. Mitchell (Bds.),
Morgan-Kaufinan, Los Aitos, California, 1986.

 

Liepins, Gunar E., and Michael R. Hilliard,
"Representational Issues in Machine Learning,”
ional Symoosium on

 
     

+, Oct, 25, 1986;

IRNI- 6362 .

Rinnooy Kan, A. H.
Classification, Compley

 

 

ty_and Computations.

Nijhoff, The Hague, 1976.

 

 
 

USING THE GENETIC ALGORITHM TO GENERATE
LISP SOURCE CODE TO SOLYE THE PRISONER'S DILEMMA

Cory Fujiki
John Dickinson
Department of Computer Science
University of Idaho
Moscow, Idaho 83843

ABSTRACT

The genetic algorithm is adapted to manipulate Lisp S-
expressions. The traditional genetic operators of crossover,
inversion, and mutation are modified for the Lisp domain, The
process is tested using the Prisoner's Dilemma. The genetic
algorithm produces solutions to the Prisoner’s Dilemma a$ Lisp
S-expressions and these results are compared to other published
solutions.

INTRODUCTION

The genetic algorithm is an adaptive learning system
inspired by evolution and heredity. In this particular application,
the genetic algorithm will be used with a program generator to
create source programs for different applications of the Prisoner's
Dilemma problem.

The genetic algorithm is used to produce a set of programs
(called a generation) which attempt to solve a particular problem.
These programs are evaluated by a critic on how well they solve
the problem and the best programs are manipulated by a set of
genetic operators to create a new set of programs (the next
generation). Tnis process is repeated until a program produces
an acceptable solution to the problem.

There are three genetic operators [6} traditionally used to
create the new programs from existing programs: Crossover,
Invertand Mutate. These genetic operators have been modified
to be used with the program generator.

BACKGROUND

The initial work with the genetic algorithm manipulated
a bit string representation of the solution. Smith [7] has shown
that the genetic algorithm can manipulate variable length
solutions that are more complex in form. Cramer [2] showed that
instruction sets could be manipulated directly. Hicklin [5] showed
in his work that a genetic algorithm can be applied to a program
generator which creates grammatically correct LISP programs
from a given set of productions [5]. The initial generation of
programs were generated at random then and evaluated.
Programs which rated below the average score of the entire
population are replaced by new programs created from the higher
rated programs. Random changes are made to programs to ensure
a complete search of aff possible programs. ASthough Hicklin’s
work was inspired by the genetic algorithm and Holland's work
in the area, he did not use the genetic operators to develop new
programs. Fujiki [4] took this work and developed genetic
operators which closely approximated the traditional operators
used by Holland (6).

236

KNOWLEDGE REPRESENTATION

A problem facing all adaptive learning systems is how to
represent the knowledge being learned. In theory, the system
should be flexible enough to allow unique and previously
unknown solutions to be discovered. In practice, it is often
necessary and even desirable to supply some built-in knowledge
about a possible solution in order to speed up the search process.
In Hicklin’s system, the knowledge about a particular solution
toa problem is represented a5 a program. A program is generated
using a set of productions. The degree of flexibility in the system
and the degree of built-in knowledge is dependent on how these
productions are defined.

The types of programs generated in this study are based
on the set of productions shown in Figure 1, These productions
represent a subset of the productions for LISP. While not all
possible LISP constructs can be generated using these produc-
tions, all of the programs generated are acceptable LISP S-
expressions,

In productions | and 2, the symbol "Start" represents the
starting symbol for the grammar. The program created are LISP
functions starting with a COND command. Productions 3 and 4
are used to create any number of conditions and actions pairs.

1) Start --> (COND (T Action))

2) Start ~-> (COND Cond-Term (T Action))

3} Cond-Term --> (Logical Action}

4) Cond-Term ~-> (Logical Action) Cond-Term

5} Action --> The different actions that
may be performed.

6) Logical --> The different possible
conditions that can be checked.

Productions Used To Create Programs
Figure I

Once a true condition is found, the action associated with it is
performed. Only the first true condition found in a COND is
used, The last condition generated is always a TRUE function
with some action. This insures that the program generated will
always produce some action found in the grammar no matter what
conditions have been formed earlier in the program. Production
§ represents the different result productions, represented by the
symbol "Action". These actions may be Lisp S-expressions or calls
to specific routines. This production produces the different
actions that may be desired for the problem domain being studied,
Production 6 represents all of the different productions used to
generate the logical condition used in the program. There can
be any number of these “Logical” productions, including
productions which combine the logical conditions by the use of

 
0. Initialization. Using the grammar productions, ran-
domly create the initial populations of m Lisp
expressions. All members of this initial popu-
lation are considered to be mutations of the
start symbol of the grammar. Whenever the
grammar is used, the productions are separated
into two categories; those that have only
terminal symbols on the right-hand side of a
production and those that have one or more
variable symbols on the right-hand side. The
length of the Lisp expressions is controlied by
selecting productions from these two catego-
ries. A terminal percentage, t, is used to
determine the probability that a production
will be selected from the category that contains
only terminal symbols on the right-hand side
The terminal percentage is increased as the
length of the Lisp expressions increase, so that
eventually the Lisp expression contains only
terminal symbols.

1. Evaluation. All Lisp expressions are evaluated using
EVAL. The value returned is used to rank each
Lisp expression. Some applications may re-
quire several evaluations of a Lisp expression
to obtain the actual figure of merit. For
example, the prisoner.s dilemma requires sev-
era! "turns"; each turn uses EVAL and the set
of responses is used to calculate the figure of
merit.

2. Modify population, A percentage, r, of the best per-
formers in the current population is retained
and the lower portion of the population is
eliminated. The members eliminated are re-
placed by new members that are created from
the remaining good performers by mutation,
inversion, and crossover, that are involved
according to the mutation percentage pm, the
inversion percentage pi, and the crossover
percentage pc.

3. Herate. Go to Step 1.

Genetic Algorithm for Lisp S-expressions
Figure 2

logical “AND* and “OR" operators. An example of one complete
grammar used for the Prisoner's Dilemma programs is shown in
Appendix A,

GENETIC OPERATORS FOR LISP SOURCE CODE

The genetic operators must be able to change a program
but at the same time keep some of the major characteristics of
the program. If the operator does not change the program
enough, then the search for better programs proceeds too slowly.
if the operator changes the program too much, then the
characteristics of the program that made it successful will be lost
and Jittle information is passed to the next generation of the
programs.

The programs generated all have the same basic form in
that they are all COND statements followed by one or more
conditions and actions, Each condition action pair is considered
to be one individual piece of information to be used by the genetic
algorithm and are never broken apart by any operator. The
operators used in this study were inspired by those used by
Holland [6].

 

237

 

The crossover operator combines condition action pairs
from one program with those of another program. The crossover
operator divides two programs at two points picked at random
in between the condition action pairs. New programs are created
by combining the first part of one program with the second part
of the other program. A second program is created by combining
the two remaining parts of the programs, The crossover point can
occur at any point in the list of conditions except after the TRUE
condition found at the end of the program. A crossover after this
point would be pointtess since all conditions after a true statement
would never be used and the program would remain the same as
before, For example, a program can be represented as (Cond a
b]¢d)and (Cond A BC|D) where each letter is a condition action
pair. Using crossover between b,c and C, D wauld produce the
following programs: (Cond a b D) and (Cond A BC cd).

The invert operator does not change the information in
a program but rather rearranges the information. When the
invert operator is applied to a program, it reverses the order in
which the condition action groups are evaluated, but it does not
change the condition action groups themselves. The TRUE
condition found at the end of the program was not included as
one of the points which could be inverted. If the true condition
was inverted, none of the conditions following it would ve used
since the program always ends after the first true condition
found. If, for example, a program represented as (Cond ja b c| d)
is inverted between a to c, then the resulting program would be
(Cond c ba d).

The mutate operator is used to add new information to
the population of programs. Once the original generation of
programs has been created, the crossover operator and the invert
operator can not add any new information to the programs, since
they only rearrange the information that already exists. A
random mutation operator is used to add new information to the
programs generated.

When the mutation operator is used on a program, it
removes one condition action pair picked at random in the
program and replaces it with a new condition action function.
This new condition action pair is created using a subset of the
original grammar. Since it is desired to always have a true
condition at the end of the program, the last true condition could
not be changed by the mutation operator. {f this condition is
picked to be mutated, then only the action associated with it
would be changed, For example, if a program which is
represented as (Cond a bc d) is mutated at c then it would appear
as (Cond a b e d) where e is a completely new condition action
pair. The extent of the mutation performed on a Program is
determined at random.

A detailed outline description of the genetic algorithm
applied io Lisp S-expressions is given in figure 2. The values for
the parameters used for this work were:

(population size)
(survival rate)
(terminal percentage)

m = 80

r= 25%

t = initially 20%

but increases to 100%
pm = 25%

pi = 25%

pe = 25%

(mutation percentage)
(inversion percentage)
{crossover percentage)

The percentages for the genetic operators were all set to the same
value because one of the goals in Fujiki [4] was to determine the
effects of the various operators.

PRISONER’S DILEMMA PROBLEM

Each program is evaluated and the highest scoring
programs are used to create the next population. If two programs
score the same then the smaller program is rated higher. This is
done to favor the more efficient programs and remove programs
 

with useless clauses,
The Prisoner’s Dilemma problem involves two players
who both must decide whether to cooperate or defect on each

other. Each player is awarded points (called the payoff)
depending on the action taken compared to action of the
opponent. The different combinations of possible actions and the
payoffs are shown in Figure 3 where R represents the payoff
given to each player if both cooperate. Stepresents what is called
the sucker payoff (i.e. the payoff for cooperating when the other
player defects), T represents what is called the temptation payoff
(i.e, the payoff fora player when he defects and the other player
cooperates). P is called the punishment payoff when both players
defect. The payoffs for each move have the following relation-
ships in the Prisoner.s Dilemma problem, T> R>P>S. The
gain for the players when both cooperate with each other is
greater then when both defect. The gain for cooperating when
the opponent defects is the smallest of any action. The gain for
a player is greatest when he defects and the other player

cooperates.
Player 2 (Second Symbol)

Cooperate Defect

Player | Cooperate R/R S/T
(First Symbol)

Defect T/S P/P

Points Scored in Prisoner's Dilemma
Figure 3

A second requirement for the Prisoner’s Dilemma is that
players can not benefit from taking turns exploiting each other.
The gain from mutual cooperation must be greater than the
average score for defecting and cooperating, R > ($+ T)/ 2.

From the basic situation described above, several vari-
ations have been created. In this paper, the following additional
requirements have been added: |} Instead of just one round of
decisions, a series of decision are to be made. Here a player could
start by cooperating but decide to defect later on depending on
what the other player did. 2) The strategy is used against a
variety of other strategies ina round robin tournament. Here
the best strategy is the one which scores the most total points at
the end of the tournament. In this situation, a player may not
be able to out score any other player but still win by scoring well
against all of the other players.

The Prisoner’s Dilemma has been used to represent a wide
variety of real life situations. For example, when two businesses
interact, it is usually a situation where one company would gain
the most if it could get the other company to give up something
for nothing, but at the same time, they have more to gain if they
both cooperate than if they both try to deceive each other.
Problems ranging from personnel relationships to super power
negotiations have been studied using some form of the Prisoner's
Dilemma problem [1].

For this study, the genetic algorithm tried to develop a
strategy when having to make a series of 10 decisions against a
variety of other strategies. In each situation, the program
generated by the genetic algorithm was used each time a decision
whether {0 cooperate or defect was required.

Scoring for each decision was based on the following (see
Figure 4.)

Player 2 (Second Value)

Cooperate Defect

Player ! Cooperate 3/3 0/4
(First Value)

Defect 4/0 pst

Points Scored in Prisoner’s Dilemma
Figure 4

238

The number of games played and the points awarded for each
move was based on information from a tournament held for the
Prisoner’s Dilemma problem. This information was received in
a file called The First Announcement of Computer Programs
Tournament (of the Prisoner’s Dilemma Game) on February 7,
1986. These values can be easily changed in the genetic algorithm
to meet other conditions. For instance, if the grammar has no
knowledge about the total number of games to be played, then
a strategy for an infinite number of games can be generated,

The grammar used to generate programs had three
different options it could check based on what the opponent did
in the past: !} What the opponent did on the Jast move, 2) what
was done two moves before, and 3) if a particular action was ever
done by the opponent in the past (either defect or cooperate).
Other strategies could be generated by using a combination of
the above checks together. The program generated could also
check on the number of rounds played. While these checks were
available in the grammar, there was no information about which
checks to use or how to use them when creating a program. If
a. user was interested in other strategies then they could be added
to the grammar. The actions available to program would be to
either cooperate (represented by a 1) or defect (represented by a
2).

RESULTS AND CONCLUSIONS

The generation of what appear to be optimal strategies for
the Prisoner’s Dilemma problem came after relatively few
generations in most cases. According to Axelrod’s pioneering
work on the Prisoner's Dilemma, the optimal strategy for the
Prisoner's Dilemma under most circumstances is one he calls Tit-
For-Tat [1] (there is never one optimal strategy for all cases since
the best strategy depends on the other strategies being played).
Tit-For-Tat cooperates on the first move and then does whatever
the opponent did on the previous move (i.e. if the opponent
cooperated then it cooperates, and if the opponent defected, then
it defects). Axelrod states that a strategy normally has three
different characteristics to be successful in a Prisoner’s Dilemma
tournament; Nice (i.e, does not defect first), Provocative (i, will
defect if defected on), and Forgiving (i.e. can be made to
cooperate after it has started to defect). Tit-For-Tat has all of
these characteristics. This strategy is simple for the grammar to
generate and therefore one would expect it to appear soon in the
random generation of programs. The fact that simple strategies
could produce some of the best results no doubt leads to the rapid
generation of solutions,

The environment that the genetic algorithm was tested
under was based on work done by Dacey and Pendegraft in their
study of the Prisoner’s Dilemma problem [3]. The grammar used
to generate programs was designed to produce all of the different
strategies studied by Dacey and Pendegraft. Other strategies were
also possible with the grammar (i.e. checking two moves back),
but there was no attempt at making a grammar that was all
inclusive for every possible strategy. The grammar also allowed
the checking of which round the program was in. The first run
was played against strategies where Tit-For-Tat was thought to
be optimal. The solution generated was a slight variation of Tit~
For-Tat where, instead of defecting if your opponent.s last move
is a defect, it defects if the opponent's fast two moves are both
defects. This was not one of the strategies studied by Dacey and
Pendegraft and was produced by combining two strategies found
in the grammar. This strategy was studied by Axelrod in his work
{1] (called Tit-For-Two-Tats) and matches his contention that
the best overall strategy should be nice (i.e. never defect first)
and forgiving (i.e. can be made to cooperate even after being
defected upon) It is also provocative, but not as much as Tit-
For-Tat. When Tit-For-Tat was tried against the same environ-
ment, it was found to score lower then this new strategy. The
genetic algorithm was tried against another set of strategies where
Dacey and Pendegraft’s work predicted Tit-For-Tat was optimal

 
 

 

 

 

and once again Tit-For-Two-Tats won.

A third run was made against a different combination of
strategies. In this environment, Dacey and Pendegraft, the
strategy that was predicted to be optimal was one of cooperating
with an opponent until defected upon and then defecting from
that point until the end. This time, the strategy produced by the
genetic algorithm matched the predicted strategy.

In the examples used above, the genetic algorithm
developed a strategy against a fixed set of opponents. The fourth
run was made where the programs produced by the genetic
algorithm were also used as the environment that the programs
were played in, Each program played all of the other programs
produced each generation and then evaluated on its overall score
against all of the programs. In this type of environment, a strategy
may do extremely well only because it can take advantage of the
other strategies currently in the population. If these strategies
are replaced by different strategies, then the most successful
strategies may change. This is exactly what appears to happen
in this run. At first, a strategy of defecting all the time scored
extremely well and begins to dominate the population. Then a
small number of programs were created which had the
characteristic of being both nice and provocative. These strate-
gies, of which Tit- For-Tatis one, protects themselves from being
taken advantage of by an all defecting strategy but also cooperates
with other nice programs. When enough of this type of program
are created, either by random mutation or by manipulating
existing programs, then the all defecting strategy is no longer op-
timal, Eventually the strategy of cooperating until defected on,
and then defecting from then on came out as the optimal strategy.
This strategy appeared after approximately seventy generations
and took over as the top player. Once this strategy was created
no other strategy was found that scored better in over 300
generations.

In Axelrad’s wark on the Prisoner's Dilemma problem, he
states that in a population of all-defecting players, a strategy of
being nice and provocative would be the successful if a group
nice and provocative players appeared in the group, But he also
states that if a population of mostly all nice and provocative
strategies exists, then no other type of strategy would ever be
better. This is exactly what occurs in this cun of the genetic
algorithm. The all defecting strategy appears early in the run
and appears to be the best strategy. The genetic algorithm
eventually created enough programs which are nice and
provocative and they are able to out score the all defecting
programs. Once this occurs, this strategy is never outscored by
any other strategy found by the genetic algorithm.

This strategy was not the exact one predicted by Axelrod
ta be the best for the general Prisoner's Dilemma tournament (i.e.
Tit-For-Tat) but it does match many of the results found by
Dacey and Pendegraft. It can be shown that when using only
those strategies used in Dacey and Pendegraft’s study, the strategy
of cooperating until defected on, then permanently defecting, is
the optimal one. Since it is their work that inspired the grammar
producing the programs, this could be the reason why Tit-For-
Tat did not win. The strategy of cooperating until defected on,
then permanent defecting is considered nice and provocative but
it is not forgiving. This would mean that it would do just as well
or better than Tit-For-Tat ina world of nice strategies (since
neither program would defect first) and beth would respond
when defected on. Tit-For-Tat would do better if it played
against programs that tried to encourage cooperation after having
defected, This type of strategy can be made by the grammar used
in the genetic algorithm but it is not very common, This may
be one reason why, in this particular environment, Tit~For-Tat
does not appear to be optimal.

In all of the strategies developed, each one eventually
learned to defect on the Jast move, It can be shown that in a
limited number of Prisoner's Dilemma games, the optimal
strategy always includes a defect on the last move. An example

23

 

of the programs generated by the genetic algorithm for the
different Prisoner's Dilemma problems are shown in Appendix
B,

The significance of this work is applying the genetic
algorithm in the generation of real source code and showing that
the genetic algorithm can be used to find optimal solutions to less
then trivial problems when a proper set of productions are used.

APPENDIX A: GRAMMAR USED IN FRISONER’S DILEMMA

The following is a grammar that is used to produce the
Programs for the Prisoner's Dilemma problem. This grammar
produces programs which can check a variety of items before
returning either a 1 (which represents a cooperation move) or a
2 (which represents a defect move).

1) index --> (COND (T action))

2) index --> (COND index-cond-term (T action))
3) index~cond-term --> (logical action)

4) index~cond-term --> (logical action) index-cond-term
5) action -~> 1

6) action ~-> 2

T) logical --> (NOT logical)

8) logical ~-> (I-op logical logical)

9) logical --> (EQUAL NROUND 1)

10) LOGICAL --> (EQUAL NROUND 10)

11) logical --> (EQUAL OP-PLAY action)

12) logical ~-> (EQUAL OP2-PLAY action)

(3) logical --> if-any

14) l-op ~-> AND

15) l-op --» OR

16) if-any --> PAST-DEF-OP

17) if-any --> PAST-COOP-OP

In the previous grammar, any item written in all capita)
letters are used in the final program. Index is the starting point
for all programs. If production | is picked, then index becomes
a COND statement with only a TRUE test (which by definition
is always true). Action is a non-terminal and can be converted
to either 21 or a 2. This represents either a cooperation move
ora defect move, If, in the beginning, production 2 was chosen,
then a program witha COND and either one or more index-cond-
term productions are used These become some kind of logical
test which must evaluate to a true statement before a correspond-
ing action will be done. Some of the tests that can be generated
in a program include: Check to see if this is round I of the game
(production 9) or round 10 (production 10}, check the last place
the opponent made (production 11), check the play the opponent
did two moves ago (production 12), check if the opponent ever
defected in the past (production 16) or if the opponent has ever
cooperated in the past (production 17). The symbols created by
productions 16 and 17 are prewrititen functions created for this
particular problem which become true if a defect or cooperation
move has ever been made. These checks can be combined using
logical "AND" and "OR" statements (production § with produc-
tions J4 and 15)

APPENDIX B:
ALGORITHM

PROGRAMS GENERATED BY GENETIC

The following is an example of the programs produce by
the genetic algorithm for the Prisoner's dilemma, Each program
is designed to work ina LISP environment. The genetic algorithm
was used to produce several different strategies for different
environments for the Prisoner’s Dilemma problem. The
following represents two different strategies, one called Tit-For-
Two-Tats by Axelrod [1} and one called Cooperation-Permanent
Defection by Dacey and Pendegraft [3].

 
 

Tht-For-Two-Tats

{COND (EQUAL NROUND 10) 2)
(NOT (EQUAL OP2-PLAY 2)) 1)
((NOT PAST-DEF-OP) 2)
(EQUAL OP-PLAY 2) 2)
(TD)

This strategy defects on the last move, which is when the
round (stored in the variable "NROUND") is ten. If the person’s
action two moves ago (stored in OP2-PLAY) was nota defect
{represented as a "2") then cooperate (represented asa "1"), The
third condition checks to see if the opponent has ever defected
(stored in PAST-DEF-OP). This condition would cause a
defection if the opponent never defected. In this program, this
condition is never reached since if the opponent never defected,
then the opponent would not have defected two moves ago and
the second condition would have been true Therefore, the only
time this program defects is if condition four is true and the
conditions before it are false. Specifically, if the opponent
defected two moves ago and if the last move (stored in op-play}
was also a defect. Any other conditions would produce a
cooperation move from the true condition in the Jast line of the
program.

Cooperate~Permanent Defection

(COND (PAST-DEF-OP 2)
(T 1))

This strategy is very simple to produce since one of the
built-in procedures is true if the opponent has ever defected
(stored in PAST-DEF-OP). This program returns a defecting
move if PAST-DEF-OP is true (meaning if the opponent ever
defected). Otherwise, it returns a cooperating move.

BIBLIOGRAPHY

[1] Axetrod, Robert, Evolution of Cooperatron, Basic Books inc,
New York, 1984

{2} Cramer, Nichael Lynn, 4 Representation for the Adaptive
Generation of Simple Sequential Program, Proceedings,
International Conference on Genetic Algorithms and
their Applications, July 1985.

(3] Dacey, Raymond; Pendegraft, Norman, The Optimality of
Tit-For-Tat, Prepared for Presentation, International
Studies Association, March 1986

[4] Fujiki, Cory, dn Evaluation of Holland's Genetic Operators
Applied to a Program Generator, Master of Science
Thesis, Department of Computer Science, University of
Idaho, 1986

[5] Hicklin, Joseph F, Application of the Genetic Algortthm To
Automate Program Generation, Master of Science Thesis,
Department of Computer Science,University of Idaho,
1986

{6} Holland, John H., Adaptation In Natural and Artificial
Systems, University of Michigan Press, Ann Arbor, 1975

{7] Smith, Stephen F., Flexible Learning of Problem Solving

Heuristics Through Adaptive Search, Proc. 8th IICAL,,
August, 1983.

240
Optimal Determination of User-oriented Clusters
An Appheation for the Reproductive Plan

by

Vijay V Raghavan” and Brijesh Agarwal
The Center for Advanced Computer Studies
University of SW Louisiana
Lafayette, LA 70501-4330

Abstract

Atypical Information Retrieval system is
required to be able to satisfy user queries in an
efficient and effective way. Furthermore, the system
should be able to adapt itself to changing user pat-
terns. Recent work in user-oriented clustering
attempts to meet the above objectives by identifying
document clusters that are consistent with the similar-
ities perceived by the users, rather than those
hypothesized by the designer. This paper proposes a
representation framework for the application of an
adaptive scheme in order to find optimal user-oriented
clustets. It is shown, in particular, that the reproduc-
tive plan proposed by Holland [7] is a promising solu-
tion strategy for the optimal determination of the
boundaries between clusters.

1. Introduction

Cluster analysis is an important tool employed in
information retrieval to enhance both efficiency and
effectiveness of the retrieval process. The retrieval
strategies based on classification of documents can
lead to greater efficiency by the search being restricted
to just a few selected clusters; they can achieve a
greater effectiveness if clusters are formed in such a
way that, when the appropriate clusters are selected
for retrieval (or detailed examination), the number of
relevant documents obtained relative to the number
of documents retrieved (or examined) is high. This
implies that the clustering algorithm must ensure that
documents that are likely to be relevant to the same
queries are placed together in the same clusters.

Although many existing strategies for document
classification consider the summarization of the data
and the identification of the “natural” or "homogene-
ous” clusters as the primary objective, it is also attrac-
tive to specify some extrinsic criterion, based on the
requirements of the application environment, accord-
ing to which a placement of documents into classes is
to be assessed, In the latter case, the problem of elus-
ter analysis becomes simply an instance of eombina-
torial optimization problems.

a

« Currently on feave from Univeraity of Regina, Regina, Canada, $45 OA,

 

244

When performing cluster analysis by formulating
the problem as one of function optimization, one can
choose from two general directions for developing a
solution. On the one hand, prior knowledge of
specific properties of the function to be optimized
(e.g. unimodality, continuity, etc.) can be exploited in
tailoring a particular strategy to the problem at hand.
On the other hand, either because prior knowledge is
not available or because the optimization criterion is
known to be complex (and, therefore, not amenable
to optimization by “standard” techniques), ane could
seek a solution strategy that is effective for a broad
class of complex criterion functions. The classification
of documents for information retrieval, in our judge-
ment, requires the adoption of criterion functions that
are quite ill-behaved. Consequently, in what follows,
we develop an approach involving a robust strategy
for function optimization,

By now the connection between adaptive
processes and function optimization problems is well
established. Several adaptive plans have been pro-
posed and, in particular, the reproductive plan by Hol-
land {7] has been fairly well investigated ‘as an
approach to function optimization [1-4,16]. The con-
clusions are that the algorithm is robust and, for the
cases where the criterion functions are jil-behaved,
more efficient relative to standard techniques.

Thus, there is motivation for considering the
reproductive plan as an approach for performing clus-
ter analysis. In other words, we suggest that each pos-
sible classification (a placement of objects into clus-
ters) be considered a device and it is desired to iden-
tify progressively better performing devices vis-a-vis a
clustering criterion. This view of clustering has
already been investigated in Raghavan and Birchard
[13]. However, in that work, certain problems of
representation were not adequately resolved. The
result was that the basic genetic operators such as
crossover, double-crossover and mutation did not
prove effective. In the current work, a framework, in
which such representational problems do not exist, is
identified. As a result we are able to assert that the
genetic algorithm is again a promising direction to
investigate.
 

In the next section, the general framework that
characterizes user-oriented clustering approach is
presented. Then, in section 3, the definition of the
problem, for which the use of the genetic algorithm is
being advocated, is given. The details of the structure
representation and operator — specification are
developed in section 4. Finally, section 5 summarizes
the contributions of this work.

2. The Problem Framework

In the context of information retrieval systems, a
new approach to document clustering called the user-
oriented document clustering (also known as adaptive
clustering) is emerging. This approach has been
investigated by Yu et al. (16), and Deogun and
Raghavan [14]. User-oriented clustering can be seen
as a mechanism whereby the classification ig based on
the user perception of clusters, rather than on some
similarity function perceived by the designer to
represent the user criteria.

2.1 Overview of Adaptive Clustering

The adaptive clustering process is basically a two
stage clustering scheme. Hach document is initially
assigned a position on the line (-co, oo). In the first
stage, as a query is processed, documents relevant to
this query are moved closer; then some randomly
selected documents are moved away from their cen-
troid. This step is repeated for several queries. In the
second stage, the clusters are identified. For the pur-
pose of this paper, the first stage of the algorithm is
adapted from Yu et. al. The second stage is based on
the formulation of the problem developed by Ragha-
van and Deogun([14]. The important details in the
algorithm are as follows:

Stage 1
1. Given documents 4j,do,...,dy each document
d, is assigned an arbitrary position p, on a real
line (-co, 00), for 1 Si SN.
2. Given a (next) query
use the (actually) stored clusters to pro-
vide response

ae

b. obtain feedback from user as to which
documents are relevant to the query.

ce. modify the position of documents on
the line, so that the documents accessed
for this query occupy positions closer to
each other on the line while ensuring
that points corresponding to all docu-
ments do not eventually bunch up in a
small interval.

a. Hf the clustering based on the positions
of the documents on the line (see stage
2 far details) is significantly different
from the stored clusters, then reorgan-
ize the stored clusters according to the
new clusters indicated.

b. Go to step 2 (next query).
Stage 2
In this stage, optimal clusters are to be obtained

by defining boundaries at suitable points on the
document line.

242

The baste steps are:

1. Define possible boundary points (points on
the document line where boundary between
two clusters can be defined).

From the set of possible boundary points
select some points as actual boundaries to

wD

obtain a particular set of clusters (a
classification).
3. Evaluate the performance of this

classification relative to the given set of
queries, according to some performance cri-
teria. Terminate if a sufficiently well per-
forming classification has been obtained.
Otherwise, generate a potentially better per-
forming classification by modifying the
current one. Repeat step 3.

2.2 Earlier Work on Adaptive Clustering

Several research investigations have been carried
out, in the last few years, on adaptive document clus-
tering [6,14-17 ].

The main emphasis of these activities have been
on ways of accumulating the information gathered
through user feedback (analogous to stage 1 above}.
In Raghavan and Sharma ash the idea of arranging
the documents on a line, proposed in [16], was
retained. But the conditions under which two docu-
ments would be moved closer to or apart from each
other is modified. The results indicate that the
modification is computationally more efficient. The
work by Deogun and Raghavan {6}, in contrast,
hypothesized that summarizing the usage patterns on
a tine is too restrictive in the sense that much useful
information is lost. Consequently, an entirely new
procedure that employs a weighted graph as the struc-
ture for summarizing feedback on usage patterns is
proposed. The preliminary results show that this
approach is effective. However, that approach has not
yet been adequately compared with others existing in
the literature to determine its relative performance.
More recently, Oommen and Ma [11] have character-
ized the adaptive clustering problem within the con-
text of certain learning automata.

Another aspect of clustering documents in this
manner concerns the use of clusters so obtained dur-
ing retrieval. Yu et al. do not investigate any concrete
retrieval scheme. Instead they show that performance
will be good if a scheme that is able to select the
“best” clusters ig devised. Thus, this problem is still
essentially open. The difficulty is that the descriptors
or properties associated with documents and queries
play no part during clustering and the system, while
knowing what the clusters are, has no knowledge of
how documents in one cluster can be distinguished
from those in another. A promising approach based
on Bayesian decision theory is suggested in Deogun
and Raghavan [6]. This and few other alternatives are
currently under investigation,

3. The Boundary Selection Problem

Concepts such as positive, negative and boundary
regions introduced by Pawlak (ial in the context of
the development of the rough set theory are used to
develop 4 formulation a suitable clustering criterion.

 
A positive region[POS(q)] for a query q is defined to be
the set of documents in the clusters that are com-
pletely included in the relevant set of the query. Simi-
larly, a negalwe region{NEG(q}] is defined to be a set
of documents in the clusters that are completely
excluded from the relevant set of the query. Docu-
ments in the remaining clusters define the boundary
region[BND(q}].

Clearly, an ‘‘ideal’’ clustering for a given set of
queries will be the cone in which for each query we
have only positive and negative regions but no boun-
dary regions. Therefore, in finding a clustering our
motivation is to maximize the positive and negative
regions collectively for all queries in the given set.

This leads us to the following function as a. meas-
ure of effectiveness for a given clustering
Co w{CyCo, Ch 4

2 POSe(a) + INEGe( a) |
fe Ee

aml

. a icht
where qj, qorey¢, are the queries and w, is the weig
associated with query q, and D is the set of documents
over which the partition C is defined.

The objective is to find a clustering C of docu-
ment set D that maximizes ¢¢ with respect to the

* .
given set of queries. Since }) u, can be considered to

sox wee .
be a constant, the objective of maximizing & is

equivalent to

minemize ro == S3 wy [BNDo(q,) |
at
ro can be interpreted as the cost associated with
choice of clustering C.

It is not difficult to see that this cost function is
biased in favor of forming many small clusters. So
the following constraint is adopted:

Sy ASB,

el

where the ws are the query weights as earlier, R, is
the number of sets among Cy,C2,,C, such that Cc, 9
POSg) #4, fort <j<srandi <i<n, and Bisa
chosen bound on the retrieval effort. Formally, the
problem we want to solve is then to

find © such that

je

To = 33> ow * IBNDo(adt
vel yy
OO) BNDcla) #4
is minimized subject to
ys wt R, SB.
vd

This problem is called the Boundary Selection
Probiem (BSP) [14].

 

4. The Reproductive Plan

4.1 Representation

Let bo, 61, ... b, be the possible boundary points,
We denote a classification, C, by an (m+ 1) * dimen-
sional binary vector X = (fo) Zi, wy Im), where zg
1,2, == 1 and, for each j, lSjcm,

{t if 6, ws an actual boundary in C
77 lo, otherwise

Let r, 2r<m+1, be the number of 1's in the solution
vector X. Thus, given a bound on the retrieval effort
and a set of possible boundary points, it is desired to
find the X for which the cost function is a minimum.

Now each X corresponds to a solution from the
search space of 2”! different solution vectors. Each
X defines classification, but some of these
classifications may not be feasible (if they violate the
constraint), Nevertheless, there are still’a very large
number of feasible solutions in the search space and
no efficient procedure could possibly search the entire
space exhaustively to obtain an optimal solution.
Some heuristic methods, which exploit the branch-
and-bound strategy, have been proposed [14]. Since
the clusters interact in a non-linear way, and due to
inherent parallelism and other characteristics of the
reproductive plan identified in (7-10,17|, effective
resulls are likely to be obtained by adopting the basic
genetic algorithm. Next, the representation of any
classification as a structure and its manipulation via
genetic operators are illustrated by means of exam-
ples.

Let D represent a set of 11 documents :-

D = {1,2,3,4,5,6,7,8,9,10,11}.
With respect to a given set of queries and their

relevant sets, suppose the output of stage 1 is the fol
lowing :

{ (1,0.21),(2,0.27),(3,0.16),(4,1.76),(5,0.50),(6,0.60),
(7,2.11),(8,0.18),(9,0.30),(10,0.95),(11,-) }

A pair Gi » 6,) in the output simply means that
document d, is at a distance 6; from the next docu-

ment on the line. This js depicted in the diagram
below:

poe 3s 4 6 67 Beg “la: ‘Al:

Using a suitable criterion (e.g. a threshold as to
the size of a gap {see, 13]), we identify the possible
boundaries as follows =

pi 2

 

243

jos 516-7 --- 48 9 { io jit
{ i a J t
bo b, by bg By by be by bs

The two extreme boundary points are always

marked, by defauit, as actual boundaries. We now
have

X= ( FoF i FoF F Ture e72y ) Where ro =S.2y=1.

 
 

Thus, the solution space has 2**=97 different
solutions. Consider one of them :

Nez 1LOLOLLO11

In this case, the possthle boundary points chosen
as the actual partitions are

P= {bi ba de Pebardo}-
This would result in following clustering C :
Cc ={1,CoeaCuCe} , where
C; ={1.2,3.4) er =[5.| 1s ~{r} + ex ~[2.0,10)

ovefu

Suppose that we have several structures in a
population as follows :

—1LOLLOOOLL
L1LO101LOOL
Ny=LiOLOOOLL

  

A cross-over between X; and X, at position 6 will
yield +
101100011 1O1100001 My
=>

110101001 110101011 . Xa

The clustering corresponding to XV.¥o,NyAiXo

are

xX 1,2,3,4},{5}, {6,7 ,8,9,10}, {11

x, 21672} 43-4,5 4, (6.75,18,9,10,11

x: Uh ta 5bis.7, O10} 411
Xy, + {41,2,3,4}, (5 },46,7,8,9,10,11}

me: HZ h48,4,5},416:7 4, (8,9, 10}, (11)

Mutation of X, and 4, at positions 7 and 4 respec-
tively would result in new structures, Xyo and Xj. as
follows:

NyillO100011 => 110100111

The resulting clusters are :

{{1,2},{3,4,5},{6,7,8,9}, 10}, 11 }}:

Xo_:101100011 => 101000011

similarly, the corresponding clusters are +
4(1,2,3,4}, 5,6,7,8,0,10}, {11 fp.

Suppose q, is the relevant set for query,

a; = {1, 2, 6, 7, 8, 11}.
Then the associated POS, NEG, and BND
regions would be :

xX POSs(9, {11};
{1,2,3,4,6,7,8,0,10}; NEGy(q,) = {5},
X_ + POSx,{4,) == (1,2,6,7} BNDy fq)
NEGy(q) = {3,4,5},

Nyt POS (a) = (12,11); BNDy C4)

) = BND x(q}
== {8,0,10,11};

== {6,7,8,9,10};

Zag

NEGy9.) = 34,5},

Xu POSy, (9) me os BNDy, a) wa
{1,2,3,4,6,7,8, o 10,11}; NEG, (4) = {5},

Xyt Peel s= {1,2,6,7 lk BNDy,( 4.) = {8,9,10};
NEGy,( 9.) = 5 4 5}

Xu t os a) {2,2,11}; BNDy,(a) == {6,7,8,0}
NEGy,( 4) = 3,4,5,10},

Me POSx,A%) == {1lk BND, {4,) =

{1,2,3,4,5,6,7,8,9,10}; NEGy,.(q.) = 4.

In the original population, X, had the best utility
with respect to q, since its BND set was smallest. This
performance could be attributed to the subunits:

Pattern 11 in positions 0, 1 respectively,

and 101 in positions 3, 4,5 respectively. Notice
how the offspring Xy,, which provided these subunits,
continues to perform well.

On the other hand X, had a low utility with respect to
q,, and subunits contributing to poor performance are

Pattern 101 in position 0, 1, 2 respectively,

and 10004 in position 3,4, 5, 6,7 respectively.
The offsprings X,, and X,. which still have the first
subunit continue to perform poorly.

Mutations generate a new classification by break-
ing a cluster or by merging two clusters. Crossovers
help in producing new combinations of clusters.

Now we have seen how a diverse set of solutions
can be generated by a few elementary operationsand
how the operations affect the payoff (or utility) of
each structure. Also, this representation does not
have the "labelling anomaly” of the previous approach
[13]. Consequently, it is expected that the basic
genetic operations would suffice for our application;
ie. no application-specific operators are necessary.

4.2 Discussion of the Approach

The classification scheme considered represents a
novel perspective on the process of cluster analysis.
When compared to approaches, for example, in pat
tern recognition literature, it seems to resemble
paradigm-oriented classification. That is, the clusters
identified depend on the feedback from users as to
which documents are relevant to their queries. The
fact that it resembles paradigm-oriented pattern
classification gives some assurance that the use of the
reproductive plan can be effective, since Cavicchio [4]
has been successful in applying the algorithm to the
character recognition problem.

However, the similarity of the problem addressed
to paradigm-oriented classification is not that great.
This can be realized by observing that the
classification is not done separately for individual
queries, whereby a distinction is drawn between
relevant and non-relevant documents, but rather in a
global sense. In other words, the clusters are required
to transcend the specific queries on the basis of which
they were generated and be useful in retrieving the
relevant documents even for queries not encountered
at any time earlier, In this context it is important that
new classifications (structures) developed during the
clustering process be made of up well-performing
schemas (subunits), This aspect points to the fact
that some sort of "homogencous” clusters of interest
to many past and potential users, are being sought.
Hence, there is also some resemblance between our
objectives and those of the classification strategies that
fit in the class of “unsupervised” learning methods.
We betieve that this new perspective on classification
is very useful for the retrieval environment and
should be of value in other contexts as well.

It is also interesting that stages 1 and 2 of the
proposed clustering strategy would be, in the termi-
nology of Yu et. al., called adaptive document cluster-
ing. Within this broader adaptive system, we propose
to apply the reproductive plan to solve the BSP.
Thus, what we are developing is a layered system
where on the one hand the positioning of documents
on the line is being adapted in response to certain
environmental influence and on the other hand the
organization of these documents into clusters is
adapted according to the clustering criterion. For the
latter stage, the use of the genetic algorithm is advo-
cated.

In the earlier investigation of the genetic algo-
rithm as a way of obtaining an optimal classification
[13], the concept of arranging documents on a line did
not exist. Consequently, a classification was
represented as a string of cluster labels: for a
classification of N documents into r clusters, the string
is of length N and each position on the string would
be assigned a value between 0 and r 1 indicating the
label of the cluster to which the document belonged.
Thus, there was not only the problem that at each
locus there would be a large number of alleles, but
also thet there was what we called the “labeling ano-
maly”. That is, a cluster having label t in one struc-
ture may be totally disjoint from and unrelated to a
cluster also labeled / in another structure. Such prob-
lems lead to the basic genetic operators being
ineffective in generating better performing, and often
even valid, structures, It is easily seen that the
representation adopted here circumvents such
difficulties.

The implementation of these ideas are underway.

5. Conclusion

In information retrieval, a novel approach to clus-
ter analysis in which clusters are modified as relevance
feedback from users are obtained, known as user-
oriented document clustering, is actively being
investigated. A combinatorial optimization problem
that arises in that context requires the identification of
cluster boundaries in such a way that a prescribed
retrieval criterion is optimized. It is shown, by speci-
fying a structure representation and providing exam-
ples of operator application, that the reproduetive plan
is a promising approach for solving this problem.

245

Acknowledgements

This research is supported in part by a grant from
NSERC of Canada.

References

1. Bethke, A.,"Genetic Algorithms as Function
Optimizers”, Doctoral Thesis, CS Department,
University of Michigan, 1981.

Brindle, A.,"Genetie Algorithms for Fuaction

Optimization”, Doctoral Thesis, Department of

Computing Science, University of Alberta, 1980.

3. Caviechio, D.J., "Adaptive Search Using Simu-
lated Evoljution”, TR 03296-T, Computer and
Communications Dept., University of Michigan,
1970.

4. De Jong, K., “Adaptive System Design A
Genetic Approach’, ZEEE Transachons on Systems,
Man and Cybernetics, pp. 566-574, vol, SMC-10,
9, Sept. £980.

5. De Jong, K., "A Genetic Based Global Function
Optimization Technique”, TR 80-2, Department
of Computer Science, University of Pittsburgh,
1980,

6. Deogua, JS., and Raghavan, V.V., "User-
oriented Document Clustering : A Framework for
learning in Information Retrieval’, Proc. of the
Ninth International ACM SIGIR Conf. on Research
and Development wn Informaton Retrieval, pp. 157-
163, Pisa, Italy, Sept. 1986.
Holland, IL, "Adaptation in Natural and
Artificial Systems”, The University of Michigan
Press, 1975,
8. Tolland, J.H. et. al., "Computational Implementa-
tion of Inductive Systems”, in Induction . Process
of Inference, Learnmg, and Discovery, pp. 103-150,
The MIT Press, 1986.
9. Mauldin, M.L., "Maintaining Diversity in Genetic
Search”, pp. 247-250, Proc AAAS, August 1084,
Mercer, H.ts. and Sampson, JR., "Adaptive
Search Using a Reproductive Meta-Plan”, pp.
215-228, Kybernetes, vol. 7, 1978.
Oommen, BJ. and Ma, D., "Fast Object Parti-
toning Using Stochastic Learning Automata”,
Proc. of the Tenth Internahonal ACA SIGIR Conf.
on Research and Developing in Informaton
Retrieval,to appear, New Orleans, June 1987.
Pawlak, Z., "On Learning - A Rough Set
Approach”, in Lecture Notes in Computer Science,
No. 208, Skowron A., Ed., Springer Verlag, Ber-
lin, 1986.
Raghavan, V.V. and Birchard K., "A Clustering
Strategy Based on a Formalism of the Reprodue-
tive Process in Natural Systems”, Proc. of the
second International ACM SIGIR Conference on
Information Relneval, pp. 10-22, Dallas, Sept.
1979.
Raghavan, V.V. and Deogun, JS., "Optimal
Determination of User-oriented Clusters”, Proc
of the tenth Internatonal ACM SIGIR Conf on
Research and Development m Information Retrieval
to appear, New Orleans, June 1987.

)

“I

 

10.

lt.

13.

14.

 

 
 

15. Raghavan, V.V. and Sharma, R.S., "A Frame-

16.

17.

work and Prototype for Intelligent Organization
of Information”, Canadian Journal of Information
Science, 1987, to appear.

Yu, ©.T., Wang, Y.T. and Chen, C.H., "Adaptive
Document Clustering’, Proc. of Erghth Annual
International ACM SIGIR Conf on Research and
Development in Information Retrieval, pp. 107-203,
Montreal, Canada, 1985,

Weitzel, A., "Evaluation of the Effectiveness of
Genetic Algorithms to Combinatorial Optimiza-
tion”, Doctoral Thesis, Department of Library
and Information Science, University of Pitts.
burgh, 1983.

246
THE GENETIC ALGORITHM AND BIOLOGICAL DEVELOPMENT

by

Stewart W. Wilson

The Rowland Institute for Science, Cambridge MA 02142

Abstract

A representation for biological development is de-
seribed for simulating the evolution of simple multi-
cellular systems using the genetic algorithm, The
representation consists of a set of production-like
growth rules constituting the genotype, together with
the method of executing the rules to produce the
phenotype. Examples of development in 1-dimen-
sional creatures are given,

4. Introduction

The genetic algorithm [1] incorporates mecha-
nisms which resemble the mechanisms of reproduc-
tion, Variation, and selection found in natural evo-
lution, but, despite successes in several fields of ap-
plication, there has been little attempt to use the
algorithm as a tool to investigate, through simula-
tion, natural evolution itself. Considerable work ex-
ists on the ontogenetic evolution of behavior, @.c¢.,
learning [2-4], but relatively little on the evolution
of organisms per se |5|. The main reason has been
the absence of representations for organisms which
would permit the genetic algorithm to be brought to
bear. The genetic algorithm observes the genotype-
phenotype distinction of biology: the algorithm’s
variation operators act on the genotype and its se-
jection mechanisms apply to the phenotype. In biol-
ogy, the genotype-phenotype difference is vast: the
genotype is embodied in the chromosomes whereas
the phenotype is the whole organism that expresses
the chromosomal information. The complex decod-
Ing process that leads from one to the other iy called
biological development and is essential if the geno-
type is to be evaluated by the environment. Thus
to apply the genetic algorithm to natural evolution
calls for a representational scheme that both per-
mits application of the algorithin’s operators (o the
genotype and also defines how, based on the geno-
type, organisms are to be “grown”, t ¢., their devel-
opment,

The present paper outlines a few steps in the
direction of such e representation [6]. The problem
is addressed at the level of cells, which are treated
as as “black boxes” having well-defined properties
Beginning with the fertilized egg, the cells are to
divide, move, and differentiate under the control of

 

tules 50 as eventually to form a mature organism.
An attempt is made to respect major facts known
about cells and these processes, but large compro-
mises must occur at this point in the effort to ap-
proach algorithmic workability. The principal objec-
tive is to describe a representational framework—
a sort of “developmental automaton” ~sufficiently
completely that randomly generated instances will
grow and can be evolved under the genetic algorithm
in computer experiments.

2. Evolution of Development

The problem of applying the genetic algorithm to
the development of multi-cellular organisms can be
divided into four parts: plan, expression, selection,
and variation.

21 Plan

In nature, the genotype contains (1) information
that is descriptive, through the action of develop-
ment. and the environment, of a range of possible
phenotypes, and (2) information encoding the devel-
opmental process itself, t.e., how to go about mak-
ing a phenotype from a genotype. Both kinds of
information are of course inherited and subject to
variation and natural selection. {lere, for simplic-
ity, it will be assumed that only the first kind of
information, termed the organism’s plan, is herita-
ble and subject to the genetic algorithm. The other
kind. the rules for expressing the plan to form the
phenotype, will be regarded as fixed.

What should the plan look like? Several obser-
vations on natural systems are suggestive |7]. In the
first place, though individual cells can have differ-
ent sizes and can change in size, growth occurs pri-
marily through cell division: one cell becomes twa
“daughter” cells. Second, depending on the situa-
tion, the daughters can be phenotypically the same
as the parent, they can differ from the parent but
not differ from each other, or they can differ from
the parent and from each other. Third, the pheno-
(ypical outcome of cell division can depend not only
on the nature of the parent cell, but also on factors
related to the cellular, chemical, or physical context
in which the parent cell is embedded, Finally—-and
pivotal for this discussion— all cells in an organism
are considered to contain the same genetic informa-

247
tion, though some of 1t may become in some sense
“switched off” or inoperative during differentiation.

These observations have suggested the following
working proposal The plan will take the form of
a so-called production system program (PSP) con-
sisting of a finite number of production (condition-
action) rules which will be termed growth rules. The
growth rules have the general form

X+K, = K, Ky

The K’s stand for cell phenotypes and X represents
the local context; the symbol “+” means conjunc-
tion. In addition, each growth rule has associated
with it a weight w. Every cell in the organism con-
tains the same set of rules, or PSP.

Focussing attention on a particular rule in the
PSP of a particular cell, the condition side of the
tule is satisfied if that cell is (phenotypically) of type
K, and the context matches X. The action recom-
mended by the rule is to replace the cell by two new
cells, one phenotypically of type K,, the other of
type K,. Whether or not this rule controls the par-
ent cell’s fate depends on whether the rule is selected
for expression, as discussed in the next section.

The general growth rule form is open to many
specia] cases. As in nature, the daughter cells may
or may not be the same as the parent or each other.
Furthermore, some rules may contain just one daugh-
ter cell, identical to the parent; such a rule, if ex-
pressed, means that cell division does not take place.
Also, some rules may have no cell in their action
parts, corresponding to dissolution of the parent cell.

Some rules may have no term corresponding to
X; their condition is satisfied independent of con-
text. In the other rules, X can take on several forms.
Most simply, X can stand for the presence of a cell
of a particular kind adjacent to K,. In this case
(“adjacency” type context), the spatial relation of
the X cell and K, may affect the spatial relation of
the daughter cells (if there are two). Another kind
of X (“signal” type) would stand for a detector for
signals emitled by other cells, not necessarily in the
immediate neighborhood. For present purposes, the
“signal” emitted by acell is simply a list of its pheno-
typical properties. The predominant direction from
which matched signals are received could affect the
daughter cells’ spatial relation. Still another kind of
X would detect aspects of the physical environment
such as intercel} pressure.

2 2 Expression

Since all cells contain the same “program”, dif-
ferential development of the system depends on the

selection for expression of different rules in differ-
ent cells This is not difficult in principle, since
once some differentiation occurs, the sensitivity of
the rules to cell type and context will lead to fur-
ther differentiation. The proposed expression mech-
anism consists of a match step and a dectston step.
Again focussing attention on a particular cell, in the
match step the cell first identifies those program
rules which have satisfied conditions. Then, from
this match set, the cell chooses a single rule for ex-
pression. The chosen rule “carries out its right-hand
side”, 7.¢., daughter cells are produced as prescribed
and their signals are emitted.

The system’s growth process is envisioned as a se~
ries of discrete time-steps. On each slep, the expres-
sion mechanism operates in every cell of the current
system, The operation is regarded as “parallel” in
the sense that offspring of all the cells are produced
simultaneously. The offspring cells then undergo,
in accordance with their phenotypical properties, a
process of interaction and spatial accomodation so
as to form the “new” system to be input to the ex-
pression mechanism in the next time-step.

2.2.1 The decision step

The decision step of the expression mechanism
makes use of the growth rule weights w and the effect
of signals from nearby cells. Each growth rule in the
match set has an associated weight w. If a rule’s
context (X} part is either absent or is of adjacency
type, its excttation is defined to be just w. However,
if a rule’s context part is of signal type, the rule’s
excitation is defined to be the product of the weight
w and the tniensity of the received context signal.
For example, suppose that a certain rule has an X
which matches signals Sq emitted by nearby cells
A. Suppose further that the total intensity of the
signals is simply their number nq times a constant
{. Then the excitation of the rule in question would
equal fra w.

The cell decides which match set rule to express
using a probability distribution over the rules’ ex-
citations That is, the probability that a particular
rule will be picked is equal to its excitation divided
by the sum of the excitations of the rules in the
match sel. The following three rules offer an inter-
esling example.

AoAA Wy
(Sa)+ Aaa uy,
(Sa) +A +0 wa

The first rule, termed “reproductive”, takes one cell
A and leaves two in its place. The second rule,

248

 
termed ‘inhibitory’ matches cel] A, senses the pres-
ence of at least one A-type signal in the vicinity, and
seeks, if chosen, Lo maintain the status quo exactly,
The third rule, a deletion rule, has the same condi-
tion as the inhibitory rule, but seeks to delete the
matched A cell Each rule has a weight, as shown,

Suppose now the system consists of an aggregate
of n cells of type A. In any cell, the excitations of
the three rules will be:

by 7 Wy
e& suing

eg = wafny

If w, is large and there are relatively few cells, the
reproductive rule will be chosen most of the time and
the aggregate will grow. As it does, however, the
excitations of the inhibitory and deletion rules will
increase relative to that of the reproductive rule, due
to na. The growth rate will slow down. Eventually,
an equilibrium will be reached where net growth is
zero. At that point, the probability of reproduction
equals the probability of deletion, or w, ~ wafng-
Solving for na yields the system’s equilibrium size:

nye (1/f)(w, /wa)

The system’s net growth rate dn/dt prior to equilib-
rium can be calculated by taking the product of n
and the difference between the probabilities of repro-
duction and deletion. Dropping the “A” subscripts,
the result. is

dn (1 - n/n’)

dt Lt (uJwart\(nfar) ”
showing that the system’s growth rate can be “cho-
sen” independently of its equilibrium size.

Though simple, the example is important be-
cause it illustrates one way in which the cellular
program can manage the fundamental problem of
bounded growth. Later examples of differentiation
into finite regions of homogeneous cell type will as-
sume the presence of growth rule sets of this or sim-
ilar sort for the regions.

2.2.2 Phenotype properties

Once the decision step has picked a rule for ex-
pression, the daughter cells in the action part must
be simulated, which means simulating their proper-
ties. In a real organism, each cell “type” has a myr-
iad of physical and biochemical properties. Some of
these may be more properly regarded as behavioral,

 

e.g., during development, cells can creep, amoeba-
like to new positions. Most of the properties affect in
one way or another a cell’s interactions with other
cells. Even if all the properties were understood,
a realistic simulation would still have an enormous
problem adequately representing and computing the
interactions within the cell aggregate. Such a com-
putation is necessary in order to determine the fit-
ness, with respect to an environment, of the organ-
ism as a whole. The practical course for the present
would seem to be to choose extremely simple en-
vironments, simple measures of fitness, and a very
restricted range of cell properties.

2.8 Selection and variation

Because the foregoing representational framework
for development takes the form of a production sys-
tem program, it is straightforward to apply the ge-
netic algorithm as the ‘engine” of phenotype selec-
tion and genotype variation. The application of the
algorithm would be along the lines of previous work
with production system programs [3,8], One would
start with a population of “egg” cells, each contain-
ing a random genotype. Each egg would undergo
development and, after a standard number of time-
steps, each resulting cell aggregate would be rated
for fitness. The original eggs would then be copied
in numbers proportional to these fitnesses to form a
new population of the same size. Genetic operators
would be applied to the genotypes of the new pop-
ulation. The cycle would be iterated through some
number of generations, corresponding to evolution.

Many aspects of this scheme are quite well un-
derstood due to the research just cited and on ge-
netic algorithms in general However, the form of
the growth rules in the genotype is somewhat un-
usual so some comments about coding are in order.
The basic encoding would resemble that of classifiers
{2!. The condition part of a rule would consist of a
context taxon (for X) and acell taxon (for K,), each
being a string of length L from {1,0, 4}. The ac-
tion part would consist of two cell descriptions (for
K, and Kx}, both strings of length L from {1,0}.

An interpreter is required to relate cell descrip-
tion encodings to phenotypical properties. This sim-
ply means establishing a pre-defined mapping be-
tween substrings in the cell description and proper-
ties, ¢.g., “110” in the 14th through 16th positions
could mean the cell surface has “high stickiness”,
etc To take care of rules in which one or both of
the daughter cells is absent, the interpreter would
simply check the setting a certain bit in each cell
description: “O", say, would mean that description
cell was absent and the rest of the description should

249

seneenernayesscse

 

 

 

1
|
i
be ignored. A similar systern would be used to indi-
cate the presence or absence of a context taxon and
its type (adjacency, signal, or other).

A growth rule’s condition would be satisfied if
both (1) the cell description of the cell in which the
rule finds itself matches the rule’s cell taxon, and
(2} at least one signal reaching the cell matches the
rule’s context taxon. The meaning of “match” is the
same as for classifiers. the two strings must be the
same at every non-# position of the taxon. The use
of the “don’t care” symbol # permits rule conditions
to restrict their sensitivity to particular subsets of
cell description and signal bits.

Calculation of the intensity of the signal match-
ing the context taxon can be quite complex, de-
pending on the simulation. Involved are the depen-
dence of individual signal intensities on the distance
from their sources, and also perhaps propagation de-
lays with respect to the time-step of creation of the
source cell. These factors must be predefined. In
any case the total received intensity would be a sum
over the individual intensities of all matched signals.
As noted earlier, the net direction of the received sig-
nal may in some rules determine the spatial orjenta-
tion of the daughter cells. The dependence would be
encoded in a special bit string associated with the
daughter cell descriptions.

The weight associated with a growth rule must
also be encoded in order to make it, and conse-
quently the rule’s influence in the decision step. sub-
ject. to the genetic algorithm. The weight would sim-
ply be concatenated, as a fixed-length binary num-
ber, with the rest, of the rule string.

3. 1-D Development

As has been the case with research on cellular au-
tomata [9], the complexity of realistic three-dimen-
sional simulations recommends initial study of one-
dimensional examples. In two and three dimensions,
forces between cells must lead to complicated cell
movements and contortions of the “tissue”. A I-
D “creature”, however, could be viewed as growing
inside a frictionless tube, with no forces except be-
tween adjacent cells. Cell] division would Jengthen
the creature; deletion would shorten it. Though sim-
ple, the 1-D case can exhibit cell type configuration
patterns such as symmetry, periodicity, and polar-
ity that are analogous to patterns emerging in the
development of real organisms. Some elementary
examples follow.

$1 Symmetry and periodicity

Changes in a 1-D system through time can be

250

 

represented by a pyramid lhe the following:

This shows three time-steps. At first, the system
consists yust of cell A; then A divides to form cells
B and 8B; then the left-hand B divides to form the
(oriented} pair D C, and the right-hand B yields the
pair C D. Only two growth rules are required:

AsxBB
(B)+B2CD

In the second rule the context taxon is of adjacency
type (indicated by the absence of “S”). This type
of rule reads: “order the output cells so that the
direction from the first one to the second one is the
same as the direction from the context cell to the
replaced cell.”

Note that the pyramid diagram shows bilateral
symmetry about its center ine. Using additional
rule sets of the self-limiting form discussed in Sec-
tion 2.2.1, the C’s and D’s could be multiplied to
yield eventually a stable symmetrical creature of fi-
nite size, D...D C...C D.,.D, with approximately
equal groups of D cells

The following pyramid and its rules illustrates
rudimentary periodicity:

A A»>EF
EF (F)+ E2De
cCDEeD (KE) + F # CD

Again, the addition of self-limiting rule sets would
result in the creature, C...C D...DC.,..CD...D,
in which lke cell groups were approximately equal
in size. It is clear that quite complex structures can
be built by first establishing the pattern with non-
cyche rules (im which the cell taxon will not match
the output cells), and then using self-limiting rule
sets which apply to the final cell types.

$.2 Polarity

An elementary polarity results from any rule in
which the output cell types differ. A polarity with
respect to some phenotypical property can be set up
with non-cyclic rules as follows:

A A>BC
BC (C)+ Bw ED
DEFG (B)+ C= FG

If in the cell descriptions of D, E, F, and G the prop-
erty is, say, monotonically increasing, the amount of
the property will be graded across the system. A
more sophisticated gradient system occurs under the
rales;

A--BC
(Sp) +C DC

CCC
(So) + C2 €
(Sc) 40-290

If B’s signal loses intensity with distance, the prob-
ability that a C will change to D C will fall with
distance from the left end of the structure. The re-
sult wil] be a decreasing distribution of D’s from left
to right. The last three rules are intended to control
the system’s overall size.

When rule sets become even slightly complicated,
as in the last example, it evident that development
wil) be difficult to predict. It can be hoped, how-
ever, that with the help of the genetic algorithm,
the ability to design and analyse organisms in ad-
vance will not be necessary in order to build suc-
cessful and interesting ones-~just as in natural evo-
Jution it is not. What does seem essential is an ade-
quate space of possible growth rules. The rule forms
discussed include self-excitation, self-inhibition, and
cross-excitation and -inhibition between different cell
types. The repertoire seems fairly complete for a
start, but modifications in it and in many others as-
pects will surely occur as the proposal is studied
experimentally and analytically.

4. Conclusion

An extremely schematic representational frame-
work for biological development has been described
which may permit simulations of evolution using the
genetic algorithm, Major questions that need to be
addressed include the accuracy and adequacy of the
representation and the problem of computing the
phenotype. Tt 1s hoped that coupling “developmen-
tal automata” with genetic adaptive techniques will
yield insights into biological, social, and other sys-
tems which are capable of growth.

Acknowledgement

The author thanks D.E,. Goldberg for valuable
comments on an earhier draft of the paper

References

il) Holland, JH. Adaptation in natural and artificial
systems. Ann Arbor) University of Michigan
Press, 1975.

[2] Holland, J, H. ‘Escaping brittleness. the pos-
sibilities of general-purpose learning algorithms
applied to parallel rule-based systems.” In R.
S. Michalski, J. G. Carbonell & T. M. Mitchell
(Eds.), Machine learning, an arttfictal intelh-
gence approach. Volume II. Los Alios, Califor-
nia: Morgan Kaufmann, 1986.

[3] Smith, S. A learning system based on genetic
algorithms. Ph.D, Dissertation (Computer Sci-
ence}. University of Pittsburgh, 1980.

[4| Grefenstette, J. J. (ed.) Proceedings of an inter-
nattonal conference on genetic algorithms and
their applications Pittsburgh: Carnegie-Mellon
University, 1985.

|5| An important paper used mechanisms closely re-
lated to those of the genetic algorithm as defined
in {U] Go study the emergence of self-replication.
See Holland, J. H. “Studies of the spontaneous
emergence of self-replicating systems using cel-
lular automata and formal grammars.” In A.
Lindenmayer & G. Rozenberg (eds.}, Automata,
languages. development. Amsterdam: North-
Holland, 1976.

[6] The representation appears to share several as-
pects with the developmental models called L-
systems. However, there do not seem to be any
studies in which populations of L-systems un-
dergo evolution. See Lindenmayer, A. “Develop-
mental algorithins for multicellular organisms: a
survey of L-systems.” J. Theor. Brol., §4, 1975.

\U, For background, see Balinsky, B. 1. “Develop-
ment, Animal” and Waddington, C. H, “Devel-
opment, Biological.” Encyclopaedia Britanica,
15th Ed., Vol. 5, 625-650.

{8| Schaffer, J. D. “Learning multiclass patiern dis-
crimination.” In [4].

|9] Wolfram, 5. ‘Cellular automata as models of
complexity.” Nature, 341, 4 October 1984, 419-
424.

204

 
 

Genetic Algorithms and Communication Link Speed

Design: Theoretical Considerations

Lawrence Davis
BBN Labs

10 Moulton Street

Cambridge, MA 022

 

38

Abstract

The problem of finding low-cost sets of packet
switching communication network links when the
network topology has been fixed is a difficult and
time consuniing one. In this paper we deserihbe the
features of the problem that make it difficult. de
scribe a genetic algorithm that solves the problem,
and discuss some of the theoretical issues raised in
applying genetic algorithms to this domain.

Introduction

The process of designing a packet switching cammunication
network, as carried out at Bolt Beranek and Newman Com-
munications Corporation (hereafter. “BBNCC™), has four

major phases:

® First, information about the traffic to be carried, the
customer’s existing devices and communications pro-
tocals, and other customer requirements is gathered
and entered into a database

«© Second, access devices are placed such that equip-
ment aud hne costs are minimized and traffic from
the terminals is homed to them.

* Tord, packet-switching nodes are stmilarly placed
and traffic from terminal access devices and customer

hosts 1s homed to them.

e In the final phase. the “backbone” phase. the designer
Hinks the nodes su that they can carry the desired traf-
fic and meet other customer requirements on speed,
reliability, connectivity, and cost, (See the Figure 1

for an tHustration of a backbone design.)

T™

Stan Coombs
BBN Commumeations
TO Paweett Street
Cambrulge. VIA 92238

Ths paper and its companion paper! concern a problem
that gecurs in the fourth phase of network design. When
the designer first links nodes to create a backbone network

topology. the links used are high-speed ones. The intent
ts to produce a topology that satisfies the customer's re
quirements. One such constraint imposed on the topology
in figure Tis that the net be “bicunnected™ -- given the
failure of any single network lnk or node, the network will
still be connected,

High-speed links are very expensive. When a working topol-
og) has been found, the designer concentrates on uptinnz
ing the network cost hy reducing the speeds of some or all
of the network link». while satisfying the customer's design
criteria. These two problems ~— that of choosing a topol-
ogy and that of selecting low-cost. acceptable link speeds
given a topology - together constitute the fourth phase of
communication network design. When performed by hu-
man designers, the two processes tend to be carried out
sequentially, although there ts a certain amount of overlap
between thent.

The process of choosing link speeds given a topology 1s
complicated by a number of factors:

« The set of allowable link speeds is determined by the
available services offered by the chosen carner(s). and
often vartes from network to network.

© The set of customer performance requirements varies
from network to network.

¢ The simulator used by DESIGNet to assess the per-
formance of a trial network is a stuchastic one; un-
derstanding the effects of changing a link's speed is
much more difficult when such effects change between
trial

 

 

'"Genene Algonthms and Communication Link Speed Design Con-
straints and Performance,” by Susan Coombs and Lawrence Davis in

This process is carried out with the aid of DESEGNet ".
BBNCC"s interactive network design tool mplemented in
Zetalisp on Symbolics Lisp Machines.

this volume

252
SEATTLE

 

 

 

 

 

Figure 1, A Sample Backbone Topology

« The network performance constraints may be stated
m stochastic terms (eg. the delay of each link in
the network must be below 500 milliseconds im 80%
of the simulations) end itis dificult to know rapidly
whether they have been satisfied.

« The differences in the costs of links of different speeds
are nowlinear, and often non-iatuitive. Thes depend
on the carrier and/or on the modems chosen by the
customer. if analog lines are being used.

« Equivaleat bandwidth of one ligher-speed link be
tween two nodes can be provided by multiple lower-
speed hinks, generally at non-equivalent cost. The
nodes treat the multiple links as one logical link when
routing traffic.

© The effect of changing a link’s size is difficult to pre-
dict ands lughly interactive —- traffic may be rerouted
everywhere in the network as a result of changing the
speed of a single link. a

253

» The effect of changing a link’s size may be counter-
intuitive - if occasionally happens that reducing the
size of one tink in the network unproves the entire

network's performance.

® For larger backbones, the DESIGNet simulation of
network performance may take a nunute or more to
run. leading to lengthy intervals between a designer's
changes in the network and the simulation’s feedback.

Given these charactenstics, and the fact that a substantial
amount of the cost of a packet switching communication
network can be bound wp in its links, the problem of find-
ing a low-cost set of sizes for the links in a network when
its topological constraints have been met is worth solving
with a computer, The stochastic nature of the data and the
non-linear interaction between changes in a network design
make the link speed problem a diffiewlt one to approach
with heurstie and hill-climbing systems. We hypothesized
that the problem would be amenable to the genetic algo-
rithm approach. and in the fall of L990 we apphed a genetic

 
 

algorithm to a simplified version of this problem im order
to measure its performance agatiet some deterministic al-
gonthins. The results were encouraging (they, and a more
detailed description of the communication network design
problem in general, are contained im an earlier paper?). Ou

constraints that the human designers were using. The coti-

panion paper also describes the results of several applica-

tions of our genetic algorithm to networks that were being

designed for prospective clients of BBNCC

the hasis of that work, we extended and modified the sys-
tent for appheation to some real-world network designs,

The following sections of this paper discuss theoretical is-
sues related to genetic algomthms that were relevant to cre-
ating our prototypical system and applying it to networks
being designed at BBNCC. The companion paper to this
one discusses issues concerned with mtegratmg a wide var
ety of constraints on network performance into the genetic
system, so that the genetic algorithms employed the same
constraints that the human designers were using. The com
pation paper also describes the results of several applica-
tions of our genetic algorithm to networks that were being
designed for prospective chents of BBNC

 

Representation Issues

Genetic algorithm researchers most trequently use bit strings
as chromosomal encodings of solutions to the problem they
are trying to solve, Other representation techniques have
been emploved, but the early work in the genetic algorithin
field and miost of the fornial results that have been proved
have been based on the bit string representation. Bit strings
and their croysover. imversion, and mutation opetators have
been extensively studied and are fairly well understood.

We «id not use bit strings in the system to be described
here, Instead, our chromosomes were lists of lnk speeds.
Each speed was keyed to a link in the backbone. and each
speed was taken from the list of allowable link speeds for the
design in question. This chromusomte, for example. encodes
link speeds for a three-link network:

(2400 50000 9600)

Evaluation of such a chromosome 1s carried out by assigning
each link the speed encoded for it on the chromosome, simu-
lating the resultant network's performance, and evaluating
that performance agaist the customer's requirements.

 

"Davis, Lawrence and Susan Coombs, Optumaing Network Link
sizes with Genetic Algonthms’ To appear i Vodelling and Senula
fon Methodology Ano cledge Systems Parad gra, Maurice S Elzas,
Tuncer! Osen and Bernard P Ziegler, editors

 

254

Why did we choose this representation for the link speed

problem? The principal motivation was our expectation

that an important genetic operator would be “creep” - al-
tering the speed of a link upward or downward one or moie
steps in the fuerarchy of allowable speeds Encoding speeds
with bit strings would not support this sort of operation.
Interpreting a bit string as a Gray coded string would al-
low creeping one step up or down, but it would not easily
support creeping multiple steps during the application of a
single operator.

There is a potentially telling argument agatnst our choice
Other things being equal, representa-
tions using an alphabet with fewer letters and lounger words

of representation.

will yield better performance in a genetic algorithin than
representations with a larger alphabet and shorter words,
suice they will provide more hyperplanes for the algorithin
to explore. We. on the other hand, have chosen to repre
sent each separate link speed “word” with a single “letter”.
fave we discarded the basis for some important crossover
eflects? We do not believe so.

The performance gains acquired by using larger alphabet
are derived by finding periodicities in the search space that
are encoded in the longer words. Our search space is not
a periodic one. When the best speeds of a hnk for a given
topology are plotted over a number of runs, they tend to
cluster in one or two regions of the link size hist, rather
than in periodic regions of it. While it 1s true that small
alphabets and large words are useful for hnding periodicities
ina gene's values, in our domamm such periodicities appear
not. to exist. Accordingly, we appear to have lost nothmg
by collapsing the representation, and it is simple for us to
use the representation m implementing ‘creep’.

Oar problem space tends itself to crossover based on word
values (to continue the analogy tn the preceding paragraph)
rather than on letter values, In the link speed domain one
must hind link speeds that Hein an optimal region, wiven the
speeds of the other links. One then wishes to sample the
surrounding allowable speeds, hoping to find better combr
nations. Complicating the search process are: the fact that
evaluations in the region of the optimum may not produce
monotonically decreasing evaluations as they move away
from the optimal speed: the fact that the evaluation pro-
cess is stochastic; and the fact that usetul hyperplanes in
this search space are combinations of link speeds that work
well together in the context of the simulation, but these
hyperplanes may not be useful if other hyperplanes are in-
terfering with them. The crossover uperator in our system
Despite
these complicating factors, it seems to have satisfactotly

operated by cutting chromosomes at two points

detected and combined useful hyperplanes to create coad-
apted sets of link speerts.
 

steicclnbantinnsesoceniraian eke 9

The link speed problem 1s like the problent of optinuzing
parameters for a genetic or other stochasti algorithm “The
task is to find a set. of coadapted values for a set of parame-
ters, under the guidance of a stochastic evaluation function
(running the algorithin while using the parameter values
encoded on the chromosome.) Given a fixed set of values
for the other parameters, the values for a given parameter
tend to he together on the axix of possibilities, offen im a
standard distribution. Much of the work to be done in such
domains involves finding regions of optimal values for each
parameter that fit with good regions for other parameters,
exploring the mteractions of such regions in parallel, and
settling on promising combinations of regions for further
exploration. The result will be a hyperplane of coadapted
values cutting across the axes of parameter values. Here.
as in the link sizing problem, the task i to find reasonable
carabinations af contiguous values in the face of a good deal
of noise and parameter interaction.

Genetic approaches to such problems have done well*, One
question of iiterest for genetic algorithm researchers de-
siting to carry out other applications is the comparison of
the performance of representations that support creep op
erators with representations that do not in domains with
contiguous, rather than penodic, optimal parameter values

Epistasis

At the International Couference ou Genetic Algorithins and
their Applications held two years ago, an interesting dis-
cussion arose concerning the status of genetic algorithm
that operate in domains with high degrees of “epistasis” -
suppression by one gene of other genes’ effects. Lhe prob-
lem was that John Holland's proof of the convergence of
a genetic algorithin on an optimum was based on the as-
sainption that there was a low degree of eptstasis in the
evaluation of the algorithm's chromosoines, In order for a
genetic algorithm to carry out parallel sampling of byper-
planes when 1t evaluates a chromosome, tt 1s important that
each hyperplane contnbute its effect to the chromosome’s
evaluation. This cid not appear to be the case for some of
the genetic systems being presented at that confereuce {n
one example
Snuth!

packed in the chromosomal encoding of a packing could

the bin-packing system described by Derek
- altering the position of a single rectangle to he

well alter the final position of every other rectangle in the
packing! Clearly, South's representation led to a good deal
of interactian when positions of rectangles were changed,
But did it vielate the low eptstasis assumptuou? And sfit

3See John Grefenstette “Optimization of Control Parameters for
Genetic Algorithms "an [EEE Tranenctions on Systems Man and Cy-
bernetics, SMC-AS(L) (22-128, for a discussion of optinuzing param-
eter values with bit string representations Also ser Lawrence Davis

and Frank Ritter ‘Schedule Optimization with Probabilistic search *

258

simular te Holland's orginal proof. that could be proved
for such domams? These questions were left open at that
Hime, as way the question of rigorously defining the notion
of eprstanis.

There do not appear to have been any published answers
to these questions in the interim. There is, however, a
growing body of empirical evidence suggesting that genetic
algonthms do indeed perform successfully in domains in
which alteration of a single gene may cause tremendous
divergence in a chromosome’s behavior, Soine of the papers
on the travelling salesrep problem found in this Proceedings
are examples of such successful performance. The system
described in this paper is another,

We believe that applications of genetic algorithms to snch
problems will continue to be frintful. In our companion
paper, we describe cesults that support this belief. Our do-
main, however, appears to be an epistatic one. Consider
what happens when the speed of a single link is increased
in a communication network. One possibility is that the
traffic in the network follows the same paths and no sig-
nificant changes in delays accrue. More bkely, however, is
a substantial alteration in traffic patterns, a» traffic that
had been routed over different paths 1s now routed over
the freer link and other traffic m 1¢ routed in reaction to
this change [t is possible to change every routing in a
network by changing a single links speed. and the resul-
tant change in the evaluation uf the network's performauce
can be tremendous as well, Given that the genetic algo-
rithm performs well in this domain, we would like ta raise
again some of the questions raised two years ago: Is the
domain we are working in imttunotcally different from the
domains for which convergence proofs exist with respect to
epistatic effects? What is epistasis? And are there for-
mal results that can be proved for problem lke clustering.
bin-packing, and capacity assignment that will help us to
uoplement genetic algorithms in other, simdar domains?

Multiple Constraints

An issue that concerned us for a good deal of time arose
fromm the fact that the “enviroument” surrounding our ge-
netic algonthm differed radically from problem to problem.
[t was not at all cleat that we would he able in a single ge-
netic algorithm framework to model the behavior of com-
mumeathoun uetwork designers with respect to the myriad

constraimts wnposed bs customers. {t was also not clear

that we would be able to represent those constraints in an

 

Proceedings of the Jrd [LEE Cr nference ot Artificial intelligeace A ppl-
cations for an account of opunuzation uf the parameters of a simulated
annealer with a representation that supports a “eterp operator.
‘Derek saath, “Bin Packing with Adaptise search "in Gselenstette,
John J, editor, Praceedings of Ur sccond Conference on Genetic Ab
goriiry and thea

  

Ippicaband Io

 
 

use. After a period of developitient and experimentation

the resalts of which are deserbed in the companion paper,
we produced a single genetic algorithm and evaluation func-
tion environment that we believe will support the applica-
tien of genetic algorithms to the sort of Ink speed dessgns
that are currently beme carmed out at BBNCC, We believe
that this issue will arise frequently as genetic algonthmis
are developed from prototypical apphcations mto working
industrial systems. Accordingly, we have devoted our com-
pamion paper to a discussion of the techniques we used to
create a genetic algonthm: system capable of robust appli-
cation to the link speed design problem. and toa discussion
of that system's performance on actual network designs.

Conclusion

In applying genetic algorithms to the communication net-
work hiok speed design problem. we made several interesting
discoveries:

« Genetic algorithms perforin well when applied to this
domain.

e The highly stochastic and epistatic nature of the do-
main did not prevent the algorithm from finding and
exploiting hyperplanes consisting of useful combina-
tions of link speeds.

e Our system, based on “creep,” crossuver, and a non-
bit string representation. performed well in this do
main.

« The domain was sunilar in some respects to parame-
ter optimuzation problemi already studied by genetic

algorithm researchers.

« Further research mto the properties of genetic algo-
rithms applied to such domains would be of benefit to
those who will produce other, sitmlar applications®.

 

ST he authore thaak

The authors thank Susan Bernstein and Richard Vacea for ther
many helpful comments and suygestions ducing the preparation of this
paper

256
Genetic Algorithms and Communication Link Speed
Design: Constraints and Operators

Susan Coombs

BBN Communications

70 Fawcett Street

Cambridge, MA 02238

Lawrence Davis
BBN Labs
10 Moulton Street
Cambridge, MA 02238

duly, 1987

Abstract

In thin paper we desc ibe »ome novel methods for
incorporating constraints mto a genetic algorithm,
for choosing link speeds in a packet switching com-
munications network design We also describe the
object-ortented approach we used in representing the
constraints. and present the results of applying these
methods to four network designs

Introduction

Constraints play an important role in network design. A
packet-switching network design typically must satisfy con-
straints on link utilization, connectivity, and on the max-
imum travel time between source and destination, In ex-
panding a genetic algorithms approach for adjusting net-
work link speeds from the teial network design described
in Davis and Coombs (1987)! to actual network designs,
we discovered that we needed new ways of expressing con-
atraints ta generate good designs.”

There are many possible varieties of constraints. Designs
for two networks rarely satisfy exactly the same set of con-
straints. To help our genetic algorithm adapt to this fact,
we implemented constraints as objects We were particu-
larly interested in making the constraints easy to combine
and modify for each new design. By modeling constraints
as objects, we also could quickly demonstrate the effects of
constraints separately or in combination. Each constraint
was able to annotate its behavior, making it easier to ana~
lyze the effects of the constraints on the population

‘Davis, Lawrence and Susan Coombs “Opumizmg Setwork Link
Sizes with Genetic Algorithms,” to appear it Modelling and Simulation
Methodology Knouledge Sustems Paradigms Maurie $ Elzas, Tuncer
1 Oren, and Bernard P Ziegler, editors

*For a more complete description of the problem of adjusting net-
work link sizes than provuled yn this paper, see both our earher paper
and the companion paper being presented at this conference

 

257

Our results were quite encouraging. We ran our genetic
constraint system, LINKR, on an automatically-generated
network design, as well as on three hand-optimized varia-
hon» of a different network design In both the automati-
cally-generated and hand-optimized cases, the

i genetic ap-
Broach improved the original design

Constraints and Operators

in applying genetic algorithms to the link speed design
problem, we implemented both constraints which impose
Penalties on solutions, and operators which alter the ge-
netic representatian of solutions. Examples of constraints
include constraints on link utilization, node utilization, link
cost. and delay, Specitically, a customer might require that
the utilization of all links in a network be less than 70%.
Examples of operators include crossover and “Creep”.

Some constraints that we had assumed to be fixed, in par-
ticular, those modeling network routing?, had to be formu-
lated stochastically. We discovered other constraints, in-
cluding some we had not presiously considered. that were
essential to designing an acceptable network, Some of these
constraints took more time to evaluate than was practical,
since part of our goal was to complete a network design in
roughly the sarne amount of time as a person designing a
network Other constraints were problematic in that their
evaluations fluctuated wildly under slight changes to the
overall network. We developed new techniques to deal with
the new kinds of constraints.

   
 

fhe network ronting determines how the traffic ts distributed
throughout the network

 
Stochastic Constraints

Our attempt to model some stochastic constraints deter-
ministically, in particular. constraints depending on net-
work routing, proved too bmiting. With deterministic rout-
ing, the genetic algorithm optimized link sizes (or the par-
ticular Axed routing chosen, leading to network designs with
little ability to suppert alternate routings, and hence low
resiliency in the face of line outages. Since many of the

hey constraints, including link utilization, node utilization,

and delay, depend on network routing, we had to overcome

this limitation. To do this, we developed a strategy for

modeting stochastic routing

In modeling stechastic routing, we adopted a k-out-ofn
stralegy, where & “successful” routings out of nm trial rout-
Ings are necessary for a penalty-free solution. The success
of a routing depends on the specific constraint. For ex-
ampie, a routing could be successful for a link-utihzation
constraint if all link wtihvations were less than 70C AL
though a strategy requiring less frequent routings would be
desirable, our strategy worked acceptably in the cases we
tried. By varying & we could adjust the robustness of the
routing. and by varying n, we could adjust our confidence
that the routings sampled were representative.

“Ice Age” Constraints

Some constraints were problematic in that the large atnount
of time necessary to evaluate them made it impossible to
complete 4 genetic design in roughly the same amount of
time as a person designing a network, One constraint that
proceeded at a particularly glacial pace involved cheching
network performance after dropping each node in the net-
work and re-routing traffic, For this constraint we adopted
a new strategy, the “Ice (ge7! constraint strategy. An lee
Age constraint evaluates itself once every Ice Age, that is,
once every n generations. An Ice Age affects an entire gen-
eration. Rach member of the affected generation incorpo
rates the lee Age constraint’s evaluation into its overall
evaluation.

Many details of this approach remain to be investigated.
For example, the optimal time between Ice Ages is uncer-
tain. It cannot be too long, for then the Iee Ages have
little effect on natural selection, Nor can it be too short,
for then the time-saving advantage of the Ice Age approach
is lost. Experimentation could also be done on whether
members of non-[ce-Age generations should carry a linger-
ing, but perhaps exponentially decreasing, penalty from the
Ice Age evaluations of their ancestors,

 

proach

“LaMarck” Operators

We observed that some constraints imposed penalties on de-
signs which, with only slight changes, would have incurred
no penalties at all. One such constraint was the constraint
on mismatched line speeds. Although a network design may
appear to work well with two adjacent links with widely
differing speeds, in some situations one or more high-speed
links funnel traffic onto the low-speed link, causing traffic to
back up on the high-speed link(s}. To avoid this situation,
one adjusts the speeds of one or both of the links, resolv-
ing the mismatch, and elimimating the penalty the design
otherwise would incur,

With the famous LaMarck, who posited the inheritance of
acquired traits, in mind, we comed the “LaMarck” opera-
tor. When a LaMarck operator ts evaluated, it adjusts the
genetic structure of a solution as necessary to bring 1t into
the space of legal solutions. This strategy works best when
a slight, known, change to the overall network dramatically
affects the evaluation of a given constraint, but does not
have an appreciable effect on other constraints,

Implementation of both the Ice Age constraints and the
LaMarck constraints was provisional and experimental. To
begin to reap the benefits of these constraints would require
further experimentation.

Bringing Object-Oriented Program-
ming to Genetic Algorithms

One of the key elements of our implementation, which per-
mits the applwation of different sets of constraints to dif
ferent network designs, s the object-oriented representation
we have chosen for constraints and individual members of
the population. Because the constraints involved in each
network design are unique, and over many network designs,
potentially infinite, the structure of the genetic approach
must permit, reuse, adaptation, and new combinations of
constraints. The object-oriented approach that we have
adopted accomplishes this, and we have already built up
a sizable library of constraints for use in various types of
network designs.

In implementing our constraints, we first developed a small
set of basic operations. To generate new members of the
population after crossover, creep, or other operators are
applied. we implemented a bast operation for changing link
sizes. To stochastically mode! the routing of data traffic
throughout the network, we developed a network routing
operation We then used these basic operations as building
blocks in implementing specific customer constraints, such
as constraints on link utilization

 
Some of the constraints that we developed appear in Table
t below. The link cost constraint sums the costs of the
adjustable links in the network, and adds this cost to the
overall penalty of an individual solution. The remaining
constraints assign a “dollar” cost corresponding to the de-
gree to which they are violated, and the importance of the
violation. This cost contributes to the overall penalties of
individual solutions.

Many of the constraints have genetic operators associated
with them. Some of these operators appear in Table 2 be-
low. The “Creep” operator, for example. uses the link-
ulilization constraint in determining probabilities for creep-
ing to a higher or lower link speed. Links with high uti-
lizations have a higher chance of creeping up, whereas links
with low utilizattons are more likely to creep down. The
“Adjust Ports” operator ts an example of a requirement
that we formulated, at different times, both as a constraint
imposing penalties (see Table 1), and as a LaMarck post-
processor altering solutions.

Results

Applying Genetic Algorithms to an Auto-
matically-generated Network Design

In the case of the automatically-generated network design
we used, ten choices of link sizes were available. as com-
pared to four choices in the test case described in Davis
and Coombs (1987), There were 66 subnet links with ad-
Justable sizes, as compared to 27 in the test case. Since we
had little time in which to find an improved design. and
since this design was significantly larger than the test de-
sign we had tried previously, we began by trying the greedy
cyclic algorithm described in our earlier paper The results
from this paper show that the greedy cyclic algorithm runs
significantly faster than the genetic algorithm, but does not
perform as well. In this case, the greedy cyclic algorithm
did not produce viable results.

We then turned to the genetic algorithm. To initialize the
genetic population, we slightly increased the link speeds of
the automatically-generated design in a random fashion, so
the genetic algorithm would begin its search in a promising
area. [t found significant improvements to the design in a
relatively smal! run: a population of 30, evolved over 20

generations. We then used a trivial post-optimization step,
which consisted of attempting to lower link sues when they
were higher than in the original design This was possible
for two links. The cost savings reported in table 3 below
include the effects of this post-optimuszation step

Applying Genetic Algorithms to a Hand-
Optimized Network Design

The three variations of the hand-optimized network de-
sign had many features in common. All three versions had
coughly 26 nodes, and 37 links, with about 30 links that
could be adjusted in size. Twenty choices of link speeds
were available. We typically ran populations of around 30,
over 20 to 50 generations.

In the first version, the network had to be reliable enough to
tolerate the joss of either of two central nodes, We wrote a
special constraint that worked by applying penalties when
link utilizations were too high with either of the central
nodes down We experimented with making this constraint
an tee Age constraint which evaluated only once every n
generations. Although people designing networks seem to
use a similar approach, that is, they evaluated this time-
consuming constraint sporadically rather than continually,
we did not have much success with implementing this con-
straint as an fee Age constraint. Perhaps a larger popula-
tion would help to preserve traits that only come into play
each n generations.

In the second and third versions, the primary constraint
was link utilization fn the first of these, we modeled the
network with twice the expected ammount of traffic, and link
utilizations had to be less than LOO: (delay way not a fac-
tor). In the final version, the customer requested a slightly
more conservative design. with link utilizations under 90‘%.
Other constraints on these designs included link cost, mo-
dem cost, and port usage.

In analyzing the results of the genetic algorithms with the
person responsible for doing the hand-optimving of the de-
signs, an interesting fact emerged. Two of the more im-
portant constraints to consider when designing these net-
works were link cost and link atilization, and these were
the factors the designer concentrated on. A less crucial
consideration was modem cost. since link costs usually far
outweigh modem costs, and the designer did not have easy
access to modem costs, In applying the genetic algorithm
to these cases, however, we included a constraint for mo-
dem cost, aad discovered that much of the improvement
that the genetic algorithm achieved was in reducing mo-
dem costs. Apparently. genetic algorithms are able to keep
track of subsidiary constraint.. and to use them to advan-
tage in finding improved solutions.

 
Conclusion

In the process of applying genetic algorithms to the problem
of determining network link sizes, we developed novel and
useful types of constraints. We also found a representation
for constraints (as objects) that facilitated the rapid selec-
tion and modification of sets of constraints for individual
network designs. Finally, with the successful appheation of
genetic algorithms to the network link sizing problem, we
demonstrated the value of genetic algorithms in solving de-
sign problems characterized by many differing constraints.

 

 

 

 

 

 

Constraint Behawor
Link Cost " Sums costs for penalties
Link Utilization Routes, sums penalties
Link Drop Drops links each Ice Age, routes, sums penalties
Port Usage Sums penalties

 

 

Table 1: Constraints

 

 

Operator Behawnor Associated Constramt

 

Crossover Combines two parents
with multiple tries
for different parents

 

 

 

 

Creep Changes a link size Link Utilization
one step up or down

Adjust Ports Adds extra ports or

{LaMarck) nodes where needed Port Usage

Speed Mismaich Changes link sizes to correct

(LaMarck) speed mismatches

 

 

Table 2: Operators

 

 

Design Annual Cost Savings
Automatically-Generated Design $36900
Hand-Optimized Design, Version 1 8232
Hand-Optimized Design, Version 2 45236
Hand-Optimized Design, Version 3 55716

 

Table 3: Cost Savings by Design

250)