GENETIC ALGORITHMS AND THEIR APPLICATIONS: Proceedings of the Second International Conference on Genetic Algorithms July 28-31, 1987 at the Massachusetts Institute of Technology Cambridge, MA Sponsored By American Association for Artificial Intelligence Naval Research Laboratory Bolt Beranek and Newman, Inc. John J. Grefenstette Naval Research Laboratory Editor EA LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS 1987 Hillsdale, New Jersey Hove and London dale jerk weghene “i an LC wil ( rhe cop} keep GENETIC ALGORITHMS AND THEIR APPLICATIONS: Proceedings of the Second International Conference on Genetic Algorithms July 28-31, 1987 at the Massachusetts Institute of Technology Cambridge, MA Sponsored By American Association for Artificial Intelligence Naval Research Laboratory Bolt Beranek and Newman, Inc. John J. Grefenstette Naval Research Laboratory Editor IEA LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS 1987 Hillsdale, New Jersey Hove and London Copynght © 1987 by Lawrence Erlbaum Associates, Inc. All rights reserved, No part of thus book may be reproduced in any form, by photostat, microform, retrieval sysiem, or any other means, without the prior whiten permission of the publisher. Lawrence Erlbaum Associates, Inc., Publishers 365 Broadway Hillsdale, New Jerscy 07642 ISBN 0-8058-0158-8 cloth ediuon ISBN 0-8058-0159-6 paperback edition Printed in the United States of Amenca O98 76543221 ACKNOWLEDGEMENTS On behalf of the Conference Commitce, it is my pleasure to acknowledge the support of our sponsors: the American Association for Artifical Intelligence, the Navy Center for Applied Research in Artificial Intelligence at the Naval Research Laboratory, and Bolt Beranek and Newman, Inc. The Committce also appreciates the cooperation of Dr. Edwin H. Land. I would personally like to thank the other members of the Conference Commitice for their conscientious efforts as referees. Stewart Wilson deserves special thanks for handling the local arrangements. John J. Grefenstette Program Chair Conference Committee John H. Holland University of Michigan (Conference Chair) Lashon B. Booker Navy Center for Applied Research in AI Dave Davis Bolt Beranck and Newman Kenneth A. De Jong George Mason University David E. Goldberg University of Alabama John J. Grefenstette Navy Center for Applicd Research in AI (Program Chair) Stephen F. Smith Carnegie-Mellon Robotics Institute Stewart W, Wilson Rowland Institute for Science (Local Arrangements) CONFERENCE PROGRAM TUESDAY, JULY 28, 1987 5:00 - 9:00 REGISTRATION: Lobby, McCormick Hall 7:00 - 9:00 WELCOMING RECEPTION: Courtyard, McCormick Hall WEDNESDAY, JULY 29, 1987 8:00 REGISTRATION: Roam 10-280 9:00 OPENING REMARKS: Room 10-250 9:20 - 10:40 GENETIC SEARCH THEORY Finite Markov chain analysis of genetic algorithms David E. Goldberg and Philip Segrest l An analysis of reproduction and crossover in a binary-coded genetic algorithm Clayton L. Bridges and David E. Goldberg .... 9 Reducing bias and inefficiency in the selection algorithm James E. Baker .... 14 Altruism in the bucket brigade Thomas H. Westerdale 22 10:40 - 11:00 COFFEE BREAK 11:00 - 12:00 ADAPTIVE SEARCH OPERATORS I Schema recombination in pattern recognition problems ry, ; A ff Trene Stadnyk ccc ceescssessseeecseeseseeeseeesecatscescscaeseeessenusengearsanesiveneavens Leoaeesecseeesnecesnereeseestseaeaeets 27 An adaptive crossover distribution mechanism for genetic algorithms J. David Schaffer and Amy Morishima 0.0.0... ccc csessscececsessecesesceseessscsessssetanesesssensanetsseessesseee 36 Genetic algorithms with sharing for multimodal function optimization David E. Goldberg and Jon Richardson ...0.....cccscsccecssescscesssstesteescssesesssassestesssesersssescsusaeasetsseescas 4) 12:00 - 2:00 LUNCH 2:00 - 3:20 REPRESENTATION ISSUES The ARGOT strategy: adaptive represeniation genetic optimizer technique Craig G. Shaeker ccc ce a ee ceccscsa pene reneseeeaeessesesacecsesessesneceteseessnerseesssseseeaieneetecanectecsieacseseeseees 50 Nonstationary function optimization using genetic algorithms with dominance and diploidy David E. Goldberg and Robert BE. Smith 2.0. ccececssesccsesses cece cen ssesseneessescsvenseeseeeseecavetesnesstensese 59 Genetic operators for high-level knowledge representations H. J. Amtomisse and K. 5. Keller oo. ccccccceececesesssesssseesecesnsseseeceuescsecesesscscautsensecsaseeseaesesisaeesansecs 69 Tree structured rules in genetic algorithms Arthur §, Bickel and Riva Wenig Bicke] oo iccccecccseseccesssscsesesescsessiessussacesvsceseceriestnenseanes 77 3:20 - 3:40 COFFEE BREAK 3:40 - 5:00 KEYNOTE ADDRESS Genetic algorithms and classifier systems. foundations and future directions John H. Holland ... 7:00 - 9:00 BUSINESS MEETING THURSDAY, JULY 30, 1987 9:00 - 10:20 ADAPTIVE SEARCH OPERATORS Hi Greedy genetics G. E. Liepins, M. R. Hilliard, Mark Palmer and Michael Morrow oo. .cccccccsssecccsesecetesensssstretacee 50 Incorporating heuristic information into genetic search Jung Y. Suh and Dirk Van Guecht occ cc ccecssueecsneseeesececenesseesereareacageeseesnneensaestaseasieesees 100 Using reproductive evaluation to improve genetic search and heuristic discovery Darrell Whitley occ ec ceecseecseesess cscs eeevecesseceeesanensestcasseseaseeasaeseneasnea ties sseasssneeetscanetseesetacsentees 108 Toward a unified thermodynamic genetic operator David J. Sirag and Paul T. Weisser occ cecseecseecssscssscevessuescsecareseuascescouensegeesesnsrsnseseatsctentees 116 10:20 - 10:40 COFFEE BREAK 10:40 - 12:00 CONNECTIONISM AND PARALLELISM I Toward the evolution of symbols Charles P. Dolan and Michael G. Dyer ..... SUPERGRAN: a connectionist approach to learning, integrating genetic algorithms and graph induction Deon G. Oosthuizen oo... cece snes csceceesesctcneeeneesesecanecensceseseecseuenseevsneseassesesssensreteeresageaneneaesenes 132 Parallel implementation of genetic algorithms in a classifier system George G, Robertson oo s ccs ceccsesscscescetsesetscasseecsecseeesenaenvssseceeeseesaceesensesesssqeasseaseceessasseensesseesaes 140 Punctuated equilibria: a parallel genetic algorithm J. P. Cohoon, S. U. Hegde, W. N. Martin and D. Richards 0... ecessesesseccteeeeeseescnsearesesenenes 148 12:00 - 2:00 LUNCH 2:00 - 3:20 PARALLELISM If A parallel genetic algorithm Chrisila B. Pettey, Michael R. Leuze and John J. Grefemstette oo... ee cececeeeccessenseeeecntenseeeaes 155 Genetic learning procedures in distributed environments Adrian V. Sannier II and Erik D. Goodman oi eesscsnecseneneetsetetesnsssnerseassstsetsssetessersers 162 Parallelisation of probabilistic sequential search algorithms Prasanna Jog and Dirk Varn Guecht oo... cccccsecssenseseesesetseecsessceseeceenecterseessnsecersceeeecssenseenseseavaes 170 Parallel genetic algorithms for a hypercube Reiko Tanese ...... 177 3:20 - 3:40 COFFEE BREAK 3:40 - 5:00 CREDIT ASSIGNMENT AND LEARNING Bucket brigade performance: I. Long sequences of classifiers RICK L. RIGO wee cece tsesteecesessceseassneeesescespesecauessesescansurssssecesesusecueecsns egassesisneeceesenesaeengasecaanaes 184 Bucket brigade performance: H. Default hierarchies Rick L. Riolo ..... 196 Multilevel credit assignment in a genetic learning system John J. Greferstette oe csesseeeeseeseessseeseescsessneecsnecnsseeassvescansaeesesseaeeessasessecasnaeesieaseeceneeneseeenee 202 On using genetic algorithms to search program spaces Kenneth A. De Jong wi cccccceeessecscsesssecasenssenscessnseenseceseestesuesucesseaenussnetacesscansuaseneesestcssnesesseseseses 210 THURSDAY, JULY 30, 1987 6:30 - 10:00 CONFERENCE BANQUET: New England Clambake FRIDAY, JULY 31, 1987 9:00 - 16:20 APPLICATIONS J A genetic system for learning models of consumer choice David Perry Greene and Stephen F. Smith o.cc cc ceeesccscsssscesesessseseeessseecssescessesssestearseseatesssessseanse 217 A stidy of permutation crossover operators on the traveling salesman problem 1M, Oliver, D. J. Smith and J. R. C. Holland oe ccccesccsscseesesseseesseseacsussuesenssenciesscansucerseueases 224 A classifier based system for discovering scheduling heuristics M. R. Hilliard, G. E. Liepins, Mark Palmer, Michael Morrow and Jon Richardson .........0.00.. 231 Using the genetic algorithm to generate LISP source code to solve the prisoner's dilemma Cory Fujiko and John Dickinson oo... ccc cesesesnesessssssscecesssnescssscsesvescansessestestssavacsessevsusevseesresveaee 236 10:20 - 10:40 COFFEE BREAK 10:40 - 12:00 APPLICATIONS If Optimal determination of user-oriented clusters: an application for the reproductive plan Vijay V. Raghavan and Brijesh Agarwal The genetic algorithm and biological development Stewart W, Wilson .....0...... Susan Coombs and Lawrence Davis... 12:06 - 2:00 LUNCH 2:00 - 3:20. PANEL DISCUSSION: GA’s and AI 3:20 - 3:40 COFFEE BREAK 3:40 - 5:00 INFORMAL DISCUSSION AND FAREWELL FINITE MARKOV CHAIN ANALYSIS OF GENETIC ALGORITHMS David E. Goldberg and Philip Segrest The University of Alabama Tuscaloosa, AL 35487 ABSTRACT A finite Markov chain analysis of a single- locus, binary allele, finite population genetic algorithm (GA) is presented in this paper. The Markov analysis is briefly derived and computa- tions are presented for two kinds of problems: genetic drift (no preference for either allele) and preferential selection (ene allele is selected over the other). Approximate analyses are present- ed to explain the detailed Markov analysis. These computations are useful in choosing parameters for artificial genetic search, INTRODUCTION Genetic algorithms (GAs) are receiving increasing application in a variety of search and machine learning problems. These efforts have been greatly aided by the existence of theory that explains what GAs are processing and how they are processing it. The theory largely rests on Holland’s exposition of schemata (1968, 1975), his bridge to the two-armed bandit problem (1973 1975), his fundamental theorem of genetic algo- rithms (1973, 1975), and later works by several of his students (Martin, 1973; De Jong, 1975; Bethke, 1981), All of these works have necessarily made bounding assumptions: pepulation sizes have been assumed to be infinitely large, probability limits have been estimated by relatively crude limits, and in some cases genetic operators have even been modified to facilitate the analysis. As a result, there is still a need for mare exact analysis of genetic algorithm behavior using finite popula- tions and realistic analytical models of genetic operators. In this paper, we perform a Markov chain analysis of a one-locus, two-operator (reproduc- tion and mutation) genetic algorithm acting upon a finite population of haploid binary structures. Specifically, we calculate the expected time of first passage to various levels of convergence under different selection ratios and mutation rates, - To understand the Markov chain analysis and its application, we first develop an analysis of Benetic drift: convergence in finite populations under no selective pressure. Understanding this phenomenon is useful in explaining why finite CAs make convergence errors at relatively unimportant bit positions. We then examine expected GA per- formance at different levels of selective pressure assuming a deterministic fitness function. This anslysis permits calculation of the selective pressure required to reduce the probability of selecting the wrong bit to some known level. We also discuss how these results and their extension may be used for designing and sizing finite GAs. STOCHASTIC ERRORS IN FINITE GENETIC ALGORITHMS As simple three-operator genetic algorithms have been used in a wider array of search and machine learning applications (Goldberg & Thomas 1986), a number of objections have been voiced concerning their performance. Chief among these objections is the occurrence of premature conver- gence (De Jong, 1975). Premature convergence is that event where a population of structures attains a high level of uniformity at all loci without containing sufficiently near-optimal structures. Two reasons have been given for this undesirable form of convergence: a problem (with its chosen coding) can itself be GA-hard, or the finite GA can suffer from stochastic errors. GAs may diverge because a problem is inher- ently difficult for a three-operator GA, Such problems have been called GA-hard (Goldberg, 1983), and Bethke (1981) has proved that GaA-hard problems exist using Walsh function computation of schema averages. Reordering operators based on natural precedent have been suggested (Holland, 1975; Goldberg & Lingle, 1985) as one remedy for such problems, especially those where building blocks are not Linked tightly enough to permit the three-operator GA to find near-optimal points. Despite the possibility of GA-hardness, Bethke’s study and an extended schema analysis of the minimal deceptive problem (Goldberg, 1986) suggest that it is more difficult to construct intentionally misleading (GA-hard) problems than was previously thought. Furthermore this notion --that GA-hard problems are relatively hard to construct--is empirically supported by the widespread success of simple GAs across a spectrum of problems. Nonetheless, the study of GA-hard problems and the design of operators to circumvent the difficulty remain imporcant open areas of research activity. The other major source of convergence diffi- eulty in genetic algorithms results from the stochastic errors of small populations. These errors may themselves be divided into two types errors in sampling and errors in selection. A pollster makes a sampling error when he selects a sample size which is too small to achieve the accuracy he desires. He also commits a sampling error when the sample he selects is not represen- tative of the population as a whole. Genetic algorithms may make similar sampling errors when elther the strings representing important schemata are not present in sufficient numbers or the individuals present are not representative of the whole similarity subset. Sampling errors in small populations in this way prevent the proper propa- gation of the correct (above average) schemata, thereby circumventing the expected and desired action predicted by the schema theorem (Holland, 1975). Errors of selection are harder to understand because their disruptive effects are counterintul- tive. A simple example will halp clarify the problem. Suppose we have a population of 50 single-locus structures containing 25 ones and 25 zeros. Suppose further that we select a new population from the old population by choosing 50 new members one at a time using selection with replacement (where once picked, an individual is placed back in the sampled population). Since we are picking each new member with probability 1/50, and since there are currently 25 ones and 25 zeros, the expected number of ones and zeros is still 25 and 25 respectively, however, because the selection process is fairly noisy, we shouldn’t expect to retain exactly the same 25/25 split. In fact, the probability of exactly that division is only P(25/25} = B2)¢0.5)° ~ 0.1123, Therefore it is reasonably likely the popularion will fall away From this initial starting point. In the next generation, the new population becomes the new initial condition with the expected number of ones and zeros determined by the new state. This process continues, and in finite time for a finite population, the population converges to all ones ox all zeros, Once converged there is no way to get back any of the missing material (unless we permit some mutation, a possibility we will cougider Later). Geneticists have long recognized that finite populations do converge in this manner even when there is no selective advantage for one allele over another. This error of selection is s0 important it has been given a special name, genetic drift. Selection errors can accumulate, causing a drift to one allele or the other. By itself, genetic drift is bad enough, After all, if the environment has no preference for one allele or another, we might like a GA to preserve both of them (perhaps we would even like the GA to preserve them in relatively equal numbers). Genetic drift insures that this won't happen unless we take special action to encourage this behavior (see Goldberg & Richardson, this volume) To make matters worse, these errors of selection can cause a genetic algorithm to converge to the wrong allele when the environment dees prefer one allele over the other, especially when that selective advantage is relatively smail The analysis of these difficulties is our main concern in the remainder of this paper. We consider first the case of no selection pressure--genetic drift --followed by analysis of cases with varying degrees of selective pressure. MARKOV CHAIN ANALYSIS OF GENETIC DRIFT Discrete and continuous modeis of genetic drift have received attention in the literature of mathematical biology (Crow & Kimura, 1970); however, many of these analyses contain additional details of biological reality that are not always of interest to genetic algorithmists De Jong, (1975) presents computer simulations of genetic drift in the context of simple GAs, showing graphs of the relationship between expected first passage time to varying convergence levels as a function of population size and mutation rate, In this section, we calculate these quantities more exactly using the mathematics of Markov chains (Remeny & Snell, 1960). Markov Chains Suppose we have a sequence of random vari- ables 5,, 5,,..., and suppose the possible values for these random variables are drawn fram the set (0, 1,..., N). We think of the random variables 5 as the state of some system at time t: more precisely, the system is in state i at time t if so i If at each time t there is a fixed probability p,, that the system will be in state j at time t+] whdn the system was in state i at time t, we say the sequence of random variables forms a Markov chain. The fixed quantities Py, are said to be transition probabilities. 4 P{ s=L Pay 7 Seay | syet } Markov chains may be classified by the types of states they contain. States that may not be reached as a process goes to infinity are said to be transient states. Those that may be reached as time marches on are said to be erpodic states (non-transient). In this paper, we will only be interested in those Markov chains that contain a particular type of ergodic state: an absorbing State. A state is said to be absorbing if once entered it may never be left, For any state i, this is true if and only if the following condi- tion holds: Puy wt We have already seen examples of absorbing states in our earlier description of genetic drift: we Yecognize the all-zeros and all-ones stares as absorbing states, because once we get to either one of them we can never leave. A chain with at least one absorbing state and no ergodic states other than absorbing states is said to be an absorbing Markov chain. The states of an absorbing Markov chain may be renumbered to obtain a transition probability matrix P in the following canonical form (Kemeny & Snell, 1960): where I is an identity matrix and 0 is a matrix of all zeros. The submatrices Q and R are used to calculate important properties of an absorbing chain. Genetic Drift without Mutation To illustrate the construction of a transition probability matrix, we turn to the genetic drift problem at hand. Suppose we have a population of N ones and zeros. We let state 0 be that situation where we have all zeros and we let state N be that case where we have all ones (no zeros), and in general we let state i be that state with exactly i ones. Thus we have a total of N+l states, i- 0, 1,,.., N, that together represent all possible conditions within the population. Then if we assume random selection of exactly N new population members with replacement, we may calculate the elements of our drift transi- tion probability matrix Py as follows: Reg Pa (Psy) in (METDAI de These calculations are provably correct, because at each state i, we have a probability of selec- ting @ one Pone 7 i/N; further, the probability of getting to state j (j ones) is binomially distributed with probability Pone and size N. In this matrix, the states 0 and N are both absorbing states because Poo 7 Pun ~ 1. Thus, it is a simple matter to obtain the canonical form of the matrix shown above. In particular, we are inter- ested in the Q matrix which may be obtained simply by stripping off the first and last rows and colums of the original Py matrix. The Q matrix may then be used to calculate the expected number of visits n,, to any transient state j starting in the transi@dt state i (the N matrix): -1 Ne (I - Q) We won’t belabor the details leading to this salculation here; however, the desired N matrix may first be written as an infinite matrix geomet- vic series in Q. Thereafter, the closed form calculation may be obtained in ao manner similar to that used in summing an infinite scalar geometric series. The total expected time in transit to any one of the absorbing states starting in state i--we eall this the transit time 4 ,--may easily be calculated as a row sum over the ith row of the N matrix: ; ro n oe oy t3 The t, values may then be used to calculate the expected time to absorption given any initial starting conditions. If we assume an initial population chosen uniformly at random, the praba- bility of being ina state i initially--we call this m,--is binonially distributed as follows mo Coy The expected time to an absorbing state, the quantiry Copsorbed? MY then be calculated as the dot product ef the initial state probability vector ms and the transit time vector te & Capsorbnd ™ ZL Tym These calculations suggest a straightforward procedure for calculating the expected time until either the all-ones state (state N) or all-zeros state (state 0) is reached, but what can we do if we are interested in calculating the time of first assage to some level of convergence other than 100 percent? We may calculate such quantities quite simply by replacing all rows in the Py matrix within a desired percenrage of convergence by identity rows: Pay 7 1 andp,, = 0 for ij. This bit of chicanéry makes chad once-transient states absorbing states, and the procedure described above may be used on a now-reduced Q matrix to determine expected first passage times to the particular level of convergence. These computations are carried aut for convergence levels of 60%, 70%, 80%, 90%, and 100%. The expected time of first passage is calculated for populations ranging from Ne10 to N-100. Computations have been performed in AFL to exploit that language’s facility with vector and matrix operations. Figure 1 displays the expected first passage time versus population size at different convergence levels. The relationship between passage time and population size at a given level of convergence is strikingly linear. Although we may expect the passage time to prow with increased population size, it is not at all obvious why that growth is linear. | | f | | [ | Figure 1. Genetic drift Markey chain computation of expected first passage time versus population size at different convergence levels with no mutation, To understand this better, we appeal to a simpler but related stochastic model: the gam- bler’s ruin problem. In the gambler's ruin problem, a gambler with a fixed stake D places successive one dollar wagers on the outcome of a coin toss. In our version of the game we assume that the coin is fair, Putnning ao Plosing ~ 0.5. The game ends when the gambler either loses all his money or attains a goal value value A, ALL this is interesting, but how is it at all related to the genetic drift problem? We may view the genetic drift problem as a gambler‘s ruin problem where we drift toward either ruin (all zeros) or reward (all ones). In the real genetic drift problem, our range of outcomes on a given "gamble" is not binary; however, by using an average step size per generation we may calculate an approximate form for the genetic drift solution using the simpler gambler‘’s ruin model. Fortunately, the expected duration of the gambler’s ruin problem is a well-known computation (Feller, 1968). Under the assumptions above, the expected number of coin tosses to ruin or reward may be calculated as E(number of coin tosses} = B(A-D). In our problem, the stake amount is given by NP yon 85) (the distance to ruin or reward at # particular convergence level, Poon?! the desired accumulation to quit (assuming equal distances) is twice the stake amount. To place these quantities in terms of numbers of generations, we divide both terms (D and A-D) by the average number of steps taken per generation. This quantity is simply the standard deviation of the expected number of ones (ox zeros) per generation: This operation yields the following approximate expression for the number of penerations to a particular level of convergence: (2p_-1)? t es son N convergence tp py n Thus we have reasoned that the expected transit time increases linearly with population size. As with many approximate models, the derived propor- tionality constant is mainly useful an indicator of order of magnitude. The simple model does help explain why the slopes of lines of successively higher convergence proportion increase faster than the proportion itself. Genetic Drift with Mutation To combat the undesired convergence of gene- tic drift in simple GAs, mutation rate increases are often suggested. To investigate the effect of mutation on expected times of convergence we continue our Markov chain analysis by including mutation in our overall probability transition matrix. With mutation included, we view the overall probability transition matrix P_ as the product of the drift’ matrix P developed earlier and a mutation probability transition matrix Pp (Py) = (Pg) (By) To develop the mutation transition probability matrix, we need to select from a number of possi- ble mutation models. We may perform a_ single mutation with probability p_, replace the indivi- dual in the original population, and then perform a sequence of N such operations (we call this mutation with sequential replacement). We may also mutate (again with probability p_) population member by population member (mutation without replacement). We develop the transition probabil- ity matrices for both types of mutation. The transition probability matrix for muta- tion with sequential replacement may be developed quite simply. A single mutation permits a shift in state by one step (either one more one or one less one). Thus, the single mutation transition probability matrix Pos is tridiagonal, and its elements may be specified as follows: Precty - (3 ae. dd “ ip, otherwise The overall mutation transition probability matrix under sequential replacement may then be taken as the product of N single mutation matrices; N (PO (Pi? m’ sequential replacement ms The transition probability matrix for muta- tion without replacement may also be calculated. Although the computation is more cumbersome, the extra effort is worthwhile as mutation without replacement is a more faithful model of mutation as it is implemented in most GAs. To calculate the transition probability matrix for mutation without replacement, we recognize a simple fact. To go up in state value by, for example, two ones, we must have exactly two more zeros change to ones than we have ones change to zeros. If we sum aver all the possible occurrences where this is true we obtain the elements of the transition probability matrix. For j>i (where we shift to a state with more ones) we obtain the elements of the transi- tion probability matrix as follows: Pho (Py mutation no replacement d 2 - oe De — ECDC tae where ¢ = min(i, N-j) A similar expression may be derived for the case where j (r-Dp, 4 |? where the superscript t is a generation index. For r values near 1 this equation reduces to a geometric progression in xr: t t ° Py =r Py Taking the natural logarithm of both sides we obtain the following relationship for the time to a given proportion of ones; 1nf P,*/P,"] ce nr Figure 4. Markov chain computation of expected first passage time versus fitness ratio r_ for reproduction cases to 100 percent convergence level with different population sizes. Elgure 5. Markov chain computation of probability o£ correct convergence versus fitness ratio r for reproduction cases to 100 percent convergence level with different population sizes. On a plot of log(t) versus log(log r) this curve plots as a straight line with negative slope. Referring to Figure 4, at high enough r values we hetice this straight line behavior. At small r values the curve levels out and approaches the constant value predicted by the genetic drift computations. We may derive an approximate r value where this divergence should occur with a little physical reasoning. During a given genera- tion we expect an, increase in ones approximately equal to N(r-1)P,". On average, the increase or decrease of ones due to selection noise is simply the standard deviation a. When the noise is of the same order of magninude or greater than the expected increase, we should expect divergence between simplified model and the finite model. This divergence should occur at values of r predicted by the following relationship: When the population is half ones and half zeros, this equation says that the large population result becomes suspect when the excess fitness (r-1) is less than the inverse of the square root of the population size, This relationship ex- plains the decreasing break point value (between drift-like and convergent behavior) with imcreas- ing population size as can be seen in both Figures 4 and 5, At values of r beneath this critical value, the expected time results flatten, and the probability of converging to the correct allele starts to fall dramatically. Similar calculations may be performed for reproduction with mutation. As in the genetic drift case, we take the overall transition proba- bility matrix as the product of two matrices; here, we multiply the reproduction transition probability matrix by a mutation matrix (without replacement) as follows: @) - @)@) We perform expected first passage time calcula- tions as before. in Figure 6, we graph first passage time versus r for different mutation probabilities p_ at a population size Ns50, Using our previous physical analysis, we expect the finite analysis to approach the bounding analysis for r-l values greater than sqrt(1/50) = 0.1414. This holds true for small p_ values; however, as the mutation probability grows, the expected rime grows beyond that predicted by the bounding analysis, That this should eccur may be reasoned intuitively. If the expected net loss of good alleles due to mutation starts to exceed the expected increase in good alleles from reproduc- tion, the process loses its ability to converge as quickly. This point occurs when the following condition holds true: When the excess fitness (r-1) is of the same order or less than the mutation, the GA will take a longer time to converge, because mutation is adding errors faster than the reproduction can erase them. This shows up in Figure 6 as suctes- Figure 6. Markov chain computation of expected first passage time versus fitnesa ratio r for reproduction and mutation cases to 90 percent convergence with N=50 and different mutation levels. sively higher mutation rates cause the expected time curves to break away from the bounding analysis results more quickly. CONCLUSIONS In this paper, we have analyzed the perfor- mance of one-locus, two-operator (reproduction and mutation) genetic algorithms with finite popula- tions of haploid structures using finite Markov chains. The results in particular and the Markov chain analysis in general are useful in under- standing the performance of finite GAs commonly in use. These results and their extension should be useful in sizing populations appropriately, selecting proper mutation rates, and choosing rates of selection in scaling procedures. ACKNOWLEDGEMENTS This material is based upon work supported by the National Science Foundation under Grant MSM- 8451610. REFERENCES Bethke, A. D. (1981). Genetic algorithms as function optimizers. (Doctoral dissertation, University of Michigan). Dissertation Abstracts International, 41(9), 3503B. (University Microfilms No. 8106101) Crow, J, F., & Kimura, M. (1970) An introduction to population genetics theory. New York: Harper and Row. De Jong, K. A. (1975). An analysis of the behavior of a class of genetic adaptive systems. (Doctoral dissertation, University of Michigan). Dissertation Abstracts International, 36(10), 5140B. (University Microfilms No. 76-9381) Feller, W. (1968). An introduction to prohability theory and its applications (Vol. 1, 3rd ed.). New York: Wiley. Goldberg, D. E. (1983), Computer-aided gas pipeline operation using genetic algorithms and rule learning (Doctoral dissertation, University of Michigan). Dissertation Abstracts International, 44(10), 31748. (University Microfilms No 8402282) Goldberg, D. E, (1986). Simple penetic alporithms and the minimal deceptive problem (TCGA Report No. 86003). Tuscaloosa: University of Alabama, The Clearinghouse for Genetic Algorithms. Goldberg, D. £., & Lingle, R, (1985). Alleles, loci, and the traveling salesman problem. In J. J. Grefenstette (Ed.), Proceedings of an International Conference on Genetic Algorithms and Their Applicationg (pp. 154+ 159). Pittsburgh: Carnegie-Mellon University. Goldberg, D. E., & Thomas, A. L, (1986). Genetic algorithms: bibliography 1962-1986 (TCGA Report No. 86001). Tuscaloosa: University of Alabama, The Clearinghouse for Genetic Algorithms. Holland, J. H. (1968). Hierarchical descriptions of universal spaces and adaptive systems (Technical Report ORA Projects 01252 and 08226). Ann Arbor: University of Michigan, Department of Computer and Communication Scflences. Holland, J. RH. (1973). Genetic algorithms and the optimal allocations of trials. SIAM Journal of Computing, 2(2), 88-105. Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor: The University of Michigan Press. Kemeny, J. G., & Snell, J. L. (1966). Finite Markov Chains, Princeton: Van Nostrand. Martin, N.(1973). Convergence properties of a class of probabilistic adaptive schemes called sequential reproductive plans. (Doctoral dissertation, University of Michigan), Dissertation Abstracts Interna- tienal, B-3747B. (University Microfilms No. 74-3685) AN ANALYSIS OF REPRODUCTION AND CROSSOVER IN A BINARY-CODED GENETIC ALGORITHM Clayton L. Bridges and David E, Goldberg The University of Alabama Tuscaloosa, AL 35487 ABSTRACT The foundation of genetic algorithm (GA) theory--the so-called schema theorem or fundamental theorem of genetic algorithms--provides a lower bound on the expected number of representatives of a particular schema (similarity subset) in the next generation under various genetic operators. In this paper, assuming a large population of binary, haploid structures of known distribution, processed by fitness proportionate reproduction, random mating, and random, single-point crossover, an exact expression for the expected proportion of a particular string (or ‘representatives of a particular schema) in the next generation is calculated. This derivation is useful in analyzing the expected performance of simple GAs. INTRODUCTION Over the past two decades, the application of genetic algorithms (GAs) to search and machine learning problems in setence, commerce, and engineering has been been made possible by a number of theoretical developments (Holland, 1673, 1975 De Jong, 1975; Bethke, 1981). Without these theories, it is doubtful that many of us would have made much sense of our computer simulations or experiments; this speculation is supported by the experience of genetic algorithm prehistory. We need only recall some of the evolutionary schemes that resorted to mutation-plus-save~-the-best strategies (Box, 1957; Bledsoe, 1959; Friedman, 1959; Fogel, Owens, & Walsh, 1966) to remember that shots in a darkness without schemata or the fundamental theorem (Holland, 1975) can be frustrating experiences indeed. Because of the usefulness of theory to progress in genetic algorithm research, there is still a pressing need to improve our understanding of the foundations-- the theoretical underpinnings--of genetic algorithms and their derivatives. In this paper, we extend the fundamental theorem of genetic algorithms to exactitude, Specifically, we derive a complete set of equations describing the combined effects of reproduction and crassover on a large population of binary, haploid Seructures. These equations may be used to determine the correct expected performance of a genetic algorithm on a given problem with specified coding; they may be used for calculating the correct expected propagation of a set of competing schemata; they may also be used for estimating disruption or source probabilities for particular strings or particular schemata in a specified population of structures. In the remainder of the paper, we develop our extended analysis in three steps. We reexamine the fundamental theorem of GAs (the schema theorem), calculate an exact expression for the probability of disruption due to crossover, and calculate the expected gain of individuals due to mating and crossover by others. THE FUNDAMENTAL THEOREM OF GENETIC ALGORITHMS The fundamental theorem of genetic algorithms (Holland, 1975) calculates a bound on the expected number, m, of schemata (similarity templates), H, in successive generations, t, under the action of reproduction and crossover (other operators are often included in the calculation, we choose to focus on reproduction and crossover alone): £(H 5(H m(H, tL) = (a, ERD - 0 A | In this equation, 2 is the string length, p, is the probability of crossover, and § is the schema defining length (the distance between its outermost defining positions). In words, the schema theorem tells us that a particular schema H receives trials according to the ratio of schema fitness to population average fitness as long as the schema is not unduly disrupted by crossover, The average fitness of a schema £(H) may be calculated by summing over all strings s, representatives of H at time t: £ 4 (ifs, chy ef FD = Raney We note that thea theorem is an inequality--a lower bound. Our main goal in this paper is to transform the bound to an equality. To do this, it is helpful to leok at a full schema conservation equation in broad outline: ~ EOD 4. m(H, t+1) m(H,t) : { 1 P.Py ] + [gains from crossing] The increase due to reproduction is multiplied by the probability of the schema surviving a cross (one minus the product of the crossover probability BP, and the probability of disruption py). For completeness we must also include possible gains from mating and crossover of other strings. Hidden within this broad outline are the two items that prevent the schema theorem from providing an exact analysis. First, we usually calculate a crude upper bound on the probability of disruption pq due to crossover, the usual term 6(H)/(4-1) assumes that the schema is destroyed every time a cross falls within its defining length. This is a conservative assumption because it ignores the possibility of getting back the same material from a particular mate, Second, the schema theorem ignores all sources of schemata from erosses of strings containing different competing schemata. In crossover, one schema’s loss is another's gain. Although these terms may seem small and somewhat beside the point, a recent study (Goldberg, 1986) has shown that inclusion of these terms in an extended schema analysis permits the prediction of convergence of a simple genetic algorithm on a problem specifically designed to cause the CA to diverge (the minimal, deceptive problem). As such, these terms deserve more of our attention We first look at the probability of disruption due to crossover. DISRUPTION DUE TO CROSSOVER In this section we calculate the probability of disruption of «a string due to the set of all possible crosses, Ta do this, we define some useful symbols and terms. We take S as the set of all possible binary strings of length £; there are 2° such strings. Furthermore, we index the set § 50 84 ip the the jth member of S, where j goes from 0 td 2"-1,. As a convenience, we choose our indexing scheme so the decoded value of our string is equal to the index value j. We note that we are working in terms of strings, and the schema theorem {s written in terms of schemata, We continue here to work with strings as the equations are easier to derive and understand; however, keep in mind that we ultimately want to come back to a particular set of competing schemata. We do this in a later section, We define arbitrary string variables B, Q, and R; these may take on any value in S. We also permit ourselves to refer to their individual positions by subseripted lower case letters: B= bob. -by, 40 Thus b; is the boolean variable representing the jth position of a string of length 2. Before we proceed with the derivation of the probability of disruption, we convert the schema theorem to proportion form. Suppose reproduction is operating by itself. In this case, we only have the first part of the schema theorem: m(B,t+1) = m(B,c)£(B)/E. If we divide both sides of the equation by the population size N, and if we define Ppt to be the proportion of the string B in the population at time t, we obtain the following equation fer the expected proportion in the next generation (or the mating pool) under the action of reproduction alone; £, pitt - pt. 8 B BE where f, is the fitness of string B and f is the average fitness of the population as given by the following expression: — ky " be) Prey i-0 Of course, we are currently interested in calculating the crossover disruption term that reduces the above expression by a factor 1+PoPg where p, is the probability of crossover and pg is the probability of disruption; however, we envision the GA acting in two phases: a selection phase followed by a mating and crossover phase. Therefore it is useful to define the expected proportion of string B under reproduction alone. We shall use this quantity in our calculation of the probability of disruption pq. We call this important intermediate quantity Rae the reproductive proportion: f, t yo B Be Fa To calculate the probability of disruption to a given string B under crossover, we need to know which strings will disrupt B and which won't. Clearly a string that is different from another string by a single bit cannot fail to return a copy of the original string among the two strings produced by the cross For example, suppose we have two strings Bb and B’ which differ at position 3 (the bar is used to {ndicate a complementary bit position): B = b,b, bbb, BY = byb,b,6,b, Every possible cross produces a copy of B. On the * ether hand,- strings with two or more hits of difference must (for at least one cross site) disrupt one another. Consider the two strings with three different bits: B= bgbyb;bb, BY = bgb, bbb, In this case, disruption occurs if a crass falls between the two outermost bits of difference (between positions 1 and 3). These two outermost bits--we call them sentry bits--are important in the analysis of disruption, because we need only consider their location when deciding how two strings will disrupt one another. We can use this information to create a general scheme for analyzing mutual disruption: Region begin middle end Length x GL t-8-x-1 Characteristics {| = | 6*...*b fo = | Here we have divided the string into three regions: the beginning, the middle, and the end. In the beginning and ending regions, the strings under consideration must have the same bit values as string B. _ The middle region is bounded by sentry bits (the b's) where the string B is different from its prospective mate. Bit positions other than the sentry positions are marked with *'s, the usual don't care or wild card symbol used in schema analysis; we may properly use the don't care symbol here, because it really does not matter where we cross between sentry bits. Every such crass disruptive. To quantify the disruption we define a number of useful variables. We take 6 (Se[1,2-1]} as the defining length of the middle region; this is also the number of possible cross sites. Separately we recognize that the length of the middle section (including both sentries) is 6+1. We define «x (xe[0,2-6-1]) as the length of the beginning region; it also is the position of the first sentry bit. Finally, we recognize that the length of the final section is the remaining portion that causes the total to sum to £: 2-8-x-1. is To facilicate our disruption computation, we define a middle function M = M[B,6,x]. The middle function is a subset generator, generating a similarity subset according to the following schema: MIB,6,%] = Bye. Deg Dyke MB eayPccssery Peggy We may now use the middle function M to calculate the probability of disruption of a given string B. This probability is the sum over all strings of the product of proportions and defining lengths: (3) GF 5 a y ‘ i - yr: R a ne ae (318, et0B,6 21) 4 " The use of the middle function M insures that we count each string only once. This may be confirmed by removing the § and R factors and simply using the summation to count the number of such crosses. After clearing the expression, we find that there are 2*-2-1 crosses that match the M function, This quantity may be reasoned independently as the total number of strings less the number of one-bit mismatches (these don't disrupt) and the string B itself (a zero-bit mismatch). STRING GAINS FROM CROSSOVER Just as we are interested in detailing the from crossover, so are we interested in knowing the potential gain from other string crosses. We may construct a template to precisely identify when this occurs Specifically, we consider the construction of a string B from two strings Q and R as follows: losses Region begin |middle| end Length a @ w Q Characteristics, sf = ° R Characteristics! — - (lb, # We now demand that the Q string be identical to B in the middle and end regions, and we require the R string to be identical to B in the beginning and middle sections, This time around we post sentries outside the middle region, and when we cross between the sentries we obtain a copy of the desired string B as one of the products of the cross, To make our summation a bit easier we define several important quantities. The quantity a (a «f[1,2-1]) is the length of the beginning region in string Q. The quantity w (w e[1,£-a]) is the length of the ending region in R, and o (o = £-a-w) is the length of the middle region, We define formal string functions to specify the strings they may cross to yield B. We call the two functions the beginning function A[B,a] and the ending function Q[B,w]. The beginning function is a subset generator as follows. PAD gegybye sb A[Ba] = #8 by by yy The ending function generates subsets as follows: Sk ate ew buy ue Q[B,w] = by. ..b The probability that these two strings will cross to give B is dependent on the value of a, the length of the region where they are the same. If o™~ 0, then there is only one site where the strings can cross to yield B. In general there are a total of o+l cross sites; therefore, the probability that a cross will give B is given by the expression’ Lo oth | os eI We may now calculate the expected proportion gain P_ of strings B from crosses by all other strings: as follows: We multiply by 2 because there are two ways to pick Q and R. We divide by two because only half the products of the cross are B strings. THE COMPLETE EQUATION With both disruptions and gains now calculated, it is a straightforward matter to write the expected proportion of strings B in generation ttl under both reproduction and crossover: frat fol pitt pth a. P y é at 5 a ° on et Z, tstayeitb 01 3 telt-a oth t t Pp R s [ Lae isteydaa : [ wigeiner | | | EXTENSION TO SCHEMATA So far we have confined our analysis to individual strings. We often want to examine the expected propagation of a set of competing schemata over some specified bit positions. To do this in our current scheme requires only minor modificacions to our equations. To interpret the extended equations in sachemata, we first index the sums over the o fixed pesitions (where o Is the schema order or the number of fixed positions) instead of the £ positions in the string (we do use £ in the denominator of the disruption probability, however). Next, we introduce a function A(x,é) to account for the possibly unequal spacing of the fixed positions. For example, if we are interested in the order 3 competing schemata defined by the template f4**f*f (where the £ indicates a fixed position, a 0 or a1) at xs0 and 61, the defining length function may be calculared as Aw4-0—84. Since the x values only run through the consecutive fixed positions, at x=l (the middle fixed position) we find A(1,1)=2. With such a function defined, we may rewrite our disruption probability for a schema H as follows: orl os-1 Az, 6) cs pd ~ p 2 2 RR FR but {4]Syetilt s,21) x0 Here we interpret S,; 4s a member of the set of schemata sharing the same o fixed positions as H, Also, RB is interpreted as the reproductive proportion of the schema Sj. Thus we are able to extend the computation of the probability of disruption to schemata using the same middle function M through careful indexing and the introduction of a function that tracks unequal defining bit spacing. Similar modification may be introduced in the expected gain proportion computation, Poe CONCLUSIONS In this paper, we have derived an equation for the expected propagation of strings in a genetic algorithm under reproduction and crossover, and we have shown how such a derivation can be extended to schemata. The computation is exact within the assumptions of the analysis (large populations, uniformly random mating, and randomly chosen ¢ross the analysis sites), The equations may be used for of specifle problems and codings or it may be used to quantify disruption in populations of strings or schemata of known distribution, The equations should further our understanding of the detailed operation of genetic algorithms. ACKNOWLEDGEMENTS This material is based upon work supported by the National Science Foundation under Grant MSM- 8451610. REFERENCES Bethke, A. D, (1981). function optimizers. Genetic algorithms as (Doctoral dissertation, University of Michigan). Dissertation Abstracrs International, 41(9), 35038, (University Microfilms No. 8106101) Bledsoe, W. W. (1961, November). The uge of biological concepts in the analytical study of systems. Paper presented at the ORSA-TIMS National Meeting, San Francisco, GA. Box, G. E. P. (1957). Evolutionary operation. A method for increasing industrial productivity. Applied Statistics, 6(2), 8l- LOL. Be Jong, K. A. (1975). An analysis of the behavior of a class of genetic adaptive systems. (Doctoral dissertation, University of Michigan). Dissertation Abstracts International, 36(10), S140B. (University Microfiims No. 76-9381) Fogel, L. J., Owens, A. J., & Walsh, M. J, (1966), Artificial intellipence through simulated evolucton New York: John Wiley. Friedman, G, J. (1959). Digital simulation of an evolutionary process, General Systems Yearbook, 4, 171-184. Goldberg, D. E. (1986). Simple genetic alsorithms and the minimal deceptive problem (TCGA Report No. 86003). Tuscaloosa: University of Alabama, The Clearinghouse for Genetic Algorithms. Holland, J. H. (1973). Genetic algorithms and the optimal allocation of trials. SIAM Journgl of Computing, 2(2), 88-105. Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor: The University of Michigan Press. REDUCING BIAS AND INEFFICIENCY IN THE SELECTION ALGORITHM James Edward Baker Vanderbilt University Abstract Most implementations of Genetic Algorithms experience sampling bias and are unnecessarily inefficient. This paper reviews various sam- phng algorithms proposed in the hterature and offers two new algorithms of reduced bias and increased efficiency, An empirical analysis of bias is then presented. L. Introduction Genetic Algorithms (GAs) cycle through four phases|4j: evaluation, selection, recombination and mutation. The selection phase determines the actual number of offspring each individual will receive based on its relative performance. The selection phase is composed of two parts: 1) determination of the individuals’ expected values; and 2) conversion of the expected values to discrete numbers of offspring. An individual’s expected value is a real number indicating the average number of offspring that individual should receive. Hence, an individual with an expected value of 1.5 should average, 1 1/2 offspring. The algorithm used to convert the real expected values to integer numbers of offspring is called the sampling algorithm. The sampling algorithm must maintain a constant population size while attempting to provide accurate, consistent and efficient sampling. These three goals lead to the bias, spread and efficiency measures described below: 1} bias — Bias is defined to be the absolute difference between an individual’s actual sam- pling probability and his expected valne[3]. To ensure proper credit assignment to the represented hyperplanes, the expected values should be sampled as accurately as possible. The optimal, zero bias is achieved whenever each individual’s sampling probability equals his expected value. 2) spread - Let f(i) be the actual number of offspring individual i receives im a given genera- tion. We define the “spread” as the range of possible values for f(i). Furthermore, we define the “Minimum Spread” as the smallest possi- ble spread which theoretically permits zero bias. Hence, the Minimum Spread is one in which 198 (x)J, 1 cota] where eu(2) = expected value of individual 1. 14 Whereas the bias indicates accuracy, the spread indicates precision. Hence the spread reveals the sampling algorithim’s consistency. 3) efficiency -- Jt is desirable for the sampling algorithm not to increase the GAs’ overall time complexity. GAs’ other phases are O(LN) or better, where L = length of an individual and N= population size. All currently available sampling algorithms fail to provide both zero bias and Minimum Spread/3]. Yet an accurate sampling algorithm is crucial, All of the GAs’ theoretical support presupposes the ability to implement the intended expected values. How much the existing inaccuracies have affected the GAs’ performance is unknown. However, inaccura- cies of sufficient magnitude to alter perfor- mance should be either understood or elim- inated. This paper offers an empirical analysis of one of the most common sampling algo- rithms[5] -- Remainder Stochastic Sampling without Replacement--and introduces two new sampling algorithms designed to reduce or eliminate these inaccuracies. Section 2 reviews and compares previously pro- posed sampling algorithms. Section 3 intro- duces an improved sampling algorithm which can be partially executed in parallel, Section 4 introduces a sampling algorithm of zero bias, Minimum Spread and optimal O(N) time com- plexity. Section 5 presents an empirical analysis of these algorithms and section 6 pro- vides a summary and conclusions. 2. Previous Work The basic characteristics of various sampling algorithms/3] are presented in Table 1. The first four algorithms are stochastic and involve basically the same technique: -—~ Determine R, the sum of the competing expected values, ~— 1-1 map the individuals to contiguous seg- ments of the real number line, [0..R), such that each individual’s segment is equal in size to its competing expected value. ~~ Generate a random number within (0.3). ~~ Select the individual whose segment spans the random number. ~~ Repeat the process until the desired number of samples is obtained. i This technique is commonly called the “spin- ning wheel” method. It is analogous to a gambler’s spinning wheel with each wheel slice proportional in size to some individual’s expected value. This technique is typically implemented in O(N?) time, but can be imple- mented in O(NlogN) by using a B-tree. In “Stochastic Sampling with Replacement”, the “spinning wheel” is composed of the origi- nal expected values and remains unchanged between “spins”. This provides zero bias yet virtually unlimited spread — any individual with expected value > 0 could be chosen to fill the entire next population. “Stochastic Sam- pling with partial Replacement” prevents this. In this algorithm, an individual’s expected value is decreased by 1.0 each time he is selected. (If the expected value becomes nega- tive, it is set to 0.) This modification provides an upper bound of [expected value] on the spread. However, this upper bound is achieved at the expense of bias. Furthermore, a reason- able lower bound is not provided. Remainder sampling methods reduce the spread further. A remainder sampling method involves two distinct phases. In the integral phase, samples are awarded deterministically based on the integer portions of the expected values. This phase guarantees a lower bound of {expected value} on the spread. The frac- tional phase then samples according to the expected values’ fractional portions. In “Remainder Stochastic Sampling with Replacement”, the fractional expected values are sampled by the “spinning wheel” method. The individual’s fractions remain unaltered between “spins”, and hence continue to com- pete for selection. This sampling algorithm provides zero bias and the greatest lower bound on the spread, yet it provides virtually no upper bound. Any individual with an expected value fraction > 0 can obtain all sam- ples selected during the fractional phase. “Remainder Stochastic Sampling without Replacement”, (RSSwoR), provides Minimum Spread. This remainder algorithm also uses the “spinning wheel” for the fractional phase. However, after each “spin”, the selected individual’s expected value is set to zero, Hence, individuals are prevented from having multiple selections during the fractional phase. Unfortunately, although this is a commonly used sampling algorithm, it is biased by favor- ing smaller fractions(2]. A “Deterministic Sampling” algorithm is sug- gested and used by Brindle{3]. In this remainder algorithm’s fractional phase, the 15 individuals with the largest fractions are selected. The result is a minimum sampling error for each generation, yet a high overall bias. Since overall accuracy in approximating the expected values is crucial to the applicabil- ity of the GAs’ current theory, high bias is con- sidered unacceptable, Hence, “Deterministic Sampling” is not widely used. 3. Remainder Stochastic Independent Sampling In a spinning wheel algorithm, selection is based on the relative expected values of those competing. This relativity is unnecessary since the expected values themselves are population normalized. Furthermore, this relativity is causing the bias-spread tradeoff and the O{NlogN) time complexity. (IF the selected individuals are not inhibited in future competi- tions, then there is little spread limitation. However, if they are inhibited, subsequent selection is biased.) In “Remainder Stochastic Independent Sam- pling” (RSIS), the fractional phase is per- formed without use of the error-prone “spin- ning wheel”. RSIS independently uses each fractional expected value as a probability of selection. This is accomplished by traversing the population and stochastically determining whether each individual should be selected: “C” Code Fragment for RSIS /* Integral Phase */ NumSelected = 0; for (i=0; i= 1.0; ExpValliJ--) { SelectInd{i); NumSelected++; /* Fractional Phase */ for (i=0; NumSelected Rand()) { Select Ind (i); ExpValli] = 0; NumSelected++; if (4-41 == N) ie20; } Where Rand() returns a random real number uniformly distributed within the range [0..1). The code for RSIS is noticeably less complex than typical implementations of sampling and much less than O(NlogN) algorithms. Although the fractional phase is potentially an infinite loop, probabilistically it completes after one traversal of the population and is only O(N). Empirical studies indicate that it rarely proceeds beyond the second traversal -- averaging only 0.017% over a variety of expected value distributions. Hence a modified algorithm which randomly selected the remain- ing sample(s) after two traversals would be heuristically appropriate. An individual can not be selected more than once during the fractional phase, since the selected individual’s expected value is set to zero. Hence this remainder sampling algorithm has Minimum Spread. However, some bias is present in RSIS. This bias occurs if a second population traversal is necessary and is similar to that found in latter stages of a spinning wheel algorithm without replacement. How- ever, the overall bias of RSIS will be much less since, typically, most of the samples are obtained in the zero biased, first traversal. Note that “positional bias’ may oceur if the individuals are ordered. However, in a remainder sampling algorithm, the population is shuffled to prevent excessive cloning. Hence, this potential bias does not exist. The selection phase of GAs requires some sequential processing, both in obtaining and sampling the expected values. However, the GAs’ evaluation, recombination and mutation phases can be executed in full parallel. Hence, increases in the parallel nature of the selection phase may eventually prove useful. RSIS is the only acceptable sampling algorithm which can be partially executed in parallel. The frac- tional phase can be performed in full parallel but must precede the O(N) sequential, integral phase. “C" Code Fragment for Partially Parallel RSIS /* sample fractions in full parallel */ /* copy population and mark selected ind.s */ /* code of jth processor */ parbegin NextPop.Ind|j] = CurrPop.Ind|j]; NextPop.flag = (ExpValFract|j] > Rand(}}; parend; /* sample integers sequentially */ /* overwrite unmarked individuals first */ k = 0; /* pointer to NextPop */ ; /* availability flag */ for (i==0; i= 1.0; ExpValfil—) { while ( NextPop.flaglk] == set } if (++k == N) k = set = 0; NextPop.Ind|k-+-+| = CurrPop.Ind{i}; In the parallel fractional phase, the current population is copied into the next population and the selected individuals are marked with a 1. This mark indicates the positions’ unavaila- bility. During the integral phase, the unmarked positions will be filled first. Two potential problem conditions exist: the frac- tional phase may select too few or too many individuals. If too few are selected, then some unmarked individuals will not be overwritten during the integral phase and hence will be considered selected. This is equivalent to a standard RSIS implementation employing ran- dom selection on the second traversal. (If this proves undesirable, a sequential, stochastic, second traversal could easily be performed.) If fractional selection choses too many individu- als, then integral selection will require addi- tional positions. However, {expected value] must be guaranteed to maintain Minimum Spread. Thus, some individuals selected dur- ing the fractional phase must be replaced. This replacement does not alter the perfor- mance characteristics of RSIS, since sequential RSIS would not have chosen these extra indivi- duals in the first place. In either case, Minimum Spread is maintained and the low bias of RSIS remains uneffected. 4, Stochastic Universal Sampling “Stochastic Universal Sampling” (SUS) is a simple, single phase, O(N) sampling algorithm. It is zero biased, has Minimum Spread and will achieve all N samples in a single traversal. However, the algorithm is strictly sequential. “C* Code Fragment for SUS ptr = Rand(); for (sum=i==0; i ptr; ptr++) SelectInd(i}; On a standard spinning wheel, there is a single pointer which indicates the “winner”. SUS is analogous to a spinning wheel with N equally spaced pointers. Hence, a single spin results in N “winners”. Since the sum of the population’s expected values = N, the pointers are exactly 1.0 apart. Thus an individual is guaranteed |expected value] in samples and no more than [expected value]. Hence, SUS has Minimum Spread. Furthermore, in a ran- domly ordered population, an individual’s selection probability is based solely on the ini- tial spin and the magnitude of his expected value. Hence, SUS has zero bias. Like remainder sampling algorithms, the population must be shuffled before crossover. Therefore, positional biag can not exist in SUS, either. (Note that, in general, any number of samples, n, can be obtained by letting ptr range within (0..N/n) and incrementing it by N/n after each selection.) In a sequential environment, SUS seems to be the optimal sampling algo- rithm. In the next section, an empirical analysis of RSIS and SUS is presented. 5. Empirical Analysis This analysis investigates the severity, direc- tion and progression of bias in RSSwoR, sequential RSIS and SUS. The severity is important from an implementation standpoint. Although bias can be proven theoretically[2}, it is of less interest if it does not alter perfor- mance. The direction of the bias indicates which individuals are favored and the progres- sion indicates when the bias occurs. Since we are concerned only with the sampling algo- rithm, a full execution of GAs is not necessary. Rather, we simply maintain a population of expected values. This allows a single expected value distribution to be sampled, repeatedly. We use linear distributions specified by the parameter MAX. Given 1.0 |Ap| if and only if danon-empty S C A, such that Vales tnt D((r(eT 2!) > acd) 27, ret where f is the XOR operator, > is the shift-right op- erator, & is the bitwise AND operator, and T is an integer threshold. Thus, a population A, of recognizer strings recognizes a pattern string when there is at least 28 one recognizer string that has the same bit values in at least T bit positions. T in the immune system has been experimentally found to be between six and eight amino acids, but in this conceptual model T has been set to five bits, though this can easily be changed. Also, in the real immune system the antibodies form three dimensional configurations so that any T’ or more positions can try to match the antigen. In the current model, this trans- lates to letting any bit position value of the recognizer string match any other bit position value of the pat- tern string within the limits of the configuration possi- bilities. Since the configurations of real antibodies are not known yet, this idea has been abstracted to simply allowing any T bit position values to match with the values in the same bit positions in the pattern string. Also, two bit values match if they are not only in the same position in the binary string, but also if they are the same values. In the real immune system, matches occur between complementary values rather than iden- tical values, In this simple model, this does not make a difference. However, if the model were to incorporate the fact that antibodies can match other antibodies as well (where self-recognition and loop creation are im- portant), then complementary matching would have to be the rule. However, in the current model, antibodies as recognizers evolve as they try to match the antigens or patterns, regardless of what other recognizers the rec- ognizer string may match. Another way of looking at the pattern recognition problem is in terms of schemata. Schemata represent subsets of binary strings by picking out certain bit po- sitions that must have a certain bit value, where the rest of the bit positions can have either value(8]. For instance, the string 011 is contained in schemata 0##, #1#, ##1, O14, OF 1, #11, O11, and HHH, where the #-sign indicates that for that position the schema will accept any value. In this model, the defining bits of a schema would represent those bit values that match between a recognizer string and a pattern string. The theory of how schemata are manipulated under the genetic algorithm can be used to understand how the recognizer strings are evolving. Since each binary string is a member of 2' schemata, between 2! and 3! schemata are implicitly tested and recombined by the genetic algorithm each generation or time step as the strings get tested and recombined with respect to their fitnesses. The strings reproduce in proportion to their fitness, Schemata can also be thought of as having fit- nesses based on the average of the fitnesses of the strings that are members of that schema, Then as Holland has shown|8], the schemata also reproduce in the population in proportion to their fitnesses. Each schema is implic- itly manipulated in the strings that are members of that schema, where the strings are explicitly manipulated by the genetic operators. What this means for the cur- rent pattern recognition problem is that if the genetic algorithm finds one schema in the recognizer popula- tion with at least T defining bits that match with a pattern, then the recognizer population recognizes that pattern string. Also, as the genetic algorithm manipu- lates schemata at an exponential rate to find useful ones that match patterns in the pattern population, it also finds recognizer strings at an exponential rate to match and recognize the patterns. The covering problem can simply be stated as fol- lows: Population A, covers population A, if and only if VreA,3 a non-empty 5 ¢ Agr such that V 2'eS, x! recognizes z. That is, every pattern string in the pattern population should be recognized by at least one recognizer string. The runs of this model should at least be able to show that the recognizer population can evolve under the genetic algorithm to recognize a pattern string as well as cover the pattern population. Once parameter settings and reproduction strategies have been found that give this result, the schemata in the evolving rec- ognizer population can be studied to determine how the recognizer population as a whole organizes information about the patterns as it discovers recognizers to recog- nize the patterns. 3 The Model The parameter values for the genetic algorithm are set as follows. The length of the strings is Us 36 bits, To determine the size of the recognizer population, the length of the schemata one expects to find must be taken into account. Based on recognition in the immune system, the model should be able to let schemata with five or six defining bits develop in the recognizer pop- ulation, To see what schemata we can expect to find in the recognizer population, we can use a corollary of John Holland's Theorem 6.2.3 [8] that Qn EST where ¢ is the error due to crossover (which may break an instance of a schema), n is the the number of defin- ing bits in the schemata one is interested in, and / is J the string length. Since / = 36, the error for a schema : of length 1 is < 0,056, 2 < 0.12, 3 < 0.17, 4 < 0.23, $< 0.28,6 <0. 34, and so on. In order for the recog- nizer population to be able to search for schemata with = 6 defining bits, of which there are 2" = 2° = 64, the recognizer population size should be about 64 as an upper bound. Schemata with more defining bits might still be found. The recognizer population size has been set to recognizer populationstze = 50 which is still much greater than 2° = 32 and close to 28 = 64, where the error rate is about 0.3, about as high as the genetic algorithm can tolerate. The initial strings in the recognizer population are created at random. The pattern population size has been chosen to be patternpopulationsize = 1 for one set of runs, and patternpopulationsize = 10 for another set. of runs to see if the recognizer population can recognize just one string as well as cover several pat- tern strings. The population is initialized with random strings that do not change over time. The point mutation rate has been set as low back- ground noise, pointmutationrate = 2 two point mutations per time step, where the mutated bits are chosen at random. The crossover rate has been set to 1 . . . crossoverrate = 3° recognizer populationsize so that in each generation ; of the recognizer strings are chosen as parents, thus producing that many new strings, and replacing that many old strings. The par- ents are chosen probabilistically with respect to fitness so that the algorithm results in reproduction with em- phasis (on fitness)[8]. Schemata with above average fit- ness have new copies or samples of them generated in the recognizer population in accordance with the theo- rem that guarentees minimal performance loss. Several fitness functions have been considered. With fitness equal to the average number of matching bits of the recognizer string to every pattern string, fitness generally does not increase monotonically nor exponen- tially as claimed should happen with crossover[8]. To encourage the recognizer strings to evolve with as many consecutive bits matching a pattern string as possible so that it is less likely that the schema of matching bits will be broken by crossover, fitness has been chosen to be soruomale sopamatehes — 0 TPO Gown, (+1) msample f(z) = where z is a recognizer string, msample is matchsample, m,(t) is the number of consecutive bit matches between the ith set of consecutive bit matches in string x and the jth pattern string in sample s,,, #jmatches is the num- ber of consecutive bit match sequences between string x and the jth pattern string, and ! is the length of the strings. For example, if z =01101 and s, = {11100, 00101}, then my(1) = 3, mo(1) = 1, m2{2) = 3, and o ay h i 3 1 3 V8 ~ t Ne) = Garg + garg t gavg het 2 5 TP . 4 The matchsample is the size of the sample 5, of strings chosen from the pattern population. The sam- ple 8, ,rather than all of the strings in the pattern pop- ulation, is used to compute fitness. Using this sample helps subpopulations emerge in the recognizer popula- tion, as in Booker’s simulation [1]. Subpopulations are necessary if all of the patterns are to be recognized by the recognizer population. The combinatorial bias of the probability of finding m matching bits between two strings, 3 —m+), has been removed by dividing the number of matches by the bias to get the unbiased fitness that would make all lengths of matches equally likely to occur when based on fitness. The probability of finding a match between two bits is $. For m bits, it is 1". Since these bits are consecutive, there are (!~ m-+-1) ways that they can be chosen in a string of length |. Therefore, a sequence of m consecutive matching bits occurs with a probability of 4"(i ~m +1) in a string of length 1. Dividing by this probability is equivalent to putting more emphasis on longer schemata since the dominating term of 2") increases exponentially with the length of the substring, m,(2). Several replacement strategies have been tried for determining which recognizer strings get replaced by the new offspring strings created by crossover. Choosing a string probabilistically with respect to Jana ally gives rise to a homogeneous population consisting of one highly fit string. Choosing a string that most closely matches the new string, a “crowding” strategy, never allows the fitness to go up since the child too often replaces its parent and thus schemata do not increase in proportion to their fitness as they should. A similar strategy, first tried by DeJong [2] and later by Booker [1] where the closest matching string is chosen from a small random sample of the recognizer population does not do much better than the first strategy. A strategy which works well is one which chooses a small sample of recognizer strings, 8,, of size crowdsample probabilisti- cally with respect to jana Then from this sample, the string which most closely matches the new string is cho- sen to be replaced. With this strategy, the recognizer population fitness increases and does not yield a ho- mogencous recognizer population. The crowdsample is determined by how many copies of a schema one wants. eventu- A sample of size n would on average include recognizer strings in all schemata that are present in + of the recog- nizer population. Thus, n subpopulations are allowed to coexist in the recognizer population since any subpop- ulation cannot contain more than i recognizer strings. Since the replacement strategy uses fitness to determine which strings to replace, it is consistent with the repro- duction with emphasis plan. That is, schema with below average fitness will have copies or samples removed so that their sample size decreases and makes room for an increase in the number of above average schema. Thus, this strategy generally replaces the string with the low- est fitness and which is in the same subpopulation as the new string. The algorithm to find schemata actually finds a small sample of all schemata present in a string population. The sampling is designed to find the more useful schemata (building blocks) that evolve under the genetic algo- rithm rather than all of the schemata present in a pop- : wation. The algorithm to find these schemata is: 1. Choose a sample, s;, of the population in which want to find schemata (e.g. n most fit strings). 2, Find the schemata in each string in sample s). For each string in sample s;: (a) Get a random sample, s2, from the popula- tion. (b) Check string against each string from s, to find which bits match. (c} Record schemata which appear between string and each string in 32 with matching bits no more than two bits apart (otherwise you have a separate schema). (4) Do not store duplicates. (e) Store schema frequency, sf, by running through all the strings in s, again. 3. Check to see what schemata are present randomly. For each string in sample 5): (a) Use same random sample s2 as in 2a). (b) Count the number of matches at each bit be- tween the string and every string in s;. (c) If Hdefanedivte #matches, rt {s2| where sf is the schema frequency, #matches, is the number of matches found in 2b) for bit i defined in schema, then remove the schema from the list. sf< For example, let 5; == {0110100011} and s, = {1110011101, 0101010101}. For step (2), $1, Sy = HIP HHA HHI with schemata HILO AH HHEH of frequency } and HRP EL of frequency 5 and S111 822 = OLEH HE BOB AL with schemata OLA HEHAHHHAH# Of frequency 3 and HEHEHE HOP HA Of frequency 3. For step (3), the number of matches at each bit is (1)(2)(1) (1) (0) (0}(1) (0) (0)(2). Then for schema HLLOAA AEH, $< 24 50 remove this schema, HBEHEA HAL 3 so keep this schema, OleHHHEHHE, } 2 so remove this schema, and AHHH RHOPHL, 1 < 43 so remove this schema. ns Ina! 2 < | For the schemata in the pattern population, both samples are the entire pattern population. That is, all useful pattern schemata that can be located by the al- gorithm are found and counted. For the recognizer pop- ulation, the (2 - recognizer populationsize) == n most fit strings are chosen for the first sample, s,, and the sare size sample is chosen for the second random sample, $q. So recognizer schemata occuring in i of the recog- nizer population are located this way in every time step. Rather than being explicitly represented anywhere in the model, the schemata so-located are printed out at every time step so that they can be analyzed. This model for pattern recognition is very similar to, though implemented differently from, Booker’s ge- netic algorithm simulation(1]. In that simulation, the classifier system learns useful categories or taxa (the conditions of the classifers) for one or more categories in the environment, where the categories are generating input messages to the system, The taxa are equiva- lent to schemata though there are fewer of them in a classifier string than in the recognizer binary bit string. The binary bit string messages are equivalent to binary Bit pattern strings that the classifers and recognizers, respectively, are trying to match. The matchscore that he uses, called Mg, is defined for a taxon x and message (binary bit string) z' as 1 if z and x! match identically My(zx) = s(z) { bn otherwise where n is the number of mismatched bit values and fis the length of the strings. This function is linear with slope ~® except for the case of all defining bits matching where Ms; returns the value 1 rather than +. . This Jump when the optimal solution is reached ensures that a good taxon is never lost from the population. On the other hand, the fitness function in the model just a presented is an exponential curve which increases the fitness exponentially with each additional matching bit tather than just a big jump when the perfect match is found. The fitness function encourages bit by bit ad- ditions to matching schemata whereas the matchscore encourages this but really emphasizes the final perfect match. The fitness function is also more apt to keep schemata of all lengths in the population in case the pattern strings in the environment change so that the recognizer population can quickly build a matching rec- ognizer from a few schemata. Matchscore just tries to match the categories in the current environment. In the pattern recognition model, there are no mat- ing restrictions, unlike Booker’s simulation where two strings can mate only if they match the same message. Mating in the new model is probabilistic with respect to fitness, which is computed with respect to a sample of pattern strings. Thus, higher fitness implies that the recognizer string matches more pattern strings out of its sample which implies that there is a good chance that two mating strings both match at least one same pat- tern string, like a probabilistic mating restriction, This chance is equal to 1 — ae at sok , where a is the pattern population size and m is matchsample. For instance, in this model the chance would be 7 where a = 10 and m = 3. Thus, the recognizer population can find the near-optimal points representing the pat- tern strings and still incorporate new strings with new schemata for each time step. Crowding is also implemented similarly to Booker’s simulation, The Finan equivalent to wm is used to pick out a sample of strings to delete based on which string best matches the new string. This is similar to Booker’s crowding scheme where taxation and a tax re- bate lower the strength of strings in a subpopulation that has reached its “carrying capacity”. Then the strings with lower strength, are more likely to get re- moved, just as strings in the new model with lower fitness and higher jaa are more likely to get re- moved. Since fitness is noisy, there are recognizer strings in each subpopulation with lower fitness that will prob- ably get removed; if the subpopulation is of size 4, then these strings with lower fitness will appear in the sam- ple and get chosen to be replaced. Also, in the cur- rent model, using a sample size equal to the number of desired subpopulations is simpler to implement than Booker’s scheme where the system keeps track of which taxa are relevant to what message, unless the relevance criterion is such that it can be calculated for each sub- population rather than every individual taxon. Thus, the recognizer population in the pattern recog- nition model, just like the classifiers in Booker’s madel, can also find the near-optimal points representing the pattern strings and still incorporate new recognizer strings with new schemata for each time step. 4 Results Preliminary runs have found parameter settings with which the recognizer population increases its fitness (equal to the sum of the fitnesses of each recognizer string) as shown in the following logio( fitness) va. time plot. In- creasing fitness implies that the recognizer population is matching one or more pattern strings at more bits. To test the model’s ability to solve the recognition problem, fifty recognizer strings have been run under the genetic algorithm to get them to match one pattern string. The population fitness does increase exponen- tially (Figure 1) until it reaches the maximum popula- tion fitness possible when about ninety percent of the recognizer strings match the pattern string at every bit position. The other ten percent match at almost every bit. To test the model’s ability to solve the covering prob- lem, a run where fifty recognizer strings try to recog- nize ten pattern strings has been tried. The fitness of the recognizer population does increase fairly monoton- ically (Figure 2). The listing of pattern schemata found in the final recognizer population shows that all pattern schemata with five or fewer defining bits have also been found in the final recognizer population. Also, upon closer inspection of the recognizer population, a recog- nizer string can be found for each pattern string such that the recognizer string matches the pattern string at no fewer than five bit positions. Thus, the recog- nizer population recognizes and covers all of the pattern strings. With what sample sizes does this genetic algorithm solve the recogntion and covering problems? How does the recognizer population evolve with these parameter settings? It turns out to be critical that the value for matchsample be less than the size of the pattern population. That is, fitness for each string must be computed with respect to only a few patterns and with respect to different pat- terns each time step for a given recognizer string. If all patterns are used to determine the fitness of a recog- nizer string, then the entire recognizer population ends up matching only one pattern as it reaches maximum possible value for fitness. There are several reasons why matchsample should be just a portion of the pattern population size. First of all, the computations to determine fitness are done faster if only a few patterns are used for the calculation. Secondly, each recognizer uses the same sample of patterns to compute its fitness, namely all of the pat- terns, Thus, the first pattern with a longer schema found in the recognizer population will be the one matched by all of the recognizers over time. That schema will increase the fitness exponentially of the recognizer con- taining it with respect to the fitnesses of the other recog- nizers. That recognizer’s descendents will take over the 32 Vogl(Fitness) ve Time 49 B00, 1 es6 a Geo 81 660 Figure 1. patternpopulationsize = matchsample = crowdsample = 1, Yogt0(Fftnese) ve Time 13 8@0, pane 1 ag 8 ap S61 avo Figure 2. patternpopulationstze = crowdsample = 10, matchsample = 3. entire recognizer population to match that one pattern. Thus, in order to cover all of the patterns, a different sample of patterns must be used to calculate fitness for each recognizer string. Thirdly, and most importantly, sampling is the coun- terpart in this model of collision dynamics. One recog- nizer can try to recognize only a few patterns at any one time in real physical space. Modelling this accu- rately models the fact that this system is far from equi- fibrium most of the time where changes in the system are most likely to increase rather than decrease order in the system. That is, new schemata introduced into the recognizer population will get their numbers increased with respect to their fitness and give rise to new recog- nizer strings as they get distributed throughout the rec- ognizer population. Otherwise, with non-noisy fitness, these new schernata would be ignored and eventually be removed from the recognizer population as one rec~ ognizer string containing one longer schemata was the only one with surviving descendent strings. The other sample size for which the value turns out to be critical is crowdsample. With crowdsample = 10 (equal to patternpopulationsize in this case), a schema will be replaced if it is in 4 of the recognizer popu- lation. This allows more room for other schemata to occur and distribute in 3 of the recognizer population. If crowdsample is less than that, for example if it set to 3, then the fitness of the recognizer population increases but only recognizes about three of the patterns. With crowdsample = 10, as in the second run de- scribed, it was found that the recognizer population maintains a high number of schemata at all times, keep- ing the number of recognizer schemata around 100 +25 each time step. By maintaining such diversity in the rec- ognizer population, all pattern schemata of 5 or fewer defining bits are found by the genetic algorithm along with a few schemata with even more defining bits. This is also due to the recognizerpopulationsize parame- ter. Theory predicts that when there are 2" schemata of length n, a population of size 25 = 32 or greater should be able to search these. A population of size 50, where 2° < 50 < 25, does this plus finds a few other longer schemata. One of these longer schemata, not in the initial recognizer population, has been created by putting together two shorter schemata that increase in frequency enough to form the longer schema, which then increases in frequency. The shorter schemata de- crease in frequency since they have been incoporated into the longer schema. Thus, we see the beginning of a default hierarchy forming[7|[9]. First shorter, more gen- eral schemata are tested against the environment, and then the useful ones are used to build longer, more spe- cific schemata that predict the environment (or in this case cover the strings in the pattern population even more accurately}, With a larger population, more of the longer schemata should emerge. 5 Conclusion The recognition and covering constraints are the mini- mal ones that the current model can satisfy. It does this by taking two shorter schemata and applying crossover to the strings of those schemata to form a new longer schema. This longer schema is a combination of the two shorter ones and is in a string that matches a pattern at more bits. Thus, a default hierarchy is emerging in the tecognizer population. Mutation also can create such longer schemata and fitter strings but only one more bit will match an pattern. The one additional matching bit does not increase the fitness of the string nor the number of recognizer strings to match pattern strings as much as recombination does. However, these results are based on preliminary runs and need to be quantified More precisely before more specific conclusions can be made. There are atill many questions to study about this Pattern recognition model. 1. How is the default hierarchy of schemata repre- sented in the recognizer string population? Are there recognizer strings that contain several shorter pattern schemata and some recognizer strings with specific schemata or coarse-grained with more gen- eral schemata? Is the entire emerging default hi- erarchy maintained in the recognizer population? 2. Do distinct subpopulations evolve in the recog- nizer population with respect to the patterns or other criteria such as regularities in the pattern strings? Does each recognizer string contain schemata for one pattern or for more than one pattern? If the patternpopulationsize is increased, will the system no longer be able to pick out all of the regularities which may no longer come from one pattern but may include several patterns? How will this depend on the recognizerpopulationstze? 3. How do the schemata in the recognizer popula- tion change when the initial pattern population is changed? And if one of the parameter values is changed? What if the fitness function is changed to give maximum fitness value when T bits match? 4. Are matchsample and crowdsample, which main- tain diversity in the recognizer population, also re- sulting in suboptimal fitnesses for the recognizers that cover the patterns? That is, does it become impossible for perfect matches to evolve for each pattern? Is this somehow an adaptive advantage even if it is not a performance advantage? How can their values be set by some feedback mecha- nism based on regularities found and the resulting fitness, so that the recognizer population can con- tinue to recognize and cover a pattern population that is changing its size and strings over time? 5. Is the final recognizer population robust to changes in the pattern population? Will the recognizer population fitness decrease alot quickly if an orig- inal pattern is replaced with a new and different pattern string? Will the entire recognizer popula- tion have to re-evolve to redistribute the pattern schemata it has, lose the ones no longer in the pattern population, and acquire new ones? Various tools still need to be built to analyze the evolution of schemata and the default hierarchy in the recognizer population. For instance, since the schema finding algorithm only finds a sample of the schemata present in the pattern population, a reasonably-sized phylogenetic tree can be plotted over time based on what pattern schemata in the recognizer strings give tise to which new and possibly longer pattern schemata in new recognizer strings. The frequency of each pat- tern schemata in the recognizer population (equal to the number of recognizer strings with that pattern schemata) could be plotted over time and used with the phyloge- netic tree to trace the development of the default hier- longer schemata? In terms of knowledge representation{6}, 2°hY: is the representation fine-grained with longer, more 33 The frequencies of the schemata can also be studied with analytical tools developed for understanding dy- namical systems. Every schemata changes its frequency {equal to the number of strings in the population be- longing to that schema) in the population with respect to the probability M,(t +1) P(t +1) = vj = filt)-e.- M(t) where schema s has an average expected fitness St) at time t, M,(t) is the number of strings in the recog- nizer population that are elements of schema s, M is the total number of strings in the fixed sized recognizer population, and ¢, is an error term created by having a schema instance get broken up by a crossover point or mutation. The size of the recognizer population and the sample sizes also affect the number of schemata and the number of their instances that are possible to have in the recognizer population. Thus, the schemata dynam- ics can be studied under different parameter settings for the population and sample sizes, and the affects of fluc- tuations occuring when new schemata are introduced into the recognizer population by crossover, mutation, or changes in the pattern population can also be stud- ied, Once the structure and evolution of the default hi- erarchy is better understood, this pattern recogniton model can be used to study the structural evolution of memory in other physical and biological systems that do pattern recognition. The features of the particular system can be assigned to the bits of the pattern and Tecognizer strings. Then one can study how the rec- ognizer strings of that system’s memory are structured with respect to the regularities in the patterns the sys- tem is trying to recognize. For instance, in modeling the immune system, ex- perimental evidence can be used to set up the fitness function and parameter values to be more realistic with Teal immune systems and to see if the model results in antibodies with similar characteristics as those found in the experiments. Some experimental studies have shown that one antibody can recognize more than one antigen (called “multi-specificity” in immunology) [13]. Such an antibody is a generalist, Other evidence shows that when new antigens are introduced, new antibod- ies are created by an increased rate of point mutations. The affinity of these new antibodies for new antigens in- creases ten-fold over a short period of time, indicating that the antibodies become very specific at recognizing the antigens [14]. Thus, some sort of default hierarchy is developing in the immune system. What is the na- ture of this default hierarchy if the fitness function and paramters are set to correspond to actual numbers used and seen in these experiments? What antibodies result if the intial antibodies are ones that have evolved in this pattern recognition model and can only undergo muta- tion in the future when presented with a new antigen? How might a network of such antibodies compensate for any shortcomings of the antibody population which can no longer undergo crossover? Also, in a letter recognition system, letter patterns can be represented with line features of different lengths, angles, and positions. Could this model then generate a default hierarchy to recognize the letter or letters? How could it recognize the same letters if they are written in a different style? In cognitive science, a concept is represented by a prototype developed from a set of instances. The na- ture of the prototype with respect to the properties of the instances is not yet clearly understood, but the in- stances do not necessarily all share some set of property values which might then be the prototype. Rather, each instance shares one or more property values with one or more other instances in the set [11]. Translated into the pattern recognition model, the recognizer popula~ tion can try to match the instances in the environment where the prototype would appear as the more general schema that persists over time and is not just used to build one specific schema. On the other hand, the pro- totype might turn out to be some piece of the default hi- erarchy distributed over several recognizer strings. How would this correspond to any experimental data that has been collected by psychologists on the nature of proto- types? What kinds of prototypes are actually commu- nicated in these experiments (ie. trasmitted verbally to the experimenter)? By understanding the organization of the default hi- erarchy in the recognizer strings, the memories of other pattern recognition systems evolving under a genetic al- gorithm can be analyzed as well. A system that tries to recognize patterns in the environment and then builds an internal model or memory of those patterns must use some definition of similarity to cluster the patterns into classes. If this system also transmits its internal model by a code representing how to reconstruct that model, then this system is a good candidate to be cast in and analyzed by this pattern recognition model. 6 Acknowledgements This work has been performed under the auspices of the U.S. Department of Energy. I would like to thank John H. Holland for his guidance and ideas about ge- netic algorithms and modelling, Chris Langton for his comments on modelling artificial life, and Alan Perelson, Doyne Farmer, and Norman Packard for their valuable insights into the structure and dynamics of biological and physical systems. a as ASAP SENN: References {1} {4] [5] {6] {7] [8 (9 (29 Booker, L.B. 1985, “Improving the Performance of Genetic Algorithms in Classifier Systems”. Pro- ceedings of an International Conference on Ge- netic Algorithms and Their Applications, Pitts- burgh, Pennsylvania : Carnegie-Mellon Univer- sity. DeJong, K.A. 1975, “Analysis of the Behavior of a Class of Genetic Adaptive Systems”. PhD Disser- tation, Ann Arbor : The University of Michigan. Farmer, J.D., Kauffman, S.A., and Packard, N.H. 1987, “Autocatalytic Replication of Polymers”. Physica 22D, pp. 50-67. Farmer, J.D., Kauffman, S.A., Packard, N.H., and Perelson, A.S. 1987, “Adaptive Dynamic Net- works as Models for the Immune System and Au- tocatalytic Sets”. Annals New York Academy of Sciences. Farmer, J.D., Packard, N.H., and Perelson, A.S. 1986, “The Immune System, Adaptation, and Ma- chine Learning”. Physica 22D, pp. 187-204. Feldman, J.A. 1986, “Neural Representation of Conceptual Knowledge”. TR189, Dept. of Com- puter Science, Rochester, NY : The University of Rochester. Goldberg, D.E. 1983, “Computer-aided Gas Pipeline Operation Using Genetic Algorithms and Rule Learning”. PhD Dissertation, Ann Ar- bor : The University of Michigan. Holland, J-H. 1975, Adaptation in Natural and Artificial Systems. Ann Arbor, Michigan : The University of Michigan Press. Holland, JH. 1985, “Escaping Brittleness: The Possibilities of General Purpose Learning Algo- rithms Applied to Parallel Rule Base Systems”. Machine Learning II, Ch. 20, Los Altos, CA : Mor- gan Kauffman. Jerne, N.K. 1974, “Toward a Network Theory of the Immune System”. Annals of Immunology (Inst. Pasteur) 125C : 373. [14] Kaplan, S. 1986, Neural Models, Course Notes, Ann Arbor : The University of Michigan. [12] Langton, C.G. 1984, “Self-Reproduction in Cellu- lar Automata”. Physica 10D, pp. 135-144. [13] [14] Rosenstein, R.W., Musson, R.A., Armstrong, M.Y.K., Konigsberg, W.H., and Richards, F.F. 1972, “Contact Regions for Dinitrophenyl and Menadione Haptens in an Immunoglobulin Bind- ing More Than One Antigen”. Proceedings of the National Academy of Science, USA, Vol. 69, No. 4, pp. 887-881. Wysocki, L.J., Manser, T., and Gefter, M.L. 1986, “Somatic Evolution of Variable Region Structures During an Immune Response”. Proceedings of the National Academy of Science, USA, Vol. 83, pp. 1847-1851. An Adaptive Crossover Distribution Mechanism for Genetic Algorithms J Daod Schaffer Amy Morishima Philips Laboratories North American Philips Corporation Bnarelifl Manor, New York ABSTRACT This paper presents a new version of a class of search procedures usually called genetic algorithms. Our new — version implements a modified string representation that includes special punctuation used by the crossover recombination operator, The idea behind this scheme was abstracted from the mechanics of natural genetics and seems to yield a search procedure wherein the action of the recombination operator can be made to adapt to the search space in parallel with the adaptation of the string contents. In addition, this adaptation happens "for free" in that no additional operations beyond those of the traditional genetic algorithm are employed. We present some empirical evidence that suggests this procedure may be as good as or better than the traditional genetic algorithm across a range of search problems and that its action docs successfully adapt the seareh mechanics to the problem space. 1. Background A genetic algorithm js an exploratory procedure that is able to locate high performance structures in complex task domains. To do this, i) maintains a set (called a population) of trial structures, represented as strings. New test structures are produced by repeating a two-step cycle (called a generation) which includes a survival-of-the-fittest selection step and a recombination step. Recombination involves producing new strings (the offspring) by operations upon one or more previous strings {the parents), The principle recombination operator was abstracted from knowledge of natural genetics and is called crossover. Holland has provided a theoretical explanation for the high performance of such algorithms {5} and this performance has been demonstrated on a number of complex problem domains such as function optimization [8,4] and machine learning {6, 8, 9}. The action of the traditional crossover is illustrated in figure 1. 36 Starting with two strings from the populationt, a po: is selected between 1 and L-1, where L is the stri length. Both strings are severed at this point and t segments to the right of this point are switched. T two starting strings are usually called the parents a the two resulting strings , the offspring. Taking t metaphor one step further, we will calf this operation mating with a single crossover event. In all previo work with which we are familiar, the crossover point chosen with a uniform probability distribution. The motivation for our new crossover operat sprang from some properties of this traditional operate Specifically, it has a known bias against proper sampling structures which contain coadaptive substrin that are far apart. In metaphorical terms, we might cs them genes which are far apart on the chromosome. Tl reason is not diffcult to grasp intuitively. The farth apart the genes are, the higher the probability that uniforinly selected random crossover point will fa between them causing them to be passed 1o differe: ofispring. We observe that this crossover operate requires only knowledge of the string length; it pays r attention to its contents. Furthermore its action nonadaptive. lt performs the same way in ever generation. In contrast to this, what we know of Nature genetic crossover activity suggests that the location ¢ crossover events may be quite sensitive to the content of the chromosome [1]. There are many activities in thi microworld which involve the initiation of an action b the binding of an enzyme to a specific base-sequence We were motivated to design a crossover mechanist which would adapt the distribution of its crossove points by the same survival-of-the-fittest an recombination processes already in place. We reasone: that it should do so by the use of special punctuation marks introduced into the string representation for thi purpose. The operation envisioned for this new crossove was to proceed hy "marking" the site of each crossove event in the string in which it occurred. Thus if the search space had the characteristic that crossovers ai particular loci were consistently associated with inferior offspring, then they would die out, taking their markings with them, The converse was also hoped for. If nce consistent relation existed to he exploited, then a4 t We will show all strings as bit strings, but this is not ¢ requirement imposed by the algorithm soon PARENTS: 0000000 ALLL crossover point -- selected randomly with a uniform distribution OFFSPRING 0001211 1110000 Figure 1, The action of the traditional crossover operator. The two parent strings are shown as all zeros and ones for clarity. random allocation of markings was expected. In this, we were reminded of speculation by Lenat that modern gene pools may contain accumulated heuristics to guide genetic search as well as an accumulation of well- adapted expressible genes [7]. The idea of punctuation in the strings used by a genetic algorithm was first proposed by Holland [5], but this proposal was for a different purpose than that proposed here. The rest of this paper will present, the mechanics of our new crossover operator, some empirical results indicating that superior performance can indeed be achieved with it, and some data exhibiting properties of its behavior. 2. Mechanics In this section we will explain the mechanics of Operation of our new crossover operator. We address how the crossover punctuation is coded, how an initial distribution is generated, how this distribution affects the crossing over of the functional parts of the strings, how the punctuation marks themselves are passed on to the offspring, and how a linkage is established between individual punctuation marks and the functional substrings with which they are associated. The representation is straightforward. To the end of each chromosome (string of bits interpretable as a point in the search space) we attach another bit string of the same length. Thus any string representation previously used with a traditional genetic algorithm can be employed in our scheme simply by doubling its length. The bits in the new section are interpreted as srossover punctuation, 1 for yes and QO for no (ie. l=crossover, O=no crossover), The loci in the Punctuation (second) part of the string correspond one- to-one with the loci in the functional (first) part of the chromosome. Thus it is natural to think of these two Parts of the chromosome as interleaved. Punctuation mark i tells whether crossover is or js nob to occur at leeus i, In figure 2, the punctuation marks are shown as + in the functional string. Tt is common practice when beginning a genetic Search to initiate a population of strings by randomly » Senerating bits with equal probability for zero and one. We folk ow this practice for the functional string, but the 37 THE CHROMOSOME OL11Ld9O00OLT1ECOGLOOLSHOOD Ss. oan expressible crossover string punctuation L= 10 L<+ 10 MAY BE INTERPRETED AS 911100 101100 Lf i.e@. crossover here Figure 2. The string representation of chromosomes with crosgover punctuation marks. probability of generating a one in the punctuation string is designated P,, and is set externally. The influence of this variable was studied empirically. BEFORE CROSSOVER mam: aaaaaaathbbbbbb pop: ceeclddddddecere APTER CROSSOVER kidl: aaaadddbbbeeee kid2; ccecctaaatdddbbbb Figure 3. The action of punctuated crossover, The mechanics of crossover governed by these punctuation marks is illustrated in figure 3. The bits from each parent string are copied one-by-one to one of the offspring from left to right. When a punctuation mark is encountered in either parent, the bits begin going to the other offspring (a crossover). When this happens, the punctuation marks themselves are also passed on to the offspring, just before the crossover takes effect. Thus we may think of the marks as being linked to the functional bit string to the left of its locus. A little experimentation with pencil and paper should serve to give the reader a sense of the redistribution possibilities of this process, Some parental distributions will result in all the punctuation marks being passed to one of the offspring and none to its sibling, while others will redistribute clumped distributions. When an offspring fails to survive the fitness-based selection step in its generation, its punctuation marks die with it. Thus, the dynamics of the distribution of these marks in the gene pool should reflect an accumulating «xperience about where is or is not a good place to crossover the genetic material in the pool. The action of the mutation operator has traditionally been employed as a low level (ie. small probability) defense against premature convergence of the gene pool. It seems consistent with our metaphor to allow the mutation operator equal access to the entire chromosome, both functional and punctuation parts. A few experiments supported this belief. 3. Empirical Evidence We selected the task domain of function optimization (minimization) to test the capabilities of this new genetic algorithm and a set of five scalar valued functions which has been used in the past to test genetic algorithms. These functions provide a range of characteristics of search problems and are summarized in table 1. Recently Grefenstette has found a configuration of the traditional genetic algorithm which performs consistently better on this funetion set. than any previously known. [4]. We will use his resulls (which we will call BTGA for best traditional genetic algorithm) as a benchmark. TABLE 1 Functions Comprising the Test Environment Fen Dimensions Space Size Deseription fl 3 10% 10° parabola f2 3 17x 10° Rosenbrock’s saddie 13 5 1.0% 108 step function f4 3a 10x 107 quadratie with noise £5. 2 16x 101° Sheket's foxholes We adopted Grefenstette's strategy of setting a genetic algorithm to optimize a genetic algorithm (GA) in order to locate a good set of parameter values for our new procedure (a “meta-search”). The meta-level GA was a traditional GA that allowed vector-valued fitness {one dimension for each of the five functions) called VEGA [8]. The parameter set searched at this level included: population size, crossover rate, mutation rate, P,,, and sealing window. The performance measure was online average function value (ic. an average of all trials in a run of 5000 function evaluations). For more details on these matters the reader is referred to Grefenstette’s paper and its predecessors. Since the performance of BTGA on each of the individual functions was not previously published, we first estimated this by running a BGA on each one five times (n) with different random seeds, The resnits are given in table 2. The results of the Meta-search revealed that the genetic algorithin with punctuated crossover (GAPC)} performed well for the whole set of funetions at the TABLE 2 Performance of Best Traditional Genetic Algorithm on Test Functions Function mean s.d. n global optimum fl 1.664 19G0) 5 0 f2 25.16 4.497 5 0 fs -27.78 Qtr 5 +30 fa 24.28 1.383 5 0 £5 30.78 2.148 5 as] 38 following parameter settings: population size = 40, crossover rate = 0.45, mutation rate = 0.002, P,, = 0.04, and scaling window = 3. Estimates of the performance of GAPC comparable to those for BTGA are given in table 3 along with results of two-tailed t-tests of the mean differences between them. TABLE 3 Performance of Genetic Algorithm with Punctuated Crossover on Test Functions and a comparison with the Best Traditional Genetic Algorithm Function mean s.d. n t-test significance fl 1.111 .1706 0 5 4.84 01 f2 17.22 3.279 5 3.19 05 {3 -27.90 5907 5 0.12 ns fe 20.32 1.320 5 4.63 OL 15 14.81 LAS 5 13.71 2001 These results clearly show the superiority of GAPC. It statistically outperforms BTGA on four of the five functions and, is no worse on the other ({3). We believe the reason for this latter result lies in a floor effect. Both GAs find good solutions very quickly on [3 so that the online average rapidly approaches the global optimum. There is simply insufficient room for improvement to allow for a statistical difference. We believe the same explanation applies to the finding that no significant diflerences were noted between BT'GA and GAPC when offline average was used as a criterion. 4. Characterization of GAPC Search There are a number of questions about the realization of the performance characteristics we envisioned when this scheme was designed, Specifically, does the distribution of crossover marks adapt to the task environment in a meaningful way, is the process stable (i.e, does the number of punctuation marks in the population tend to vanish or saturate), and how many crossovers per mating does it settle upon (if it does settle)? In this section we present some evidence related to these questions that was collected while monitoring the searches reported above. We define the population distribution of punctuation marks at time ¢ as the sum of the punctuation bits at each locus across all the individuals in the population. pepace Pro(Lit)=: SS punet(I,t) (2) veel The total number population js then of punctuation marks in the L Teolt)= Dsl) (3) aol Figure 4 shows a time history of »,,(f,¢) for 200 generations of one search of function fl. The x-axis is chromosome locus and the y-axis is both p,,(/,t) and time (generations), The vertical separation between stiecessive generations is equal to popsize (ie. the number of individuals in the population) so that, a locus at which every member of the population has a punctuation mark, will appear as a peak which just reaches the baseline of the line above. This figure shows that the initial distribution is flat since the random initialization of the punctuation marks does not [favor one focus over any other. As time progresses, however, some loci tend to accumulate more punctuation marks than others. The location of these concentrations changes with time; a peak may appear and remain prominent for some generations only to die out as others emerge. The distribution of punctuation marks does indeed seem to adapt as the gene poo! adapts. os a) generations freq of crossover gent ralions 0 bo 2 bo we chromosome position 1 p) gencrations £00 CTasSsoyer of fre Bene Talons 4000 i i "1 in chromosome poson Figure 4. A time history of the distribution of punctuation marks for one run on fl. 30 Figure 5 shows a plot of 7,,(¢) for the same run. Also shown are the maxp,,{/,4) and ming,,(!,.). Although the total seems ta be growing with no sign of stabilizing, the population is far from saturatedt, the minimum remains zero until the last few generations and the maximum concentration at any one locus never exceeds 30 out of the possible 40 (i.e. popsize). Aut * popmlatin total ” 1 crossover pls per gene Faas pf sume Io as wee Se wee ae = anhalt anna 8 60 100 158 fo L590 WO IO AG) 190 TO) T9N MD EE us ae o generations Figure 5. Now the numbers of punctuation marks change with time. These data were from one search on fl, and were typical. Tigure 6 shows the average number of crossover events occurring per mating. This simple counting can be misleading, however, since the gene pool is converging as the generations progress. Note that when a crossover event swaps gene segments which are identical, it is an unproductive crossover event. When these events are discounted, we were surprised to see that the number of “productive” crossover events per mating remained nearly constant. What is more, the level at which this statistic holds appears to correlate strongly with L. Sec table 4. These results are not dissimilar to results reported by DeJong when experimenting with multiple crossover points [2] TABLE 4 Time-averaged "Productive" Crossovers per Mating Function chromosome length crossovers fh 30 1.49 f2 24 0.87 {3 50 2.02 fs 240 8.64 f5 32 1.65 t Saturation would mean a punctuation bit at every one of the popaize XL (10% 30 1200) possible Incations f 1 katt q sit yan 4, ' A ‘ yi) thi La e > r r Legis gid, Preduruite Ber, agit ale ap “atin sent Y$ f * 0 ‘ 0 tb Wk Ite Que or WN TE APG ASU SUL are On generalons + Wy “ve” en a yp aa Figure 6. Total and “produvtive’ crossover events per mating for one run on fl. 5. Conclusions We have representation described a modified knowledge and crossover operator for use with genetic search. Its design was driven by intuition abstracted from Nature’s mechanisms of crossover during meiosis. Experiments indicate that it performs as well or better than a traditional GA for a set of test problems, that exhibits a range of search space properties. Experiments on other test problems are continuing. The distribution of crossover events evolves as the search progresses and the statistics of “productive’ crossover events per mating indicate steady search effort even in the face of a converging gene pool. These statistics seem to correlate with chromosome length and are consistent with previous results. We remain cautiously optimistic that continued experimentation will strengthen these conclusions and will lead to a robust approach to adaptive knowledge representation. Acknowledgement We wish to acknowledge the valuable contributions of D. Paul Benjamin to the conception of this project and to discussions of its implications, References 1. 40 B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts and J. D, Watson, Molecular Biology of the Cell Garland Publishing, Inc., New York, 1983. K. A. De Jong, Analysis of the Behavior of a Class of Genetie Adaptive Systems, Ph.D. Thesis, Department of Computer and Communication Sclenees, University of Michigan, 1975. K. A. De dong, Adaptive System Design: A Genetic Approach, [EEE Transactions on Systems, Man & Cybernetics SAIC-10,9 (September 1980), 866-574. 5 J. J. Grefenstette, Optimization of Control Parameters for Genetic Algorithins, JEEE Transactions on Systems, Afan & Cybernelics SMC-16,1 (January-February 1086), 122-128, JOH. Holland, Adaptation m Natural and Artyferal Systems, University of Michigan Press, Ann Arbor, MI, 1075. J. H. Holland and J. Systems Based on Adaptive Algorithms, in Pattern-Directed Inference Systems, D, A. Waterman and F. Hayes-Roth (editor), Academic Press, New York, NY, 1078. D. B. Lenat, The Role of Heuristics in Learning by Discovery: Three Case Studies, in Afachine Learning, R. S. Michatski, J. G. Carbonell and T. M. Mitchell (editor), Tioga, Palo Alto, CA, 1983. J. D. Schaffer, Some Experiments in Machine Learning Using Vector Evaluated Genetic Algorithms, Ph.D. Thesis, Department of Blectrical Engineering, Vanderbilt University, December 1084. 8. F. Smith, Flexible Learning of Problem Solving Heuristics Through Adaptive Search, 8th International Joint Conference on Artsfictal Intelligence, Karlsruhe, Germany, August 1983. 8. Reitman, Cognitive suites - Yet as their usage GENETIC ALGORITHMS WITH SHARING FOR KULTIMODAL FUNCTION OPTIMIZATION David E, Goldberg, The University of Alabama, Tuscaloosa, AL 35487 and Jon Richardson, The University of Tennessee (formerly at The University of Alabama), Knoxville, ABSTRACT Many practical search and optimization prob- lems require the investigation of multiple local optima, in this paper, the method of sharing functions is developed and investigated to permit the formation of stable subpopulations of dif- ferent strings within a genetic algorithm (GA), thereby permitting the parallel investigation of many peaks. The theory and implementation of the method are investigated and two, one-dimensional test functions are considered. On a test function with five peaks of equal height, a GA without sharing loses strings at all but one peak; a GA with sharing maintains roughly equally sized sub- popularions clustered about all five peaks. On a test function with five peaks of different sizes, a GA without sharing loses strings at all but the highest peak; a GA with sharing allocates decreas- ing numbers of strings to peaks of decreasing val- ue as predicted by theory. INTRODUCTION Genetic algorithms (GAs) are finding increas~ ing application in a variety of problems across a spectrum of disciplines (Geldberg & Thomas, 1986). This is so, because GAs place a minimum of re- quirements and restrictions on the user prior to engaging the search procedure. The user simply eedes the problem as a finite length string, char- acterizes the objective (or objectives) as a black box, and turns the GA crank. The genetic algo- vithm then takes over, seeking near-optima pri- marily through the combined action of reproduction and crossover. These so-called simple GAs have proved useful in many problems despite their lack sf sophisticated machinery and despite their total lack of knowledge of the problem they are solving. has grown, several objections to their performance have arisen. Simple GAs have been criticized for sub-par performance on multi- modal (multiply-peaked) functions. They have also been criticized for so-called premature conver- Sence where substantial fixation occurs at most Sit positions before obtaining sufficiently near- ptinal points (Cavicchio, 1970; De Jong, 1975; Manidin, 1984; Baker, 1985). a4 TN 37996 In this paper, we examine the first of these maladies and propose a cure borrowed from nature. In particular, our herbal remedy causes the forma~ tion of niche-like and species-like subdivision of the environment and population through the imposi- tion of sharing functions. These sharing func~- tions help mitigate unbridled head-to-head compe~ tition between widely disparate points in a search space. This reduction in competition between dis- tant points thereby permits better performance on multimedal functions. As a side benefit we find that sharing helps maintain a more diverse popula- tion and more considered (and less premature) con- vergence. In the temainder of this paper, we re- view the problem and past efforts to solve it; we consider the theory of niche and speciation through Holland's modified two-armed bandit prob- lem, and we compare the performance of a genetic algorithm both with and without the sharing func~- tion feature. Finally we examine extensions of the sharing function idea to permit its implemen- tation in a wide array of problems. MULTIMODAL OPTIMIZATION, GENETIG DRIFT, AND A SIMPLE GA The difficulty posed by a multimodal problem for a simple genetic algorithm may be illustrated by a straightforward example. Figure 1 shows a bimodal function of a single variable: f(x} ~ &(x-0.5)7 coded by a normalized, five-bit binary integer. In this problem, the two optima are lo- cated at extreme ends of the one-dimensional space. If we start a genetic algorithm with a population chosen initially at vandom and let it run for a large number of generations, our fondest hope is that stable subpopulations cluster about the two optima (about 00000 and 11111), In faer if we perform this experiment, we find that the simple GA eventually clusters all of its points about one peak or the other. Why does this happen? After all, doesn’t the fundamental theorem of genetic algorithms (Hol- land, 1975; De Jong, 1975; Goldberg, 1986) tell us that exponentially increasing numbers of trials will be given to the observed best schemata? Yes it does, but the theorem assumes an infinitely large population size. In a finite size popula- Figure 1. Bimodal function with equal peaks. tion, even when there is no selective advantage for either of two competing alternatives (as is the case for schemata 11*** and 00*** in the exam- ple problem) the population will converge to one alternative or the other in finite time (De Jong, 1975; Goldberg & Segrest, this volume). This problem of finite populations is so important that geneticists have given it a special name, genetic drift. Stochastic errors tend to accumu- late, ultimately causing the population to con- verge to one alternative or another. The convergence toward one optimum or another is clearly undesirable in the case of peaks of equal value. In multimodal problems where peaks of different altitudes exist, the desirability of convergence to the globally best peak is not so clear cut. In Figure 2 we see a bimodal function with unequal peaks: f(x) = 2.8(4-0.6)" with a five-bit normalized coding. If we are interested in obtaining only the global optimum, we should not mind the eventual convergence of the popula- tion to the leftmost point; however, this conver~ gence is not always guaranteed. Small initial populations may allow sampling errors which over- estimate the schemata of the rightmost points thereby permitting convergence to the wrong peak. Furthermore, in real world optimization we are often interested in having information about good, better, and best solutions. When this is so, it might be nice to see a form of convergence that permits stable subpopulations of points to cluster about both peaks according to peak fitness. In either of these cases, we can argue for more controlled competition and less reckless con- vergence than is possible when we work with a sim- ple, tripartite (reproduction, crossover, and mu- tation) genetic algorithm. For these reasons we turn to the theory of niche and speciation to find an appropriate model for naturally regulated com- pecition. A2 Figure 2. Bimodal function with unequal peaks. THEORY OF SPECIES AND NICHE The results of sur initial gedanken- experimente (thought experiments) with simple GAs and multimodal functions are somewhat perplexing when juxtaposed with natural example. In our problem with equal peaks, the simple GA converges on one peak or the other even though both peaks are equally useful. By contrast, why doesn’t na- ture converge to a single species? In our second problem with unequal peaks, we notice that the simple GA again converges to one peak, usually- -but not always--the "correct" peak. How, when faced with a somewhat less fit species, does na- ture choose to limit population size before re- sorting to extinction? In both cases, nature has found a way to combat unbridled competition and permit the formation of stable subpopulations. In nature, different species don't go head ta head. Instead they exploit separate niches (sets of en- vironmental features) in which other organisms have little or no interest. In thts section, we need to bridge the gap between natural example and Benetic algorithm practice through the application of some useful theory. Although there is a well-developed biological literature fn bath niche and spectation, its transfer to the arena of GA search has been Limir~- ed. Like many other concepts and operators, the first theories directly applicable to artificial genetic search are due to Holland (1975). To il- lustrate niche and species, Holland introduces a modification of the two-armed bandit problem with distributed payoff and sharing. Let's examine his argument with a concrete formulation of the same problem. Imagine a two-armed bandit as depicted in Figure 3. In the ordinary two-armed bandit prob- lem (Holland, 1975; De Jong, 1975), we have two arms, a left arm and a right arm, and we have dif- ferent payoffs associated with each arm, Suppose we have an expected payoff associated with the left arm of $25 and an expected payoff associated wich che right arm of $75; in the standard two- armed bandit, we are wnaware initially which arm pays the higher amount, and our dilemma is to min- Sketch of queues and sharing. Figure 3. the two-armed bandit with imize our expected losses over some number of tri- als. In this form, the two-armed bandit problem puts the tradeoffs between exploration and exploi- tation in sharp perspective. We can take extra time to experiment, but in so doing risk the possible gain from choosing the right arm, or we ean experiment briefly and risk making an error once we choose the arm we think is best. In this way, the two-armed bandit has been used to justify the allocation strategy adopted by the reproduc- tive plans of simple genetic algorithms. This is not our purpose here. Instead we examine the modified two-armed bandit problem to put the concepts of niche and species in sharper focus, In the modified problem, we further sup- pose that we have a population of some number of players, say 100 players, and that each player may decide to play one arm or the other. If at this point we do nothing else, we simply create a par- allel version of the original two-armed bandit problem where we expect that all players eventual- ly line up behind the observed best (and actual best) arm. To produce the subdivision of species and niche, we introduce an important rule change. Instead of allowing a full measure of payoff for each individual, {ndividuals who choose a particu- lar arm are now forced to share the wealth derived from that arm with other players queued up at the arm. At first glance, this change appears to be quite minor. in fact, this single modification Causes a strikingly and surprisingly different outcome in the modified two-armed bandit To see why and how the results change, we first recall that despite the different rules of the game, we still allocate population members according to payoff. In the modified game, an in- dividual will receive a payoff which depends on the arm payoff value and the number of individuals gueued up at thac arm. In our concrete example, an individual lined up behind the right arm when all Individuals are lined up behind that same arm feceives an amount $75/100 = $0.75. On the other “hand, an individual lined up behind the left arm 43 when all individuals are queued there receives $25/100 ~ $0.25. In both cases, there is motiva- tion for some individuals to shift lines. In the first case, a single individual changing lines stands to gain an amount $25.00 - $0.75 = $24.25. The motivation to shift lines is even stronger in the second case. At some point in between we should expect there to be no further motivation to shift lines. This will occur when the individual payoffs are identical for both lines. If N is the population size, Bight and ™ orp are the number of individuals behind the right and left queues, and fight and flees are the expected payoff values from the right and left arms the equilibrium point may be lows: respectively, calculated as fol- E, right “ Meaght Evert Mott In our example, this complete equalization of in- dividual payoff occurs when 75 players select the vight arm and 25 players select the left arm, be- cause $75/75 ~ $25/25 = $1 This problem may be extended to the k-armed case directly, and the extension does not change the fundamental conclusions at all: the system attains equilibrium when the ratios of arm payoff to queue length are equal (Holland, 1975). The incorporation of forced sharing causes the forma- tion of stable subpopulations (species) behind different arms (niches) in the problem, Further- more, the number of individuals devoted to each niche is proportional to the expected niche pay- off. This is exactly the type of solution we had hoped for when we considered the bimodal problems of Figures 1 and 2, Of course the extension of the sharing concept to real genetic algorithm search is more difficult than the idealized case implies. Ina real genetic algorithm there are many, many arms and deciding who should share and how much should be shared becomes a non-trivial question. In the next section we will examine a munber of current efforts to induce niche and species through indirect or direct sharing. A BRIEF REVIEW OF CURRENT SCHEMES A number of methods have been implemented to induce niche and species in genetic algorithms. In some of these techniques the sharing comes about indirectly. Although the two-armed bandit problem is a nice, simple abstract model of niche and species formation and maintenance, nature is not so direet in divvying up her bounty. In natu- ral settings, sharing comes about through crowding and conflict. When a habitat becomes fairly full of a particular organism, individuals are forced to share available resources, Cavicchio’s (1971) dissertation study was one of the first to attempt to induce niche-like and species-like behavior in genetic algorithm search. Specifically, he introduced a mechanism he called preselection. In this scheme, an offspring re- places the inferior parent if the offspring's fit- ness exceeds that of the inferior parent. In this way diversity is maintained in the population be- cause strings tended to replace strings similar to themselves (one of their parents). Cavicchio claimed to maintain more diverse populations in a number of simulations with relatively small popu- lation sizes (n#20). De Jong (1975) has genetalized preselection in his crowding scheme. In De Jong crowding, in- dividuals replace existing strings according to their similarity with other strings in an overlap- ping population, Specifically, an individual is compared to each string in a randomly drawn sub- population of CF (crowding factor) members. The individual with the highest similarity (on the basis of bit-by-bit similarity count) is replaced by the new string. Early in the simmation, this amounts to random selection of replacements he- cause all individuals are likely to be equally dissimilar. As the simulation progresses and more and more individuals in the population are similar to one another (ane or more species have gotten a substantial foothold in the population) the re- placements of individuals by similar individuals tends to maintain diversity within the population and reserve room for one or more species. De Jong has had success with the crowding scheme on multi- modal functions when he used crowding factors CF=2 and CF=3, De Jong’s crowding has subsequently been used in a machine learning application (Goldberg, 1983). Booker (1982) discusses a direct application of the sharing idea in a machine learning applica- tion with genetics-based, classifier systems. In classifier systems, a sub-goal reward mechanism called a bucket brigade passes reward through a network oF rules like money passing through an economy. Booker suggests that appropriately sized subpopulations of rules can form in such systems if related rules are forced to share payments. This idea is sound and has been forcefully demon- strated in Wilson's recent work with boolean function learning (Wilson, 1986); however, it does not transfer well ro function optimization, be- cause unlike classifier systems, there is no gen- eral way in function optimization to determine which strings are related. Shaffer (1984) has used saparate, fixed size subpopulations in his study of vector evaluated genetic algorithms (VEGA). In this study, each component of the vector (each criterion or objec- tive measure) is mapped to its own subpopulation where separate reproduction processes are carried out. The method has worked well in a number of trial functions; however, Shaffer has expressed some concern over the procedure’s ability to han- dle middling nondominated individuals--individuals that may be Pareto optimal but are not extremal (or even near extremal) along any single dimen- sion, Furthermore, although the study does use separate subpopulations, it is unclear how the same method might be applied to the more usual single-criterion optimization problem. A direct exploration of biological niche the- ory in the context of genetic algorithms is con- tained in Perry’s (1985) dissertation. In this work, Perry defines a genotype-to-phenotype map- ping, a multiple-resource environment, and a spe- cial entity called an external schema. External schemata are special similarity templates defined by the simulation designer to characterize species membership. Unfertunately, the required interven- tion of an outside agent limits the practical use of this technique in artificial genetic search, Nonetheless, the reader interested in the connec- tions between biological niche theory and GAs may be interested in this work. Grosso (1985) also maintains a biological ortentation in his study of explicit subpopulation formation and migration operators, Multiplica- tive, heterotic (problems with diploid structures where a heterozygote is more highly fit than the homozygote) objective functions are used in this study, and as such, the results are not directly applicable to most artificial genetic search; how- ever, Grosso was able to show the advantage of intermediate migration rate values over either isolated subpopulations (no migration) and pan- mictic (completely mixed) subpopulations. This study suggests that the imposition of a geography within artificial genetic search may be another useful way of assisting the forming diverse subpopulations, Further studies are needed to determine how to do this in more general artifi- cial genetic search applications, Although he has not directly addressed niche and species, Mauldin (1984) has attempted to bet- ter maintain diversity in genetic algorithms through his uniqueness operator, The uniqueness operator arbitrarily returns diversity to a popu- lation whenever it is judged to be lacking. To implement uniqueness, Mauldin defines a uniqueness paramater ke that may decrease with time (similar to the cooling of simulated annealing). He then requires that for insertion in a population, an offspring must be different than every population member at a minimum of kK loci. If the offspring is not sufficiently different, it is mutated until it is. By itself, uniqueness is little more than a somewhat knowledgeable (albelt expensive) muta- tion operator. That it is useful in improving offline (convergence) performance is not un- expected. Grefenstette (1986) has recently sup- ported the notion of fairly high mutation prob- abilities Py “ 0.01 to 0.1) when convergence to the best is the main goal. It is interesting to note that uniqueness combined with De Jong's crowding scheme worked better than either operator by itself (Mauldin, 1984). This result suggests that maintaining diversity for its own sake is not the issue. Instead, we need to maintain appropri- ate diversity--diversity that in some way helps cause (or has helped cause) good strings. In the next section, we show how we can maintain appro- priate diversity through the use of sharing functions. SHARING FUNCTIONS In attempting te induce species, we must ei- ther directly or indirectly cause intraspecies sharing, but we are faced with two important ques- tions: who should share, and how much should be shared? In natural systems, these two questions are answered implicitly through confliet for fi- nite resources. Different species find different combinations of environmental factors--different niches--which are relatively uninteresting to oth- er species. Individuals of the same species use those resources until there is conflice. At that point, they vie for the same turf, food, and other environmental resources, and = the increased competition and conflict cause individuals of the same species to share with one another, not out of altruism, but because the resources they give up are not worth the cost of the fight. It might be possible to induce similar conflict for resources in genetic optimization. Unfortunately, in many optimization problems, there is no natural defini- tion of a resource. As a result, we must invent some way of imposing niche and speciation on strings based on some measure of their distance from each other, We do just this with what we have called a sharing function. A sharing function is nothing more than a way of determining the degradation of an individual's payoff due to a neighbor at some distance as mea- sured in some similarity space. Mathematically, we introduce a convenient metric d over our de- coded parameters x (the decoded parameters are themselves functions of the strings xr x, (83))" d - a(x, ox i ~> the Alternatively, we may introduce a metric over strings directly: qi - als;.s5) In this paper, we use a metric defined over the decoded parameters xy (phenotypic sharing); later on, we briefly consider the use of metrics defined over the strings (genotypic sharing). However we choose a metric, we define a sharing function sh BS a funetion of the metric value sh ~ sh(d) with the following three properties: 1. 0 » 107? 116.7395 uiit 0,551 0.818 12.12 0.46 « 1077 1073.4 20% 0 49999 1.00000 6 74999 --0 22888 x 1074 1.3384 20+ ~ 4.68133 - 0.66591 x 107} --19.71969 -0.70215 x 1073 264.298 20% 1.551 0.080 ~ 4,129 - 0.678 x 10! 1432.4 A 0.5000001 1.0006600 0.7499999 - 0.23817 » 1075 0.00914 Table 2 x ~ Theoretical ‘nearest’ trial solution, | = SGAO; t= MC, tt = 107 points, fff = Best MC; A= ARGOT, on average, uses 1, ~ 4 56 Roving Boundaries Roving Boundaries 20. 10. -10. -20, 15. 10, -10. -15. Figure 4 A(M(.005),M7(2,.01),T(15,.5<(eg)<.6)) on [4,95,5] CT TNT ~ I ] 4 | “lt f _ L r yh _ \ zz: . ~ TIS my mm L_ .; Nm me ee eee . ~ -“ \ ne I \ nh _| 1° 3 7 “y Lt | | | ml Ld I | | | po | A 4 | / No Pe LS +. 4 ™A i~ | \Wiy parm TT ar er me er _ y , A eo OA -— b _— ft { L { ml 0. 200. 400. 600. Generation This figure plots the roving boundaries and best estimates for the two bimodal parameters of the A, Problem: 75 = {~.5,.5} (upper) and %_ = {-1,1} (lower). Note that the best estimates of the parameters, the beld lines, undergo early ‘switching’ between the two possible global solutions. 57 A(M(.005),M7(2,.01),T(15,.5<(eg)<.6)) on [4.95,5] a 200. Cy 3 T ~ T 4 0 150. \ 7 a j \ [_ N s 100. ; 4 4 . ‘ g 50. + fos . 4 wo fy KN - a 0 Li om se ee UTR A eS ee me ety ee ee ee eee ‘ \ . L - \ e 50 \/ Mee a . - ~ 7 V | “too Et | | jp | a5 &F a n ! 2 hy : 5.0 ~~ mm, x \: oy ° 2.5 7 at 4 mM N \ ao | . se, ~ a 0.0 [OE aS er ee ree my ree me mee oO “ = 25 b i 4 “50 Fy | kL. ee ed 7 0. 200. 400. 600. Generation Figure 2 This figure plots the roving boundaries and best estimates for the ‘broad’ and ‘sharp’ parameters of the A; Problem: 79 = .75 (upper) and 7,3 = O (lower). Note the differing scales of the two graphs and how the broad parameter has very large roving boundaries while the sharp parameter has extremely narrow roving boundaries 58 sates, os NONSTATIONARY FUNCTION OPTIMIZATION USING GENETIC ALGORITHMS WITH DOMINANCE AND DIPLOIDY David E. Goldberg and Robert E, Smith Department of Lugineering Mechanics The University of Alabama Tuscaloosa, AL 35487 ABSTRACT This paper investigates the use of diploid representations and dominance operators in genetic algorithms (GAs) to improve performance in environments that vary with time The mechanics of diploidy and dominance in natural genetics are briefly discussed, and the usage of these structures and operators in other GA investigations is reviewed An extension of the schema theorem is developed which Illustrates the ability of diploid Gas with dominance to hold alternative alleles in abeyance. Both haploid and diploid GAs are applied to a simple time varying problem an oscillating, blind knapsack problem. Simulation results show that a diploid GA with an evolving dominance map adapts more quickly to the sudden changes in this problem environment than either a hapleid GA or a diploid GA with a fixed dominance map. These proof-of-principle results indicate that diploidy and dominance can be used to induce a form of long term distributed memory within a population of structures, INTRODUCTION Real world problems are seldom independent of time. T£ you don‘t like the weather, wait five minutes and it will change. If this week gasoline sosts $1.30 a gallon, next week it may cost $0 89 a gallon or perhaps $2.53 a gallon. In these and any more complex wavs, real world environments are beth nonstationary and noisy. Searching for good solutions or good behavior under such conditions 1s a difficult task; yet, despite the perpetual change and uncertainty, all is not lost. History does repeat itself, and what goes around does come around, The horrors of Malthusian extrapolation Yarely come to pass, and solutions that worked well yesterday are at least somewhat likely to be useful when rircumstances are somewhat similar tomorrow or the day after. The temporal regularity implied in these observations places a premium on search atgnented by selective memory In other words, a System which does not learn the lessons of its Ristory is doomed to repeat its mistakes. er . *% this paper, we investigate the behavior of a aenstic algorithm augmented by structures and ~S8eEaters capable of exploiting the regularity and Kepeatabllity of many nonstationary environments OO Specifically, we apply genetic algorithms that include diploid genotypes and dominance operators to a simple nonstationary problem in funetion optimization’ an oscillating, blind knapsack problem, In doing this, we find that diploidy and dominance induce a form of long term distributed memory that stores and occasionally remembers good partial solutions that were once desirable. This memory permits faster adaptation to drastic environmental shifts than is possible withaut the added structures and operators. In the remainder of this paper, we explore the mechanism, theory, and implementation of dominance and diploidy in artificial genetic search. We start by examining the role of diploidy and dominance in natural genetics, and we briefly review examples of their usage in genetic algorithm circles We extend the schema theorem to analyze the effect of these structures and mechanisms. We present results from computational experiments on a 17-object, oscillating, blind 0-1 knapsack problem Simulations with adaptive dominance maps and diploidy are able to adapt more quickly to sudden environmental shifts than either a haploid genetic algorithm or a diploid genetic algorithm with fixed dominance map, These results are encouraging and suggest the investigation of dominance and diploidy in other GA applications in search and machine learning. THE MECHANICS OF NATURAL DOMINANCE AND DIPLOIDY It is surprising to some genetic algorithm newcomers that the most commonly used GA is modeled after the mechanics of haploid genetics. After all, don’t most elementary genetics textbooks start off with a discussion of Mendel’s pea plants and some mention of diploidy and dominance? The reason for this disparity between genetic algorithm practice and genetics textbook coverage is due to the success achieved by early GA investigators (Holistien, 1971; De Jong, 1975) using haploid chromosome models on stationary problems. It was found that surprising efficacy and efficiency could be obtained using single stranded (haploid) chromosomes under the action of reproduction and crossover, As a result, later investigators of artificial genetic search have tended to ignore diploidy and dominance. In this section we examine the mechanics of diploidy and dominance to understand their roles in shielding alternate solutions from excessive selection. Most studies of genetic algorithms to date have considered only the simplest genotype found in nature, the haploid or single-stranded chromosome In this simple model, a single-stranded string contains all the information relevant to the problem we are considering. While nature contains many haploid organisms, most of these tend to he relatively uncomplicated life forms. It seems that when nature wanted to build more complex plant and animal life it had to rely on a more complex underlying chromosomal structure, the diploid or double-stranded chromosome. in the diploid form, a genotype carries a pair of chromosomes (called homologous chromosomes), each containing information for the same functions. At first, this xedundancy seems unnecessary and confusing After all, why keep around pairs of genes which decode to the same function? Furthermore, when the pair of genes decode to different function values, how does nature decide which allele to pay attention to? To answer these questions, let's consider a diploid chromosomal structure where we use different letters to represent different alleles (different gene function values): AbCDe aBCde At each position (locus) we have used the capital form ov the lower case form of a particular letter to represent alternative alleles at that position In nature, each allele might represent a different phenotypic characteristic (or have some nonlinear or epistatic effect on one or more phenotypic characteristics). For example, the B allele might be the brown-eyed gene and the b allele might be the blue-eyed gene. Although this scheme of thinking is not much different from the haploid (single-stranded) case, one difference is clear. Because we now have a pair of genes describing each function, something must decide which of the two values to choose because, for example, the phenctype cannot have beth brown and blue eyes at the same time (unless we consider, as nature sometimes does, the possibility of intermediate forms, but we shall not concern ourselves with that possibility here). The primary mechanism for eliminating this conflict of redundancy is through an operator which geneticists have called dominance. At a particular locus, it has been observed that one allele (the dominant allele) takes precedence (dominates) over the other alternative alleles (the recessives) at that locus. More specifically, an allele is dominant if it is expressed (it shows up in the phenotype) when paired with some other allele. In our example above, if we assume that all capital lerters are dominant alleles and all lower case letters are recessive, the phenotype expressed hy the example chromosome pair may be written. AbCDe aBCde -++> ABCDe 60 At each locus we see that the dominant gene always expressed and that the recessive gene only expressed when it shows up in the company another recessive, In the geneticist's parlance say that the dominant gene is expressed wh heterozygous (mixed, Aa --> A) or homozygaus (pur CG --> C) and the recessive allele is express only when homozygous (ee --> 2), The mechanics of diploidy and dominance se relatively clear. On a more abstract level, we a think of dominance as a genotype-to-phenory; mapping. Yet, if we continue to ponder the natu and action of diploidy and dominance, they at really quite bizarre. Why does nature double tt amount of information carried within the genotyr and then turn around and cut by half the quantit ef information it uses? On the surface this seer wasteful and unnecessarily tedious. Yet, nature 1 no spendthrift, nor is she given to whimsy o caprice. There must be good reason for the adde redundancy of the diploid genotype and for th reduction mapping of the dominance operator. Actually, diploidy and dominance have lon been the object of genetic study, and mumerou theories and explanations of their role have bee: puc forth. The theories which make the most sens: in the context of artificial genetic searcl hypothesize that diploidy provides a mechanism foo remembering alleles and allele combinations whict were previously useful and that dominance provides an operator to shield those remembered alleles fron harmful selection in a currently hostile environment, In a natural context, we car understand the need for both a distributed long term memory and a means of protecting that memory against rapid destruction. Over che course of the evolution of Life on Earth, the planet has undergone many changes in environmental conditions. From hot to cold and back to moderate temperatures, from dark to light to somewhere in between, there have been dramatic and rapid shifts in environmental conditions. The most effective organisms have been those able to adapt most rapidly to the changing conditions. Animals and plants with diploid or polyploid structure have been these most capable of surviving, because their genetic constitution did not easily forget the lessons learned prior to previous environmental shifts. The redundant memory of diploidy permits multiple solutions (to the same "problem") to be carried along with only one particular solution expressed, In this way, old lessons are not lost forever, and dominance and dominance change permit the old lessons to be remembered and tested secasionally. An often cited example of the long term memoty induced by diploidy and dominance can he found in the shifts in population balance of the peppered moth in Great Britain during the industrial revolution. The wild form (and originally the dominant form) of this lepitopteran had white wings with small black specks Prior to the Industrial Revolution, this coloration was effective camouflage against birds and other beasts of prey in the moth's natural habitat, lichen- covered trees. In the middle of the nineteenth century, black forms were caught in the neighborhood of industrial towns. Careful experiments by Kettlewell (Berry, 1972) showed that the speckled version was advantageous in the pristine setting, while the melanic (dark) form was advantageous in the industrial environment where pollution had killed off the lichen covering the tree trunks, It turned out thar the melanic forms were controlled by a single dominant gene, implying that a shift in dominance occurred. When the industrial revolution shifted the balance of power toward the darkened form, the darkened form became dominant and the speckled form was held in abeyance. Note that the melanic form was not a new invention; this was no case of fortultous mutation magically concocting the needed form. Instead, the black form had been invented earlier, perhaps in response to forests where lichen was naturally suppressed, When the by-products of industry caused the lichen to disappear, the melanic form was sampled more frequently and then evolved to the dominant form. With this alternate solution held in the background, the peppered moth was easily able to adapt rapidly to the selective pressures of its changing environment. In this example we how diploidy and dominance permit alternate solutions to be held in abeyance--shielded against over selection. We also sea how dominance is no absolute state of affairs. see Biologists have hypothesized and proven that dominance itself evolves. In other words, the dominance or lack of dominance of a particular allele is itself under genic control. Fisher (1958) theorized that dominance at a particular position (locus) along a chromosome is actually determined by another modifier gene at another locus. This implies that dominance is an evolving feature of the organism, subjecr to the same search procedures as any other feature. If a particular allele is favored by selection, it will spread more vapidly if it is dominanr. A modifier gene therefore enhances the spread of the gene being modified. This in turn enhances the spread of the modifier. If the two genes are closely linked, this positive feedback quickly propagates both the favered allele and the modifier in the population. But what sets the dominance of the modifier gene? In order to avoid infinite regress, we Yecegnize that a gene can have more than one effect on the phenotype, In fact, an allele can have several major effects on the phenotype while it affects the dominance at one or more other loci. The presence of such multiple effects is known as Blelotropy. We will use a simple form of . Pislotropic modifiers (one with the modifier always _ attached to the gene it modifies) in experiments with a diploid GA representation suggested by . Holistien (1971) and Holland (1975), ; In the next section, we examine the diploidy- ominance schemes used in artificial genetic search 9 See how they incorporate diploid structure, eminence, and the evolution of dominance. of DIPLOIDY AND DOMINANCE IN GENETIC ALGORITHMS, AN HISTORICAL PERSPECTIVE Some of the earliest examples of practical genetic algorithm application contained diploid genotypes and dominance mechanisms, In Bagley'’s early (1967) dissertation, a diploid chromosome pair mapped to a particular phenotype using a variable dominance map coded as part of the chromosome itself (Bagley, 1967, p. 136): Each active locus contains, besides the information which identifies the parameter to which it is associated and the particular parameter value, 4 dominance value. At each locus the algorithm simply selects the allele having the highest dominance value. Unlike the biological case where partial dominance may be permissible (wesulting, for example, in speckled eyes), our interpretation demands that only one of the alleles of the homologous loci be chosen. The decision process in the case of ties (equal dominance values) involves position effects and is somewhat complicated so that it will be necessary to outline the process in some detail. The introduction of a dominance value for each gene allowed this scheme to adapt with succeeding generations, Unfortunately, Bagley found that the dominance values tended to fixate quite early in simulations thereby leaving dominance determination in the hands of his somewhat complicated and arbitrary tie-breaking scheme, To make matters worse, Bagley prohibited his mutation operator from operating on his dominance values, thereby further aggravating this premature convergence of dominance values. Additionally, Bagley did not compare haploid and diploid schemes, and in all of his cases the environment was held stationary. In the end, the convergence of dominance values at all positions led to an arbitrary random choice dominance mechanism and inconclusive results. Rosenberg's (1967) biologically-oriented study contained a diploid chromosome model; however, since biochemical interactions were modeled in data{l, dominance was not considered as a@ separate effect. Instead, any dominance effect in this study was the result of the presence or absence of a particular enzyme. The presence or absence of an enzyme could inhibit or facilitate a biochemical reaction, thus controlling some phenotypic outcome. Hollstien’s study (1971) included diploidy and an evolving dominance mechanism. In fact Holistien described two simple, evolving dominance mechanisms and then put the simplest to use in his study of function optimization. In the first scheme, each binary gene was described by two genes, a modifier gene and a functional gene. The functional gene took on the normal 0 or 1 values and was decoded to some parameter in the normal manner. The modifier gene took on values of IM or m. In this scheme 0 alleles were dominant when there was at least one M allele present at one of the homologous modifier loci. This resulted in a dominance expression map as displayed in Figure 1. a ee OM QO 0 0 0 On 0 o | 0 , 1M > 0 . 1 L im . . L 1 - Figure 1, Two-locus evolving dominance map from Hollstien (1971). Hollstien recognized that this two-locus evolving dominance scheme could be replaced by a simpler one-locus scheme by introducing a third allele at each locus. In this triallelic scheme, Hollstien drew alleles from the 3-alphabet {0, 1, 2}. Here the 2 played the role of a dominant "1" and the 1 played the role of recessive "1." The dominance expression map he used is displayed in Figure 2. 1 9 ly 1 L 1 L 0 L 9 0 ly L 0 L ere beni -— Figure 2. Single-locus, triallelic dominance map from Hollstien (1971) and Holland (1975). The action of this mapping may he summarized by saying that both 2 and 1 map to "1", bur 2 dominates O and O dominates 1, Holland (1975) later discussed and analyzed the steady state performance of the same triallelic scheme, although he introduced the clearer symbology (0, lg, 1} for Hollstien’s (0, 1, 2}. The Hollstien-Holland triallelic scheme is the simplest practical scheme suggested for evolving dominance and diploidy in artificial genetic search. With this scheme, the more effective allele becomes dominant, thereby shielding the recessive. Minimum excess storage is required (half a bit extra per locus) and furthermore, dominance shift can easily be handled as a mutation-like operator, mapping a 2 to al (a 1 to a lp using Holland's notation) and vice versa. Despite the clarity of the scheme, Hollstien's results with this mechanism were mixed. Although his Breed Type III simulations maintained better population diversity (as measured by population variance) than did his haploid simulations, there was no significant overall improvement of either average or ultimate performance. This scems surprising until we recognize that his test bed only contained stationary functions. Tf the role of dominance-diploidy is shielding or abeyance, we should only expect significant performance differences between haploid and diploid genetic algorithms when the environment changes with time. Brindle (1981) performed experiments with a number of dominance schemes in a function optimization setting. Unfortunately, their has been some question as to the validity of this study's test functions and codings (K. A. De Jong & L. B. Booker, personal communication, 1986). Furthermore the study ignored previous work in artificial dominance and diploidy, and a number of schemes developed were without basis in theory and without biological precedent. Specifically, Brindle considered a total of 6 schemes: 1 random, fixed, global dominance 2. variable, global dominance 3. deterministic, variable, global dominance 4. choose a random chromosome 5. dominance-of-the-better chromosome 6. haploid-controls-diploid individual dominance The second and third schemes make local dominance decisions based on global population knowledge. The use of global information is questionable since the primary beauty of both natural and artificial genetic search is their global performance through local action. Once global operators are imserted, this attractive feature is destroyed. This is no small matter if we are ultimately concerned with efficient implementation of these methods on parallel computer architectures. Of the remaining schemes, only the the sixth scheme suggested by Brindle uses an adaptive dominance map Like those in Hollstien's (1971) and Bagley's (1967) earlier work; however, this scheme completely separates the dominance map (the modifying genes) from the normal chromosome (the functional genes) as an added haploid chromosome. ew 2m oD . SE course, This separation effectively destroys linkage between the dominance map and the functional genes. with the Brindle’s In addition to the dominance schemes and test functions, work, Like studies before, considered only stationary functions. This is a common thread running through all previous genetic algorithm studies of dominance and diploidy. If dominance and diploidy do act to shield currently out-of- favor solutions from excessive selection, use of these operators in static environments is unlikely to show any performance gain when compared to a haploid GA. In the next section, we further buttress the case for dominance and diploidy as abeyance-~long term memory--mechanisms through an problems analysis of schema propagation under these operators. THEORY OF DOMINANCE AND DIPLOIDY Before analyzing the specific effects of dominance and diploidy, we briefly review the notion of schemata and the fundamental theorem of genetic algorithms. The cornerstone of all genetic algorithm theory is the realization that GAs process schemata (schema-singular, schemata-plural) or similarity templates. Suppose we have a finite length binary string, and suppose we wish to describe a particular similarity. For example, consider the two strings A and B as follows: wp ga ad ro He Or or We notice that the two strings both have 1's in the first and third position. A natural shorthand to describe such similarities introduces a wild card er don’t care symbol, the star *, in all positions where we are disinterested in the particular bit value, For example, the similarity in the first position can be described as follows: Likewise, the similarity in the third position may be described with the shorthand nocation aed the combined similarity may be described with *'s im all positions but the first and third: . these schemata or similarity templates eet only name the strings A and B, The schema ewes describes a subset containing 2% = 16 4RE8, each with a one in the first position. Bore specific schema 1*1 describes a subset 03 of 23 = 8 strings, each with ones in both the first and third position. We notice that not all schemata are created equal. Some are more specific than others. We call the specificity of a schema H (the number of fixed positions) its order, o(H). For example, o(]l*t<*) » 1 and o(1*1**) - 2. Some schemata have defining positions spaced farther apart than others. We call the distance between a schema's outermost defining positions its defining length, 6(H). For example, the defining length of any one- bit schema is zero’ BS (LEE) om ECERLE) oo» 0, On the other hand, the defining length of our order-two schema example may be calculated by subtracting the position indices of the outermost defining positions: 6(141#4) - 3-1 2. These properties are useful in the fundamental theoren of genetic algorithms otherwise known as the schema theorem. Under fitness proportionate reproduction, simple crossover, and mutation, the expected number of coples m of a schema H is hounded by the following expression: ODT, 2» AUD Lp. m@H el) 2 m(H,e) AGT . p BU. p orn) | In this expression, the factors p, and Py are the mutation and crossover probabilities respectively, and the factor £(H) is the schema average fitness which may be calculated by the following expression: D £(s,) m(H,t) The schema average fitness f(H) is simply the average of the fitness values of all strings s which currently represent the schema H. Overall the schema theorem says that above average, short, low-order schemata are given exponentially increasing numbers of trials in successive generations. Holland (1975) has shown that this is a near-optimal strategy when the allocation process is viewed as a set of parallel overlapping, multi- armed bandit problems. We will not review this matter in detail here. Instead, we need to look at how dominance and diploidy modify expected schema propagation, To see the effect of dominance and diploidy on schema propagation, ve recognize that the schema theorem still applies: however it is useful to separate the physical schema H from its expression H,(H). Of course the expression of a schema is a funecion of the schema, its range of mates, and the dominance map in use. Recognizing all this, we may rewrite the schema theorem in somewhat clearer form: m(H,t#1) > (He) - §(4) ED + ot p,] Everything remains the same, fitness of the schema H, F(H), except the average is replaced by the average fitness of the expressed schema H.C), £(H,(H)). In the case of a fully dominant schema H, the average fitness of the physical schema always equals the expected average fitness of the expressed schema H, CH) + £CH) ™ £(H,CH)) In the case of a dominated schema H, the hope is that the average fitness of the expressed schema is greater than or equal to the average fitness of the physical schema: ECH,CH)) 2 £CH) This situation is most likely to occur when the dominance map itself is permitted to evolve, if the average fitness of the allele as expressed is greater than its fitness when homozygous, then the currently deleterious, dominated schema will not be selected out of the population as rapidly as in the corresponding haploid situation. This is how dominance and diploidy shield currently out-of- favor schemata. To make this argument more quantitative, let's consider a simple case where only two alternative, competing schemata may be expressed, ene dominant and the other recessive, Physically, we can think of this as representing elther two alternate alleles at a particular locus, or two multi-locus schemata that have come to dominate a particular set of loci. In either case, we assume that the dominant alternative is expressed whether heterozygous or homozygous, and we assume that the recessive alternative is expressed only when homozygous. Rearrangement of the schema growth equation permits us to calculate the proportion of recessive alleles, Pf, in successive generations, t. If we assume that there are only two alternatives and the dominant form has a constant expected fitness value of f4 and the recessive form when expressed has a constant expected fitness value of f,, the proportion of recessives expected in the next generation may be calculated as follows: ph + r(1-pt pil PEK tet ERD) EC we ots rp where r= E4/f,, and K = crossover-mutation Loss constant. & similar equation may be derived for the haploid case where the deleterious alternative (we will still call this recessive even though it no longer recesses) is always expressed when present in a haploid structure, 64 Proportion ratio (PL Eel yp fy and proportion vers time graphs are plotted "for haploid and dipie cases in Figures 3 and 4, From Figure 3 we ne that for a comparable proportion of alleles, t haploid case always destroys more recessiv (always has a smaller proportion ratio) than t corresponding diploid case. Of course, this do not imply that the diploid case has a low on-li performance measure. In fact, the sampling ra remaing low (proportional to P*) for the po (recessive) alleles in the diploid case. 0.9000 4 ee 4 3 0.7000 +} a.e000 4 9.3000 O ap000E +00 T 5 Q0000E~O1 Seating ion of Population oD Rez + TUmiung Pemen F PEP 1.00G008+ Poplald Ree Figure 3. Expected ratio of recessives proportio pL tp_* versus P.", for haploid (r ~ 2), diploi (r= 2), and Limiting diploid (r ~ =). n.g0ca a,4000 4 fom] jh] 8.1000 4 aooco 1s nd . o Vo 20 30 40 50 Osnerction Number DHoplola Rw? + Diploid R= Umiting Dipiold Elgure 4. Preportion of recessives alleles pt versus generation t for haploid (r ~ 2), diploid (r = 2), and limiting diploid (r ~ ©). The previous analysis clearly demonstrates the long dominance. term memory induced by diploidy and Because of this effect we also expect that mutation should play even less of a role in the operation of a diploid genetic algorithm. Holland (1975) has presented an analysis of the steady state mutation requirements of diploid structures as compared to haploid structures. We reproduce his arguments to understand the ergodic performance of these mechanisms. For a haploid structure under selection and mutation it may be shown that the proportion of recessive alleles in the next generation pi is related to the proportion in the current generation, pt by the following equation: ttl c t t PL = (Lee) PL + pyCl-PLo) + PyPy Here we have the sum of three terms, due to selection, the source of mutation and the loss of alleles The ¢ factor is the proportion selection and other operator losses state, the proportion alleles from from mutation. lost due to At steady ttL t P. 7 Prom Pss- Solving for p,, ve obtain the following equation: «P. ss PTT oP, This equation suggests that the final steady state preportion of alleles is directly proportional to the mutation rate (with large ¢ and small Pog) For a diploid structure under selection and mutation it may be shown that the proportion of recessive alleles in the next generation is related to the number in the current generation by the following equation: t41 typ t Lop t PL ~ (1-2eP “PLS + 2p, 2P, ) At steady state we obtain a relationship between the required mutation rate and the steady proportion of recessive alleles: Pa = €Pgg°/(1-2P ag) For small steady state proportions of recessive alleles, P,. Old rule New rule 4.2.1 Specialization at relation level The mechanics of specialization in relations 1s very similar to generalization. Rules can be specialized by filling variable positions with literal values that have a type of the correspond- ing relation arguments (a constructor being used to generate a legal literal value based upon its type in the class hierarchy), by constraining the variable to be of a subtype of the type at which it is presently constrained, or by replacing the relation with a subspecie of the relation. 4.2.2 Specialization at conditional level Another method of specialization {s the addition of clauses in the left-hand side of the rule. In the simple case the new clause does not introduce any new elements, it only further constrains the conditions of the rule's applica- bility. However, when new variables are introduced by a STM or associate clause, the clause will do little useful work unless at least one of the newly introduced variables 1s mentioned elsewhere in the rule.# Therefore, specialization by clause introduction needs to be tied to the existing clauses through an existing variable or must be complemented by another process other clauses to effect such a tie. 4.3 Mutation Mutation is very similar to specialization and generalization in that it is conceptually a local operator that transforms individual elements of a Tn the bit-string representation qe rule construet. this transformation consists of replacing a with a "O" and vice versa, see figure 8. Pigure 8 of Bul Old rule New rule 4.3.1 Mutation at relation level In the production rule representation, mutation is constrained by the same datatyping considerations as specialization and generalization. It produces value changes based upon the type of the argument corresponding to the selected position, but, as opposed to those twa operators, it seeks sibling concepts for the new variable or constant. Recall the instantiatied relation (location-of bldg,) that was generalized by replacing the reference to a specific butiding by reference to a superclass of the building. A mutation on that clause would be, for instance, (location-of ?vehicle), which, as was pointed out in the discussion of generalization above, 15 not a generalization of the instantiated relation. 4.3.2 Mutation at conditional level Mutation of the relation expressed by the clause can be viewed as the removal of a clause from the conditional followed by the addition of a new clause. Notice that in a fully articulated concept graph, the relation expressed by the new clause must be, at some level, a sibling or cousin relation to the original. One might use this to provide a metric and control regimen for the degree of mutation in a rule . *This is by no means always the case. Sometimes variables are introduced to test for some general condition to decide if an action should be taken, resulting in a rule like IF (sty ?var,) THEN BO {action constant,). 74 4.4 Inversion The inversion operator of the G& produces an end for end reversal of positions in the condition of a rule, It can be applied as a mutation operator during the running of the system, or as a crossover/mutation operator applied during reproduction, The example in figure 9 shows the basic notion behind inversion. Figure 4 Tnvernion of tule in Tapure 5 Old rule. New rule: NALA Inversion at relation level Because of the strict ordering of arguments within relations, inversion cannot be in general achieved unless the types of the relation's arguments are symmetrical. Under any other circumstances, inversion would result in illegally formed clauses. 4.4.2 Inversion at conditional level The production system as described restricts ordering of clauses according to the introduction of variables, Inversion of the clauses in the conditional would presumably destroy some of the dependencies of that variable introduction scheme, requiring renaming and rethreading of variables through the inverted clause set. Without an explicit relation between, say, the clauses in a rule and the actions to be taken in a plan, it's difficult to see the point of inversion. Given the variable set for a rule, it shouldn't make any semantic difference in what order the clauses are expressed. 4.5 Crossover The crossover operator is the workhorse modification operator of the GA. It is also an operator which doesn’t have direct analog in the # "traditional" machine learning literature. The example in figure 10 shows two rules being crossed over after they were chosen as mates. A locus : (denoted by the cut) determines how mich of Ruled and how much of Rule2 will be used. In this example, Rulel provides the first 3 positions of the new rule with Rule 2 Providing the remaining positions. 4.5.1 Crossover at relation level It is not immediately obvious that crossover makes sense for relations. In fact, the argument typing requirements are often clted as the cause for crossover's inapplicability in this domain. Fleurs 19 Rulel. Rule2: New rule However, there is a way in which crossover might be applied (the locus of crossover chosen) even within strongly datatyped relations when two rules are being eressed. This requires two modifications of the system we have so far described: (1) the system should be described by same context-sensitive rule grammar that formally delineates what is legal at all the levels as described in Section 3.5, (2) all rules should be described by a datastructure that indicates the derivation tree through the grammar followed in the construction of the rule. A rule can be cut at any point, but each part of the rule must carry a copy of its (partial) derivation. Crossover can then be performed by any two rule fragments whose derivations complete each other. That is, the cut results not only in the splitting of a rule into two fragments, but the tagging of each cut end of the fragment, the left-hand side with the partial derivation tree down to the cut. The right-hand side with the subtree starting at the art. The unification of the derivations produces the new rule. This process 1s obviously more complez than the ather operators considered thus far and we are presently trying to specify a grammar for our system that will form the basis of the derivation path datastructure. 4.5.2 Crossover at conditional level Figure 11 illustrates a naive view of crossover when the cut oceurs between clauses in the conditional. Rulet and Rule2 are two hypothetical rule mates, Their combination via crossover produces a rule that mentions a variable in its action that 1s not well-founded; ?varc is supposed to be bound to (function ?varB), but ?vare is unbound in New Rule). Moreover, ?vara is also unfounded, and neither ?varA nor ?varl are doing any work that is obviously related to the action to be taken. Our first implementation of crossover attempted to achieve crossover for conditions, and initial experience to test some intuitions. It Promoted modification only among rules with the Same goal or class of goals, A rule was naively Produced. It then went through one round of generalization, in which variables of the same type hith different names were unified and one round of Specialization in which variables of predicates and 75 Fipure 34 Crossover Uperatur Protlerta Rated, (F (AND ffei, Peart const.) Cut > « (Rel, fart canst} } THEN (D0 fAction, feerf}) Take 2 © (AND fitel, Prar4 Peardi) ful > < [Pred tear} {Avsee Prort {function Pearfi})) TEEN (DO fAchany Suar€') } New Rule Ik {AND (Rel, frart consi,} [Pred fara} {Astor fart {functon Pror8})) THIN (DO (Action, fear€}) associative relations are given literal values if they are not were bound by the STM relations. Predicates without any variables (only literals) or associative relations whose variable is not used by predicates or assertions, were expunged from the rule condition because of their useless computational expense. Qur present approach (until we work out the grammar-based one) is to view crossover as a repeated generalization by clause removal on the rule that 1s to pravide the action side of the new rule (e.g., Rule2 in Figure 11), followed by a repeated specialization by clause addition on the resulting generalized rule. The clauses added in the specialization process are those before the cut in the rule that 1s to provide the first part of the new rule (e.g., Rulel in the figure). The characterization 1s rough because the conditional 1s bound together by the combination operator. We mutate the combination operator of the generalized intermediate form of the rule (derived from Rule2} to the combination operator of the rule providing the specialization clauses (Rulel). 5. Issues and Future Work At least two issues are raised by the representational adaptations we are attempting. Firstly, are we still discussing computation within a genetic metaphor if we accept this representation? Secondly, how are these operators to play in AI systems, what kind of control regimen should they operate under, and is that genetic? We discuss these issues very briefly here. 5.1 The Genetics Metaphor An important issue raised by this work is whether the metaphor of genetic recombination is not completely lost when one goes to such high-level representations. The integrity requirements on the variable bindings makes it difficult to view rules as suitable material for local genetic recombination operators. When the legality requirement is enforced the way we have described, the whole operation of adapting GA techniques seems driven by a priori representation characteristics. The result has been an adaptation that may seem ad hoc, from the GA perspective. A much more satisfactory solution seems to be at hand if the entire production system can be described in a formal grammar. In that context genetic recombination re-emerges much more clearly as a metaphor for the process, although the operators are much more complex than usually described in the GA literature. Perhaps an analogy of such operators to the role of RNA is appropriate. A glimpse of this approach appears in the discussion of crossover within relations, above. 5.2 Control Having constructed a set of operators to generate changes to a rule base, we are now confronted with the heart of the machine learning problem, how to control system modiPication operators, A (perhaps crucial) element of the GA is the probabilistic control of recombination of elements in a gene pool of some sort, leading to new generations of the overall system. In real world applications of knowledge~based systems, the effort expended on knowledge engineering, the expectation that system changes will be incremental, and the hope that they will be well-understood seems ta argue against directly incorporating the GA approach to coritrol in such systems. Yet the GA approach may represent an aspect of contro] that must be present in a system (in some form) for it to exhibit the capacity for discovery. 5.2 Future Work We have described {ssues in the application of the GA to pattern-based production systems. We are currently implementing the operators described so we may explore those issues, The system in which this approach is being explored is implemented on a Symbolies 3670. It runs the bucket brigade as a knowledge evaluation mechanism, and has demon- strated auto-modification of its behavior using the bucket brigade alone. We are presently exploring variations on the competitive learning scheme exemplified by the bucket brigade, as well as rule generation strategies as exemplified the genetic operators. Generalization, specialization, and a version of crossover are implemented, and, at present, invoked on a per-rule basis depending on a rule's performance. Next year's work is aimed at quantitatively assessing these approaches through controlled machine learning experiments. We expect to pay particular attention to the GA control strategy as process for generating effective new rules in competition with other such processes. We will attempt to implement operators in a meta-level rule representation to begin to provide explanation of rule derivations and justification of system modifications. 40. Ve 76 References Simon, H. A., "Why Should Machines Learn?," Machine Learning, an Artificial Intelligence Approach, R. 3, Michalski, J. G. Carbonell, and T. M. Mitehell, Eds., Tioga Publishing Co., Palo Alto, CA, 1983. Holland, J. H., "Adaptation in Natural and Artificial Systems”, University of Michigan Press, Ann Arbor, MI, 1975. DeJong, K. A., "Genetic Algorithms: a 10 year perspective", Proc. Int. GA Conf. pp. 169-177, 1985. Goldberg, D. E., “Computer-aided gas pipeline operation using genetic algorithms and rule learning”, Ph.D. Thesis, University of Michigan, Ann Arbor, MI, 1975. DeJong, K. A., and Smith, T., "Genetic Algorithms Applied to Information Driven Models of US Migration Patterns", IEEE Trans, on Systems, Man and Cybernetics, Sept. 1980, 10,9, Grefenstette, J. J., Gopal, R., Rosmaita, B, J., and VanGucht, D., "Genetic Algorithms for the traveling salesman problem", Proc. Int. GA Conf., pp. 160-168, 1985. H. J. Antonisse and K. S. Keller, Evaluation of Imprecisely Specified Knowledge," Proc. Digital Avionies Systems Conf. (Fort Worth, TX, 1986). H. C. Tallis and H. J. Antonisse, “Evaluation of Intelligence Fusion Rules Using the Bucket Brigade", submitted for publication. “Dynamic Cramer, N. L., “A representation for the adaptive generation of simple sequential programs”, Proc. Int. GA Conf., pp. 183-185, 1985. Smith, D. “Bin packing with adaptive search", Proc. Int. GA Conf., pp. 202-206, 1985. Forrest, S., "Implementing semantic network structures using the classifier system", Proc, Int, GA Conf., pp. 24-44, 1985, TREE STRUCTURED RULES IN GENETIC ALGORITHMS Arthur 8. Riva Wenig Bickel Bickel Department of Computer and Information Systems Florida Atlantic University Boca Raton FL 33432 ABSTRACT We apply genetic algorithm techniques to the creation and use of lists of tree- structured production rules varying in length and complexity. Actions, conditions and operators are randomly chosen from tablea of possibilities. Use of the techniques 1s facilitated by the notions of time delay and dependency in performing tests or observations and by accounting for indeterminate test results. can adapt to 4 changing the tables to particular library of GENES, variety a general program, of systems by cantents of the various accomodate the domain of a problem and by substituting a appropiate cases. INTRODUCTION The genetic algorithm 1s an iterative adaptive search technique which has been applied successfully to learning systems having large search spaces and to problems which cannot easily be reduced to closed form. Holiand, who origanated the notion of computer congtruets which mimic natural adaptive mechanisms, noted that in order to apply this technique, the subject to be treated must be capable of being répresentad by some data structure and the solutions must be capable of being evaluated and ranked (1,2). APPLICATION TO EXPERT SYSTEMS Conventionally in genetic algorithm implementations, the rules containing the e4pertise have been expressed in the form of bit strings of fixed length. Thas approach works guite well in terms of the ane of applying the genetic operators — mutations may turn aOQO toa lor to some other symbol taken from a similar simple alphabet; crossovers are implemented by randomly cheosing two crossover points . Along the atring and exchanging the 77 anformation between those two points, However there are many problems which cannot be easily axpreased in terms of a simple alphabet, and still be amenable to the genetic operatora (see, e.g. work on the traveling salesman problem (3, 4)). The commonly used dual valued alphabet 1s @specially troublesome for two reasons. First, aince the number of possible actual interpretations is unlikely to be a power of two, some interpretations will be redundant, weighting the possible choices unevenly, Secondly, come mutations would be extromely unlakely, o@.g. 000 would probably not mutate to ili. These problems are both eliminated by having each decisional element be an index into an appropriate table. Choices for each position and changes due to point mutations are then made merely by randomly chosing one of the possible indices into that table. Schaffer and Grefenstette (5), building upon the LS~1 learning system eriginated by Smith (6), used a data structure where each expert was not an andividual rule but, rather, an entire set of rules, in order to deal with multi- objective learning. These rule sets were of fixed length, although all the rules might not necessarily be used. But there may be insufficient flexibility an solving difficult classes of problems where the axpert 1s camnposed of linear data structures of fixed Jength, since practicable computer expert systems are eften presented as a system of non-linear rules (7). These rules tend to be of the form: IF condition (Ufandtor condition} THEN action the action obtain more where to may be either an attempt anformation on the system (an action rule), or a terminal decision in the nature of deciding the answer to the presented problem (a decision rule). CREATING AN EXPERT GENERATOR He created GENES, a program to develop expert systems. Each expert in our systen Gonsisted of a linked last of rules, each rule parsed syntactically as a The actual number of linked last was randomly via an average length parameter passed to the program upon its anitiation. The number of rules allowed was optionally bounded, tree structure. rules ain each determined A different program parameter determined the maximum number of nodes for each rule. The minimal possible value was one, this single nede being an action nede - which would mean that same yrescribed action (randomly chosen from a table of possible actions) would always be carried out. one type of action was the undertaking of a specifie additional test er observation; the other possible action was a final determination as to the atate of the system - e.g. if the expert was a prospector, then this might be that a given mineral exists at a certain Location in recoverable concentrations. There was only one action node per rule. Additional nodes each represented a possible condition precedent to carrying out the action, linked by boolean operators (again, randomly chosen). Conditions precedent nodes are of the form: IF itest-result-yields RELATIONAL OPERATOR scalar quantity). As an example, if the expert created were a chemist, a possible node might be "“1f melting Point > 160.51 degrees". In creating (but not an evaluating) the expert, booleans, relational operators, possible tests, and possible range of results were all randomly chosen from a permissible domain of actions, conditions {IF (NOT CCCL = 2) AND (C4 > 3)) OR (C2 >= 12)) THEN Al} -> {AS} -> {IF ((C5 = TRUE) OR NOT (Cl <> 17)) THEN Al} Figure 1 Statements within {} indicate a single set of rules Statements beginning with C are conditions Statements beginning with A are actions ~> indicates the linking of rules which are evaluated in order ~ and values appropriate to the area of expertise desired for the system. Figure 1 shows a possible rule set for an unspecified expert. APPLYING THE GENETIC OPERATORS TO THE EXPERT POPULATION Three types of genetic operators were applied to the population of experts - mutation, crossovel and anversion. The tree for some modifications operato1s normally deal strings of bits. to the way these with £ixed length Mutation was performed as an operation upon a single rule. A number of varieties of mutation were possible and the type of mutation occurring in each given case was chosen randomly from the list of possibilities, as was the rule to be mutated and the node of the rule at which the mutation would occur. The simplest type of mutation was ai point mutation, Thais involved the operator to another (e@.g., > becoming >=). It was also possible for an operand of either a conditian or an action type to be exchanged with another (e.g., an a chemical settang boiling poant becoming melting point or 450 degrees might become 200). & thard possibility is exemplified by the exchange of an OR for an AND or the addition or deletion of a NOT an the boolean expression. The ilast mutation possible was the addition or deletion of a rule or a portion of a rule - the latter being equivalent to a randomly selected branch of the parse tree. change of one The next simplest an order inversion. Each inversion carried out on a single set of rules. involved randomly choosing, via the MOD operator, two pointe along a rule set of given length, and then exchanging the order of rules that existed between the two points. This operation did not affect the Fules themselves but, lather, the order ain which they were evaluated. The possible effects of anversion were twofold. First, the firing of an earlier rule (that as conditions evaluating positively so that an action is taken) could cause later rules to evaluate differently. With 4a new ordering, given preconditions now may or may hot exist. Second, due to rule evaluatien halting once a decision rule fared, new rules might come under consideration or old ones might not row be reached. genetic operation wag was This The most complex involved crossover. Crossover, an system, occurred at points between and involved exchanges of adentacal genetic operator used thia rules, structured nature of each rule called - amounts of genetic material (in terms of number of rules) between experts. Because the rule sets of each of the pair of experts chosen to undergo crossover were unequal ain length, one point for crossover was chosen to be MOD the length of the shorter expert and one MOD the longer expert. I€ both values turned out to be less than the length of the shorter rule set then a double crossover occurred: a central list of rules was exchanged between the two and both experts retained their original size. If only one point was lese than the shorter rule set, then only a single crossover oceurred, with the taal end of the shorter expert now grafted on to what was once the longer one, and vice versa. Both offspring of each auversion survived. Figures 2a and 2b demonatrate the two types of crossovers. R1A~-R1IB--RiC-1-R1D--R1IE--R1F-|-R1G--RLH-- R1I--R1LJ--R1K R2A--R2B--R2C-1-R2D--R2E-~R2F-1-R2G6 crossover of rule sets 1 & 2 yield RIA--RiB-~R1C~(-R2D--R2E--R2F~|-RiG--RiH-~ R11--R1J--R1K R2A~-R2B--R2C-}-R1D--RIE~-R1iF-1-R2G DOUBLE CROSSOVER - CROSSOVER POINTS INDICATED BY Figure 2a R1A--R1iB-~-R1C~-R1D-!-R1E--~R1IF~-R1G--R1H-~ R1I~-R1J--R1K R2A--R2B--R2C--R2D-1 -R2E--R2F--R2G6 eressover of rule sets 1 & 2 yield RLA--~RIB--R1C~-R1D-|-R2E--R2F--R2G R2A~-R2B~--R2C-~R2D-1-R1E--RLF--R1G~-R1H-- R1I~-R1d--R1K SINGLE CROSSOVER - CROSSOVER POINT INDICATED BY Fagure 2b THE ROLE OF THE CRITIC Experts, naturally, relate to some field of expertise. The general set of problems which these experts were to solve involved a set of related problems, where 7° a number of different tests could be undertaken ain order to determine some information about the system, The goal of each expert was to generate the proper get of tests, in proper order, so that the correct solution for each presented problem would be obtained at a minimal cost in terms of testing expenses, The role each expert’s of the cratic was to apply rule set, in order, (and repeatedly if necessary), to each member of a library of actual cases, and to obtain thereby some figure of merit for each expert. If the conditions precedent an the rule were met, then the rule fired and the action called for was taken. If the rule action called for an observation about the case, aunformation might be unmasked, and the expert’s state of knowledge increased (but enly aif the information was available in the actual library cage). of course coats may have been incurred in making such observations. These costs could involve time, money, or anjury to the subject being observed. Results might be aunconclusive or even unavailable. Rule evaluation decision rule fired. set was evaluated ceased when a If the entire rule without a decision rule firing, then the set was reevaluated from its beginning (and at an additional cost), because the results of earlier fired rules were likely to have provided additional anformation that could result in different rules firing on subsequent passes. & maximum number of passes were allowed in order to ensure termination. A valuation of the expert’s merit vis a vis that patticular problem was made, and then the next problem in the set was presented to it, and so on. Obviously, the larger and more varied the set of problems, the more sophisticated and discriminating the expert. Creation of a library direct program control can be a and error-prone proceas. In order facilitate both entry of and changes to the problem set, GENES allowed creation of the problem set via the normal Unix(TM) vi editor, and then fed the error-free result to the program. of problems under tedious to In interested our trial in evaluating system we were Our experts on a dual basis. One consideration was how often they reached the correct result. The other was how little cost was expended in ceaaching this result. An apportionment between these two factors 1s clearly a value judgment that must be made for each particular expert system. Thus each expert accrued a certain cost in its decision making, that being the sum of dollar costs incurred in running tests, degradation costs in che case of environmental misfortunes (e.g. a damaged drill aun the instance of oil exploration?) and penalties for had evaluations. The merit of each expert was inversely related to the cogts it incurred. TIME DEPENDENT EVALUATIONS Because the system we were studying was ene where time was a factor, we introduced the notion of time dependency to our evaluations of the goodness af each expert. Many of the tests which could be performed by the experts would not yield immediate resulta. While such answers were pending other tests might or might not be undertaken, depending upon the rules (where actions might or might not be dependent upon previous test results). Time dependency required that the entire rule set for each expert be reevaluated repeatedly, because new testa might be ordered once earlier test results came in. In many existing systems such delay involves additional daily overhead costs, so these costs were factored into the testing costs for each expert where appropriate. A related notion was persistence, persistence being a quality associated With tests whose results do not change meaningfully over short intervals. In order to prevent the repeated ordering of such tests, they were each assigned 4a persistence value which was decremented with time. If a rule required that a test be done when that test showed a non-zero persistence value, the rule was ignored. It should also be noted that there Was a possibility that test results might be eguivocal. To simulate this possibility, the individual problems in the sets presented were issued certain flage representing both the current and possible results of tests and observations that the expert might make. Thus, for example, an oa medical situation, the patient’s sex might generally be immediately knowable, the reaults of a blood culture would be available in 24 hours, but meaningful results of a CAT scan might actually turn out to be unobtainable. Because there was ne prohibition on the repetition of rules within an expert’s repertory, however, the CAT scan could be repeated and might show results the second time. HANDLING CONVERGENCE Researchers in the genetic algorithm field have repeatedly noted the difficulty of fine tuning the parameters in order to obtain results along the desired results. Tie algorithm shares this rather nasty fact programming: wath other types of computer "what you ask for ig what you get" (See, €.g., (8). Due to the emphasis upon getting the correct result “ag compared to the cost of arriving at that result, the population of experts tended to converge rapidly to ai rathe homogeneous solution set, wherein a high percentage of the problem sets tested were xolved properly, but at less than optimum oat. Rapid convergence 1s a common problem ain these types of systems and hag been solved in various ways (3). The solution used in GENES was to consider the aystem to have converged if the gum of all the expert scores remained within a narrowly bounded interval over several ‘onsecutive iterations. If the converged, system was judged to then a percentage of the population was replaced by newly generated eaperts, thereby providing a major influx if genetic material into the system. PARAMETERS FOR GENETIC OPERATIONS Rates of crossover, as compared ‘uplication, were tunable wis the population size. ih the range ‘10), namely a ° percent inversion generation tasults and inversion and with just plain parameters, as _ Typically, rates recommended by Grefenstette 30 percent crossover rate, mutation rate, 5 percent Late, and a 40 percent gap, were found to give good were used. Variations in these ratea affected the time it took to reach a good solution, but did not effect ‘he goodness of that solution. Other typical variable values were a population of experts set at 50 and 50 maximum rules; thia was with 13 possible actions that vould be taken and 12 possible tests-to- be-done where the possible variations for the bound of each test-to-be-done sere sstremely large since the constants chosen were from a continuous rather than a ssacretea population (e.g. temperature > 192.48), This, along with the variable number of branches in each rule and the wtable number of rules an each set, makes the size of the search apace ~tfficult to estimate. It certainly wot small. mutation, Using these parameters, the choice _r wadch experts were ta undergo eaun procedure was made via the weighted 1sulette wheel. In all cases at least one upy of the expert with the best value wae -utained. RESULTS The GENES model, we tested at least on a small scale, converged un approximately 2000 iterations with 2 population size of 36, or ain € F t : i i iterations with a population size to results which appear optimal in the correct decisions nine problems reasonably of 50, that were made for all presented and at a low cost. This required no excessive or duplicative testing. Due to the very laige possible variation of rules, there did, however, seem to be some strange rules that vemained in the best expert's rule sets. These rules tended not, to make a great deal of sense, but they did not actually affect the system as their very weirdness assured that they either always or never fired, Thus the rule interpreted as "if the patient’s temperature is less than 120.3 degrees discover aif abdominal pain 1g present”, is practically equivalent to stating “always check for abdominal paan’”, Additionally, it should be noted that all rules following one that resulted in a decision being made were never evaluated, s0 they did not actually contribute to the rule set, although they did provide genetic material tu succeeding generations and might be activated during inversion, FURTHER DIRECTIONS This system 1s currently being examined with promising results in two areas of expertise. The first area that we are investigating is the minamization of hospital costs during patient diagnostic evaluations in a prospective reimbursement environment. The simple environment we are studying is a rule set appropriate to 4 patient admitted wath suspected gall bladder cligease, The expert in this system should be able to arrive at a correct diagnosis using minimal hospital resources in terms of tests and time. It should be noted that thas system does not involve interpretation of tests, but only the suyyests the order of testing, as a function of previous results. The second system under study involves server activation and deactivation ain queteing problems. Costs here include both activation and use of the server, wath the goal being Minimization of costs, while providing adequate service to the queue under varying server loads. A back end anterpreter to translate the rule sets into terms understandable by Ssemi-expert users as also under Consideration. ACKNOWLEDGMENT The authors wish 4§sistance of Neal helpful suygestions. to acknowledge the Coulter for has many a BIBLIOGRAPHY 1. John H. Holland. Outline for_a Logical | Theory of _ Adaptive Systems. In Essays on Cellular Automata, Edited by Arthur W. Burks. University of Illinois Prass (1970). 2. John H. Holland, "Adaptation in Natural and Artifacial Systems". Univ. Michigan Press, Ann Arbor (1975). 3, Davad E. Goldberg and Robert Langle, Jv, Alleles... Loci, and _the. Travelang Salesma Problem, Proceedings of an International Conference on Genetic Algorithms and Their Applications, John J. Grefenstette, ed. (1985) pp 154-159. 4. John Grefenstette et Algorithms forthe Trave Problem, Proceedings of an Conference on Genetic Algerathms and Their Applications, John J. Grefenstette, Ed. (19985) pp 160-165. 5. 3. D. Schaffer and Multa -Objective _. Learning Via. Genetic Algorithms, International Joint Conference on Artifisial Intelligence (9th 1985) pp John Grefenstette, 592 - S95. 6. S. F. Smith, Flexible Learning of Problem __ Solving _Heurastics rough Adaptive_ Search, International Joint Conference on Artificial Intelligence (8th 1983) 422 - 425. 7. A.J. Tomas and A. Th. Schreiber, HERMAN Computer Aided Medical Decision Ma Artificial Intelligence in Medicine, 1. De Lotto, M. Stefanelli, Ed. North-Holland 1985S, pp 1 - 9. a. KF. De Jong, Year Perspective, Genetic Algorithms: A 10 Proceedings of an International Conference on Genetic Algorithms and Their Applications, John J. Grefenstette, ed. (1985) pp 169 - 177. 9. 4. &. Baker, Adaptive Selection Methods for Genetic Algorithms, Proceedings of an International Conference on Genetic Algorithms and Their Applications, John J. Grefenstette, ed. (1985) pp 101 - 111. 10. John Grefenstette, Optimization. of Control Parameters for Genetic Algorithms, TEEE Teansactions on Systems, Man, and Cybernetics, (Vol. SMC 16 No. 1 Jan/Feb 1986) pp 122 - 128. GENETIC ALGORITHMS AND CLASSIFIER SYSTEMS AND FUTURE DIRECTIONS FOUNDATIONS John H_ Holland The University of Michigan Abstract Theoretical questions about classifier systems, with rare exceptions, apply equally to other adaptive nenlinear networks (ANNs) such as the connectionist models of cognitive psychology, the immune systern, economic systems, ecologies, and genetic systems This paper discusses pervasive properties of ANNs and the kinds of mathematics relevant to questions about these properties. It discusses relevant functional extensions of the basic classifier system and extensions of the extant mathematical theory An appendix briefly reviews some of the Key theorems about classifier systems. Classifier systerns are examples of a broad class of systems sometimes called adaptive nonlinear networks (ANNs, hereafter) Broadly characterized, classifier systems, and ANNs in genera}, consist of a large number of units that (1) interact in a nonhnear, competitive fashion, and (2) are modified by various operators so that the system as whole progressively adapts to its environment Typically an ANN confronts an environment that exhibits perpetual novelty and it can function (or continue to exist) only by making continued adaptations to that environment Because ANN/environment interactions are complex, except in artificially constrained cases, an ANN usually operates far from equihbrium ANNs form the core of areas of study as diverse as cogmitive psychology, artificial intelhgence, economics, immunogene- sis, genetics, and ecology Founda tions. Classifier systems are quite typical ANNs so that questions about classifiers, suitably translated, are typically questions about ANNs and vice versa To carry out the translation it is necessary to identify the counterparts other ANNs of the message-processing rules called classifiers, that are the units of classifier systems. For example In genetics, the counterparts are chromosomes, in game theory | and economics, they are (rule~defined) strate~ gies; in immunogenesis, antigens, in connec- tionist versions of cognition, (formally~defined) neurons, and so on Under such translation it is relatively easy to identify a range of important theoretical questions that apply to classifier systems in particular and ANNs in general [These questions, and some of the ensuing discussions, are presented assuming that the reader has some farmlarity with Holland et al [1986] or Holland [1975] There is simply not enough room here to define the terms, a reader familiar with the hterature concerning seme other ANN should be able to make the relevant translation in most cases } (4) What parameters and operators favor the emergence of stable hierarchical covers such as default hnerarchies, internal models, and the hike (via an increased diversity of umts and progressively more complheated interactions between them)? (2) Are the farmhar “ecological” interactions parasitism, symbiosis, competitive exclu-~ sion, ete —~ a cormmon feature of all parallel, nonlinear competitive systems? (3) Are multi-functional units (umts that can serve in several contexts) the major stepping- stone employed by ali ANNs in making adaptive advances” (4) What environmental conditions favor recombination, imprinting, triggering, and other constrained or biased procedures for generating new tnals (rules, chromosomes, organizational structures, etc )? (5) What environmental conditions favor tracking vs. averaging, exploration vs tation, etc.? (6) What combinations of operators yield impheit parallelsm? exploi- Traditional mathematics with its reliance upon linearity, convergence, fixed points, and the lke, seems to offer few tools for studying such questions. Yet, without a relevant mathematical framework, there 1s less chance of understanding ANNs than there would be of understanding physical phenomena in the absence of guidance from theoretical physics A mathematics that puts emphasis on combina- torics and competition between parallel processes 1s the key to understanding ANNs. What seems starthng when one uses differential equations, where the emphasis 1 on con~ tunuity, 1s commonplace in a programming or recursive format, where the emphasis 1s upon combinatorics (Consider, for example, the chaotic regimes that are so unexpected in the context of differential equations, but are an everday occurrence, in the guise of biased random number generators, in the program- ming context ) Because classifier systerms are formally defined and computer-oriented, with an emphasis on combination and competition, they offer a useful test~bed for both mathematical and simulation studies of ANNs We already have some theorems that provide a deeper understanding of the behavior of classifier systems (see the Appendix), and simulations suggest a broader class of theorems that delineate the conditions under which internal models (q-morphisms) emerge in response to complex environments (Holland [1986b]). By putting classifier systems in a broader context, we can bring to bear relevant pieces of mathematics from other studies. For instance, in mathematical economics there are pieces of mathematics that deal with (a) hierarchical organization, (2) retained earnings (htness) as a measure of past performance, (3) competition based on retained earnings, (4) distribution of @arnings on the basis of local interactions of consumers and suppliers, (5) taxation as a control on efficiency, and (6) 83 division of effort between production and research (exploration versus exploration). Many of these fragments, mutatis mutandis, can be used to study the counterparts of these processes in other ANNs As another example, in mathematical ecology there are pieces of mathernatics dealing with (1) niche exploitation (models exploiting environmental opportunities), (2) phylogenetic hierarchies, polymorphism and enforced diversity (competing subsystems), (3) func- tional convergence (similarities of subsystem orgamzation enforced by environmental requirements on payoff attainment), (4) symbiosis, parasitism, and mimicry (couplings and interactions in a default hierarchy, such as an increased efficiency for extant generalists simply because related specialists exclude them from some regions m which they are inefficient}, (5) food chains, predator—prey relations, and other energy transfers (appor- tionment of energy or payoff amongst component subsystems), (6) recombination of multifunctional co~adapted sets of genes {re- combination of building blocks), (7) assortative mating (biased or triggered recombination), (8) phenotypic markers affecting interspecies and intraspecies interactions (coupling), (9) "found- er” effects (generalists giving rise to specialists), and (10) other detailed cormmonalities such as tracking versus averaging over environmental changes (compensation for environmental variability), allelochemicals (cross-inhibition), linkage (association and encoding of features), and still others. Once again, though mathematical ecology 1s a young science, there is much in the mathematics that has been developed that 1s relevant to the study of other nonlinear systems far from equilibrium The task of theory is to explain. the pervasiveness of these features by elucidating the general mechanisms that assure their emergence and evolution Properly applied to classifier systerns, or to ANNs in general, such a theory militates against ad hoe solutions, assuring robustness and adaptabihty for the resulting organization One of the best ways to insure that the mechanisms investigated are general is “to look over your shoulder" frequently to see if the mechanisms apply to all ANNs. This view is sharpened if we pay close attention to features shared by all ANNs (14) Merarchical organization Ai) ANNs exhibit an hierarchical organization. In hving systems proteins combine to form organelles, which combine to form cell types, and so on, through organs, organisms, species, and ultimately ecologies Economies involve individuals, departments, divisions, companies, economic sectors, and so on, until one reaches national, regional, and world economies. A similar story can be told for each of the areas cited. These structural similarities are more than super~ ficial. A closer look shows that the hierarchies are constructed on a “building block” principle Subsystems at each level of the hierarchy are constructed by combination of small numbers of subsystems from the next lower level. Because even a small number of building blocks can be combined in a great variety of ways there is a great space of subsystems to be tried, but the search is biased by the building blocks selected. At each level, there 1s a continued search for subsystems that will serve as suitable building blocks at the next level. (2) Competition A still closer look shows that in all cases the search for building blocks 1s carried out by competition in a population of candidates. Moreover there 1s a strong relation between the level in the hierarchy and the amount of time it takes for competitions to be resolved -~ ecologies work on a much longer time-scale than proteins, and world economies change much more slowly than the departments in a company More carefully, if we associate random variables with subsystern ratings (say fitnesses), then the sampling rate decreases as the level of the subsystem increases. As we will see, this has profound effects upon the way in which the system moves through the space of posstbilities. (3) Game-lke system/environment interaction. An ANN interacts with its environment in a game~like way’ Sequences of action ("moves") occasionally produce payoff special inputs that provide the system with the wherewithall for continued existence and adaptation Usually payoff can be treated as a simple quantity (energy in physics, fitness in genetics, money in economics, winnings in game theory, reward 84 in psychology, error in control theory, etc } It as typical that payoff 1s sparsely distributed m the environment and that the adaptive system must compete for it with other systems in the environment (4) Exploitation = of ~—sregularities The environment typically exhibits a range of regularihes or mches that can be exploited by different action sequences or strategies As a result the environment supports a variety of processes that interact in complex ways, much as in a multi-person game Usually there 1s no super-process that can outcompete al} others so an ecology results (domains in physics, interacting species in ecological Benetics, companies in economics, cell assemblies in neurophysiclogical psychology, etc.) The very complemty of these interac~ tions assures that even large systems over long time spans can have explored only a mmuscule range of possibilities. Even for much studied board games such as chess and go this is true, the not so simply defined “games" of ecological Benetics, economic competition, immunogenesis, CNS activity, etc., are orders of magnitude more complex As a consequence the systems are always far from any optimum or equilibrium situation (5) Exploration vs exploitation There is a tradeoff between exploration and exploitation In order to explore a new niche a system must use new and untried action sequences that take it imto new parts (state sets) of the enviroment This can only occur at the cost of departing from action sequences that have well-established payoff rates The ratio of exploration to exploitation in relation to the opportunities (niches) offered by the environment has much to do with the hfe history of a system (6) Tracking vs. averaging tradeoff between “tracking” and “averaging” Some parts of the environment change so rapidly relative to a given subsystem’'s response rate that the Sub-systerm can only react to the average effect, in other situations the subsystern can actually change fast enough to respond “move by move" Again the relative proportion of these two possibilities in the meches the subsystern inhabits has much to do with the subsystem's hfe history There is also a hij sit th (4 ex giv us mk eff pr fu res thi sy! the (7) Nontneartty The value (“fitness”) of a given combination of building blocks often cannot be predicted by a summing up of values assigned to the component blocks This nonlinearity (cormmonly called epistasis 1n genetics) leads to co-adapted sets of blocks (alleles) that serve to bias sampling and add additional layers to the hierarchy (8) Coupling. At ali levels, the competitive interactions give rise to counterparts of the familiar interactions of population biology ~~ symbiosis , parasitism , competitive exclusion , and the hke (9) Generalists and specialists. Subsystems can often be usefully divided into generalists (averaging over a wide variety of situations, with a consequent high sampling rate and high statistical confidence, at the cost of a relatively ingh error rate in individual situations) and specialists (reacting to a restricted class of situations with a lowered error rate, bought at the cost of a low sampling rate) (10) Multifunctionality Subsystems often exhibit multifunctionality in the sense that a given combination of building blocks can usefully exploit quite distinct niches (environ~ mental regularities), typically with different efficiencies Subsequent recombinations can Produce specializations that emphasize one function, usually at the cost of the other Extensive changes in behavior and efficiency, together with extensive adaptive radiation, can result from recombinations involving these multifunctional founders (11) dnterna/ models ANNs usually generate implicit internal models of their environments, models progressively revised and improved as the system accumulates experience The systems /earn. Consider the progressive improvements of the immune systern when faced with antigens, and the fact that one can infer much about the system's environment and history by looking at the antigen Population This ability to infer something of a system's environment and history from its changing internal orgamzation 1s the diagnostic feature of an implicit mternal model The models encountered are usually prescriptive ~~ they specify preferred responses to given environmental states -~- but, for more complex Systems (the CNS, for example), they may 85 also be more broadly preaictive, specifying the results of alternative courses of action The relevant mathematical concept of a model of process~like transformations 15 that of a Aomomarphism Real systems almost never admit of models meeting the requirements for a homomorphism (“commutativity of the diagram"), but there are weakenings, the so-called g-morphisms ( guasi-hemomorphsm The origin of a hierarchy can be looked upon as a sequence of progressively refined q-morph- isms (specifically q-morphisms of Markov processes) based upon observation Functional Extensions The foregoing questions and commonalities, together with some of the problems already encountered in simulations, have already suggested extensions of the standard definitions (as in Holland (1980]) of classifiers systems. One important change involves the way bids are used in deterrnining the winners of competititions for activation The standard way of doing this 1s to calculate a Aid = [bid ratio}*{strength] Under this arrangement, the local fixed points of classifiers are such that a generalist and a specialist active in the sarne situations will come to bid the same amount (because the strength of the generalist increases to the point of compensating for its smaller bid ratio, see the Appendix) This goes against the dictum that specialists should be favored in a competition with generalists. To compensate for this an Offecere sr 18 ealeulated by reducing the dd in proportion to the general- ity of the classifier producing the bid. The effective bid is then used in determining the probability that the classifier generating it 1s one of the winners of the competition. If the classifier wins 1t must pay the bid, sot the effective bid, to its supphers under the bucket brigade Thus, the local fixed points are not changed, but specialists are favored in compe- ution with generalists This change goes a long way toward reducing instabilities in emergent default hierarchies (We are still exploring the effects in simulations and, at the level of theory, the resulting modifications in global fixed points). A related change concerns the method of determining a classifier's probabihty of producing offspring, its fitness, under the genetic algorithm The higher strength of a Generalist at its loca) fixed point greatly favors it in the production of offspring, and simulations indicate that this overbiases the evolution of the system toward the offspring of generalists The simplest way of compensating for this 1s to make the fitness proportional to {bid ratio}*{strength} rather than strength alone In intuitive terms, this makes the fitness proportional to the classifier's potential for affecting the systern (its bid can be thought of as a “phenotypic” effect), rather than its reserves (strength 1s a quantity determined by its “genotypic” fixed~point) We have yet to carry out an organized set of simulations based on fitness so~determined Simulations have also revealed two other effects worth systematic investigation The first of these is the “focussing” effect of the size of the message hst (see R Ruiolo’s paper in this Proceedings) In effect, a small message hst forces the systern to concentrate on a few factors in the current situation Clearly there 1s the possibility of making the size of the message list depend upon the “urgency” of the situation For example, during “lookahead” the message list's size can be quite large to encourage an exploration of possibilities, while at “execution” time the size can be reduced to enforce a decision Clearly, the system can use classifiers to control the size of the message lst This makes the size dependent upon the system's “reading” of the current situation, and the “reading” 1s subject to long-terrn adaptive change under the genetic algorithm A second simple effect 1s to revise the definition of the environment, or equivalently the defimtion of the system's speed, so that typical stimu persist for several time~steps (Ths corresponds to the fact that the CNS operates rapidly relative to typical changes in its environment -- usually, milliseconds vs tenths of a second ) The resulting “persistence” and “overlap” of input messages makes it much easier for the classifier system to develop causal 80 models and associative links (see below) yet, to my knowledge, been built along these lines As no simulations have At a much more general (and speculative) level, use of trigrered genetic operators pro- vides a major extension of genetic algorithms Triggering amounts to invoking — genetic operators with selected arguments, when certain predefined conditions are satisfied. As an example of a triggermg condrtion consider the following “Only general classifiers that produce weak bids are activated by the current input message " When this condition occurs it is a sign that the system has little specific information for deahng with the current environmental situation Let this condition trigger a cross between the input message and the condition parts of some of the active general rules The result will be plaus~ ible new rules with more specific conditions This amounts to a bottom-up procedure for producing candidate rules that will automati~ cally be tested for usefulness when similar situations recur As another example of a triggering condition consider "Rule C has gust made a large profit under the bucket brigade * Satisfaction of this condition signals a propitious time to couple the profitable classifier to its stage-setting precursor An appropnate cross between the message part of a rule C, active on the irnmednately preceding time-step -- the precursor -~ and the condition part of the profit-making successor can produce a new pair of coupled rules (The trigger 1s sof acti~ vated if C, is already coupled to C) The coupled, offspring pair models the state transi- tion mediated by the original pair of (uncoupled) rules Such coupled rules can serve as the building blocks for models of the environment Because the couplings serve as “bridges” for the bucket brigade, these building blocks will be assigned credit in accord with the efficacy of the models constructed from them Interestingly enough there seerns to be a rather small number of robust triggering conditions (see Holland et al [1986]), but each of them would appear to add substantially to responsiveness of the classifier system the Tags are particularly affected by triggering conditions that provide new couphngs. Tags serve as the glue of larger systems, providing both associative and temporal (model-building) pointers Under certain kinds of triggered couphng the message sent by the precursor in the coupled pair can have a “hash~coded’ section (say a prefix or suffix). The purpose of this hash-coded tag is to prevent accidental eavesdropping by other classifiers -- a sufficient number of randomly generated bits in the tag will prevent accidental matches with other conditions (unless the tag region in the condition part of the potential eavesdropper consists mostly #'s) If the coupled pair proves useful to the systern then it will have further offspring under the genetic algorithm, and these offspring often will be coupled to other rules in the system Typically, the tag will be passed on to the offspring, serving as a common element in all the couphngs The tag will only persist if the resulting cluster of rules proves to be a useful “subroutine” In this case, the “subroutine” can be “called” by messages that incorporate the tag, because the conditions of the rules in the cluster are satisfied by such messages. In short, the tag that was imtally determined at random now "names" the developing subroutine It even has a sewing im terms of the actions it calls forth Moreover, the tag is subject to the same kinds of recombination as other parts of the rules (it 1s, after all, a schema) As such it can serve as a building block for other tags. It is as if the system were inventing symbols for its internal use Clearly, any simulation that provides for a test of these ideas will be an order of magnitude more sophtsticated than anything we have tried to date, Runs involving hundreds of thousands of time~steps and thousands of classifiers will probably be required to test these ideas Support 1s another techmique that adds considerably to the system's flexinhty Basic~ ally, support is a techmque that enables the 87 classifier systern to integrate many pieces of partial information (such as several views of a partially obscured object) to arrive at strong conclusions Support is a quantity that travels with messages, rather than being a counter- flow as in the case of bids. When a classifier is satisfied by several messages from the message list, each such message adds its support imto that classifier’s support counter. Unhke a classifier's strength, the support accrued by a classifier lasts for only the time-step in which it 1s accumulated. That is, the support counter is reset at the end of each tume-step (other techniques are possible, such as a long or short half-life). Support is used to modify the size of the classifier’s bid on that time-step, large support increases the bid, small support decreases it. If the classifier wins the bidding competition, the message it posts carries a support proportional to the size of rts bid = The propagation of support over sets of coupled classifiers acts somewhat hke spreading activation (see Anderson [1983]}, but it 1s much more directed It can bring associ- ations (coupled rules) into play while serving its primary mussion of integrating partial mfor- mation (messages frorn several weakly—bidding, general rules that satisfy the same classifier). In addition to these broadly conceived extensions, there are more special extensions that may have global consequences, partic~ ularly in respect to increased responsiveness and robustness. One of these concerns a simple redefinition of classifiers The standard definition of a 2-condition classifier requires that each condition be satisfied by some message on the message list, in effect an AND, requiring a message of type X and a message of type Y. It 15 a simple thing to replace the imphert AND with other string operators, e g a bit-by-bit AND or a bimary sum of the satisfying messages, which is then passed through as the outgoing message. This extension has been wnplemented, but has not been systematically tested Other simple extensions impact the functioning of the genetic algorithm. It is easy to introduce, in the string defimmg a classifier, Punctuation marks that bias the probability of crossover (say crossover is twice as hkely to take place adjacent to a punctuation mark). These punctuation marks are not interpreted in executing the classifier, but they bias the form of its offspring under the genetic algorithm Punctuation marks can be treated as alleles under the genetic algorithm, subject to muta~ tion, crossover, etc , Just as the other (func~ tion-defining) alleles This ensures that the placement of punctuation marks 1s adaptively determined. Similarly one can introduce mat- ing tags that restrict crossover to classifiers with similar tags, again the tags, as part of the classifier, can be made subject to modi~ fication and selection by the genetic algorithm Finally, there are two broad ranges of investigation, far beyond anything we yet understand either theoretically or empirically, that offer intriguing possibihties for the future One of these stems from the fact that classifier systems are general~purpose They can be programmed initially to implement whatever expert knowledge 1s available to the designer, learning then allows the system to expand, correct errors, and transfer information from one domain to another It 15 important to provide ways of instructing such systems so that they can generate rules ~- tentative hypotheses ~~ on the basis of advice It 15 also important thet we understand how lookahead and virtual explorations can be incorporated without disturbing other activities of the sys- tem Little has been done in either direction The other realm of investigation concerns fully-directed rule generation In a precursor of classifier systems, the 4droadcast Janguage (Holland {1975]), provision was made for the generation of rules by other rules With minor changes to the definition of classifier systems, this possibility can be reintroduced (Both messages and rules are strings. By enlarging the message alphabet, lengthening the message string, and introducing a special symbol that indicates whether a string 1s to be interpreted as a rule or a message, the task can be accomplished.) With this provision the system can invent its own candidate operators and rules of inference Survival of these meta- (operator~lke) rules should then be made to depend on the net usefulness of the rules they generate (much as a schema takes it value from the average value of its carriers) It 1s probably a matter of a decade or two before we can do anything useful in this area Mathematical Extensions There are at least two broader mathematical tasks that should be undertaken One 1s an attempt to produce a general characterization of systerns that exhibit amplicit parallelism Ip to now all such attempts have ; led to sets of algorithms that are easily recast as genetic algorithms —~ in effect, we still only - know of one example of an algorithm that exhibits implicit parallelism The second task involves developing a . mathematical formulation of the process whereby a systern develops a useful internal model of an environment exhibiting perpetual novelty In our (preliminary) experiments to date, these models typically exhibit a (tangled) hierarchical structure with associative coup- lings As mentioned earher, such structures _ can be characterized mathematically as quasi~ homomorphisms (see Holland et al [1986]) The perpetual novelty of the environment can be characterized by a Markov process in which each state has a recurrence time that 1s large relative to any feasible observation trme Con- siderable progress can be made along these lines (see Holland [1986b}), but much remains to be done In particular, we a need to construct an interlocking set of theorems based on (1) a more global set of fixed point theorems that relates the strengths of classifiers under the bucket brigade to observed payoff statistics, (2) a set of theorems that relates building blocks exploited by the “slow” dynamics of the Genetic algorithm to the sampling rates for rules at different levels of the emerging default hierarchy (more general rules are tested more often), and (3) a set of theorerns (based on the previous two sets) that detail the way in which various kinds of environmental regularities are exploited by the genetic algorithm acting in terms of the strengths assigned by the bucket brigade Appendix A simphhed version of the fundamental theorem for genetic algorithms can be stated as follows (for an explanation of terms, see Holland [1975] or Holland [1986a]} Theorem (/mplcit parallelism ) Given a fitness function u (0, 11K Reals, a population B(t) of M strings drawn from the set (0,1), and any schema se {0,1,*)K defining a hyper- plane in (0, nk, M,(t+1) 2 u°,(t)egM,(t), where M,(t+1) 1s the expected number of instances of s in B(tt1), W's) = Dy sambeert is the average observed fitness of the instances of scherna s in B(t), and = Cintke- DPLo/(k- 1), b/g, 18 a “copying error” induced by crossover, where Po, 18 a constant of the genetic algor— tthm (often Poss = 1) grving the proportion of strings undergoing crossover in a given genera- tion, and k,-1 is the number of crossover points between the outermost defining symbols of s€ 10,14 Under interpretation, the implicit parallel- ism theorem says that the sampling rate for every schema with instances in the population is expected to increase or decrease at a rate specified by its observed average fitness, with an error proportional to its defining length Theorem (Speedup) The number of schemas processed with an error < € under a genetic alyorithirn considerably | execeds M° for a population of size M = g?k where € = k’/k Theorem (Bucket brigade focal fixed-point ) If, under the bucket brigade algorithm, 1, 1s average incorne (after taxes) of and r. 15 its bid~ratio, then its the long-term a classifier C, strength S, will approach J(/r¢ Theorem (g-nierphism parsunony , for defi- nitions, see Holland et al [1986]) A g-morphism of n levels, in which each succes~ sive level uses k or fewer additional variables Bo to define exceptions to the previous level, and in which the rules at each level are correct over at least a proportion p of the instances satisfying them, requires no more than % Palkcr-p| rules (A homomorphism defined on nk variables requires 29 rules) k=2, p=0 5, the q-morphism requires fewer than gie rules, while a corresponding home— morphism would require 20 rules, that is, the homomorphisrn would require at least 256 times as many rules as the q-morphism For n=10, ferences Anderson, J R_ [1983] The Architecture of Cognition — Carmbridge, Massachusetts Harvard University Press Holland, J H [4975]. Adaptation in Natural and Artificial Systems, Ann Arbor University of Michigan Press Holland, J H_ [1980] “Adaptive algorithms for discovering and using general patterns In growing knowledge-bases * interna- tonal J Pohey Analysis and Information Systems, 4,2 217-240 Holland, J. H [1986a] “Escaping brittleness The possibilities of general purpose algorithms apphed to parallel rule-based systems." Ch 20 in Machine Learning ‘7 (eds Michalski, RS, et al) Los Aitos = Morgan Kaufmann Holland, J H [1986b] "A mathematical framework for studying learning in classifier systems “ Physica 220 307-317 Holland, J H , Holyoak, K J., Nisbett, R E, and Thagard, P R [1986] = Induction Processes of Inference, Learning, and Discovery Carnmbridge, Massachusetts MIT Press Acknowledgments Much of the research reported here has been supported by the National Science Foundation, currently under grant IRI 8610225 The author also wants to thank the Los Alarnos National Laboratory for a year (1986-87) as Ulam Scholar at the Center for Nonlinear Studies, a year to pursue his chosen research objectives without hinderances G6. E. Liepins & M, by R. Hilliard Oak Ridge National Laboratory, Oak Ridge, TN 37a831* Mark Palmer & Michael Morrow Oniversity ot Tennessee, Knoxville, TN 37996 ABSTRACT The performance of conventional] and PMK genetic algorithms as problem optamizers is compared with greedy aigorithms coupied with genetic algorithms. Set cavering, travelina salesman, ami job shou scheduling vrobiems form the basis for the com- parison. Conventional and PMX genetic alqorithms are suguested to be robust, although not partic- ularly distinguished, optimizers for the class of problems studied in this paper. Greedy genetics are recommended only when the underiying qreedy algorithm is powerful, but not ovtimal, it remains difficult to characterize when greedy genetics wild outperform pure greedy algorithms. INTRODUCTION The goal of this research is to investigate whether the population diversity of genetic algorithms can be favorably used to ameliorate the myopic tendencies of tne greedv algorithm; or in short, is a synergistic combination possible? (Greedy algoritims are best-first searcn aigorithns with no backtracking -~ see Lawler, 1976 and Nilsson, 1980.} This paper contrasts the verfor- mance of greedy genetics to conventional and PMK genetic algorithms. A camparison with established operations researcn algorithms ana techniogues sucn as simulated annealing awaits further experiments, A criticism of conventional qenetic aigorithms as optimizers is that they fail to incorvorate problem structure in their formulation. Thear only tie to the specific function being optimized is through the reward structure, The incorporation of the greedy algorithm allows problem specific information to be used in the crossover operation. Three ciasses of vroblems are studied: set covering, traveling salesman, and job shop schedul- ing. Not unexnectedly, results suqgest that wnen the associated greedy algoritnn 1s vowerful (that is, a single application of the algorithm produces a generally "reasonable" solution), greedy qenetics outperform conventional and PMX qenetics. Con- versely, when the greedy algorithm is weak, greedy qenetics perform worse than conventional genetics. No definitive comparison can be drawn regarding the comparative performance oi the pure greedy al- gorithm and greedy genetics. The lesson seems to be that the conventional and PMK qenetic aiuor.thms are robusi, use greedy genetics only when there 1s good reason to believe that tne underlying greedy is powerful but not optimal. Conventional Genetics Genetic aigoritims were oriysnaily Geveloved by Holland (1975). Foliow-on researcn of their suitabilitv as function optimivers. nas been campleted by Devong (1975) and Bethke (1980). The basic regimen contains five major features. qd) representation of the solution space as 4 binary string, (2) a critic (function evaluation}, (3) reproduction, (4) crossover, and (5) mutation. Hidden behind the conceptual simplicity of the genetic algorithm are a variety of parameters and policies such as crossover rate, mutation rate, and replacement policy ~~ all of whose settings affect pertormance (DeJong 1975; Liepins and Hilliard, 1985). With tne parameters properly set, conver tional genetic aigorithms aré robust and vowerful wren applied to multimodal reul valued optimization probiems (DeJong 1975; Bethke 1980). However, they are sorely imadecuate for combinatorial omtimiza tion problems whose solutions rewuire immivulation of permutations (problems such as the traveline salesman and joo stuo scheduling). Goldverq and Lingle (1985) modified the basic genetic aigoritmn to handle vermutations by the introduction of tne partially-mavved crossover (PMX). The example Goldberg and Lingle use to illustrate PMK cons.ders two permutations of ten objects: A=984 Bae@yvie If two random numbers, say 4 and 6, are cnosen as the crossover voints for these two genes, then each is to be modified by the permutations (5 ?), (6 3), and (7 10) A=96415 BeaT7T1'2 and the respective results become t At 9842310165 Bt 792 7 810156 43. It Goldberg and Lingie investigatea survivability of o-scnemata under PMS and tnen appized PMX to Karq and Thomson's ten city traveling salesman probien. r rt Gi et Ob BOMOMMFO NE Naso eA POtrP +t oat aoa- octet eon Greeay Cenetics Althouqn tney did not call it such, Grefenstette et al (1985) developed a greedy crossover for the traveling salesman vroblen: 1. For each basr of varents, pick a random city for the start. 2. Compare the two edges leaving the city (as represented in the two parents) and choose the shorter edae. 3. If the sherter parental edge would introduce a cycle into the partial tour, then extend the tour by a randam edge. 4. Continue to extend the partial tour using steps two and three until the circuit is comleted. Grefenstette et al. applied their heuristic toa a 50 city and a 100 city problem. Unfortunately, the results of Grefenstette et al. were not directly comparable to those of Goldberg and Lingle, It miaunt be expected tnat an analysis of scnemata survivability (Goldberqa, 1986) or a niili- climbing variant {Ackiey, 1987) might be inves- tigated tor the greedy crassover. A little reflection suggests that tne former is signifi- cantly more diff.cuit tman for PYX or traditional crossover operators. Alternating hili-clambing with genetic alqorithms has produced successful results when interim pomulations suggested by the yenetic algorithm provide qood initial points for hill-climbing. However, the greedy alqorithm (as inmiemented here) 1s not a “hill-climber", nor does it benefit from different starting points. On the otner hand, Margov chain analysis (Goldberq and Segrest 1987) might provide insight into qreedy genetics; such studies are expected) to be pursued in jater pavers, The comparative anaiys of conventional and greedy genetics was vertormed on three classes of vroblems: set cavering (SCS), Joo shop scheduling {J8S), and traveling salesman (TSP}. The genetic alqurithm used was the Genesis system (Grefenstette 1986) modified to generate a single etfspring for each crossover overation. This offspring replaced tne most similar of its parents. (One offspring ver crossover was mandated by the greedy genetics and was used in all crossover metnods ta insure comparability. Replacement of the most samslar oaren* provide a simple vet efticient means of avoading premature convergence~ ~ @ pervasive problem witn qreedy genetics.) Additional modipicatiors were made to accommodate the integer coding underiving JSS and TSP. (No Sucn modification was remured for SCP.) PMK and greeny crossover operations were aaded to Genesis, A fixed popuation size of 75 was used tor SCP and candiuate solutions were reoresented as o1zary Wn strings of length 25 (the SCP matrices were of dimension 50 x 25). A fixed population size of 50 was used for TSP and JSS for tne majority of the experiments and the candidate solutions were represented as permutations of 15 objects (JSS and TSP problems involved 15 jobs and 15 cities, respectively). (Selected experiments were run with population sizes 100 and 250. The results differed little from those presented here.) Genesis parameters not explicitly discussed in tne previous paraqrapns were left at their default settings. For each of the problem classes, a sample of oroblems were generated and the problems were attacked with both conventional and greedy versions of the genetic algorithm. Comparative performance of the best solution and trial at which it was achieved; as weil as on-line, off-line, and average performance were all investigated. Golden and Stewart (1985) suqgest a number of statistical tests to evaluate the relative perfarmance of competing algorithms. They suqgest the Wilcoxon signed rank test, the Friedman test, and an expected utility approach. Such tests are useful as confirmatory statistical analysis. However, the results of this paper are still considered prelimi-~ nary and it is felt that little would be gained by additional statistical formalism. Set-Covering The set-covering proDlem (SCP) is NP-complete {Garey and Jonnson 1979) and is often encountered in applications such as resource allocation (Revelle et al 1970) and scheduling (Marsten and Shepardson, 1981). Known solution techniaues for SCP include such methods as integer programming; heuristic branch and bound; and most recently, Laqrangian relamation with subgradient variations ~~ see Balas and Ho (1980) for a review of set covering solution techniques. The usual representation of a set covering problem is as a zero/one matrix with a cost associated with each column. A set of columns is defined te cover the matrix if for each row of the matrix at least one of the columms of the set contains a ‘il! entry. The SCP objective is to find a minimal cost cover of the matrix as follows: Let A be an man binery matrix and wy, i=l, “0 non-negative costs. n Minimize (over é) EZ wy, & ~ el Subject to A 4> 1 wnere = t a. 4 ={ ye Sy) p. 1 1s an mdimensional column vector of ones, and Gc. each 6) is binary. The genetic algorithm representation of SCP 1s straightforward: @ candidate SCP solution is expressed as a binary string where a ‘i' in position i indicates that tne ith column of the matrix is included in the solution. The reward is a large number M minus the sum of the cost of the columns used and a venalty function Pn for failure to cover: R= where 6, andn are binary with n= 0 if the solution is a cover and n= 1 otherwise, and P is an appropriate scaling for the penalty. {The choice of the penalty scaling factor P ia known to affect the performance of the genetic algorithm SCP performance, but will not be investigated in this paper. ) Each of single crossover, double crossover, and greedy crossover was investigated. The jatter was defined as tollows: For every pair of parent genes, 1. Initialize the set S to be empty and the matrix A to be the original set covering matrix with columns (c;} and associated costs iwy}- 2. For the unused columns and uncovered rows calculate the cost-ratio (cost/number-of-rows -covered = w, /number-of-rows~ covered). 3. Append to S the column (say colum c) with the least cost tnat is included in one of the parents. 4. Strike column c and the rows covered by c from A and let this new matrix be A‘. 5. If S is a cover or represented in the set A to A' and go =f no other columns are parents, stop. Otherwise, to step 2, The greedy SCP crossover is illustrated in Figure 1. parent 1: 10000 wy=5 Wy=2 Wa=5 Wytd Wax parent 2: 01100 6 1 1 1 1 1 0 1 initial selection oO oO 9 1 QO for child: column 1 QO 0 i a 2 with cost-ratio 1 wee he ee second-stage cost-ratio second selection: cohmn 1 colum 3 column 4 colwm 5 colum 1 5 CS 2 % 21 Note: Column 4 was not selected because it was not represented in either parent. Figure 1. Greedy crossover for SCP. The experimental desic¢m for the set-covering problem involved matrix density as an atiditiona] parameter. (Matrix density for a zero-one matrix is the proportion of ones in the matrix.) Five test problems were qenerated for each of the fesected) densities ranging from 10% to 90%. The results are displayed below: Table 1 vresents tne relative success of eacn of the methods on eacn of the classes of the oroblems and Table 2 vresents a problem by problem breakdown of the verformance. Generation average pverformance and best of qenera~ tion performance was displayed for representative matrices of 30% and 70% density in Fiqures 2~5, CROSSOVER METHOD “one-point two-point wins ties wins ties wet AQ 8 4 Table 1. Relative SCP performance by problen density and crossover method. The major conclusion that can be drawn from tne SCP results is that the qreedy genetics for tte set~covering problem not only yield a better soiution than either the vure greedy or conver tional genetics, but that convergence to that solution is mach more rapid than with comventiona: genetics. fhe job Shon Schedul ing The JSS proplem (just like the TSP) problem as a pure ordering proplem. Since it is not quaran~ teed to result in an order, conventional crossover is not applicable to ordering oroblems unless the Probiems are formulated as a penalty function problem. Otherwise, mechanisms such as PMX must be used and the underivind: genetic building olocks are the o-schemata. Bethke (1980) analyzed GA-hard functions using Walsn transforms, qroup characters of the group m Zo, the n-fold Cartesian product of the group 2, with component wise addition (Dym ari McKean 1972). A natural extension of Bethke's work to the analysis of "GA-order-hard” functions might be based on the group representations of the symmetric qroup (Boerner 1963). Three types of crossover overators were investiqated for job snop scheduling: PMX, a weak greedy crossover, and a powerful greedy crossuver. The joo snop schequling proplem investigated was the simpiest scheduling problem: a static queue of Jobs with specified due aates and run times with no precedence constraints, a single server and minimas pT pt 1 Pe. Cross 2 Pt. Cross Greedy Cross Pureg Best Trial Best Trial Best Trial Greedy Ootimal i: 61,29 412 62.03 560 60.11 80 60.11 ? 2: 45.72 593 47.79 593 48.2 56 48.76 ? 10% 3: 48,31 613 43.16 630 43.28 258 44.42 ? 4: 49.72 702 51.58 663 49.11 90 49.82 2 5: 55.79 547 50.85 389 48.38 7 48.38 ? 1: 19,36 682 18.78 831 18.78 61 22,33 ? 2: 34.54 550 33.6 496 29.34 669 33,27 ? 20% 3: 33.66 719 98.37 524 27.49 114 30.25 ? 4: 26.42 534 27.02 783 22.22 646 24.17 ? 5: 24.09 614 16.23 641 13.41 78 16,03 ? I: 16.30 786 11.52 787 41,52 TT 41.52 11.52 2: 20.13 918 16.70 733 16,06 93 17.19 14,80 30% 3 8.46 611 41.85 913 8.20 65 8,20 8.26 4: 17.77 163 18.40 431 18.10 108 20.06 ? 5: 9.99 688 8.83 915 7.58 96 7.57 7.55 4: 10.73 937 11.58 788 11.63 81 11,53 10.70 2: 12.2) 793 42.97 745 9.78 108 11.62 9.56 40% 3: 11.38 949 11,70 1006 7.50 95 7.50 7.50 4: 7.99 523 6.45 633 6.45 96 6.82 6.42 By 17,96 631 15.26 612 19.31 104 16.00 12.64 a: 3.62 703 5.45 623 2.86 18 3.62 2.86 2: 4.38 378 5.40 434 4.38 79 4.38 4.38 50% 3: 8.07 855 41.16 684 1.57 81 1.57 7.57 4: 6,10 992 8.91 718 6.10 220 7.90 6.10 5: 6,39 929 6.99 644 6.27 90 6.39 6.19 1: 4.75 893 4.24 913 3.12 90 3.28 3.12 2: 6.49 687 4.66 581 4,66 1014 5.23 4.66 60% 3: 6.24 123 6.14 962 5.86 77 5.86 5.31 a: 5.15 799 3.64 795 3.63 B82 3.63 3.63 5B: 6.93 493 4,08 779 3.77 16 5.12 3.77 1: 4.78 645 4.49 607 4,49 107 5.76 3.96 2: 3.42 663 2.15 735 1.76 94 2,04 1.76 TO%® 3: 2.31 547 2.31 729 2.31 76 2.31 2.31 4: 5.21 727 3.00 610 2.64 37 2,78 2.64 Ss 0.96 473 0.96 540 0.61 V7 0.76 0.61 1: 0,92 728 2.09 802 0.92 78 0.92 0.92 2: 2,03 426 2.03 922 2.03 92 2,69 2.03 80% 3: 3.89 442 3.41 548 2.25 85 2.92 2.25 4.85 546 5.46 922 4.13 80 4.18 3.70 3.58 6819 2.59 686 2.59 80 3.58 2.59 3.08 3887 0.62 551 0.62 16 0.78 0.62 0.66 979 1.19 787 0.66 86 G,.68 0.66 90% 0.73 851 0,54 B15 0,51 79 0.72 ? 3.82 667 3.81 532 3.65 80 5.20 3.65 1,33 738 2.34 825 1.08 76 1.04 1,04 & Greedy heuristic apnlied recursively until a cover is generated. Table 2. Problem by problem SCP verformance, 93 Mat Covering Prontan 42 of 308 Deowity oi | wt \ \ a yj SN Ceer Cet 19 beeen a a2 oa on os 4 (Thousanet) trials © tat. cree + Rpt cree + weer (gate. eterariot a drage vert team 2 Sat Covering Protieg #2 af 202 Oenaity Onrec Cent Best st gereration Figure 3. at Covering Arpelen #4 of 708 Donasty Teele 4 at, Cross + am. Figure + eheratien -fecuxe perfoesarce 94 Set Covering Proujee #4 at 70% Derwity Corer Cont ee a a2 oa Oo oe 1 Tinks st pt. com, + Apt, cree + wreety Figure 5, Sesc wf gererarion (signed) lateness as the criterion. (fne more complicated job shop scheduling problem consiuered by Davis (1985) was not investigated, ) A woli known heuristic provides the optimal job schedule for this problem: “Order the queue according to increasing run times." The two versions of the greedy crossovers were based on weak and powerful heuristics, respectively. The powerful heuristic is the optimal heuristic previously discussed; tne weak neursctic states "order the queue according to increasing difference between due date and run time"; that is, a job with an early due date and a jong run time would be scheduled before a ‘op with a late due date and a short mm time. The arecuy crossavers are immiemented as follows: 1. For each parent pair, start at the first jop {i=1). 2. Compare the two jobs at the ith vosition of the two parents and wlace the better (accordimy to the heuristic being used) of the two in the child's ith position. If one of the jons has already been placed in the chiid, then autamat- ically pack the other. If botn of the jobs have already bee vlaced, then pick a job randomly from the yet unplaced jobs. 3. Repeat step 2, incrementing i, until ali positions 1n the child string are defined. Results fram the job shop scheduling emeri- ments are presented in Table 3 and generation average and best of generation performance for eacn of the crossover methods for a representative 75% proviem are qranhically presented in Figures 6 and 7, It as obvious that the strong greedy crossover deminates PMX, which in turn dominates the weak greedy crossover. Repeated application of tne powerful heuristic (pure strong greeay) would vied the optimal schedule with an evaluation of 0.00. PMX Weak Greedy Stronq Greedy Best Trial Best Trial Best Trial 1: 40,84 825 124.41 314 0,00 514 2: 33.78 913 51.42 951 8.82 321 3. 29.74 938 57.71 948 2,27 397 4: 59.60 873 86.93 984 6.99 370 6. 27.22 861 54.94 972 2.75 576 6: 34.94 967 68.76 849 3.34 462 T: 20.90 689 71.72 926 2.20 938 8: 14.68 890 112,84 887 0,00 545 9: 53,60 695 101.93 923 2.20 861 10; 13.52 994 203.40 169 0.00 432 11 28.99 946 74,98 711 0,00 416 12 48,14 B35 115.14 422 0.00 1005 13 18.02 941 63.18 955 1.60 481 14 17.57 7O1 80.20 967 1.24 720 16 15.07 976 150.09 397 10.85 424 16 7.93 946 136.03 492 2.23 599 17 50.18 786 208.27 43 3.24 959 18 31.35 1007 122.13 29 4.71 828 Ag 25.97 904 184.46 281 2.49 344 20 46,00 793 183.39 681 3.78 937 Tabie 3. Prodlem by problem JSS verformance 20 Sep Scheeuling Provine #2 Traveling Salesman Problem (TSP The traveling salesman problem (TSP) is another of the very difficult NP complete combinatorial ordering problems with a long history of interest. ‘ Few (1955) discovered a heuristic solution tor the Euclidean problem with a quaranteed worst case performance of “V2n + 1.78. Christofides (1976) discovered an elegant heuristic with a worst case error of 50% of ovtimal tour length. Karp (1977) showed that partitioning into subsets and con- catenating optimal tours of each subset yieids tours with expected percentage error approaching zero. Crowder and Padberg (1980) solved a 318 city tour to optimality, and Golden and Stewart (1985) reported excellent results for their CCAD heuris~ tic. Of the many theorems and results relating to Ereue Note evagd turn 00 eee 00H 0 Sealing Praplam A ae the TSP, two are of particular relevance to this sea paper. on + a 4 2 Theorem 1 (Rosenkrantz, Stearns, and Lewis >: 4 . 1977). For every r > 1, there exist n-city i a= — instances of the {TSP for arnpitrarily ijarge n ; im . obeying the triangle inequality such that NN(I) > r 100 ‘ . OPT(I), where NN(I) is the nearest neighbor optimal « ‘ tour and OPT{1) is the overall optimal tour. ‘ \ seat eee ee ~ i ne \ od EEE Qo a2 oe oe oe 1 (Trouearce) wile. 2 Oe + el RE eer Strong Srvedy Flaara 7 Best of yevcticiog, 95 Tara mmc qreedy problem nearest unanchored anchored On genetic qreedy # neighbor and seeded and seeded anchored seeded genetic 1 19484 z9374P 182430 23269 i9211 1824380 2 164807 164804 164807 18096 16480% 164804 a 176734 176737 17673 22588 17673" 17815 4 18701 15626 14837 18495 14331 3142569 5 16219 1621082 16219 18569 16219 16219 6 136139 136132 136139 13613 136139 23916 7 174008 174008 174007 24578 174007 17745 8 15067% 150677 150672 46714 150677 150673 9 19755 19785 108770 25129 195557? = 19661P 10 21299 21299 20739" 23392 20778 194212 1 18034 177118 17740° 22150 17364 171918 12 185877 185874 185872 21998 165879 19708 13 17078 17078 17078 21063 17078 1575 14 19535 19472 192059 23845 185812 1858195 15 160527 160522 160522 20856, 160422 160522 16 21934 20226? 19897 21442? 19253" 3ag712> 17 16384 16354 162299 20352 162299 16354 18 21977 21977 21977 26089 21807a> 22668 19 17049 170068 317015? 22651 17049 18519 20 178692 178699 178692 25119 17869 178697 Table 4. Problem by problem TSP performance 4 best performance b performance surpasses nearest neganbor Theorem 2 (Papadimitriou and Steiglitz 1977). If A is a local search algorithm wnose neighborhood search time is bounded by a polynomial of the problem representation, then if P # NP, A cannot be guaranteed to find a tour whose length is bounded by a constant multiple of the optimal tour even with an exponential number of iterations. Theorem 1 is relevant to this paver because the standard against which the genetic algorithms are compared is the pure greedy algorithm, called the nearest neighbor solution in the operation research literature. The implication is that this standard itself could be poor, Theorem 2 suggests that if the greedy genetic algorithm is a local search algorithm as is believed, then examples of tours can be found for wnich the algorithm performs arbitrarily poorly. Presumably, the same conclusions would hold for PMX. The work presented here buiids on that of Brockus (1983}, Goldberg and Lingle (1985), and Grefenstette et al (1985). The results of those three studies were not directly comparable and it is interesting to ask how they compare. Moreover, none of the previous studies investigated the effect of seeding the initial pooulations, of anchoring the tours, or of stochastic variation due to different randomization. 96 Two tyoes of crossover operators were inves~ tigated for the TSP: PMX and a modification of tne greedy crossover of Grefenstette et al (1985). For the anchored case all members of the poyulation were normalized to begin with the same starting city {city '1'). For the unanchorea case this was net done and the starting cities were randamly chosen. For any two parents, the modified Grefenstette crossover [greedy crossover) produces a child according to the following specifications: 1. For a pair of varents, start at the first position (always the same city). 6 Choose the shortest edge leading fran the current city (that is represented in the parents). If this etige leads to a cycle, choose the other edge. If this i1eads to a cycle, cnoose a random city that contimues we tour. 3. If the tour is comlete, step; eise qo to 2. The modified Grefenstette crossover will be called the greedy crossover and is always anchored an the results presented here. Both greedy crossover and PMK crassover were tested with seeded and unseeded initial populations. The seeded initial population inciuded the fifteen optimal ureedy algorithm generated tours (possibly in- cluding auplicates) begamning at each of the fitteen different cities. This subpopulation of fifteen was randomly extended to an initiai powuation of fafty. The nearest neighbor tour was the best of tne fifteen greedy aiaorithm cenerated fours. Resuits of the pe-focmunce of Mie arecty andi PMK crossovers are given am Table 4. Nis- couraging is the poor verfermance of the PMX with the unseeded nosuljation; only for one problem did it verform better than the nearest neighpor algorithm. Virtually no verfurmance differential among the remaining variations of tne genetic algorithm was onserved; each verformed somewhat better than the nearest neiqhbor. Perhaps suroris- ingly, the unseeded greedy genetic performed neaciy as well as its seeded countervart. Relative verformance between anchored and unanchored PMK and seeded and unseeded greedy crossover is presented in Table 5, where the 2 in tne upper left hand corner indicates that the unanchored PMK verformed better than the ancnorerd PMX on two vroblems. The remaining entries arc interpreted similarly. Aoparently, no important differences are indicated. Figures & and 9 represent generation average performance and best of generation results for the fourth problem at 70% density. Bethke {1980) has shown that for certain problems, genetic algorithms are unstable, that is, the solution determined seems to denend on gepetic drift and on the randomization of the initial population and genetic mechanism. Tables 6 and 7 display the results of difterent rardamization of anitial ponulations and genetic mechanisms for the first problem at 10% density. Discouragingly, an~ proximately 98% variation in performance can be noted for tne qreedy genetic as either initial population or qenetic mechanism is randomized difterentiy. The counterpart variations for tne PMK are 13% and 17%, respectively. The implication seems to be that tne TS? is a difficult problem for either tne PM or greedy qeuetics, mm unanchored and seeded and seeded 2 6 qreedy genetic " sqreedy genetic seeded unseeded 8 6 fable 5. Seeding and anchoring as factors in verformance differential 97 etx ae Toure reine Sour pomulation qrecdy seed # prc genetic 1 22671 19696 2 22445 18243 3 21247 18733 4 22456 18771 5 24075 18888 Table 6. Randan initial population seeds as a factor in performance differential crossover qreedy seed # peer genetic 1 24029 19537 2 23244 20721 3 22167 20713 4 24074 20275 5 22427 19072 6 21460 20078 7 22898 19600 a 21628 19348 3 22671 19636 10 26125 20415 Table 7. Random seeds for GA mechanism as a factor in performance differential Treveling Salama Preplen a4 = a ry w= a = 200 -_ = an im 20 tao 10 4ira @ Seneratica average performince fer angele! vith: y+ Treveling Salesman Problen 44 Figure %. Beot of geraration fer arseeded utial peeilatiers Conclusions and Future Research Although little effort was made to optimize tne algorithms investigated in this paper, the evidence suggests that greedy genetics can successfully make use of problem specific infomation whenever the underlying greedy algorithm is powerful. (The underlying greedy could be defined to be "yowerful" whenever a singie problem applicatian of it resuits in a "reasonable" solution.) On the other hand, if the underlying greedy ig misdirected, greedy qenetics are less successful than traditional genetics. However, regardless of tne power of the underlying greedy algorithm, qreedy genetics often converge more rapidly that their conventional counterparts. It appears that greedy genetics have their Dlace in optimization and it would be interesting to extend this work to more realistic oroblems {such as job shop scheduling problems with prece- dence constraints and muitivle servers) and to compare performance with more traditional optimiaza- tion techniques. The variance in verformance due to randomization. Markov chain results, extensions to Bethke's work, and vroper penalty function formuiations all deserve additional analysis. RERERENCES Ackley, D. H. (1987). R Connectionist Machine for Genetic Hiliclimbing, Kluwer Academic Publishers, Boston, MA. Balas, E. and A. Ho (1980}. "Set Covering Algorithms Using Cutting Planes, Heuristics, and Subgradient Optimization: A Comoutationai Study,"' Mathematical Programming 12, 37-60. Bethke, A. D, (1980). Genetic Algorithms as Michigan, Ann Arbor. Boerner, H. (1963). Revresentations of Groups witn Special Considerations tor the Needs of Modern Physics, North Holland Publishers, Amsterdam. Brockus, €. G. {1983}. "Shortest Path CGotimization Using a Genetics Search Technique", Proceedings dation Conference, Pittsburgh, PA. 241-5. Christofides, N. (1976). Worst-Case Analysis of a New Heuristic for the Traveling Salesman Problem, Report 388, Graduate School] of Industrial Administration, Carnegie Mellon University, Pittsburgh, PA. Crowder, H. and M. W. Padberg. (1980). "Solving Large-Scale Symmetric Traveling Salesman Problens to Optimality, Management Science, 26, 496-509. Davis, L., (1985). "Job Shoo Scheduling with Genetic Algorithms," Proceedings of an Interna- Avplicatians, Carnegie-Melian University, Pittsburgh. DeJong, K. A. (1975). An Amslecie of toe Bewavior of a Class of Genetic Acatmive Systems," Ph.D. Thesis, University of Michigan, Ann Arbor. Dym, H. and H. P. McKean (1972). Fourier Series and Integrals, Academic Press, New York, Few, L. (1955). "The Shortest Path and the Shortest Road through n Points", Mathematika, 2, 14i-144, Garey, M. R. and D. S. Johnson, (32979). Computers Grefenstette, J. J., Gooal, B. J. Roca ta, and D. V. Gucht, (July 1985). "Genetic Algorithms for the Traveling Salesman Problem,” Proc. an_International Conference on Genet t and Their Applications, Carnegie-Melion Uraver- sity, Pittsburgh. Grefenstette, J. J., (April 1986). A User's Guide to GENESIS, Technical Report CS-84-11, Computer Science Denartment, Vanderbilt University, Nashville. Goldberg, D. E. (1986). Simple Genetic Algorithms and the Minimal Deceotive Problem, The Clearing- house for Genetic Algorithms, TCGA Report No. 86003, University of Alabama, Tuscaloosa, Goldberg, DB, E., and R. Lingle, (July 1985). “Alleles, Loci, and the Traveling Salesman Problem," Proceedings of a ional Conference on Genetic Algo: Their Apvlications, Carneqie-Mellon University, Pittsburgh. Goldperq, D. E., and P. D. Segrest, (to anvear) "Finite Markov Chain Analysis of Genetic Aigorithms”, University of Alabama, Tuscaloosa, Golden, B. L., and W, R. Stewart. (1985). Empirical Analysis of Heuristics in The Traveling em, Lawier et al {eds), John Wiley Holland, J. 4. (1975). Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, Karp, R. M. (1977). “Probabilistic Analysis of Partitioning Algorithms for the Traveling Salesman Problems in the Plane, Math Goerations Research 2, 209-224. Lawlez, Z. L. (1976). Compinatorial Optimization: Networks and Matroids, Hoit, Rinehart and Winston, Lawler, BE. H., J. K. Lenstra, A. H. G. Rinoovkan, and D. B. Shmovs, (19485). The Traveling Salesman Problem, John Wilev and Sons. Liepins, G, E., and M. R. Hillsard. (1986) "Representational Issues in Machine Learning”, Proceedings of the International Symposium on Methodologies for Intelligent Systems Colloquia Program. Knoxville, TN. ORNL-~6362. Marsten, R, F., and F. Shepardson, 1981. “Evact Solution of Crew Scheduling Problems Using the Set Partitioning Model: Recent Successful Applications,” Networks, Vol. 11, No. 2, pp. 165-177. Naisson, N. J. (1980). Principies of Artificial Alto, CA. Papadimitriou, C. H., and K. Steiglitz. (1977) "On the Complexity of Local Search for the Traveling Salesman Problem". SIAM J. Comput.6, 78-83. Reveile, C., D. Marks and J. C. Lietman (1970). “An Analysis of Private and Public Sector Facilities Location Models,” Management Science 16, 12, 692-707. Rosencrantz, D. J., R. E. Stearns, and P. M. Lewis TI, (1977). "An Analysis of Several Heuristics for the Traveling Salesman Problem", SIAM J. Comput. 6, 563-581. i i ; | hn } i f i i INCORPORATING HEURIS LIC INFORMATION INTO GENETIC SEARCH dung Y Suh Dirk Van Gueht Computer Science Department Indiana University Bloomington, Indiana 47405 (812) 335-6429 CSNET jysulG@indiana, vgueht@indians. Keywords Genetic Algorithins, Heuristics, Qpumization Probfeins, Sunulated Anneahug, Shding Block Puzzle, Traveling Salesman Problem Abstract Genetic Algorithms have been shown to be robust optimiza: tion algorithms for {positive} real-valued functions defined over domains of the form Ji” (R denotes the real numbers) Only recently have there been attempts te apply genetic algorithms to other optsmization problems, such as combi- natorial optimization problems In this paper, we rdentrfy several obstacles which need to be overcome to successfully apply genetic algorithms to such problems and indicate how integrating heuristic information related to the problem un- der consideration helps in overcoming these obstacles We illustrate the validity of our approach by providing genetic algorithms for the Traveling Salesman Problem and the Slid- ing Block Puzzle 1. Introduction Suppose we have an object space X and a function f X — R* (R* denotes the positive real numbers} and our tash is to find a global mmimum (or maximum) for the function f In this paper, we will concentrate on genetic algortthing, a class of adaptive algorithms invented by Jolin Holland {8}, to salve (or partially solve) this problem. Genetic algorithms dilfer from more standard search al- gorithms {eg , gradient descent, controlled random search, hill-climbing, simulated annealing [9] etc ) in that the search is conducted using the information of a population of siruc- fures msiead of that of a single structure The motivation for this approach is that by considering many structures as potential candidate solutions, the risk of getting trapped in a local optimum is greatly reduced Genetic algorithins have been apphed with great suc- cess by De Jong [4] to a wide variety of functions defined over object spaces of the form A", ie, each structure x consists of n real numbers z[1] . jn] Only recently have there been attempts to apply genetic algorithing to other op- tumization problems such has the Traveling Salesinan Prob- lem (TSP) (6, 7], Bin Packing [13], Job Scheduling {2, 3} An important observation made hy Grefenstette et al [7] was that to successfully apply genetic algori(hins ¢o such problems, heuristic tnformation has to be mcorporated into the genetic algorithin, in particular they proposed a heuristic crossover operator and showed the dramatic improvement as compared to crossover operators which did not such heurs- tic information In this paper, we continue the efforts of [7]. In Section 2 we identify several problems which need to be overcome to successfully apply genetic algorithms to optimization problems other than the standard function op- limization problems In Sections 3 and 4 we illustrate the validity of our approach by providing genetic algorithms for me Salesman Problem and the Shding Block Puz- zie [12 410( 2. Design Issues of Genetic Algorithms In this section, we outhne the major obstacles in the de- sign of genetic algorithms for optimization problems other than standard function optimization problems and suggest approaches to overcome themt 2.1. The Representation Problem As mentioned before, genetic algorithms have almost exclu- sively been applied to functions defined over object spaces of the form R® When we want to solve other optimiza- tion problems, such as combinatorial optimization problems, simple parametn¢ representationg of the structures can no Jonger be used This suggests that the first step towards successfully applying genetic algorithms to these problems is to use a natural representation for the structures of the problem at hand In particular, we suggest that the choice of such a representation allows for the definition of recom- bination operators which incorporate heuristic information of the problem We thus imply that the selection of “good” representations and recombination operators are highly cor- related (for a more detailed discussion of these issues we refer to [5]} 2,2. The Selection of Appropriate Recombination Cperatora and the Importance of Local Improve- ment The power of applying genetic algorithms to functions de- fined over #” 1s that the standard recombination operators, crossover and mutation, make intuitive sense in this prob+ Jem Jn other problem domains, however, tlus is not useally the case Since the recombination slep is critical for the success of a genetic algorithm, it is important to carefully select an appropriate set of recombination operators for such problem damams Early research on genetic algonthms [f, 8} was primar- dy concerned with operatura which guarantee that {some) structural information of the structures to wloch they are apphed 13 preserved Examples of such recombimation op- erators are the standard crossaver, inutation and imversion operators used in function optimization Grefenstette et al argucd that such an approach does not carry over with sim- Uar success to the traveling salesman probleain (TSP} They showed that considering recombination operators (in thor case, only crossover operators} that merely preserve struc- tural information results in pooily performing genetic al- gorithins (.e@, not much better than random search) They discovered however that it is possible and natural to incorpo- rate hearistic information about the TSP mto the crossover } [t should be noted that De Jong [5] and Grefenstette et.al [7] already identified some of these problems operator and still mamtain its fundamental propetty, namely preservation of structural information of the strue- tures to which the operator ys apphed, This resulted in a fasrly successful genetic algorithm for the TSP, but certainly not an algorithm shat is competitive with other approxima- tion algorithms for the TSP (see {10]) We claim that it ts often the case that additional un- provements can be gained if one incorporates even more heuristics about the problem into the recombination step of a genetic algorithm Often, heuristics about problems ac jneorporated into algorithms in the form of operators which iteratively perform local improvements to candidate solutions Examples of such operators can be found in gra- dient descent algorithms, bill climbing algorithms, simulated annealing, etc We will argue that it 39 usually straightfor- ward, and in fact, we think, essential ifa competitive genetic algorithm is desired, to incorporate such local improvement operators into the recombination step of a genetic algorithm. Au additional advantage of {is approach 19 that it suggests a natural techiuque of blending genetic algorithms with more standard optimization algorithins. 2.3, Avolding Premature Convergence One of the major difficulties with genetic algorithins {and im fact with most search algorithms) 15 that sometimes prema- ture convergence, :e convergence to a suboptimal solution, occurs It has been observed that Uus problem is clasely tied to the problem of losing diversity in the population, One source of loss in diversity 1s the occasional appearance of a “super-indridual” which yn afew generations takes over the population One way of avoiding this problem os to change the selection procedure, as was demonstrated by Daher [i]. Anather source of loss of diversity results from poor per- formance of recombmation operators in terms of sampling new structures To overconte such problems, we claim that the recombination operators should be selected carefully so that they can offset each others vulnerabilities (for a more detailed discussion of these issues, we refer to [f, 4, 7, 8]) 3, The Traveling Salesman Problem , Jn this section we show that solutions to the problems raised in Section 2 enable us to develop a genetic algorithms for the TSP The TSP is easily stated’ Given a complote graph with WN nodes, find the shortest Hamiltonian tour through the graph (in this paper, we wul assume Euclidean distances between nodes). For an excellent discussion on the TSP, we refer to [10| The object space N obviously consists of all Hanulto- nian tours {tours, for short) associated with the graph, and f, the funetion to be optimized, returns the fongth of a tour As in [7], we represent a tour by its adjacency repre- sentatron It twins out that this representation allows us to easily formulate and implement heuristic recombination op- erators In the adjacency representation, a tour is described by a hst of cities There is an edge m the tour from city f to city g fand only uf the value in tth position of the adjacency tepresentation is 7 For example, the tour shown im Figure 1 is represented as (315 2 4). We now tui to the most critical step of the design the selection of appropriate recombination operators We elected to have two such operators ‘The first opetator is a slight modification of the heuristic crossover operator intro- duced hy Grefenstette et al [7] This operator constructs an offspring froin two parent tours as follows Pick a rane dom city as the starting poont for the ollspring’s tour, Com- pare the two edges leaving the starting city in the parents and choose the shorter edge Continue to extend the par- tral tour by choosing the shotter of the two erlges in the parents winch extend the tour Lf the shorter parental edge would introduce a cycle into the partial tour, check sf the other parental edge mtroduces a cycle In case the second 4 Figure t. The tour (3 15 2 4). edge does nat introduce a cycle, extend the tour with this edge, otherwise extend the tour by a random edge. Con- tinue until a complete tour is genorated ft us in the 6c lection of the shorter edges that we exploit heuristic infor- mation about the TSP, indeed, it seems lhely that a good tour will contain short edges The effect af the heurnstic crossover operator is to “glue” together “good” (1e, short) subpaths of the parent tours, (Notice also that it preserves strucbural information about the parent tours) The prob- lem with the heuristic crossover operator is that it leaves undesirable crossings of edges as Wlastrated in Figure 2 {sce also Appendix 1). In other words, the heuristic crossover operates performs poorly when it comes down to fine-tuning candidate solutions This motivated us to introduce a sec ond recombination operator Whereas the heuristic crossover operator can be thought of as a global operator, the second recombination operator has a more local behavior and thns qualifies as a local improvement operator It was introduced by Lin and Kernighan {43} and 09 called the 2-opt operator The 2-opt operator randomily selects two edges (41, 21) and (t2, a) from a tour (see Figure 2) and checks if 2D (11, 91} + ED(1a, 2) > ED(45,33) + ED(Qa, 1) (ED stands for Enchidean distance). Vf this is the case, the tour is replaced by removing the edges (ty, 1) and (tg, 9) and replacing them with the edges (in, ja) and (ia, 11) (sce Figure 3) Actually, we use a more subtle variation of the 2-opt operator, mspired by recent work on stmulated annealing for the TSP by Kirkpatrick etal (9 In this varation there is a (small) probability (depending on a slowly decreasing temperature) that when ED(uyn)+ D(a, 3) < ED(4, 72) + ED(1a, 71), the ong- mal tour is replaced using the previously described transfor- mation f To make the description of our genetic algorithin complete, we need to desciibe three parameters a crossover rate this parameter indicates the amount of structures tn the population which will undergo cross- over b local improvement rate this parameter indicates the amount of stiuctares in the population whieh will undergo 2-opt operations © Q-opt rate if the graph under consideration has N’ nodes, each structure which 15 selected to undergo local unprovement will undergo (NV x 2~opt rate) 2-opt operations pet generation The algorithm was stopped when the majority of the tours in the population were identical, We tried our algorithin on a wide variety of (euctidean} t lt should be noted, however, that the performance of the algouithm with the shnple 2-opt as usually only shightly worse than an algorithm that uses the simulated annealing version Figure 2 Tour with edges (t¢, 31} and (13,52) Figure 3. Tour with edges (t1, 9) aud (12,1) traveling salesman problema In Figure 4, we show a selec- tion of such problems krolak lattice Fe : -cireles . ms m1 200-c1ties latedecy “Acre Figuied Five Tiaveling Salesman Problems In Table 2, we show the results obtamed by the algorithm of Grefenstette et al [7] for the following parameter set- tings indtial population = 100 randomly chosen tours, crogsover rate = 50%%, local improvement rate = 0%, 2-opt rate = not appheable In Table 2, we show the re- sults of a genetic algonthms which uses the local umprove- ment operator with the following parameter settings pop- ulation size = 100 structures, crossover rate == 50%. local improvement rate = 50%, 2-opt rate = 0.1. Table 1 Genetic Algorithm Without Local Improvement TSP Nodes Optimum Our Generations Solution Krolah{to} 100 21282 25651 «UE latice 100 1001019 209 tercles 200 2167 390 300 lat-t-erre 200 «11256 1392 986 200-cities 200 ? 1928 376 Table 2 Genetic Algorithm With Local Improvement TSP Nodes Optimum Our Generations Solution krolak = 100, 21282) 21651 679 lattice 100 100 100 188 4-circles 200 246 215 218 Jat-4-cire 200 12656 1133 669 200-cities 200 ? 153 6 946 In Figure 5, we show the (best) tours obtamed and the goner- ation in which they were first found by the algorithm which uses local tmproverment. Clearly, the addition of a local m- provement technique improves the performance (measured in terms of the tour length of the best tour obtained) of the algorithm dramatically (In terms of extra resources, on average, the algorithm uaing local improvement required about 22 times more generations to obtain its best struc- ture ) In fact, the results obtamed by our algorithm are very competitive, again, in terms of the tour length of the best tour obtamed, compared to results reported in the literature for other approximation algorithms for the TSP (9, 10] {In Appendix 1, we give additional results ) 4. The Stiding Block Puzzle (SBP) We now desembe how the appraach described im Section 2 can be used in the design of a genetic algorithin for a problem which 13 hot usually thought of as a function optimization problem the Sliding Block Puzzle [12] Consider the initia} board of the puzzle shown in Figure 6 and let the board shown in Figure 7 be a goal board (the empty tile 1s represented by the symbol 0} 123 4 5 6 7 8 9 0 1011 P43 14:15 Figure 6 The Initial Board of a Shding Bloch Puzzle. Figure 7 A Goal Board a Sliding Block Puzzle { dn our implementation, we used 3% 3 and 4x 4 puzzles ENON JT Sy Syl krolak (21651) (COQ) (24.53) lattice (lo0.0) 4-circles 200-c1 ties (153.67) (113.34) fated-cire Figure 5 Best Tours Obtained by a Genetic Algorithm with Local Improvement The objective of the SBP 1s to reach the goa! board starting from the intial board using a sequence of valid moves There are four basic moves: L: move the empty tile to the left. U move the empty tile upwards R move the empty tle to the ught. D move the empty tile downwards, The only precondition required for applying a move is that it should not move the empty tle out of the board For example, a sequence which transforms the board shown in Figure 6 into the board shown in Figure 71s (L,U,R,D,R,U). In order to apply genetic algorithms to the SBP, we need to formulate the problem as a function optmuzation problem The object space V consists of all valid sequences of moves applicable to the mutial board Notice that the structures in X do not have a fixed length representation Other research with genetic algorithms on object spaces with structures having variable length representations was done by Smith [14] who implemented a machine learning system (LS-1) using structures corresponding to production system programs. In order to define f, the function to be optunized, we need to introduce some extra notation. We will denote the muitial board by 18 and the goal board by GB, Let {21,. an} be a sequence of valid moves (ie, an element of X), we denote by IB(z;,. .,2,) the board which is obtained by applying the sequence of moved (y,...,2,) to IB. Cousider the boards IB(z,;, , Zn) and GB For each tile except the empty tile) in 1Bf21,.. 2), compute the Afan- hatlan distance between the tile’s posthion in 18(z,,...,2n) and its position in GB We define the per formance(zy,..,n) as the sum of all these Manhattan distances, In our first altempt, we defined f(t. , Lut quickly discovered that a much better measure 15 f(t. ie, the value of a structure (x,,.. ,2n) is delined as the per- formance of the sub-sequence (44,., 2.) whose correspond- ing intermediate board IB(z1,. ,2,) comes closest to GB (it should be nated that computing f(z1, ., Zn) can be done in O(n)} It should be clear that whenever f(z1, ©, za} = 0, theseqnence{t1, — , 2p) contains a subsequence (21,. ., 21) which 13 a solution to the SBP We now turn to the selection of the 1ecombination operators The crossover process here is similar to that in the TSP Suppose that two sequences of operators are given. We pick the first operator from each sequence Apply each opera- tor to the initial board to see which operator yields a new board closer to the goal board Choose with high proba- bility the operator which yields the closer board, ie, the one with the better performance Notice that it is here that we employ heuristic information about the Sliding Puzzle Problem, indeed, it seerus likely that we should try to ob- tam mteimediate configurations that get closer to the goal board We do not always choose the better operator, how- ever, because this may eventually lead to a bad sequence «hose performance we can not iinprove as long as it staris with that particular operator Jn short, we could get stuck ain the local optimum However, it ts our assumption that in general it is more likely the case that selecting the better operator wul contribute to constructing a good sequence In case the two operators have the same performance, pick any of them randomly Once the operator is chosen, it be- comes the first operator of the new sequence and the board is updated accordingly Now we pick the second operators of each sequence. Again, we will take the one with the bet- ter performance Ht may however be the case that one or both of them is nb longer legal, ie, it pushes the empty tile off the edge of the board. This is possible because the operator chosen for the new sequence 15 not necessarily the one which preceded the current two operators In case only one of the operators is illegal, choose the one which is le- gal, Otherwise, randomly generate a legal one. It becomes the second operator of the new sequence. Again update the board This process is repeated until we reach the end of the two sequences The local improvement process is petformed on a single structure. First randomly pick mm positions (Q < m ED) pe expl(B, ~ EB /kT] where Eg, is the energy of the new solution and the energy of the current best solution. Eis c Conceptually, the system must pay a price in terms of energy to transition to a state that is higher in energy than the current state. The energy required is supplied by the heat present in the system. Boltzmann’s equation relates the temperature te the available energy. The equation is probabalistic rather than deterministic since the ambient temperature represents the average of a large number of small volumes. Even at a low ambient temperature the local temperature in a small volume might be very high. Boltzmann's equation gives the probability that a given amount of energy is available at a given temperature. In simulated annealing the temperature is set to a high value initially and then lowered according to an annealing schedule provided by the analyst There are reports that simulated annealing has enjoyed great success at the travelling salesman problem [Ref. 6}. These reports are difficult to assess, since performance numbers are not given. In any event, it would be desirable to include some aspects of simulated annealing within the general framework of the genetic algorithm. In particular we desire to tie the schemata retention behavior of the standard operators (crossover, inversion and mutation) to a temperature patameter. This suggests a model for genetic operators as described in the following section. The model described is artificial in the sense that it is motivated by a need to control schemata retention in a computer program rather than by biological or physical processes. Thermodynamic Operator Ve consider a parent crossover operation between two individuals that have been selected from the population vith probability proportional to their fitness. An offspring individual is constructed by transeribing the genes from the parents in such a way that o-schemata and a-schemata are partially preserved. Transcription begins with the first gene of one of the parents. We imagine that it is energetically favorable to continue transcription from the same parent; it costs energy to switch transcription to the other parent. The energy available in the environment follows a Boltzmann distribution. I£ we call the energy at which transcription vill switch to the other parent o {the crossover threshold energy) then a switch will occur just when the energy available locally (which depends on the Lemperature) exceeds Qe This occurs with probability exp (-9,/kT] {eq 2) where k ig an atbitrarily chosen scaling constant. We assume k = 1. Transcription continues from the second parent until the crossover threshold energy is again exceeded. At high temperatures, transcription might switch back and forth between the parents many times. At low temperatures transcription might not switch at all resulting ina direct copy of the parent individual. Inversion is modelled similarly. We imagine that the strings selected from each parent stand a chance of being rotated as they are transported from the parent to the offspring (see figure 2). Again, it costs energy to perform this rotation and this energy is supplied by the heat available locally. Thus, the probabilty of inversion is exp (-8,/kT] vhere a, is the inversion threshold energy. Tn some domains it may be desirable to model the Fact that it requires more energy to rotate a longer string than to rotate shorter one. Due to the rotational symmetry in the travelling salesman problem we ignored this issue. That is, the distance Erom city-i to city-j is the same as the distance from city-j te city-i so that in this case, o-schemata wholly contained within the rotated string are not affected by the operation. In other cases, for instance bin packing, packing item-i before item-j will almost always result in a different fitness than packing item-j before item-i. STRING BETWEEN CROSSOVER POINTS | f™ CITI NY oe’ OFFSPRING Figure 2. A string may be rotated tome ot ama aad em Anwaemoa Thus inverting a longer string in the bin packing case should require more energy than inverting a shorter string. Finally, mutation (of the allele} is treated similarly. We imagine that as each allele is transcribed there is a probability that its value will change. The probability of mutation is just exp [-8,/kT} Thus a thermodynamic operator is specified by a triple: (6.18) ,8,])+ The overall system temperature is varied according to schedule chosen by the analyst. As the temperature approaches’ zero, the probability that any of the thresholds will be exceeded also approaches zero, genetic activity ceases, and the schemata retention ratio approaches 1. As the temperature approaches infinity, schemata are retained only by chance and the algorithm becomes a random search. The operator therefore provides a retention ratio for o-schemata that is continuously variable from 0 to 1. an “annealing” implemented Operators We have implemented the unified thermodynamic operator and use it in place of the basic genetic operators. Since the exponentiation operation in {eq 2) is computationally expensive, we calculate the expected number of genes to be transcribed at the current temperature before a crossover event occurs. This quantity is given by a geometric distribution [Ref., 7]: -~G/kT Ne [ln U/ in (1 -e4 1 where U is a uniform deviate in the interval {0,1]. The calculation is the same for crossover, inversion and mutation. These quantities should be recalculated if the temperature ts raised. In the thermodynamic framework tefresh is implemented by allowing the operator to be defined by a 4-tuple [8,58; 8,0] where O *temp* 115) (decf (QO *temp* 100) (deck ((> *temp* 50) (decf *temp* 0.1)) (© *temp* 36) (deck *temp* 0.05)) (t (progi (setq *temp* *reset-temp*) (if (< +reset-~temp* 50) (setq *reset-temp* 150) (decf *reset-temp* 10)))) *temp* 1.0) *temp* 0.2)) Figure 5. A LISP routine to implement an schedule annealing Test Cases Ve have run many experiments vith travelling salesman problems of various sizes. Some sample cases are exhibited in figures 5, 6, and 7. We have found that the use of thermodynamic crossover in conjunction with heuristic crossover provides a marked improvement over the use of the extended crossover operator alone or heuristic crossover alone. We have also found that thermodynamic crossover allows the reduction of the population to sizes much smaller than those indicated by Goldberg and Lingle [Ref. 3]. For instance, we use population size of 12 - 15 individuals for a city problem. a 200 Other parameter settings used in the 100 problem are as follows. eity oot 375.0 8: 140.0 6: not applicable tn refresh: -07 250.0 Initial temperature: The annealing schedule shown in figure 5 was used. The heuristic crossover operator was invoked 25% of the time and thermodynamic crossover 75%. A population of size 12 was used. In the following figures "Trials" is the number of individuals that have been evaluated. The fitness measure is in terms of the theoretical optimum value derived in [Ref. 8} and referenced in (Ref. 9]. The fitness quantity used is the optimum tour length divided by the actual tour length. Thus a fitness of .95 is about 5% above the theoretically predicted minimum tour. These results compare favorably with results presented by other researchers using the GA on the TSP, [Ref. 2} for example, both in terms of the required number fitness evaluations and the resulting route length. 120 RANDOM TOUR POPULATION SIZE = 10 50 CITIES FITNESS = 0 8976 2500 TRIALS 5000 TRIALS FITNESS = 0.9367 Figure 6. A 50 10,310 TRIALS FITNESS = 0 9443 eity problem RANDOM TOUR POPULATION SIZE = 12 1400 CITIES 2500 TRIALS FITNESS = 0.7567 20,000 TRIALS FITNESS = 0 9155 Figure 7. A 100 ciry problem 2500 TRIALS POPULATION SIZE = 12 FITNESS = 0.5607 200 CITIES 70,676 TRIALS FITNESS = 0 9656 35,000 TRIALS FITNESS = 0 95037 Figure 8. A 200 clty problem Conclusion Although the unified Thermodynamic Operator is not directly motivated by a theory of genetics or thermodynamics it provides an explicit control over population convergence. Empirically, it has been shown to increase performance through the use of an annealing schedule, a concept borrowed from Simulated Annealing. Analysis of the thermodynamic operator and a better understanding of the relationship between schemata retention and optimal convergence rates is expected to lead to even better performance by customizing the annealing schedule to the requirements of specific applications. Appendix 0-Schemata Retention Ratio For Inversion The inversion operator on an individual of length L selects two unique points 4 and B. The order of the genes between these paints is reversed. To calculate the o-schemata retention ratio (the fraction of o-schemata retained by a typical application of the inversion operator) ve first calculate the expected number of o-schemata retained and then divide by 2b, the total number of o-schemata originally represented in the individual. To calculate the expected number retained, &, we sum the product of the expected number retained for each choice of A times the probability of choosing that value of A. Since all the values of A are equally likely (for all x,OB) and the case where (BPA). Ey = P(ADB) E + P(B>A) E ADB B>A Since all choices for B are equally likely: P(ADB) = (A-1) / (i-1) P(B>A) = (L-A) / (L-1) Enon can be calculated by summing the product of the actual number to be retained, when A and B are both known (and A>B), times the probability of that selection of A and B given A>B. Once again this probability is the same for all values of A and B given A>B ( P = 1/(A-1) ) so it can be moved outside the summation. Finally, the number retained when A and B are known (and A?B) can be rewritten as the total number, gh minus the number lost. Generally, any o-schemata using genes from the inversion region will be lost, so the number lost in an inversion between A and B can be shown to be 2O+A-B) (the number of o-schemata using genes between A and B). A similar analysis will alse work for the other half where B>A simply by reversing the roles of A and B and using P = 1/(L-A). Assembling the expression described above leads to a complex expression which reduces to: Boe 2h ae B/(LAL) + B/L(L-4) ~ (a¢2') / LcL-1)] To get the retention ratio we must divide by at; R= t+ Coty se qa 7 2h¢n-1)] 248 6 Mee) - 18 7 LCLH1)] For large values of L it can be approximated by: which approaches 1 as L becomes very large. From this analysis we can see that inversion will have a tendency to retain a large portion of the o-schemata, not allowing it to explore new schemata effectively. This could help to explain the generally poor performance of inversion in solving order problems cited by Goldberg and Lingle {Ref. 3}. Similar analysis af the other operators has shown that each one has different retention eharacteristics for o-schemata than for a-schemata. References Holland, J. H.: Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, 1975. Greffenstette, J., R. Gopal, B. Rosmaita, and D. Van Gucht: Genetic Algorithms for the Travelling Salesman Problem. Proceedings of an International Conference on Genetic Algorithms and their Applications, Pittsburgh, July 74 26 1985, pp 160-168. Goldberg, D. E. and R. Lingle, Jr.: Alleles, Loci, and the Travelling Salesman Problen. Ibid, pp 154-159. De Jong K.: Genetic Algorithms: A 10 Year Perspecitve. Ibid, pp 169-177. Kirkpatrick, &., C. D. Gelatt, and MH. P. Vecchi: Optimization by Simulated Anngaling. Science, Vol. 220, May 1983, pp 671-680, Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. V. Vetterling: Numerical Recipes, the Art of Scientific Computing. The Press Syndicate of the University of Cambridge, Cambridge, 1986, pp 326-334. Knuth, D, E.: The Art of Computer Programming, Vol. 2, Addison-Wesley Publishing Co., Reading, MA, 1969, pp 116-117. Beardvood,J., H. Halton and J. M. Hammersley: The Shortest Path Through Many Points. Proc. Cambridge Philosophical Society, Vol 58, 1959. Bonomi, E. and J. Lutton.: The N-city Travelling Salesman Problem: Statistical Mechanics and the Metropolis Algorithm. SIAM Review, Vol. 26, No. 4, October 1984 ho TOWARDS THE EVOLUTION OF SYMBOLS! Charles P. Dolan Hughes AI Center and UCLA AI Laboratory Michael G. Dyer UCLA Al Laboratory Abstract This paper addresses the dual problem of implementing symbolic schemata in a connectionist memory and of using simulated evolution to produce connectionist net- works that operate at the symbolic level. Connectionist models may change the way we model symbols, and we take the view that a connectionist architecture that imple- ments symbol processing must have a plausible evolu- tionary path through which it could have passed. A simulation methodology is introduced which allows symbolic models to be studied at the neural level without expensive computational numerical models. A model for schema processing is proposed and that model is shown to have a plausible evolutionary path using only mutation. The proposed model allows us to implement the memory model of CRAM, a model of learning and planning cur- rently implemented entirely at the symbolic level, using distributed representations. We also propose a more general solution for using genetic algorithms to construct connectionist networks that manipulate explicit symbolic structures. 1. What do genetic algorithms have to offer connectionism? The question of “what genetic algorithms have to offer connectionism” is an important one because we believe that genetic algorithms are not just another search method for finding a good set of weights in a high-dimensional non-linear weight space. Genetic algorithms may be part of a solution to the problem of representational opacity. One problem in connectionist systems is that the representations they learn are often incomprehensible to humans. In the case of the NETtalk system for pronouncing English text (Sejnowski 1987) it was neces- sary to use multi-dimensional clustering techniques to find out what distinctions the hidden layer was making among iThe second author is supported in part by the JTF program of the DoD, monitored by JPL, and by grants from the ITA Foundation and the Hughes Artificial Intelligence Center. the inputs. In Rumelhart and McClelland's system (1986) for learning the past tense of verbs, even though the system looked as though it was following rules, no rules could be found directly in the weights of the network. In both these systems, the back-propagation learning algo- rithm (Rumelhart et a! 1986) learned non-intuitive mappings of symbols (English words) to actions. It is our contention that a network that learns a specific task, but does so by constructing an opaque representation, may yield little to our understanding of human cognitive information processing. General intelligence will proba- bly not be achieved by only picking statistically signifi- cant features of the environment”. To combat this tendency of connectionist systems we follow the principle of intermodular transparency. This principle states that modules always communicate with a fixed representation that can be given some symbolic interpretation. In the case of connectionist networks, the module communicate with patterns of activation. Inside the modules any representation at all may be used. The problem we encountered is that there are no connec- tionist leamming algorithms for performing search in the space of configurations of functional units. To attempt a solution to this problem we used a degenerate form of genetic search (Holland 1975), hill climbing, to search a small space of configurations and received encouraging results described in Section 9. This has led us to formulate the entire problem as a general genetic search problem using the representation described in Section 10. We have yet to show that genetic search yields reasonable structures, but the representation we have constructed shows some promise when measured against the representational issues set forth in (DeJong 1985), 21m (Holland et al 1987) the point is made that the inductive inferences a system makes are highly dependent on the goals of the system, not just the regularities of the data. 123 2. Symbol processing in PDP networks We adopt a functional approach modelling cognitive processes. For a given process we want to model, we design a complex architecture that models it, Then, piece by piece, we try to replace functional parts of the architecture with networks of simple units, as in parallel distributed processing (PDP) (Rumelhart and McClelland 1986) models. This approach simultaneously answers the questions of how to adapt symbolic models for PDP implementation and how to implement symbols in PDP models. The question which this approach does not answer satisfactorily is “How could these architectures have gotten there?” To answer this question we must provide a plausible evolutionary path through which an architecture could have developed. We hope that genetic algorithms will not only help us establish this path but will give a tool for exploring the space of possible architectures in a principled manner, To more fully understand our approach it is helpful to see that distributed connectionist models fail on a continuum of functional structure. The continuum ranges from ho- mogeneous to highly functional models. The homoge- neous models such as (Rumelhart and McClelland 1986) have a large number of identical units and a uniform connection pattern among units, These networks use the same learning rule for al! units and rely on emergent properties of these collections of units for modeling all cognitive behavior. The approach we are taking here is highly functional. In these models, the general spirit of connectionism is kept, but large portions of the network may lose the ability to develop emergent properties. In addition, these models also admit larger amounts of external control than other models and more than one type of unit. These are all simplifying assumptions that allow us to model symbolic processing. A more complete discussion of the various degrees of functionality can be found in (Dolan and Dyer 1987). 3. Symbolic schemata and hierarchical structure Too understand why we want to embed symbolic structures in connectionist networks one has to see the importance the such structures have played in previous cognitive models. For example, symbolic schemata have been used extensively in story understanding under various names, such as _ scripts (Cullingford 1980) and MOPs (Schank 1982). Schema recognition is a fundamental operation in such systems. Figure 1 shows an example schema with some unfilled roles; waiter, owner, food, and payment. This schema is simplified and adapted from the restaurant script in (Cullingford 1980). The notation used in Figure 1 stresses that a schema-based representation can be implemented as a set of relations, one relation for each slot in the schema. Likewise, the constraints among the slots of the schema, which constitute the structural description of the schema, can be represented as sets of relations. SRESTAURANT PTRANS MTRANS ATRANS CUNO SOL) enn actor actor co atabiehmontae MaMaiSOn —s—reeto water ————HUMAN(servica-persan) e Po actar Food OBJECT (edible) object —L, obyect Figure 1 - Example schema, “the restaurant” SRESTAURANT $FAST-FOOD $FANCY MacDonalds BurgerKing MaMaison Last Thursday | had | heard John went to a Big Mac MaMaison Figure 2 - Example Schema Hierarchy In addition to having a rich structural description, symbolic schema are also often organized into a hierarchy such as the one shown in Figure 2. In these hierarchies, the most general schemata are located at the top with more specific schemata on the levels below. The leaves of the tree are instances of schemata that the program has encountered, This organization supports two operations that give schema processing programs a great deal of their power. The first is inheritance. Once a situation is recognized as being an instance of a schema somewhere in the hierarchy, all the knowledge associated with that schema’s super- classes is also available. The second is discrimination. Once a situation is recognized, the program can walk down the branches of the tree to see if the descriptions of anymore specific situations apply. Our approach to connectionist symbol processing is to directly implement features such as inheritance and discrimination at the neural level. This allows us to determine the value of such mechanisms when our models are implemented on physiologically more realistic hardware. 4. Symbol processing in CRAM CRAM is a symbolic model of comprehension and learn- ing from fables, with a schema-based memory (Dolan and Dyer 1986). CRAM has various components that use the schema-based memory: (1) a story comprehension system, (2) a planning system, (3) a symbolic learning system, and (4) an advice and adage-generating system. Such a model is ideal for testing a connectionist-style memory because the memory module must support these 124 four diverse cognitive tasks. The memory model for CRAM is based primarily on three operations on schemata: (1) instantiation, (2) role binding, and (3) recognition. For example, when CRAM notices that a character in a story has a particular goal, such as needing a secretary, and notices that the same character flatters a secretary, then CRAM needs to instantiate the schema for and bind all the correct roles. Role binding is different from binding in pattern matchers because it involves modifying the roles of schemata already instantiated in short term memory. For this reason we cannot use Touretsky and Hinton's solution (Touretsky and Hinton 1985) for production systems. For a discussion of role binding in connectionist memories see (Dolan and Dyer 1987). 5. Distributed symbols and relations One way to think of symbols in distributed connectionist models is as “bit strings”. These “bit strings” are not memory addresses, as are symbols in traditional implementations, but are feature vectors. An example of this is found in (McClelland and Kawamoto 1986) where each symbol is classified along a number of dimensions. A fixed number of bits is allocated to each dimension, one for each possible classification, and the bit string is formed from that set of features. As an example from (McClelland and Kawamoto 1986), DIMENSIONS FEATURES GENDER male, female, neuter SIZE small,medium, large are the features, and the symbol mappings are VECTORS SYMBOLS GENDER SIZE John ~-> (100 gol } Mary —-> ( 010 010 ) Book --> ( 002 100 ) By constructing symbols this way, we ensure that symbols with similar meanings will have similar representations, Relations can then be formed from these symbols in a very straightforward manner, as in (Hinton 1981). Assume that we have defined the symbol “hit” for arelation as (100 100), then we can represent “John hit Mary” with (100001 100100 010010) for (ROLE1 REL ROLE2). This method of representation is compatible with the neutral/distributed and biased methods for intermodule communication. Sets of relations of these units can be represented as a Pattern of activation on a set of units using conjunctive Coding (Hinton et al 1986), An example of such a Tepresentation is given in Figure 3. Here we allocate a cube of units where each dimension of the cube is the length of one symbol. Each unit in the cube represents a three-way conjunction of the features of the ROLE1, REL, and ROLE2 positions of a relation. In this way, multiple relations can be stored on the same set of units. This approach is similar to the design of the working memory for the production system in (Touretsky and Hinton 1985), except in that design, each element of the working memory was assigned a random subset of each of the possible symbols for the three positions in a relation. In our design, each unit in the working memory plays a very specific role in the semantic meaning of the relations stored in working memory. 10H. ee o 8 REL Figure 3 - Conjunctive coding of relations in working memory 6. Methodology In order to understand the meanings of their symbols, symbolic researchers need a model of a network that allows them to study the interactions of micro-features and micro-inferences (Hinton et al 1986) while still being able to specify symbolic interactions at a relatively high level. The approach we propose here is to use a model of groups of idealized neurons. One useful side effect of this approach is that simulations are less complex compared to those of individual neurons, By making the objects of study functional groups of neurons, modelers can study architectural issues, but only after it has been shown that groups of units can be made to function in the desired way using detailed numerical simulations. Using this method, we simulate a group of units, called a “unit-set”, as an object that responds to messages in the sense of object-oriented programming. This is similar to the approach taken in (Bilbert and Salter 1986). Eilbert and Salter modeled each layer of the network as an abject, and object parameters determined the probability of a par- ticular unit exciting either the unit directly below it or some random unit in the next layer. In this way they were able to easily simulate complex hierarchical networks. In the Section 10 we describe a way of encoding a configura- tion of unit-sets that is suitable for use by a genetic algorithm. Underlying this methodology is the idea that a network architecture can be judged by how well it minimizes three quantities: (1) the number of unit-sets, (2) the complexity of the control layer, and (3) the number of different 125 messages used. We believe that symbolic processing can be accomplished with a small number of messages, including: (1) SEND-OUTPUTS to trigger a unit-set to excite the unit-sets to which it is connected WINNER-TAKE-ALL to cause a mutually inhibitory unit-set to settle on a small number of active units REINFORCE to invoke reinforcement learning DECAY to cause a unit-set to decay the activation of each unit towards zero (2) @) 4) Between unit-sets are weighted links (connections) of the type normally used in neurally inspired models. Static links are used to pass data between sub-networks. Modifiable links are used to make one unit-set learn to respond to a pattern of activation from another unit-set. In cases where a learning algorithm is used, a message of the appropriate type is sent to the unit-set. For units that leam by reinforcement, for example, the unit-set is sent a REINFORCE message. Active units within the unit-set have their links modified according to reinforcement (Barto et al 1981). The static links are used to implement the architecture and should be thought of as data paths, The modifiable links are used to store knowledge. 7. Evolving functional structure Our methodology in studying distributed connectionism lends itself to answering questions pertaining to adaptation at the architectural level. If we assume a set of architec- tural building blocks, such as winner-take-all networks, hidden layers, conjunctive coding networks, and unit-sets for individual relations and symbols, then we can ask whether configurations of these building blocks convey some adaptive advantage over other configurations and what configurations of building blocks does some principled model of evolution lead us to postulate. In order to discuss evolution of functional structure, we need an architecture from which to start developing that structure. An example of such an architecture is given in Figure 4. The task for the network is to Jearn to reproduce the patterns from the training set on the output units when excited by noisy versions of those patterns on the input units. If the training set is composed of non-linearly separable vectors then this is clearly a problem that requires hidden units. These hidden units are where we shall try to learn functional structure. If we start with the network as described above, using a Jearning procedure such as back-propagation (Rumelhart e¢ al 1986), the hidden units will learn to form features from the input patterns, and the output layer will learn to reproduce those patterns based on the features formed by the hidden layer. The features that the network learns will depend on the width (i.e. number of units) of the hidden layer and to a large degree on chance, which is based on the order of presentation of the training set and the initial settings of the weights. noisy training data Input layer modifiable weights hidden layer modifiable weights output layer cigan training data for teaching Figure 4 - Simple auto-associative network As was shown in (Sutton 1986), search in the weight space is slowed by strict gradient descent. Because gradient procedures “creep” along a “bumpy” error surface they are not likely to come to rest with the same features that humans have. They are more likely to stop in some “pot hole”, and if that pot hole allows the network to learn the training set without errors, the learning algorithm will not get out. Also there is no reason to expect that the features that people see in the training data are better than the “pot holes” in terms of the error surface. We believe that general intelligence does not emerge from a single network of interconnected units. In order to learn large symbolic structures of the type that people use, specific architectures will be required. Such networks would have fewer stable states but would, we hope, generalize faster and would reach generalizations that are closer to those that people make. Unfortunately, there currenily are no algorithms along the lines of back-propagation that leam functional structure, There is, however, a process in nature, evolution, which does yield functional structure. The is also a class of algorithms, genetic algorithms (Holland 1975), which are idealizations of evolution and have been shown to be effective for high-dimensional, non-linear adaptive problems. These algorithms have been applied to such problems as adaptive control (see (DeJong 1985) for a summary) and the traveling salesman (Goldberg and Lingle 1985; Grefenstette et al 1985). As a first attempt at search in the space of configurations of functional units we used an algorithm that is a degenerate form of a genetic search algorithm, which uses a population of 2, one genetic operator, mutation, and for each generation chooses the most fit structure and a mutation of that structure for the next generation. In terms of Holland's characterization for reproductive plans (1975), this one is3: Ri(crossover=0,Pinversion=90,Pmutation=1,<1>) Even with this simple degenerate case of genetic algo- rithms we were able to get improved performance from the network. This seems to indicate that for some problems, such as the one described in Section 9, search at the level of configurations of functional units is very important. Search for network structure at the link level, such as that demonstrated in (Ackley 1985), will yield learning that is too slow to operate on an evolutionary time scale. Also, this type of learning will yield systems which do not have intermodular transparency. It is better to assume a basic level of architectural organization, and then search for a solution at that level. Thus, we can evolve systems that are highly fit. This approach also agrees with the basic assumption of genetic search, that search occurs in the space of the genotype (genetic code), not the phenotype (physical manifestation of the organism). To require that individual links are coded for in the genotype would mean that the genetic code is the size of the network, and this is clearly not the case with people. The proper measure of fitness for an architecture is how well it learns the training set. A fit architecture must be able to deal with noisy inputs, should be stable, and should learn fast. In addition an architecture should be flexible enough to be put in a different enviconment from that which formed it and still perform well. In our experiment we used a measure of fitness for a structure A in an environment E as below: HEA) = ale+ bey + 1) where T is the time the network is allowed to learn, e is the total number of errors the network makes, and e7 is the number of errors the network produces at the end of a tun, The constants, a and b ,are used to weight these two measures of network efficiency and in the experiment described in Section 9 they are both set to 1. 8. An architecture for schema processing In this section we present an architecture for schema pro- cessing that performs the functions of schema recognition, instantiation and role binding. The central component of the architecture is the working memory. The working memory is a set of units that conjunctively code sets of telations, such as the one in Figure 3, The full architec- ture is shown in Figure 5. The schema memory performs the functions of recognition and instantiation. The proce- dural memory could be implemented a number of ways, as 37his can also be characterized as simple hill climbing. 127 a connectionist production system (Touretsky and Hinton 1985), as a classifier system (Holland and Reitman 1978), or as a symbolic production system. The full architecture is described more fully in (Doland and Dyer 1987). rodudural memo working memory] Figure 5 - Schema processing architecture Figure 6b shows the details of the schema memory. The memory is composed of many winner-take-all cliques. Each clique discriminates a mutually exclusive set of pos- sible schemata based on the contents of working memory. In addition to recognizing schema, each unit in the cliques also excites the units in working memory to instantiate the remainder of the schema. In addition to the connec- tions to the working memory, the cliques are connected to each other. These links implement inheritance and dis- crimination. Part of the tree in Figure 6a is shown in the schema memory of Figure 6b. Strong positive weights, shown in the figure as think black lines from the sub- class-units to super-class units, implement inheritance. Weak links, shown in the figure as thin black lines from the super-class units to the sub-class units, implement discrimination. Once a super-class is active, it slightly ac- tivates all of its sub-class units in the same clique. If there is enough additional evidence in working memory to activate any of the sub-classes the one with the most evi- dence wins. This structure can also implement exception handling. To override a default, a strong negative weight, shown in the figure as a thick gray line, is formed between sub-class and the super-class to be overridden. excitation to and from working memory (b) (a) Figure 6 - An example schemata in memory It is clear that this organization can be “wired up” to Tecognize and instantiate schema according to their symbolic definitions. What is not clear is how plausible this architecture is, The question we must answer is how this architecture got there. In the next section we describe an experiment in simulated evolution that shows that this architecture is very plausible. 9. An experiment in the evolution of structure To test the plausibility of this architecture, we designed an experiment to see if simulated evolution would arrive at anything similar to the architecture in Figure 6b. In genetic search algorithms, a genotype is defined from which phenotypes can be constructed. A population of the genotypes is then put in a pool and altered using genetic operations such as crossover, inversion, and mutation (Holland 1975), Differential reproduction based on estimated fitness, combined with recombination using crossover allows the system to search through the space of coadapted alleles with provable efficiency. In our experi- ment we used a population of two, a single organization of the network and one mutation of it, and determined which was most fit. The winner of the competition moved on to the next generation. The reason for not using a large gene pool is that we were only interested in finding a plausible evolutionary path, 9.1 Experimental design The unit of mutation in our experiments is the winner- take-all clique, Within each clique, only the unit with the strongest activation is allowed to fire. In addition, each unit in a clique has an adaptive threshold. If the unit does not fire often enough, it lowers its threshold; if it fires too often, it raises its threshold. A unit decides whether or not it is firing “enough” based on the number of reinforcement cycles the network has gone through since it last fired. The effect of the adaptive threshold is to make each clique try to maximize its information-carrying capacity, If a unit is either not firing at all or firing all the time, it is hot carrying any information. It would be easy to define a genotype for this architecture by letting a certain number of genes code for the number of cliques and other genes for the relative distribution of clique sizes. However, for this experiment we define four types of mutation directly on the architectural phenotype4: (1) splitting a clique, (2) merging two cliques, (3) deleting aclique, (4) adding a new clique with two units. Since cliques also have modifiable connections among each other, these four mutations allow the system to form a wide variety of architectures. To ensure that the system does not simply learn an architecture that is tuned to a specific input set, for each generation we use a new training set. The only thing in common among all training sets is that they are all tree structured, For example, we might want a network that 4Because the mutations are performed at the level of functional blocks, this can be considered as search in the space of the genotype. Section 10 details a representation that allows search directly in the space of the genotype. was able to handle trees of depth 3 and width 4, so each generation we generated a random tree of depth 3 where the branching fact varied from 0 to 4 at each node. For example, the some of the training set for the tree shown in Figure 6a is shown in Figure 7. Feature vector Examples ABCDEFGHIdS 1. (101000000 0) c 2. (01021001060 0) D 3 (00000110 1) Ja Figure 7 - An example of hierarchical features For example, to represent a member of the class C as in test vector I, tum on bit C and turn on bit A, because A is super-class of C, but never turn on bit B because B and C are in mutually exclusive sub-classes of A. Likewise, to represent a member of class F as in test vector 2, we tum on the bits on the path from F to the root: F, B, and A. To represent a member of class D, as in test vector 2, we tum on the bits for D, B, and G but not A because there is a cancellation link from D to A. A set of features such as these were presented to a network such as the one in Figure 4 where the hidden layer was structured as a set of cliques. In addition, noise was added to the training vectors by flipping bits. The average number of bits flipped was proportional to the depth of the trees being learned. For 2 level trees it was 1 bit, for 4 levels, 2. The experiments started with a hidden layer with just one clique. The clique was large enough to discriminate among a set of feature vectors from an “average bushy” tree of the desired depth. In this case we were trying to evolve an architecture for learning trees of depth three. The rationale for starting with a single large clique is that this structure works extremely well in an environment with no noise and with a training set whose size perfectly fits the width of the hidden Jayer. We used leaming to evaluate the fitness of the various architectures, Competitive learning (Rumethart and Zipser 1986) was used to modify the weights from the input to the hidden layer and the weights among the cliques. Hebbian learning (Hebb 1949) was used to modify the weights from the hidden layer to the output layer. We wanted to select for three features of the learning performance: speed, stability, and accuracy. To select for speed, we only allowed the networks ten iterations through the training data. Considering the fact that noise was introduced, this was not a large number of trials. To select for stability and accuracy we kept track of two quantities during a given generation: the total number of errors a network made and the current performance. At the end of ten iterations, the networks were compared; if a network had lower scores on both counts it moved on to the next generation. If the decision was split, then it was settled at random. 9.2 Results By watching the performance of various organizations we were able to derive some mules of thumb for predicting the fitness of an architecture. These observations are not quantitative, but serve to illustrate why one particular organization had a selective advantage over another. Flat organizations (i.e. a single, wide clique) are extremely efficient. However, they are unstable and very susceptible to noise. Very often, noise will force the competitive learning algorithm to change the encoding from the input layer to the hidden layer, and this will undo alf the leaming performed between the hidden and output layers. It is easy to construct an organization of binary cliques that exactly mirrors the hierarchy. This structure is efficient and extremely stable, but if the tree changes substantially in the next generation, a finely-tuned structure will not perform well. Organizations with lots of structure do extremely well. A good mix of various clique sizes allows the organization to perform well in a wide range of environments (i.e. all trees of depth three and branching factor 0-4), 9.3 Analysis In general we found that networks tended to mutate towards increasing complexity. Since flat organizations are not particularly robust, a mutation in which a new Clique is added will increase stability and will therefore be selected for. The mutation that adds another small clique gives only a small advantage if any, but allows the organization to mutate to one in which two smail cliques combine. This yields a large advantage over multiple small cliques or one single, wide clique. One unfortunate feature of this method is that the environment can kill off a promising line of evolution by accident. As a network is growing into a very complex organization, with many cliques of various sizes, if the training set for a particular generation is very bushy it may overload the capacity of the network to learn the structure of the tree, In these cases, a flat architecture with the same number of units will be more fit, since it can encode one training example on each unit of the hidden layer. This is a case of a representational boundary as described in (DeJong 1985). Because of that fact, the first architectures to evolve are often large, flat, single-clique architectures. Once these cliques split, however, they form very robust hierarchical architectures. It seems quite difficult for the system to develop hierarchical architectures from the start by incrementally adding more and more cliques. The reason is that the intermediate stages of such a development are subject to extinction by difficult environments. 129 10. A proposal for genetic search in the space of functional structure The primary problem in any genetic algorithm is choosing the representation. DeJong (1985) has identified several issues that need to be addressed in choosing a Tepresentation. One is the problem of schema? that confer above-average performance: (1) do they exist?, and (2) are they at all preserved by the genetic operators? A second one is the problem of representational boundaries: are there Tepresentations which confer above-average performance, but which cannot be improved upon without a radical change in genotype? Although it is possible to devise such an encoding for the simple structures described in Section 9, it is difficult to devise a fixed encoding which would be capable of describing structures such as the one shown in Figure 5. A solution to these problems may be found in adaptive codings as suggested in (Holland 1975). In that formulation, complex productions are expressed as strings of a ten member alphabet, which has a rather complex interpretation scheme, and a reproductive plan is run on that representation. A similar, but more elegant formulation is found in (Holland and Reitman 1978) in which each production is a bit string from the alphabet {1,0,#}, where # stands for “don't care”. Productions are matched against and post bit-patterns to a blackboard whose contents also determine the actions of the system (in our case which unit-sets to create and how to connect them). A further refinement of this formulation called Classifier systems along with the bucket-brigade algorithm (Holland et af 1987) allows credit (fitness) to be apportioned among a chain of productions responsible for producing a structure of above-average fitness. The genetic algorithm is then run either by treating an entire set of productions as a genotype, as in (Smith 1983), or by using each production as a genotype as in (Holland and Reitman 1978). In the second option the production system is treated as a population and the strength of a rule is its measure of fitness. This style of representation holds great promise for our task because these classifier systems can be used to compute almost anything (anything if we give them an infinite blackboard). However, we must be cautious of DeJong's pitfalls for genetic algorithms (1985). To begin with, we structure the condition and action parts of a production's bit vector with the following fields, (block~type,*1,y1,Z1,unit-type, projection-type,x2,y2,22, link-type, tag) 5The word schema is used in two different ways in this paper, to indicate a compiex symbolic structure and to indicate a set of alleles on a genotype as in (Holland 1975). All previous references have been to the first meaning. All subsequent ones shall be to the second. This indicates a functional block of a certain type (e.g. relation vector, symbol vector, relation cube, or hidden layer of a particular size) at location (xj,y1,z1) with a certain type of units (e.g. linear threshold, with or without refractory period, etc.), projecting with a certain pattern (e.g, random, one-to-one, or layered) to a block at location (x2,y2,22) using a link with a particular learning rule (e.g. reinforcement or back-propagation). This same format is used both to test for blocks and to assert the existence of blocks onto the blackboard. The tag field can be used to link arbitrary productions. Now we can attempt an informal evaluation of this representation in terms of DeJong's metrics. Taking the position that either individual productions or the entire system is a genotype, we can see that the position codes constitute fairly robust schema. However, they convey no particular advantage until combined with a particular block type, and they make even stronger schema when combined with a projection to another location. For example, a production with #'s in the positions for both sets of (x,y) coordinates would create a layer of identical unit sets, each one connected to the unit set either above or below it depending on the values of the z coordinates. Now, however, we are talking about schemata almost as long as the genotype if we use individual productions in population. Clearly, we need to use the first option, i.e. the entire production system as a genotype, to make such a linkage viable. With this method there is also a possibility that the message field can be used to form linkages between productions. These linkages, however, will be extremely weak unless we can use inversion to strengthen them. This representation looks like it will have interesting representational boundaries. First, it is not clear whether most random genotypes will even produce a configuration with two connected functional blocks. Second, once there are a set of productions represented in the population which produce a structure with a certain fitness, it is not likely that an operation which modifies those productions will result in improved fitness. It is much more likely that it will completely inhibit the creation of the structure and result in much lower fitness. This conforms very well to our notion of how epistasis® should work in functional evolution. It is much more likely that productions will be found which add other structures on top of the ones already found to be useful. This is similar to the situation found in the evolutionary development of the brain. The structures found in higher organisms are built on top of those found in lower organisms (Kolb and Whishaw 1980). SEpistasis is the phenomena, in genetics, of a reaction that depends on several enzymes, and will not proceed until all the enzymes are present. In order to test this representation, we propose three steps. First, we will continue working with hand-coded networks of functional units, but restrict our attention to networks that can be compactly represented by a production system. Second, we will take those same representations and see what other systems are constructible from them, since classifier systems can be non-deterministic (when a classifier matches more than one element of working memory or when more than one classifier matches). Last, using a corpus of symbolic manipulation tasks, such as story comprehension and question answering, we will test the ability of a genetic algorithm to construct neural-like networks which can process symbolic structures. 11. Current status The architecture for the schema memory is currently implemented in Scheme on a workstation. The model has been integrated with a long term schema-based memory and a procedural memory implemented serially at the bit- level. The system can currently understand fragments of One-paragraph stories, The largest network we have simulated so far stores six schemata with approximately seven relations each. The relations are represented with eight bits for each symbol. A more complete description of this system can be found in (Dolan and Dyer 1987). 12. Conclusions We have found that search in the space of configurations of sub-networks can yield a network that learns human- like generalizations. The resulting architecture learns fast, and forms the same type of hierarchical structures that are formulated as symbolic models in humans. By using this approach we can now evaluate yarious symbolic models at the micro-feature level, to see if their symbolic representations hold up under a detailed analysis. We now have a hope of using true genetic search in the space of configurations to find architectures for general intelligence. This would probably not be the case if we could not have shown that simple hill climbing in a simple case was able to yield a network that out-performed the simplest flat network. 13. References Ackley, D. H. (1985). A Connectionist Algorithm for Genetic Search. Proceedings of the First International Conference on Genetic Algorithms and their Applications, 121-135, Barto, A. G., Sutton, R. S., and Brouwer, P. S. (1981). Associative search networks: A reinforcement learning associative memory. Biological Cybernetics, 40, 201-211. 430 Cullingford, R. E. (1981). SAM, in R. C. Schank and C, K, Riesbeck (Eds) /nside Computer Understanding: Five Programs Plus Miniatures. Lawrence Erlbaum Associates. DeJong, K. (1980). Genetic Algorithms: A 10 Year Perspective. Proceedings of the First International Conference on Genetic Algorithms and their Applications, 169-177, Dolan, C. P. and Dyer M. G. (1986), Encoding Knowledge for Planning, Learning, and Recognition, in Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 488-499, Dolan, C. P. and Dyer, M. G. (1987). Symbolic Schemata in Connectionist Memories: Role Binding and the Evolution of Structure. UCLA AI Laboratory Technical Report, UCLA-AI-87-11. Eilbert, J. L. and Salter, R. M. (1986) Modeling Neural Networks in Scheme. Simulation, 46(5), 193-199, Goldberg, D. E. and Lingle R. (1985), Alleles, Loci, and the Traveling Salesman Problem. Proceedings of the First International Conference on Genetic Algorithms and their Applications, 154-159. Grefenstette, J. J., Rajeev, G., Rosmaita, B. J., and Van Gucht, D. (1985). Genetic Algorithms and the Traveling Salesman Problem. Proceedings of the First International Conference on Genetic Algorithms and their Applications, 160-168. Hebb, D. O. (1949). The organization of behavior. Wiley. Hinton, G. E. (1981), Implementing Semantic Networks in Parallel Hardware, in G, E. Hinton and J. A. Anderson (Eds), Parallel Models of Associative Memory. Lawrence Erlbaum Associates. Hinton, G. E., McClelland, J. L., and Rumethart, D. E. (1986). Distributed Representation, in D. E. Rumethart and J. L. McClelland (Eds) Parallel Distributed Processing, Volume 1. MIT Press. Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press. Holland, J. H. and Reitman, J. S. (1978), Cognitive Systems Based on Adaptive Algorithms, in D. A. Waterman and F, Hayes-Roth (Eds) Pattern-Directed Inference Systems. Academic Press. Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P, R. (1987). Induction: Processes of Inference, Learning, and Discovery. MIT Press. Kolb, B. and Whishaw, I. Q. (1980). Fundementals of Human Neuropsychology. W. H. Freeman and Compandy. McClelland, J. L. and Kawamoto, A. H. (1986). Mechanisms for Sentence Processing: Assigning Roles to Constituents, in J. L. McCleiland and D. E. Rumelhart (Eds) Parallel Distributed Processing, Volume 2, MIT Press. Rumeithart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning Internal Representations by Error Propagation, in D, E. Rumelhart and J. L. McClelland (Eds) Parallel Distributed Processing, Volume 1. MIT Press. Rumelhart, D. E, and McClelland J. L.(1986), Parallel Distributed Processing, Volume 1. MIT Press, Rumelhart, D. E. and Zipser, D. (1986). Feature Discovery by Competitive Learning, in D. E. Rumethart and J. L. McCielland (Eds) Paralle! Distributed Processing, Volume 1. MIT Press. Rumelhart, D, E. and McClelland J. L. (1986). On Leaming the Past Tenses of English Verbs, in J. L. McClelland and D. E. Rumelhart (Eds) Parallel Distributed Processing, Volume 2. MIT Press. Schank, R. C. (1982), Dynamic memory: A theory of reminding and learning in computers and people. Cambridge University Press. Sejnowski, T. J. (1987). From Signals to Symbols in Connectionist Networks. Lecture at UCLA, May 1987. Smith, S. F. (1983). Flexible learning of problem solving heuristics through adaptive search. Proceedings of the Eighth International Joint Conference on Artificial Intelligence, 422-425, Sutton, R. S. (1986). Two Problems with Back Propagation and Other Steepest-descent Learning Procedures for Networks. Proceedings of the Eighth Annual Conference of the Cognitive Science Society. Touretsky, D. S., and Hinton, G. E. (1985). Symbols Among the Neurons: Details of a Connectionist Inference Architecture. Proceedings of Ninth International Joint Conference on Artificial {ntelligence, 239-243, 434 SUPERGRAN: A connectionist approach to learning, integrating genetic algorithms and graph induction G. Deon Oosthuizen Dept. of Computer Science, University of Strathclyde 26 Richmond Street, Glasgow G1 1XH, Scotland Abstract The operation of genetic algorithms is based on the recurrent selection of strings of values according to their usefulness to the system and the subsequent creation of new strings, by application of genetic operators, in an effort to obtain more appropriate strings. A learning algorithm, based on a connectionist approach to knowledge representation, can be used to supervise and enhance the application of genetic operators on the basis of analysing the underlying features of the strings. The genetic algorithm generates new strings. The learning algorithm utilises the strings to induce new concepts (schemata) and these are used to advise the genetic algorithm, which in tum produces new strings. Combining the learning capabilities of the two components results in a versatile system that is able to digest eagerly and adapt gracefully, Introduction The application of genetic algorithms has already delivered very interesting results in various areas, including optimization ([Grefenstette 1986], cognitive modelling {Holland & Reitman 1978}, classification and control (Holland 1986]. Genetic algorithms have two inherent characteristics that make them highly suitable for learning by discovery: on the one hand, sufficient variability is maintained to prevent convergence to focal optima. On the other hand, categorization and recombination yields powerful —simplicit parallelism. Yet, the overall process is based on the appropriateness of strings. The underlying relations and dependencies between features are not dealt with directly in the generation of new strings. We employ another fearning method, the GRAND algorithm for GRAph iNDuction! developed by the author [Oosthuizen ef al. 1987b], to supplement the genetic operators by providing knowledge about the underlying Telations and dependencies between features. The aim is to use “deep” knowledge to direct the application of the genetic operators and thereby to enhance the performance of the overall process. Since we are concerned with learning systems, we shall devote our attention to the application of genetic algorithms to the class of message-passing rule-based systems called classifier systems [Holland 1986], also referred to as Holland classifiers (Schrodt 1986]. We shall use the term classifier system to refer to a total learning system consisting of a rule base, an apportionment of credit algorithm and in particular a genetic algorithm. Although we shall use the terminology employed in that context, the ideas described here are applicable to the broader class of optimization procedures called genetic algorithms. Thus, our use of the term string refers to the members of the population, applying to the condition/action part of a classifier (or their concatenation}, but also to a structure or vector in other optimization procedures. In this paper we describe how, by combining the inductive capabilities of 'praph induction = induction by graphic representation and graph manipulation 432 genetic algorithms and graph induction, we obtain a competent learning system, We describe the system by first looking at the role of schemata in classifier systems. We then describe the method of graph induction in the context of classifier systems. Subsequently, we introduce the idea of a supervised classifier system, Finally, we highlight the characteristics of the integrated system, called SUPERGRAN, as well as some crucial implementation considerations. Schemata Schemata play a central role in the application of genetic algorithms. They are used as building blocks from which new — strings are constructed, and are also used (as tags) to facilitate sophisticated tansfer of knowledge from one situation to another [Holland 1986]. Schemata are basically generalized descriptions of categories of strings. A schema is expressed as a string containing particular symbols in particular positions, indicating that all strings in that category have similar symbols in the equivalent positions. Unspecified positions, i.e. positions that "don’t matter” are filled with *’s. In classifier systems, schemata define subsets of the space of possible conditions of actions, i.e. subsets of the space of possible strings. A schema thus constitutes a characteristic description for a set of strings. We shall merely refer to a schema as a description of a set. The schema also serves as an identifier for a set of strings. We shall therefore sometimes just speak of a schema, when referring to the set it represents. In the method described here, schemata are employed in an additional role. Information (already captured in schemata) regarding the composition of the "best" strings in the current population, is used directly, and forms the basis for concept generation and clustering. The kind of string categorization described here is based on the combination of features present in schemata and strings, and is therefore different from the use of tags, which also results in schema classification/string clustering. Tags relate schemata on the basis of the temporal sequences of categories (Holland 1986]. We consider the non-temporal association of schemata, based on their inherent properties. Graph Induction Graph induction evolved from research regarding the connectionist approach to knowledge representation. The method thus originated against the background of a study of the leaming capabilities of connectionist architectures, ie, systems in which connections - rather than memory cells - are the principle means of storing information. Such systems employ vast networks of very simple processors, functioning in a massively parallel way. This accounts for the graphic basis of our method. We will expand on the significance of the connectionist approach in a later section. We now describe the method in the context of its application to classifier system schemata. Each individual position in a string (genotype) can assume more than one value (allele - taken from the standard terminology of genetics). Each allele can be considered to define an entire category of its own, i.e. the class of all strings in the population containing allele x in position y, say. Similarly, each allele can be used to identify a set of strings. If we compute the intersection of two such sets, we obtain a third set which is identified by the co- occurrence of two alleles. This third set can be conveniently described by a schema containing only the two alleles involved and *’s in ali other positions. We can now represent this situation graphically as illustrated in fig. 1. An upward arrow represents a subset relationship. If two arrow-ends meet - an intersection is implied. Fig. 1 represents the fact that the set identified by the schema 01* contains the intersection of the sets identified by the schemata *1* and O**. Fig. 2 shows that the sets identified by the schemata *O* and *1* (we shall refer to them as 1-allele schemata), are contained in the set *#*. Fig 3. shows some other intersections that may be formed. This notation can be extended to represent any combination of any number of alleles - ie. the intersection of any 133 “4° or Fig. 1 o1° q , * string value may be #(unspecif.),0 of 1 # = string value must be unspecified “9° 4° Fig. 2 ‘o" “4 on 1" 00° or 10° Fig. 3 new * schema created {a) (b) Fig. 4 group of sets. These newly obtained sets can in tum be intersected at random to form further new sets - each described by its own schema. Thus we obtain a graph consisting of tangled hierarchies of schemata, categorizing groups of strings. The sets thus created contain strings from the population. Some schemata may therefore represent empty sets. Our convention is to create and retain schemata (nodes in the graph) only when they are needed, i.e. for non-empty sets, At the lowest level, every member of the population is assigned to its own set - ie. a special kind of schema containing no *’s. This schemacan then be accurately described in terms of the intersection of & other sets, where k is the length of the string. These k sets will belong to the abovementioned level of I-allele schemata. The result is a two-level graph: the top level containing the 1-allele schemata and the bottom level containing the k-allele schemata. In essence the GRAND algorithm now does the following. The graph structure is restricted in such a way that it adheres to the nile that the intersection of any two sets in a graph must be contained in a single set. This means that a configuration like the one in fig. 4a (referred to as a distributed intersection) is changed to the one in fig. 4b. This results in a policy of consistent data factoring. If the tranformation is applied to fig. 5, the configuration of fig. 6 is obtained. Taking into account the fact illustrated in fig, 2 and applying the basic principles of set theory, we find that fig. 6 can once again be transformed, resulting in fig. 7. 000 U 100 = (0°* mn +00) U (** 400) - fig. 6 = (OF* U 14) 4 700 = HFM #00 as in fig. 7 The significance of the GRAND algorithm is that it maintains a regime of maximal schema integration. © Maximafly common schemata (sets) are identified and strings are clustered together accordingly. By close inspection one finds that the application of above transformation rule corresponds to the generalization rules introduced by Michalski in his theory of inductive learning [Michalski 1983}. The different generalization rules described by Michalski manifest as different variants of semantic network configurations. This is an interesting result, because it implies that the application of above transformations to the network structure represents learning. As changes occur in the world of interest, nodes and relations are constantly added and removed, transformations are continuously triggered, and new schemata (concepts) are formed. Consequently, we obtain what could be described as a “self-leaming” network. 134 feature 1 feature 2 feature 3 tor ty (arte y Gta ty Fig. 5 feature 1 taatura 2 feature 3 [or wy [ 0" sy [ta tt] | SR 100 Fig. 6 000 100 014 WW Fig. 7 Graph induction has been applied to published examples as processed by packages based on Quinlan’s ID3 algorithm for learning from examples, and produced similar results, (In some cases the results were better.) It has also been shown [Oosthuizen 1986] that in principle they imply an extention of Lebowitz’ “similarity-based generalization" method of leaning from observation [Lebowitz 1986]. Thus, GRAND embodies various inductive learning strategies, in particular learning from examples, learning from observation and 135 conceptual clustering. Supervised Classifier System The application of the GRAND algorithm makes a graph structure highly sensitive to similarities between strings. The consequent effect is the rapid identification of any such similarities and the creation of new categories characterizing these similarities. Thus, we obtain an automized string classification and clustering mechanism. Employing this mechanism, string "fimess" can now be judged on more general “feature” levels rather than on the string level itself. The objective is to discover more easily and to make it more transparent, what aspect of a string it is that makes it a “good” string - and then to be able to concentrate on that aspect. A similar argument holds for the “bad” aspects of a string. The above effects can be obtained as follows: For each schema set Q, we store its current number of members, as well as the current average value of v(Ci}, the "fimess” of the individual strings Ci that are members of Q. In each case, the stored value pertains to the terminal nodes (i.e, strings) of the tree below a particular node (schema) of which the node is the root. However, not all the strings in the population are taken into account. To improve the efficiency of the method, only the best performing 50%, for example, of the population is used. As strings in this best performing group are replaced by new ones, some schemata might eventually become redundant on the basis of not having any members, and are then removed from the graph. Enhancement of crossover operation Let us say two schemata exhibit above average “strength”, neither is a subset of the other (like schema B and schema £ in fig. 8), and both represent an adequate number of strings. Such a situation would indicate that the two schemata represent two clusters of strings. It would now be sensible to apply the crossover operation tO strings within the same schema (cluster) in the hope to discover stronger variations within this schema. This corresponds *~O1) [014] Cc av(B), where av(Q) is the average strength of the strings in Q, it means that the extra features in A actually improve the “fitness” of its strings. Since these strings are also incorporated in B, it means that some of the strings in B’ ( =B-A), ie. those in B which are not in A, did not fare so well. We now use a supervised mutation operation to combine the extra features present in A with those of the strings in B’. Thus, we are spreading proven qualities in a controlled way to a larger population. The same effect can be obtained by applying the pattern-crossover operation (o stings in A and B, respectively. Since the characteristic features of B is contained in A, this operation will have no effect on strings in A. If the opposite sitation holds, ie. av(B) > av(A), then similarly, supervised mutation can be used to create new strings not having the distinguishing features of A. This might give rise to other schemata under B, ie, specializations of B, if the newly introduced features correspond to others already present in another string in B. 136 Clearly, the application of above operations should be dependent on a defined minimum value for D, where D = abs (av(A) - av(B)), the difference of the average fitness of strings in A and B. In other words, if the superiority of one of the schemas is not evident, the operation might be superfluous. Application It is envisaged that the standard genetic operators (mutation, inversion, etc.) would be applied as usual to maintain the variability of the population and thereby to guard against overemphasis of a given kind of schema. The supervised operations described above should be applied in a constrained manner, to prevent distortion of the well balanced search process. The role of the graph induction is seen primarily in speeding up convergence once a possible optimal solution area has been spotted, by applying knowledge obtained from significant discoveries in immediate subsequent string generation. An appropriate mix of standard and supervised operations should, however, prevent premature convergence to suboptimal solutions. SUPERGRAN The account given above, amounts to the system represented in fig. 9. However, the concept acquisition and generalization capabilities of GRAND enables SUPERGRANd to conduct two learning strategies in parallel, Whilst the genetic induction mechanism ensures flexibility (internally), SUPERGRAN can also digest information from an external source (see fig, 10). Strings volunteered to the system are added to the population. Such strings are given a high initial strength and are therefore immediately absorbed into the induction graph. Strings volunteered are either specific - corresponding to examples - or general - corresponding to rules. If the added strings are of significance, GRAND, in its advisory role, will tigger immediate adjustments to the mixture of features in the new strings created by the genetic algorithm. 4 genetic algorithm generates strings + strings tasted + “pest” strings captured in graph + graph induction + classifler system advised berrenrcsnstsneall Fig. 9 strings (rules genetic algorithm generates snnas on from external source strings tasted “best” strings captured in graph + graph induction + classifier system advised Fig. 10 SUPERGRAN is currently _ being implemented. Graph Induction and Connectionism As mentioned earlier, graph — induction originated from a study of the knowledge representation capabilities of massively parallel connectionist networks, Although the basic idea behind graph induction - the restructuring of a graph into a "canonical" form - is quite straightforward, its application poses a computational problem, illustrated in fig. 11. Let us say a certain schema A is described by 6 features (e-j) and another schema B is added to the graph. According to the graph induction rule, fig. 11 has to be transformed to the configuration in fig. 12. But, recognizing distributed intersections like those at A and B, requires the inspection of 437 the conjunction of all possible combinations of pairs of sets intersected by B. If we want to arrive at the configuration in fig. 12, it requires the inspection of all combinations of all higher order conjunctions as well, ie. of groups of 3 sets, 4 sets, etc. Thus, there is an exponential increase in the number of conjunctions to be inspected as the number of features of B increases. To make things worse, sets may have ’semi-explicit’ intersections. Although the sets themselves do not intersect, their indirect descendents may well (see fig. 13). Consequently, the inspection has to be conducted to depths of more than one level. McGregor & Malone [1982a,1982b] developed a set processing machine that is able to operate on vast quantities of data items (nodes). Subsequently software as well as hardware implementations (prototypes) have been created as part of the low level functions of the FACT Expert Database System [McGregor & Malone 1983]. The hardware version of the system, known as Generic Associative Memory (GAM), dynamically Fig. 12 Fig. 13 A 138 constructs networks to process set data in a Massively parallel fashion which makes processing speed not highly dependent on the depth of the network. These systems are described in detail elsewhere. The core operation of this set processing system, involves the closure of a set (node). The upward closure of a node includes the node itself as well as all nodes "above" it in the network, ie. following the arrows. The downward closure includes all modes below. The determination of the intersection of closures is implemented as a hardware function (Oosthuizen er al. 1987a]. As mentioned above, tansformations are triggered by the detection of distributed intersections. The GAM closure operation is ideal for the detection oof distributed intersections and especially — semi-explicit distributed intersections in complex networks (ie. where the distributed intersections exist at levels lower than the immediate descendants of the two nodes involved). The use of the GAM closure operation in principle eliminates the distinction between explicit and semi-explicit intersections. Because of its inherent parallelism, the closure operation is also virtually independent of the width of the “tree” involved - ie. whether a particular node has 2 or n descendants (where n >> 2). Furthermore, the intersection of closures of two nodes like A and B in fig. 11 is (1) a primitive operation in GRAND and (2) executed in a minimal amount of time. Although the abovementioned computation problem does not affect current small scale prototype learning systems, GAM ensures the feasibility of the application of graph induction as a general purpose learning method. Concluding Remarks Although dependencies between features (attributes) are dealt with implicidy in the induction graph, we have not made explicit use of them. Functional dependencies can be used to make further inductive inferences, (Functional dependencies can be inferred from mappings between sets - information already captured in the graph.) SUPERGRAN exploits the interaction between two learning methods: the genetic algorithm generates strings and thereby creates examples which are used by the graph induction method to generate schemata and highlight underlying properties. “Champions” are identified among features or combinations of features and used to advise the genetic discovery mechanism on where to look for “better” strings. Thus, a higher degree of coherence is established between the event of a significant discovery, or underlying trend, and the present exploration of the search space. By incorporating the whole population in the induction graph, the induction graph becomes a knowledge base to the system. In that way, the learning process becomes fully integrated with the knowledge representation function of the cognitive system, (The notation used in the induction graph in fact constitutes a knowledge representation model based on Generic Associative Sets [Oosthuizen et al, 1987b].) Acknowledgements IT am indebted to the members of the IKBS working group (in particular to Michael Odetayo} for their stimulating discussions and to Jesus Christ, the Origin of all knowledge. References Grefenstette, JJ. [1986] Optimization of Control Parameters for Genetic Algorithms, (EEE Transactions on Systems, Man, and Cybemetics. Vol. SMC-16, No, 1, January/February. Holland, JH. [1986] Escaping Brittleness: The Possibilities of General Purpose Algorithms Applied to Parallel Rule-Based Systems. In Machine Leaming: An Artificial Intelligence Approach Vol. 2 by Michalski, R.S., Carbonell, J.G., Mitchell, T.M. (Eds.) Morgan Kaufman. Holland, JH., Reitman, J.S. (1978] Cognitive Systems based on adaptive algorithms. In Pattern-Directed Inference Systems by Waterman, D.A. Hayes-Roth, F, (Eds.), Academic Press. Lebowitz, M. [1986] UNIMEM, a general learning system: An overview. Proceedings of ECAL-86, Brighton, England. McGregor, DR, Malone, JR. [1982a] Generic Associative Hardware. Its Impact on Database Systems, Proceedings TEE Colloquium on Associative Methods & Database Engines, May 1982. McGregor, D.R., Malone, LR. {1982b] Generic Associative Memory, G.B. Patent No. 8236084. McGregor, D.R., Malone, IR. [1983] The Fact System - a Hardware- oriented approach. In DBMS A Technical Comparison - State of the Art Report. Maidenhead; Pergamon Infotech pp 99-112, Michalski, R.S. [1983] A Theory and Methodology of Inductive Learning. Artificial Intelligence, Vol. 20 pp. 111-161. Oosthuizen, G.D. [1986] A Paradigm for Automatic Leaming. Intemal FACT 24/86, University of Strathclyde, Scotland. Oosthuizen, G.D., McGregor, D.R., Henning, M., Renfrew, C. (1987a] Parallel Network Architectures for Large Scale Knowledge Based Systems. Proceedings of First Workshop of the British Special Interest Group on Knowledge Manipulating Engines (SIGKME). Reading, England, January 1987, Oosthuizen, G.D., McGregor, D.R., Malone, J.R. [1987b] The Use of a Simple Connectionist Architecture for Matching and Learning, Internal Report FACT 2/87. University of Strathclyde, Scotland. Schrodt, P.A. [1986] Predicting international events. BYTE, November. 139 PARALLEL IMPLEMENTATION OF GENETIC ALGORITHMS IN A CLASSIFIER SYSTENI George G. Robertson Thinking Machines Corporation Abstract Genetic Algorithins are the primary mechanism in Classifier Systems [Ju] for driving the selective evo- lution of rules (earning) to perform some specified task. The Classifier System appr ach to machine learning. in particular, and Genetic Algorithms, in general, are inherently parallel. Implementations of Classifier Systems and Genetic Algorithms -o date have mostly been done on conventional serial com- pmers Because of thes serial bottleneck, researchers have heen able to study only relatively small prot- lems. The advent of commercially available mas sively parallel computers, such as the Connection Ma- chore system [10]. now make parallel implementations of these inherently parallel algorithms possible, and make at possible to begin studymg these algorithms on larger task domains This paper describes an mm- plaunentation of Genetic Algorithms in a Classifier System on the Connection Machine. Classifiers Learning Systems and Machine Classifier Systems represent ane hasic approach to Jearminy dy example. Two other approaches are sym- bohe rule-based systems and Connectionist Netwark systems These approaches, and the inherent par- allelism in them. are compared and contrasted with Classifier Systems below. This provides a framework for machine learning m which to place Classifier Sys- tems, and suggests the ease of implementation for cach approach on massively parallel computers. symbole rule-based systems attempt to model high (symbolic; level cognitive processes. A good ex- ample is the SOAR problem-solving architecture be- ing studied at Curnegie-Mellon University (Newell), Stanford University (Rosenbloom), and Xerox PARC (Land) (see :18, 19, 201). SOAR is an architecture for problem solvine and Jearning. based on heuristse search in problemi spaces and chunking. 1t 1s based on a modified version of the OPS5S Production Sys tem arclutectipe [5,. Althoueh SOAR shows promise as an approach to learning, 1 dues not appear to be # good candidate for massively parallel computers Gupta [8] has done studies of parallelism possible in symbolic production systems and found that systems with more than thirty or so processors would not be effectively utilized. Most of the traditional Al work in machine learning (see Michalski et all [22,) also op- erates at the symbolic level and is perhaps evon bess easily parallelized than SOAK. Connectionist Network approaches to Jearning he- gan with the work on Perceptrons by Rosenblatt [28 (also see Minshy and Papert (24) New wars of deal- mig with the problems that they encountered have recently been studied, including work on Boltzmann Machines [J]. Back Propagation net works [29, 30. 31). work by Klopf [17]. Barto [2], Grossberg [7 , Feld- man 4}, Hinton 22), and Minsky :23]. These svs- tems model low level neural processes, with long-term knowledge represented as sirengths (or weights) of the connections between simple neuron-hke process- ing elements The Boltzmann Machines introduce a stochastic process ta avoid the pitfalls of local nnn- ima in learning behavior Back Propagation networks have successfully been used to learn to recognize syin- mictries m viswal patterns 31). as wel) as to learn a speech synthesis task 30). In this latter task a cor- pus of text and its phonetic translation is used as the example for learmng. A fixed initial structure (a three level Connectiomst Network) begins with random bink and node threshold weights and ovei time learns ihe proper weights to translate the text into understandable phoneme streams. These sys- tems are well-adapted to massively parallel si stems in fact, Back Propagation and Boltzmann Machines have been implemented on the Cownection Machine system. Classifier Systems attempt to model macroscopic evolutionary processes with Genetic Algorithins and natural selection. Holland's early work on Genetic Algorithms {15} has recently led to numerous research efforts and application of Classifier Systems to a num- ber of tash domains (see (6, 13, 14. 25. 20. 46 av This approach shares some properties with both the symbolic rule-based and Connectionist Network ap proaches. Tt is rule-based, but the rules are low level message passing tules (below the sumbol level} A 140 svinholic leveh can he added om top of C lassiber &y6 toms (see farrest’s wotk (oO } Like Boltzmann Ma- chines. Classifier oysters use a stochastic process to avuld local inmnia in learmng behavior They use Gonche Algorithms to replace weak rules (those not contributing or contributing incorrectly to the prob Jem solution) with offspring of strong rules. Unlike the Back Propagation approach. which requires pre- defined layers of networks, the Classifier System ap- proach does not require any predefined structure (al- though one can be provided if desired, in the form of an intial set of rules), Because of the nature of the ules. messages, and the match evele, this approach is vet) well-adapted to massively parallel computers To date, woh on all of these approaches has taken place primarily on conventional serial computers with sinmilators Because the speed of these serial simula- tars depends directly on the size of the problems be- ing solved, only small examples have been used to test these ideas. The massive parallelism and dynamic seconfigurabslity of the Connection Machine offers a chance to nnplement and evaluate the Connectiomst Network and Classifier System approaches to learn- ing on much larger (and more realistic) task domains. The speed of the parallel versions of these systems 1» nearly independent of the size of the problems being solved. Parallelism on the Connection Machine Most computer programs consist of a control se- quence {the instructions) and a collection of data el- ements. Large programs have tens of thousands of instructions operating on tens of thousands, or even millions of data Clements. There are opportunities for parallelism in both the control sequence and in the collection of data elements. In the control se- quence, it is possible to identify threads of control that could operate independently. and thus on differ ent processors. This approach is known as “control parallelism”, and is the method used for program- ming most multiprocessor computers The primary problems with this approach are the difficulty of iden- ufying and synchronizing these independent threads of control Alternatively. it is possible to take advan- tage of the large number of data elements that are independent, and assign processors to data elements. This approach js hnown as ‘data parallelism” [11. 33]. This approach works best for Jarge amounts of date and for main appHcations is a more natural programming approach, The Connection Machine system is a general purpose implementation of data parallchisin. The Connection Machine sistem is a dvpamicalls reconfigurable computer with 3.830 processors and 32 Mbytes of memory which ean process large vol- umes of data at speeds exceeding one bithon instruc- tions per second (1900 VIPS) This computer is a departure from the conventional von Neumann model of computing. which bas one processor with a large memory. In the Connection Machine, the processing power 1s distributed over the memory. so that there are inany processors, cach with a sruall amount of menory (4090 hits). In addition to a large number of processors, a data parallel computer must have an effective means of communicating between processors. The Connection Machine system has a communications system that allows any processor to send a message to any other processor, with possibly all processors sending mes- sages at the same time. This mechanism allows ap- pheations to dynamically reconfigure the commu ti- cations topology to adapt if to the communications nceds of the moment. For example, in the low level part of an image understanding system, a two dimen- sional grid is the most appropriate topology. At the intermediate level, such a system might want to com- municate in a tree structured topology. And ai the highest level, a semantic network for understanding might require an arbitrary network topology. Each of these communications topologies can be dynamically configured on the Connection Machine. From a programming point of view, the Connec- tion Machine system can be thought of as a parallel processing accelerator for a conventional serial com- puter. In fact, the Connection Machine system is niade up of two partis: a flont-end computer (like a Vax or a Symbolies 3600), and an array of Connection Machine processors and memory, The front-end com- puter can operate ou the Connection Machine mem- ory as though it were part of ils own memory, or it can amvohe paralld! arithmetic, logical, or communi- cations (data movement) operations on that memory The program runs on the front-end computer. invok- ing parallel operations when necessary. Thus, all the program development tools of the front-end computer are available for developing Connection Machine pro- grams, Connection Machine programming languages are simple parallel extensions (o familar sequential lan- guages. C* [27] is a parallel version of the C language (9, and *Lisp [2]: is a parallel version of Common Lisp [32] In each of these languages, the extensions were made in an unobtrusive manner, For example, in *Lisp, two parallel variables can be added together with +! (the 1 at the end of a operator name in- dicates the parallel version of that operator) ‘The 144 process of writing a data parallel programs with a lan- guage like *Lispis quite sunples write an algeusthin as though a wer operating on one data clenient. then use the parallel versions of the operators In cone trol parallechom. understanding how to do a parallel tash decomposition is often an intdlectually dificult task. In such a system, it 1s alsu often difficult to take an existing program and understand it. How- ever in data parallelism. the programming «{yle is much closer to what programmers are used to, and it is thus much simpler to produce and understand data parallel programs *CFS: A Parallel Classifier System The state of the art for Classifier Systems is repre- sented by the CFS-C 43 stem [25]. designed and imple- mented at the University of Michigan. This system is designed with parallelism in mind, but has been simulated on conventional serial machines, and thus has been restricted to small task domains Serial im- plementations of CFS-C are relatively fast for small sets of classifiers: up to 20 cycles per second for 200 Clas ifiers on a one MIP (one million instructions per second) computer However, because the match pro- cedure (described below) requires every message to be matched against every classi or, serial imiplemen- tations for large sets of classifiers are too slow to be useful “CFS is an implementation of CPS-C on the Con- nection Machine, written in *Lisp. Because of the parallel nature of *CFS on the Connection Machine, it will work with 65,000 classifiers about as fast as it works with 200 classifiers. The speed of the svstem does not depend on the number of classifiers and is relatively fast, up to 10 cycles per second for 65,000 Classifiers Thus, *CFS on the Connection Machine provides a way to explore and evaluate Classifier Svs- tems and Genetic Algonthms on large task domains. *CFS: Overview of Operation From the user's point of view, both CFS-C and "CFS provide the saine basic structure, a message-passing tule-based system that uses Genetic Algonthms to evolve rules to solve some specified problem. Both systems are task domain independent. To define a task domain, you simply define three functions: a function to provide input messages that describe the state of the external environment on each cycle (these are called “Detector” messages), a function to ana Ik7e antput messages to alter the external environ- ment (called “Effectors” ). and a function to evaluate the changes made tu the environment, and supply a reward or punishment for those chatiuses A’ Message’ isa fixed deneth bit string with cach bit position conmlaimimg a zeae ane, or A “Classifier™ and anaction ‘The condstons are message patterns which are matched with incoming messages Thr ace “dont eare” value weak with two conditions Hon is a inessage pattern for production of an oatgo- Ing message. On cach evele, the messages output from the pre- vious evele are combined with Detector messages de- scnbing the environment to provide the meoming message list All Classifiers are matched with all the messages on the message list, The matching Classi fiers can cach post one or more new messages to the outgoing message list The message list is limited to small siz¢ to force the Classifiers to compete for the right to post new messages. There is a strength as- sociated with each Classifier that 1s used to control this bidding process. The strengih reflects how good a rule is; that 1s, a high strength Classifier 1s one that contributes to a correct solution, while a low strength Cliosifier is one that either does not com tribute or contributes to an meorrect solution Clas- sifier strength 1s adjusted with several different mech- anisms: (1) reward o punishment from the evalua- tor changes the strengths of all Classifiers the won the bidding competition and posted messages during that cycle. (2) Classifiers that win the bidding process pay their bid to the Classifiers in the previous cvcle that produced the messages which the winning C lessifiers matched (this is called the Bucket Brigade algonthm 131, and is necessary for chains of rules to form and survive}: and (3) taxes are used to eliminate non contribaters and ‘o help prevent over-peneral Classi- fiers from dominating the bidding process Genetic Algorithms ase used periodically to re place some percentage of the population of Class fiers (gcnerally Jow strength Classifiers art replaced} with offspring from matings of other (generally Ingh strength) Classifiers. The two primary Genetic Al- gorithms used are Crossaver Mating and Mutation. The Crossover algorithm considers the entire Clas- sifier (both conditions and the action) as a chromo- some, picks a tandom point for the crossover, and crosses two parents to produce two offspring (i.e., the high order bits up to the randomly picked point are swapped in the parents), The Mutation algonthm uses a4 Poisson distribution to decide whether to make zero, one. two. or three random madifications to a new offspring. The mutation rate 1s kept qinte low, so Uhat only a few offspring are nmtated at all, and very few have more than one mutation. 142 *CFS: Parallel Data Structures and Algorithins The inherent computational Ing tof CFS can be charactenzed as proportional to the product of the length of the message hst and the number of Clas- sifiers Eapcrience suggests that the message list should be restricted fo a small size to promate strong competition in the Classifier population Ji alse ap- pears that solving problems in large ieal-world tack domains will require a large Classifier populatiru These two observations, along with the fact that the Classifiers are operated on almost entirely indepen- dently, suggest a natural data parallel approach for "CTS. which is the assignment of une processor to each Classifier. The primary parallel data structures in *CFS are associated with Classifiers. Jn addition to the two conditions, action, and strength that have already been mentioned. there are about 30 other variables associated with each Classifier Some of these vari- ables maintain performance statistics, others are used {o control various algorithms in the system, Values stored in parallel yanables on the Connection Ma- chine can be of iny size needed The parallel vari- alles that represent a Classifier include several ane lit booleans, signed integers ranging from 5 to 32 bits and unsigned integers that are as long as the message size in the particular tash domain (9 bits for the J*tter sequence prediction task). In the cur- rent Huplementation, about 850 bits are used for each Classifier, with performance statistics accounting for about half of that, That means that coveral Classi- fiers could be implemented on each physical processor on the Counection Machine (since each has 4096 bits of memory). In fact, the Connection Machine has a Virtual Processur mechanism that supports that. By selecting the right virtual processor configuration, a 05,536 processor Connection Machine can sunujate a aulhon processors hence a milli m Classifiers. The other pniwary data structure im *CFS is the message list. which is maintained on the front-end conipater. Ajthough all operations on Classifiers are done an parallel, the operations on messages in the message list must be done senally However, the size of the message list is relatively small (less than 30 messages for most applications}, and the number of operations that sequence through the message hist is small (the match proceedure, the creation of the new message list, and activation of Lflectors) To see how the parallel Classifiers and the serial message list interact, Jet us consider the match al- gunthm Assume that cach message is N bits long and that there are at) sost M messages in the mes- sage Ist The conditions in the Classifiers are repre- soiled as two N bit wusipaed fields. one for the bits taeros and ones) and the other for the wildeards (tle ‘dou't care” Inte). Pet us allorate an M-bit “messace winner” field for each condition, to represent which messages matched that condstion (ie., the nvth dit of that field will be one if the on“th message in the message list matched that condition} To match all the messages with all the Clasafiers we first sequence through the message list broadcasting each message to all Classifiers in parallel. This is the only serial part of the match, the rest of the operations are done i parallel, We compate the Jogical OR of the wild- card bits and the condition hits with the logical OR of the wildcard hits and the message bits. If they are the same, we set the nth message winner bit, After sequencing through the message het. we find alt Clas sifiers that matched by selecting all Classifiers that have non-zero message winner fields for both condi- tions. The whole match process takes about 3.8 mil- liseconds plus about 0.5 milhseconds for each message on the message list, regardless of the number of Clas- sifiers (up to 65.536} Parallel Genetic Algorithms In *CFS As mentioned carher, the critical component of Clas- sifier Systems that drives their learning behavior 1s the Genetic Algorithms In *CFS, there are currently two parallel Gcnetic Algorithms employed. Crossover Mating and Mutation, These algorithms are invoked periodically and replace some percentage of the popu- lation of Classifiers with offspring of matings of other Classifiers. In a typical experument. the Genetic Al gorithms might be inveked once every eleven cycles, with five percent of the population being replaced each time, The frequency of invocation of Genetc Algorithms is a prime number to avoid the effects of cycle patterns in the test environment. which might cause strength patterns that alter the effectiveness of the Genetic Algorithms. The percent of the pop- wlation being replaced is relatively low so that the Bucket Brigade and other strength adjustment algo- rithms have time to establish some stability. The first step in these Genetic Algorithms is to pick a set of Classifiers to be replaced and a set of Clas- sifiers to be parents. These two sets are the same size and a one-to-one mapping is established between them, m the form of two-way pointers between ele- ments of each set. The parents are then copied onto the Classifiers being replaced. From that point on, the set of Classifiers being replaced is referred to as the set of offspring Some percentage of the offspring are leff as pure repheations, while the rest are paired 143 and mated with the crossover aleorthan. Finally, there is a small probability that cach new offspring will be changed shghthy by the matation algorthan All of the algerithims descaibedi in this seeton oper ate at speeds that are independent of the number of Classifiers involved (up to 65,536} Picking Parents and Classifiers to Replace Relatively Jow strength Classifiers are picked to be re- placed The choice is made probabilhistically, making random draws without replacement fiom the popu- Jation with the probability of being piched propor- tonal to the square of the inverse of the strength of the Classifier. This is done so that weak Classifiers have some small chance of not being replaced and strong Classifiers have some small chance of being re- placed Likewise. relatively high strength Classifiers are picked as parents, This is also done probabihsti- cally. with the probability proportional to the square of the strength. A “Parallel Random Weighted Selection” al- gorithm supports the picking of parents and Classi- fiers to replace. Of the several approaches to imple- mentation of this algorithm, the following has been the most successful. First. generate a random number between zero and the weight (e g.. the square of the strength} for each Classifier in parallel. Then, Rank those random numbers in parallel The Rank oper- ation is the first step im a parallel sort. and is done in log time on the Connection Machine using either Batcher’s bitome sort (3 or a radta sort (1). After ranking the random numbers, the Classifier with the smallest random number will have rank 0. the Clas sifier with the largest random number will have raik N (the size of the population), and cach sank wall he umque. At this port, select the lowest five pere cent (or whatever percent is being replaced) of the ranked Classifiers as the Classifiers to replace, and the Jughest five percent of the ranked Classifiers as parents. This random weighted selection algorithm takes about 23 milliseconds on the Connection Ma- chine. Crossover Mating At this point. we have identified two sets of Clasa- fiers a set of parents and a replacement set. Before proceeding. we need to replace the Classifiers piched for replacement with copes of their parents. The first step in this proccss is to make each offspring (one of the replacement set} point to a parent, and vice- versa. This is dune using a “Parallel Rendezvous” algontim, which takes Uee steps 1. Enunerate the pirnt set ind Send the pro cose addicess of cash parent ta a rendezvous sanalde inthe cnimerated processors fd mane ation is another dog tine operation on the Con- nection Machine. The result of cnumerating the parcnt ser wil be a unique numbering of the par ents fram zeto to the number of parents The Send from the parents tu the enumerated pro- cessors will result in the rendezvous variable in processor zero containing the processor address of the first parent, in processor one the ren- dezvous vanable will contain the processor ad- dress of the second parent, and su on. Hf there aie N parents, then the rendezvous variable in the first N processors will contain the addresses of the parents te Enumerate the replacement set. and have each offspring Get the address in the rendezvous vari able fram the enunierated processors, Since this enumeration will be the same as the replacement rank developed while picking the Classifiers (o replace. we can use the replacement rank and skip this enumeration to save time. The Get will result in the first offspring obtaining the addiess of the first parent from the rendezvous variable in processor zero, the second offspring obtain- ing the address of the second parent, and go on. Now. each offspring has a pointer to its parent, 3, Offspring Send ther processor addresses to their parents. Since offspring now have thei parents’ addresses, they can do this directly Given the pointers between parents and offspring, we now replace the offspring with copies of their par- ents. This is done by having parents Send relevant parts of themselves to then offspring using the off spring address pouters derived from the rendezvous stucp Some parallel varrables associated with the off spring are simply initialized. while others are copied from their parents. The strength of a new offspring 1s set to a value half way between its parent's strength and the average strength of the population. This al- lows the new offspring to participate in the bidding process quickly, without dominating the process. The Crossover algorithm used in "CFS allows some percentage (a system parameter) of null crossovers, or pure repheations. To decade which offspring will be replications we generate a 1andom muinber between 0 and 106 for each offspring. If that nuinber is below the specified cutoff, we are done with the Crossover algorithm. since the parent has already been copied Por the remaming offspring we proceed with the Crossover algorithni by pichang pairs to mate and 144 The nest step as to pan the oflspriug (since these arc now copies crossing those mates at random pomts of parents, thisas equivaloat te paring the parents), Vhisas done dn dividing the set of offspring mm half (cwrsenth done by avon rephacement rank versus odd replacement rank} and using the rendezvous algo- rithin to get the ven ranked offspring to point to the odd ranhed offspring. and vice-versa Wath this set of ponters, we can now camplete the Crossover. Generate a random number between 0 and the length of the chromosome (1.e., {hiee times the Jeng h ofa message, since the chromosome contains both of the conditions and the action) for each even ranked off- spring, to be used as the crossover point Now the even ranked offspring Get the odd ranked offspring’s chromosame (conditions and action), Now, using par- allel load-byte and deposite-hy te, swap the high order bits of the chromasomes up to the crossover point Fi- nally, store the crossed chromosomes back in the even ranked offspring and Seud the crossed chromosomes to the odd ranked offspring. Mutation The Mutation algorithm uses a Poisson distribution to mutate zero, one, two, or three genes in the chro- mosome (bits in the conditions and action) of the new offspring. The Poisson distribution is mmplemented in parallel using a table (see [34 ). A random number between 0 and 1000 is gener: ed for each offspring. A match fo the Poisson table indicates how many mutations to make (normally, very few mutations are done), For «ach mutation, a random bit position 1s picked. and a random new value (one, zero, or don ¢ care} 1s fain that position. Smumary: Parallel Genetic Algirithms To summarize, the patalld) Genetic Algontlins used im *CPS are all independent of Classifier population size (up to 65,536). The total tume for the Genetic Algorithms is around 400 milliseconds (or an aver- age of 36 milliseconds per cycle in the experiment deseribed abave, since they were run once every 3) exeles} Very little optimization has been done on this part of the system. so the figure can probably be reduced by a@ factor of two or three. All these algorithms make heavy use af the communications mechanisms ip the Connection Machine, both explic- ily (with uses of Seud and Get in parent copying and vrossover} and implicitly (with uses of Enumer- ate and Rank), Two crucial algorithms that support the Genetic Algorithms are parallel random weighted selection (for picking parents and offspring) and par- alld rendesvons ($0 establish pomters between par- ents and offspring aud fetwecn pairs of mates im APOSsOveT) Summary ‘The bnphmentation of a parallel version of Genetic Algonthins in a Classifier System on a fine-grained massively parallel computer, the Connection Ma- chine, provides verification that these inherently par- allel algorithms can, in fact, be fully parallelized on a compater. This is best demonstrated by the speed of the system, which is independent of the number of Classifiers, and for parts of *CFS other than its Genetic Algoritums depends only on the size of the isessage hist and is linear. This work also provides a demonstration of the power and importance of data parallelism as a pro- gramming style. 11 took less than one man-month to develop the first working version of the hasic *CFS system, starting only with a description of CFS-C (the documentation in [25, 26]), This also provides confirmation that the Connection Machine system is well-adapted {o data parallelism and is easy to pro- «ram Finally, parallel implementations of Classifier Sys- (ems and Genetic Algorithms, Jike *CF>, provide a vehicle for exploring large tash domains Ultimately, a Classifier System with a large population should be able to evolve a set of rules for any repetitive struc- tured task for which an evaluation function can be defined, An example of a difficult task that is un hkely to ever work with a small population of Classi- fiers, Lut might work with a large population, is the game of Gu. The best current Go playing program, by Wilcox ,35}, is only slightly better than a novice. Can we express the knowledge embedded in Wileox’s program tna set of rules and an evaluation function, and use that as the starting point for evolving rules for a really good Go plaves’ Parallel implementa- tions of Classifier Systems and Genetic Algorithms allaw us to hegin investigating such questions. References 1. Ackley, DH., Hinton, G.E, and Sejnowski, TJ, A Learning Algonthm for Boltzmann Machines, Cognitive Science, vol 9, no. 1, 147-159, 1985. 2 Barto, A., Learning bv Statistical Cooperation of Self-Interested Neuron-like Computing Lie- ments, COIN Technical Report 85-11, University of Massachusetts Apnl 1955 3 Batches, KLE, Sorting Networks and thes Appli- 145 3 i. Q ' 13 cations. Tn Pree 1468 sprang domt Computer Conference. ALIVS. pp 07-311 Apnt Teas Feldinan, JA. Dynanie Connections ino New ral Networks, Tiological Cyberneties 10, 27-34, 1982 Forgy, (.L., OPS5 Manual. Carnegie-Mellon Vni- versity Computer Science Department Technical Report, 198). Forrest, S., lmplementing Semantic Network Structures Using Classifier Systems, in 32. Grefenstette (d.). Proceedings of an Inver- national Conference on Genetic Algorithms and thes Applications, Carnegie-Mellon Univ , Pittsburgh. Pa., July, 1985 Grossberg. $., Competition, Decision. Consensus, Journal of Mathematical Analysis and Applica- tions, 06, 470-93, 1978. Gupta, tems, A., Parallelism in Production Sys- The Sonrees and the Expected 5peed- up. Carnegie-Mellon University Computer Sci- ence Department Techmeal Report CMU-CS-64 369, December 1484 Harbison, S P., and Steele, GL, C: A Reference Manual, Prentice-Hall, New Jersey, 14964 Hilhs, W.D., The Connection Machme, MIT Press, 1985. Hihs, WD. and Steele, GL. Data Parallel Algo- rithms, CACM, vol 29, No. 12, pp 1170-1283, December, 150, Hinton, G.E . Implementing semantic Networks m Parallel Hardware. G Hinton and J. Ander- son (ids ). Parallel Models of Associative Aton ory, Hillsdale, NJ, Erlbaum, 161-187, 1951 Holland, J H., Properties of the Bucket Brigade Algonthm, im J.J. Giefenstette (Ld.), Pruceed- ings of an International Conference on Genetic Algonthms and their Applications, Carnegie- Mellon Univ., Pittsburgh, Pa., July, 1985, Holland, 3.H., Escaping Britileness: The Possi- Indities of General-Purpose Learning Algorithims Applied to Parallel Ruke-Based Systems, in RS. Michalsha, JG Carbonell, and T.M. Mitchell (Eds.j. Machine Teasmine An Artificial Inteshi gence Approach H. Los Altos Calsformta, Morgan Kanfinann Publishers Pas. vol mm, Ib. 26 2s 146 HoNand 40 Adaptenon ie Naturab and 4 thiol Svstams Cay oof Muchivan Press Aun Arbor $975 Holland, JW 0 Bolvoak Koy Nisbett. RE and Thagard. PR.. Induction Processes af Infer- ence, Learning, and Discovery MI‘) Press, 1986. Klopf AU, The Hedonste Neuron, Washing- ton, DC Hemispher . 1982 Land J, SOAR User's Manual, Xerox PARC Internal Document. June 1985. Laid. J. Rosenbloom, P., and Newell, A., To- wards Chunking as a General Learning Mechin- rin, in Two SOAR Studies. Carnege-Mellon University Computer Science Department Tech- nical Report CMU-CS-85-110, January 1985. Laird, JE. Rosenbloom, P.S.. and Newell A. Chunking in Soar: The Anatomy of a General Leaming Mechanisin, Machine Learning, Vol. 1 No. J, 1980. « Lasser, C., The Essential *Lisp Manual, Thinking Machines Technical Report, July. 1986 Michalski, R. Carbonell. J,, and Mitchell. T., Machine Learning. vol UH, Morgan Kauffman, 1984, Minshy, M. Sunon and Schuster The Society of Mind, New York, T8686, Minsky. M. and Papert, $.. Perceptrons. An Intiodnetion to Computation Gronetry, MIT Press [aau, Riolo RL. CISC A Package of Domain In- dependent Subrowones for Implementing Clas- set Systems in Arbitrary, Usa-Defined Lnvi- juntuents, Unis of Michigan. Div of Com- puter Serence and Engineriing Logie of Com- puters Group Technical Report, January, 196b. . Riolo. kL. LETSEQ: An hnplementation of the CPS-C Classifier System in a Task-Domain that Involves Learning to Predict Letter Sequences, Univ. of Michigan. Div. of Computer Science and Engineering. Lugic of Computers Group Technical Report. January. 1986. Rose, J and Steck. CG. C* Language Quick Reference, Thinking Machines Technical Report. December, T9886 Rosenblatt. Fo. Paineiples of Nenrodynanues Spartan Books New York 1963 20 Pumedbant. DL. Winton boo and Walliams RJ, Learning Intermal Representarions by Lito Propagation JOS Report Sano, UC San Taege September bass 30, Semnowsht To ONDTtalke A Paralle) Net- work that Learns to Read Aloud, Johns Hopkins Viiv. Elrctrical Cngiieering and Computer Ser ence Technical Report JHU,ELCS-86 01, Jan vary, 1980 31 Sejnowski, T.J,, Keinker, P.K.. and Hinton, G E., Symmetry Groups with Hidden Vnits: Beyond the Perceptron, Physica D (in press} 32. Steele, G.L., Common Lisp’ The Language. Dig- ital Press, Massachusetts, 1984, 33. Thinking Machines Corporation, Introduction to Data Level Parallelism, Thinking Machines . Technical Report, April, 1986, 34. Wagner, HM, Principles of Operations Re- search Prentice-Hall, New Jersey, 190%, 35 Wileoa, B., Reflections on Building Two Go Pro- 2 grams, SIGART Newsletter, October 1985. 36° Wilson. S.W , Classifier System Learning of a Boolean Function, Rowland Institute for Science Research Meno RIS No. 27r, February, 1966. 37, Wilson S W.. Knowledge Growth in an Artificial Anonal, in J.J. Grefenstette (Ed.), Proceedings of an International Conference on Genetic Algo- rithms and ther Applications, Catnegie-Mellon Univ. Pittsburgh, Pa. July. 1985 147 PUNCTUATED EQUILIBRIA: A PARALLEL GENETIC ALGORITHM by J. P, Cohoon, S. U. Hegde, W. N. Martin, D. Richards Department of Computer Science University of Virginia Charlottesville, Virginia 22903 ABSTRACT A distributed formulation of the genetic algorithm paradigm is proposed and experimentally analyzed. Our formulation is based in part on two principles of the paleontological theory of punctuated equilibria—allopatric speciation and stasis. Allopatric speciation involves the rapid evolution of new species after being geographically separated. Stasis implies that after equilibria is reached in an environment there is little drift in genetic composition. We applied the formulation to the Optimal Linear Arrangement problem, In our experiments, the result was more than just a hardware acceleration, rather better solutions were obtained with less total work. INTRODUCTION The genetic algorithm paradigm has been previously proposed to generate solutions to a wide range of problems {HOLL75]. In particular, several optimization problems have been investigated. These include control systems [GOLD83], function optimization [BETHS81], and combinatorial problems [COHO86, DAVI85, FOUR85, GOLD85, GREF85, SMITS85]. In all cases, serial implementations have been proposed, We will argue that there is an effective parallel realization of the genetic algorithms approach based on what evolution theorists call ‘punctuated equilibria.’ We propose 2 parallel implementation and present empirical evidence of its effectiveness on a combinatorial optimization problem, While the genetic algorithm (GA) approach is easily understood, it would be difficult to glean a canonical “pseudo-code’’ version from published accounts. Various implementations differ and many ‘‘obvious’’ design decisions are omitted. In all cases, a population of solutions to the problem at hand is maintained and successive “‘generations’’ are produced by manipulating the previous generation. The population is typically kept at a fixed size. Most new solutions are formed by merging two previous ones; this is done with a ‘‘crossover’’ operator and suitable encodings of the solutions, Some new solutions are simply modifications of previous ones, using a ‘‘mutation’’ operator. Successive generations are produced with new solutions replacing some of the older ones, An ad hoc termination condition is used and the best remaining solution {or the best ever seen) is Teponied. 148 A solution is evaluated with respect to its ‘‘fitness,’’ and of course we prefer that the most fit survive. There are two mechanisms for differential success. First, the better fir solutions are more likely to crossover, and hence propagate. Second, the less fit solutions are more likely to be replaced. it is important to realize that the GA approach is fundamentally different from, say, simulated annealing [KIRK83) which follows the ‘‘trajectory’’ of a single solution to a local maximum of the fitness function. With GA there are many solutions to consider and the crossover operation is so chaotic that there is no simple notion of trajectory. How can parallelism be used with the GA approach? Initially it seems clear that the process is inherently sequential, Each generation must be produced before it can be used as the basis for the following generation; it is antithetical to the evolutionary scheme to jump forward. A simple use of parallelism is the simultaneous production of candidates for the next generation. For example, pairs of solutions could be crossed-over in parallel, along with the selection and mutation of other solutions. But algorithmic issues remain to be resolved. How are the ‘‘parents’’ probabilistically selected? How are the solutions that are to be replaced chosen? The simple answers to these questions, which require global information, suggest the use of shared- memory architectures. Note that this sort of parallelism does not make any fundamental contribution to the GA approach; it can be viewed simply as a ‘‘hardware accelerator’’. We restrict our interest to the study of parallel algorithms for a distributed processor system without shared memory. Our reasons are threefold. First, we have access to such a system, Second, the extension of our results to massively parallel machines will be quite natural. In such a machine, the cost of connecting and distributing data is an important component in the analysis of the algorithms. We assume that the interconnection network is sparse, and hence communication between distant processors is expensive. Third, as implied above, we are interested in developing more than just a hardware accelerator. Rather we desire a distributed formulation that gives better solutions with less total work. We feel the most natural way to distribute a genetic algorithm over the processors is to partition the population of solutions and assign one subset of the population to the local memory of each processor. Consider a straightforward implementation of the GA approach. In order to probabilistically select parents for the crossover operation, global information about the (relative) fitnesses must be used, This implies an often performed phase of data collection, processing, and broadcasting. Further, extensive data movement is required to crossover two randomly selected solutions that are on distant processors. Considerations such as the above led us to question the wisdom of using the GA approach as it is typically presented. There are many ad hoc methods for bypassing these difficulties. For example, continuing the above examples, knowledge of the global fitness distribution can be approximated. Further, steps can be taken to artificially reduce the diameter of related computations. Instead we found a simple model that naturally maps GA onto a distributed computer system, It is drawn from the theory of punctuated equilibria, discussed in the next section. We chose to do our inital study using the Optimal Linear Arrangement problem (OLA). It is an NP-complete combinatorial optimization problem [GARE79}. We selected it due to the practical interest in such placement problems, as well as its simple presentation, There are mn objects and m positions, where the positions are arranged linearly and are separated by unit distances. For each pair of objects i and j there is a cost cj. We need to find an mapping p, where object i is assigned to position p(i), that minimizes the objective function D cy PO-pG. q) isy We note that OLA is related to the ubiquitous traveling salesman problem; they are both instances of the quadratic assignment problem. PUNCTUATED EQUILIBRIA N, Eldredge and S. J. Gould [ELDR72] presented the theory of punctuated equilibria (PE) to resolve certain paleontological dilemmas in the geological record. While the extent to which PE is needed to explain the data is hotly debated [ELDR85], we have found it to be an important model for understanding distributed evolutionary processes. PE is based on two principles: allopatric speciation and stasis. Allopatric speciation involves the rapid evolution of new species after being geographically separated. The scenario involves a small subpopulation of a species, “peripheral isolates,” becoming segregated into a new environment. By using latent genetic material or new mutations this subpopulation may survive and flourish in its environment. A single species may give rise to many peripheral isolates. Stasis, or stability, of a species is simply the notion of lack of change. (This directly challenges phyletic gradualism.) It implies that after equilibria is reached in an environment there is very little drift away from the genetic composition of a species. The motivation is that “sympatric’’ speciation (differentiation in the same environment) is difficult since small changes can not compete with the ‘‘gene flow’’ of the current species. Ideally a species would persist until its environment changed {or it would drift very little), It is instructive to define ‘‘species’’ in a way that relates to the concept of a solution in the GA approach. We adapt an old idea of S. Wright [WRIG32] that introduces the concept of the ‘‘adaptive landscape,’’ which is analogous to the fitness ‘‘surface’’ over the space of solutions, Consider a peak of the landscape that has been discovered and populated by a subset of the gene pool, That subset (perhaps with nearby subsets) corresponds to a species. There can be many ‘‘species’’ in a given environment, some so distant that their mutual offspring are not adapted to the environment. The difficult question is how can a species, as a whole, leave its ‘“‘niche’’ to migrate to an even higher peak. The concept of stasis emphasizes the problem. PE stresses that a powerful method for generating new species is to thrust an old species into a new environment, that is, a new adaptive landscape, where change is beneficial and rewarded. For this reason we should expect a GA approach based on PE should perform better than the typical single environment scheme. What are the implications for the GA approach? If the “environment” is unchanging then equilibrium should be rapidly attained. The resulting equivalence classes of similar solutions would correspond to species. It is possible that the highest peaks have been unexplored. Typically, when GA is used the mutation and crossover operations are relied on to eventually find the other peaks, PE indicates that a more diverse exploration of the adaptive landscape could be achieved by allopatric speciation of peripheral isolates. Therefore, subpopulations must be segregated into environments that are somehow different. Two different schemes for changing the environment are suggested. Suppose fimess is a multi-objective function. Various low-order approximations to the true fitness could be tied at different times and places. We will not explore this further here. The second scheme simply changes the environment by throwing together previously geographically separated species. We feel that the combination of new competitors and a new gene pool would cause the desired allopatric speciation. Further, we will define fitness so that it is relative to the current local population. So a new combination of competitors will alter the fitness measure, This scheme is used in the work presented here. GENETIC ALGORITHMS WITH PUNCTUATED EQUILIBRIA Our basic model of parallel genetic algorithms assigns a set of n solutions to each of N processors, for a total population of size nxN, The set assigned to each processor is its subpopulation. (It is a simple extension to the model to allow different and time-varying population sizes. Other extensions are discussed in Section 6.) The processors are connected by a sparse interconnection network. In practice we might expect a conventional topology to be used, such as a mesh or a hypercube, but at present the choice of topology is not considered to be important. The network should have high connectivity and small diameter to ensure adequate “‘mixing’’ as time progresses. The overall structure of our approach is seen in Figure 1, There are E major iterations called epochs, During an epoch each processor, disjointly and in parallel, executes the genetic algorithm on its subpopulation. Theoretically each processor continues until it reaches equilibrium. Since we know of no adequate stopping criteria we have used a fixed number, G, of generations per epoch, This considerably simplifies the problem of ‘‘synchronizing’’ the processors, since each processor should be completed at nearly the same time. After each processor has stopped there is a phase during which each processor copies randomly selected subsets of its population to neighboring processors. Each processor now has acquired a surplus of solutions and must probabilistically select a set of 1 solutions to survive to be its initial subpopulation at the beginning of the next epoch. (The selection process is the same as the adjustment procedure used by GA proper.) The relationship to PE should be clear. Each processor corresponds to a disjoint ‘‘environment’’ (as characterized by the mix of solutions residing in it). After G generations we expect to see the emergence of some very fit species. (It is not necessary or even desirable to choose G so large that only one ‘‘species’’ survives. Diversity must be maintained.) Then a ‘‘catastrophe’’ occurs and the environments change. ‘This is simulated by having representatives of geographically adjacent environments regroup to form the new environments. By varying the amount of redistribution, that is, § = 1S, |, we can control the amount of disruption. There can be two types of probabilistic selection used here. The fimess of each element of a population is used for selection, where the probability of selecting an element is proportional to its fitness. When there is repeated selection from the same population it can be done either with replacement or without replacement. The decision should be based on both analogies with the natural genetic model and goals for efficiently driving the optimization process. The selection of each final (end of epoch) subpopulation is done inidalize for E iterations do parfor each processor i do tun GA for G generations endfor parfor each processor i do for each neighbor j of i do send a set of solutions, Ss, yp from i to j endfor endfor parfor each processor i do select an n element subpopulation endfor endfor Figure } — The parallel genetic algorithm with punctuated equilibria. without replacement; the good solutions will only “‘propagate’’ at the beginning of the next epoch. (The selection of each Sj; is not done probabilistically, A “random’’, i.e. using a uniform distribution, selection is used to simulate the randomness of environment shifts.) We present our interpretation/implementation of the GA code each processor uses in Figure 2. The crossover rate, OSC S1, determines how many new offspring are produced during each generation. ‘‘Parents’’ are chosen probabilistically with replacement. The crossover itself, and other details, are discussed below. Our crossover produces one offspring from two parents. The fitnesses are recalculated, relative to the new larger population, Then, probabilistically without replacement, the next population is selected. Finally, (uniform) random elements are mutated. The mutation rate, OSMsS1, determines how many mutations altogether are performed. IMPLEMENTATION DETAILS The problem we studied, OLA, is a placement problem. Hence a solution must encode a mapping p from objects to positions, Since we may assume both objects and positions are numbered 1, 2,..., m, the mapping p is just a permutation. In fact, throughout we use the inverse mapping, from positions to objects, as the basis for our encodings. There are many ways to encode permutations but it is not at alt clear which, if any, are suitable for the GA approach. Note that it is desirable to preserve adjacencies within groups of objects during crossover. Several representations and crossovers have been proposed for related problems, e.g. the traveling salesman problem. Inversion vectors (‘‘ordinal representations’’) are the most obvious choice since they allow ‘‘typical’’ crossovers (as in [HOLL75]). However the use of such crossovers with inversion vectors is quite undesirable [GREF85] since it breaks up groups of adjacent objects. Goldberg and Lingle [GOLD85] gave a ‘‘partially mapped crossover”’ that uses a straightforward array representation of the (inverse) mapping, where the i" entry is j if the j™ object is in the i" position. Briefly, their crossover copied a contiguous portion of one parent into the offspring, while for G iterations do for nxC iterations do select two solutions crossover those solutions add offspring to subpopulation endfor calculate fitnesses select a population of n elements generate nM random mutations endfor Figure 2 — The genetic algorithm used within an epoch at each processor. a having the other parent copy over as many other positions as possible. Smith [SMIT85] proposed a ‘‘modified crossover’’ for the array tepresentation. A random division point is selected and the first ‘‘half’’ of one parent is copied to the offspring. The remainder of the offspring’s array is filled with unused objects, while preserving their relative order within the other parent. For example, parents [7 13 5 6 2 4] and {3 42 7 15 6} with the division point after the third position produce the offspring (7 1 3 4 2 5 6]. We essentially used this representation and crossover but allowed the first or the second ‘‘half”’ of the first parent to be used. By arguments analogous to those of Goldberg and Lingle [GOLD85], we can argue that our approach has the desirable ‘‘schema-preserving’’ property that the GA approach exploits. However it is an open problem to give a theoretically compelling proof of that property. In any event, it is clear that our scheme tends to preserve blocks of adjacent objects. The mutate operator was selected next. We felt that the mutations should not be too disruptive; if most adjacencies were broken then with near certainty the mutation would be immediately lost. We chose to use “‘inversion,’”’ the reversal of a contiguous block within the array representation. The beginning of the block was randomly selected. The length of the block was randomly chosen from an exponential distribution with mean |t. We typically kept 1 small to inhibit disruption, The nature of the OLA problem encourages inversion as opposed to pairwise interchanges, which do not involve block moves. How should fimess be calculated? With any minimization problem, such as OLA, the scores of the solutions should decrease over time. The score is the value of the objective function, Two simple fitness functions suggest themselves. First, the fitness could be inversely related to the score; this could cause excessive compression of the range of fitnesses. Second, the fitness could be a constant minus the score. The constant must be large enough to ensure all fitnesses are positive (since they are used in the selection process) and not too large (effectively causing compression), If such a constant was optimal initially, it would become a poor choice near equilibria. For these reasons we used a time-varying ‘‘normalized’’ fitness. We chose our fitness to be a function of all the scores in the current population. We have empirically found that randomly generated solutions to the OLA problem have scores that are ‘‘normally distributed’’ (i.e., have a bell- shaped curve), with virtually every solution within 1 standard deviation (s.d.) of the mean, and no solutions were found more than 3 s.d,’s away. For related evidence see [COHO87, WHIT84]. Therefore we used (py - score (x))+ a9 200 where [ly is the mean of the scores, o is the s.d., and a is a small constant parameter. Note that in practice we expect 0 < fitness (x) < 1, we use clipping to ensure it is positive. Near equilibrium the scares will not be normally distributed because the contributions from most mutations and many crossovers will almost certainly be below the mean, biasing the distribution. fitness (x) = (2) 464 Our fitness measure has several advantages. It is somewhat problem independent, so that we can reasonably compare very different instances. It also tends to control the effect of a few ‘‘outliers’’ on the population, A disadvantage is that it is expensive to calculate and it needs to be recomputed at regular intervals. An approximation scheme can be used where new fitnesses are calculated according to the current mean and variance. The other elements would not have their fitnesses recalculated, unless they were otherwise encountered, but the pseudo- normalization renders them comparable. EMPIRICAL RESULTS We performed several experiments to determine if our parallel genetic algorithm is an effective approach. The efficacy of any GA approach is determined by the design variables. Our initial experiments, reported here, have been made with the Optimal Linear Arrangement (OLA) problem as the base case. This permutation problem has a raw score (Eq. 1) that the system is to minimize for a given cost matrix, The system uses a fitness measure (Eq. 2) in selecting the elements for crossover and in determining the survivors for each generation. Remember that for both of those selection processes within a subpopulation the fitness is judged relative to that subpopulation, and that the fitness is used in a probabilistic manner. Our current implementation is a sequential simulation of the parallel genetic algorithm with punctuated equilibria. It operates with an arbitrary configuration of the N subpopulations. Presently we are investigating “‘mesh’’ and “hyper-cube’’ connection topologies. Although other configurations will be analyzed, the hyper-cube topology is of particular importance to us. We plan to obtain ‘‘real’’ performance measures on the hyper-cube multiprocessor at the University of Virginia. In most of the initial studies described below, a mesh configuration is used with NV = 4, ie., each subpopulation being able to ‘‘communicate’’ during the inter-epoch transition with two other subpopulations. For each experiment the number of epochs, £, is given along with the number of generations per epoch, G, and the end-of-generation subpopulation size, n. Thus, over the course of a single example the parallel system will create NxExG generations. [f we set N and E to one, then we have a ‘‘standard’’ sequential GA creating G generations. While the sequential GA has a single evolutionary time line, the parallel algorithm has multiple, interrelated evolutionary time lines. The ‘‘interrelated’’ qualification is quite important because the parallel system does more than just create N divergent time lines. As with the sequential GA, one cannot say, a priori, how many distinct individuals, i.z., possible problem solutions, a particular example run of our system will examine. For our purpose here, we will use NxXExGxn to be an indicator of the total number of solutions created during the experiment. The remaining design variables of importance are C, the crossover rate, M, the mutation rate, 5, the size of the redistribution set, a, the fitness scale factor, and \4, the mean length of the mutation block, Results For Three Problem Instances n a 3.00 NIE G 100 9600 9600 2 15596 15450 300 | 15540 The Effects Of Changing The Design Variables Cc M Sipliian a {N/E G 5 r s* 0.50 | 0.05 | 15 3.| 80 | 3.00_| 4 6 50_}_15225 |_15318 0.80 15240 | 15315 0.20 15270 | 15311 40 15240 | 15343 12 15270 | 15360 | 15210 50 15225 | 15371 1.50 15210 | 15258 0.50 15210 | 15225 12 25 15210 | 15266 Table 2 Tables 1 and 2 present the results and the settings used to derive those results, In those tables the quantity, s”, is the theoretical optimal OLA score (as opposed to the fitness Measure); 3 is the average of the best OLA score from each example with the specified settings; and § is the score of the single best solution created during the example. In the current simulations a single random number process is used, allowing multiple examples to be generated for the same design variable setting by changing a single ‘‘seed.’’ In the discussion below the term ‘‘average’’ will indicate that several examples with the same settings and inputs, but with different seeds, have been run and the resulting measures averaged, In all cases reported here, the average is taken over a minimum of four runs, Table 1 is broken into three instances, with three rows for each instance. The first row shows the results from running the parallel genetic algorithm with punctuated equilibria, i.e, independent subpopulations — with communication, For these results a four node mesh configuration was used. The second row shows the results from a sequential genetic algorithm, That algorithm was derived by setting N and E to one, i.e, one population creating G generations. For a comparable uniprocessors, this 152 algorithm would require about four (NV) times the amount of “wall clock’’ time as the parallel genetic algorithm with punctuated equilibria. The third row shows the results from a simple parallel genetic algorithm that just used four (NV) independent populations withour communication. Here each example run was derived by setting N to four and E to one, and then, at the end of the parallel operation, selecting the best overall from the best of the four populations. For each instance, we kept NxExG xa constant over all three rows. This product is indicative of the total number of OLA solutions examined by the system. By keeping the product constant we assume that approximately the same amount of total computation is required. For these initial studies we have considered ‘‘antificial’’ OLA examples, which allow easy determination of the optimal score. While the examples are contrived, they exhibit natural clustering patterns. In all cases, the costs were chosen so that the identity permutation produced the optimal score; the only other optimal permutations were simple perturbations of the identity mapping. (Of course this does not make the problem any easier.) We used three types of problems, i.e., cost matrices, in our experiments. wh sol in Ta alt so ta al tt o The first type of problem instance, with unique optima, has a cost matrix of the following form: OABCDOODD AOABCDOOO BAOABCDOO CBACABCDO C,9)=|DCBAOABCD ODCBAOABC OO DCBAOAB OOODCBAOA OO0ODCBAD where m=9 and A®B®C»D20. We believe the solution spaces for such problems instances to be ‘‘convex"" in some sense, and therefore ‘‘easy,”’ Instance one (/ = 1) of Table 1 used C)(9), wih A=1000, B=100, C=10, and D=1. Note that, for this instance, the parallel genetic algorithm with punctuated equilibria found the optimum solution in each example, as indicated by 5 = 3" The second problem type was slightly more complex. We increased m to 18 and created a cost matrix by embedding two independent 9-element orderings as given by the following cost matrix: C,(9) 9 C2(18) = 0 619} : Note that two groups of 9 are uncoupled and that this is tantamount to solving two disjoint problems. A, B, C, and D in C2(18) were as above. The settings and results for this cost matrix are shown as instance two (/ = 2) in Table 1. The third type of problem incorporated further complexity by embedding interrelated blocks of three elements each. For nine elements the resulting cost matrix would be C3(9) = anaanppe aRAaAaAaAaPers andaanwo py pm OA AOP Pow AN AaaAnaroetann Qanavpoprpaanna rPomaananan opanaanna oppananan 0 oO - P We assume A>B>C. The intra-block cost, A, causes primary clustering, and the inter-block cost for adjacent blocks, B, forces an ordering of the blocks. The other costs, the C’s, tend to ‘‘flatten’’ the search space by making all permutations have similar scores. This cost matrix pattern was extended to C4(18), ic., eighteen objects comprised of six blocks, and we let A=50, B=30, and C=15, The results are shown as instance three (/ = 3) in Table 1. The results shown in Table 2 are presented to indicaie the effects of changing individual design variables. The problem instance used C3(18) with aA=40, B= 30, and 153 C= 15. Note the slight reduction in A was intended to make the optimum solution more elusive, The settings for the base case are shown in the first row. In the remaining rows we show the altered value of a particular variable and the obtained results. Again, the averages were taken over four example runs. In terms of 5, the most dramatic changes were due to reducing a, and due to increasing E. The effect of decreasing & is to make the fitness measure more sensitive to smaller differences between solutions that are near the current best for the subpopulation. In this way incremental improvements are given more of an opportunity to survive and create further improvements. The increase in E effectively provides more communication opportunities between the subpopulations. The jast row of Table 2 and instance three from Table 1 provide the strongest experimental results to date for the effectiveness of the parallel genetic algorithm with punctuated equilibria. These promising effects prompted an experiment combining the modifications in the last two rows of Table 2. We obtained the optimal solution in 5 out of 6 runs, CONCLUSIONS AND EXTENSIONS In attempting to develop parallel algorithms one always wants to obtain the simple speed-up of having more processors to do more instructions in the same ‘‘wall clock’’ time. However, there is evidence that in attempting to develop parallel versions of previously known algorithms one often derives a modified formulation which comprises more fundamental efficiencies. We believe this to be the case for our parallel genetic algorithm with punctuated equilibria. The partition into subpopulations specifies a simple and balanced mapping of the workload to a non- shared memory, multiprocessor system, while the intra- epoch isolation and inter-epoch communication provide a fundamental modification to the basic genetic algorithm that will generate better solutions while considering a smaller total number of individuals. Several extensions to the model have been considered and are being attempted. For example, the use of fixed-size subpopulations is not suggested by the natural evolutionary setting. When a processor receives a surplus of very fit solutions it makes sense to retain most of them, However it is clear there should be some mechanism for limiting the size of each subpopulation as well as the total size. Varying- sized groups will create data management and coordination problems, and it is not clear they are worth the additional computational load. Our model is essentially synchronous, though it is easily realized asynchronously with handshaking, A tuly asynchronous madel would allow each processor to decide for itself whether it has reached equilibrium and should begin another epoch. At that point it could poll its neighbors, asking for subsets of their current subpopulations to be sent to it. Global termination, while somewhat arbitrary before, becomes even more difficult. ACKNOWLEDGEMENTS The authors’ work has been supported in part by the Jet Propulsion Laboratory of the California Institute of Technology under Contract 957721 to the University of Virginia. The work of James Cohoon has been supported additionally in part by the National Science Foundation through grant DMC 8505354. Their support is greatly appreciated. REFERENCES {BETH81} A. Bethke, Genetic Algorithms as Function * Optimizers, Ph.D. Thesis, Department of Computer and Communication Sciences, University of Michigan, 1981. [COHO86] J. P. Cohoon and W. D. Paris, Genetic Placement, [EEE International Conference on Computer-Aided Design, Santa Clara, CA, 1986, 422-425. fCOHO87} J.P. Cohoon and M. T. Roberson, Jump Starting Simulated Annealing, Department of Computer Science, University of Virginia, 1987. (DAVI8S] L. Davis, Job Shop Scheduling with Genetic Algorithms, Proceedings of an International Conference on Genetic Algorithms and Their Applications, Pittsburgh, PA, 1985, 136-140. (ELDR72] N. Eldredge and S. J. Gould, Punctuated Equilibria: An Alternative to Phyletic Gradualism, in Models of Paleobiology, T. J. M. Schopf (ed.), Freeman, Cooper and Co., 1972, 82-115, [ELDR85] N. Eldredge, Time Frames, Simon and Schuster, 1985. {FouR85] M. P. Fourman, Compaction of Symbolic Layout Using Genetic Algorithms, Proceedings of an Imernational Conference on Genetic Algorithms and Their Applications, Pittsburgh, PA, 1985, 141-150. [GARE79}] M.R. Garey and D. S. Johnson, Computers And Intractability - A Guide To The Theory Of NP- Completeness, W. 1. Freeman and Co., San Francisca, CA, 1979. [GOLD83} D. E, Goldberg, Computer-Aided Gas Pipeline Operation Using Genetic Algorithms and Learning Rules, Ph.D. Thesis, Department of Civil Engineering, University of Michigan, 1983. {GOLD85] D. E. Goldberg and R. Lingle, IJr., Alleles, Loci, and the ‘Traveling Salesperson Problem, Proceedings of an International Conference on Genetic Algorithms and Their Applications, Pittsburgh, PA, 1985, 154-159. [GREF85] J. J. Grefenstette, R. Gopal, B. J. Rosmaita and D. Van Gucht, Genetic Algorithms for the Traveling Salesperson Problem, Proceedings of an International Conference on Genetic Algorithms and Their Applications, Pittsburgh, PA, 1985, 160-168. [HOLL75] (KIRK83] (SMIT85} [TWHIT84] [WRIG32] J. H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI, 1975. S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, Optimization by Simulated Annealing, Science 220, 4598 (May 13, 1983), 671-680. D. Smith, Bin Packing with Adaptive Search, Proceedings of an International Conference on Genetic Algorithms and Their Applications, Pittsburgh, PA, 1985, 202-206. S. R. White, Concepts of Scale in Simulated’ Annealing, International Conference on Computer Design: VLSI in Computers Proceedings, Port Chester, NY, 1984, 646-651. S. Wright, The Roles of Mutation, Inbreeding, Crossbreeding, and Selection in Evolution, Proceedings of the Sixth International Congress of Genetics 1, (1932), 356-366. A PARALLEL GENETIC ALGORITHM Chrisila B. Pettey, Michael R. Leuze, and John J. Grefenstette” Vanderbilt University, Nashville TN 37235 ABSTRACT A parallel genetic algorithm (PGA) is presented as a solution to the problem of real time versus genetic search encountered in genetic algorithms with large populations. A discussion of the algorithm is followed by descriptions of experiments which were performed with the PGA and of performance measures which were collected during each experiment. The paper concludes with a discussion of experimental results, 1. Introduction An important open question in the study of genetic algorithms (GA’s) is the optimal size of a population. The problem centers around the tradeoff between the amount of genetic search that can be done and the amount of real time available. If the population size is too small, then the GA will have a, possibly improperly, constrained search space because of an insufficient number of schemata in the population. If the population size is too large, however, an inordinate amount of time will be required to perform all the evaluations. In the worst case a GA can be reduced to random search if the amount of available time is exhausted before any genetic search is performed. Goldberg [3] has developed a theory of the optimal population size for binary-coded genetic algorithms based on the length of an individual. But, here again, the tradeoff between the amount of genetic search and the amount of real time is still evident. For instance, according to Goldberg’s theory, the optimal population size for individuals of length 60 is approximately 10200. Jf evaluation of an individual requires 1 Research supported in part by the National Science Foundation under Grant DCR-8305693, * Current address: Navy Center for Applied Research in AI Code 5510 Naval Research Laboratory Washington,DC 20375-5000 155 second, it will take approximately 2.8 hours to evaluate one generation. Since an evaluation can be as complex as solving a large queueing network model or running a simulation of processes on a multiprocessor, it is not unreasonable to believe that an evaluation could require 1 second or more. Execution times become even greater when populations with longer individuals are considered. In order to overcome the genetic search vs. real time problem, a parallel implementation of GA’s has been investigated. This class of parallel genetic algorithms (PGA’s) is presented here, along with some experimental results. The implementation techniques used in this work are not restrictive to any particular multiprocessor architecture, but the results reported in this paper are from an Intel iPSC, a message-based multiprocessor system with a binary n-cube interconnection network. 2. Outline of the algorithm A PGA operates on a very large population of several distributed groups of individuals. There is precedent for this type of population found in the idea from population genetics of a polytypic species [2]. A polytypic species is a species composed of different groups which are isolated from each other yet which are capable of producing offspring with each other. The human species, for example, is polytypic, in that it consists of groups isolated from each other either physically or culturally. Precedent for isolated groups in a population can also be found in Wilson’s Animat [5], a classifier-based learning system, and in Schaffer’s VEGA [4]. In the Animat system, each individual specifies one action, a direction for an artificial animal to move; when a parent is selected for crossover, its mate is selected from the subpopulation of individuals which specify the same action as the first parent. VEGA, on the other hand, in order to find an individual which performs well in several dimensions, performs selection by choosing subpopulations from the total population based on fitness in an individual dimension. A PGA consists of a group of identical “nodal GA’s” (NGA’s), one per node of the multiprocessor system. Each NGA maintains a small population which is a portion of the large population, and functions in much the same way as a sequential GA. The one difference between an NGA and a sequential GA is that once during each generation, an NGA communicates with its neighboring NGA’s. This communication phase consists of sending the best individual in the local population to each neighbor and receiving the best individual from each neighbor’s population. After the best individuals are received from the neighboring NGA’s, it is necessary for an NGA to insert the new individuals into its local population. This insertion can be done by replacing random individuals, the worst individuals, or the individuals most like the incoming individuals (ie., the smallest Hamming distance). The modified algorithm for an NGA is listed in Figure 1. NGA: begin Initialization; Evaluation; while (not done) begin Communication; Selection; Recombination; Evaluation; end end Figure 1. A Nodal Genetic Algorithm While it is true that a PGA can be thought of as a sequential GA with a very large population, there is one major problem with this analogy. In a sequential GA an individual is selected based on its performance in the whole population, but in a PGA an individual is selected based on its performance in its local subpopulation. This difference in selection could result in premature convergence. On the other hand, this difference might lead to a slowing down of convergence and ultimately produce better results. At this time there is no theoretical basis for this type of selection. 156 3. Description of Experiments Four of DeJong’s testbed of functions [1]—f1, f2, £3, and f5—were used to test a PGA. For each of the four functions, an NGA maintained a population of 50 individuals and used a worst individual replacement policy. Five experiments involving different numbers of NGA’s were performed. For each of the four functions, 1, 2, 4, 8, and 16 NGA’s were used. Corresponding population sizes were, therefore, 50, 100, 200, 400, and 800. For each experiment the best individual performance, the online performance, and the offline performance were collected after each generation from each NGA. Each NGA was treated ag a standard sequential GA, and performances were measured accordingly. From this collection of results, performances for the PGA were calculated. 4. Experimental Results The results of the 20 experiments are graphically displayed m Figures 2 through 5. Figure 2 shows the best individual per generation for each of the four functions. Since a PGA with more NGA’s has a larger initial population, it is possible for it to contain a better best individual in the initial population than a PGA with fewer NGA’s. Since the addition of an NGA does not change the initial population of other NGA’s, it is not possible for a PGA with more NGA’s to have a worse best individual in the initial population than a PGA with fewer NGA’s. According to Goldberg’s theory, the optimal population size for f1 is 116, for £2 is 51, for £3 is 2240, and for f5 is 262. The results from fl (Figure 2a) and f5 (Figure 2d) appear to agree fairly well with the theory. Although the experiments were not run on populations of size 116 and 262, the populations of aize 200 and 400 for f1 and f5, respectively, quickly found the optimum, and increasing the population size had no significant improvement, The results from [3 (Figure 2c) also tend to corroborate the theory. While a PGA was not run on a population larger than 800, the results did tend to improve with the increase of population size. Although populations of size 200 and 800 performed reasonably well on {2 (Figure 2b), the results from this function would seem to substantiate the belief that the selection performed in a PGA increases the likelihood of premature convergence. It is interesting to note that the best answer was found with a population of size 50, which is approximately the theoretical optimal population size. These data would tend to indicate that the population size for a PGA should be set with the optimal population size in mind. Figure 3 shows the online performance of a PGA on the four functions. If it is desirable to optimize online performance, a large population should be used since a PGA’s online performance is better the larger the population. However, it is not clear that online performance has any real meaning in a PGA since a PGA runs on a multiprocessor system. Calculating the best performance and the online performance of a PGA is fairly straightforward. The offline performance, however, is a different matter. A possible offline performance measure is the average of the offline performances of each of the NGA’s. Figure 4 shows the results of this measure. Perhaps more in keeping with the idea of offline performance is the measure shown in Figure 5. The offline best performance is obtained in much the same manner that offline performance is obtained in a sequential GA, with the exception that only the best individuals from each NGA are used in the measure. Therefore, the offline best performance measure is the average of the bests of the local NGA best individuals seen so far. As is to be expected, the offline best performance measure improves with an increase in population size, Tt should be noted that a PGA has the desirable property that an increase in population size by the addition of another NGA should increase the execution time of a PGA only slightly. This increase in time is due to increased communication overhead among a larger set of NGA’s. However, while there is no firm data as yet on increase in real time due to the addition of an NGA, experience indicates that it is negligible in comparison to the increase in real time due to a comparable increase in population size of a sequential GA. 5. Conclusions This paper is a presentation of the initial work with a PGA. There is still much work that needs to be done with PGA’s. Timing studies of the five functions on the iPSC are continuing. In the near future a PGA will be applied to the Traveling Salesman Problem and to a mapping 497 problem {i.e., finding the appropriate placement of processes in a multiprocessor system in order to minimize the response time), and PGA’s will be implemented with alternate means of communication (eg., sending different individuals to each neighbor, sending individuals probabilistically based on their performance, etc.), Also, the selection mechanism in PGA’s needs to be compared theoretically with the selection mechanism in sequential GA’s. The initial work, however, indicates that a PGA is a viable means of increasing the population size of a GA since a PGA will allow the use of genetic search in problem areas where a candidate solution has a long representation and is time consuming to evaluate. Acknowledgements The authors would like to thank Mike Hilliard and Gunar Liepins of Oak Ridge National Laboratory for their contributions to this work. References {1} Kenneth A. DeJong, An Analysis of the Behavior of a Class of Genetic Adaptive Systems, Ph.D. Thesis, Department of Computer and Communication Sciences, University of Michigan, 1975. (2] Theodosius Dobzhansky, Genetics of the Evolutionary Process, Columbia University Press, New York, 1970. David EE. Goldberg, Optimal Initial Population Size for Binary-Coded Genetic Algorithms (TCGA Report No. 85001), Tuscaloosa: University of Alabama, The Clearinghouse for Genetic Algorithms, 1985. [3] [4] David J. Schaffer, Some Experimenta in Machine Learning Using Vector Evaluated Genetic Algorithms Ph.D. Thesis, Department of Electrical Engineering, Vanderbilt University, 1984. Stewart W, Wilson, Knowledge Growth in an Artificial Animal, In Proceedings of an International Conference on Genetic Algorithms and Thetr Applications, J.J. Grefenstette, Ed., July 1985. 15] Side em & wo alae Paeses tn dee ee Bnvdes eee o1 1Gnodep remeron ooul 4 0 0U01 + o4 T OT - T T T 0 3 50 Te 100 generations Figure 2a. Function f1 Lnsde Dosdee Andee ere ; i | Bavdes Hoeee | lonodes emereee i i —T" T T T T 50 75 100 generavons ° ts B Figure 2c. Function f3 1 Vonade meme Daeder dante errens nodes Ate Ol- Lt puedes eae 001 - Uh 4 an aT T t T Q a8 50 15 100 generations Figure 2b. Function f2 Dopede aan 16 Z| Taodes 4 nodes wees $ nodes seed a Idnodes wee 44 24 a 14 I I aT T a a5 50 75 100 generations Figure 2d. Function [5 Figure 2. Best Individual Performances 158 a tao) 1000 . In-dee Paede —n warn 50 (nodes ewes “wy BS nudes teehee 500 4 TE sudan ren ” 10+ 100 + 504 1 On 1 10 T T T T ~ To oe T T ~ v T a 25 50 75, 100 Go 25 50 7S 100 generations generations Figure 3a. Function fl Figure 3b. Function f2 node eens 500 2 podea 60 4 ncdte revere uy B nodes eres Vesuen weerex 10 100 4 : so-| 1 O4-. 10- re repre a ace as oT T T T r 30 18 100) a a5 50 78 100 generations generatons Figure 3c. Function {3 Figure $d. Function {5 7 Figure 3. Online Performances Dnode ee pusde ee Dede + 2 noden 54 anodes dda ere Bncdes meneee B nodes Nemeee Tne peterereeen So nodes eae T T T T ~T TO 0 25 50 1 100 Q 25 50 75 100 generations Generations Figure 4a. Function f1 Figure 4b. Function {2 Lnode pnoden «+ 4 nodes everws 100 + & poder H+ntes 16 nodes mecwseten 50 + +4 3 O44 T t T T T F T T TT oO 2h 60 1s 10a Qa 25 50 1S on generations generations Figure 4c. Function {3 Figure 4d. Function {5 Figure 4. Offline Average Performances 160 1 \ neds ae 2osdes Anodes eee 6 ods Hee oN Yonodes sneer 01 \, _ we : T T T T 0 25 50 15 100 generations j Figure 5a. Function fl 104 {node ———— Leder + sy Anodes ree ‘ B nodes eeewe woe panne T en eerie —~ ok 4 OT Te 0 25 50 75 100 generations Figure 5b. Function f2 16 tanode = B 4+ T F T T a T a B 50 15 100 0 25 50 75 100 generations generations Figure 5c. Function {3 Figure 5d. Function [5 Figure 5. Offline Best Performances GENETIC LEARNING PROCEDURES IN DISTRIBUTED ENVIRONMENT! Adrian V. Sannier JI Research Assistant Erik D. Goodman Professor and Director A.H. Case Center for Computer-Aided Engineering and Manufacturing Michigan State University, East Lansing, MI 48823 ABSTRACT This paper introduces a strategy for fostering the develop- ment of hierarchically organized distributed — systems which uses a genetic algorithm [1] to manipulate indepen- dent, computationally limited units that work toward a, com- mon goal. The central thesis of this work is that impor- tant characteristics of the hierarchical structure of living systems can be duplicated by applying idealized genetic operators to a functionally interacting population of encoded programs. The modeis and results we present in this paper suggest that the introduction of functional imeraction between distributed units can promote the development of a genome capable of producing a set of independent, differentiated units and implicitly organizing them into a coherent and coordinated distributed system. The results presented here are available in more complete form in [2]. AN IDEALIZED DISTRIBUTED GENETIC SYSTEM The class of systems we have attempted to model emerges from consideration of macro-organisms composed of vast numbers of interacting, distributed units. During ontogeny, a single genetic program, or genome, represented in several strands of DNA, replicates itself, more or less exactly, millions of times. The cells containing these repli- cas of the original program do not all perform alike, howev- er. Instead, they differentiate into many types, each of which performs a specialized function. None of the individu- al units operates on the scale of the macro-organism, yet together the operations which these differentiated cells perform are implicitly coordinated to produce the behavior of the larger organism. This process of differentiation is con- trolled, at least in large part, by the action and organization of the original genome. We propose here an idealized mechanism for developing what we call composite genomes, i.e, genomes capable producing differentiated offspring. Their formation is desir- able since they provide a means for coordinating the actions of independent, functionally interacting units operating in a common environment. By replicating itself many times and placing the offspring in appropriate environments, a single composite genome produces a system of implicitly coordi- nated independent units capable of pursuing a common objective. Furthermore, all of the information about the sys- tem is located in a single unit, (of which there are many copies), making communication of the group strategy sim- ple. Composite genomes are in large part responsible for the structure of the recursive, holonic hierarchies [3] which are characteristic of living systems. Our basic hypothesis is that composite genomes arise from the consolidation of specialized genetic programs which control independent units that produce behaviors which are symbiotically related. Via a reproductive opera- tion we call hybridization, genetic programs which produce distinct specialized behaviors are combined into a single genome capable of producing either behavior. Which of the encoded behaviors replicated offspring of a composite genome exhibit is determined by the internal and external environmental conditions in which they are placed. The Basic Modet in our idealized model of the natural genetic system, an environment is defined in terms of a space, which we visual- ize as a two-dimensional integer grid of infinite extent, the standard setting for cellular automata. Within this grid, environmental conditions are defined in a local way, Le., conditions exist in regions of the space and vary with time and/or the action of the living systems in the region. Living systems are modeled as abstract genomes which reside at some Jocation in the grid and interact with the conditions present in some symmetric region surrounding them. Each genome is capable of : 1) detecting the conditions present in its immediate external environment; 2) detecting the conditions present in its internal environment; 3) reacting to sets of external and internal states, either by establishing some internal condition, or by initiating a process capable of interacting with, and possibly altering, the conditions pre- sent in its immediate environment. Also associated with each genome in our model is a num- ber, denoting its strength. A genome’s strength is initial- ly assigned by the environment and is adjusted at period- ic intervals to reflect the fitness of the genome with respect to it. [t is intended as an analog to the natural sys- tem, in that it provides the basis for selection. When a genome’s strength falls below a certain threshold, it "dies", and is removed from the grid, but whenever its strength climbs above a somewhat higher threshold, it is able to pro- duce an offspring. In this way, the number of offspring allocated to individuals is biased by performance. The initial strength of an offspring comes from its parents, depleting their strengths and thus restricting the frequency of individual reproduction. The principal genetic operators in the model are crossover and replication. (Augmenting these are idealized mutation and inversion operators. To simpli- fy the presentation, their action is not discussed here; we utilize them in the standard fashion [1,4,5] ). The environmental grid mediates functional interaction between genomes. The potential for functional interac- tion exists wherever the processes initiated by one genome can, through the medium of a common environ- ment, establish conditions which affect, either positively or negatively, the strength of anather genome, In our mod- el, genomes which are proximate to one another on the grid experience similar environmental conditions and can simultaneously affect their mutual environment. Since a genome’s fitness is a function of its perfor- mance within iis immediate environment, the fitness- es of genomes which share a common region of the grid are linked. The grid also performs an important role in mediating Teproductive interactions between genomes. In order to more accurately model the natural system, mate selection and offspring placement are spatially biased. The praba- bility of an offspring emerging from a crossover between two parent genomes is inversely related to the distance between them. Similarly, when an offspring is placed in the grid, the probability that that offspring is found a distance x from its parents decreases inversely with the magnitude of x. These spatial dependencies appear to us to be important factors in promoting the formation of composite genomes, as we discuss below. Structural and Spatial Coherence In distributed populations, two kinds of groups emerge due to the action of the genetic operators and regional varia- tions in environmental conditions. These groups derive their coherence from different properties and each serves a different function. The first of these are structurally coherent groups, or genotypes. The members of a geno- type can be said to be structurally coherent in the sense that the structures of their programs, and hence the behaviors these programs produce, are similar. In sufficiently large, unevenly distributed populations, we expect a num- ber of diverse genotypes to emerge due to functional spe- cialization, {fthe individual genomes in an unevenly distributed Population are computationally limited, in the sense that they are incapable of responding, simultaneously, to all the demands and regularities present in their environ- ment, then functional specialization will occur [6]. Spatially distributed genomes under selection pressure and the action of genetic operators, will begin to pursue different survival strategies and, aftera time, will be separable into distinct genotypes according to their patterns of behav- ior and the structure of the programs which encode for this behavior. As time goes on, the behaviors encoded in these specialized genotypes will become mutu- ally exclusive, in the sense that a given genome, due to its limited computational capabilities, will be unable to ade- quately perform more than one of ihe behaviors at a time. The second kind of grouping arises as a consequence of the spatial dependencies built into our model. Under our formulation, a collection of genomes which occupies a par- ticular region forms a spatially coherent group, or sub-popu- lation, These spatial groups hold a special place in dis- tributed systems. Not only do the individuals within them share a common context, but since mate selection and offspring placement are spatially biased, reproductive activity tends to be concentrated within them. If the population density is sufficiently low, individuals will tend to cluster in these spatially coherent groups, particu- larly if certain regions of the environment are more “hospitable” than others. The individuals within such "spatial niches” [7] will form isolated sub-populations whose members interact both functionally and reproduc- tively, and tend to place their offspring back in the niche with high probability. Since the members of these groups interact functionally, if the niche contains genomes from dif- ferent genotypes, the potential for symbiosis between indi- viduals from these different types exists and is selected for. A Reproductive Continuum Crossovers of parents within and between the groups described above produce offspring of different characters, each of which perform a different function in the distribut- ed system. The reproductive operations induced by the crossover operator take on the character of a continuum, pro- ducing different kinds of offspring depending on the degree of structural similarity between the crossing par- enis. This continuum is depicted below, linking two natu- rally occurring operations, replication and recombination with a third operator which we refer to as hybridization. (recone nacre reenter nner eee ene nnseerennneent reer nnainenmnnan een | Replication t Admittedly, it may seem odd at first to assert that any kind of continuum exists between replication and recombi- nation, since so many differences exist between them. Replication, the process by which a single composite genome produces a macro-organism composed of millions of differentiated cells, seems to have little to do with recombination, the process which produces new indi- viduals by “mixing” two parent genomes. They occur at different levels in the biological hierarchy and the physical process which implement them are quite distinct. Never- theless, we argue that, from an information processing per- spective, replication and recombination can be consid- ered as points on a logical continuum of reproductive operations by which new genomes can be produced from existing ones. While only certain regions of this continuum are realized in the natural genetic system, we have incorpo- rated all of them in our model. Replication lies at one extreme of this hypothetical con- tinuum, occuring in our model either from the action of the replication operator or as a crossover between identi- cal parents. The resulting offspring isan exact replica of the parental genome, but, as with all genomes in our model, its actions and responses are dictated by the state of its internal and external environment. Replica- tion is most interesting when the parental genome is com- posite. When this is the case, the sections of code which become active in the offspring may differ from those which are active in the parent, due to differences in their internal and external environments. For example, a composite genome containing code for two distinct pro- cesses, each one sensitive to a distinct set of intemal and external cues, can produce two differentiated types of offspring. While each is an exact genetic replica of the par- ent, different components of the composite genome are active in each of the offspring types, due to differences in the internal and external environments in which they have been placed. As a result the two types exhibit different behaviors. When crossovers occur between genotypic variants, ie, non-identical members of the same genotype, the off- spring genome begins to look less like a replica of its par- ents and more like a mix of two similar, but distinct par- ents, In place of replication, we get recombination. Those familiar with genetic algorithms will recognize recombina- tion as the standard image of crossover. Two parents, variants of the same basic genotype, cross to produce an offspring which shares sections of both parents’ genetic code, and exhibits behaviors characteristic of both par- ents, as well as new behaviors arising from epistatic inter- actions between the spliced sections, Recombination has been studied extensively by a number of genetic algortihm researchers [1,4,5], and has been shown, under certain con- ditions, 10 generate individuals of increasing fitness through an intrinsically parallel search of potential genospace. As the parents become less and less similar and the behaviors they encode for increasingly — specialized, crossovers in our model tend toward hybridizations. We define hybridization as a crossover between genomes from radically different genotypes, each of which codes for a separate process activated by a distinct set of internal and external cues, which produces a new, composite genome that contains the code for both processes. It is the primary source of composite genomes in our model. In hybridization, two parent genomes, each of which codes for a different, specialized process, cross at a non-inter- fering point to form a composite offspring capable of exhibit- ing either behavior, depending on the external and internal conditions it encounters. Composite Genomes We see evolution in the distributed genetic system as an interaction of the operations described above. As new indi- 464 vidual strategies emerge, recombinations between genomes which follow similar strategies produce increasingly fit indi- viduals. When the individuals are computationally limited and spatially distributed, we expect a number of mutually exclusive strategies to emerge. If migrations occur in the space, we expect the formation of spatially coherent groups that contain genomes from more than one of these special- ized genotypes. The key to the development of successful composite genomes lies in hybridizations that occur within these structurally diverse sub-populations between symbiotically related parents. When a spatially coherent group, com- posed of genomes from more than one genotype, each of which exhibits a different specialized behavior, per- sists over time, spatial biases in mate selection and off- spting placement increase the frequency of hybrid off- spring, generated by parents from different genotypes within the spatial niche. Those offspring which successful- ly hybridize mutually beneficial behaviors will gain selective advantage. We suggest that hybridizations which occur between the genomes in spatially coherent groups whose members are symbiotically related give rise to composite genotypes which make structurally explicit the implicit, spatial coherence that exists between the members of such groups. Replications of these composite types produce a set of independent computational units which act as an implicitly coordinated, distributed system. Preliminary verification of our hypothesis that suc- cessful composite genomes can form from the spatially- biased interaction of computationally limited adaptive units was provided by an experiment with the authors’ software system, Asgard. Asgard is designed to simulate the behavior and adaptation of artificial animals operating under a genetic algorithm; similar systems have also been developed by Holland and Reitman [8], Booker [9], and Wilson [10]. Unlike these systems, however, the main objective of Asgard was the study of the evolution of a distributed population of interacting genomes, rather than the evolution of functionally isolated individuals. In the next section we describe the organization of the system and review the results of the most interesting simulation. ASGARD Asgard’s environment is a finite, toroidal, two-dimen- sional grid, with a resolution of 160 x 60, which contains “food" at various locations. It is divided into 4 equal quad- rants, each of which can be displayed on a Tektronix 4105 graphics terminal. The display style is similar to Wilson’s {10]. The grid is "home" to 4 population of genomes which move about the grid in search of food. Each time step, the genomes in the grid expend one unit of strength in order to stay “alive", and each unit of food they consume increases their strength by one unit. In order for a genome to survive, then, its average food consumption must be at least one unit per time step; in order to reproduce, its average con- sumption must be somewhat greater than |. The behavior of each genome is controlled by its individual program. These programs are lists of labeled instructions of two types, Move and Food?, which specify either a movement direction or a test and transfer of control. Order of execution is controlled by a program counter that speci- fies which instruction in the list is to be performed next. Move instructions take one argument, which specifies one of eight directions of movement (N, E, SW, etc.). When exe- cuted, these instructions move the genome one unit in the specified direction. Food? instructions transfer control by testing the eight locations surrounding the genome for the presence of food. They take two arguments, both labels, which specify the statement to which the program counter should point given that food is or is not present in any of the surrounding locations. A short example is given below, together with a potential pattern of movement. Start I Move § | Food? 2.1 (no) ? 2 Move E (no) Food? 23 | etc. «~—— ? 2 (n0) 3Move Ws (yes) ? —» 2? > 2 — >” Food? 23 (yes) (yes) In terms of our model, each genome’s immediate external environment is defined solely by the concentrations of food in its neighborhood. A genome can modify its external environment in two ways, either by consuming food or by moving to a different location in the grid. A genome’s internal state is defined by the position of its program counter, By executing Food? instructions, the genome can detect the state of its external environment and, based on this information, can change its internal state, poten- tially altering its future external response pattern. The evolution of the population is driven by an algorithm from the class of reproductive plans [1]. The algorithm starts with a population of genomes whose programs are formed at random from the set of legal instructions. Food is placed into the environment according to a pattern which may depend on time as well as the behavior of genomes in regions of the torus. The objective is to produce a population which responds to regularities in this pattern. The algo- rithm produces a sequence of populations whose mem- bers interact with each other through common local environ- ments and are allocated offspring on the basis of their individual consumption. Over time, the succes- sive populations will contain individuals adapted not only to the conditions of the artificial environment, but also to the behavior patterns of the other members of the population. Let A(t) denote the population at time t and let a be any member of A(t). One iteration of the algorithm consists of the steps below : cobegin { for all a in A(t) } : determine Loc (a) based on a’s program and update program counter [allows & one Move]; consume food at Loc(a); coend cobegin { for all ain A(t) such that Str (a) > T, } : replicate @ to create a’; begin { with probability P, }: choose mat random from the set A (t} - {a}, biasing choice by distance |Loc(a) - Loc(m)|; crossover a’ and m; replace a’ with one of results of crossover; end choose Loc(a’) at random, biasing the choice by the distance |Loc(a’)-Loc(a)}; coend cobegin { for all a in A(t) such that Str (a) < 1}: delete a from A(t); coend Increment t; where: P, = probability of crossover; T, = reproduction threshold; Str (a) = strength of genome a at time t; Food (Loc(a)) = number of food units present at a’s current grid location. The reader will note that Asgard's algorithm differs in some respects from more standard implementations, (see {1,4,5}). For example, the population size in Asgard is variable, governed only by the carrying capacity of the environment. Also, mate selection and offspring placement are biased by distance as our model specifies. These modifi- cations were undertaken in order to make Asgard more closely resemble a community and allow for the formation of semi-isolated sub-populations. Within these — sub- populations, trials are allocated to individuals and schemata in proportion to their observed fitnesses, insuring that Holland intrinsic parallelism theorems hold (i). The Task The task set before Asgard’s population is to identify and exploit the regularities present in the pattern of food place- ment. Although we performed experiments with a num- ber of patterns, of various complexities, we discuss only one here, the so-called “seasonal” pattern, Under this pattern, there are three basic regularities to which the population can adapt. 1) Food can appear in only 1/8 of the space, concen- trated in eight evenly-spaced fertile regions. Each genome must periodically visit one of these fertite regions in order to survive, 2) The productive capacity of each fertile region oscil- lates periodically between a maximum and a mini- mum value, (hence the term "seasonal"). Each fertile area can support many more genomes dur- ing its "summer" than its “winter", so competi- tion for food within a particular area becomes increasingly fierce as winter approaches. 3) The amount of food actually produced during a giv- en time step by a fertile region is linked to the amount of food consumed in the region during the previous time step. At any given time, an opti- mal consumption level exists for a region, and the genomes in that region can over- or under-con- sume with respect to this level, as a group, dur- ing a given time step, resulting in decreased food production in the next time step. In each of the four quadrants of the grid, two 10x15 areas are established as fertile, (see figure la). For con- venience, these areas are called farms. They are the only areas in the grid where food can be found; the rest of the gud is desert-like. Each time step, Ki(t) units of food are produced by the i’th farm and distnbuted randomly within it, (Gi = 1,..8). Ki(t) is itself a function of two other quantities: Mi(t), the seasonal productive potential of farm i; and Ei(t), the consumption efficiency within farm i, a value derived from the total consumption within farm i during the previous period, Ci(t-1). The exact form of these rela- tionships is : A Dts, 32) + D(a, 1) (ED stands for Buchdean distance) If tus is the case, the our is replaced by removing the edges (11,1) and (12,33) and replacing them with the edges (44,39) and (12, 3,) (see Figure 2) One way of patallehoing a probabilistic sequential search algorithin 15 to split the problem into n subproblems and let each processor work on one sulbproblem Such a division 1. m general, not possible For instance ina TSP, at is unlikely that we could mahe different processors attempt edge rater- changes simultancously on the same tour and hape (a obtain. Figure J Tour with edges (41, 9:} and {19, 32) Figuie 2 Tour with edges (11,33) and (19,71) a legal tour In other words, to achieve such paralighsation one Has to add conflict resolution techniques, usually resnlt- ing im a degradation of the performance of the algorithm Another way of parallehsing 3s to let all n processors run independently and take the best avaiable solution at the end We willrall this the mdependent strategy A poten- lial problem with thig strategy 33 that as the processors run independently, Soine of them may get caught m a local man- ima or may search sub-optumal regions of the search space, wasting valuable resource power Inturtively, it seems hkely that we might do better if we let the proressers work mnde- vw pendently for same time, then exchange information about “good” candidate solutions, again work fora while, exchange new information and soon We call such a method an m- terdependent strategy and call the timo of the processing in between to information exchanges a generation? Clearly, there are many ways of exchanging imformation about goad candidate solutions A straightforward strategy 15 lo over- write after each generation a certain number of “bad” can- didate solutions by good candidate solutions More sephis- ticated strategses could involve exchanging structural party of good candidate solutions Examples of such strategies are cross-over operators found in Genetse Algorithms (GA) [3] In the rest of this paper, we give evidence that snterde- pendent strategies are usually better than the independent strategy when parallelising probabilistic sequential search al- gorithims which use local improvement operators tn fact, we suggest that a good technique of parallehsation is to use an interdependent strategy where mformation exchange is done ona fairly regular basis In Section 2, we illustrate this ap- proach by parallelising a simple problem, the Classical Oc- cupancy Problem [4] Section 3 covers the parallelisation of a more complicated search algorithm’ the 2-opt strategy of Lim and Kernighan for the TSP mentioned above tn Section i, we describe experiments with a genetic algorithm for the TSP We show that genetic algorithms cau be viewed as par- allel search algorithms that smplement an interesting kind of interdependent strategy to achieve good, robust perfor. mance ‘The standard selection procedure of the GA can be viewed as a mechauism for achieving information exchange and the local improvement operator can be viewed as a re- combination operator of the GA Finally, Sertion $ offers a discussion of the ideas and results grisea in this paper 2. A Toy Problem: The Classical Occupancy Prob- jem Consider the classical occupanry problem Garena structure of VW empty cells, shoot points randomly at the cells (with a probability of 1/N of hutting any given cell) until all cells are filled The time for solving this problem Js the number of shots required to fill all V cells This prob- len bas been studied extcasively by Kolchin et al [5] We Present here some experimental results which Whistrate the: advantages of au interdependent strategy over Uie indepen- dent strategy ‘The sequential algorithm snvolves starting with the ime ial structure of N cmpty cells and repeatedly gen rating a random numbcr between f and N, this is called a tral H the cell was empty before, it » now assumed to be full and uf not, the tnal has been unsuccessful In the mdependent case, this sequential algovtlin will run separately on all n processors required to solve the problem is the number of shots required by the processor that finished first In the mterdepeudent case the algorithin would look a» follows In this case, the time t Note that ifa generation volver infinitely many trials {attempts at local improvements), the interdependent stiat- tgs reduces to the independent strategy assign to cach processor the empty structure, full ~ 0, generation + 0, while full < N do begin generation — generation + 1, each processor generates a number between I and N, if (at least one processor is successful} then begin randomly choose one among the successful praces- sors, distribute its structure to all other processors, full — full+ 1, end, end, We now consider results of simulation on the classical occupancy problem with N = 100 First we fixed the num- ber of processors and varied the number of trials per gener- ation (tpg). As indicated by the above algorithm, after cach Generation the best structure was redistributed {copied} to all the processors we found that tpg = 1 resulted im the minimum number of generations It should be nated, however, that a higher oumber of trates per generation also did well compared to the independent case For N = 100 and n = (0 processors, the mdependent case required on average (taben over 500 e\periments} 370 gencrations until completion (Note that 100 is optimal) In the interdependent case, for tpg = 1 the processors required on average 128 generations, for tpg = 4 the processors required 165 generations and for tpy = 7 the processors required 190 generations Next we fixed the trials per generation and varied the number of processors As the number of processors increased the generations required in the independent case reduced, but at a slower rate than the generations required for the m- terdependent case For example, with 1 trial per generation and WV = 100, it took 3 processare on average 221 genera- tions ta fiussh wm the interdependent case and on average 427 generations m the independent case With 5 processors the interdependent strategy required 166 generations while the iudependent case required 397 generations With 10 proces- sors the generations required were 128 and 370 respectively {As the number of processors tends to infinity both strate- gies will require N’ gencrations) Our experiments suggest that domg less number of trials per generation and eresa- ing the number of processors 13 better With the number of processors n = 10, But this was a simple problem In thus problem, we know the (optimal) solution and among processors that have a ouecessftl tial, there is no one best structure, since they all have the sane amount of empty cells, That is, they are all equally good [or this reason, we decided not to try a strategy of taking the & best candidate solutions with & > 1, although a slightly different performance may be expected We now conaider the parallelisation of a more comple prob- Yen and compare the perforwnance of various interdependent strategtes with cach othor and the independent strategy 3, Experiments with a Search Algorithm for the Traveling Saleaman Problem The domain of this experiment 1s the Traveling Sales- man Problem (TSP). We use the operator of Lin and Kernighan (the r-opt strategy}, described earlier a3 an evam- ple of a probabilistic operator that makes smal! local changes to produce new structures from old ones The larger the value of r, the more likely it is that the final solution (when uo Inore exchanges are possible) 15 optimal However, ris usually chosen to he 2 because the possible edge interchanges ts of the order of (7) x rif In each generation, cach pro- cessor performs a certam number of applications of the 2 opt strategy on the structure it currently holds in its local memory and after each generation the best structure was redistributed (copied) to all the processors The domain for expermments descnbed here 1s the lattice of 100 poms spread over (0,0), (0,9), (9 8), {9,9} as shown an Tigure 3 Figure 3 Lattice of 100 cities Clearly, the optimal tour length 3 100{ We will consider ihe total effort, given by the number of generations times the number of trials per generation, required to get to within 10° of optimal performance as the sumber of trials por gen- eration (tpg) 1s varied Notice that we do not take into ac count the overhead mvolved in redistributing the structures after cach generation The algorithm ty as follows for various values of tpg do begin generation 0 generate a structure randomh, copy it ta all n processors, while (lestper formance & (1,10*oplimumvalue})do begin generation — generation + 1, each processor attempts tpy local improvements, find the structure with the best performance, distribute this structure to all other processors, end end + It should be noted that Lin and Kernighan propose a inore powerful strategy where ris varied dynamically, but the ample 2-opt strategy also gives good results and 1s quite efficient [2} ¢ It should be noted that similar results can be obtained for other TSPs The actual algorithm also heeps trach of the number of suc- cessful Jocal improvements (ie, applications of the 2-opt operator that make the tour length smaller) and the num- ber of experiments that got aborted (1e those experiments wherem the performance did not reach below 10% of op- timal performance aven after many generations) Figure 4 shows the graph for 10 processors Similar graphs were ob~ tamed Jor different number of processors This shows that if we have fewer trials per generation then the total effort required to get a relatively good performance is less (As mentioned before, when tpg tends to infimty we have the mndependent strategy, and that clearly requires a lot more effort} But aot all trials mvolve successful local improvements Those that do not perform any local improvement will take very little processor time because the tour does not need to be rearranged Therefore, to get an idea of total time re- quired, we plotted in Figure 5 the number of successful local Again, we see that domg fower triads per generation reduces the total time required improvements done on average ‘Shore ts a danger in doing too few trials per generation however As Table } shows, out of 50 experunents for cach value of (pg, some get aborted for low values of ipy This happens when the algorithin gets caught m a local optimum, which occurs because the algorithm described above has no means of mamtaming diversity of structures for low valucs of tpg Note that for higher values of tpg, that ig strate. gies closer to the independent strategy, no expertments get aborted, showing the robustness of the operator and the al- gorithm being used To avoid getting caught in a focal optimum we decided to change the interdependent strategy slightly to increase the diversity of examined structures Instead of taking the one best structure we decided to experiment with the redistribu- tion of the & best strurtures after each generation (Again as k lends to n we have the independent case} Jn the ex- peritnents with 10 processors we found that with & = 4 only 1 or 2 experiments get aborted (out of 50} for lower values of tpg We may conclude that increasing & (1e maintaining diveraity} makes the algorithm more robust ‘There 1s another price we pay for eschangmg informa- tion too quichly Dvery time we exchange information a copy time is involved and this copy time inercases as the number of trials per generation decreases But it wall decrease as k increases {for a fixed talue of tpg)f. Thus, our experiments show that one obiaims a good interdependent strategy by keeping (pg as low as possible to decrease the number of total! trials and to use a large enough & to make the algo- iithm averd getting trapped mto a loca) minima as well as to reduce copy time 4. Experiments using a Genetic Algorithm Genetic Algorithms (GA) [8], introduced by Holland, have been applied with good success to function optimisation problems involving comples functions [6] as well as on some combinatorial optimisation problems (7,8,9,10,11,12,13}. A typical GA maintains a population of structures after each ft ft should be noted that on a parallel inachine the copy tine can be reduced 472 generation, which consists of applying local provement op- erator a cettam number of mes (trials) ta each structure, uses a selection mechanisin to produce a new population of structures for the next generation (Thus, a trial involves a probabilistic operation that makes a small local change to the structure) The selection mechanism assigns a strength of performance measure to each structure This measure 1s uplially the ratio of the performance of the structure to the average performance of the population The number of occurences of this structure in the next generation 16 pro- portional to this performance measure t A genetic algorithm can be viewed as a parallel search algorithin of the type we have been discussing im the previ- ous sections Whereag a sequential algorithm (using the local improvement operator) operates on one structure a GA op- erates on a population of n structures If one imagines that each structure 1s worked on by a separate processor, the se- lection mechanism can now be viewed as an mterdependent strategy, The n processors run separately for a while {a generation) and then exchange information (about perfor- mance) resulting im processors getting a (possibly) different structure for the nevt generation This strategy is a more sophisticated version of our earlier idea which consisted of taking the & best structures from the current population We used a version of the genetic algorithm: (GA) that uses the 2-opt local improvement operator to solve the TSP We ran the GA for 66,000 trials, varying pg The population size, or alternatively the uuinber of processors, 19 50 On the lattice of 106 pomnts (optimum length 1s 100) the GA did not do well for small values of (pg (5 and 10) but did almost «qually sell on higher values of tpg Tabte If shows the effort required for the GA to get to within 10% of the optunal performance It showld be noted that these values are close to the values obtained in our earlior interdependent strategy of taking the oue best (Table J Next we chose the domain to be 100 poimts uniformly distributed between (0,0) (0,1) (1,0) (1,0) Figure 6 shows the performance versus tpg curve The best performance is seen at about tpg = 500 and it worsens a little after that Therefore using the selection mechanism of a GA as strategy of interdependency, and keeping tpg relatively small yields a fast and robust algorithm with good performance, &. Conclusion We have given evidence that probabilistic sequential search algorithms which operate by performing a local change on a structure to generate a new structure, can be paralleliscd by domg a reasonably large amount of local im- provements for each structure per generation and then ey. changing information about “good” structures Doing too few trials per generation, however, may not yield good per- formance values as premature convergence may occur On the other hand, we also showed that taking one best struc- ture and ecdustributing it over the other processors ts too simplistic a strates} siice ut also Causes premature convers gence, honce afew best should be selected for redistribution } Most GAs also use a crossover operator that takes two structures and interchanges parts of them to produce two new structures Jn this paper we ignore such operstors Ia fact, it turned out that the more soplusticated interdepen- dent strategy, the selection procedure of a genetic algorithm, gave the resulted in the most robust strategy. It should be noted that we have ognored copy time in the presentation of the results We believe however, that even if we take this overhead into account similar results cuncerning the mde- pendent versus interdependent strategy may be obtained Acknowledgementa We wish to thank the referees for helpful comments which belped im clarifying some of the ideas and results of the paper References 1, S Lin and BW Kernighan, *An Effective Heuristic Al- gorithm for the Travelmg Salesman Problem”, Operations Rescarch 1972, pp (98-516 2 DT L Lawler, JK Lenstra, AH G Rinnooy Kan and DB Shmoys {ED), The Trateliny Salesman Problem, Jahn Wiley and Sons Ltd (1985} 3 3 Holland, Adaptation in Natural ond Artificial Systems, University of Michigan Press, 1975 4 NAL Jobnsou, 5. Kotz, Urn Afodels and their Applica- lions, Wiley and Sons, 1977 5 V ¥ folchin, BA Sevastyanov, V P Chistyakov, Random Allocations, VE, Winston and Sons, 1978 6 K A Dejong, “Adaptive system desgn a genetic ap- proach” JFEE Trana on System, Man and Cybernetics, Vol SMG-10(9}, pp 556-574 (Sept 1980) 7 JJ, Grefenstette, R Gopal, BJ Rosmaita and D Van Gucht, “Genetic Algorithins for the Traveling Salesman Problem”, Proc of an Intl Conf on Genelec Algorithms and Their Applications, pp 160-168 (duly 1985) 8 J4° Grefenstette, “Incorporating Problem Specific Knowledge mto Algorithms”, ‘To appear 9 JY Suh, D Van Gueht, “Incorporating Heuristic [n- formation into Genetic Search”, Technical Report, Indiana University, February 1987. 10, L Davis, “Job shop scheduling with genetic algorithms”, Proc of an Int Conf on Genelie Algortthms and Their Appheations, pp 186-110 (Suly 1985) if MP Fourman, “Compartion of symbolic layout using genetic algorithms", Proc of an Intl Conf on Genetic Algorithms and Thesr Applicationa, pp 141-153 (July 1985), 12 DE Goldberg and R Lingle, “Alleles, loci, and the trav- cling salesman problem”, Proe ofan fall Conf on Genelie Algorithins and Thetr Applications, pp 154-159 (duly 1985) 138 T) Snuth, “Bin packing with adaptive search”, Proc of an Intl Conf on Genetic Algorithms and Their Applica- fons, pp 202-206 (duly 19) 473 - tpg total success total experiments variance effort rate success aborted 5 1605 0.030 45.009 10 1820 0.034 61.320 20 2160 «0.0386 = 77.385 5 0.092 3 0.123 3 0.093 40 2640 «0.038 = 99.2611 0.107 80 3360 = (0.035 V17.152 3 0.110 160 «4160 = (0.034 140.610 0 0.157 320 5440 = -0.030 160.867 0 0.245 640 6190 0.027 472,721 2 0.590 1280 7680 0.022 170.622 2 0.922 2660 10240 0.019 192.836 2 1.857 5120 15360 0.015 222.767 0 2.114 tpg - trials per generation total effort = tpg * (generations to get to within 10%) success rate = fraction of trials successful total success = total number of successful improvements done experiments aborted are out of total of 50 for each tpg TABLE fi tpg total effort 5 1500 10 2058 20 2812 40 3760 80 4480 160 5600 320 = 6080 640 7040 1280 7680 2560 7680 5120 10240 474 Peaod CHO MMO MO oOo et aneadaca 49000 8000 6000 4000 2000 250 200 160 100 50 Qo effort for 10 processors — - | a a < | - | i I. ce ee ee aL sec ee deere = Oo 1000 2000 3000 4000 5000 trials per generation Figure 4 success Tate for 10 processore fo7 - ee oe ae _f / f | : \ bee meee el ee alee wee LL o 1000 £000 3000 4000 5000 trials per generation Figure 5 RPDOAPBHOMN OD -05 00 -95 -90 -B5 -80 75 performance of 100 random cities 0 200 400 600 800 trials per generation Figure 6 Ve PARALLEL GENETIC ALGORITHM FOR A HYPERCUBE Reiko Tanese” Department of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI Abstract This paper discusses a parallel genetic algorithm for a medium-grained hypercube computer. Each processor runs the genetic algorithm on its own sub-population, periodically selecting the best individuals from the sub-population and sending copies of them to one of its neighboring processors. The performance of the parallel algorithm on a function maximization prob- ler is compared to the performance of the serial ver- sion. The parallel algorithm achieves comparable re- sults with near-linear speed-up. In addition, some ex- periments were performed to study the effects of vary- ing the parameters for the parallel model. 1 Introduction The genetic algorithm, which was designed by Holland[1] in the 1960s, is applicable to a wide range of problems and its popularity is steadly increasing. Tn some applications, such as a simulation of natural phenomena, the algorithm requires many generations and a large number of individuals in the population. However, time limitations make it infeasible to run the algorithm with such a large population. To allevi- ate this, this research investigates a parallel version of the genetic algorithm on a hypercube computer. The Tesults show that the parallel version is significantly faster than the serial one, and that nearly linear speed- up can be achieved with as many as 64 processors. The parallelization of the algorithm involves divid- ing the population and placing a sub-population on each processor. A processor runs the genetic algo- rithm on its own sub-population, periodically selecting good individuals from the sub-population and send- ing copies of them to one of its neighboring prees- sors. Earlier work was done by Grosso{2] studying sub- populations in the genetic algorithm on a serial com- puter. Grefenstette[7] has proposed several versions of “Partially supported by Digital Equipment Corporation. 48109-2122 USA parallel adaptive algorithms for different architectures The next section in this paper describes the hy- percube architecture and the specific hypercube com- puter, the NCUBE, on which this research was per- formed. Section 3 describes the serial and parallel algorithms studied. The algorithm’s performance is measured on a function maximization problem; the function to be maximized is defined in section 4. The experiments which were conducted and their analyses are contained in section 5. 2 The Hypercube Computer The parallel genetic algorithm presented in this paper is designed to take advantage of a parallel architecture, a medium-grained hypercube. An n-dimensional hy- percube computer consists of N = 2” processors inter- connected as an n-dimensional binary cube (Figure 1) Each processor is a node of the cube and has its own CPU and local memory. The communication among the processors is done via message passing. Each pro- cessor is directly connected to n other processors (its neighbors), which makes the distance between any two processors in the cube at most n communication links long. This means that information can be broadcast through the cube in n (or lg N) time steps, compared to O(N4/?) in a 2-dimensional gtid, another popular parallel architecture. The specific machine used to run all the ex- periments in this paper is a general purpose 64- processor NCUBE/six hypercube made by NCUBE Corporation[3}. Each processor in the NCUBE/six is a powerful 32-bit custom VAX-like CPU chip with 128K bytes (soon to be 512K) of local memory. Although a quarter of this memory is needed to store the node operating system and message buffers, it is still large enough to categorize the NCUBE/six as 2 medium- grained hypercube computer. An 80286 host proces. sor is used to provide the connections between the node 477 © 0-dimensional @©——O® 1-dimensional o— ©— rv 2-dimensional 3-dimensional Figure 1: Hypercubes for n = 0, 1, 2, and 3 processors and the external world. 3 The Genetic Algorithm 3.1 The Serial Algorithm The serial genetic algorithm (Figure 2) is a straight- forward implementation of a standard algorithm with two genetic operators: crossover and mutation. Two parameters, crossover_rate and mutation_rate, deter- mine the frequency of application of these operators. Crossover-rate is the average number of crossovers per mating. The actual number of crossovers for any par- ticular pair of individuals is a Poisson-distributed ran- dom variable with a mean of crossover_rate. That is if crossover-rate is 1, any pair will undergo crossover operation on the average once. Sirnilary, mutation_rate is the average number (Poisson-distributed) of muta- tions per individual. Notice that this differs from the traditional interpretation, in which the mutation rate represents the probability of mutation per allele. The method used here is functionally equivalent to the tra- ditional method using mutation_rate / the length of a chromosome. The new method is simply more ef- ficient, especially when there are few mutations per chromosome. The algorithm keeps track of the best individual in the population and carries it over to the next genera- tion. This feature, which was first used by DeJong[5], is especially helpful when solving function optimiza- tion problems, in which the environment remains un- changed over time. 478 Randomly generate initial population of pop-size. For gen = 1 to num_gens do Compute the fitness of each individual in the population. Determine the number of offspring which each individual is going to have based on the individual’s fitness. For i = 1 to (pop_size / 2) do Pick 2 parents randomly without replacement. Crossover the parents based on crossover rate to produce 2 new offspring. Mutate each offspring based on mutation rate, endfor If the best fit individual from the old population is not in the new population, replace one of the individuals in the new population with the best individual from the old population. endfor Figure 2: The serial genetic algorithm 3.2 The Parallel Algorithm In the parallel implementation (Figure 3), each pro- cessor runs the genetic algorithm on its own sub- population, periodically selecting good individuals from its sub-population and sending copies of them to one of its neighbors. It will also receive copies of this neighbor’s good individuals, with which it will re- place bad individuals in its own sub-population. The neighbor with which this exchange takes place will vary over time: each exchange will take place along a dif- ferent dimension of the hypercube. The frequency of exchange and the number of individuals exchanged are two new adjustable parameters. For example, suppose there are four processors, each containing a sub-population of size 50. Suppose they are to send 10% of their good individuals to their neighbors at every 10th generation. Suppose that the processors have their identification numbers assigned as shown in Figure 1. At generation LO, processors 0 and 1 will send 5 of their good individuals to each other, and so do processors 2 and 3. At generation 20, processors 0 and 2 will exchange individuals and so do processors 1 and 3. This exchange of individuals is designed to remedy the parallel algorithm’s main difference from the serial one. In the serial algorithm, an individual can select its mate from any other individual in the population, while in the parallel version potential mates are all in Randomly generate initial sub-population of sub_pop_size. For gen = 1 to num_gens do Calculate fitness as in the seria] algorithm. If (gen mod frequency_of.erchange = 0) then Choose num-erchange good individuals from the pool of individuals whose fitnesses are at least equal to the average fitness of the sub-population. Send copies of good individuals to the neighboring processor along one dimension of the hypercube. Receive copies of neighbor's good individuals. Choose num_erchange bad individuals from the pool of individuals whose fitnesses are no greater than the average fitness of the sub-population. Replace these individuals with the newly received ones, endif Allocate offspring as in the serial algorithm. Reproduce as in the serial algorithm. Keep the best fit individual from this generation as in the serial algorithm. endfor Figure 3: The parallel genetic algorithm the same sub-population, a much smaller pool. This restriction is overcome by allowing the better fit indi- viduals to circulate among the sub-populations, in the hope that they might prove to be “useful”. Notice that the “good” individuals to send to a neighboring processor are chosen probabilistically from the individuals whose fitnesses are at least equal to the average fitness of the sub-population. The current implementation ensures that the individual with the highest fitness has the the highest chance of being du- plicated and sent to a neighboring processor. Similarly, the “bad” individuals are chosen probabilistically from the individuals whose fitnesses are no greater than the average fitness of the sub-population, giving the worst fit individual the highest chance of being replaced. There is a limit, though, to how frequently ex- changes should occur. Notice that the exchange scheme produces duplicates of the fit individuals, which is equivalent to producing more offspring. This increased fertility can, in some cases, lead to prema- ture convergence. Another consideration in determining the exchange frequency is the problem domain. Function optimiza- tion domain is one in which the environment is un- changing; the function values are fixed. In other domains, in which the environment may be chang- ing, it might be better not to exchange as frequently. This is due to the fact that as the environment changes, the sub-population may undergo substantial re-adjustment. It would be best to perform exchanges only after the sub-populations have settled into qui- escence, which could take a number of generations to achieve, The parallel genetic algorithm in Figure 3 can easily be implemented on the NCUBE. The two main modi- fications required are: addition of an inter-processor communication routine for exchanging individuals, and addition of a host program which loads the node processors with the genetic algorithm program and es- tablishes the I/O interface between the user and the node processors, since the node processors cannot per- form direct [/O. The core of the genetic algorithm run- ning on each processor is as in Figure 2. 4 Function Maximization The problem of function maximization is used for pre- liminary demonstration of the performance of the par- allel algorithm. This is because in the function opti- mization problem there is a clear measurement of how well the system is performing at all times, whereas in other problems performance can be harder to analyze. The first attempt used five functions studied by DeJong in his thesis(5]. However, the serial algorithm has found the global maximum for all five functions within 200 generations, starting with a randomly gen- erated population of 50. There is little need to par- allelize problems which are quickly solvable by se- tial computers, so a search was made for reasonable genetically-hard functions. Bethke[4] has shown in his thesis that Walsh functions can be used to construct genetically-hard functions. This was used to define a Walsh-like function W on binary strings of length 64. The objective of the algorithm is to maximize W. W is defined as a composition function: W(s) = F(Order(s)) where Order(s) = the number of ones in the string s, (0 < Order(s) < 64) and F is defined so that W has the values shown in Figure 4. The function W has the following characteristics. e [t has many local optima and a single global op- ae ‘ vTy 7 2 \ 0 tT . v T s o 8 16 24 32 40 48 56 64 Order(s) Figure 4: Function W timum. The domain of W is huge, namely 2°. The function values for neighboring points are rel- atively close together. This helps keep a single fit individual from taking over the population easily, thus preventing the genetic algorithm from devel- oping a fixation. The global optimum is not an isolated spike which can only be found by a random search. Experiments and Results 5.1 Number of Processors Table 1 contains the results of running the parallel al- gorithm using various number of processors. For each dimension, the parameters are set as follows: total population size 400, crossover_rate 0.6, muta- hhon-rate = 0.5, and exchange.frequency = every 5 gen- erations. Table la contains the results of runs in which 10% of sub-population is exchanged. Table 1b contains the results of runs exchanging 20% of sub-population. The runs for dimension 0, those with one processor, are equivalent to running the serial algorithm in which no exchanges occur. Sixty-four runs were made for each dimension. Each run keeps track of the earliest generation in which the global maximum is found. The column “average gens” in Table 1 is the average of such generations over 64 runs. When a run does not find the global maximum within 1000 generations, as seen in dimension 0, the average generation is calculated as if that run reached the global maximum in generation 1000. This is in- dicated in the table by “>”, meaning that it took at least that many generations on average to reach the global maximum. “#éruns reaching max” is the num- ber of runs that reached the global maximum out of the 64 runs. The range column shows the earliest and the latest generations in which the global maximum is found. Since the genetic algorithm is a stochastic algorithm, this range can be quite large. The total population size for dimension 4, 5 and 6 is kept approximately 400. This is because the parallel program handles only even sub-population sizes in the current implementation. Notice that there is no data for dimension 6, in Table 1a. This is because 10% of six individuals is less than one, so no exchanges would have taken place. The results in Table 1 show that the parallel algo- rithm finds the global maximum in roughly the same number of generations as the serial algorithm. Table 2 shows the execution times for the runs in Table la. The times are for complete runs of 1000 generations, regardless of when the global maximum is found. The parallel algorithm introduces two ex- tra steps over the serial one: choosing good and bad individuals, and exchanging these individuals. From empirical observations, these steps are negligible when the sub-populations are larger than 10; typically under 1% for selection and under 2% for exchanging. Figure 5 demonstrates the speed-up achieved by the 180 linear speed-up ~*~ my resuits log of execution time dimension Figure 5: Dimension versus log of execution time parallel algorithm. The graph plots the dimension of the hypercube versus the log of the execution time. Notice that for runs made on 6-dimensional hyper- cube, the sub-population size is only 6: consequently the communications overhead becomes a much larger factor in the execution time. It also takes longer to load all 64 processors with the program. These two factors combine to push the 6-dimensional time away from true linear speed-up, The two results, the comparable performance and the near-linear speed-up, together demonstrate the power of the parallel algorithm. 5.2 Different Mutation and Crossover Rates The effects of varying the mutation rate in the serial algorithm was studied by DeJong in [5]. The purpose of a mutation is to introduce a lost allele in a given bit position. When a bit position is fixed to one allele in all of the population, no amount of crossover will in- troduce a different allele there. Therefore, a mutation is an insurance policy which re-introduces lost alleles. A single bit mutation of an individual can also be thought as a local search in an area surrounding that individual in a multi-dimensional space. The prob- lem is when the population converges prematurely to a local optimum, in which case it may require more than a single mutation to get over to a new region, A high mutation rate is helpful in such situation, but it may cause too much disruption during the exploration phase. The difficulty is in keeping the balance between the exploration and exploitation (or switching the empha- sis from one to the other). Ackley(6] used a variable 184 mutation rate over time, in which he started out with high mutation rate to explore the search space, and gradualy lowered the rate to concentrate on a promus- ing region. This use of the mutation rate is similar to the use of temperature in simulated annealing. Ackley used an “iterated” genetic algorithm in addition to this variable mutation rate to ensure that the algorithm would not prematurely converge to a local optimum The parallel algorithm shown in this paper intro- duces a different way to handle the exploration versus exploitation problem. Since the algorithm uses mul- tiple processors, why not have some processors run with a high mutation rate, emphasizing exploration, and the other preessors run with a low mutation rate, emphasizing exploitation? When a processor running with a low mutation rate exchanges individuals with a processor with a high mutation rate: 1. If any of the new individuals are less fit than the rest. of the processor’s sub-population, then they will eventually die off. Ww If any of the new individuals are more fit than the processor's sub-population, then they may help to focus the search in a new direction, possibly avoiding premature congergance. ‘Two experiments were conducted on a 3-dimensional hypercube to demonstate this point, All pararneters other than mutatron_rate were set, as follows: total pop- ulation size = 400, crossover_rate = 0.6, exchange 10% of sub-population every 5 generations. The first exper- iment used mutation_rate equal to 0.5 per individual on half of the prcessors, and 1.0 on the other half. The second experiment used mutation-rate of 0.5 for half of the processors, and 2.0 on the others, The results of these experiments are in lines 2 and 3 of Table 3. Another experiment was run with different values of crossover and mutation rates on all eight processors. Crossover-_rate is set either to 0.6 or 1.0. Mutatton_rate can be 0.5, 1.0, 1.5, 2.0. The combination of these two parameter values makes every processor run in a different mode than the others. The results of these runs are summarized in line 4 of Table 3. Compared to the results of the “stan- dard” run in line 1, all three experiments found the maximum within a reasonable number of generations. These results are very encouraging in dealing with one of the main difficulties of the genetic algorithm, its sensitivity to the parameter settings. Numerous stud- ies have dealt with this problem; the algorithm is not practical unless appropriate parameter settings can be found easily, The experiments of Table 3 suggest that the parallel algorithm can perform well even without knowing the best parameter settings. 5.3. Exchange Frequency Table 4 shows the results of changing the frequency of exchange for runs using a 3-dimensional hypercube. The other parameters are set as follows: total popula- tion size = 400, crossover-rate = 0.6, mutation_rate = 0.5, exchange 10% of sub-population. As the results indicate, exchanging too frequently or too infrequently may degrade the performance of the algorithm. Exchanging every 5 generations seems to be most effective for optimizing the function W. 6 Conclusion The problems that can be solved by the genetic algo- rithm can be categorized in three groups: 1. Problems that work better using a large popula- tion and the serial algorithm, 2. Problems that work better using the sub- population model and the parallel algorithm, 3. Problems that work well with either using a large population and the serial algorithm, or the sub- population model and the parallel algorithm. The function maximization problem presented in this paper belongs to the third category. Using the sub-population model for problems in this category has practical benefits, as it achieves near-linear speed-up. The benefits of a large single population versus those of the sub-population model have been argued in the field of population genetics by Fisher and Wright{8}. According to Wright’s “Shifting Balance Theory”, there exist situations in population genetics in which the sub-population model succeeds but the single pop- ulation model will not. It is not immediately apparent that this theory is transferable to computer science, but if it is, there may be problems which can be solved only by the parallel genetic algorithm. Further resarch 1s required to identify the problems in the first category (if any), and to determine if they can benefit from other parallelization techniques. Finally, the experiments in Table 3 suggest that the parallel algorithm may be more robust than the serial one, in that reasonable performance can be achieved by having different processors use different parameter values without determining the best possible parame- ter values. More investigation is required in this area, possibly studying the effects of varying other parame. ters in addition to mutation and crossover rates. Acknowledgements I would like to thank Professor Quentin Stout for his valuable advice and guidance, and Colin Underwood for his help in presenting these ideas. This work is only possible because of their enthusiasm. I also like to thank Professor John Holland for introducing me to this subject. References (i] J. H. Holland, Adaptation in Natural and Artifi- cral Systems, University of Michigan Press, 1975 [2] P. Grosso, Computer Simulations of Genetic Ap- plcation: Parallel Subcomponent Interaction m a Multilocus Model, PhD Thesis, Computer and Communication Sciences, University of Michigan, 1985. [3] J. P. Hayes, T. Mudge, Q. F Stout, S. Colley and J. Palmer, A Aftcroprocessor-based Hypercube Supercomputer, IEEE Micro, vol. 6, no. 5, Oct. 1986, pp. 6-17. [4] A. Bethke, Genetsc Algorithms as Functional Op- timizers, PhD Thesis, Computer and Communi- cation Sciences, University of Michigan, 1981. {5] K. A. DeJong, Analysis of the behavior of a class of genette algorithms, PhD Thesis, University of Michigan, 1975. [6] D. H. Ackley, Stochastic iterated genetic hill chmbing, Carnegie-Mellon University, CMU-CS- 87-107, 1987. (7] J. J. Grefenstette, Parallel adaptive algorithms for function optimization, Vanderbilt University, (Preliminary Report), Technical Report CS-81- 19, 1981. [8] W. P. Provine, Sewall Wright and Evolutionary Biology, The University of Chicago Press, 1986. 182 (a) Exchange 10% (b) Exchange 20% Dim | #PEs Sub Average #Runs Range Average Runs Range Pop Size Gens Reaching Max Gens Reaching Max 0 1 400 >198 63 60- 1000+ >198 63 60- 1000+ 1 2 200 212 64 65-615 198 64 60-675 2 4 100 201 64 55-530 >210 63 70- 1000+ 3 8 50 171 64 80- 385 209 64 60- 665 4 16 24 248 64 95-595 200 64 70-510 4 32 12 307 64 105-785 212 64 65-465 6 64 6 > - > 232 64 115-460 Table 1: Data for 64 runs on a hypercube with 0 to 6 dimensions Dimension | #Processors [CPU Seconds 0 1 1548.33 1 2 787,98 2 4 393.29 3 8 199.18 4 16 98.88 5 32 54.86 L 6 64 37.85 Table 2: Execution times for runs of 1000 generations Crossover Mutation Average #Runs Rate Rate Gens Reaching Max Range 06 0.5 171 64 80-385 0.6 0.5,1.0 150 64 45-380 0.6 0.5,2.0 199 64 65-570 | 961.0 — 0.5,1.0,1.5,2.0 153 64 65-435 ‘Table 3: Data for 64 runs with different crossover and mutation rates (dimension=3) Exchange Average Runs | Frequency (gen) Gens Reaching Max Range 1 >222 62 70- 1000+ 2 >197 63 60- 1000+ 3 >180 63 40- 1000+ 5 171 64 80-385 10 >248 63 95- 1000+ 20 302 64 120-695 | Table 4: Data for 64 runs with different frequency of exchanges (dimension=3) BUCKET BRIGADE PERFORMANCE: I. LONG SEQUENCES OF CLASSIFIERS Rick L. Rialto The University of Michigan Abstract In Holland-type classifier systems the bucket brigade algo- rithm allocates strength (“credit”) to classifiers that lead to rewards from environment. This paper presents results that show the bucket brigade algorithm basically works as designed~-strength is passed down sequences of cou- pled classifiers from those classifiers that receive rewards directly from the environment to those that are stage set~ ters, Results indicate it can take a fairly large number of trials for a classifier system to respond to changes in jts environment by reallocating strength down competing sequences of classifiers that implement simple reflex and non-reflex behaviors. However, “bridging classifiers” are shown to dramatically decrease the number of times a long sequence must be executed in order reallocate strength to all the classifiers in the sequence. Bridging classifiers also were shown to be one way to avoid problems caused by sharing classifiers across; competing sequences. 1 INTRODUCTION Like all highly parallel, fine-grained, rule-based learning systems, classifier systems ([Holland, 1986a], [Burks, 1986], [Holland and Burks, 1987], (Holland, 1986b]) must solve the apportionment of credit problem. In short, the ap- portionment of credit problem is the problem of deciding, when many rules are active at every time step, which of those rules active at step ¢ are necessary and sufficient for achieving some desired outcome at step t-+n. In terms of Samuel, who first recognized the problem in the context of his checker playing program [Samuel, 1959], the problem is how to know which of the many moves (or sequences of moves) made in the early parts of a game “set the stage” for a triple jump later in the game. The problem of ap- portioning credit is especially difficult in complex domains in which (a) information about what is a good result is provided only occassionally, perhaps after long sequences of actions, and (b) there are millions of possible states or state sequences, so that the system never sees the same exact sequence twice. In classifier systems using the bucket brigade algorithm {Holland,1985}, credit is allocated in the form of a value, strength, associated with each classifier. The strength as- signed to a classifier is important for two reasons: 1. Strength determines in part what classifiers will be active at a given time step, and so controls the short term behavior of the system. . Strength is used by rule discovery algorithms to guide the creation and deletion of classifiers, thereby influ- encing the longer term learning behavior of the sys- tem. Thus for classifier systems both to perform well and to learn, strength must be allocated properly and expedi- tiously by the bucket brigade algorithm. Basically the bucket brigade algorithm acts in two ways: 1. It adjusts the strength of those classifiers that are ac- tive when a payoff is received from the environment. Each classifier’s strength is changed a little at a time until it is proportional to the average of the payoffs the system receives when that classifier is active. . It redistributes strength from each active classifier to classifiers that posted messages that activated it. Each classifier’s strength is modified a little at a time until it is proportional to the strength of the classi- fier(s) it activates. Over time the bucket brigade algorithm reallocates strength from classifiers that directly lead to payoffs from the envi- ronment to those classifiers that indirectly lead to payoff, ie,, to classifiers that post messages that “set the stage” for those classifiers directly responsible for receiving payoffs. The bucket brigade has several characteristics that make it ideal for use with highly parallel systems like classifier systems. First, the bucket brigade algorithm uses only [o- cal information: when adjusting the strength of a classifier, 184 “This work was supported by National Science Foundation Grant DCR 83-05830 it only needs to know which classifiers directly activated it and which classifiers it directly activates. There is no need for complicated book-keeping or for high-level crit- ics to analyze sequences of actions and assign credit ac- cordingly. Second, the bucket brigade works in a highly parallel way, changing the strength of many (or all) rules at the same time, Third, the bucket brigade acts inere- mentally, changing the strength of classifiers gradually. By changing the strength of classifiers only a small amount at a time, the classifier system tends to learn gracefully, with- out the precipitous changes in performance that may result from making a large change in response to a single, possibly anomolous, case. One key issue for systems using the bucket brigade al- gorithm is how fast strength flows down long sequences of classifiers. If a whole sequence of classifiers must be acti- vated many times in order to adjust the strength of a clas- sifier at the beginning of the chain in response to a change in payoff associated with the last step in the sequence, the system's response to simple changes in its environment will be too slow. Wilson [Wilson, 1986] used a simple simula- tion to show that allocation down a sequence of classifiers can take a fairly large number of steps. (He suggested an alternative “hierarchical” bucket brigade algorithm that is designed to speed up th»: flow of credit down long sequences of classifiers.) Holland [Holland,1985] mentions this prob- lem and suggests a way to implement “bridging” classifiers that speed up the flow of strength down a long sequence of classifiers. This paper describes some simple experiments designed to show how well the bucket brigade is able to allocate strength down long sequences of classifiers. Section 2 de- scribes the CFS-C/FSW1 classifier system, which is used to carry out all experiments described in this paper. In Section 3, the allocation of strength down a single chain of classifiers will be examined. In Sections 4 and 5, the ability of the bucket brigade to allocate strength so that the sys- tem learns to make the proper choice at the beginning of a long sequence of steps is examined. The effects of “bridg- ing” classifiers are also examined. In Section 6 the effect of sharing classifiers in different sequences is examined, with- out and with bridging classifiers. 2 THE CFS-C/FSW1 System All experiments described in this paper were done using the CFS-C classifier system (Riolo, 1986], set in the FSW1 (“Finite State World 1”) task environment [Riolo, 1987]. This section describes the parts of the CFS-C/FSW1 sys- tem that are relevant to the experiments described in this paper. For a complete description of those systems, see the documentation cited. Basically, the FSW1 domain is a world that is modeled as a finite Markov process, in which a payoff is associated with some states. The classifier system's input interface opwotett LUFF , eet te. » t oR € veep a Biboyaatattf, bet ares Te hye 7 it barnett spa tenes Hae yh eget ee AIT O" py tue ayes provides a message that indicates the cubdent state of the Markov process. The classifier’s output interface provides the system with a way to alter the transition probabilities of the process, so that the system can control (in part) the path taken through the finite state world. When the clas- sifier system visits states with non-zero payoff, that payoff is given to the system as a “reward”. Thus the task for the CFS-C classifier system in the FSW1 domain is to learn to emit the appropriate signals at each step so that the Marlov process will visit states with higher payoff values as often as possible. Mare formally, the FSW1 task domain is fully defined by specifying: « A set of n states W;, i= 0, ..., n—1, each with an associated payoff u(W,) & ®; one state also is desig- nated the start state. A set of probability transition matrices, P(r), where each entry p,,;(r) in P(r) gives the probability of going to state W,, given that the system is in state W; and that the classifier system has emitted r as its output value (r =0.,.15). Figure 1: A simple FSW1 finite state world. For example, consider the simple three state world shown in Figure 1. (In this and other diagrams, states are shown as circles, and arrows designate non-zero prob- ability transitions.) Wo is the start state. The payoff for state W, is 100; the payoff for the other states is 0. When the system is in state Wo, if r = 1 the probability of going from state Wo to W; is 1.0; if r = 2 the probability of go- ing from state Wo to W; is 1.0, For other values of r, the probability of going from state Wo to state W, or W; is 0. The probability of going from either W, or W; to Wo is 1.0, no matter what the value of r. Thus if the classifier system is to maximize its payoff in this world, it must learn to set r == 1 whenever it is in state Wo. The CFS-C classifier system is a standard, “Holland” type learning classifier system that consists of four basic parts: « A message list, which acts as 2 “blackboard” for com- munications and short term memory, In the CFS-C classifier system, the message list has a small, maxi- mum size. 3 AN Ca aneAT pore i ae « A classifier list, which consists of condition-action rules called classtfiers. Each classifier in the CPS-C system is a two-condition classifier of the form: Cy, Cz / Action A classifier’a condition part is satisfied when each of the conditions C, and Cz is matched by one or more messages on the message list. The second con- dition may be prefixed by a “~”, in which case that condition is satisfied only when ne message matches the condition C,. A satisfied classifier produces one message for each message that matches its first con- dition, C1, using the usual “pass through” procedure. Each classifier also has an associated strength, which is related to its usefulness in attaining rewards for the system, and a specificity (sometimes called its bid ratio), which is a measure of the generality of the classifier’s conditions. e An input interface, which provides the classifier sys- tem with information about its environment. In the FSW1 domain, the input interface provides one de- tector message which indicates the current state of the Markov process. An output interface, which provides a way for the classifier system to communicate with or change its environment. In the FSW1 domain, the output in- terface maps messages that start with a “10” (some- times called effector messages) into an effector set- ting, r, r = 0...15, which determines the transition probability matrix P(r) used to select the next state of the Markov process. As in other “Holland” classifier systems, messages are all strings of fixed length é, built from the alphabet {0,1}. Each condition C, and the action part of a classifier is also a string of length ¢, built from the alphabet {0,1,#}. The # acts as a “wildcard” symbol in the condition strings, and it acts as the “pass-through” symbol in the action part of a classifier. The CFS-C/FSW1 system is run by repeatedly execut- ing the following steps of the classifier system’s “major cy- cle”: 1. Add messages generated by the input interface to the message list. In the FSW1 domain one message, which indicates the current state W(t) of the world, is added to the message list. . Compare all messages to all conditions of all classi- fiers and record all matches for classifiers that have their condition parts satisfied. . Generate new messsges by activating satisfied classi- fiers. If activating all the satisfied classifiers would produce more messages than will fit on the Message 186 list, a competition is run to determine which classi- fiers are to be activated, Classifiers are chosen prob- abilistically, without replacement, until the message list is full, The probability that a given classifier is activated is proportional to its drd. Procesa the new messages through the output inter- face, resolving conflicts and selecting one effector set- ting, r, to be used for the current time step. Once r is set, the associated transition matrix P(r) and the current state W(t) are used to select the world state W(t +1) to which the system moves. . Apply the bucket brigade algorithm, to redistribute strength from the environment to the system and from classifiers to other classifiers. . Apply discovery algorithms, to create new classifiers and remove classifiers that have not been useful. Replace the contents of the message list with the new messages, and return to step 1. In the CFS-C/FSW1 system, the id of classifier i at step t, B,(t), is calculated as follows?: B,(t) = k4 S,(t) * BidRatio; K is a small constant (usually about 0.1), which acts as a “risk factor”, i.e., it determines what proportion of a clas- sifier’s strength it will bid and so perhaps lose on a single step. S,(¢) is the classifier's strength at step t. BidRatio, is a number between 0 and 1 that is a measure of the clas- sifier’s specificity, ie,, how many different messages it can match. A BidRatio of 1 means the classifier matches ex- actly one message, while a BidRatio of 0 means the clas- sifier matches all messages. When a competition is run to determine which classi- fiers are to be activated, the probability that a given (sat- isfied) classifier ¢ will win is: Prob( i wins ) = A,(t)/ ~ B,(t) i The effective bid, §,(t), of a classifier ¢ at step ¢, is: B.(t) BidPow is a parameter that can be set to alter the shape of the probability distribution used to choose classifiers to produce messages. For example, if BidPow = 1 then B(t) = B,(t), ie. a classifier’s probability of producing messages is just its bid divided by the sum of bids made by all satisfied classifiers. Setting BidPow to 2, 3, and so on, makes it more likely that classifiers with the highest bids B, (1) BidPow ‘Actually the CFS-C/FSW1 bid calculation involves other param eters not shown here, but for the experiments described in this paper, those parameters have been disabled, will win the competition. The effects of varying BidPow are considered further in section 4 of this paper. Note that the output interface of the CFS-C/FSW1 sys- tem may have to resolve conflicts, e.g., when one classifier produces a message that says “set the effector value r to 1” and another produces a message that says “set the effector value r to 2”. Since the effector can only be set to one value at a time (just as we can either lift our arm or lower it, and not both), an effector conflict resolution mechanism must be used. Basically, when there are conflicts the value of r is chosen probabilistically, with the probability that r =r! equal to: 2 Bm (t)/ Bm (t) ma 7m where m! ranges over the effector messages that say “set r tor’, Ba:(t) is the effective bid of the classifier that posted message m', m ranges over all effector messages, and 6,,(t) is the effective bid of the classifier that posted message m. The winning r value is used to select a transition matrix P(r), which in turn is used to determine to which state the system will move. Once an effector value is chosen, all messages that are inconsistent with that setting are deleted from the new message list. The basis for the reallocation of strength done by the bucket brigade algorithm is payoffs received by the classifier system from the FSW1 environment. In the C'S-C/FSW1 system, when the Markov process enters a state W, all clas- sifiers that posted messages which are on the new message list (after any effector conflicts are resolved) have the full payoff (W,) added to their strength. Thus when the acti- vation of a classifier tends to be directly associated with a high reward from the environment, that classifier’s strength ig on average increased. The bucket brigade algorithm also redistributes strength from classifiers to other classifiers. In particular, when a classifier posts messages, it pays the amount it bid to the classifiers that made it possible for that classifier to be- come active. Let BtdShare equal the classifier’s bid, B, divided by m, number of messages that matched its condi- tions. Then the strength of the active classifier is decreased by BidSharexm and BidShare is added to the strength of each classifier that produced a message that matched the activated classifier’s conditions. (If a classifier is matched by one or more “detector” messages, ie., messages from the system’s input interface, the classifier’s strength is still decremented by BidShare for each detector message used, but that amount is not added to the strength of any other classifier. Thus just as the environment is the ultimate source of strength, in the form of payoffs, it is also the ulti- mate sink for strength, when detector messages are used.) In summary, the strength S,(f +1) of a classifier ¢ at t+ is: S(t +1) = S(t) + R(t) + P(e) - Be) 187 where 5,(¢) is the strength of classifier ¢ at ¢, R(é) is the re- ward from the environment during step (t), P,(t) is the sum of all payments to classifier t from classifiers that matched messages produced by ¢ during the previous step, and B,(t) is the classifier’s bid during step t. Clearly a classifier’s strength reaches a fixed-point when the amount of strength it receives is equal to the amount it pays. Thus in the long tun a classifier’s fixed-point strength, Sy,, approaches: Sty = (R-+ P)/(k ¥ BidRatio) where & and P are the average amounts the classifier re- ceives per activation as rewards from the environment and payments from other classifiers, respectively. Since the focus of this paper is on the allocation of strength by the bucket brigade algorithm amoung existing classifiers, rather than the creation of new classifiers, the CIS-C/FSW1 system’s rule discovery algorithms are not used in the experiments described in this paper. Instead, all classifiers are added to the initial classifier list. Those clas- sifiers remain unchanged, except for their strengths, during the course of each experiment. 3 SIMPLE SEQUENCES OF CLASSIFIERS To get a feel for how the bucket brigade algorithm works in the FSW1 domain, consider the finite state world shown in Figure 2. There are 13 states in this world, W,,t 0, ..., 12. State Wy, has an associated payoff of 100, while the payoff for all other states is 0. The start state is Wo. The classifier system must set r = 1 to move from state Wo to Wi; ie, py(1) = 1.0 for jg = f+1,f = 0...11 HWS vee EL EL Gig) F100 Figure 2: A single path of states leading to a reward at state Wy. When the system reaches state Wis, it will go to state Wy by default; ie., p,,(0) = 1.0 for ¢ = 12,7 = 0. (In this and subsequent descriptions of finite state worlds, all transitions not mentioned have probability 0.) Thus with a perfect set of classifiers, the system could achieve a reward of 100 every 13 cycles. The CFS-C/FSW1 system was run in this world with twelve classifiers, each with a starting strength of 50. Clas- sifier 1 was of the form: do, do/ex,7 = 1 where each condition“dy” matches the detector message for state Wo, and the action “e,r = 1" posts an effector mes- sage ¢, that sets the effector value r to 1. (In order to make the classifiers more readily understandable, they will be shown in an “interpreted” form rather than in terms of strings built from the {0,1,#} alphabet.) Classifiers 2 through 12 are of the form: G1 G-1/e,7 = 1 where the condition “e,.,;” matches the effector message produced by classifier 1~1,i = 2, ..., 12, and the ac- tion part of each classifier 1 produces an effector message ¢, which sets r to 1. In short, when the Markov process enters state Wo, clas- sifier 1 is activated, which posts an effector message that moves the system to state W). Since classifier 1 is activated by detector messages, it pays ita bid to the system rather than to some other classifier. At the next time step, clas- sifier 2 is activated by the message produced by classifier 1, so classifier 2 pays its entire bid to classifier 1. Classifier 2 also posts an effector message that moves the system to state W.. Thus the process continues, each classifier be- ing activated by and paying its bid to the classifier that was active on the prior step. Finally the Markov process reaches state Wz, in which case the classifier active during that step, classifier 12, receives the payoff associated with that state (100). The system then returns to state Wo, and the cycle starts again. Figure 3 shows the results of running the 12 classifiers described above for a period of 3000 cycles (about 230 passes through the sequence), using k = 0.1 and BidPow = 1. The strength of classifiers 1, 4, 7, 10, 11, and 12 are shown plotted against the number of time steps executed. Figure 3 clearly shows the wave of strength flowing from 1o00r 800+ D a 3 STRENGTH 400- —— CLASSIFIER #1 w—s CLASSIFIER *4 wo CLASSIFIER #7 as CLASSIFIER #10 4 —* CLASSIFIER #11 ee CLASSIFIER #12 4 200F 600 1200 1800 CYCLE STEP 2400 3000 Figure 3: The flow of strength down a simple sequence of coupled classifiers. Classifier 12 leads directly to a reward, and classifier 1 is at the start of the sequence. 188 4900+ — BiD_K=005 a—s BID_K=010 L w—¥ BID_K=020 = BID_K=030 = © 3000 = uo oS e w S 2000 a a a 4 o > wu DISTANCE FROM REWARD Figure 4: The number of steps required for a classifier ina chain of coupled classifiers to reach 90% of its fixed-point strength, plotted against the number of steps to the end of the chain, for k = 0.05, 0.1, 0.2, and 0.3. classifier 12, which recsives the reward directly from the environment and reachs its fixed-point strength first, to classifier 1, which is farthest in the chain from the envi- ronmental reward and so reaches its fixed-point strength last. (The blips are artifacts of when strength is recorded: sometimes a classifier’s strength is displayed at the end of a step in which it posted a message, so that it has just had its strength reduced by its bid but it has not yet been paid by the next classifier in the chain.) As expected, the fixed-point strength, Sy,, of all the classifiers in this experiment is the same (1000), since each classifier pays its full bid to its one predecessor (or to the system for the detector message, in the case of classifier 1). The number of cycles it takes for a classifier n steps from the environmental reward to reach 90% of its Sy, fits the following equation: t = 286 + 155n where n = 0 for classifier 12, n = 1 for classifier 11, and so on. In terms of the number of passes through the sequence of states (i-e., the number of rewards received}, this works out to be: R= 22 +11.9n This number is in good agreement with the value arrived at in [Wilson, 1986], using a simple simulation of the bucket brigade. One way to speed the flow of strength back through a sequence of classifiers is to increase k, the bid constant that specifies what proportion of a classifier’s strength is to be risked on any one bid. Figure 4 shows a comparison of the results obtained when the 12 classifiers described earlier were run using values of & = 0.05, 0.1, 0.2, and 0.3. The horizontal axis shows the number of steps a classifier is from the environmental reward. The vertical axis shows the number of cycles it takes for a classifier to reach 90% of its fixed-point strength. Higher values of & result in faster flow of strength down the sequence of classifiers. For example, for a classifier 10 steps from the reward, the number of passes through the sequence is 59 for k = 0,3, compared to 141 for k = 0.1. 4 CHOSING BETWEEN SIMPLE SEQUENCES OF CLAS- SIFIERS Clearly ane important characteristic of the bucket brigade algorithm is the number of cycles it takes for strength to flow down a chain of coupled classifiers. Another important measure of the bucket brigade algorithm’s performance is the ability of the classifier system to respond to changes in the payoffs associated with states. For example, consider the CFS-C/FSW1 world shown in Figure 5. There are 19 states in this world. The start state is Wo. When the system is in Wo, if the classifiers set the effector value r to 1, then the system goes to state W, with probability 1; if the classifiers set r to 2, then the system goes to Wy;. Once the top or bottom path is chasen, the Markov process can be moved through the intervening states to Wy or Wy by continuing to set r to 1 or 2, respectively. When the process reaches state W or Wyo, it always returns to state Wp. The CFS-C/FSW1 system was run in the world de- scribed above using the follwing 18 classifiers: dy, do/ey,¥ = 1 (1) do, do/eis,r = 2 (11) eyenajenrel (F209) tm nr/eor = 2 (#12...19) (The numbers in parentheses on the right serve to identity the classifiers.) Each classifier has BidRatio = 1. Basi- cally, the classifiers 1 and 11 compete to become active when the system is in state Wy. Those classifiers try to have the system take the top or bottom path, respectively, by (a) setting the effector value to 1 or 2, and (b) producing a message that sets the stage for the rest of the classifiers in its associated sequence to fire, one after another. (Note that 1 and 2 can’t post messages at the same time, since they try to set r to different values.) For example, if classi- fier 1 wins the competition, its message sets r to 1, so that the system moves to state W,. In the next step, classifier 2 is matched by the message produced by classifier 1, so classifier 2 pays its bid to classifier 1, and the system is moved to state W2. This process continues until the sys- tem reaches state Wy and classifier 9 receives the payoff associated with that state. wees "1(6) (ve) =") ese [pH =400} =100 Figure 5: Two competing paths of states leading to rewards at states Wy and Wi. Suppose states Wy end Wy both have a payoff of 100, and all other states have a payoff of 0. In this case it does not matter what path the system takes—the maxi raum payoff rate it can achieve is 100 per 10 cycles. Note that the fixed-point strength of all classifiers in both se- quences is 1000 (assuming k = 0.1 and each classifier has a BidRatio = 1). In particular, classifiers 1 and 11 will have the aame fixed-point stvengths, so each will have a 0.50 probability of winning the competition, and so the system will go down each chain 50% of the time. Suppose the payoff assoicated with state W, is changed to 400. In this case the optimal payoff, 400 per 10 cy- cles, can be achieved by always taking the top path, Thus to achieve optimal performance, the bucket brigade must reallocate strength so that classifier 1, the classifier that causes the system to go down the top path, has a higher fixed-point strength than classifier 11, the one that causes the system to go down the bottom path. The faster the system can reallocate strength, the faster the system can respond to the change m its environment. Figure 6 shows the -asults of running the above classi- fiers in the world descr bed above, with y{I¥)) = 400 and B(Wi) = 100, (The results in this and the rest of the experiments described in this paper are the average of 10 runs, each started with a different seed for the system's pseudo-random number generator.) All classifiers had an initial strength of 1000, i.e., the system was started as if it had been run with Wp = Wyy = 100 until the classifiers all reached their fixed-point strengths. BidPow was set to J, Le., a classifier’s effective bid equals its bid. k wag set to 0.1 in this and all the rest of the experiments described in this paper. Given that BidRatto = 1 for all 18 clas- sifiers, the expected fixed-point strengths for classifiers in the top and bottom chains is 4000 and 1000, respectively. The maximum payoff rate (per 200 cycles) is 8000, and the payoff expected if the choice of path is made at random is 5000, y a Figure 6 shows the marginal payoffthe system received plotted against cycle st 15s executed. The strength of clas- sifiers 1 and 11, the clausifiers that compete to choose the ‘ LE 9000 T T t T 7 T r 7 T —— STRENGTH OF CLASSIFIER * wa STRENGTH OF CLASSIFIER “11 7500+ Yo MARGINAL PAYOFF TO SYSTEM 1 60007 aso} 3000F 15007 L » 4 4 1600 2400 CYCLE STEP ob 1 1 _ 0 800 3200 4000 Figure 6: Marginal performance and the strength of classi- fiers i and 11 when the system is run with two competing sequences in the world shown in Figure 5. The rewards at the end of the sequences chosen by classifiers 1 and 11 were 400 and 100, respectively. path the system takes, is also shown. Note that as the strength of classifier 1 increases toward its fixed-point value, marginal performance also increases. Let the fixed point payoff, Py,, be defined as the average marginal payoff in the last one quarter of arun. Then the average Py, for the runs shown in Figure 6 was 6870, which is 85.9% of the optimal payoff rate. The system’s performance is less than optima! because the competition is stochastic—the higher strength classifier has a higher probability of winning but it doesn’t always win. Also note that it takes about 1700 steps (170 trials) for the system to reach 90% of the Pyy. This is a little more than might be expected given the results described in the previous section; the reason for the longer observed time is that the system often traverses the lower path, in which case increased strength is not flowing down the top chain. One way to increase the fixed-point payoff rate is to bias the effective bid in favor of high strength classifiers by setting BidPow > 1. The following table compares the results obtained for BidPow = 1,2, and 3: Pep BidPow (% Max) 1 85.9 2 96.7 3 98.3 8000; + ~ 3 a a 7 6000; 4 MARGINAL PAYOFF TO SYSTEM 5000+ i — BIDPOW=1 { == BIDPOW = 2 i 4000 ¥—¥ BIDPOW = 3 $00 a a 800 700 +~+~-2400.~+~«3200°~—«4000 CYCLE STEP Figure 7: Marginal performance when the system was run with competing sequences of classifiers as in Figure 6, using three different values of BidPow. As expected, increasing BidPow increases the payoff tate. Figure 7 shows the marginal performance obtained in these experiments plotted against the number of major- cycle steps executed by the system. Not only are the fixed- point performance levels increased by increasing BidPow, but the number of steps it takes the system to respond to the change in payoff is decreased somewhat. In the rest of the experiments described in this paper, BidPow is act to 3. Even with BidPow = 3, it takes the classifier system a large number of passes down the full sequence to learn to take the path to the higher reward: as Figure 7 shows, it takes about 100-120 passes (1000-1200 cycles) to begin to tespond to the change in the environment, and about 160- 170 passes (1600~1700 cycles) to reach 90% of the fixed- Point performance rate. One way suggested by Holland [Holland,1985] to speed up reallocation of strength down a long sequence of clas- sifiers is to introduce a “bridging” classifier. Basically, a bridging classifier (sometimes called an “epoch marking” or a “support” classifier) is one that is activated by a mes- sage produced by the first classifier in a sequence, and which remains active until the payoff state at the end of the sequence is reached. Since al] classifiers that are active when a payoff is achieved have that payoff added to their strengths, the bridging classifier has its strength increased the first time the sequence is executed. The next time the sequence is executed, when the bridging classifier again is activated by the message produced by the first classifier in the chain, its payment to that first classifier reflects the payoff it received on the first pass down the sequence. In this way, the change in payoff at the end of a long sequence of classifiers is passed almost immediately to the classifier at the beginning of the sequence. To test the effectiveness of bridging classifiers, the CFS- C/FSW1 system was run using the same finite state world shown in Figure 5, using the same 18 classifiers described above plus two more bridging classifiers (one for the top sequence and one for the bottom one). In particular, the bridging classifiers were: (e1 | mio), ~ do/mmio (10) (err | M20), ~ dio/m2 (20) Both of the classifiers have BidRatio = 0.5. Classifier 10, the bridge for the top sequence, says “If the message from classifier 1 or from classifier 10 is in the message list, and the detector message for state Wo is not in the list, then post message myo”. Thus if classifier 1 posts a measage on step ¢, classifier 10 will be activated on the next step, and it will use the message it produces to activate itself until the system reaches state Wo, the end of the top path. Classifier 20 acts similarly for the bottom chain. The starting strength for all classifiers was set to the fixed-point strength expected when the payoffs for states W, and Wj, are both 100. The payoff for state WW was then set to 400, and the system was run for 4000 major- cycle steps. Figure 8 compares the results obtained when the system was tun with and without the bridging classifiers. Both marginal performance (per 200 steps) and the strength of classifier 1 is plotted versus cycle step. (The strength of classifier 11 doesn’t change in these experiments—just the ratio of the strength of classifier 1 to 11.) Note that with the bridging classifier, the strength of classifier 1 be- gins to rise almost immediately, within the first 200 cycles, as does the marginal performance. On the other hand, without the bridging classifier, the atrength of classifier 1 and the marginal performance doesn’t begin to rise until about cycle step 1100. Similarly, with the bridging classifier marginal performance reaches 90% of its fixed-point level (98.4% of the maximum, StdDev = 1.1%) in 600-700 steps, whereas without the bridging classifier it takes 1600-1700 steps. There are many other ways to implement “bridging” classifiers. For example, the following classifiers also serve as bridges for the sequences 1-9 and 11-19: ti, a/miqg t= 1,..9 (21) 1, ¢,/my9 t= 11...19 (22) Classifier 21 says “If the message list contains a message posted by any of the classifiers in the top sequence, then post a message”. This classifier will be activated by classi- fier 1 and remain active until the system receives a payoff in state Wy. The difference between the this type of epoch marker and the one described earlier is that this one is ac- tivated by every classifier in a sequence, rather than just A-~A STRENGTH OF CLASSIFIER *1, WITH BRIDGE v--¥ STRENGTH OF CLASSIFIER #1, NO BRIDGE [ 4—~# MARGINAL PAYOFF TO SYSTEM, WITH BRIDGE 77 MARGINAL PAYOFF TO SYSTEM, NO BRIDGE 8000 4000F é TIE I FF ‘ Fad Lf ¥ ‘ / é f 20008 ros me Yo ¥- yy eee gl oO 800 1600 2400 200 4000 CYCLE STEP Figure 8: Marginal performance and the strength of classi- fier 1 observed when the system is run in the world shown in Figure 5, with and without “bridging” classifiers, the first one in the sequence. Thus the brigding classi- fier passes strength to all the classifiers in the chain rather than to just the first one. (Classifier 22 acts similary for the bottom sequence.) To test the effectiveness of the second kind of bridging classifier, the system was run in the world described in Figure 5 with the 18 classifiers described earlier and the bridging classifiers 21 and 22. The results were similar to the results obtained using the bridging classifiers 10 and 20: the strength of classifier 1 began to rise immediately, in the first 200 steps, as did the marginal performance. The system again reached 90% of its fixed-point performance rate in about 700 cycle steps. The main difference between using the second type of bridging classifier and the first is the fixed-point strengths of the classifiers in the top sequence. With the first type of bridging classifier, the fixed-point strength of classifier i was changed from 2000 (for n(W,) = 100) to 8000 (for u(W,) = 400); the fixed-point strengths of the other clas- sifiers in the top sequence changed from 1000 to 4000. With the second type of bridging classifier, the fixed-point strength of classifier 1 was changed from 2000 (for u(W,) = 100) to 8000 (for 4(W,) = 400); the fixed-point strengths of the other classifiers in the to sequence changed from 1000 to values ranging from 7406 (for classifier 2) to 4000 (for classifier 9). The reason classifiers 2 through 9 have different fixed-point strengths is that the second type of bridging classifier pays some of its strength to each classi- fier in the sequence, but it pays more to the classifiers at the beginning of the sequence, since that is when the bridg- ing classifier’s strength is the highest (just after receiving a payoff). & SEQUENCES WITH MULTIPLE MESSAGE SOURCES In the experiments described in the previous section, each classifier (except the first one) in a sequence was activated solely by the message produced by the classifier that pre- ceded it in the sequence. Those sequences implemented something akin to a reflex: once the first classifier in the sequence is activiated by some signal from the environment, the rest of the classifiers in the chain are activated, one af- ter another, until the last one fires, no matter what effect their actions are having. Since each classifier in the se- quence used only messages from its predecessor, each paid its full bid to that predecessor, so that all classifiers in a sequence had the same fixed-point strength (ignoring the effects of bridging classifiers). Another type of classifier sequence is one in which one or more classifiers after the first one in a sequence have conditions that match messages produced by sources other than their predecessors in the sequence. For example, some of the classifiers could have one condition that matches a message produced by its predecessor and a second condition that matches a detector message produced by the system’s input interface. Sequences of this type are non-reflex se- quences: the system can monitor the effects of executing each step, so that if the sequence isn’t producing the ex~ pected results {as indicated by messages on the message list), the sequence can be stopped or alternatives steps can be executed. To test the effectiveness of the bucket brigade algorithm in allocating strength down non-reflex chains, the CFS- C/FSW1 system again was run in the finite state world shown in Figure 5. In these experiments the following nine classifiers were used for the top path: do, do/er,r = 1 (1) es, ds/e4,r = 1 (4) ¢e,dg/er,r = 1 (7) €21,¢-1/e,7 = 1 f= 2,3,5,6,8,9 Classifier 1 is matched by state Wo. Once it is activated, classifier 2 is activated by the message produced by clas- sifier 1, and then classifier 3 is activated by classifier 2’s message. Classifier 4 is activated only if classifier 3 posted a message on the previous step and if the system is in state Wy. Classifiers 5 and 6 then fire reflexively, and classifier 7 fires only if the system is in state Wg. Classifiers 8 and 9 then fire reflexively. A similar set of classifiers (11 to 19) was included for traversing the bottom path. Classifier 1 and 11 compete when the Markov process is in state Wo to guide the system down the top or bottom path, respectively. 192 80c0r —— STRENGTH OF CLASSIFIER “1 4 o—a STRENGTH OF CLASSIFIER "4 we STRENGTH OF CLASSIFIER #7 o—e MARGINAL PAVOFF TO SYSTEM 6000) 4000 2000 a 2 J. 1 __ A 1 L. J. i600 2400 +~—-3200 CYCLE STEP r 800 4000 Figure 9: Marginal performance and the strength of classi- fiers 1, 4, and 7 when the system is run with two competing non-reflex sequences of classifiers, in which classifiers 4 and 7 each use one detector message and one message from their predecessors in the sequence. Figure 9 shows the results of running this set of clas- sifiers in the world shown in Figure 5, with the payoff for state Wig = 100 and the payoff for state Wp changed from 100 to 400 at step 0. The marginal performance (per 200 steps) is plotted against the major cycle step executed. The strength of classifier’s 7, 4, and 1 are also plotted. Aa ex- pected, the strength of 7 begins to rise first, since it is closest in the sequence to the reward state, followed by the strength of classifier 4 and then 1. When the strength of classifier 1 begins to rise the marignal performance begins to rise, since the classifier 1 begins to win the competition with classifier 11 more often, thus leading the system down the top path to the higher payoff. Note, however, the fixed-point strength of classifier 4 is 1/2 that of 7, and strength of classifier 1 is in turn 1/2 that of 4, The reason for this drop is that classifiers 4 and 7 pay only 1/2 of their bid to their predecessor classifiers; the other half of their bids is paid to the system for the detector messages that match those classifiers. In general, then, any sequence that has n classifiers that use messages not from their predecessors in the chain will have an exponential (in n) fall-off in strength down the sequence. In this experiment this exponential fall-off didn’t hinder the ability of the system to respond to the change in payoffs at the end of the sequences, since both competing sequences contain the same number of classifiers using messages from multiple sources. In other classifier structures, where the competing sequences have different number of classifiers 80c0 600d! 4 heh hk eh a 4000 “ A--& CLASSIFIER #1, BRIDGE { ¥--¥ CLASSIFIER "1 NO BRIDGE = a PAYOFF, BRIDGE 7 PAYOFF, NO BRIDGE 2000+} : F AAAI IANA IRR IO TE ae y-¥-7~9-y-9-7 QUTeewiT ad 0 800 1600 2400 3200 4000 CYCLE STEP Figure 10; Marginal performance and the strength of classi- fier 1 when the system is run with two competing non-reflex sequences of classifiers, with and without “bridging” clas- sifiers. using messages frora multiple sources, the results would be different. Figure 9 also shows that the response time of the non- reflex sequence was about the same as was observed with the reflex-like sequence (described in the previous section). To see if bridging classifiers could speed up the learning rate in non-reflex sequences, the classifiers described above were run with the two bridging classifiers, 10 and 20, described in the previous section. Figure 10 shows the results obtained for the non-reflex sequence when run with and without bridging classifiers. As with the reflex-like sequences, the rate of learning with the non-reflex sequence is greatly increased by the use of a bridging classifier like classifier 10. With the bridging classifier, the strength of classifier 1 and the system’s per- formance began to increase immediately, in the first 200 cycles, and it reached 90% of its fixed-point performance rate in about 700 cycles. Also note that the fixed-point strength of classifier 1 is now 5000: the strength passed through the bridging classifier is able to overcome the ex- ponential fall-off of strength observed in non-reflex chains without bridging classifiers. 6 SEQUENCES WITH SHARED PARTS Because classifier systems exchange information through a fully open “blackboard”, the message list, any classifier can be used in any context, as long as the message list contains messages that satisfy the classifier’s conditions. One advan- tage of this architecture is that classifiers that are created for use in one domain can be used in other domains that are similar. For example, a group of classifiers (e.g., to contro] a robot’s “hand”) that were discovered while the system was learning to do one task also could be used to solve some other task, greatly reducing the time required to learn the second task. This “knowledge sharing” not only makes it possible to learn faster, it also leads to a more economical use of classifiers, since each situation will not require its own unique classifiers. Figure 11: Paths leading to rewards at states W, and Wg. In order to explore the effect of shared classifiers on the allocation of strength by the bucket brigade algorithm, the CFS-C/FSW1 system was run in the simple finite state world shown in Figure 11. There are 9 states, in this world. The start state is Wo. When the system is in Wo, if the classifiers set the effector value r to 1, then the system goes to state W, with probability 1; if the classifiers set r to 2, then the system goes to Ws. Once the top path ia chosen, the Markov process can be moved through the intervening states to W, by continuing to set r to 1. If the bottom path is chosen, in order to move to W, the system must first set r to 1 to get to We and then set r to 2 for the rest of the bottom path. When the process reaches to state W, or Ws, it always returns to state Wy. State W, has a payoff of 100; all other states have 0 payoff. The CPS-C/FSW1 system was run in the world shown in Figure 11 with the following 8 classifiers: do,dofer,r = 1 (1) Gaitia/eor=1 (i= 2, 3, 4) do, dg/es,r = 2 (5) €s,¢5/¢a,r = 1 (6) ents talent = 2 (i= 7, 8) Classifiers 1 and 5 compete to select the path to be followed, ie., 1 selects the top path and 5 selects the bottom. Once a path is chosen the rest of the classifiers in each sequence (2~4 or 6-8) are executed in order. Note that there are no classifiers shared between the two sequences in this case. The system was also run with classifiers similar to those shown above, but with classifiers 2 and 6 replaced by the following single classifier: erssers/ear=l (9) Classifier 9 is shared by the two sequences: it matches mes- sages produced by either classifier 1 or classifier 5, and sets the effector value r to 1. / (Classifiers 3 and 7 were modi- fied so they both respon to to the message produced by classifier 9; pass-thro \ symbols were used to ensure that classifiers 3 and 7 fire only when classifiers 1 and 5, respectively, were fired two steps before.) The set of classifiers with the shared classifier, 9, was also run with the following bridging classifiers: (10) (11) Classifier 10 serves to bridge the top sequence, and classifier 11 serves to bridge the bottom sequence. All classifiers were started with a strength of 1000. The following table shows the results obtained for the three sets of classifiers: (e: | Mio), ~ d4/m40 (es| mu), ~ de/mu Pre Stor _S 9.8 No Sharing 3880 2048 651 Shared Classifier 1990 1075 1065 Shared, with Bridge 3913 6035 1677 When no classifiers are shared, the system basically achieves the optimal fixed-point performance (4000 per 200 steps). However, when a classifier is shared the system’s perfor- mance is the about what it could achieve by choosing the path at random (2000), The performance with shared classifiers is consistent with the fixed-point strengths of classifiers 1 and 5, which are about the same. When those classifiers compete to se- lect a path, they each win 50% of the time. As a little thought will show, the reason classifiers 1 and 5 have the same strength when classifier 9 is shared between the chains is obvious: both 1 and 5 are paid only from classifier 9, and the fixed-point strength of classifier 9 is just the average of the payments made to it whenever it used, Adding a bridging classifier rectifies this situation: clas- sifier 1 now gets income from both classifier 9 and its bridge classifier, 10, and classifier 5 gets income from classifier 9 and its bridge, classifier 11. Since the bridging classi- fier 10 gets the 100 payoff, while classifier 11 gets 0, the strength of classifier 10 is greater than that of 11, and in turn the strength of classifier 1 is greater than that of 5. Thus adding a bridging classifier restores the fixed-point performance rate to near the maximum. 7 CONCLUSIONS The apportionment of credit problem is the problem of deciding, when many rules are active at every time step, which of those rules active at step t are necessary and suf- ficient for achieving some desired outcome at step t-++n. In classifier systems using the bucket brigade algorithm, credit is allocated in the form 3f a value, strength, associated with 194 each classifier. Because the bucket brigade algorithm usea only local information to incrementally change the strength of classifiers, one key problem for systems using the bucket brigade algorithm is how rapidly strength can be passed down long sequences of classifiers. If a whole sequence of classifiers must be activated many times in order to adjust the strength of a classifier at the beginning of the chain in response to a change in payoff associated with the last step in the sequence, the system’s response to simple changes in its environment may be too slow to be useful. This paper has presented results that show the bucket brigade basically works as designed—strength is passed down a chain of coupled classifiera from those that receive the reward directly from the environment to those that are “stage setters”. The system can thereby learn to respond to a change in the environment by choosing to activate one classifier rather than another, even when a reward is re- ceived only after a long sequence of classifiers is activated, However, it does seem to take a large number of trials before the classifiers at the beginning of a chain reach their fixed-point strengths, and so alter the choice of paths to take. While increasing the bid constant k can speed the flow of strength, a large k means classifiers are risking a large proportion of their strength on each bid they make, so that there is not much room for classifiers to make mistakes. Increasing the BidPow parameter was shown to increase the fixed-point payoff rate the system will achieve, and to slightly decrease the systems response time to a change in payoff at the end of the sequence. The bucket-brigade algorithm was shown to allocate strength down sequences that implement both “reflex” like subroutines and non-reflex like sequences. The system can respond to changes in payoff at the end of two competing non-reflex sequences just as fast as it can to changes that occur at the end of two competing reflex sequences. How- ever, the fixed-point strengths of classifiers in non-reflex se- quences fall off exponentially with each classifier that uses a message not from its predecessor in the sequence. This drop in strength could present problems for non-reflex se- quences that try to compete with reflex-like sequences, and it could have effects on the creation and deletion of rules if strength is used to bias the rule discovery algorithms used by the system. One way to ameliorate the fal] of strength in non-reflex chains that use only detector messages is to change the bucket brigade algorithm so that classifiers pay less than the full BidShare for detector messages. This would mean that classifiers that match detector messages will in gen- eral have higher fixed-point strengths than other classifiers. The higher fixed-point strengths may bias the system’s rule discovery algorithms to create more detector messages—a bias which makes some sense, since the system should put a high priority on using messages from its environment. On the other hand, such a bias in favor of classifiers that use detector messages won't solve the exponential strength fall-off problem for non-reflex sequences that use messages from other classifiers. “Bridging” classifiers were shown to have many effects on the allocation of strength and the performance result- ing from competing sequences of classifiers. Bridging clas- sifiers lead to a dramatic decrease in the number of passes the system must make down a sequence before it can re- spond to a change in the environment, for both reflex and non-reflex sequences. Bridging classifiers also allocate ad- ditional strength to the earlier classifiers in non-reflex se- quences, overcoming the exponential fall-off of strength seen without such bridging classifiers. Note that classifier sequences with bridging classifiers require the same number of passes down the chain to re- spond to a change in payoffs at the end of the sequence, ho matter how long the sequence is. For example, when the experiments described in Section 4 were repeated with sequences of 19 classifiers, the system’s performance with bridging classifiers again began to increase almost immedi- ately, within the first 15-20 trials, just as it did with se quences of 9 classifiers. Without bridging classifiers, the length 9 sequence required about 120 trials to respond, whereas the length 19 sequence required 250 trials. One way to decrease further the response time might be to use multiple bridging classifiers for each sequence. Fach bridge would pass additional strength to the first classifier in the sequence, enabling it to dominate the competition sooner. There are many ways to implement bridging classifiers. Two ways were tried in experiments described in this pa- per. While each type of bridging classifier acted to decrease the system’s response time, each also resulted in somewhat different distributions of strength over the classifiers in a sequence, These differences in the fixed-point strengths may have important effects on long term learning in the system if strengths are used to guide the creation and dele- tion of classifiers. Other types of bridging classifiers should be tried to see what effects they have. Finally, sharing clasifiers between sequences could be a very good way to promote transfer of knowledge from one domain to another, and to economically use the same rules in more than one context. However, experiments de- scribed in this paper show that sharing classifiers can lead to problems for the allocation of strength down sequences of classifiers by the bucket brigade algorithm. Basically, sharing classifiers means the information being passed from classifier to classifier (in the form of strength) is lost, or at least greatly attenuated, when shared classifiers are in- volved. Simple bridging classifiers were shown to be one way to avoid the problem caused by shared classifiers. 495 REFERENCES (Burks, 1986] Burks, Arthur W. “A Radically Non-Von Neumann Architecture for Learning and Discovery.” In CONPAR 86: Conference on Algorithms and Hardware for Parallel Processing, September 17-19, Proceedings., 1-17. Wolfgang Handler, etal. (Eds.}, Springer-Verlag, Berlin, 1986. {Holland,i985] Holland, J. H. “Properties of the Bucket Brigade.” Proceedings of an International Conference on Genetic Algorithms and their Applications, 1-7. John J. Grefenstette (Ed.). Carnegie-Mellon Univer- sity, Pittsburg, 1985. {Holland, 1986a] Holland, John H. “Escaping Brittleness: The Possibilities of General-Purpose Learning Al- gorithms Applied to Parallel Rule-Based Systems,” In Machine Learning: An Artificial Intelligence Ap- proach, Volume IT, Michalski, Ryszard S., Carbonell, Jamie G., and Mitchell, Tom M. (Eds). Morgan Kauf- man Publishers, Inc., Los Altos, CA (1986). [Holland and Burks, 1987] Holland, John H. and Burks, Arthur W. “Adaptive Computing System Capable of Learning and Discovery.” United States patent applied for (1987). [Holland, 1986b] Holland, John H., Holyoak, Keith J., Nis- bett, Richard BE. and Paul A. Thagard. Induction. Pro- cessea of Inference, Learning, and Discovery. The MIT Press, Cambridge, MA, 1986. [Riolo, 1986] Riolo, Rick L. “CFS-C: A Package if Domain Independent Subroutines for Implementing Classifier Systems in Arbitrary, User-Defined Environments.” Logic of Computers Group, Division of Computer Sci- ence and Engineering, University of Michigan, Ann Arbor, 1986. [Riolo, 1987] Riolo, Rick L. “CFS-C/FSW1: An Imple- mentation of the CFS-C Classifier System in a Domain that Involves Learning to Control a Markov Process.” Logic of Computers Group, Division of Computer Sci- ence and Engineering, University of Michigan, Ann Arbor, 1987 [in prep.}. [Riolo, 1987} Riolo, Rick L. “Bucket Brigade Performance: II. Simple Default Hierarchies.” [in this procedings...] (Samuel, 1959] Samuel, A. L. “Some Studies in Machine Learning Using the Game of Checkers.” IBM Journal of Research and Development, 8, 211-232 (1959). (Wilson, 1986] Wilson, Stewart W. “Hierarchical Credit Allocation in a Classifier System.” Research Memeo RIS No. $7r. The Rowland Institute of Science, Cam- bridge MA, 1986. BUCKET BRIGADE PERFORMANCE: IJ. DEFAULT HIERARCHIES Rick L. Riolo The University of Michigan ABSTRACT Learning systems that operate in environments with huge numbers of states must be able to categorize the states into equivalence classes that can be treated alike. Holland-type classifier systems can learn to categorize states by build- ing default hierarchies of classifiers (rules). However, for default hierarchies to work properly classifiers that imple- ment exception rules must be able to control the system when they are applicable, thus proventing the default rules from making mistakes. This paper presents results that show the standard bucket brigade algorithm does not ead to correct exception rules always winning the competition with the default rules they protect. A simple modification to the bucket brigade algorithm is suggested, and results are presented that show this modification works as desired: default hierarchies can be made to achieve payoff rates as near to optimal as desired. 1 INTRODUCTION Any learning system that is to operate in environments with huge numbers of states must be able to categorize the states into equivalence classes that can be treated alike. For rule-based systems like classifier systems ([Holland, 1986al], [Burks, 1986], [Holland and Burks, 1987]) the problem in- volves finding a set of classifiers (condition/action rules) that induce the appropriate equivalence classes. One approach to this problem is to try to find a set of rules that never make mistakes and that partition the whole environment. Such a set of rules in effect estab- lishes a homomorphic model of the world. The problem with this approach is that realistic environments typically involve millions of possible states, with very complicated underlying equivalence classes, of which the system may have sampled only a small fraction. In such situations it would take a vast number of rules to establish a homomor- phic model of the world. Another approach is to implement a default hierarchy (Holland, 1985], [Holland, 1986b]) of rules. A default hier- archy is a multi-level structure in which classifiers (rules) at the top levels are very general. Each general rule responds to broad set of states, so that just a few rules can cover all 196 possible states of the world. Of course since a general rule responds in the same way to many states that don’t really belong in the same category, it will often make mistakes. To correct the mistakes made by the general classifiers, lower level, exception rules are added to the default hierarchy. The lower level classifiers are more specific than the higher level rules: each exception rule responds to a subset of the situations covered by some more general rule. Default hierarchies have several features that make them well suited for learning systems that must build models of very complex domains: © Default hierarchies can be made as error-free as nec- essary, by adding classifiers to cover exceptions to the top level rules, to cover exceptions to the exceptions, and go on, until the required degree of accuracy is achieved. e Default hierarchies are the basis for building quasi- homomorphic models of the world, which generally require far fewer rules to implement a given degree of accuracy than do equivalent homomorphic models [Holland, 1986b]. e Default hierarchies make it possible for the system to learn gracefully, since adding rules to cover excep- tions won’t cause the system’s performance to change drastically, even when the new rules are incorrect. This paper describes some simple experiments with de- fault hierarchies implemented in the CFS-C/FSW1 classi- fier systern [Riolo, 1987b]. These experiments show that when default hierarchies are built in a top down manner, by adding rules to cover exceptions, overall performance does improve as predicted. However, using the standard bucket brigade algorithm, the system does not achieve the performance expected. The reasons for this lower than ex- pected performance are explained, and a modification to the bidding mechanism used in the standard bucket brigade algorithm is proposed. This modification is shown to lead to performance as close to the expected performance as desired. *This work was supported by National Science Foundation Grant DCR 83-05830 2 THE CFS-C/FSW1 SysTEM All experiments described in this paper were done using the CFS-C classifier system {Riolo, 1986], set in the FSW1 (“Finite State World 1”) task environment [Riolo, 1987a]. This section briefly describes the parts of the CFS-C/FSW1 system that are relevant to the experiments described in this paper. For more details, see [Riolo, 1987b] or the cited documentation. Basically, the FSW1 domain is a world that is mod- eled as a finite Markov process, in which a non-zero payoff is associated with some states. The classifier system’s in- put interface provides a message that indicates the current state of the Markov process. The classifier system’s effec- tor interface provides the system with a way to alter the transition probabilities of the process, so that the system can control (in part} the path taken through the finite state world. Basically, a message that begins with “10” is inter- preted by the effector interface as a command to set the effector to some value r, r = 0,1,...15, depending-on the tight most 4 bits of the message. The effector setting r is used to select the transition matrix P(r} which specifies the probability the system will go from the current state W, to some state W,;. When the system moves to a new state with a non-zero payoff, that payoff is given to the active classifiers as a “reward”. The CFS-C classifier system is a standard, “Holland” type learning classifier system. While the CF'S-C classifiers all have two conditions, in this paper they will be treated as if they have only one condition. (This is done by making both conditions identical.) For the experiments described in this paper, no classi- fiers are coupled, i.e., no classifier is satisfied by a message produced by another classifier. All classifiers match only detector messages, i.e., the classifiers are matched when the current state of the finite state process is in the set of states matched by the classifier’s condition. The most important parts of the classifier system are (a) the mechanism imple menting bidding and competition to post messages and to set the effectors, and {b) the allocation of payoff to classi- fiers from the environment. tn the CFS-C/FSW1 system the bid of classifier ¢ at step t, B,(t), is calculated as follows: B,(t) = k* S(t) « BidRatio, & is a small constant (set to 0.1), which acts as a “risk factor”, i.e., it determines what proportion of a classifier’s strength it will bid and so perhaps lose on a single step. 5,(t) is the strength of classifier i at step t. BidRatto, is a number between 0 and 1 that is a measure of the classifier’s specificity, ie., how many different messages it can match. A BidRatio of 1 means the classifier matches exactly one message, while a Bid Ratio of 0 means the classifier matches any message. Thus classifiers that implement high level, general rules in a default hierarchy will have low BidRattos, while classifiers that implement lower level, exception rules will have higher BidRatto values. When a competition is run to determine which classi- fiers are to be activated and post messages, the probability that a given (satisfied) classsifier ¢ will win is: Prob( i wing ) = B(t)/ > 8, (t) where j ranges over all bidding (satisfied) classifiers at t, and f,(t), the effective bid of classifier ¢ at t, is: BA(t) = B,(t) Brow BidPow is a parameter that can be set to alter the shape of probability distribution used to choose classifiers to pro- duce messages. In all experiments described in this paper, BidPow = 3, Note that the output interface of the CFS-C/FSW 1 sys- tem may have to resolve conflicts, e.g., when one classifier produces a message that says “set the effector value r to 1” and another produces a message that says “set the effector value r to 2”. Since the effector can only he set to one value at a time (just as we can either lift our arm or lower it, and not both), an effector conflict resolution mechanism must be used. Basically, the value of r is chosen probabilistically, with the probability that r = r’ equal to: Peel) To Patt where m!' ranges over the effector messages that say “set r tor’, B(t) is the effective bid of the classifier that posted message m', and m ranges over all effector messages. Once an effector value is chosen, all messages that are inconsis- tent with that setting are deleted from the new message fist. When a classifier wins a competition and posta a mes- sage, its strength is decremented by the amount it bid to become active. Since there are no coupled classifiers, classi- fiers only receive payments from the environment when the system maves to a state with an associated non-zero pay- off. In particular, the full payoff is added to the strength of every classifier that has posted a message during that time step. Because the bids made by classifiers are not paid to other classifiers in the experiments described in this paper, the strength at step ¢+1 of a classifier ¢ that produced one or more messages at step ? is: S,(t+1) = 5,(t) — Bt) + RE) where B,(t) is the classifiers bid at step t and R(t) is the - reward from the enviroment at step t. Note that the fixed- point strength of classifier 1, 5, yp, is inversely proportinal to its bidratio, BidRatio;. In particular, Sutp = Lf (k * BidHatio,) where J; is the average amount paid to classifier ! whenever it posts messages. Other things being equal, a general clas- sifier (with a low BidRatio) will have a higher fixed-point strength then a more specialized classifier (with a higher BidRatio). Figure 1: A simple FSW1 finite state world. 3. A SIMPLE TEST WORLD In order to examine default hierarchies in the CFS-C/FSW1 system, consider the simple FSW1 world shown in Figure 1. There are seven states in this world, W,, ¢ == G...6. State W, is the start state. States W4 and Ws have pay- offs of 200 and 400, respectively; all other states have zero payoff. For: = 4, 5, or 6, and any effector value r, py(r) = 0.25, j = 0...3. That is, when in state Wy, We, or Wg, the system has an equal chance of going to any one of the four states on the left in Figure 1, and no chance of going to a state on the right, no matter what the value of r. When in state Wo or Wj, if r = 1 the system goes to W,; if r = 2, the system goes to Wz. When in state W2 or Ws, if r = 2 the system goes to Wy; if r = 1, the system goes to W;. And when in state W3, the system goes to We only if r = 3. (Other values of r are not allowed in the experiments described below.) The system can do best if it sets r to 1 when in states Wo or W4, sets r to 2 when in state W2, and sets r to 3 when in state W 3. A good set of classifiers for this world would classify the states on the left into three categories and respond accordingly. (Since the system can’t alter the transition probabilities when in any of the states on the right, no rules categorizing those states can be more usc ful than any others.) While this world is too small to show how a default hierarchy (i.e., a quasi-morphism) can require fewer classifiers than a homomorphic model, it is complex enough to show the relationship between default and ex- ception rules. As a simple test of the CFS-C/FSW1 system, it was run in the world shown in Figure 1 using the following three classifiers: 198 doy /r = 1 (2) d,fr = 2 (2} dsfr=3 (3) (The numbers in parentheses on the right serve to iden- tify the classifiers.) The condition “do,” taatches two de- tector messages, for states Wo or Wi. The conditions d, and ds each match one detector message, ie., the detector messages for states Wz and Ws, respectively. The action “r = 2” posts an effector message that sets the effector value r to z, r= 1, 2, or 3. In this and other experiments described in this paper, the system was run for 2000 major-cycle steps. In all runs classifiers reached their fixed-point strengths within the first 1000 steps. The average performance of the system over steps 1500 to 2000 is used to establish the “fixed- point” perfarmance, P,,, for one run. All results presented are the average of 10 runs (each started with a different pseudo-random number generator seed). For the three classifiers shown above, Py, was 124.4 per step (StdDev = 1.3}, As can be easily calculated, the expected value is 125 (the system gets a non-zero payoff at Most once every two steps). Thus these three rules perform just as expected. Of course these rules do not implement a default hierarchy: there is a general rule, 1, but it never makes mistakes, and the other rules do not cover subsets of the cases cavered by the general rule. Instead, these rules iraplement 4 homomorphic model, since the rules partition the space of possible states (with respect to the 4 states on the left in Figure 1, the states the system is modeling). The Next section shows how a default hierarchy can be built to cover the same states. 4 RESULTS To test the performance of a default hierarchy in the world shown in Figure 1, the CFS-C/FSW1 system first was run with just one classifier: doigs/r = 1 (BidRatio = 0.56) (4) This general classifier clearly implements a high level de fault rule for this world, namely “When in any state W,, ¢ = 0..,.3, set r to 1”. The expected fixed-point performance for a system using just this rule is 50, since the system will get a reward only when it moves from state Wo or W, to W,, which it will do 50% of the time. The observed fixed-point performance was 49.9, or 99.8% of the expected value (Std- Dev = 1.9%). The classifier’s fixed-point strength is 1765, which is also exactly the expected strength. To improve the system’s performance, it was run again with two rules, the above default rule, 4, and the following “exception” rule: dyy/r = 2 (BidRatio = 0.75) (5) ret ted ATT TTI This classifier covers a subset of the states covered by clas- sifier 4. For that subset, it corrects a mistake made by classifier 4, since it sets r to 2, and so causes the system to go from state W, or Ws to W, {instead of to Ws). The expected payoff for a system using the default hierarchy implemented by these two rules is 100 per step, since every other step the system should receive the 200 payoff from state W,. The observed fixed-point payoff was 83.9, or just 83.9% of the expected rate (StdDev = 1.8%}. The fixed- point strengths of classifiers 4 and 5 were 2661 and 2667, respectively. Note that the strength of the default rule 4 went up as predicted [Holland, 1985| when the exception tule 5 is added to the system, but it did not go up as high as expected (to 3571). The system also was run with an additional classifier: ds/r = 3 (BidRatio=1) (6) This rule covers another exception, so that the system can get to the high-payoff state W_ whenever it is in state W. Together rules 4, 5, and 6 implement a complete default hierarchy for this simple world. The expected payoff for these classifiers is the same as for the original 3 perfect rules (1, 2, and 3), i.e., 125 per step. However, the observed performance was just 108.7, or 86.9% of the expected value (StdDev = 1.3%). The fixed-point strengths of classifiers 4,5, and 6 were 2860, 2667, and 4000, respectively. Why is the performance lower than expected? First, note that the fixed-point strengths of classifiers 5 and 6 are just as expected, given the amount of payoff they re- ceive. On the other hand, classifier 4, the top level default rule, has a fixed-point strength that is much lower than its expected value, 3571. Thus classifier 4 must be making mistakes. To get a better idea of what is happening consider Fig- ure 2, which shows the results of a run using just two clas- sifiers, 4 and 5. Figure 2 shows the marginal performance of the system plotted against major-cycle steps executed. It also shows the strengths for classifiers 4 and 5. First, note that the strength of classifier 5 stabilizes at 2667, which is the maximum strength it can reach given its maximum average income (200 per bid) and its BidRatio, Q.75, On the other hand, the strength of classifier 4 oscil- lates, and it never reaches its expected value. Instead, as soon as the strength of 4 gets much above the strength of classifier 5, performance begins to drop (as does the strength of classifier 4). This co-oscillation of the strength of classifier 4, the general default rule, and marginal per- formance is the key to why the performance of this default hierarchy is lower than expected. Recall that in classifier systema there is a competition to post messages and control effectors: the higher a classifier’s bid, the more likely it is to win the competition and control the system’s behavior. Also recall that when a classifier loses the competition to set an effector, its messages are deleted from the message list and it does not pay its bid to 199 vor MARGINAL PAYOFF TO SYSTEM ——— STRENGTH OF DEFAULT CLASSIFIER (#1) @—e STRENGTH GF EXCEPTION CLASSIFIER (#2) Q 500 1000 CYCLE STEP Figure 2: Performance of a default hierarchy using the stan- dard bucket brigade algorithm. 1500 2000 its suppliers. Bearing these facts in mind, the reason for the oscillations and sub-expected performance are clear: 1. When the strength of an exception rule (rule 5) is greater than the strength of the default rule for which it is exception (rule 4), the exception rule tends to control behavior in the states it covers. The exception tends to protect the default from making mistakes, which allows the average income for the default rule to reach its maximum. 2. Since the default rule has a smaller BidRatio than the exception rule that protects it, if the average in- comes of the rules are about the same, the maximum fixed-point strength of the default rule will be greater than the maximum for the exception rule. In this ex- ample, as rule 5 protects 4, the strength of rule 4 eventually exceeds that of rule 5 (since both have a maximum income of 200 per bid). 3. As the strength of the default classifier rises, its bid and effective bid will rise. Once the effective bid of the default rule is near to or greater than the effective bid of the exception rule, the default rule will begin to win the competition. That is, the exception rule will not always win the competition in the situations it covers, which means it will not be able to atop the default rule from making mistakes. 4, Once the default rule begins making mistakes, both the system performance and the strength of the de- fault rule will begin to fall, Eventually the system re- turns to the beginning of the performance oscillation, when the strength of the default rule is enough lower than that of the exception rule so that the exception tule can again protect the default from making mis- takes, In short, the protection an exception-covering rule provides for a default rule allows the strength of the default rule to tise until the exception rule can no longer protect it. This will almost always happen in default hierarchies, since the maximal fixed-point strength of a general default classifier is in general higher than that of of more specific, exception classifier. One way to correct this problem would be to alter the bucket brigade so that the maximum fixed-point strength of general classifiers is not higher than that for more special- ized classifiers. For example, payments to classifiers could be biased so that a lower Bid Ratio leads to a smaller share of the payment from other classifiers or from environmen- tal rewards. Lowering the fixed-point strength of general classifiers may create sore other problems, however. For instance, since default rules tend to make mistakes more often than more specialized rules, the fixed-point strength of a default rule should be relatively high, so that it can afford to make mistakes (and lose strength) without hav- ing its strength go so low that it its eliminated from the system. Another approach is to leave the relationship between the fixed-point strengths of general versus specific classi- fiers the same, but to bias the effective bid, 8, against the general classifiers. Since the effective bid only changes the probability distribution used to determine which classifiers post messages and control the systems effectors, it does not alter a classifier’s per bid income or payment. Thus genera! classifiers will still have relatively high fixed-point strengths, even though specialized classifiers will tend to win the competition more often. To test this idea, the way effective bids are calculated in the CFS-C/FSW1 system was changed by adding a factor involving a classifier’s BidRatio. In particular, the effective bid, B,(t), of a classifier ¢ at step ¢ is now calculated as follows: B,(t) = B,(t)P4?* « BidRatioF BRP The value of EBRPow can be changed to alter the shape of probability distribution used to choose the classifiers that are to produce messages. For example, if BidPow = 1 and EBRPow = 0 then ,(t) = B,(é), ie., a classifier’s probability of producing messages fs just its bid divided by the sum of bids made by all satisfied classifiers. Setting EBRPow equal to 1, 2, and so on, makes it less likely that general classifiers (those with lower BidRatio’s) will win the competition with more specific, co-active classifiers. To test the effects of this modification, the system was run in the same finite state world described earlier, using classifiers 4, 5, and 6, which implement a simple 3 level default hierarchy, The following table shows the results obtained using various values for EBRPow: % Expected EBRPow Pyy (1 StdDev) S44, 0 86.9 (1.3) 2860 1 92.7 (1.3) 3104 2 95.8 (0.9) 3280 4 98.6 (1.4) 3479 6 99.5 (1.8) 3553 As can been seen, raising the EBRPow parameter increaseg the average fixed-point performance of the system to vir- tually a mistake-free level. Also note that the fixed-point strength of classifier 4, the default rule, increases to almost its maximum value, 3571. 6000: 4000 3000 T 7 T 7 t +t 7 T 5000 PO errr ETT TTY 2000 we—w MARGINAL PAYOFF TO SYSTEM "006 —— STRENGTH OF DEFAULT CLASSIFIER (#1) e—s STRENGTH OF EXCEPTION CLASSIFIER (#2) +4 0 L L Leah L Lal L 0 500 1000 1500 2000 CYCLE STEP Figure 3: Performance of a default hierarchy using the modified bucket brigade algorithm with EBRPow = 3. Figure 3 shows the results of repeating the experiment shown in Figure 2 with EBRPow = 3. The oscillations of both the marginal performance and the strength of classi- fier 4 are all but eliminated. Thus with a high EBRPow, the exception classifier 5 is able to almost always win the competition with the default rule it protects, in which case the system never makes mistakes. The system was also run using the same three classi- fiers in two slightly different versions of the world shown in Figure 1. First, the system was run in a world in which setting the effector value r to 2 in states W2 and W causes the system to go to another state, Wy which haa a payoff of 100. That is, an exception rule like classifier 5 leads to a lower payoff than the default rule does when it doesn’t make a mistake. In this world the expected performance (per step) is 112.5, while the observed performance with EBRPow = 0 was 97.7 (86.9% expected performance, SidDev = 1.8). However, with EBRPow 3 the ob- served performance was 109.0 (97.5% of the expected value, StdDev = 1.9). Thus the default hierarchy performs just as well when the exception rule leads to a lower payoff than the default rule does when it is correct. Second, the system was run in a finite state world like that shown in Figure 1, except that various amounts of un- certainty are introduced into the transitions. For example, instead of going from state Wo to state W, with probability 1.0 when r is set to 1, the system will go to that state with probability 0.92 and go to one of the other etates on the right with probability 0.04 each; i.e., 8% of the time the system will go somewhere unexpected. The following table shows the results of running the three classifiers 4, 5, and 6 that implement the simple default hierarchy described earlier, in worlds with increasing amounts of uncertainty {using EBRPow = 3): % Un- Expected Observed % Optimal certainty Payoff Payoff (1 StdDev) Payoff 0 125 122 (3.44) 97.6 8 116 116 (3.64) 99.1 16 107 105 (3.64) 98.1 32 89.0 87.5 (5.52) $8.3 As can be seen, uncertainty in the environment has little effect on the systems ability to obtain the best possible payoff rate, This is an important property for any learning system that must contend with very complex environments in which it can never completely reduce the uncertainty. 5 CONCLUSIONS Default hierarchies are an excellent way for classifier sys- tems to cope with very complex enviroments. However, for default hierarchies to work properly, classifiers that imple- ment “exception” rules must be able to control the system when they are applicable, thus protecting the default rules from making mistakes. This paper has presented results that show the bucket brigade algorithm as described in (Holland, 1985] does not lead to correct exception rules al- ways winning the competition with the default rules they protect. A simple modification to the bucket brigade algo- rithm is suggested, which involves biasing the calculation of the effective bid so that general classifiers (those with low BidRatios) have much lower effective bids than do more specific classifiers (those with high BidRatios). Results are presented that show this modification works as desired: de- fault hierarchies can be made to achieve payoff rates as near to optimal as desired. Since the modification to the bucket brigade algorithm only changes the effective bid made by classifiers, the allo- cation of strength under the bucket brigade is not altered (except insofar as the fixed-point strengths of default rules are increased to their maximum expected levels). Also, if the system does not have more specialized exception classi- fiers to compete with a particular default rule, that default rule will continue to control the behavior of the system. The modified bucket brigade algorithm only reduces the the probability that default rules will post messages when they are competing with co-active exception rules. The classifier system also is shown to work as expected in environments with varying amounts of uncertainty, and when the payoff received for activating a correct exception rule is less than the payoff received by the default rule it protects when the default is used in a correct situation. REFERENCES [Burks, 1986] Burks, Arthur W. “A Radically Non-Von Neumann Architecture for Learning and Discovery.” In CONPAR 86: Conference on Algorithms and Hardware for Parallel Processing, September 17-19, Proceedings., 1-17. Wolfgang Handler, etal. (Bds.}. Springer-Verlag, Berlin, 1986. (Holland, 1985] Holland, J. H. “Properties of the Bucket Brigade.” Proceedings of an International Conference on Genette Algorithms and thetr Applications, 1-7. John J. Grefenstette (Ed.J. Carnegie-Mellon Univer- sity, Pittsburg, 1985. (Holland, 1986a] Holland, John H. “Escaping Brittleness: The Possibilities of General-Purpose Learning Al- gorithms Applied to Parallel Rule-Based Systems.” Tn Machine Learning: An Artificial Intelligence Ap- proach, Volume I, Michalski, Ryszard S., Carbonell, Jamie G., and Mitchell, Tom M. (Eds). Morgan Kauf- man Publishers, Inc., Los Altos, CA (1986). [Holland, 1986b] Holland, John H., Holyoak, Keith J., Nis- bett, Richard E. and Paul A. Thagard. Induction, Pro- cesses of Inference, Learning, and Discovery. The MIT Press, Cambridge, MA, 1986, (Holland and Burks, 1987] Holland, John H. and Burks, Arthur W. “Adaptive Computing System Capable of Learning and Discovery.” United States patent applied for (1987). [Riolo, 1986] Riolo, Rick L. “CFS-C: A Package if Domain Independent Subroutines for Implementing Classifier Systems in Arbitrary, User-Defined Environments.” Logic of Computers Group, Division of Computer Sci- ence and Engineering, University of Michigan, Ann Arbor, 1986. [Riolo, 1987a} Riolo, Rick L. “CFS-C/FSW1: An Imple- mentation of the CFS-C Classifier System in a Domain that Involves Learning to Control a Markov Process.” Logic of Computers Group, Division of Computer Sci- ence and Engineering, University of Michigan, Ann Arbor, 1987 [in prep.]. [Riolo, 1987b] Riolo, Rick L. “Bucket Brigade Perfor- mance: I. Long Sequences of Classifiers.” [In these procedings.| MULTILEVEL CREDIT ASSIGNMENT IN A GENETIC LEARNING SYSTEM John J. Grefenstette Navy Center for Applied Research in AI Naval Research Laboratory Washington, DC 20375-5000 Abstract Genetic algorithms assign credit to building blocks based on the performance of the knowledge structures in which they occur. If the knowledge structures are rules sets, then the bucket brigade algorithm provides a means of performing additional credit assignment at the level of individual rules. This paper explores one possibility for using the fine-grained feedback provided by the bucket brigade in genetic learning systems that manipu- late sets of rules. 1. Introduction There are two distinct approaches to machine leaming using Genetic Algorithms (GA’s). These approaches are popularly denoted by the names of the universities where they were first elaborated. In the Michigan approach, first described by Holland and Reit- man [Holland 78], the population comprises a single set of production rules, or classifiers. Each individual rule is assigned a measure, called strength, that indicates the ulility of the rule to the system’s goal of obtaining extemal payoff. The bucket brigade algorithm [Holland 86] shifts suength among the rules during the course of problem solving, so that mules achieve high strength either by obtaining direct payoff from the task environ- ment or by setting the stage for later rules. New rules arc discovered by genetic operators applicd to existing rules on the basis of strength. In the Pitt approach, developed by De Jong and Smith at the University of Pittsburgh [Smith 80], each structure in the population maintained by the GA represents a production system program, i.¢., a set of rules. Each structure is evaluated by running it through a production system interpreter in the environment of the learning task and measuring vari- ous aspects of the system's performance. As the result of each evaluation, a fimess measure is assigned to the enlire program and is used to control the selection of structures for reproduction. The usual genetic search Operators ~- crossover, mutation and inversion -- are applied 1o the structures without reference to the perfor- mance of the individual rules comprising the structures. These two approaches have provided a basis for several successful learning systems [Booker 82, Gold- berg 83, Schaffer 84, Smith 80, Wilson 85], each of which incorporates significant additions to the basic out- line given above. It should be emphasized that both approaches are undergoing constant evolution and that the relative merits of the two approaches is a topic of continuing interest. This paper describes the first attempt to combine the strongest features of cach approach in a single Icarming system. A primary strength of the Michigan approach derives from its use of the bucket brigade algorithm [Holland 85]. This elegant scheme achieves a distribu- tion of credit among a large rule sect without the necd for coslly bookkeeping overhead, e.g., the traces of sys- tem behavior required of some learning systems, The bucket brigade algorithm supports an important theme in massive parallelism: that effective global pattems of behavior (e.g., default hierarchies) can emerge from vast numbers of local competitions. From a GA perspective, however, the application of genetic operators in the Michigan approach is problematic. The framework described in [Holland 75] provides a theory for the evo- lution of co-adapted sets of alleles within the structures, arising as a result of competition among the indepen- dent structures in the population. There is no theory provided for cooperation among the members of the population, although this might be an interesting exten- sion. As a result, instead of emerging as co-adapted alleles, co-adaptation among tules in the Michigan approach is typically achieved by additional mechan- isms, e.g., UWiggered genetic operators and bridge classifiers [Holland 86, Riolo 87], that specifically encourage rule linkages. In contrast, the Pitt approach corresponds more closely to the biological metaphor of evolution through natural selection. The knowledge structures in the Pitt approach correspond to chromosomes in a population of competing organisms. The performance measure corresponds to an organism’s fitness to its environment. In the Pitt approach, co-adapted rules correspond to co- adapted alleles in [Holland 75]. Holland make a strong case for viewing adaptation as an optimization problem over the response surface defined by the payoff provided by the task environment and argues that GA’s are suit- able for performing the optimization. The Pitt approach to leamming rule sets seems to be a plausible, albeit ambitious, extension of the successful use of GA’s for parameter optimization problems. However, there are two primary bottlenecks in the Pitt approach to learning: computational resources and the feedback bandwidth. Extensive computational resources are required in to evaluate thousands of complete rules sets. Two lines of current research address this issue. First, it can be shown that GA’s are especially well-suited for optimiza- tion using Monte Carlo techniques in the evaluation phase [Grefenstette 87b]. This follows from the obser- vation that even if individual structures are evaluated by probabilistic procedures, the expected error in the observed performance of hyperplanes is much less than the expected error in the the performance of the indivi- dual structures. The implication for the Pitt approach is that the amount of problem solving activity required to test individual rule sets can be fairly limited without adverse effects on the overall performance of the learn- ing system. A second means of dealing with the com- putational burden of the Pitt approach is through the use of increasingly available coarse-grained parallel comput- ers with 50 io 100 processors. It is a straightforward task to install a production system interpreter on each node of a MIMD machine and thereby perform all of the structure evaluation for a given gencration in paral- Icl. Several papers in this volume explore this approach, The second bottlencck in the Pitt approach is the limited feedback bandwidth. In LS-1 [Smith 80] a critic provides feedback concerning the performance of each program, The critic includes not only a measure of the task level performance, but also measurements of vari- ous dynamic characteristics of the behavior during the problem solving task, such as the number of rules that fire, the number of rules suggesting external actions, and so on. All of this information is combined into a single scalar value to represent the fitness of the rule set under evaluation. Shaffer's LS-2 [Schaffer 84] expands the fcedback bandwidth by allowing the critic to provide a vector of performance measures. Subpopulations are sclected on the basis of cach clement in the performance vector, providing the equivalent of environmental niches in which specialists can evolve, Recombination in LS-2 takes place without regard to the subpopulation boun- daries. The result is an effective search for Pareto- optimal regions of the search space. Yet another approach to increasing the detail of credit assignment in GA’s is represented by recent attempts at the traveling- salesperson problem (TSP). In the TSP, the usual fitness measure is the overall tour length. However, this fitness measure does not provide sufficient selective pressure toward high performance tours. In one approach to this problem [Grefenstette 87a], the crossover operator takes the length of competing parental edges into account when constructing the offspring tour. This represents a low level credit assignment strategy for directly promot- ing high-performance first-order hyperplanes. In this paper, we explore a similar multilevel credit assignment strategy in the context of the Pitt approach to learning production system programs. Fig- ure 1 illustrates the overall system design. Our approach is to use the bucket brigade in the course of evaluating production sets to assign credit at the level of individual rules. The GA uses the overall performance of the rule set to guide selection, as in LS-1. But in addition, the strengths of the individual rules influences the physical representation of ihe structure, making it more likely that high strength rule combinations survive the crossover operation. That is, we are proposing a heuristic form of the inversion operator, called cluster- ing, based on the feedback from the bucket brigade algorithm. The description of the system will proceed bottom-up: We first discuss the object level task and the represeniation of rules. This is followed by the descrip- tions of the problem solver and the leamer. Experimen- tal results are presented, followed by a discussion of future directions for this research. 2. The State Space Search Task In the interests of obtaining general results, the object level task for our learning system is an abstract state space search characterized by the triple < S$, O, P>, where § is a set of states, O is a set of state transition operators, and P is a mapping of payoff to the states. Certain states in § are initial states, others are final States. At the start of a task, a random initial state is chosen. During each problem solving step, the problem solver selects an operator that is then applied to the current state to produce the new state. A task is com- plete when the new state is one of the final states. One particular problem is shown in figure 2. The set of initial states is (0, 1, 2, 3} and the set of final states is {12, 13, 14, 15}. The arcs are labeled by the operator(s) that perform the transition. In this problem, a payoff of 1000 is associated with state 13. All other states have a payoff of 0. The object for the learning system is to leam a set of heuristic rules that select operators for the current state. The rules have the form: IF the current state is in the set 8, THEN choose an operator from the set 0, The action associated with such a rule is to choose one 203 of the operators in the set O, at random. The knowledge representation employed is an important bias for any learning system. We now describe our representation of the heuristic rules. In most state space search problems, the number of states is very large. In these cases, it is infeasible to specify arbitrary sets of states in the conditions of a heuristic rule. However, states can often be usefully character- ized by a set of features. Furthermore, it is reasonable to assume that heuristic rules for selecting operators may usefully attend to some features and ignore others. In fact, Lenat cites the following as one of the central assumptions underlying the use of heuristics: If an action A is appropriate in situation S, then A is appropriate in most situations that are very similar to S [Lenat 83]. So the loft hand sides of our rules will contain patterns that match the feature vectors of states, using the sym- bol # to indicate that the value of the corresponding feature is irrelevant. For case of presentation, in this paper we identify the name of each state with its binary feature vector. On the other hand, state transition opera- tors do not usually have features that capture their simi- larity. Furthermore, the number of operators is usually smaller than the number of states by several orders of magnitude. Given these considerations, it seems reason- able to permit the right hand sides to specify arbitrary sets of operators. We do so by interpreting the right hand side as a membership vector for the set of opera- tors. In the example problem, the operator set is {a, b, c, d}. So, for example, the rule O##1 -» 1010 represents the heuristic IF the current state is in the set {1, 3, 5, 7} THEN choose an operator from the set (a, c} A problem solver that uses this representation is described next. 3. The Problem Solving Level The problem solver for this system is based in part on Riolo’s CFS-C system [Riolo 86]. As that sys- tem is described in detail elsewhere in this volume, we restrict our discussion to the modifications implemented for this work. At the start of each problem solving cycle, a detector message indicating the current state is posted on the message list. Each rule whose left hand sides matches the current state produces a bid that is proportional to the rule’s strength. (The system currently ignores the specificity of the rules in comput- ing bids.) A competition, based on bid size, is held among the bidders to see which rules get to produce a message. (The maximum number of messages produced is fixed.) Effector conflicts among the resulting messages are resolved by a competition based on the total bids associated with each distinct message. Once a winning message is selected, one of the operators indicated by the message is selected at random and applied to the current state to produce the new state. Rules that pro- duce messages consistent with the selected operator are said to be active. Our implementation of the bucket bri- gade is similar to the one in Wilson’s animat system [Wilson 85): 1) each rule producing a message has the amount of its bid deducted from its strength; 2) ~~ the total of the bids thus collected is distributed evenly among the rules active on the previous cycle; 3) any external reward associated with the new state is distributed evenly among the currently active rules, Note that it is possible that no rule matches the current state. In this case, a randomly chosen operator is used to produce the next state. Since there are no active rules, no external payolf enters the system. It is acknowledged that many of the decisions regarding our implementation of the bucket brigade are ad hoc and that current research may well suggest significant improvements. Nonetheless, the basic feature of even our somewhat crude version is that the strength of each rule serves as an estimator of its typical payoff. In the example problem, it takes exactly three cycles to move from an initial state to a final state. To evaluate a given rule set, 100 traversals of the state space (300 CFS-C cycles) are performed, each traversal starting at a randomly chosen initial state. The average extemal payoff achieved by the rule set is returned to the learning level, along with the updated strengths of the individual mules. 4. The Learning Level At the top level of the learning system, the GENESIS genetic algorithm system [Grefenstette 84) maintains a population of fixed-length structures, each of which is interpreted as a set of fixed length rules. For the experiments described here, each structure consists of 240 bits, interpreted as rules sets of 20 mes each. Each rule is represented by a string of 12 bits. For the left hand sides, 00 represents a “Q" in the representation used by the problem solver, 11 represents a "1" and 01 and 10 both represent "#". Since the right hand sides contain no “#" symbols, only 4 bits are required to represent the right hand sides. 204 In addition to the usual genetic operations -- selection, crossover and mutation -- the learner also per- forms an additional operator, called clustering, described below. The crossover operator treats the knowledge structure as a ring, and selects two crossover points without regard for rule boundaries. Each rule in the offspring inherits the strength associated with the corresponding rule in the parent structure. (In the case of a new tule created by crossing in the middle of two rules, the new rule is assigned the average strength of the two parental mules.) The only notice that crossover pays to the rule boundaries is that duplicate rules are not included when choosing the segment from the second parent. Both offspring from crossover are kept, so that all rules in the parent structures survive in one of the offspring (with the exception of those rules in which one of the crossover points falls). Because the performance of a knowledge structure is independent of the position of the rules within the structure, it is natural to consider the use of the inver- sion operator. Inversion typically reverses a randomly chosen portion of a structure, thereby altering the defining length of many hyperplanes. This is turn alters the probability that those hyperplanes will be disrupted by crossover. A moderate rate of inversion produces slight performance improvements in LS-1 [Smith 80]. In other studies, the effects of inversion has generally been hard to measure. One explanation is that the space being search by inversion -- the space of all possible permutations of genes -- is in general much larger than the original task space. Faced with the huge space of representation permutations, the random mutations of the representation provided by inversion cannot be expected to show much progress in the amount of time before selection pressure guides the population into conver- gence, We now describe a heuristic version of inversion called clustering. Like inversion, clustering does not introduce any new phenotypes into the population. Rather, the intent of this new operator is to modify the physical representation of a rule set so that co-adapied sets of rules are less likely to be disrupted by crossover. Clustering is defined as follows: 1) select a rule position at random; 2) move ihe highest strength rule to the selected position; 3) arrange the remaining rules so that the distance from each rule to the highest strength rule is a decreasing function of the rule’s strength (treating the knowledge structure as a ring for the purposes of computing distances). 205 An example showing the effects of clustering appears in figure 3. Clustering is performed on all structures after the selection phase and before crossover; if a given rule set is assigned multiple offspring, each offspring has a random initial position chosen for clustering. Note that the combination of clustering and crossover does not generally produce an offspring that contains all the high strength cules in each parent. In fact, experience has shown that it is usually counterproductive to impose such deterministic heuristics on genetic search. Since the strength obtained by a given rule is a function of its context in the rule set, it is possible for a group of rules to achieve low strength in one rule set and higher strength in another rule set. This is just the kind of test- ing of building blocks at which the GA excels. Of course, the efficacy of the clustering heuristic depends on the assumption that co-adapted rules tend to have similar levels of strength. This is true in at least some interesting cases. For example, consider a chain of rules, e.g., Ry, Ra ws Ry in which R, always leads to the firing of R,,,; that is, assume that R, is §, > O, such that 0, Sa) € Sing for each s,€8;,0,,€0;, and 1 200” may require multiple rule firings and internal memory in order to be correctly evaluated. A favorite cognitive motivation for prefer- ting pattern matching rather than boolean expressions is the feeling that “partial ratch- ing” is one of the powerful mechanisms that humans use to deal with the enormous variety of every day life. The observation is that we are seldom in precisely the same situation twice, but manage to function reasonably well by noting its similarity to previous experience. This has led to some interesting discus- sions as to how “similarity”? can be captured computationally in a natural and efficient way. Holland and other advocates of the {0, 1, #} paradigm argue that this is precisely the role that the ‘‘#:” plays as patterns evolve to their appropriate level of generality. Booker and oth- ers have felt that requiring perfect matches even with the {0, 1, #} pattern language is still too strong and rigid a requirement, particularly as the length of the left hand side pattern increases. Rather than returning simply success or failure, they feel that the pattern matcher should return a “match score” indicating how close the pattern came to matching. An impor- tant issue here which needs to be better under- stood is how one computes match scores in a reasonably general, but computationally efficient manner. The interested reader can see {Booker85] for more details. In any case we need to understand better what is involved in choosing a syntax and semantics for the left hand side of rules which balances the need for simplicity of representation and the need for expressive power. In my view this is one the most difficult and open issues in using GAs to search program spaces, and the key to successful applications. 6. Payoff Functions In addition to choosing an appropriate representation language, careful thought must be given to the characteristics of the payoff func- tion used to provide feedback regarding the per- formance of task programs. Strategies for designing effective payoff functions tend to differ somewhat depending on whether one is working with classifier systems or using the Pitt approach to evolving useful task programs. In classifier systems the bucket brigade mechanism stands ready to distribute payoff to those rules which are deemed responsible for achieving that payoff. Because payoff is the currency of the bucket brigade economy, it is important to design a feedback function which provides a relatively steady flow of payoff rather than one in which there are long “dry spells”. Wilson’s “animat” environment is an excellent example of this style of payoff (Wilson85}. The situation is somewhat different in the Pitt approach in that the usual view of evalua- tion consists of injecting the program defined by a particular individual into a task processor and evaluating how well that program as a whole performed. This view can lead to some interest- ing considerations such as whether to reward programs which perform tasks as well as others, but use less space (rules) or time (rule firings). Smith [Smith80} found it necessary to break up the payoff function into two components: a task-specific evaluation and a task-independent measure of the program itself, Although these two components are usually combined into a sin- gle payoff, recent work by Schaffer [Schaffer85] suggests that it might be more effective to use a vector-valued payoff function in situations such as this. To my knowledge no one has explored this possibility. However, one of the more provocative papers in this area at this conference suggests that there is an opportunity to have “the best of both worlds’ with a multilevel credit 245 assignment strategy which assigns payoff to both rule sets as well as individual rules [Grefen- stette87]. This is an interesting idea which, I suspect, will generate a good deal of discussion and merits further attention. 7. Selecting Genetic Operators It has been pointed out many times that there is nothing sacred about the traditional operators defined and analyzed by Holland. What ts important is that we have criteria set by the schema theorems that such operators should meet. It is particularly tempting when using GAs to search program spaces to introduce new operators to deal with the complexity of the representations. One need not feel reluctant or apologetic about doing so, However, f changes are made to existing operators or new ones are introduced, it is important to verify that they aren’t overly disruptive of the process of distri- bution of trials according to payoff and that they encourage the formation of building blocks. Without these basic properties, one is bound to be disappointed in the performance of GAs in searching program spaces. 8. Conclusion So where does all this leave us? I think that it is quite clear that, as designers, if we can keep down the complexity of program spaces by using parameterized procedures or data struc- tures which are easy to represent and manipu- late with GAs, we have a much better chance for a successful GA application. If, on the other hand, the situation calls for evolving task pro- grams at a more fundamental level, it seems equally clear that production system languages are currently the best choice for GA. applica- tions. Whether one chooses the Pitt or the Michigan approach is still a matter of individual preference. Perhaps by the time we get together again no such choice will be necessary. References (Booker82| Booker, L. B., “Intelligent Behavior as an Adaptation to the Task Environment”, Doctoral Thesis, CCS Department, University of Michigan, 1982. [Booker85] Booker, L. B., "Improving the Perfor- mance of Genetic Algorithms in Classifier Sys- tems", Proc. Int’! Conference on Genetic Algo- rithms and their Applications, July, 1985. [Buchanan78] Buchanan, B., Mitchell, T.M., "Model-Directed Learning of Production Rules", in Pattern-Directed Inference Systems, eds. Waterman and Iayes-Roth, Academic Press, 1978. [Davis85] Davis, L. D., “Job Shop Scheduling Using Genetic Algorithms", Proc. Int’l Confer- ence on Genetic Algorithms and their Applica- tions, July, 1985. [DeJong85] De Jong, KL, "Genetic Algorithms: a 10 Year Perspective", Proc. Int’l Conference on Genetic Algorithms and their Applications, July, 1985. [Fujiko87] Fujike, C. and Dickinson, J., "Using the Genetic Algorithm to Generate Lisp Code to Solve the Prisoner’s dilemma", Proc. Int'l Conference on Genetic Algorithms and their Applications, July, 1987. [Goldberg85] Goldberg, D. and Lingle, R., Alleles, Loci, and the Traveling Salesman Prob- lem", Proc. Int’l Conference on Genetic Algo- rithms and their Applications, July, 1985. [Grefenstette85| Grefenstette, J. et al., “Genetic Algorithms for the Traveling Salesman Prob- lem", Proc. Int’l Conference on Genetic Algo- rithms and their Applications, July, 1985. [Grefenstelte87] Grefenstette, J, “Multilevel Credit Assignment in a Genetic Learning Sys- tem”, Proc. Int’l Conference on Genetic Algo rithms and their Applications, July, 1987. {Hedrick76] Hedrick, C.L., "Learning Production Systems from Examples”, Artificial Intelligence, Vol. 7, 1976. [Nolland75] J, 1 Holland, Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975. (Holland78] Holland, JL, Reitman, J., “Cogni- tive Systems Based on Adaptive Algorithms”, in Pattern-Directed Inference Systems, — eds. 216 Waterman and Hlayes-Roth, Academic Press, 1978. [Michalski83] Michalski, R., “A Theory and Methodology of Inductive Learning”, in Afachine Learning: An Artificial Intelligence Approach, eds. Michalski, Carbonell, and Mitchell, Tioga Publishing, 1983. [Newell77] Newell, A., "Knowledge Representa tion Aspects of Production Systems", Proc. 5th IJCAL, 1977. [Schaffer85] Schaffer, D., "Multiple Objective Optimization with Vector Evaluated Genetic Algorithms", Proc. Int’l Conference on Genetic Algorithms and their Applications, July, 1985. [Smith80] Smith, S. F., "A Learning System Based on Genetic Adaptive Algorithms”, Doc- toral Thesis, Department of Computer Science, University of Pittsburgh, 1980. [Smith83] Smith, S. F., “Flexible Learning of Problem Solving Heuristics Through Adaptive Search”, Proc. 8th IJCAI, August. 1983. {Wilson85] Wilson, S., “Knowledge Growth in an Artificial Animal", Proc. Int’l Conference on Genetic Algorithms and their Applications, July, 1985. A Genetic System for Learning Models of Consumer Choice David Perry Greene Stephen F Smith Carnegie Mellon University, Pittsburgh, PA 15213 Abstract Consumer choice modeling can he viewed as a classification problem where the ‘model’ of interest is the rule that an individual uses to identify acceptable from unacceptable products ina given purchase situation The search space af possible rules can be characterized as large, multimodal and noisy ‘Traditional statistical methods for generating consumer choice models arc Inmited by the assumptions and representation they impose A recent symbole induction approach attempted to address these limitations but fell short in managing the complexity of the search space H1s hypothesized that probabilistic genetic search ean overcome this complexily while retaining the advantages of a symbohe cule representation This paper proposes a genctic algortthin (GA) based model generator capable of addressing these problems, describes its implementation, and compares it to alternative modeling techniques on a simulated consumer decision problem 0. Introduction Gaining insight into how and why a consumer chooses whether or not to buy in a speetfic purchase situation is of obvious interest fram a marketing standpoint An ability to identity the relevant features of a purchase situation, the relative order of importance of these features to the cansumer, and the (potentially) complex relationsiips the consumer posits between them can provide significant leverage in both predicting product performance and focusing promotional activities Within the marketing Merature, this problem of understanding consumer behavior is referred to ax consumer choice modeling. “the problem can be generally stated as follows given a collection of features that describe a specific purchase situation, and a set of examples of specific consumer choice decisions relative to this purchase situation, infer a decision nile that accounts for the consumer's bebayior Consider the situation of renting an apartment Tn this case, the modcler is presented with a set of previously made rent decisions, describing each considered apartment in terms of a predefined collection of apartment features (cg price, distance from work, reputation of the Jandtord, ete) Assuming that the consumer chooses to rent or not rent a given apartment aceording to some deetsian rule defined over these features, the goa? is to construct a model of the consumer's decision rule from these examples of past choice behavior Ideally, a model of consumer choice should demonstrate adequacy over three broad performance dimensions 1) its abihty to correctly forceast future choice decisions {or predictive validity), 2} its ability to carrectly onder the consumer's value for different product features (or diagnostic validity), and 3} ss ability to provide insight into the consumer's decision strategy (or sirncturalfintuitive validly) — These performance dimensions provide a basis for contrasting alternative techniques for consumer choice modeling Traditionally, marketing researchers have relied on mutltiattnbute statisteal methods as a means of modeling consumer choice {Wik73,, Such methods, however, are not without shortcomings They provide limited intuition into the structure of the individual's choice rule and the assumptions that must be made conceming the structure of the solution space can sometimes result in scrious predictive errors{Cury8 Iffobn85] To overcome these imitations, a recent paper [CuerR6) adapted an AT perspective of the problem, using a production rule tepresentation and symbole induction While the use of production rules provided structural intuition, the representation created scarch difficulties which impaired the other validity measures Genetic algorithms (GAs), because of their unique search mechanism, offer the potential of retaining the benefits of a production rule representation while overcoming the complexities of the problem space Thts paper investigates the hypothesis that GA's will be able to predict decisions as well as statistical methods yet offer the representational superiority of the symbole approach, he remainder of this paper is divided into five sections The first section examines the strengths and weaknesses of the conventional statistical methods ‘The second section considers consumer choice modeling as a symbolic learning task and describes the approach taken m |Cur86] In the third section, an alternative approach using a genetic algorithm is developed to address the hmitations in the existing methods Phe next section desertbes a simulated problem to compare the three techniques using the three measures of validity discussed above The fifth section presents the results followed by a brief discussion 1, Traditional Statistical Methods In) marketing, multiatinbute preference models, such as discriminant analysis and regression models, have been the traditional classification technique (Curr84} ‘The majority of these models represent a consumer's choices as a combination of feature weights, such as the linear regression model [Wilk7 3. A major assumption in such approaches 48 that consumers trade-off relevant attnbutes of a product when forming overall evaluations ‘These are termed compensatary decision rules, because good features can compensate for the bad Research, however, indicates the existence of non-compensatory choice strategies which do not accommodate tradeoffs (Payn76]{Pinh 70]. Therefore one concern with these statistical models is that the presumption of compensating behavior is incorrect. Despite acknowledged robustness [Dawe74], under conditions likely to occur in a competitive market the use of compensatory models may be tmappropriate and misappheation can have severe consequences Iohn® 5]tCury8 1] A second concern with linear models is their timited ability to provide behavioral insights Knowing how a person's choice strategy ts constructed could provide a practical benefit by identifying targets for special emphasis or in modifying behavior {Wrg73]| Unfortunately, despite strong predictive performance, statistical methods ate linuted in two ways by the representational assumptions they impose The first problem is that non-caompensatory behavior is not casily identsfied from the feature weights Second, the numerical coefficients offer no insight tn to the configural relations among the attnbutes 2. Consumer Choice Modeling as Symbolic Learning 2.t Symbolic Representation of Choice Strategies Production systems offer a representation which can capture the configural nature of a choice stralegy in an intuitive manner Following the terminology of [Mich80}, we can represent a consumer's choice rule as a collection of elementary ‘terms’, where cach term consists of a conjunction of attnbute-value (A- V) pairs called ‘selectors’ Preference for a specific apartment, for example, could be represenied in the following form [rent = $450[[cammute = 2 milcs|[heat_inc) = true| (hy Vhe fogical union of several terms forms a disjunctive expression, Icading to a disjunctive nannal form (DNF) representation of more complex choice strategics Such strategies can be classified as “compensatory” in naturc since the selectors in one term can compensate for selectors in another An example of a compensatory rule in DNU for selecting an apartment would be IF [rent < $400{[commute ~ 2miles|[heat_inc] = true} (2) or [rent < $350}{commute < 3mites|[laundry_incl = tric] THEN purchase In contrast, a choice rule consisting of a single tenn (ie no o's) would represent. a “conjunctive choice strategy = In the consumer behavior literature the distinction between conjunctive and compensatory strategic; has important implications, therefore the ability of the leaming system to distinguish these structures will be an critical part of the evaluation (for this paper structure and strategy will be used interchangeably) While behavioral researchers acknowledge the supenority of production systems as sepresentation [Klah87], generating high validity rules is a dificult problem As [Vab85] imdicates, learning disjunctions of conjunctions 13 computationally complex for a reasonable number of features and becomes NP-hard sa cerlain circumstances Tf we assume only 40 binary selectors, for example, the size of the description space for a single term 15 2 However, an individual is quite likely to have rules which use multiple terms (“term F" ar "term 2" or .."term n") . For simplicity, if we restrict this compensatory decision, gute fo ony an} 4 5. jcrms, the number of possible rules would be, 2 ? x2 x 4 e we 2°x 2° or approximately 10° posstble combinations: *. 2.2 Concept Learning System In a recent paper, Currim, Meyer, and Jc [CurrR5] identified the Jirttattons of lmicar statistical models discussed in Section [, and proposed a solution to the symbole rule generation problem Uheir approach, called the Concept Leaming System (CLS), 6 an appheation of Quinlan’s 113 system [Quin84} (named after the eather CLS system upon which [D3 is based [Hent66}). The 1193 inductwe alganthmn constructs a consumer choice rule by UL The possible combinations QQARNCk OF, where ke othe number of selectors) k(n \(mo [1,2, Jhimits of relevance} ) and t - (the ntimber of terms in arule, ¢ (m \(mo (1,2, linia of cogmtion}), although the number of "legal" rules ts a function of the independence among selectors 2418 snerementally building a classification tice that covery the framing set, repeatedly adding additional selectors to underspecified branches on the basis of their estimated discriminatory power. ‘Thus, given a training set of specific consumer judgements, with the attributes over which the judgements were rendered encoded as binary (true or fatse) sclectors (eg rent $350, distance from-work < Yiles), CLS searches by evaluating cach selector for is ability to discriminate among the consumer's decrtions ‘The selector which provides the best separation at cach point in the search, as determined by information-theoretic measures aimed at minimizing the expected number of tests to classify an instance, becomes a new branch of the decision tree This process continucs iteratively until all trainmg examples are correctly partitioned The choice rule generated, then, consists of all the selectors used to build the tree (as in expression (2)) The use of a production system formalism is an important advantage Because of their apparent consistency with human thinking [Newe72] [Klah87] production system representation arc casily understagd by users Tlowever, the use of production niles also exposes the complexity of DNF induction problems discussed carker The major concern im the case of CLS 1s that fhe decision tree building proceduse is stepwise optimal but not necessarily globally optimal [Bres84] which means constructing a tule piece by piece will be ineffective if it 19 critical combinations of pieces which provide superior performance A second concern 15 the sensitivity of the CES inductive strategy to errors (ar “nase” in the consumer choice data (which is not an uncommon occurrence in practice) As allempts are made 10 classify noisy traimng cxampley, the irees generated by CLS become complex and inaccurate’ 1 Therefore, when conditions are less than ideal, the predictive quality of the rote may fare poorly 3. ADAM: A Genetic Algorithm Based Model Generator Genclic algorithms (GAs) are a class of search techniques which have demonstrated the capabilities to address such problems in a number of rule mducton tasks (cg [Smet83[Gold85]}{Scha8 4) Accordingly, at was hypotheazed that a GA would exhibit annular performance in the consumer choice modeling domain To validate this claim, a GA-hased consumer choice rule generator called ADAM (A Decision Adaptation Model} was implemented and evaluated In the following subsections, we desenbe the principal components of this system We assume fanufiarnty with the basics of GA's and refer uninformed readers to [Holl75| [TotlR6] 4.1 Representation To smphfy the comparative analysis, the binary selecior coding schome employed by CTS in formulating rules was also adopted within ADAM — Thus, a given term in a candidate mile 1s expreased as a string of length @ over the alphabet (0,14) Of representing a don't care position), where 9 equals the number of dichotomous selectors defined and cach bit position corresponds (oa particular selector. A complets choice rule 18 then expressed as the concatenation of one or more terms The number of terms ina given rule can vary, and there 1s no significance attached to the order in which terms appear Using [2f In [Quin86] some extensions to the CES Induchve strategy to cope with hone are proposed [ese extensions were not implemented and will have to be cratuated ina Inter study mmphicrt disjunctive normal form the rule string VARNA AHHH OUMHHE would be interpicted as IP (1) and (5)) or (C1) and (4) and (6)) or (not 1) and (2) and (3) THEN choose. If the conditions of any term are matched then the rule is active (fired) indicating an acceptable product, Tor purposes of the comparative analysis, we will assume all features describing the purchase situation to be binary attributes Thus, there is always a one to one correspondence betyecen purchase features and the selectors over which the term reptesentation 18 defined It should be noted, noncthcless, that the term representation 1s pot a good one for the GA in the general case Specifically, the representation implies that attnbutes which range over a continuum of valucs (eg the cost of renting an apartment) must be perceived through a set af binary selectors (cg frent- = WO}, rent + JOO), [rent — = 400), ste) [hts yiclds a term representation space that includes a large numbce of illegal structures, However, this problem is casily resolved by simply using several string pasitions (a recognize such atirrhutes" salues {as is routinely donc in’ classifier systems) The need to explicitly define a set of possible selectors #6 specific to the CTS inductive strategy. Assuming the above descnbed sepresentation of a consumer choice mule as a ciszunction of one or more conjunctive terms, there are still alternatives as to Nhe naiure of the population of structures to be manipulated by the GA On one hand, we mught consiler a single term to be an atomic rule of sorts, present the GA with a population of such atomic rules, and interpret the entire population as the consumer's choice rule Adopting this approach, however, requires a means of assigning credit to the individual terms that confrbute to the overall performance af the choice rule. We can bypass this problem of term credit assignment (as was done in [Smit80]} by encoding complete choice rules as individual structures in’ the GA population, and including discovery of teri mtcractions as part of the GA search. In this case, the GA well manipulate vanable length mile strings of the form If or. “THEN Choose and the population of structures represcnts a collection competing chose rules This Jatter approach is the approach adopted in ADAM, 3.2 Evaluation Funciion The evaluation function or “entic’ in ADAM contains three components The dominant component, — “prediction”, measures how frequently a rule correctly predicts choice, The other two Components, “specificity” and “lerm-count”, relate to the structure of the rule The three measure are werghted and summed to yield an overall fitness measure Prediction is given aw dominant weighting reflecting its importance in a model. howeser, the weights for the siructural componcnis have a critical role in determining bias, or direction of the search, toward the "“frue” choie strategy (eg conjunctive or compensatory) On the whole, if two rules perform equally, the imore general 1s preferred Preliminary investigation showed that the GA would direct the initial search toward the appropriate structure, however, greater discrimination in the later stages was | Actually, we can take advantage of the fact that with reepect fo many non binary atitibules we are interested in ranges of values, and define a better representation [fn euch situations, we have experimented with a representation that mncitdes a -, >, of - prefix to patterns defined over such attributes, using a binary number inerpretation of the attrsbute pattern in the care of © or prefixes 219 necessary Some means for determining the most likely strategy was necessary to adjust the bias during the scarch 32.1 Strategy Structures Using Bettman's [Bett79] description of conjunctive and compensatory for a 3 attribute example, the following structures would be expected for conjunctive rules there would be only one term with a specific value defined at cach relevant attribute conjunctive I -A=x and B=y and C=7> THF N Choose, for a linear compensatory rule the structure is disjunctions of conjunctive feos and should haye many terms with fewer defined positions (low specificity, high term-count) compensatory IF -A=xand R=y> or ~A=jand C=k > ‘THEN Choose Based on thes “specificity” measures the number of don't-caces ima rule string and “term-count” measures the number of terms Conjunctive rules would have high specificity and low term- count while compensatory rules would have the reverse 3.2.2 Shifting Bias Since one purpose of the cule is to infer strategy from rule structure an appropriate structural bias cannot he known a priori Unlike traditional methods which may erroneously presume a specific structure (i.¢ lincar compensatory), ADAM should adapt its bias as the scarch progresses Because of the GA cfficiency at initially locating good structures, ADAM can use this information to adjust the evaluation function (bias) towards conjunctive or compensatory structures That is, "fit? rules fend to acquire a specific form and the weights associated with specificity and term-count can be adjusted to favor search among roles with that form In the later stages of the search, once a specific form becomes predominant, this bras becomes uscful in discriminating among otherwise equivalent rules. Tventually this bias adjustment could be implemented as a smooth function [Berl79] based on the population average size weighted by performance, for now, the shift occurs when the size of the best structure m the population falls below the average sve of the imtial population (-3 terms) 3.2.3 Measuring Prediction Predictive validity or the ability to forecast choice can be measured with a simple contingency table (Figure 1), when a rule agreed or disagreed with the training sct chores Basic Contingency Fable {"fired” implies the nile conditions were met and so the item was acceptable) (not red fired chosen a b not chosen c d Fig. 1 ‘Thus figure 18 representative of a hypothesis test, with cells b and ¢ considered errors of omission and commission respectively. While differential penalties might be apphed based on the severity (cost) of cach type of error, for this paper they are considered equivalent 3.2.4 Folding sample One final characteristic influencing the evaluation is the segmentation of the taming sample into a smaller training set. At the end of a gencration a new training set was drawn starting with the clement about 2/3 of the way from the previous first clement. [his caused a stagger or overlap of about 1/3 for each training set. The primary reason was to maintain some of the adaptive flavor of the search by providing a changing environment (the overlap of the sample allowed the change to be more gradual}. One effect was fo maintain diversity among the genes fo prevent lost alleles A second reason was that it requires {ewer computations per generation even if the training sample size were inercased Interestingly enough, this scheme of staggering the training examples considered during evaluation appeared to tead to cqual performance compared with repreated consideration of the entire training set in prelimmary experiments 3.3 Rule Generating Parameters Based on the results of earlier studies [Delo75} [Grefk¥, a population of 50 candidate choice rules is maintained by the GA in conducting ifs search = The initial set of rules ts generated randomly with the number of terms ranging between | and 6, and a 33% probability of a 1, 0 or # at cach individual string position. Although typically a larger percentage of #'s are used, the potential of highly specific “conjunctive” rules suggested that too general a seeding would slow the search progress After the cvaluation fonction assigns a fitness to cach rule the scores are normalized, and genetic operators are applied to produce a new population of candidate rules An "clitest™ strategy, as suggested by Delong (1975), then inserts the best Tule from the previous generation The probability of crossover ts set at 06, a8 originally suggested by [Delo75{ Crossover operates at two levels (between terms and between selectors) as outlined by [Smit80] with a modification to alignment point selection to include a zero level crossover {recall that the etructures being manipulated are variable length) This has the effect of permitting single ferms to occur and be included in the crossover process’. Finally, the mutation rate was set at 0001 ADAM continues to search until either it finds a rule which pericctly classifies the set of choices of it reaches a prespecified number of generations After if stops it then selects the rule with the best score The output of (his search process is a single rule used to charactestze an individual's choice strategy 4. Methodology A simulation was constructed to perform a comparative analysis of ADAM, CTS, and a linear Logit model [he following four factors were investigated for their potential affect on performance 1- type of "truce" choice strategy (conjunctive, compensatory or mixed) 2- number of attributes on which the rule is based (4, 6, 9) 3- level of noise in representing the choice (0%, 10%, 20%) 4. sample size used for estimatton (20, 100, 200) with half of cach sample used as a holdout. To provide the simulation data a table of randomly gencrated A-V terms was created Fach alternative consists of three, six, or nine attributes using a random number befween @ and 99. A {4} As suggected by Smith [Smt&0], the level of crossover seems to yield better carly performance when set high bul better fafer performance when set low In preliminary experiments staggering the crossaver level operator bated on the oumber of generations elapsed, appeared to yield steadier increases in overall performance 220 coding function representing a decision maker's strategy (conjunctive, compensatory or mixed) was applied resulting in a ect of decisions marked as positive or negative examples ‘This could be dhought of as a sample of products characterized by 3,6 or 9 alinbutes plus an indicator of whether or not the consumer thought it should be chosen The choice indicator was generated using the folowing three choice functions conjunctive choice = Eif (xp = #1) 7. (x2 Sty) ty Fy) 0 otherwise compensatory choice = 1] if (4) t x2 4 0 otherwise + Xq) fn >t, mixed chowe= Vif (tx, > ty) - (2 # 4g + @ otherwise + Xn) f/m ta) ‘n= number of alinbutes ina given experimental condition (n @ (3,6,9}) * x, = a given alttnbute (ie (4,2... }) * 41, = thresholds, each t will be selected a pnori for each of 2 conditions to generate an approximately cqual split between the oumber of “chosen” and "nonchosen” alternatives Noise was introduced into cach coded sct of examples by changing the decision indicator of any alternative in the set with 4 probability of 0%, [0% or 20% ‘This represents a severe form of misclassification since an alternative which contained acceplable atirtbute values 1s now indicated unacceptable and vice versa Using combinations of choice strategy, number of attributes, and noise level, nine selection models were created representing a 4x 3.x 3 parhal faciosial Fach of the nine models is apphicd for three different sample sizes yielding 27 (model/sample sizes) Tach of these 27 conditions is repeated § times yielding 145 data sets Half of each data set was used as a holdout sample meaning it represented a set of decisions not previously seen and therefore usable for evaluation ADAM, CES and the simulation were all Programmed in PASCAL and run on an IBM-PCS! The Logit resulls were generated using Hottrans on a VAX computer and with RATS on an FAM-PC 5. Results Yhe objective of ADAM was to simulate performance under a number of conditions and to determsme whether it offered any improvements with respect to the issucs described in cartier secon Performance was ctaluated on the basis of the previously discussed measures of a good model predictive, dsagnastic and structural validity These results are summarized below 5.1 Prediction Ter the simulation the ability to accurately predict is measured by how well the model's rule predicts the hold-out sample The comparative predictive levels of the models averaged over the § sepetitions are presented in Table | [5] The CLS code was supplied by Rob Meyer al the University of Califor Tas Angeles Effectiveness of ADAM, CLS and Logit (percentage of holdout cases correctly predicted) sample= 10 sample = 50 sample = 100 Strat Atribs Noise GA CIS agit GA CIS logit GA CLS I ogit 1 Conjd 0% 100 90 92 100 100 180 100 160 100 2 Cony6 1% 72 62 6 7 67 75 76 SB O72 3 Conj9 20% 86 73 66 80 76 80 78 66 76 4Comp3 10% 76 72 73 79 65 76 83 69 80 SComp6 0% 75 59 84 R2 68 84 7 66 B2 6Comp9 20% 62 47 71 66 64° 70 66 S56 7I 7 Mixd 3 20% 78 68 73 76 69 73 75 66 78 8 Mixd 6 1% 88 47 Bd 88 80 86 RE 80 85 9Mixd 9 0% 7 74 97 92°90 95 87 88 94 Table 1. Ht is evident that ADAM, using a genetic algorithm, gencrated tules with equal or superior predictive ability to those of CLS across almost all the experimental conditions (the exception being + and 2 points difference for the mixed model with 9 attnbutes)! In comparing ADAM to the Logit model, the major impression 1s how comparable and consistent their performance was, Overall ADAM predicted at 807% accuracy vs Logit st 799% As was expected, the performance vaned across conditions She results are shown in Table 2, Performance Across Conditions MODI NoIsr STEPCTORS 817 evra cnj cmp mxd 0% 10% 20% 3 6 9 10 50 190 Adam 80.7 85 74 RJ RR RZ 72-85 79-77 «79 2 RI Togit 799 80 77 83 9 79:72 83:79 78 75 82 82 ‘Table 2. As seen from fable 2, ADAM appears to offer a slight edge with respect to conjunctive cules and «mall sample sizes, areas where traditional models are expected to be weak Uowever none of the performance difference were found statistically significant A companson of regsession models using dummy variables prumartly examined mam effects ‘The sesutts indicate significant differences (p ~ 0001) for all main effects (model type, number of attributes, Ya-noise, sample size) consistent with expectations That is, increasing noise and larger attribute sets had detrimental effects on predictive accuracy, while larger samples had a positive effect The performance of ADAM showed significant improvement over CLS Tlowever, an F-test between tegression models did not indicate significant difference over Logit at (p~ 053) even with first order interactions included Onc explanation for the surprising strength of the logit model an conjuncttye cules was the nature of the simulation environment As several researchers have noted (Dawe74]Cury81]{fobn85] the use of a uniform distnbution in generating simulation attnbutes [6] To establish thal the simulahon repheated Curan ct al's earher study, at leet for dferener in the patruise resnfis wae find non mpnificant, 126 df 0241 , the pull hypothesis thal the means were equal could not be rejected Tus 1s encouraging since st addrestes the objective of extending their efforis provides a best case environment for the averaging of a statistical model — itowever, such caviranments are hkely to be inconsistent with a realistic market situation, where the true conditions would prove detrimental to the hnear model. Two encouraging findings were the low variance of ADAM’s results across repetitions of the trals and the stabily of ADAM across diflerences in both strategy and noise, supporting the expectations for genctie search Note that the falloff in prediction is consistent with the increase in the noise level, Several additional runs using noise as the only experimental vanable support this finding ©‘. 5.2 Diagnosis When consumers make choices they frequently do not place the same weight on all the features but instead indicate that certain attributes are more important than others Diagnostic validity measures the models ability to recover the importance of an attribute. Ina regression model like Logit, atinhute importance is represented by the beta coeflicients or parameter weights. A comparable parameter is ADAM?s relative frequency measures for cach attribute A critical question is whether production nutes can provide diagnostic validity Diagnostic Correlation between ADAM and Logit Corclation between Relative Frequency of Attributes in ADAM and beta cocflicients from Logit. Sample Set Size Attrib (cases) 10 50 100 Rl 435) 045a O80 0788 X2 (135) 0500 07da 0790 X3 (135) 041b O75a 0758 x4 (90) 64a 0,65a 0 76a XS {907 0.68 Of2a 0 65a significance X6 (90) 065a 058a 0 65a a po O00L X7 (45) 043 O8Sa 0.6le b p< int XB (45) 047 048 042 ce pe Ol X9 (45) O53 056 0 72b d- po 05 Table 3. As 1a evidenced in ‘Fable 4, not only do the two algorithms generate equivalent predictions, but they also appear to agree as to the relative importance of attributes even across sample sizes. The correlations appear to follow increasing convergenes as sample size increases although this trend was not significant (p~ Ol) ‘This is a very interesting result since it Jends support to the use of production rule models in providing useful quantitative measures a9 well as intuition, 5.3 Structure Structural validity indicates the model's ability to recover the form of a given choice process, ultimately to provide insight inte. the individuals behavior Doing so is a two step procedure. First, the model must find a cule that predicts well Given that tule, the sccond step is to infer the underlying choice strategy hased on the rule’s steucture Phe structural clements arc features such as the number of terns, the number of “don't- care” symbols (#), the distribution of usage frequency across terms, and so on Classification is determined by specifying a IT} The shghtly lower performance In compensatory piles may be allnbutable 10 the joes of formation caused by encoding a 100 value random number ae a dichotomous variable mapping from characteristics consistent with the known choice strategy to the featuses (structural elements) of a rule, The ability to provide structural intuition is a key advantage of ADAM over Logit. A critical issue is whether the true structure can be recovered when the data is noisy (a debilitating condition for CLS) Structural Classification for Model by Noise Noise: O% 10% 20% Model: conjunctive 100% 87% 73% compensatory 87% = =— 93% = 6% mixed 60% 60% 20% Tabte 4. Trom the table it appears that the performance is encouragingly robust, generating misclassification within an acceptable range based on the given noise level §.4 Conclusion Based on a simple simulation, the performance of ADAM was evaluated using the three validity measures With respect to the research objectives and performance hypotheses, the following results were recorded ‘The GA provided superior performance to CLS acrass all measures especially in those areas addressing the weakness of traditional models ‘This suggests genctic search may offer a suitable alternative. Further, the GA performed very well with respect to the traditional strengths of the Logit model while providing the important feature of production system tepresentatian One additional finding is that production mules can provide quantitative measures comparable to statistical coefficient Accepting that the simulation represents a simplified siuation, the results appear to provide strong support for the potential of genetic search as a method for modeling consumer choice 6. Discussion 6.1 Review Preference models provide intuition about how a consumer might behave and why he does so Although the traditional toot has been the linear compensatory model, behavioral research has shown evidence of a number of situations where the use of such models is inappropriate and where misappheation can have severe consequences. A proposed alternative, based on Quinlan’s classification tree building procedure, provided the necessary production system formalism, however the predictive and diagnostic performance was weak duc to the complexity of the problem domain To address the Hmits of earher approaches, a classification system called ADAM was developed utilizing a GA for search A simplified choice simulation was used to evaluate GA performance and to provide a comparison to the earlier methods The results of the simulation support the potential of genctic search and {he use of ADAM for choice modeling 6.2 Future Directions A model which can provide performance equal to statistical models with intuitive advantages of production systems, offers a more versatile tool for marketing professionals However several tssuca need to be examined With respect to the problem domatn, it is important to look at increasing the number and type of attributes as well as different distributions of attribute sels With respect to the representation and its influence on the 222 solvability of a problem, two issues can be identified ‘The first concems experimentation with non-binary attribute pattern sepresentations that better facilitate the characterization of numerical ranges of valucs As was mentioned in passing in Section 31, we have begun investigating the use of pattern prefixes in this regard The second issue concems the infonnation Joss ihat results from any discretization of the possible values that a continuous variable may assume We are exploring techniques for dynamically rescaling (ie expanding or contracting) attribute ranges when performance stagnates within a specific range of values With respect to the cvaluation function the initia) efforts at an adaptive bias using a shift in the evaluation function weights also appear promising = An important next step is applying an upgraded ADAM to actual consumer choice data. Overall the positive reaults suggest that a much more detailed investigation of a genetic system for modeling consumer choice strategies 15 warranted References [Berl79} Berner, "On the Construction of valuation Functions for Large Domains”, Proccedings IICAI-79, 1979 [Bett79} Bettman, IR An Information Processing Theory of Consumer Choice, Addison-Wesley, 1979 {Beeik4) Breiman, Friedman, Olshen,R. and Stone,C Classification and Regression Trees, Wadsworth, Inc , 1984 (Curr84] Currim, and Sarin.R K "A Comparative Fvaluation of Multiattnbute Consumer Preference Models”, Management Science, vol 30, no. 5 (May), 1984, p 543-561, [Curr86] Curmm,., Meyer,RJ, and Te, N, "A Concept- Learning System for the Inference of Production Models of Consumer Choice”, working paper no 149, Center for Marketing Studies, UCT A, 1986 [Cury81] Curry,D 7, Louviere,J.1 and Augustine, M 1. "On the Sensittsity of Brand-Chaice Simulations ta Attnbute bapartance Weights”, Decision Sciences, vol 12, 1981, p 502-516 [Dawe74] Dawes,R M. and Corrigan,B, "Pinear Models in Decision Making", Psychological Bulletin, vol &f, no 2, 1974, p 95-106 [PeJo75] Defong,.K A "Analysis of the Behavior of a Class of Genetic Adaptre Systems", PhD Thesis, Dept of Computer and Communication Sciences, University of Michigan, 1975 \Pinh70] — Finhorn,I J. “The Use of — Nonhincar, Noncompensatory Madels in Decision Making", Psychological Bulletin, vol. 73, no 3, 1970, p. 221-238 [Gold84]— Goldberg,D “Camputer Aided Gas Pipeline Operation Using Genetic Adaptive Systems”, PhD Thesis, Dept. of Civil Fngmeering, University of Michigan, 1983 [Gold85] Goldberg, "Dynamic Systems Control Using Rule Tearning and Genetic Algorithms", Proceedings of the Ninth International Joint Conference on Artificial Intelligence (Los Angeles, Caltffornia) Morgan Kaufinann, 1985, p 588-592 [GrefR4 Grefenstette }J "Optanization of Genetic Scarch Algorithms", Tech Rep CS-83-14, Vanderbilt University, 1984 [Hot?5} Wolland,] 1. Adaptation in Natural and Artificial Sputemr, University of Michigan Press, 1975. filol8A) VolandJ 1 “Pseaping Brittleness the Possbihties of General Purpose Learning Algarthms Applied to Parallel Rule- Based Systems” in Machine Learning: dn Artificial Intelligence Approach, vol. H, R Michalski, J) Carbonell, and T. Mitchell (Pids ), Morgan Kaufmann,1986, {unt6s} Wont,7 8, Marini. and Stone,P T Lxperiments in Induction, Academic Press, [966 {John85] Johnson, F 1., Meyer, RF, and Ghose,S. "When Choice Models Fail Compensatory Models in Vfficient Sets”, working paper, Graduate School of Industrial Administration, Carncgie Mellon Urn , 1985, {KJah87]} Ktahr,D, Langley,P. and Neches,R Production Spstem Models af Learning and Development, MEV Press, 1987, fMichRO} Michalski,R and Chilausky,R 1 “Knowledge acquisition by cucoding expert rules versus computer induction from examples’ a case study involving soybean pathology", International Journal of Man-Machine Studies, volume 12, 1980, p 63-87 [Newe72]| Newell,A Solving , Prentice Half and Stmon,l (1972) Human Problem {Payn76] Payne, JW "Vask Complexity and Contingent Processing in Decision Making An Information Search and Protocol Analysis’, Organizational Behavior and Human Performance, vol {6, 1976, pg 366-387, [Quin&4} Quinlan,] Ro "Inductive inference as a Too! for the Construction of Wigh-Performance Programs” in Machine fearning An Artificial Intelligence Approach, vol 1, R Mehalski, J) Carbonell, and T Mitchell (Fds), Hoga Press, toad, [Quink6] Quinlan, R. "The Filect of Noise on Concept Teaming” in Afachine Learning: An Artlfirlal Intelligence Approach, vel H, Ro Michalski, F Carbonell, and ‘I Mitchell (Pds), Morgan Kaufmann, 1986 [Scha&5} — Schaffer,t 1 “Learning Multiclass Pattern Discnounation™, Proceedings of an International Conference on Genetic Algonthms and their Applications , Pittsburgh, Pa, John Grefenstettc, (Fd ), 1985, p [Smit8Q] Smith ST. “A Learning System Based an Genetic Adaptive Algorithms", PhD Thesis, Dept. of Computer Science, University of Pittsburgh, December, 1980 [Smut&3| Smith,S F "Flexthle Learning of Problem Solving Heovristics via Adaptive Search”, Proceedings 8th International Tot Conference on AI, Kararuhe, West Germany, August, 1984 (Vali&S] Vatent G "Learning Disjunctions of Conjunctions”, Proceedings of the Ninth International Joint Conference on Artificial fntelhgence (Los Angeles, California) Morgan Kaufmann, 1985, p 560-566 223 fWilk73} Wilkic, WT and Pessemice,. “Issues in Marketings use of Multt-Attribute Attitude Models”, Journal of Marketing Research, vol 10, November, 1973, p 428-44], {Wrig7} Wright,P., "Use of Consumer Judgement Modets in Promotion Planning", Journal of Marketing, vol 37 , October, 1973, p. 27-33, A STUDY OF PERMU TATION CROSSOVER OPERATORS ON THE TRAVELING SALESMAN PROBLEM by ILM. Oliver*. D.J. Smith?, and J.R.C. Holland} * Texas Instruments Lid (Bedford UK) tTexas Instruments Ine (Dallas TX} {Mullard Ltd (Malmesbury UK) Abstract The application of Genetic Algorithms to problems which are not ameanable to bit string representation and traditional crossover ha been a growing area of interest. One approach has been to represent solutions by permuta- tions of a hist. and + pertuutation erassover” operators have been introduced to preserve legality of offspring. Three per- mutation crossovers are analyzed to characterize how they sample the o-schema space. and hence what type of prob lems they may be appheable to. Experiments performed on the Traveling Salesman Problem go some way to ~npport the theoretical analysis. Tntroduction This paper i a study of three crossover operators de- signed to sample ou cherma [(GoLi8s]: ' ~ “Modified” crossover [Davi85] - the a version of Dax Order crassover * Goldberg and Lingle s Partially Mapped Crossover - PMX [GoLi85| a previously unreported crossover - the Cycle crossover. We introduce a modified version of o-schemata and present an analysis for cach of the crossovers. This is fol- lowed by experiments with the crossovers on the Euclidean Traveling Salesman Problem (TSP) [LLRS85]. The exper- imental] results are tied back to the analytic predictions and are compared with other Genetic Algorithm (GA) ap- proaches, a heuristic solution, a “Neural” Computation so- lution, and to the optimal solution The Crassovers The Order Crossover The order crossover proceeds as follows. First two cut points are chosen at random. The section of the first parent between the first and second cut points is copied whole to 224 the offspring. The remaining places are filled using elements not occurring in this crossover section, To do this we use the order the elements are found in the second parent after the second ent point: Parentl h#keefd bla tg7 Parent2 abedef ghia jkl Offspring defghi bla jke The offspring sequence, 7 k ¢ (return to the first posi- tion) de f gh i, is the second parent starting at the second cut point with the elements of the crossover section of the first parenf removed The absolute positions of some elements of the first par- ent are retained. Turther the relative positions of some elements of the second parent are also kept. Davis's (Davi83] “Modified” crossover is exactly as above except that the first cut point is always at the beginning of the string. The PMX The PMX also proceeds by choosing two cut pomts ¢ random. hkeefd abcde f Parent 1 ble ry Parent 2 ghar gai ‘The cut out section defines a series of swapping -tes ations to be performed on the second pareut Ly the case above we must swap 6 with g./ with he and a with «te sulting in the offspring: Offspring igedef bia gkh The absolute positions of some elements of the both the first and second parents are preserved, The Cycle Crossover The cycle crossover so a answer to the question’ Cat we create an offspring differen) from: the parents where ov ery position js occupied ty a corresponding element from one of the parents? The answer, in general, is yes. Every element of the offspring can come from one of the parents and usually the offspring will be different from both par- ents, We wish to satisfy the following conditions: a) Every position of the offspring must retain a value found in the corresponding position of a parent. b) The offspring must be a permutation. Consider again these parents: Parent 1 hkeefdblaig? Parent2 abedefghijkt Consider position one, condition a) says the offspring must contain either hk or a. Suppose we choose hk. Using condition b}, a cannot be chosen from parent 2 since that position is now occupied by kh. So a must also be chosen from parent 1. Since a in parent 1 is above? in parent 2 we must also choose 7 from parent 1, Continuing the argument, having chosen ht from parent 1 we must also choose a,i,j, and /. The positions of these clements are said to form a cycle, choosing any of the positions from parent | or 2 forces the choice of the rest if the above conditions are to hold. The cycles are usually labelled by a number or are called unary and designated by the letter U: Parent 1 hkeefdblargy ebedefgh ght Cyele Label 120733 32brL121 Parent 2 In a standard crossover operator a random crossover point or cut section is chosen, In the cvele crossover a random parent is chosen for each eyele. Choosing evele f from parent 1 and the rest from parent 2 u the above example produces the following offspring: Offspring hbcdefglatks Parent number 122222212121 The absolute positions of on average half the elements ot both the first and second parents are preserved. Cyele crossover properties Cycles have some interesting properties. Consider the cycles fortned when two random parents of length L com- bine: (the order O of a cycle is the number elements in the cycle). 1. We can show that the probabuity ot an eletnent bemg, in a cycle of order O {where 0 < O < L} l L Note that this is independent of O. 2. The expected number of cycles of order O _llLi i -E* 076 3. 4. The expected number of cycles of length 2 or greater Lod = pink) -1 O-2 5. If 2 parents produce [n(£) - 1 cycles of length 2 or greater then the number of possible ways of perform ing the cycle crossover to produce an offspring different from both parents ts. _ Bn) 4 gin(L}-} 9 In faet the expected number of different offspring for two random parents will be greater than this. This shows there are in general a good number of choices to be made. comparable with choices available when cromsing over two bit strings O-schema Analysis What i# an o-scherna ¢ The analysis of the Order crossover has motivated a relaxation of the definition of an oschema. Usually the follow ing schema are different (where ! means do not rare}. abttet!*t*! ftabltetttt In our scheme these are all considered equivalent to abllctttt!, the schemata has been normalized by taking the first element to be the successor of the longest string of don't care symbols. Lexical ordering of the fixed value symbols can be used to resolve slon't care strings of equal length. Phis definition of a schema is valid for the TSP as all the equivalent o-schema have the same varue. It also has two clear advantages: there are fess schemata so a given population will hold a greater percentage, and the defining length of many schemata is reduced. For instance the following schema are equivalent: birtrtttta, abfttrytryt The first schema had a defining length of 10 but naw has length 2. Definitions O is the order of an o-schemma D is the length of an o-schema K is the Jength of a cut Lis the length of a string Analysis method The analysis in genera! follows the Goldberg and Lingle {GoLi83] analysis. We calculate the probability of survival of o-schetmata: P(S} = PUS/PWIPAT) + P(S/O)P(O} + PEs CAPICY W (Within) occurs when the o-schema is entirdly within the cut section. O (Outside) occurs when the o-schema is entirely outside the cut section. C (Cut) occurs when the o-schema contains a cut point. P(S/C). the probability of survival of an o-schema on a cut point is considered to be negligible. Relaxing the definition of an o-schema allow- for higher survival rates. However this effect 1s only apparent in the analysis for the Order crossover. The Cycle aud PMX op- erators attempt to retain absolute positions m the ordering so the probability of relative movement of sc} ctuata along the string is negligible. The Order Crossover Let parent 1 be the parent from: whoin all the elements within the cut are taken Let parent 2 be the parent whose ordering ts ted to place the elements outside the cut The probability of the o-schema surviving from parent 1 is the probability that the o-schema is completely within the eut: . _K-D+1 L The probability of the o-schema surviving from parent 2 is the probability that none of the elements of the schema have been taken from parent 1. P(S/W)P(W) = P(W) For large L and small D the following approximation holds: P(S/0)P(0) = @ - a” Thus the probability of o-schema survival for the Order crossover is: r\ DB ff + (a - i) L 226 The PMX Let parent 1 be the parent from whom all the elements within the cut are taken Let parent 2 be the parent from whom all the elements outside the cuf are taken, after the mapping operation has been performed. The probability of the o-schema surviving from parent 1. is the probability that the o schema is completely within the cut’ oie An P(S/WYP(M | = POW) = oS The probability of the o-schema being completely outside the cut The expected number of elements needing to be moved in the mapping operation The probability of 1 element needing to be swapped K L KY 1 L =K x L K 7 For large L and small O the following approximation holds [Note 1]: 1 Ay? (s/o) = (1 1) Thus the probability of o schema survival for the PMX P(S) = P(S/W)P(W) + P(S8/0)P(O) L-~K~D+1 ( K TAPP «(1 7) oO The Cycle Crossover Let N elements be taken from parent 1 Let L-- N elements be taken from parent 2, (NV is similar to the K factor in the Order and PMX analysis above) The probability of o-schema surviving from parent 1 is the probability that all of the elements of the o-schema are taken from parent 1. For large L and small O the following approximation holds N\o ( The probability of o-schema surviving from parent 2 is the probability that all of the elements of the o-schema are taken from parent 2. Tor for large Z and small O the following approximation holds 7 (t - *) = L Thus the probabilitv of o-schema survival for the eyele crossover (3 “AE O-ochema summery oO +0 O-scherma from parent 1 P(S/IV PW). have the same probability of survival with the Order crossover and PIX, When D = 0. the probability of survival from parent 2 is a factor of. £L L-K-D+#+1 more hkel with the Order crossover than the PMX = How- ever as D becomes greater than O the probability of o- schema surviving from the second parent decreases expo- nentially compared to the probahility of survival under the PMX. Thus we would expect the order crossover to perform better than the PMX in problems where compact schemata are important. We expect the PMX would perform better than the Order crossover when compact schemata are less important. The cycle crossover is noteworthy as the probability of an o-schema surviving is independent of its length. We would expect it to perform well mn problems where the com- pactness of schemata is of no importance. Experimental Method A single 30 city problem was used for all the GA experiments, the “Neural” computation comparison, and Lin/Kernighan comparison {Note 2]. Each crossover was studied separately, no mixing of crossovers was attempted. Crossover was always applied fo 80°% of the population. A single mutation operator, SWAP, was used, The SWAP operator simply interchanges two po- sitions on the string. For each crossover the SWAP opera- tor was applied to a percentage of the population varying from 30° through 100% at intervals of 10°. Population sizes of 50, 200, and 500 were tried for each crossover with each mutation level. In all cases 50,000 tours were caleu- fated. This gives 1,000 generations for population size 50, through 100 generations for population size 500. Roulette wheel reproduction was used with some modifications to reduce selection errors. No clitist strategy was used. The system was written in the programming language ©, One run (50,000 evaluations) takes approximately 15 minutes on an Apollo DN3000 (operating system Acgis8- Domain/IX Kernel version 9.2.5. C compiler version 4 1.6). Experimental Results The Order Crossover Here wo see 8 runs of the order crossover with mutations levels at 30-100'7. Population sive is 200. The x-axis is the number of craluations, this can be thought of as the the “work done”, and is constant in this part of the study over all population sizes. No =Za9zmr acon po- @ 16 20 | 3@ 40 Se EVALS/19@6 Pop 288 Order 80 The order crossover ip very sensitive fo mutation levels. The difference in performance is 30°C frou the best (449 at 30°, pop. 500) to the worst (646 at 100‘, pop. 500). There is a steady improvement in performance as mutation 1s decreased from 100% to 30% This trait is common in large (500) and smal) (50) populations. Performance improves slightly as population sive in- creases. The best tour with a population of 500 (149) 1s 4% better than the best tour with a population of 200 466) which is 3°% better than che best tour for a population of 50 (492}. The PMX For the PMX with population s17¢ 50 we have: i + 10 | \ meomzmec scons o a2e- a 8 1@ 2@ a8 4a 5a EVALS‘160@ Pop 58 Prax 86 227 The PMX is sensitive (yet not so much as Order) to the mutation level. The difference in performance is 40‘/ from the best (498 at 60%, pop 50) to the worst (687 at 100%, pop 50). The best mutation level is not as obvious as the with Order crossover, it is between 40°% and 60° depending on the population size. There is little performance variation with population ize; in total a 5% difference, with populations 50, 200, and 500 giving best tours of 498, 518, and 521 respectively. The Cycle Crossover For the Cycle crossover with population size 50 we have: M1 7 myazmr mcom x aon a 1a 26 38 46 38 EVALS71@00 Pop S@ Cycle 86 Cycle crossover is least sensitive to mutation levels. The variation in performance is 16°% from the best (517 at 50°%, pop. 50) to the worst (601 at 100°%, pop. 50). The best mutation level is not clear, it varies from 50% for population 50, to 100% for population 200. Performance was 8°% better for population 50 (517) than for population 200 (559) and 500 (860). Experiment Summary Here we see the best runs, with a picture of the best tours, for the Order crossover, PMX. and Cycle crossover respectively. tbe 188 T 88 gia - 6a Rig | Y 4@ L | E 8 N G Tt ? HK , 6 1 5 8 8 4 a 108 28 a8 48 5a EVALS/1@@@ Pop S@8 Order 86 Suap 30 Best 449 1h 198 — 7 Ba 018 u 1 68 Rog 49 L 26 n° 8 G t 7 H , §& 105 8 8 4 a 14 20 3a 48 5a EVALS/1096 Pop 5@ Pnmx 88 Suap 6@ Best 498 N maname wean oo- 8 16 26 36 40 Se EVALS71888 Pop 5@ Cycle 98 Suap 5A Best 51? Order does 14% better than the PMX, and 15°% better than the Gycle crossover, The a-schema analysis above ties in with Chese experi- mental results. The TsP is a problem where adjacency of particular elements (cities) is important. This we would expect that the Order crossover would perform best. The fess sensitive behavior of the PMX and Cycle crossovers, especially the cycle crossover, can be thought of as a con- sequence of & built in higher Jevel of mutation present in these operators. It should be remembered that the experimental results are for the ToP only. and the oschema properties of PMX and the Cyele crossover should be considered when apply- ing permutation crossovers to other problems, “Going for the One Using the results of this study. further expernment= were carried out to find a very good tour for the 30 cities. The Tesults suggested use of the Order crossover. with mutation at 30% or lower, population size 500 or larger. ind tore evaluations. At population size 500, for 400 generations. with muta tion at 15'7, and with population size 1000. for 200 gyenera- tions. with mutation at 254. the Order crossover produced the following tour of leugth 425 i nel ees i ee a 20 4a 68 88 180 This is 5% shorter than the best tour in the comparts sont study Comparison with other approaches We can directly compare our results the optimal tour, the Lin/Kernighan algorithm [LiKe73] and Hopfield and Tank's “Neural” computation technique [HoTa85]. Nome comparison is possible with other GA results, This is the optimal tour, Its length is 424 [Note 3]: 1e8 oo 1 —~ —-. I Bo -, _ - 68 -4 | “ i leo 40 — 20 = fo The Lin/Kernighan heuristic finds this tour. The best GA result of length 425 1s within 1°7 of this. Next we see the best “Neural” computation tour. Its length is 509 [Note 4], This is ws 19°C from optimal on the same 30 cities, 229 188 ea | 4 7 ~~ | \ a 68 | ce 40 -. . ——— ; | : 2a. | : / ' ‘ a en 8 20 aq 60 86 106 Goldberg and Lingle [GoLi85] report optimal results for 10 cities. We have not run the same cities as Grefenstette et al [GGR G85} used, but some comparison is possible - Grefen- stette et al report results within 16-27°% of optimal for 50. 100, and 200 cities To put this work in perspective, current TSP research is on tours of tens of tousands of cities. Surnmary O-schoma analysis showed that the Order crossover is superior in problems where compact schemata are impor- tant (such as the TSP}. We expect the PMX to be superior when compactness i less important. The Cycle crossover preserves o-schoema independent of length. and is poten- tially be superior in problems where compactness is hot relevant, The caperimental analysis supported the theoretical predietion of good performance of the Order crossover for the TSP Results from these experiments indicated how to the fine tune of population size and mutation Jeve] to pro- duce results that were near optimal for the 30 cities studied. Notes e » Goldberg and Lingle report k+1 here. where we caleu- late k. i) . The coordinates of the 30 cities are: ((82 7) (91 38) (83 46) (71 44) (64 60) (68 58) (83 G9} (87 7 (7 71) (58 69) (54 62) (34 67) (37 84) (41 O4 (22 60) (25 62) (18 54) (4 50) (13 40) (18 {25 38) (41 26) (45 21) (44 35) {58 38) {62 39) 3. The optimal tour and Lin/Kernighan tour were cam- puted by David Johnson. 4 These were not the actual coordinates used by Hop- field and Tank but were extracted from the diagrams in [HoTa8s]. The extraction was quite accurate; they report 507 as the length of the “neural” tour. we calculate 509 on our points, they report 426 for the Lin/Kernighan tour, this is 426 on our points also. The diagrams in (HoTa85] should he rotated throngh 90 de- grees, laterally inverted, and scaled up by 100 to corre- spond with our diagrams. {Davi85] |GGRG83] [GoL185 HoTn85] {Like73] ([LLRs83] References Davis, Lawrence, Applying Adaptive Algo- rithms to Eptatatre Domains, Proc. Inter- national Joint Conference on Artificial In- telligence, 1985. Grefenstette,J]... Gopal,R., Rosmaita,B.. Van Gueht.D , Genetic Algorithins for the Traveling Salesman Problem, Proc. Inter- national Conference on Genetic Algorithms and their Applications, £983, Goldberg, D E., Lingle.R . Auleles, Loci, and the Traveling Salesman Problem, Proc. In- ternational Conference on Genetic Algo- tithms and their Applications, 1983. Hopfield.J.J.. Tank.D.W.. *Neural” Cam- putation of Decisions in Optinization Problems, Biological Cybernetics 52. t41- 152. 1985. Lin.s . Kernighan.B.W . An Effective Heurts+ tic Algorithm for the Traveling Salesman Problem, Operations Research 21. 498-516, 1975 Lawler.E.L.. Lenstra.d K.. Rinnooy Kan.A H.G.. shmoys.D.B.. The Traveling saleaman Problem. Jolin Wiley & Sens, UK, 1985. 230 A CLASSIFIER-BASED SYSTEM POR DISCOVERING SCHEDULING HEURISTICS by M. R. Hillaard & G. E. Liepins Oak Hidge National Laboratory, dak Ridge, TN 37831" Mark Palmer, Michael Morrow & Jon Richardson University of Tennessee, Knoxville, TN 37996 ABSTRACT A ciassifier system has been designed to discover heuristics tor simole job snop scheaulang broblems. Experiments nave shawn that the clas- sifier system is capable of discovering heuristics in a limited domain. We describe a new generalized version of the system and a set of experiments designed to oroduce more ceneral rules based on sets of pvroviems. Initial experiments with the generalized system indicate a potential for Success. THE NEED TO DISCOVER HEURISTICS Constrained resources are common to many amportant activities. Manufacturing and distribu- tion of goods, the mobilization and allocation of military forces, and government agencies' managqe- ment of resources sucn as water and energy are all resource constrained. Since most ontimization methods are limited either in their aoility to model the constraints of the system or in their ability ta solve problems of "real-world" scale, Dianning and scheduling in these environments Tremuires heuristics. These heuristics may range from simple “rules of thumb" to camplex algorithms based on considerable mathematical analysis; however, they all trede quarantees of optimality for speed and efficiency. The develoument of expert systems and other artificial intelligence techniques has led to methodologies for using heuristics to solve such nroblems, but the dis~ covery of the neuristics for such a system remains a difficult task, A system which could learn heuristics vased on 1ts awn experience and experimentation would vrowide a vowerful tool for developing dynamic, responsive, and useful auto- mated aids for planning and scheduling. THE SCHEDULING DOMAIN Scheduling vrovides a variety of increasincly complex problems to test a learning system's camapilities, We have cnosen the formal structure of job shop scheduling as a model within which to work. In tnis model, orocesses must be completed in time to meet a delivery schedule as closely as possible. The processes each take a certain amount of time (run time) and may have precedence con~ straints (1.e., processes A and B must be comlete before process C can begin). Sets of processes comprise jons, ana jots are expected to be com pleted vy a certain time (due date). Resources 2o1 (macnines) are available to complete the processes out can only be used on one process at a time. ‘Numerous optimization based techniques have been investigated to solve this vroblem {including genetic aigorithms [Davis, 1985)), bat the most successful techniques rely on heuristics {Rirmpoy Fan, 1976). We have performed several experiments using a modification of the classifier system, on tne simplest form of the problems (one process per job, mo precedence constraints, one machine) using the objective of minimizing lateness, L. {1) Lb = f delivery date; - due_date; ; alli i= job 1, job 2,"..., jobn This problem has a well known optimal heuris— tic: "Sort the jobs by increasing run time." Our experiments showed that the classifier system could find the optimal ordering of the jobs for a varticular example, and, given the relative rankings of the run time, the svstem could develop a set of rules whach said, "Do the shortest Job tirst. Da the second shortest job second ... Do the longest job last." A vecently developed extended system appears to have potential for developing more general rules, THE PHASE I CLASSIFIER SYSTEM The first phase of experiments with the Classifier system was intended to evaluate desiqn decisions (detectors, effectors, and feedback mechanisms). The resulting systen demonstrated jearning only in a limited sense; however, it provided insignt into the desicn decisions. The _ Environment To ve anle to evaluate desi¢n decisions, the initial experiments with the classifier system have all considered the simplest version of the job-shop scheauling problem (as statea above}. The system is presented a set of jobs with randomly generated run times and due dates, and is required .ta ordanize the jobs into a queue such that the total * Overated by Martin Marietta Energy Systens, Inc., under contract No. DE-ACOS-840R21400 with the U. $s. Department of Eneray. lateness {1) is minimized. If the joos are renumbered by their position in the queue, then i i run time,. eI (2) delivery date; It as well Imown [Conmway, et al, 1967] that the minimum lateness schedule is achieved by organizing the jobs by increasing run time. Classifier Representation A system's ability to react correctly to ats environment is determined in part by the adequacy of its internal representation of that environment. Within a traditional learning classifier system {Holiand, 1986], there are certain representational limitations due to the use of the genetic algorithm as the primary learning mechanism [Liepins and Hilliard, 1986]. The foremost of these limitations is the necessity of encoding the environmental detector messages as strings from a small (usually binary) alphabet. The length of these detector messages determines the size of the search space for the classifier system, and a balance mist be achieved to provide an internal representation of the environment which is adequate for the system to reason correctly and to develop useful classifier rules but simple enough toa be searched in an acceptabie period of time. During the design of the action and reward structure, a minimal representation was necessary to provide a testbed for evaluating various decisions about the other features of the clas- sifier system. Thus, in the Phase I experiments, the messages consisted of a single field, the rank of a job's run time in relation to other jobs in the problem. The Phase II experiments were per— formed using the binary representation of the run time as the messaqe - See Table 1. Ranked Run Raw Run Time Time Job__Run Time Due Date Message List Message List 1 5 16 00 900101 2 12 22 ol 601100 3 23 13 10 010111 4 50 40 ib 110010 Table 1. Ranked run time and raw run time message lists. Actions For a classifier system to learn how to schedule jobs, it must have sane means of interac- ting with tne job queue. Interaction is accom— plished through the use of effector classifiers which perturb the queue. In Phase [ of the Job Shop Scheduling Classifier System, all classifiers are effector classifiers, Each of these is compared with the current detector message list {representing the current job queue with one message per job) and acts upan the matched job. A 232 succesSrul Classifier implementation requires that effector actions satisfy three criteria: 1) all possible orderings must be reachable py the actions; 2) the actions must be simpie enougn tnat @ user can trace and analyze the steps leading to a particular queue ordering; 3) the evaluated merit of an action needs to reflect its true worth. Many possible actions were discussed including swapoiny jobs, moving jobs to the top of the queue, moving jobs to the top or the bottom of the queue, andi moving jobs to a queue location suecified by the Classifier itself. of these, we investigated moving a job to the ton of the queue and moving a job to an indicated position. The action of moving a job to the too of the queue was powerful enough to create all possible job queues and simple enough to trace and analyze {the first and second criteria}. A trace of the classifiers acting in the last few cycles wil] indicate how the queue was tormed, and any specific ordering can be achieved by placing the jobs at the top in reverse order of their run time. However, this desiqn fails to meet the criterion of provid- ing for an approwriate reward for action, because the system would have to Jearn to create a bad queue ordering in order to make a good one. This contradiction confused the classifier system and the action of moving jobs to a specific queue location was considered instead. This secon alternative moving jobs to a svecific queue location meets all criteria. This action meets the third criterion particularly well, since the actions can be rewarded with the value of the new queue as described in the next section. Therefore, the right hand side of a rule in the Phase I exeriments was the binary renresen-~ tation of tne queue position in which to blace the job matched by the left hand (aetector) side. Reward Providing the system with feedback indicative of ats performance is another major requirement of a learning classifier system impienentation. Each classifier which effects a change in the system's environment must be given some reward in order to estimate the classifier's relative value within the environment and to make an appropriate decision as to which classifiers should be used in reproduc- tion, which should be replaced during reproduction, and which should be ailowed to act. The classifier system maintains a strength value for each clas sifier as an assessment of its relative worth within the environment. This strengtn 1s qenerally increased by rewards for classifiers which uroduce qood actions and decreased by taxes and payments. The reward within tne jeb shoo scheduling environment is pased on the total iateness of the current JoD queue. This can ne considered the learning aporoach to reward in contrast to the training approach where each action taten by a classifier would be evaluated and the rewara would be based upon the action itsesf rather than tne combination of its direct effects and its side effects ucon the job queue. In the learning approach it is the state of the newly constructed Job queue that forms the basis for the feedback to all the active classifiers and not the manner in which the new queue was formed. Thus the learning strategy permits the classifier system to develop any strategy for queue organization as Jong as it leads to an acceptable job queue confiquration. This reward structure minimizes the amount of vreblem specific information in the system, and avoids encouraging myooic or "qreedy" heuristics. The reward in the learning approach implicitly includes both an evaluation of the actions taken ami a measure of the side effects of the actions. This approach requires that a classifier be given the oppartunity to perform its action in a variety of situations (in order to senarate the effect of its action from the side effects of other classi~ fier's actians). A successful and apwropriate reward value is the actual lateness of the current queue scaled to the range 0,200]. Combining this reward strateqy with the simultaneous action methodology described below created a system canable of distributing strength to a set of good classifiers. Conflict Resolution The conflict resolution strategy allows the system to develop sets of rules which cooperate to solve a problem. This cooveration can be based on finding niches in the environment as in Holland's defauit hierarchy, or it can be explicit. Our conflict resolution strategy allows for both types of operation by allowing maitiple rnies to compete for niches in the environment and allowing multiple rules to be active in each cycle. Each cycle proceeds through the following steps: 1. Match the detectors to the messages fram the job list, Let S be the set of matching classifiers. 2. Generate a bid for each matching classifier, based of strength and add random perturbatian to the bad to produce a “noisy” bid. (Bid = c*Strength +c). 3. Select the classifier with the largest "noisy" bid, Perform the action indicated by the classifier. 4. Select the classitier with the next hignest bid and (if possible) perform the action indicated by the classifier. 5. Repeat step 4 until all matching classifiers have beer examined or until a1] slots are filled. 6. Pall any remaining slots with available jobs (maintaining their previous relative ardering}. 7. Evaluate the resulting queue. 233 To provide more control in tne experiments, there were eight jobs in the queue; therefore, the classifiers could be completely specific and did not employ #'s. The entire population of 64 possible rules (8 jobs x 8 positions) was main- tained throughout the run. The classifier system established appropriate weights so that the eight optimal rules (i.e., 000/000, 001/001, ...., 113/111) dominated the population and generated an optimal queue. This was a limited demonstration of learning transferable heuristics. The system was limited by the requirement to use ranked run time (rather than raw ron time) ami to use svecific rules. THE GENERALIZED CLASSIFIER SYSTEM - PHASE II Two extensions were necessary for the system to eroduce more general heuristics. The first is to generalize the left-hand side of the classifiers, allowing a classifier to match a set of jobs. The second extension is to reward the system for its performance on a set of problems, since appropriate generalizations require multiple examies. An analysis of the steady state of the system under different designs provided impetus for same design decisions. Extensions The Phase II system achieves generality in the classifier representation by using the raw run time information as in Table 1, and by incorporating a ternary representation with the third character ,#, being regarded as a “don't care" symbol and matching either i or 0. Thus, 100#01# matches all of 1000010, 1000011, 1001010, 10010141, Since the classifiers match sets of jobs, the conflict resolution strategy becanes more complex. Two strateqies are being comared in our initial experiments, the champion strategy and the consen- sus strategy. In both strategies, the classifier's bid is calculated the same way bids are calculated in the all specific classifier system, except that the random perturbation is added at a different point in the process. The conflict resolution strategies view the bids through a job-position matrix, This matrix consists of celis for each job/position combination. Thus, in the case of four jobs, the matrix consists of sixteen cells and can be ‘represented as in Figure 1. Position a 2 3 4 JOB Fiqure 1, Celi Cy, contains the bids of classitiers which match job 2 and recommend placement in position 3. The champion strateqy adds ramiom perturbations to each bid in a cell and chooses the largest noisy bid to create the bid for the cell. The consensus strateqy sums the bids in a cel] and adds a random perturbation to the sum to create the cell's bid. In both cases the conflict resolution pbroceeds as in the totally specific case, with cells replacing classifiers. The reward under the champion strateqy is awarded to the chosen classifier in each active cell, while the consensus strategy divides the reward equally among all classifiers bidding into an active cell. Initial experiments have shown the champion system to be superior. The second extension, evaluating the rules on multiple problems to encourage generality, can be accomplished in two ways. In the cyclic reward implementation, the system cycles through the problems in a training suite several times in different randan orders, bidding and receiving a reward for each problem. Reproduction, selection, and crossover occur intermittently throughout this cycle, In contrast, the delayed reward scheme allows the classifiers to bid, accumilates the rewards for the entire suite, and adjusts the classifier strengths based on the suite as a whole. Reproduction, selection, ard crossover occur after several bidding cycles. This delayed reward is intended to prevent the most recent problem from having a stronger influence on the strengths at the time of reproduction. The two evaluation methods are being compared in Phase If. Steady State Analysis An analysis of the study steady equations indicates that both conflict resolution strategies temd to select more specific classifiers to act. Depending on the reward structure, selection for reproduction can tend toward general rules, specific rules, or reies with mid-level speci- ficity. Let S = steady state strength, B= steady state bid, R = average reward, C = bid constant, k = specificity (number of snecific positions), T = tax constant (ST maid each cycle), nh = mumber of jobs matched by the classifier, and m = number of classifiers bidding into the cell. We can consider a single classifier and a single cell. Three possible reward structures produce tnree different resuits. (See Table 2) 234 Classifier Selection Tendency Steady State Value Payments = S(Ck + T} S= OR ck4T Be RL Specific 1+ T/C Payments = S(Ck + Tn} ‘SiC + Tn) S= RR _ Specific C+ Tm B= R_ Specific i+ T/C Table 2. Steady state analysis for the champion strategy. The consensus strateqy produces steady state values equal to the champion strategy's values divided by m. Since the bids in a cell are summed to determine the actions, the selection for actions continues to tend toward specific classifiers. Selection for reproduction, however, tends to favor classifiers occupying celis with small nopulations. Since the offspring tend to pomulate the same cell as their parents, this provides a control on overpopulation of any single cell. Initial experiments with the systems tend to confirm these generalities, but the system is very sensitive to the value of T. If the value is too iow, extremely general ciassifiers (#1.###t/010) tend to gain strength, whereas a high value tends to produce an extremeiy sparse matrix. A complete comparison of tecnniques awaits the completion of a full set of experiments, Initial analysis of the populations does reveui that once created, "qood" general rules (CO0/MWH#/000) tend to survive. CONCLUSIONS Although we have not completed all the experi- ments in Phase II, the classifier system, when implemented with a conflict resolution scheme allowing for cooveration between classifiers, provides a mechanism which appears to be capable of generalization at a rudimentary level. The conflict resolution methods and bidding schemes developed for the simple job shon scheduling task seem to be extendible to more complex scheduling problems (precedence constraints, multiple machines, etc.) and possibly to other ordering problems. Future research will include a variety of experiments desiqned to help us estimate the systems ability to iqnore extraneous data and the usefulness of a penalty function formulation for precedence constraints. Acknaledqments We wish to thank David Goldberq of the University of Alabama and Chrisila Pettey of Vanderbilt University for mmerous helnful discussions and coments. References Comaay, Richard W., William L. Mawell, and louis W. Miller. Theory of Sc! Wesley, Reading, Mass., 1967. ing. Addtison- Davis, Lawrence, "Job Shoo Scheduling with Genetic Algorithms.” Proceedings of an _ International Conference on Genetic Algorithms . Pittsburgh, PA., Holland, John H., "Escaping Brittleness: The Possibilities of General-Purpose Learning Algorithms Applied to Parallel] Rule-Based Intelligence Approach, Vol. IT, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell (Bds.), Morgan-Kaufinan, Los Aitos, California, 1986. Liepins, Gunar E., and Michael R. Hilliard, "Representational Issues in Machine Learning,” ional Symoosium on +, Oct, 25, 1986; IRNI- 6362 . Rinnooy Kan, A. H. Classification, Compley ty_and Computations. Nijhoff, The Hague, 1976. USING THE GENETIC ALGORITHM TO GENERATE LISP SOURCE CODE TO SOLYE THE PRISONER'S DILEMMA Cory Fujiki John Dickinson Department of Computer Science University of Idaho Moscow, Idaho 83843 ABSTRACT The genetic algorithm is adapted to manipulate Lisp S- expressions. The traditional genetic operators of crossover, inversion, and mutation are modified for the Lisp domain, The process is tested using the Prisoner's Dilemma. The genetic algorithm produces solutions to the Prisoner’s Dilemma a$ Lisp S-expressions and these results are compared to other published solutions. INTRODUCTION The genetic algorithm is an adaptive learning system inspired by evolution and heredity. In this particular application, the genetic algorithm will be used with a program generator to create source programs for different applications of the Prisoner's Dilemma problem. The genetic algorithm is used to produce a set of programs (called a generation) which attempt to solve a particular problem. These programs are evaluated by a critic on how well they solve the problem and the best programs are manipulated by a set of genetic operators to create a new set of programs (the next generation). Tnis process is repeated until a program produces an acceptable solution to the problem. There are three genetic operators [6} traditionally used to create the new programs from existing programs: Crossover, Invertand Mutate. These genetic operators have been modified to be used with the program generator. BACKGROUND The initial work with the genetic algorithm manipulated a bit string representation of the solution. Smith [7] has shown that the genetic algorithm can manipulate variable length solutions that are more complex in form. Cramer [2] showed that instruction sets could be manipulated directly. Hicklin [5] showed in his work that a genetic algorithm can be applied to a program generator which creates grammatically correct LISP programs from a given set of productions [5]. The initial generation of programs were generated at random then and evaluated. Programs which rated below the average score of the entire population are replaced by new programs created from the higher rated programs. Random changes are made to programs to ensure a complete search of aff possible programs. ASthough Hicklin’s work was inspired by the genetic algorithm and Holland's work in the area, he did not use the genetic operators to develop new programs. Fujiki [4] took this work and developed genetic operators which closely approximated the traditional operators used by Holland (6). 236 KNOWLEDGE REPRESENTATION A problem facing all adaptive learning systems is how to represent the knowledge being learned. In theory, the system should be flexible enough to allow unique and previously unknown solutions to be discovered. In practice, it is often necessary and even desirable to supply some built-in knowledge about a possible solution in order to speed up the search process. In Hicklin’s system, the knowledge about a particular solution toa problem is represented a5 a program. A program is generated using a set of productions. The degree of flexibility in the system and the degree of built-in knowledge is dependent on how these productions are defined. The types of programs generated in this study are based on the set of productions shown in Figure 1, These productions represent a subset of the productions for LISP. While not all possible LISP constructs can be generated using these produc- tions, all of the programs generated are acceptable LISP S- expressions, In productions | and 2, the symbol "Start" represents the starting symbol for the grammar. The program created are LISP functions starting with a COND command. Productions 3 and 4 are used to create any number of conditions and actions pairs. 1) Start --> (COND (T Action)) 2) Start ~-> (COND Cond-Term (T Action)) 3} Cond-Term --> (Logical Action} 4) Cond-Term ~-> (Logical Action) Cond-Term 5} Action --> The different actions that may be performed. 6) Logical --> The different possible conditions that can be checked. Productions Used To Create Programs Figure I Once a true condition is found, the action associated with it is performed. Only the first true condition found in a COND is used, The last condition generated is always a TRUE function with some action. This insures that the program generated will always produce some action found in the grammar no matter what conditions have been formed earlier in the program. Production § represents the different result productions, represented by the symbol "Action". These actions may be Lisp S-expressions or calls to specific routines. This production produces the different actions that may be desired for the problem domain being studied, Production 6 represents all of the different productions used to generate the logical condition used in the program. There can be any number of these “Logical” productions, including productions which combine the logical conditions by the use of 0. Initialization. Using the grammar productions, ran- domly create the initial populations of m Lisp expressions. All members of this initial popu- lation are considered to be mutations of the start symbol of the grammar. Whenever the grammar is used, the productions are separated into two categories; those that have only terminal symbols on the right-hand side of a production and those that have one or more variable symbols on the right-hand side. The length of the Lisp expressions is controlied by selecting productions from these two catego- ries. A terminal percentage, t, is used to determine the probability that a production will be selected from the category that contains only terminal symbols on the right-hand side The terminal percentage is increased as the length of the Lisp expressions increase, so that eventually the Lisp expression contains only terminal symbols. 1. Evaluation. All Lisp expressions are evaluated using EVAL. The value returned is used to rank each Lisp expression. Some applications may re- quire several evaluations of a Lisp expression to obtain the actual figure of merit. For example, the prisoner.s dilemma requires sev- era! "turns"; each turn uses EVAL and the set of responses is used to calculate the figure of merit. 2. Modify population, A percentage, r, of the best per- formers in the current population is retained and the lower portion of the population is eliminated. The members eliminated are re- placed by new members that are created from the remaining good performers by mutation, inversion, and crossover, that are involved according to the mutation percentage pm, the inversion percentage pi, and the crossover percentage pc. 3. Herate. Go to Step 1. Genetic Algorithm for Lisp S-expressions Figure 2 logical “AND* and “OR" operators. An example of one complete grammar used for the Prisoner's Dilemma programs is shown in Appendix A, GENETIC OPERATORS FOR LISP SOURCE CODE The genetic operators must be able to change a program but at the same time keep some of the major characteristics of the program. If the operator does not change the program enough, then the search for better programs proceeds too slowly. if the operator changes the program too much, then the characteristics of the program that made it successful will be lost and Jittle information is passed to the next generation of the programs. The programs generated all have the same basic form in that they are all COND statements followed by one or more conditions and actions, Each condition action pair is considered to be one individual piece of information to be used by the genetic algorithm and are never broken apart by any operator. The operators used in this study were inspired by those used by Holland [6]. 237 The crossover operator combines condition action pairs from one program with those of another program. The crossover operator divides two programs at two points picked at random in between the condition action pairs. New programs are created by combining the first part of one program with the second part of the other program. A second program is created by combining the two remaining parts of the programs, The crossover point can occur at any point in the list of conditions except after the TRUE condition found at the end of the program. A crossover after this point would be pointtess since all conditions after a true statement would never be used and the program would remain the same as before, For example, a program can be represented as (Cond a b]¢d)and (Cond A BC|D) where each letter is a condition action pair. Using crossover between b,c and C, D wauld produce the following programs: (Cond a b D) and (Cond A BC cd). The invert operator does not change the information in a program but rather rearranges the information. When the invert operator is applied to a program, it reverses the order in which the condition action groups are evaluated, but it does not change the condition action groups themselves. The TRUE condition found at the end of the program was not included as one of the points which could be inverted. If the true condition was inverted, none of the conditions following it would ve used since the program always ends after the first true condition found. If, for example, a program represented as (Cond ja b c| d) is inverted between a to c, then the resulting program would be (Cond c ba d). The mutate operator is used to add new information to the population of programs. Once the original generation of programs has been created, the crossover operator and the invert operator can not add any new information to the programs, since they only rearrange the information that already exists. A random mutation operator is used to add new information to the programs generated. When the mutation operator is used on a program, it removes one condition action pair picked at random in the program and replaces it with a new condition action function. This new condition action pair is created using a subset of the original grammar. Since it is desired to always have a true condition at the end of the program, the last true condition could not be changed by the mutation operator. {f this condition is picked to be mutated, then only the action associated with it would be changed, For example, if a program which is represented as (Cond a bc d) is mutated at c then it would appear as (Cond a b e d) where e is a completely new condition action pair. The extent of the mutation performed on a Program is determined at random. A detailed outline description of the genetic algorithm applied io Lisp S-expressions is given in figure 2. The values for the parameters used for this work were: (population size) (survival rate) (terminal percentage) m = 80 r= 25% t = initially 20% but increases to 100% pm = 25% pi = 25% pe = 25% (mutation percentage) (inversion percentage) {crossover percentage) The percentages for the genetic operators were all set to the same value because one of the goals in Fujiki [4] was to determine the effects of the various operators. PRISONER’S DILEMMA PROBLEM Each program is evaluated and the highest scoring programs are used to create the next population. If two programs score the same then the smaller program is rated higher. This is done to favor the more efficient programs and remove programs with useless clauses, The Prisoner’s Dilemma problem involves two players who both must decide whether to cooperate or defect on each other. Each player is awarded points (called the payoff) depending on the action taken compared to action of the opponent. The different combinations of possible actions and the payoffs are shown in Figure 3 where R represents the payoff given to each player if both cooperate. Stepresents what is called the sucker payoff (i.e. the payoff for cooperating when the other player defects), T represents what is called the temptation payoff (i.e, the payoff fora player when he defects and the other player cooperates). P is called the punishment payoff when both players defect. The payoffs for each move have the following relation- ships in the Prisoner.s Dilemma problem, T> R>P>S. The gain for the players when both cooperate with each other is greater then when both defect. The gain for cooperating when the opponent defects is the smallest of any action. The gain for a player is greatest when he defects and the other player cooperates. Player 2 (Second Symbol) Cooperate Defect Player | Cooperate R/R S/T (First Symbol) Defect T/S P/P Points Scored in Prisoner's Dilemma Figure 3 A second requirement for the Prisoner’s Dilemma is that players can not benefit from taking turns exploiting each other. The gain from mutual cooperation must be greater than the average score for defecting and cooperating, R > ($+ T)/ 2. From the basic situation described above, several vari- ations have been created. In this paper, the following additional requirements have been added: |} Instead of just one round of decisions, a series of decision are to be made. Here a player could start by cooperating but decide to defect later on depending on what the other player did. 2) The strategy is used against a variety of other strategies ina round robin tournament. Here the best strategy is the one which scores the most total points at the end of the tournament. In this situation, a player may not be able to out score any other player but still win by scoring well against all of the other players. The Prisoner’s Dilemma has been used to represent a wide variety of real life situations. For example, when two businesses interact, it is usually a situation where one company would gain the most if it could get the other company to give up something for nothing, but at the same time, they have more to gain if they both cooperate than if they both try to deceive each other. Problems ranging from personnel relationships to super power negotiations have been studied using some form of the Prisoner's Dilemma problem [1]. For this study, the genetic algorithm tried to develop a strategy when having to make a series of 10 decisions against a variety of other strategies. In each situation, the program generated by the genetic algorithm was used each time a decision whether {0 cooperate or defect was required. Scoring for each decision was based on the following (see Figure 4.) Player 2 (Second Value) Cooperate Defect Player ! Cooperate 3/3 0/4 (First Value) Defect 4/0 pst Points Scored in Prisoner’s Dilemma Figure 4 238 The number of games played and the points awarded for each move was based on information from a tournament held for the Prisoner’s Dilemma problem. This information was received in a file called The First Announcement of Computer Programs Tournament (of the Prisoner’s Dilemma Game) on February 7, 1986. These values can be easily changed in the genetic algorithm to meet other conditions. For instance, if the grammar has no knowledge about the total number of games to be played, then a strategy for an infinite number of games can be generated, The grammar used to generate programs had three different options it could check based on what the opponent did in the past: !} What the opponent did on the Jast move, 2) what was done two moves before, and 3) if a particular action was ever done by the opponent in the past (either defect or cooperate). Other strategies could be generated by using a combination of the above checks together. The program generated could also check on the number of rounds played. While these checks were available in the grammar, there was no information about which checks to use or how to use them when creating a program. If a. user was interested in other strategies then they could be added to the grammar. The actions available to program would be to either cooperate (represented by a 1) or defect (represented by a 2). RESULTS AND CONCLUSIONS The generation of what appear to be optimal strategies for the Prisoner’s Dilemma problem came after relatively few generations in most cases. According to Axelrod’s pioneering work on the Prisoner's Dilemma, the optimal strategy for the Prisoner's Dilemma under most circumstances is one he calls Tit- For-Tat [1] (there is never one optimal strategy for all cases since the best strategy depends on the other strategies being played). Tit-For-Tat cooperates on the first move and then does whatever the opponent did on the previous move (i.e. if the opponent cooperated then it cooperates, and if the opponent defected, then it defects). Axelrod states that a strategy normally has three different characteristics to be successful in a Prisoner’s Dilemma tournament; Nice (i.e, does not defect first), Provocative (i, will defect if defected on), and Forgiving (i.e. can be made to cooperate after it has started to defect). Tit-For-Tat has all of these characteristics. This strategy is simple for the grammar to generate and therefore one would expect it to appear soon in the random generation of programs. The fact that simple strategies could produce some of the best results no doubt leads to the rapid generation of solutions, The environment that the genetic algorithm was tested under was based on work done by Dacey and Pendegraft in their study of the Prisoner’s Dilemma problem [3]. The grammar used to generate programs was designed to produce all of the different strategies studied by Dacey and Pendegraft. Other strategies were also possible with the grammar (i.e. checking two moves back), but there was no attempt at making a grammar that was all inclusive for every possible strategy. The grammar also allowed the checking of which round the program was in. The first run was played against strategies where Tit-For-Tat was thought to be optimal. The solution generated was a slight variation of Tit~ For-Tat where, instead of defecting if your opponent.s last move is a defect, it defects if the opponent's fast two moves are both defects. This was not one of the strategies studied by Dacey and Pendegraft and was produced by combining two strategies found in the grammar. This strategy was studied by Axelrod in his work {1] (called Tit-For-Two-Tats) and matches his contention that the best overall strategy should be nice (i.e. never defect first) and forgiving (i.e. can be made to cooperate even after being defected upon) It is also provocative, but not as much as Tit- For-Tat. When Tit-For-Tat was tried against the same environ- ment, it was found to score lower then this new strategy. The genetic algorithm was tried against another set of strategies where Dacey and Pendegraft’s work predicted Tit-For-Tat was optimal and once again Tit-For-Two-Tats won. A third run was made against a different combination of strategies. In this environment, Dacey and Pendegraft, the strategy that was predicted to be optimal was one of cooperating with an opponent until defected upon and then defecting from that point until the end. This time, the strategy produced by the genetic algorithm matched the predicted strategy. In the examples used above, the genetic algorithm developed a strategy against a fixed set of opponents. The fourth run was made where the programs produced by the genetic algorithm were also used as the environment that the programs were played in, Each program played all of the other programs produced each generation and then evaluated on its overall score against all of the programs. In this type of environment, a strategy may do extremely well only because it can take advantage of the other strategies currently in the population. If these strategies are replaced by different strategies, then the most successful strategies may change. This is exactly what appears to happen in this run. At first, a strategy of defecting all the time scored extremely well and begins to dominate the population. Then a small number of programs were created which had the characteristic of being both nice and provocative. These strate- gies, of which Tit- For-Tatis one, protects themselves from being taken advantage of by an all defecting strategy but also cooperates with other nice programs. When enough of this type of program are created, either by random mutation or by manipulating existing programs, then the all defecting strategy is no longer op- timal, Eventually the strategy of cooperating until defected on, and then defecting from then on came out as the optimal strategy. This strategy appeared after approximately seventy generations and took over as the top player. Once this strategy was created no other strategy was found that scored better in over 300 generations. In Axelrad’s wark on the Prisoner's Dilemma problem, he states that in a population of all-defecting players, a strategy of being nice and provocative would be the successful if a group nice and provocative players appeared in the group, But he also states that if a population of mostly all nice and provocative strategies exists, then no other type of strategy would ever be better. This is exactly what occurs in this cun of the genetic algorithm. The all defecting strategy appears early in the run and appears to be the best strategy. The genetic algorithm eventually created enough programs which are nice and provocative and they are able to out score the all defecting programs. Once this occurs, this strategy is never outscored by any other strategy found by the genetic algorithm. This strategy was not the exact one predicted by Axelrod ta be the best for the general Prisoner's Dilemma tournament (i.e. Tit-For-Tat) but it does match many of the results found by Dacey and Pendegraft. It can be shown that when using only those strategies used in Dacey and Pendegraft’s study, the strategy of cooperating until defected on, then permanently defecting, is the optimal one. Since it is their work that inspired the grammar producing the programs, this could be the reason why Tit-For- Tat did not win. The strategy of cooperating until defected on, then permanent defecting is considered nice and provocative but it is not forgiving. This would mean that it would do just as well or better than Tit-For-Tat ina world of nice strategies (since neither program would defect first) and beth would respond when defected on. Tit-For-Tat would do better if it played against programs that tried to encourage cooperation after having defected, This type of strategy can be made by the grammar used in the genetic algorithm but it is not very common, This may be one reason why, in this particular environment, Tit~For-Tat does not appear to be optimal. In all of the strategies developed, each one eventually learned to defect on the Jast move, It can be shown that in a limited number of Prisoner's Dilemma games, the optimal strategy always includes a defect on the last move. An example 23 of the programs generated by the genetic algorithm for the different Prisoner's Dilemma problems are shown in Appendix B, The significance of this work is applying the genetic algorithm in the generation of real source code and showing that the genetic algorithm can be used to find optimal solutions to less then trivial problems when a proper set of productions are used. APPENDIX A: GRAMMAR USED IN FRISONER’S DILEMMA The following is a grammar that is used to produce the Programs for the Prisoner's Dilemma problem. This grammar produces programs which can check a variety of items before returning either a 1 (which represents a cooperation move) or a 2 (which represents a defect move). 1) index --> (COND (T action)) 2) index --> (COND index-cond-term (T action)) 3) index~cond-term --> (logical action) 4) index~cond-term --> (logical action) index-cond-term 5) action -~> 1 6) action ~-> 2 T) logical --> (NOT logical) 8) logical ~-> (I-op logical logical) 9) logical --> (EQUAL NROUND 1) 10) LOGICAL --> (EQUAL NROUND 10) 11) logical --> (EQUAL OP-PLAY action) 12) logical ~-> (EQUAL OP2-PLAY action) (3) logical --> if-any 14) l-op ~-> AND 15) l-op --» OR 16) if-any --> PAST-DEF-OP 17) if-any --> PAST-COOP-OP In the previous grammar, any item written in all capita) letters are used in the final program. Index is the starting point for all programs. If production | is picked, then index becomes a COND statement with only a TRUE test (which by definition is always true). Action is a non-terminal and can be converted to either 21 or a 2. This represents either a cooperation move ora defect move, If, in the beginning, production 2 was chosen, then a program witha COND and either one or more index-cond- term productions are used These become some kind of logical test which must evaluate to a true statement before a correspond- ing action will be done. Some of the tests that can be generated in a program include: Check to see if this is round I of the game (production 9) or round 10 (production 10}, check the last place the opponent made (production 11), check the play the opponent did two moves ago (production 12), check if the opponent ever defected in the past (production 16) or if the opponent has ever cooperated in the past (production 17). The symbols created by productions 16 and 17 are prewrititen functions created for this particular problem which become true if a defect or cooperation move has ever been made. These checks can be combined using logical "AND" and "OR" statements (production § with produc- tions J4 and 15) APPENDIX B: ALGORITHM PROGRAMS GENERATED BY GENETIC The following is an example of the programs produce by the genetic algorithm for the Prisoner's dilemma, Each program is designed to work ina LISP environment. The genetic algorithm was used to produce several different strategies for different environments for the Prisoner’s Dilemma problem. The following represents two different strategies, one called Tit-For- Two-Tats by Axelrod [1} and one called Cooperation-Permanent Defection by Dacey and Pendegraft [3]. Tht-For-Two-Tats {COND (EQUAL NROUND 10) 2) (NOT (EQUAL OP2-PLAY 2)) 1) ((NOT PAST-DEF-OP) 2) (EQUAL OP-PLAY 2) 2) (TD) This strategy defects on the last move, which is when the round (stored in the variable "NROUND") is ten. If the person’s action two moves ago (stored in OP2-PLAY) was nota defect {represented as a "2") then cooperate (represented asa "1"), The third condition checks to see if the opponent has ever defected (stored in PAST-DEF-OP). This condition would cause a defection if the opponent never defected. In this program, this condition is never reached since if the opponent never defected, then the opponent would not have defected two moves ago and the second condition would have been true Therefore, the only time this program defects is if condition four is true and the conditions before it are false. Specifically, if the opponent defected two moves ago and if the last move (stored in op-play} was also a defect. Any other conditions would produce a cooperation move from the true condition in the Jast line of the program. Cooperate~Permanent Defection (COND (PAST-DEF-OP 2) (T 1)) This strategy is very simple to produce since one of the built-in procedures is true if the opponent has ever defected (stored in PAST-DEF-OP). This program returns a defecting move if PAST-DEF-OP is true (meaning if the opponent ever defected). Otherwise, it returns a cooperating move. BIBLIOGRAPHY [1] Axetrod, Robert, Evolution of Cooperatron, Basic Books inc, New York, 1984 {2} Cramer, Nichael Lynn, 4 Representation for the Adaptive Generation of Simple Sequential Program, Proceedings, International Conference on Genetic Algorithms and their Applications, July 1985. (3] Dacey, Raymond; Pendegraft, Norman, The Optimality of Tit-For-Tat, Prepared for Presentation, International Studies Association, March 1986 [4] Fujiki, Cory, dn Evaluation of Holland's Genetic Operators Applied to a Program Generator, Master of Science Thesis, Department of Computer Science, University of Idaho, 1986 [5] Hicklin, Joseph F, Application of the Genetic Algortthm To Automate Program Generation, Master of Science Thesis, Department of Computer Science,University of Idaho, 1986 {6} Holland, John H., Adaptation In Natural and Artificial Systems, University of Michigan Press, Ann Arbor, 1975 {7] Smith, Stephen F., Flexible Learning of Problem Solving Heuristics Through Adaptive Search, Proc. 8th IICAL,, August, 1983. 240 Optimal Determination of User-oriented Clusters An Appheation for the Reproductive Plan by Vijay V Raghavan” and Brijesh Agarwal The Center for Advanced Computer Studies University of SW Louisiana Lafayette, LA 70501-4330 Abstract Atypical Information Retrieval system is required to be able to satisfy user queries in an efficient and effective way. Furthermore, the system should be able to adapt itself to changing user pat- terns. Recent work in user-oriented clustering attempts to meet the above objectives by identifying document clusters that are consistent with the similar- ities perceived by the users, rather than those hypothesized by the designer. This paper proposes a representation framework for the application of an adaptive scheme in order to find optimal user-oriented clustets. It is shown, in particular, that the reproduc- tive plan proposed by Holland [7] is a promising solu- tion strategy for the optimal determination of the boundaries between clusters. 1. Introduction Cluster analysis is an important tool employed in information retrieval to enhance both efficiency and effectiveness of the retrieval process. The retrieval strategies based on classification of documents can lead to greater efficiency by the search being restricted to just a few selected clusters; they can achieve a greater effectiveness if clusters are formed in such a way that, when the appropriate clusters are selected for retrieval (or detailed examination), the number of relevant documents obtained relative to the number of documents retrieved (or examined) is high. This implies that the clustering algorithm must ensure that documents that are likely to be relevant to the same queries are placed together in the same clusters. Although many existing strategies for document classification consider the summarization of the data and the identification of the “natural” or "homogene- ous” clusters as the primary objective, it is also attrac- tive to specify some extrinsic criterion, based on the requirements of the application environment, accord- ing to which a placement of documents into classes is to be assessed, In the latter case, the problem of elus- ter analysis becomes simply an instance of eombina- torial optimization problems. a « Currently on feave from Univeraity of Regina, Regina, Canada, $45 OA, 244 When performing cluster analysis by formulating the problem as one of function optimization, one can choose from two general directions for developing a solution. On the one hand, prior knowledge of specific properties of the function to be optimized (e.g. unimodality, continuity, etc.) can be exploited in tailoring a particular strategy to the problem at hand. On the other hand, either because prior knowledge is not available or because the optimization criterion is known to be complex (and, therefore, not amenable to optimization by “standard” techniques), ane could seek a solution strategy that is effective for a broad class of complex criterion functions. The classification of documents for information retrieval, in our judge- ment, requires the adoption of criterion functions that are quite ill-behaved. Consequently, in what follows, we develop an approach involving a robust strategy for function optimization, By now the connection between adaptive processes and function optimization problems is well established. Several adaptive plans have been pro- posed and, in particular, the reproductive plan by Hol- land {7] has been fairly well investigated ‘as an approach to function optimization [1-4,16]. The con- clusions are that the algorithm is robust and, for the cases where the criterion functions are jil-behaved, more efficient relative to standard techniques. Thus, there is motivation for considering the reproductive plan as an approach for performing clus- ter analysis. In other words, we suggest that each pos- sible classification (a placement of objects into clus- ters) be considered a device and it is desired to iden- tify progressively better performing devices vis-a-vis a clustering criterion. This view of clustering has already been investigated in Raghavan and Birchard [13]. However, in that work, certain problems of representation were not adequately resolved. The result was that the basic genetic operators such as crossover, double-crossover and mutation did not prove effective. In the current work, a framework, in which such representational problems do not exist, is identified. As a result we are able to assert that the genetic algorithm is again a promising direction to investigate. In the next section, the general framework that characterizes user-oriented clustering approach is presented. Then, in section 3, the definition of the problem, for which the use of the genetic algorithm is being advocated, is given. The details of the structure representation and operator — specification are developed in section 4. Finally, section 5 summarizes the contributions of this work. 2. The Problem Framework In the context of information retrieval systems, a new approach to document clustering called the user- oriented document clustering (also known as adaptive clustering) is emerging. This approach has been investigated by Yu et al. (16), and Deogun and Raghavan [14]. User-oriented clustering can be seen as a mechanism whereby the classification ig based on the user perception of clusters, rather than on some similarity function perceived by the designer to represent the user criteria. 2.1 Overview of Adaptive Clustering The adaptive clustering process is basically a two stage clustering scheme. Hach document is initially assigned a position on the line (-co, oo). In the first stage, as a query is processed, documents relevant to this query are moved closer; then some randomly selected documents are moved away from their cen- troid. This step is repeated for several queries. In the second stage, the clusters are identified. For the pur- pose of this paper, the first stage of the algorithm is adapted from Yu et. al. The second stage is based on the formulation of the problem developed by Ragha- van and Deogun([14]. The important details in the algorithm are as follows: Stage 1 1. Given documents 4j,do,...,dy each document d, is assigned an arbitrary position p, on a real line (-co, 00), for 1 Si SN. 2. Given a (next) query use the (actually) stored clusters to pro- vide response ae b. obtain feedback from user as to which documents are relevant to the query. ce. modify the position of documents on the line, so that the documents accessed for this query occupy positions closer to each other on the line while ensuring that points corresponding to all docu- ments do not eventually bunch up in a small interval. a. Hf the clustering based on the positions of the documents on the line (see stage 2 far details) is significantly different from the stored clusters, then reorgan- ize the stored clusters according to the new clusters indicated. b. Go to step 2 (next query). Stage 2 In this stage, optimal clusters are to be obtained by defining boundaries at suitable points on the document line. 242 The baste steps are: 1. Define possible boundary points (points on the document line where boundary between two clusters can be defined). From the set of possible boundary points select some points as actual boundaries to wD obtain a particular set of clusters (a classification). 3. Evaluate the performance of this classification relative to the given set of queries, according to some performance cri- teria. Terminate if a sufficiently well per- forming classification has been obtained. Otherwise, generate a potentially better per- forming classification by modifying the current one. Repeat step 3. 2.2 Earlier Work on Adaptive Clustering Several research investigations have been carried out, in the last few years, on adaptive document clus- tering [6,14-17 ]. The main emphasis of these activities have been on ways of accumulating the information gathered through user feedback (analogous to stage 1 above}. In Raghavan and Sharma ash the idea of arranging the documents on a line, proposed in [16], was retained. But the conditions under which two docu- ments would be moved closer to or apart from each other is modified. The results indicate that the modification is computationally more efficient. The work by Deogun and Raghavan {6}, in contrast, hypothesized that summarizing the usage patterns on a tine is too restrictive in the sense that much useful information is lost. Consequently, an entirely new procedure that employs a weighted graph as the struc- ture for summarizing feedback on usage patterns is proposed. The preliminary results show that this approach is effective. However, that approach has not yet been adequately compared with others existing in the literature to determine its relative performance. More recently, Oommen and Ma [11] have character- ized the adaptive clustering problem within the con- text of certain learning automata. Another aspect of clustering documents in this manner concerns the use of clusters so obtained dur- ing retrieval. Yu et al. do not investigate any concrete retrieval scheme. Instead they show that performance will be good if a scheme that is able to select the “best” clusters ig devised. Thus, this problem is still essentially open. The difficulty is that the descriptors or properties associated with documents and queries play no part during clustering and the system, while knowing what the clusters are, has no knowledge of how documents in one cluster can be distinguished from those in another. A promising approach based on Bayesian decision theory is suggested in Deogun and Raghavan [6]. This and few other alternatives are currently under investigation, 3. The Boundary Selection Problem Concepts such as positive, negative and boundary regions introduced by Pawlak (ial in the context of the development of the rough set theory are used to develop 4 formulation a suitable clustering criterion. A positive region[POS(q)] for a query q is defined to be the set of documents in the clusters that are com- pletely included in the relevant set of the query. Simi- larly, a negalwe region{NEG(q}] is defined to be a set of documents in the clusters that are completely excluded from the relevant set of the query. Docu- ments in the remaining clusters define the boundary region[BND(q}]. Clearly, an ‘‘ideal’’ clustering for a given set of queries will be the cone in which for each query we have only positive and negative regions but no boun- dary regions. Therefore, in finding a clustering our motivation is to maximize the positive and negative regions collectively for all queries in the given set. This leads us to the following function as a. meas- ure of effectiveness for a given clustering Co w{CyCo, Ch 4 2 POSe(a) + INEGe( a) | fe Ee aml . a icht where qj, qorey¢, are the queries and w, is the weig associated with query q, and D is the set of documents over which the partition C is defined. The objective is to find a clustering C of docu- ment set D that maximizes ¢¢ with respect to the * . given set of queries. Since }) u, can be considered to sox wee . be a constant, the objective of maximizing & is equivalent to minemize ro == S3 wy [BNDo(q,) | at ro can be interpreted as the cost associated with choice of clustering C. It is not difficult to see that this cost function is biased in favor of forming many small clusters. So the following constraint is adopted: Sy ASB, el where the ws are the query weights as earlier, R, is the number of sets among Cy,C2,,C, such that Cc, 9 POSg) #4, fort ow * IBNDo(adt vel yy OO) BNDcla) #4 is minimized subject to ys wt R, SB. vd This problem is called the Boundary Selection Probiem (BSP) [14]. 4. The Reproductive Plan 4.1 Representation Let bo, 61, ... b, be the possible boundary points, We denote a classification, C, by an (m+ 1) * dimen- sional binary vector X = (fo) Zi, wy Im), where zg 1,2, == 1 and, for each j, lSjcm, {t if 6, ws an actual boundary in C 77 lo, otherwise Let r, 2r 110101001 110101011 . Xa The clustering corresponding to XV.¥o,NyAiXo are xX 1,2,3,4},{5}, {6,7 ,8,9,10}, {11 x, 21672} 43-4,5 4, (6.75,18,9,10,11 x: Uh ta 5bis.7, O10} 411 Xy, + {41,2,3,4}, (5 },46,7,8,9,10,11} me: HZ h48,4,5},416:7 4, (8,9, 10}, (11) Mutation of X, and 4, at positions 7 and 4 respec- tively would result in new structures, Xyo and Xj. as follows: NyillO100011 => 110100111 The resulting clusters are : {{1,2},{3,4,5},{6,7,8,9}, 10}, 11 }}: Xo_:101100011 => 101000011 similarly, the corresponding clusters are + 4(1,2,3,4}, 5,6,7,8,0,10}, {11 fp. Suppose q, is the relevant set for query, a; = {1, 2, 6, 7, 8, 11}. Then the associated POS, NEG, and BND regions would be : xX POSs(9, {11}; {1,2,3,4,6,7,8,0,10}; NEGy(q,) = {5}, X_ + POSx,{4,) == (1,2,6,7} BNDy fq) NEGy(q) = {3,4,5}, Nyt POS (a) = (12,11); BNDy C4) ) = BND x(q} == {8,0,10,11}; == {6,7,8,9,10}; Zag NEGy9.) = 34,5}, Xu POSy, (9) me os BNDy, a) wa {1,2,3,4,6,7,8, o 10,11}; NEG, (4) = {5}, Xyt Peel s= {1,2,6,7 lk BNDy,( 4.) = {8,9,10}; NEGy,( 9.) = 5 4 5} Xu t os a) {2,2,11}; BNDy,(a) == {6,7,8,0} NEGy,( 4) = 3,4,5,10}, Me POSx,A%) == {1lk BND, {4,) = {1,2,3,4,5,6,7,8,9,10}; NEGy,.(q.) = 4. In the original population, X, had the best utility with respect to q, since its BND set was smallest. This performance could be attributed to the subunits: Pattern 11 in positions 0, 1 respectively, and 101 in positions 3, 4,5 respectively. Notice how the offspring Xy,, which provided these subunits, continues to perform well. On the other hand X, had a low utility with respect to q,, and subunits contributing to poor performance are Pattern 101 in position 0, 1, 2 respectively, and 10004 in position 3,4, 5, 6,7 respectively. The offsprings X,, and X,. which still have the first subunit continue to perform poorly. Mutations generate a new classification by break- ing a cluster or by merging two clusters. Crossovers help in producing new combinations of clusters. Now we have seen how a diverse set of solutions can be generated by a few elementary operationsand how the operations affect the payoff (or utility) of each structure. Also, this representation does not have the "labelling anomaly” of the previous approach [13]. Consequently, it is expected that the basic genetic operations would suffice for our application; ie. no application-specific operators are necessary. 4.2 Discussion of the Approach The classification scheme considered represents a novel perspective on the process of cluster analysis. When compared to approaches, for example, in pat tern recognition literature, it seems to resemble paradigm-oriented classification. That is, the clusters identified depend on the feedback from users as to which documents are relevant to their queries. The fact that it resembles paradigm-oriented pattern classification gives some assurance that the use of the reproductive plan can be effective, since Cavicchio [4] has been successful in applying the algorithm to the character recognition problem. However, the similarity of the problem addressed to paradigm-oriented classification is not that great. This can be realized by observing that the classification is not done separately for individual queries, whereby a distinction is drawn between relevant and non-relevant documents, but rather in a global sense. In other words, the clusters are required to transcend the specific queries on the basis of which they were generated and be useful in retrieving the relevant documents even for queries not encountered at any time earlier, In this context it is important that new classifications (structures) developed during the clustering process be made of up well-performing schemas (subunits), This aspect points to the fact that some sort of "homogencous” clusters of interest to many past and potential users, are being sought. Hence, there is also some resemblance between our objectives and those of the classification strategies that fit in the class of “unsupervised” learning methods. We betieve that this new perspective on classification is very useful for the retrieval environment and should be of value in other contexts as well. It is also interesting that stages 1 and 2 of the proposed clustering strategy would be, in the termi- nology of Yu et. al., called adaptive document cluster- ing. Within this broader adaptive system, we propose to apply the reproductive plan to solve the BSP. Thus, what we are developing is a layered system where on the one hand the positioning of documents on the line is being adapted in response to certain environmental influence and on the other hand the organization of these documents into clusters is adapted according to the clustering criterion. For the latter stage, the use of the genetic algorithm is advo- cated. In the earlier investigation of the genetic algo- rithm as a way of obtaining an optimal classification [13], the concept of arranging documents on a line did not exist. Consequently, a classification was represented as a string of cluster labels: for a classification of N documents into r clusters, the string is of length N and each position on the string would be assigned a value between 0 and r 1 indicating the label of the cluster to which the document belonged. Thus, there was not only the problem that at each locus there would be a large number of alleles, but also thet there was what we called the “labeling ano- maly”. That is, a cluster having label t in one struc- ture may be totally disjoint from and unrelated to a cluster also labeled / in another structure. Such prob- lems lead to the basic genetic operators being ineffective in generating better performing, and often even valid, structures, It is easily seen that the representation adopted here circumvents such difficulties. The implementation of these ideas are underway. 5. Conclusion In information retrieval, a novel approach to clus- ter analysis in which clusters are modified as relevance feedback from users are obtained, known as user- oriented document clustering, is actively being investigated. A combinatorial optimization problem that arises in that context requires the identification of cluster boundaries in such a way that a prescribed retrieval criterion is optimized. It is shown, by speci- fying a structure representation and providing exam- ples of operator application, that the reproduetive plan is a promising approach for solving this problem. 245 Acknowledgements This research is supported in part by a grant from NSERC of Canada. References 1. Bethke, A.,"Genetic Algorithms as Function Optimizers”, Doctoral Thesis, CS Department, University of Michigan, 1981. Brindle, A.,"Genetie Algorithms for Fuaction Optimization”, Doctoral Thesis, Department of Computing Science, University of Alberta, 1980. 3. Caviechio, D.J., "Adaptive Search Using Simu- lated Evoljution”, TR 03296-T, Computer and Communications Dept., University of Michigan, 1970. 4. De Jong, K., “Adaptive System Design A Genetic Approach’, ZEEE Transachons on Systems, Man and Cybernetics, pp. 566-574, vol, SMC-10, 9, Sept. £980. 5. De Jong, K., "A Genetic Based Global Function Optimization Technique”, TR 80-2, Department of Computer Science, University of Pittsburgh, 1980, 6. Deogua, JS., and Raghavan, V.V., "User- oriented Document Clustering : A Framework for learning in Information Retrieval’, Proc. of the Ninth International ACM SIGIR Conf. on Research and Development wn Informaton Retrieval, pp. 157- 163, Pisa, Italy, Sept. 1986. Holland, IL, "Adaptation in Natural and Artificial Systems”, The University of Michigan Press, 1975, 8. Tolland, J.H. et. al., "Computational Implementa- tion of Inductive Systems”, in Induction . Process of Inference, Learnmg, and Discovery, pp. 103-150, The MIT Press, 1986. 9. Mauldin, M.L., "Maintaining Diversity in Genetic Search”, pp. 247-250, Proc AAAS, August 1084, Mercer, H.ts. and Sampson, JR., "Adaptive Search Using a Reproductive Meta-Plan”, pp. 215-228, Kybernetes, vol. 7, 1978. Oommen, BJ. and Ma, D., "Fast Object Parti- toning Using Stochastic Learning Automata”, Proc. of the Tenth Internahonal ACA SIGIR Conf. on Research and Developing in Informaton Retrieval,to appear, New Orleans, June 1987. Pawlak, Z., "On Learning - A Rough Set Approach”, in Lecture Notes in Computer Science, No. 208, Skowron A., Ed., Springer Verlag, Ber- lin, 1986. Raghavan, V.V. and Birchard K., "A Clustering Strategy Based on a Formalism of the Reprodue- tive Process in Natural Systems”, Proc. of the second International ACM SIGIR Conference on Information Relneval, pp. 10-22, Dallas, Sept. 1979. Raghavan, V.V. and Deogun, JS., "Optimal Determination of User-oriented Clusters”, Proc of the tenth Internatonal ACM SIGIR Conf on Research and Development m Information Retrieval to appear, New Orleans, June 1987. ) “I 10. lt. 13. 14. 15. Raghavan, V.V. and Sharma, R.S., "A Frame- 16. 17. work and Prototype for Intelligent Organization of Information”, Canadian Journal of Information Science, 1987, to appear. Yu, ©.T., Wang, Y.T. and Chen, C.H., "Adaptive Document Clustering’, Proc. of Erghth Annual International ACM SIGIR Conf on Research and Development in Information Retrieval, pp. 107-203, Montreal, Canada, 1985, Weitzel, A., "Evaluation of the Effectiveness of Genetic Algorithms to Combinatorial Optimiza- tion”, Doctoral Thesis, Department of Library and Information Science, University of Pitts. burgh, 1983. 246 THE GENETIC ALGORITHM AND BIOLOGICAL DEVELOPMENT by Stewart W. Wilson The Rowland Institute for Science, Cambridge MA 02142 Abstract A representation for biological development is de- seribed for simulating the evolution of simple multi- cellular systems using the genetic algorithm, The representation consists of a set of production-like growth rules constituting the genotype, together with the method of executing the rules to produce the phenotype. Examples of development in 1-dimen- sional creatures are given, 4. Introduction The genetic algorithm [1] incorporates mecha- nisms which resemble the mechanisms of reproduc- tion, Variation, and selection found in natural evo- lution, but, despite successes in several fields of ap- plication, there has been little attempt to use the algorithm as a tool to investigate, through simula- tion, natural evolution itself. Considerable work ex- ists on the ontogenetic evolution of behavior, @.c¢., learning [2-4], but relatively little on the evolution of organisms per se |5|. The main reason has been the absence of representations for organisms which would permit the genetic algorithm to be brought to bear. The genetic algorithm observes the genotype- phenotype distinction of biology: the algorithm’s variation operators act on the genotype and its se- jection mechanisms apply to the phenotype. In biol- ogy, the genotype-phenotype difference is vast: the genotype is embodied in the chromosomes whereas the phenotype is the whole organism that expresses the chromosomal information. The complex decod- Ing process that leads from one to the other iy called biological development and is essential if the geno- type is to be evaluated by the environment. Thus to apply the genetic algorithm to natural evolution calls for a representational scheme that both per- mits application of the algorithin’s operators (o the genotype and also defines how, based on the geno- type, organisms are to be “grown”, t ¢., their devel- opment, The present paper outlines a few steps in the direction of such e representation [6]. The problem is addressed at the level of cells, which are treated as as “black boxes” having well-defined properties Beginning with the fertilized egg, the cells are to divide, move, and differentiate under the control of tules 50 as eventually to form a mature organism. An attempt is made to respect major facts known about cells and these processes, but large compro- mises must occur at this point in the effort to ap- proach algorithmic workability. The principal objec- tive is to describe a representational framework— a sort of “developmental automaton” ~sufficiently completely that randomly generated instances will grow and can be evolved under the genetic algorithm in computer experiments. 2. Evolution of Development The problem of applying the genetic algorithm to the development of multi-cellular organisms can be divided into four parts: plan, expression, selection, and variation. 21 Plan In nature, the genotype contains (1) information that is descriptive, through the action of develop- ment. and the environment, of a range of possible phenotypes, and (2) information encoding the devel- opmental process itself, t.e., how to go about mak- ing a phenotype from a genotype. Both kinds of information are of course inherited and subject to variation and natural selection. {lere, for simplic- ity, it will be assumed that only the first kind of information, termed the organism’s plan, is herita- ble and subject to the genetic algorithm. The other kind. the rules for expressing the plan to form the phenotype, will be regarded as fixed. What should the plan look like? Several obser- vations on natural systems are suggestive |7]. In the first place, though individual cells can have differ- ent sizes and can change in size, growth occurs pri- marily through cell division: one cell becomes twa “daughter” cells. Second, depending on the situa- tion, the daughters can be phenotypically the same as the parent, they can differ from the parent but not differ from each other, or they can differ from the parent and from each other. Third, the pheno- (ypical outcome of cell division can depend not only on the nature of the parent cell, but also on factors related to the cellular, chemical, or physical context in which the parent cell is embedded, Finally—-and pivotal for this discussion— all cells in an organism are considered to contain the same genetic informa- 247 tion, though some of 1t may become in some sense “switched off” or inoperative during differentiation. These observations have suggested the following working proposal The plan will take the form of a so-called production system program (PSP) con- sisting of a finite number of production (condition- action) rules which will be termed growth rules. The growth rules have the general form X+K, = K, Ky The K’s stand for cell phenotypes and X represents the local context; the symbol “+” means conjunc- tion. In addition, each growth rule has associated with it a weight w. Every cell in the organism con- tains the same set of rules, or PSP. Focussing attention on a particular rule in the PSP of a particular cell, the condition side of the tule is satisfied if that cell is (phenotypically) of type K, and the context matches X. The action recom- mended by the rule is to replace the cell by two new cells, one phenotypically of type K,, the other of type K,. Whether or not this rule controls the par- ent cell’s fate depends on whether the rule is selected for expression, as discussed in the next section. The general growth rule form is open to many specia] cases. As in nature, the daughter cells may or may not be the same as the parent or each other. Furthermore, some rules may contain just one daugh- ter cell, identical to the parent; such a rule, if ex- pressed, means that cell division does not take place. Also, some rules may have no cell in their action parts, corresponding to dissolution of the parent cell. Some rules may have no term corresponding to X; their condition is satisfied independent of con- text. In the other rules, X can take on several forms. Most simply, X can stand for the presence of a cell of a particular kind adjacent to K,. In this case (“adjacency” type context), the spatial relation of the X cell and K, may affect the spatial relation of the daughter cells (if there are two). Another kind of X (“signal” type) would stand for a detector for signals emitled by other cells, not necessarily in the immediate neighborhood. For present purposes, the “signal” emitted by acell is simply a list of its pheno- typical properties. The predominant direction from which matched signals are received could affect the daughter cells’ spatial relation. Still another kind of X would detect aspects of the physical environment such as intercel} pressure. 2 2 Expression Since all cells contain the same “program”, dif- ferential development of the system depends on the selection for expression of different rules in differ- ent cells This is not difficult in principle, since once some differentiation occurs, the sensitivity of the rules to cell type and context will lead to fur- ther differentiation. The proposed expression mech- anism consists of a match step and a dectston step. Again focussing attention on a particular cell, in the match step the cell first identifies those program rules which have satisfied conditions. Then, from this match set, the cell chooses a single rule for ex- pression. The chosen rule “carries out its right-hand side”, 7.¢., daughter cells are produced as prescribed and their signals are emitted. The system’s growth process is envisioned as a se~ ries of discrete time-steps. On each slep, the expres- sion mechanism operates in every cell of the current system, The operation is regarded as “parallel” in the sense that offspring of all the cells are produced simultaneously. The offspring cells then undergo, in accordance with their phenotypical properties, a process of interaction and spatial accomodation so as to form the “new” system to be input to the ex- pression mechanism in the next time-step. 2.2.1 The decision step The decision step of the expression mechanism makes use of the growth rule weights w and the effect of signals from nearby cells. Each growth rule in the match set has an associated weight w. If a rule’s context (X} part is either absent or is of adjacency type, its excttation is defined to be just w. However, if a rule’s context part is of signal type, the rule’s excitation is defined to be the product of the weight w and the tniensity of the received context signal. For example, suppose that a certain rule has an X which matches signals Sq emitted by nearby cells A. Suppose further that the total intensity of the signals is simply their number nq times a constant {. Then the excitation of the rule in question would equal fra w. The cell decides which match set rule to express using a probability distribution over the rules’ ex- citations That is, the probability that a particular rule will be picked is equal to its excitation divided by the sum of the excitations of the rules in the match sel. The following three rules offer an inter- esling example. AoAA Wy (Sa)+ Aaa uy, (Sa) +A +0 wa The first rule, termed “reproductive”, takes one cell A and leaves two in its place. The second rule, 248 termed ‘inhibitory’ matches cel] A, senses the pres- ence of at least one A-type signal in the vicinity, and seeks, if chosen, Lo maintain the status quo exactly, The third rule, a deletion rule, has the same condi- tion as the inhibitory rule, but seeks to delete the matched A cell Each rule has a weight, as shown, Suppose now the system consists of an aggregate of n cells of type A. In any cell, the excitations of the three rules will be: by 7 Wy e& suing eg = wafny If w, is large and there are relatively few cells, the reproductive rule will be chosen most of the time and the aggregate will grow. As it does, however, the excitations of the inhibitory and deletion rules will increase relative to that of the reproductive rule, due to na. The growth rate will slow down. Eventually, an equilibrium will be reached where net growth is zero. At that point, the probability of reproduction equals the probability of deletion, or w, ~ wafng- Solving for na yields the system’s equilibrium size: nye (1/f)(w, /wa) The system’s net growth rate dn/dt prior to equilib- rium can be calculated by taking the product of n and the difference between the probabilities of repro- duction and deletion. Dropping the “A” subscripts, the result. is dn (1 - n/n’) dt Lt (uJwart\(nfar) ” showing that the system’s growth rate can be “cho- sen” independently of its equilibrium size. Though simple, the example is important be- cause it illustrates one way in which the cellular program can manage the fundamental problem of bounded growth. Later examples of differentiation into finite regions of homogeneous cell type will as- sume the presence of growth rule sets of this or sim- ilar sort for the regions. 2.2.2 Phenotype properties Once the decision step has picked a rule for ex- pression, the daughter cells in the action part must be simulated, which means simulating their proper- ties. In a real organism, each cell “type” has a myr- iad of physical and biochemical properties. Some of these may be more properly regarded as behavioral, e.g., during development, cells can creep, amoeba- like to new positions. Most of the properties affect in one way or another a cell’s interactions with other cells. Even if all the properties were understood, a realistic simulation would still have an enormous problem adequately representing and computing the interactions within the cell aggregate. Such a com- putation is necessary in order to determine the fit- ness, with respect to an environment, of the organ- ism as a whole. The practical course for the present would seem to be to choose extremely simple en- vironments, simple measures of fitness, and a very restricted range of cell properties. 2.8 Selection and variation Because the foregoing representational framework for development takes the form of a production sys- tem program, it is straightforward to apply the ge- netic algorithm as the ‘engine” of phenotype selec- tion and genotype variation. The application of the algorithm would be along the lines of previous work with production system programs [3,8], One would start with a population of “egg” cells, each contain- ing a random genotype. Each egg would undergo development and, after a standard number of time- steps, each resulting cell aggregate would be rated for fitness. The original eggs would then be copied in numbers proportional to these fitnesses to form a new population of the same size. Genetic operators would be applied to the genotypes of the new pop- ulation. The cycle would be iterated through some number of generations, corresponding to evolution. Many aspects of this scheme are quite well un- derstood due to the research just cited and on ge- netic algorithms in general However, the form of the growth rules in the genotype is somewhat un- usual so some comments about coding are in order. The basic encoding would resemble that of classifiers {2!. The condition part of a rule would consist of a context taxon (for X) and acell taxon (for K,), each being a string of length L from {1,0, 4}. The ac- tion part would consist of two cell descriptions (for K, and Kx}, both strings of length L from {1,0}. An interpreter is required to relate cell descrip- tion encodings to phenotypical properties. This sim- ply means establishing a pre-defined mapping be- tween substrings in the cell description and proper- ties, ¢.g., “110” in the 14th through 16th positions could mean the cell surface has “high stickiness”, etc To take care of rules in which one or both of the daughter cells is absent, the interpreter would simply check the setting a certain bit in each cell description: “O", say, would mean that description cell was absent and the rest of the description should 249 seneenernayesscse 1 | i be ignored. A similar systern would be used to indi- cate the presence or absence of a context taxon and its type (adjacency, signal, or other). A growth rule’s condition would be satisfied if both (1) the cell description of the cell in which the rule finds itself matches the rule’s cell taxon, and (2} at least one signal reaching the cell matches the rule’s context taxon. The meaning of “match” is the same as for classifiers. the two strings must be the same at every non-# position of the taxon. The use of the “don’t care” symbol # permits rule conditions to restrict their sensitivity to particular subsets of cell description and signal bits. Calculation of the intensity of the signal match- ing the context taxon can be quite complex, de- pending on the simulation. Involved are the depen- dence of individual signal intensities on the distance from their sources, and also perhaps propagation de- lays with respect to the time-step of creation of the source cell. These factors must be predefined. In any case the total received intensity would be a sum over the individual intensities of all matched signals. As noted earlier, the net direction of the received sig- nal may in some rules determine the spatial orjenta- tion of the daughter cells. The dependence would be encoded in a special bit string associated with the daughter cell descriptions. The weight associated with a growth rule must also be encoded in order to make it, and conse- quently the rule’s influence in the decision step. sub- ject. to the genetic algorithm. The weight would sim- ply be concatenated, as a fixed-length binary num- ber, with the rest, of the rule string. 3. 1-D Development As has been the case with research on cellular au- tomata [9], the complexity of realistic three-dimen- sional simulations recommends initial study of one- dimensional examples. In two and three dimensions, forces between cells must lead to complicated cell movements and contortions of the “tissue”. A I- D “creature”, however, could be viewed as growing inside a frictionless tube, with no forces except be- tween adjacent cells. Cell] division would Jengthen the creature; deletion would shorten it. Though sim- ple, the 1-D case can exhibit cell type configuration patterns such as symmetry, periodicity, and polar- ity that are analogous to patterns emerging in the development of real organisms. Some elementary examples follow. $1 Symmetry and periodicity Changes in a 1-D system through time can be 250 represented by a pyramid lhe the following: This shows three time-steps. At first, the system consists yust of cell A; then A divides to form cells B and 8B; then the left-hand B divides to form the (oriented} pair D C, and the right-hand B yields the pair C D. Only two growth rules are required: AsxBB (B)+B2CD In the second rule the context taxon is of adjacency type (indicated by the absence of “S”). This type of rule reads: “order the output cells so that the direction from the first one to the second one is the same as the direction from the context cell to the replaced cell.” Note that the pyramid diagram shows bilateral symmetry about its center ine. Using additional rule sets of the self-limiting form discussed in Sec- tion 2.2.1, the C’s and D’s could be multiplied to yield eventually a stable symmetrical creature of fi- nite size, D...D C...C D.,.D, with approximately equal groups of D cells The following pyramid and its rules illustrates rudimentary periodicity: A A»>EF EF (F)+ E2De cCDEeD (KE) + F # CD Again, the addition of self-limiting rule sets would result in the creature, C...C D...DC.,..CD...D, in which lke cell groups were approximately equal in size. It is clear that quite complex structures can be built by first establishing the pattern with non- cyche rules (im which the cell taxon will not match the output cells), and then using self-limiting rule sets which apply to the final cell types. $.2 Polarity An elementary polarity results from any rule in which the output cell types differ. A polarity with respect to some phenotypical property can be set up with non-cyclic rules as follows: A A>BC BC (C)+ Bw ED DEFG (B)+ C= FG If in the cell descriptions of D, E, F, and G the prop- erty is, say, monotonically increasing, the amount of the property will be graded across the system. A more sophisticated gradient system occurs under the rales; A--BC (Sp) +C DC CCC (So) + C2 € (Sc) 40-290 If B’s signal loses intensity with distance, the prob- ability that a C will change to D C will fall with distance from the left end of the structure. The re- sult wil] be a decreasing distribution of D’s from left to right. The last three rules are intended to control the system’s overall size. When rule sets become even slightly complicated, as in the last example, it evident that development wil) be difficult to predict. It can be hoped, how- ever, that with the help of the genetic algorithm, the ability to design and analyse organisms in ad- vance will not be necessary in order to build suc- cessful and interesting ones-~just as in natural evo- Jution it is not. What does seem essential is an ade- quate space of possible growth rules. The rule forms discussed include self-excitation, self-inhibition, and cross-excitation and -inhibition between different cell types. The repertoire seems fairly complete for a start, but modifications in it and in many others as- pects will surely occur as the proposal is studied experimentally and analytically. 4. Conclusion An extremely schematic representational frame- work for biological development has been described which may permit simulations of evolution using the genetic algorithm, Major questions that need to be addressed include the accuracy and adequacy of the representation and the problem of computing the phenotype. Tt 1s hoped that coupling “developmen- tal automata” with genetic adaptive techniques will yield insights into biological, social, and other sys- tems which are capable of growth. Acknowledgement The author thanks D.E,. Goldberg for valuable comments on an earhier draft of the paper References il) Holland, JH. Adaptation in natural and artificial systems. Ann Arbor) University of Michigan Press, 1975. [2] Holland, J, H. ‘Escaping brittleness. the pos- sibilities of general-purpose learning algorithms applied to parallel rule-based systems.” In R. S. Michalski, J. G. Carbonell & T. M. Mitchell (Eds.), Machine learning, an arttfictal intelh- gence approach. Volume II. Los Alios, Califor- nia: Morgan Kaufmann, 1986. [3] Smith, S. A learning system based on genetic algorithms. Ph.D, Dissertation (Computer Sci- ence}. University of Pittsburgh, 1980. [4| Grefenstette, J. J. (ed.) Proceedings of an inter- nattonal conference on genetic algorithms and their applications Pittsburgh: Carnegie-Mellon University, 1985. |5| An important paper used mechanisms closely re- lated to those of the genetic algorithm as defined in {U] Go study the emergence of self-replication. See Holland, J. H. “Studies of the spontaneous emergence of self-replicating systems using cel- lular automata and formal grammars.” In A. Lindenmayer & G. Rozenberg (eds.}, Automata, languages. development. Amsterdam: North- Holland, 1976. [6] The representation appears to share several as- pects with the developmental models called L- systems. However, there do not seem to be any studies in which populations of L-systems un- dergo evolution. See Lindenmayer, A. “Develop- mental algorithins for multicellular organisms: a survey of L-systems.” J. Theor. Brol., §4, 1975. \U, For background, see Balinsky, B. 1. “Develop- ment, Animal” and Waddington, C. H, “Devel- opment, Biological.” Encyclopaedia Britanica, 15th Ed., Vol. 5, 625-650. {8| Schaffer, J. D. “Learning multiclass patiern dis- crimination.” In [4]. |9] Wolfram, 5. ‘Cellular automata as models of complexity.” Nature, 341, 4 October 1984, 419- 424. 204 Genetic Algorithms and Communication Link Speed Design: Theoretical Considerations Lawrence Davis BBN Labs 10 Moulton Street Cambridge, MA 022 38 Abstract The problem of finding low-cost sets of packet switching communication network links when the network topology has been fixed is a difficult and time consuniing one. In this paper we deserihbe the features of the problem that make it difficult. de scribe a genetic algorithm that solves the problem, and discuss some of the theoretical issues raised in applying genetic algorithms to this domain. Introduction The process of designing a packet switching cammunication network, as carried out at Bolt Beranek and Newman Com- munications Corporation (hereafter. “BBNCC™), has four major phases: ® First, information about the traffic to be carried, the customer’s existing devices and communications pro- tocals, and other customer requirements is gathered and entered into a database «© Second, access devices are placed such that equip- ment aud hne costs are minimized and traffic from the terminals is homed to them. * Tord, packet-switching nodes are stmilarly placed and traffic from terminal access devices and customer hosts 1s homed to them. e In the final phase. the “backbone” phase. the designer Hinks the nodes su that they can carry the desired traf- fic and meet other customer requirements on speed, reliability, connectivity, and cost, (See the Figure 1 for an tHustration of a backbone design.) T™ Stan Coombs BBN Commumeations TO Paweett Street Cambrulge. VIA 92238 Ths paper and its companion paper! concern a problem that gecurs in the fourth phase of network design. When the designer first links nodes to create a backbone network topology. the links used are high-speed ones. The intent ts to produce a topology that satisfies the customer's re quirements. One such constraint imposed on the topology in figure Tis that the net be “bicunnected™ -- given the failure of any single network lnk or node, the network will still be connected, High-speed links are very expensive. When a working topol- og) has been found, the designer concentrates on uptinnz ing the network cost hy reducing the speeds of some or all of the network link». while satisfying the customer's design criteria. These two problems ~— that of choosing a topol- ogy and that of selecting low-cost. acceptable link speeds given a topology - together constitute the fourth phase of communication network design. When performed by hu- man designers, the two processes tend to be carried out sequentially, although there ts a certain amount of overlap between thent. The process of choosing link speeds given a topology 1s complicated by a number of factors: « The set of allowable link speeds is determined by the available services offered by the chosen carner(s). and often vartes from network to network. © The set of customer performance requirements varies from network to network. ¢ The simulator used by DESIGNet to assess the per- formance of a trial network is a stuchastic one; un- derstanding the effects of changing a link's speed is much more difficult when such effects change between trial '"Genene Algonthms and Communication Link Speed Design Con- straints and Performance,” by Susan Coombs and Lawrence Davis in This process is carried out with the aid of DESEGNet ". BBNCC"s interactive network design tool mplemented in Zetalisp on Symbolics Lisp Machines. this volume 252 SEATTLE Figure 1, A Sample Backbone Topology « The network performance constraints may be stated m stochastic terms (eg. the delay of each link in the network must be below 500 milliseconds im 80% of the simulations) end itis dificult to know rapidly whether they have been satisfied. « The differences in the costs of links of different speeds are nowlinear, and often non-iatuitive. Thes depend on the carrier and/or on the modems chosen by the customer. if analog lines are being used. « Equivaleat bandwidth of one ligher-speed link be tween two nodes can be provided by multiple lower- speed hinks, generally at non-equivalent cost. The nodes treat the multiple links as one logical link when routing traffic. © The effect of changing a link’s size is difficult to pre- dict ands lughly interactive —- traffic may be rerouted everywhere in the network as a result of changing the speed of a single link. a 253 » The effect of changing a link’s size may be counter- intuitive - if occasionally happens that reducing the size of one tink in the network unproves the entire network's performance. ® For larger backbones, the DESIGNet simulation of network performance may take a nunute or more to run. leading to lengthy intervals between a designer's changes in the network and the simulation’s feedback. Given these charactenstics, and the fact that a substantial amount of the cost of a packet switching communication network can be bound wp in its links, the problem of find- ing a low-cost set of sizes for the links in a network when its topological constraints have been met is worth solving with a computer, The stochastic nature of the data and the non-linear interaction between changes in a network design make the link speed problem a diffiewlt one to approach with heurstie and hill-climbing systems. We hypothesized that the problem would be amenable to the genetic algo- rithm approach. and in the fall of L990 we apphed a genetic algorithm to a simplified version of this problem im order to measure its performance agatiet some deterministic al- gonthins. The results were encouraging (they, and a more detailed description of the communication network design problem in general, are contained im an earlier paper?). Ou constraints that the human designers were using. The coti- panion paper also describes the results of several applica- tions of our genetic algorithm to networks that were being designed for prospective clients of BBNCC the hasis of that work, we extended and modified the sys- tent for appheation to some real-world network designs, The following sections of this paper discuss theoretical is- sues related to genetic algomthms that were relevant to cre- ating our prototypical system and applying it to networks being designed at BBNCC. The companion paper to this one discusses issues concerned with mtegratmg a wide var ety of constraints on network performance into the genetic system, so that the genetic algorithms employed the same constraints that the human designers were using. The com pation paper also describes the results of several applica- tions of our genetic algorithm to networks that were being designed for prospective chents of BBNC Representation Issues Genetic algorithm researchers most trequently use bit strings as chromosomal encodings of solutions to the problem they are trying to solve, Other representation techniques have been emploved, but the early work in the genetic algorithin field and miost of the fornial results that have been proved have been based on the bit string representation. Bit strings and their croysover. imversion, and mutation opetators have been extensively studied and are fairly well understood. We «id not use bit strings in the system to be described here, Instead, our chromosomes were lists of lnk speeds. Each speed was keyed to a link in the backbone. and each speed was taken from the list of allowable link speeds for the design in question. This chromusomte, for example. encodes link speeds for a three-link network: (2400 50000 9600) Evaluation of such a chromosome 1s carried out by assigning each link the speed encoded for it on the chromosome, simu- lating the resultant network's performance, and evaluating that performance agaist the customer's requirements. "Davis, Lawrence and Susan Coombs, Optumaing Network Link sizes with Genetic Algonthms’ To appear i Vodelling and Senula fon Methodology Ano cledge Systems Parad gra, Maurice S Elzas, Tuncer! Osen and Bernard P Ziegler, editors 254 Why did we choose this representation for the link speed problem? The principal motivation was our expectation that an important genetic operator would be “creep” - al- tering the speed of a link upward or downward one or moie steps in the fuerarchy of allowable speeds Encoding speeds with bit strings would not support this sort of operation. Interpreting a bit string as a Gray coded string would al- low creeping one step up or down, but it would not easily support creeping multiple steps during the application of a single operator. There is a potentially telling argument agatnst our choice Other things being equal, representa- tions using an alphabet with fewer letters and lounger words of representation. will yield better performance in a genetic algorithin than representations with a larger alphabet and shorter words, suice they will provide more hyperplanes for the algorithin to explore. We. on the other hand, have chosen to repre sent each separate link speed “word” with a single “letter”. fave we discarded the basis for some important crossover eflects? We do not believe so. The performance gains acquired by using larger alphabet are derived by finding periodicities in the search space that are encoded in the longer words. Our search space is not a periodic one. When the best speeds of a hnk for a given topology are plotted over a number of runs, they tend to cluster in one or two regions of the link size hist, rather than in periodic regions of it. While it 1s true that small alphabets and large words are useful for hnding periodicities ina gene's values, in our domamm such periodicities appear not. to exist. Accordingly, we appear to have lost nothmg by collapsing the representation, and it is simple for us to use the representation m implementing ‘creep’. Oar problem space tends itself to crossover based on word values (to continue the analogy tn the preceding paragraph) rather than on letter values, In the link speed domain one must hind link speeds that Hein an optimal region, wiven the speeds of the other links. One then wishes to sample the surrounding allowable speeds, hoping to find better combr nations. Complicating the search process are: the fact that evaluations in the region of the optimum may not produce monotonically decreasing evaluations as they move away from the optimal speed: the fact that the evaluation pro- cess is stochastic; and the fact that usetul hyperplanes in this search space are combinations of link speeds that work well together in the context of the simulation, but these hyperplanes may not be useful if other hyperplanes are in- terfering with them. The crossover uperator in our system Despite these complicating factors, it seems to have satisfactotly operated by cutting chromosomes at two points detected and combined useful hyperplanes to create coad- apted sets of link speerts. steicclnbantinnsesoceniraian eke 9 The link speed problem 1s like the problent of optinuzing parameters for a genetic or other stochasti algorithm “The task is to find a set. of coadapted values for a set of parame- ters, under the guidance of a stochastic evaluation function (running the algorithin while using the parameter values encoded on the chromosome.) Given a fixed set of values for the other parameters, the values for a given parameter tend to he together on the axix of possibilities, offen im a standard distribution. Much of the work to be done in such domains involves finding regions of optimal values for each parameter that fit with good regions for other parameters, exploring the mteractions of such regions in parallel, and settling on promising combinations of regions for further exploration. The result will be a hyperplane of coadapted values cutting across the axes of parameter values. Here. as in the link sizing problem, the task i to find reasonable carabinations af contiguous values in the face of a good deal of noise and parameter interaction. Genetic approaches to such problems have done well*, One question of iiterest for genetic algorithm researchers de- siting to carry out other applications is the comparison of the performance of representations that support creep op erators with representations that do not in domains with contiguous, rather than penodic, optimal parameter values Epistasis At the International Couference ou Genetic Algorithins and their Applications held two years ago, an interesting dis- cussion arose concerning the status of genetic algorithm that operate in domains with high degrees of “epistasis” - suppression by one gene of other genes’ effects. Lhe prob- lem was that John Holland's proof of the convergence of a genetic algorithin on an optimum was based on the as- sainption that there was a low degree of eptstasis in the evaluation of the algorithm's chromosoines, In order for a genetic algorithm to carry out parallel sampling of byper- planes when 1t evaluates a chromosome, tt 1s important that each hyperplane contnbute its effect to the chromosome’s evaluation. This cid not appear to be the case for some of the genetic systems being presented at that confereuce {n one example Snuth! packed in the chromosomal encoding of a packing could the bin-packing system described by Derek - altering the position of a single rectangle to he well alter the final position of every other rectangle in the packing! Clearly, South's representation led to a good deal of interactian when positions of rectangles were changed, But did it vielate the low eptstasis assumptuou? And sfit 3See John Grefenstette “Optimization of Control Parameters for Genetic Algorithms "an [EEE Tranenctions on Systems Man and Cy- bernetics, SMC-AS(L) (22-128, for a discussion of optinuzing param- eter values with bit string representations Also ser Lawrence Davis and Frank Ritter ‘Schedule Optimization with Probabilistic search * 258 simular te Holland's orginal proof. that could be proved for such domams? These questions were left open at that Hime, as way the question of rigorously defining the notion of eprstanis. There do not appear to have been any published answers to these questions in the interim. There is, however, a growing body of empirical evidence suggesting that genetic algonthms do indeed perform successfully in domains in which alteration of a single gene may cause tremendous divergence in a chromosome’s behavior, Soine of the papers on the travelling salesrep problem found in this Proceedings are examples of such successful performance. The system described in this paper is another, We believe that applications of genetic algorithms to snch problems will continue to be frintful. In our companion paper, we describe cesults that support this belief. Our do- main, however, appears to be an epistatic one. Consider what happens when the speed of a single link is increased in a communication network. One possibility is that the traffic in the network follows the same paths and no sig- nificant changes in delays accrue. More bkely, however, is a substantial alteration in traffic patterns, a» traffic that had been routed over different paths 1s now routed over the freer link and other traffic m 1¢ routed in reaction to this change [t is possible to change every routing in a network by changing a single links speed. and the resul- tant change in the evaluation uf the network's performauce can be tremendous as well, Given that the genetic algo- rithm performs well in this domain, we would like ta raise again some of the questions raised two years ago: Is the domain we are working in imttunotcally different from the domains for which convergence proofs exist with respect to epistatic effects? What is epistasis? And are there for- mal results that can be proved for problem lke clustering. bin-packing, and capacity assignment that will help us to uoplement genetic algorithms in other, simdar domains? Multiple Constraints An issue that concerned us for a good deal of time arose fromm the fact that the “enviroument” surrounding our ge- netic algonthm differed radically from problem to problem. [t was not at all cleat that we would he able in a single ge- netic algorithm framework to model the behavior of com- mumeathoun uetwork designers with respect to the myriad constraimts wnposed bs customers. {t was also not clear that we would be able to represent those constraints in an Proceedings of the Jrd [LEE Cr nference ot Artificial intelligeace A ppl- cations for an account of opunuzation uf the parameters of a simulated annealer with a representation that supports a “eterp operator. ‘Derek saath, “Bin Packing with Adaptise search "in Gselenstette, John J, editor, Praceedings of Ur sccond Conference on Genetic Ab goriiry and thea Ippicaband Io use. After a period of developitient and experimentation the resalts of which are deserbed in the companion paper, we produced a single genetic algorithm and evaluation func- tion environment that we believe will support the applica- tien of genetic algorithms to the sort of Ink speed dessgns that are currently beme carmed out at BBNCC, We believe that this issue will arise frequently as genetic algonthmis are developed from prototypical apphcations mto working industrial systems. Accordingly, we have devoted our com- pamion paper to a discussion of the techniques we used to create a genetic algonthm: system capable of robust appli- cation to the link speed design problem. and toa discussion of that system's performance on actual network designs. Conclusion In applying genetic algorithms to the communication net- work hiok speed design problem. we made several interesting discoveries: « Genetic algorithms perforin well when applied to this domain. e The highly stochastic and epistatic nature of the do- main did not prevent the algorithm from finding and exploiting hyperplanes consisting of useful combina- tions of link speeds. e Our system, based on “creep,” crossuver, and a non- bit string representation. performed well in this do main. « The domain was sunilar in some respects to parame- ter optimuzation problemi already studied by genetic algorithm researchers. « Further research mto the properties of genetic algo- rithms applied to such domains would be of benefit to those who will produce other, sitmlar applications®. ST he authore thaak The authors thank Susan Bernstein and Richard Vacea for ther many helpful comments and suygestions ducing the preparation of this paper 256 Genetic Algorithms and Communication Link Speed Design: Constraints and Operators Susan Coombs BBN Communications 70 Fawcett Street Cambridge, MA 02238 Lawrence Davis BBN Labs 10 Moulton Street Cambridge, MA 02238 duly, 1987 Abstract In thin paper we desc ibe »ome novel methods for incorporating constraints mto a genetic algorithm, for choosing link speeds in a packet switching com- munications network design We also describe the object-ortented approach we used in representing the constraints. and present the results of applying these methods to four network designs Introduction Constraints play an important role in network design. A packet-switching network design typically must satisfy con- straints on link utilization, connectivity, and on the max- imum travel time between source and destination, In ex- panding a genetic algorithms approach for adjusting net- work link speeds from the teial network design described in Davis and Coombs (1987)! to actual network designs, we discovered that we needed new ways of expressing con- atraints ta generate good designs.” There are many possible varieties of constraints. Designs for two networks rarely satisfy exactly the same set of con- straints. To help our genetic algorithm adapt to this fact, we implemented constraints as objects We were particu- larly interested in making the constraints easy to combine and modify for each new design. By modeling constraints as objects, we also could quickly demonstrate the effects of constraints separately or in combination. Each constraint was able to annotate its behavior, making it easier to ana~ lyze the effects of the constraints on the population ‘Davis, Lawrence and Susan Coombs “Opumizmg Setwork Link Sizes with Genetic Algorithms,” to appear it Modelling and Simulation Methodology Knouledge Sustems Paradigms Maurie $ Elzas, Tuncer 1 Oren, and Bernard P Ziegler, editors *For a more complete description of the problem of adjusting net- work link sizes than provuled yn this paper, see both our earher paper and the companion paper being presented at this conference 257 Our results were quite encouraging. We ran our genetic constraint system, LINKR, on an automatically-generated network design, as well as on three hand-optimized varia- hon» of a different network design In both the automati- cally-generated and hand-optimized cases, the i genetic ap- Broach improved the original design Constraints and Operators in applying genetic algorithms to the link speed design problem, we implemented both constraints which impose Penalties on solutions, and operators which alter the ge- netic representatian of solutions. Examples of constraints include constraints on link utilization, node utilization, link cost. and delay, Specitically, a customer might require that the utilization of all links in a network be less than 70%. Examples of operators include crossover and “Creep”. Some constraints that we had assumed to be fixed, in par- ticular, those modeling network routing?, had to be formu- lated stochastically. We discovered other constraints, in- cluding some we had not presiously considered. that were essential to designing an acceptable network, Some of these constraints took more time to evaluate than was practical, since part of our goal was to complete a network design in roughly the sarne amount of time as a person designing a network Other constraints were problematic in that their evaluations fluctuated wildly under slight changes to the overall network. We developed new techniques to deal with the new kinds of constraints. fhe network ronting determines how the traffic ts distributed throughout the network Stochastic Constraints Our attempt to model some stochastic constraints deter- ministically, in particular. constraints depending on net- work routing, proved too bmiting. With deterministic rout- ing, the genetic algorithm optimized link sizes (or the par- ticular Axed routing chosen, leading to network designs with little ability to suppert alternate routings, and hence low resiliency in the face of line outages. Since many of the hey constraints, including link utilization, node utilization, and delay, depend on network routing, we had to overcome this limitation. To do this, we developed a strategy for modeting stochastic routing In modeling stechastic routing, we adopted a k-out-ofn stralegy, where & “successful” routings out of nm trial rout- Ings are necessary for a penalty-free solution. The success of a routing depends on the specific constraint. For ex- ampie, a routing could be successful for a link-utihzation constraint if all link wtihvations were less than 70C AL though a strategy requiring less frequent routings would be desirable, our strategy worked acceptably in the cases we tried. By varying & we could adjust the robustness of the routing. and by varying n, we could adjust our confidence that the routings sampled were representative. “Ice Age” Constraints Some constraints were problematic in that the large atnount of time necessary to evaluate them made it impossible to complete 4 genetic design in roughly the same amount of time as a person designing a network, One constraint that proceeded at a particularly glacial pace involved cheching network performance after dropping each node in the net- work and re-routing traffic, For this constraint we adopted a new strategy, the “Ice (ge7! constraint strategy. An lee Age constraint evaluates itself once every Ice Age, that is, once every n generations. An Ice Age affects an entire gen- eration. Rach member of the affected generation incorpo rates the lee Age constraint’s evaluation into its overall evaluation. Many details of this approach remain to be investigated. For example, the optimal time between Ice Ages is uncer- tain. It cannot be too long, for then the Iee Ages have little effect on natural selection, Nor can it be too short, for then the time-saving advantage of the Ice Age approach is lost. Experimentation could also be done on whether members of non-[ce-Age generations should carry a linger- ing, but perhaps exponentially decreasing, penalty from the Ice Age evaluations of their ancestors, proach “LaMarck” Operators We observed that some constraints imposed penalties on de- signs which, with only slight changes, would have incurred no penalties at all. One such constraint was the constraint on mismatched line speeds. Although a network design may appear to work well with two adjacent links with widely differing speeds, in some situations one or more high-speed links funnel traffic onto the low-speed link, causing traffic to back up on the high-speed link(s}. To avoid this situation, one adjusts the speeds of one or both of the links, resolv- ing the mismatch, and elimimating the penalty the design otherwise would incur, With the famous LaMarck, who posited the inheritance of acquired traits, in mind, we comed the “LaMarck” opera- tor. When a LaMarck operator ts evaluated, it adjusts the genetic structure of a solution as necessary to bring 1t into the space of legal solutions. This strategy works best when a slight, known, change to the overall network dramatically affects the evaluation of a given constraint, but does not have an appreciable effect on other constraints, Implementation of both the Ice Age constraints and the LaMarck constraints was provisional and experimental. To begin to reap the benefits of these constraints would require further experimentation. Bringing Object-Oriented Program- ming to Genetic Algorithms One of the key elements of our implementation, which per- mits the applwation of different sets of constraints to dif ferent network designs, s the object-oriented representation we have chosen for constraints and individual members of the population. Because the constraints involved in each network design are unique, and over many network designs, potentially infinite, the structure of the genetic approach must permit, reuse, adaptation, and new combinations of constraints. The object-oriented approach that we have adopted accomplishes this, and we have already built up a sizable library of constraints for use in various types of network designs. In implementing our constraints, we first developed a small set of basic operations. To generate new members of the population after crossover, creep, or other operators are applied. we implemented a bast operation for changing link sizes. To stochastically mode! the routing of data traffic throughout the network, we developed a network routing operation We then used these basic operations as building blocks in implementing specific customer constraints, such as constraints on link utilization Some of the constraints that we developed appear in Table t below. The link cost constraint sums the costs of the adjustable links in the network, and adds this cost to the overall penalty of an individual solution. The remaining constraints assign a “dollar” cost corresponding to the de- gree to which they are violated, and the importance of the violation. This cost contributes to the overall penalties of individual solutions. Many of the constraints have genetic operators associated with them. Some of these operators appear in Table 2 be- low. The “Creep” operator, for example. uses the link- ulilization constraint in determining probabilities for creep- ing to a higher or lower link speed. Links with high uti- lizations have a higher chance of creeping up, whereas links with low utilizattons are more likely to creep down. The “Adjust Ports” operator ts an example of a requirement that we formulated, at different times, both as a constraint imposing penalties (see Table 1), and as a LaMarck post- processor altering solutions. Results Applying Genetic Algorithms to an Auto- matically-generated Network Design In the case of the automatically-generated network design we used, ten choices of link sizes were available. as com- pared to four choices in the test case described in Davis and Coombs (1987), There were 66 subnet links with ad- Justable sizes, as compared to 27 in the test case. Since we had little time in which to find an improved design. and since this design was significantly larger than the test de- sign we had tried previously, we began by trying the greedy cyclic algorithm described in our earlier paper The results from this paper show that the greedy cyclic algorithm runs significantly faster than the genetic algorithm, but does not perform as well. In this case, the greedy cyclic algorithm did not produce viable results. We then turned to the genetic algorithm. To initialize the genetic population, we slightly increased the link speeds of the automatically-generated design in a random fashion, so the genetic algorithm would begin its search in a promising area. [t found significant improvements to the design in a relatively smal! run: a population of 30, evolved over 20 generations. We then used a trivial post-optimization step, which consisted of attempting to lower link sues when they were higher than in the original design This was possible for two links. The cost savings reported in table 3 below include the effects of this post-optimuszation step Applying Genetic Algorithms to a Hand- Optimized Network Design The three variations of the hand-optimized network de- sign had many features in common. All three versions had coughly 26 nodes, and 37 links, with about 30 links that could be adjusted in size. Twenty choices of link speeds were available. We typically ran populations of around 30, over 20 to 50 generations. In the first version, the network had to be reliable enough to tolerate the joss of either of two central nodes, We wrote a special constraint that worked by applying penalties when link utilizations were too high with either of the central nodes down We experimented with making this constraint an tee Age constraint which evaluated only once every n generations. Although people designing networks seem to use a similar approach, that is, they evaluated this time- consuming constraint sporadically rather than continually, we did not have much success with implementing this con- straint as an fee Age constraint. Perhaps a larger popula- tion would help to preserve traits that only come into play each n generations. In the second and third versions, the primary constraint was link utilization fn the first of these, we modeled the network with twice the expected ammount of traffic, and link utilizations had to be less than LOO: (delay way not a fac- tor). In the final version, the customer requested a slightly more conservative design. with link utilizations under 90‘%. Other constraints on these designs included link cost, mo- dem cost, and port usage. In analyzing the results of the genetic algorithms with the person responsible for doing the hand-optimving of the de- signs, an interesting fact emerged. Two of the more im- portant constraints to consider when designing these net- works were link cost and link atilization, and these were the factors the designer concentrated on. A less crucial consideration was modem cost. since link costs usually far outweigh modem costs, and the designer did not have easy access to modem costs, In applying the genetic algorithm to these cases, however, we included a constraint for mo- dem cost, aad discovered that much of the improvement that the genetic algorithm achieved was in reducing mo- dem costs. Apparently. genetic algorithms are able to keep track of subsidiary constraint.. and to use them to advan- tage in finding improved solutions. Conclusion In the process of applying genetic algorithms to the problem of determining network link sizes, we developed novel and useful types of constraints. We also found a representation for constraints (as objects) that facilitated the rapid selec- tion and modification of sets of constraints for individual network designs. Finally, with the successful appheation of genetic algorithms to the network link sizing problem, we demonstrated the value of genetic algorithms in solving de- sign problems characterized by many differing constraints. Constraint Behawor Link Cost " Sums costs for penalties Link Utilization Routes, sums penalties Link Drop Drops links each Ice Age, routes, sums penalties Port Usage Sums penalties Table 1: Constraints Operator Behawnor Associated Constramt Crossover Combines two parents with multiple tries for different parents Creep Changes a link size Link Utilization one step up or down Adjust Ports Adds extra ports or {LaMarck) nodes where needed Port Usage Speed Mismaich Changes link sizes to correct (LaMarck) speed mismatches Table 2: Operators Design Annual Cost Savings Automatically-Generated Design $36900 Hand-Optimized Design, Version 1 8232 Hand-Optimized Design, Version 2 45236 Hand-Optimized Design, Version 3 55716 Table 3: Cost Savings by Design 250)