PROCEEDINGS OF AN INTERNATIONAL CONFERENCE ON GENETIC ALGORITHMS AND THEIR APPLICATIONS July 24-26, 1985 at Carnegie-Mellon University Pittsburgh, PA Sponsored By Texas Instruments, Inc. U.S. Navy Center for Applied Research in Artificial Intelligence (NCARAI) John J. Grefenstette Editor PROCEEDINGS OF AN INTERNATIONAL CONFERENCE ON GENETIC ALGORITHMS AND THEIR APPLICATIONS July 24-26, 1985 at Carnegie-Mellon University Pittsburgh, PA. Sponsored By Texas Instruments, Inc. U.S. Navy Center for Applied Research in Artificial Intelligence (NCARAI) John J. Grefenstette Editor Copyright © 1985 John J. Grefenstette PREFACE It has been ten years since the publication of John Holland’s seminal book, Adaptation in Natural and Artificial Systems. One of the major contribution of the book was the formulation of a class of algorithms, now known as Genetic Algorithms (GA’s), which incorporate metaphors from natural population genetics into artificial adaptive systems. Since the publication of Holland's book, interest in GA’s has spread from the University of Michigan to research centers throughout the U.S., Canada, and Great Britain. GA's have been applied to a striking variety of areas, from machine learning to image processing to combinatorial optimization. The great range of application attests to the power and the generality of underlying approach. However, much of the GA research has been reported only in Ph, D. theses and informal workshops. This Conference was organized to provide a forum in which the diverse groups involved in GA research can share results and ideas concerning this exciting area. On behalf of the organizing committee, it is my pleasure to acknowledge the support of Texas Instruments, Inc. and the U.S. Navy Center for Research in Applied Al. Special thanks go to Dave Davis for his efforts in obtaining the TI Grant. John J. Grefenstette Program Chair Conference Committee John H. Holland, University of Michigan (Conference Chair) Lashon B. Booker, NCARAI Kenneth A. De Jong, NCARAI and George Mason University John J. Grefenstette, Vanderbilt University (Program Chair) Stephen F. Smith, CMU Robotics Institute (Local Arrangements) TABLE OF CONTENTS Wednesday, July 24, 1985 Session 1: 8:45 a.m - 10:15 a.m. Chair: John Holland Properties of the bucket brigade John H. Holland, University of Michigan Genetic algorithms and rule learning in dynamic system control David E. Goldberg, University of Alabama Knowledge growth in an artificial animal Stewart W. Wilson, Rowland Institute for Science Coffee Break: 10:15 a.m. - 10:45 a.m. Session 2: 10:45 a.m, - 12:00 noon Chair: Lashon Booker Implementing semantic network structures using the classifier system Stephanie Forrest, University of Michigan The bucket brigade is not genetic ‘Thomas H. Westerdale, University of London Genetic plans and the probabilistic learning system: synthesis and results Larry Rendell, University of Illinois at Urbana-Champaign Lunch: 12:00 noon - 2:00 p.m. Session 3: 2:00 p.m. - 2:50 p.m. Chair: Stephen Smith Learning multiclass pattern discrimination J. David Schaffer, Vanderbilt University Improving the performance of genetic algorithms in classi fier systems Lashon B. Booker, Navy Center for Applied Research in Al Coffee Break: 2:50 p.m. - 3:15 p.m. page 1 page 8 page 16 page 24 page 45 page 60 page 7% page 80 Discussion: 3:15 p.m. - 4:30 p.m. Topic: GA's and Machine Learning Chair: John Holland Thursday, July 25, 1985 Session 4: 9:00 a.m. - 10:15 a.m. Chair: John Grefenstette Multiple objective optimization with vector evaluated genctic algorithms J. David Schaffer, Vanderbilt University Adaptive selection methods for genetic algorithms James E. Baker, Vanderbilt University Genetic search with approximate function evaluations John J. Grefenstette and J. Michael Fitzpatrick, Vanderbilt University Coffee Break: 10:15 a.m. - 10:45 a.m. Session 5: 10:45 a.m. - 12:00 noon Chair: John Grefenstette A connectionist algorithm for genetic search David H. Ackley, Carnegie-Mellon University Job shop scheduling with genetic algorithms Lawrence Davis, Bolt, Beranek and Newman, Inc. Compaction of symbolic layout using genetic algorithms Michael P_ Fourman, Brunel University Lunch: 12:00 noon - 2:00 p.m Session 6: 2:00 p.m. Chair: Ken De Jong 3:15 pm. Alleles, loci, and the traveling salesman problem David E. Goldberg and Robert Lingle, Jr., University of Alabama Genetic algorithms for the traveling salesman problem John J. Grefenstette, Rajeev Gopal, Brian J. Rosmaita and Dirk Van Gucht, Vanderbilt University page 99 page 101 page 112 page 121 page 196 page 141 page 154 page 160 Genetic algorithms: a 10 year perspective Kenneth De Jong, George Mason University Coffee Break 15 p.m. - 3:45 p.m. Discussion: 3:45 p.m. - 4:30 p.m. Topic: GA's as Search Algorithms Chair: Kenneth De Jong Friday, July 26, 1985 Session 7: 9:00 - 10:30 a.m. Chair: John Grefenstette Classt fier systems with long term memory Hayong Zhou, Vanderbilt University A representation for the adaptive generation of simple sequential programs Nichael Lynn Cramer, Texas Instruments, Inc. Adaptive "cortical" pattern recognition Stewart W. Wilson, Rowland Institute for Science Machine learning of visual recognition using genetic algorithms Arnold C. Englander, Itran Corp. Bin packing with adaptive search Derek Smith, Texas Instruments, Inc. Directed tress method for fitting a potential function Craig Schaefer, Rowland Institute for Science Coffee Break: 10:30 a.m. - 11:00 a.m. Discussion: 11:00 a.m. - 12:00 noon Topic: Summary and Future Directions Chair: John Holland page 169 page 178 page 189 page 188 page 197 page 202 page 207 PROPERTIES OF THE BUCKET BRIGADE ALGORITHM John H Holland The University of Michigan The bucket brigade algorithm is designed to solve the apportronment of credit problem for massively paraliel, message-passing, rule-based systems. The apportionment of credit problem was recognized and explored in one of the earliest significant works in machine learning (Samuel [1959]) In the context of rule-based systems it is the problem of deciding which of a set of early acting rules should receive credit for “setting the stage” for later, overtly successful actions In the systems of interest here, in which rules conform to the standard condition/action paradigm, a rule's overall usefulness to the systern is indicated by a parameter called its strength Each time arule is active, the bucket brigade algorithm modifies the strength so that it provides a better estimate of the rule's usefulness in the contexts in which it 1s activated. The bucket brigade algorithm functions by introducing an element of competition into the process of deciding which rules are activated Normally, for a parallel message-passing system, all rules having condition parts satisfied by some of the messages posted at a given time are automatically activated at that time. However, under the bucket brigade algorithm only sorne of the satisfied rules are activated. Each Satisfied rule makes a 0/g, based In part on its strength, and only the highest bidders become active (thereby posting the messages specified by their action parts) The size of the bid depends upon both the rule's strength and the specificity of the rule's conditions. (The rule's Specificity is used on the broad assumption that, other things being equal, the more information required by a rule's conditions, the more likely it is to be “relevant” to the particular situation confronting it). In a specific Version of the algorithm used for classifier systerns, the bid of classifier Cat time t is given by b(C,t) = cr(C)s(C,t), where r(C) is the specificity of rule C (equal, for classifier systerns, to the difference between the total number of defining positions in the Condition and the number of “don’t cares” in the condition), s(C,t) 1s the Strength of the rule at time t, and c 1s a constant considerably less than | 1 (eg, 1/4 or 1/8) The essence of the bucket brigade algorithm is its treatment of each rule as a kind of mid-level entrepreneur (a “middleman”) in a complex enconorny When arule C wins the competition at time t, it must decrease its strength by the amount of the bid. Thus its strength on tirne-step t+!, after winning the competition, is given by SiC, t+1) = SIC, t) - DIC, t) = (1 - er(C))SKC, t) In effect C has paid for the privilege of posting its message. Moreover this amount is actually paid to the classifers that sent messages satisfying C’s conditions -~ in the simplest formulation the bid is split equally arnongst ther These rnessage senders are C's suypp/rers, and each receives its share of the payrnent from the consumer C. Thus, if C, has posted a message that satisfies one of C's conditions, C has its strength increased so that S(Cy, t+1) = S(Cy, t) * DIC, L/MC, t) = C1 ~ cr(Cd/M(C,t SCC, t), where n(C, t) is the nurnber of classifiers sending messages that satisfy C at time t In terms of the economic metaphor, the suppliers (C ;} are paid for setting up a situation usable by consumer CC, on the next time step, changes from consumer to supplier because it has posted its message !f other classifiers then bid because they are satisfied by C’s message, and if they win the bidding competition, then C in turn will receive some fraction of those bids. C’s survival in the system depends upon its turning a profit as an intermediary in these local transactions In other words, when C 1s activated, the bid it pays to its suppliers must be less (or, at least, no more) than the average of the sum of the payments it receives from its consumers it is important that this process involves no complicated “pookkeeping” or memory over long sequences of action. When activated, C simply pays out !ts bid on one time-step, and is immediately paid by its consumers (if any) on the next time-step. The only variation on this transaction occurs on time-steps when there is payoff from the environment. Then, all classifiers active on that time-step receive equa! fractions of the payoff in addition to any payments from classifiers active on the next time-step in effect, the environment is the system's ultimate consumer From a global point of view, a given classifier C is likely to be profitable only if its usual consumers are profitable The profitability of any chain of consumers thus depends upon their relevance to the ultimate consumer. Stated more directly, the profitability of a classifier depends upon its being coupled into sequences leading to payoff As a way of illustrating the bucket brigade algorithm, consider a set of 2-condition classifiers where, for each classifier, condition | attends to messages from the environment and condition 2 attends to rnessages from other classifrers in the set As above, let a given classifier C have a bid fraction b(C) and strength s(C,t) at tine t. Note that condition | of C defines an equivalence class £ in the environrnent consisting of those environmental states producing messages satisfying the condition Consider now the special case where the activation of C produces a response r that transforms states in £ to states in another equivalence class £’ having an (expected) payoff u Under the bucket brigade algorithm, when C wins the competition under these circumstances its strength will change from s(C,t) to s(C,t+1) = s(C,t) - b(C)s(C,t) + u + (any bids C receives from classifiers active on the next time-step) Assuming the strength of C 1s small enough that its bid b(C)s(C,t) is considerably less than u, the usual case for a new rule or for arule that has only been activated a few times, the effect of the payoff isa considerable strengthening of rule C This strengthening of C has two effects First, C becornes more likely to win future competitions when its conditions are satisfied. Second, rules that send messages satisfying one (or more) of C’s conditions will receive higher bids under the bucket brigade, because b(C)s(C,t+ 1) » bIC)s(C,t). Both of these effects strongly influence the developnient of the system. The increased strength of C means that response r will be made more often to states in £ when C competes with other classifiers that Produce different responses If states in £’ are the only payoff states accessible from £, andr is the only response that will produce the required transformation from states in £ to states in £, then the higher Probability of a win for C translates into a higher payoff rate to the Classifier system Of equal importance, C’s higher bids mean that rules sending Messages satisfying C’s second condition will be additionally strengthened because of C’s higher bids. Consider, for example, a classifier C, that transforms environmental states in some class & to states in class £ by evoking response ry, That is, Cy acts upon a causal relation in the environment to “set the stage” for C. If Cy also sends a message that satisfies C’s second condition, then Cy will benefit from the “stage setting” because C's higher bid 1s passed to it via the bucket brigade. It is instructive to contrast the "stage setting” case with the case where some Classifier, say C,, sends a message that satisfies C but aves not transform states in & (the environmental equivalence class defined by its first condition) to states in & That is, C, attempts to “parasitize” C, extracting bids from C via the bucket brigade without modifying the environment in ways suitable for C’s action. Because C, is not instrumental in transforming states in 4 to states in &, it will often happen that activation of C; is not followed by activation of C on the subsequent time-step because C's first (environmental) condition is not satisfied. Every time C, is activated without a subsequent activation of C it suffers a loss because it has paid out its bid b(C psC),t), without receiving any income from C Eventually C;'s strength will decrease to the point that it is no longer a competitor. (There is a more interesting case where Co and C, manage to become active simultaneously, but that goes beyond the confines of the present illustration). One of the most important consequences of the bidding process is the automatic emergence of default hierarchies in response to complex environments For rule-based systems a “default” rule has two basic properties: 1) It is a general rule with relatively few specified properties and many “don't cares” in its condition part, and 2) when it wins a competition tt is often in error, but it still Manages to profit often enough to survive. It ts clear that a default rule is preferable to no rule at all, but, because it is often in error, it can be improved One of the simplest improvements is the addition of an “exception” rule that responds to situations that cause the default rule to be in error. Note that, in attempting to identify the error-causing situations, the condition of the exception rule specifies a subset of the set of messages that satisfy the default rule. That is, the condition part of the exception rule re/ines the condition part of the default rule by using aaastiona/ identifying bits (properties). Because rule discovery algorithms readily generate and test refinements of existing strong rules, useful exception rules are soon added to the system As a direct result of the bidding competition, an exception rule, once in place, actually aids the survival of its parent default rule Consider the case where the default rule and the exception rule atternpt to set a given effector to a different values. In the typical classifier system this conflict is resolved by letting the highest bidding rule set the effector Because the exception rule is more specific than the default rule, and hence makes a higher bid, it usually wins this competition In winning, the exception rule actually prevents the default rule from paying its bid This outcome saves the the default rule from a loss, because the usual effect of an error, under the bucket brigade, is activation of consumers that do not bid enough to return a profit to the default rule in effect the exception protects the default from some errors. Similar arguments apply, under the bucket brigade algorithm, when the default and the exception only influence the setting of effectors indirectly through intervening, coupled classifiers Of course the exception rules may be imperfect themselves, selecting Sore error-causing cases, but making errors in other cases. Under such circumstances, the exception rules become default rules relative to more detailed exceptions. iteration of the above process yields an ever more refined, and efficient, default hierarchy. The process improves both overall performance and the profitability of each of the rules in the hierarchy. It also uses fewer rules than would be required if ail the rules were developed at the most detailed level of the hierarchy (see Holland, Holyoak, Nisbett, and Thagard [1986]). The bucket brigade algorithm strongly encourages the top-down discovery and development of such hierarchies (cf. Goldberg [1983] for a concrete example) At first sight, consideration of long sequences of coupled rules would Seem to uncover an important limitation of the bucket brigade algorithm Because of its local nature, the bucket brigade algorithm can only Propagate strength back along a chain of suppliers through repeated activations of the whole sequence That 1s, on the first repetition of a 5 sequence leading to payoff, the increment in strength is propagated to the immediate precursors of the payoff rule(s). On the second repetition it is propagated to the precursors of the precursors, etc. Accordingly, it takes on the order of n repetitions of the sequence to propagate the increments back to rules that “set the stage” n steps before the final payoff. However, this observation is misleading because certain kinds of rule can serve to “pridge” long sequences. The simplest “bridging action” occurs when a given rule remains active over, say, T successive time-steps. Such a rule passes increments back over an interval of T time-steps on the next repetition of the sequence This qualification takes on importance when we think of arule that shows persistent activity over an epoch -- an interval of time characterized by a broad plan or activity that the system is attempting to execute. For the activity to be persistent, the condition of the epoch-marking rule must be general enough to be satisfied by just those properties or cues that characterize the epoch. Such arule, if strong, marks the epoch by remaining active for its duration. To extract the consequences of this persistent activation, consider @ concrete plan involving a sequence of activities, such as a “going home” plan. The sequence of coupled rules used to execute this plan on a given day will depend upon variable requirements such as “where the car is parked’, “what errands have to be run", etc These detailed variations will call upon various combinations of rules in the system's repertoire, but the epoch-marking “going home” rule D will be active throughout the execution of each variant In particular, it will be active both at the beginning of the epoch and at the time of payoff at the end of the plan (“arrival home") As ‘such it “bridges” the whole epoch Consider now arule | that initiates the plan and ts coupled to (sends a message satisfying) the general epoch-tnarking rule D The //rs¢t repetition of the sequence initiated by | will result in the strength of 1 being incremented This comes about because D is strengthened by being active at the time of payoff and, because it is a consumer of |'s message, it passes this increment on to | the very next time | is activated. 0 “supports” | as an elernent of the “going home” plan The result 1s a kind of one-shot learning in which the earliest elements in a plan are rewarded on the very next use. This occurs despite the local nature of the bucket brigade algorithm. It requires only the presence of a general rule -- a kind of default -- that is activated when some general kind of activity or goal 6 is to be attained. An appropriate rule discovery algorithm, such as a genetic algorithm, will soon couple more detailed rules to the epoch-marking rule. And, much as in the generation of a default hierarchy, these detailed rules can give rise to further refined offspring. The result is an emergent plan hierarchy going from a high-level sketch through progressive refinements yielding ways of combining progressively more detailed components (rule clusters) to meet the particular constraints posed by the current state of the environment in this way a limited repertoire of rules can be combined in a variety of ways, and in parallel, to meet the perpetual novelty of the environment References. Goldberg, D.E. Computer-aided Gas Pipeline Operation Using Genetic Algorithms and Machine Learning. Ph 0, Dissertation (Civil Engineering) The University of Michigan 1983. Holland, J. H, Holyoak, K J,, Nisbett, R.E, and Thagard, P R= /nauction: Learning, Discovery, and the Growth of Knowledge. (forthcoming, MIT Press] Samuel, AL “Some studies in machine learning using the game of checkers.” /&/7 Journal of Research and Development, I 2\\-232, 1959 GENETIC ALGORITHMS AND RULE LEARNING IW DYNAMIC SYSTEM CONTROL David E. Goldberg Department of Engineering Mechanics The University of Alabana ABSTRACT In this paper, recent research resuits [2] are presented which denonstrate the effectiveness of genetic algorithes in the control of dynamite eyatens.” Genetic algo- rithes are search algorithms based upon. the nechantes of natural genetics. They combine a survival-of-the-fletestanong string Structures with a structured, yet randontzed, infornation exchange to form a search algo” rithn with sone of the innovative flair of hnunan “search. While randomized, genetic algorithns are no sinple random walk. ‘They effictenciy explote Mstorical inforration to speculate on new search pointe with inproved performance, Two applications of genetic algorithas are considered. In the first, a triparcite genetic algorithm is applied to a paraneter optimization problem, the optimization of a serial natural gas ‘pipeline with 10 con~ pressor stations. While solvable by other methods (dynamic programming, gradient search, etc.) the problem is interesting as a straightforward engineering application of genetic algorithms. Furthermore, a surpris- ingly small number of function’ evaluations are required (relative to the size of the discretized search space) to achieve near- optimal performance, Inthe second application, a genetic algorithm is used as the fundamental learning algoritha in a more complete rule learning systen called a learning classifier system. ‘The learning syste combines a complete string rule and tessage system, an apportion nent of credit algorithn modeled after a competitive service econoay, and a genetic algorithm to forn a systen which continually evaluates its ‘present rules while forming new, possibly better, rules from the bits and pieces of the old. ‘tn an application to the Control. of a natural gas’ pipeline, the learning systen is trained to control the pipeline under normal winter and sunner Conditions. It is also trained co detect che presence or absence of a leak with increasing accuracy. INTRODUCTION Many industrial tasks and machines that once required human intervention have been all but completely automated. Where once a person tooled a part, a machine tools, Senses, and tools again. Where once a person controlled a machine, a computer controls, senses, and continues its task, Repetitive ks requiring a high degree of precision have been most susceptible to these extrene forns of automated control. Yet despite these successes, there are still many tasks and nechanisns that require the attention of a huran operator. Plloting an afrplane, controlling a pipeline, driving a car, and fixing a machine are just a few examples of ordinary tasks vhich have resisted a high degree of automation. What ts it about these tasks chat has prevented more autonomous, automated control? Primarily, each of the example tasks requires, not just a single capability, but a broad range of skills for successful’ performance. Furthermore, cach task requires performance under circunstances which have never been encountered before. For example, a pilot must take off, navigate, control speed and direction, operate auxilia~ Ty equipment, consunicate with tower control, and Jand the aircraft. He may be called upon to do any or all of these tasks under extrene weather conditions or with equipment malfunc~ tions he has never faced before. Clearly, the breadth and perpetual novelty of the piloting task (and similarly complex task environments) prevents the ordinary algo- rithaic solution used in nore repetitive chores. In other words, difficult environ- ents are difficult because not every possi- ble outcome can be anticipated in advance, nor can every possible response be pre- defined. This truth places a premium on adaptation. In this paper, we attack sone of these issues by examining research results in two distinct, but related problens. In the first, the steady state control of a serial gas pipeline is optimized using a genetic algorithm. While the optimization problem itself is unremarkable (a straightforvard paraneter optimization problem which has been solved by other methods), the genetic algo- rithm approach we adopt 1s noteworthy because it draws from the most successful and longest lived search algorith known to pan (natural genetics + survival-of-thefittest). Further- Tore, the GA approach is provably efficient in tts exploitation of important sinilari- ties, and thus connects to our om notions of Snnovative or creative search. In the second Problem, we use a genetic algorithm as a prinary’ discovery mechanism in a larger rule learning systen called learning classifier system (LCS), In this particular application the LCS learns to control a simulated natural gas pipeline. Starting from a random rule set the LCS learns appro priate rules for high perfornence control Under normal summer and winter conditions; additionally it learns to detect simulated leaks with increasing accuracy. ‘A TRIPARTITE GENETIC ALGORITHM Genetic algorithns are different from the normal search methods encountered in engineering optimization in the following 1. GA's work with a coding of the paraneter set not the paraneters themselves. 2, GA's search from @ population of points. 3, GA's use probabilistic not deterministic transition rules. Genetic algorithms require the natural paraneter set of the optinization problen to be coded as a finite length string. A variety of coding schenes can and have been used successfully, Because Gas work directly with the underiying code they are difficult to fool because they are not dependent upon continuity of the parameter space and deriva- tive existence, In many optintzation methods, ve nove gingerly from a single point in che decision Space to the next using sone decision rule to tell us how to get to the next point. This poine-by-point method 1s dangerous because 4t often locates false peaks in multinodal Search spaces. GA's work from a database of points “simultaneously (a population of strings) climbing many peaks in parallel, thus reducing the probability of finding a False peak. Unlike many methods, GAs use probabi- Listic decision rules to guide their search, The use of probability does not suggest that the method is simply a randon search, howev- er. Genetic algorithns are quite rapid in Tocating inproved performance. For our work, ve pay consider the serings in our ‘population of strings, to be tapressed ip a. plcaty alphabet containing the characters (0,1). Each string is of length 2 i the population contains total of n'such erings.” Of course, each” string say be decoded’ cova ‘set ot ‘physical. perancters according to our design’ Additionatly, we assune that with each string (parameter set) te say evaluate a fitness values Fleness ts defined as the non-negative figure of nerie Weare maxinizing, “thus, the. fitness tn Benetic algoritha’ work corresponds, to. the Shjestive™ funetion “in jornal” optini sation A simple genetic algorithm which gives G00d results 1s composed of three operators: 1, Reproduction 2, Crossover 3. Mutation With our simple genetic elgorithn we view reproduction as a process. by which individual strings are copied according to their fitness. Highly fie strings receive higher nusbers of copies in the nating pool. There are nany ways to do this; we simply give a proportionately higher probability of Teproduction to those strings with higher fitness (objective function value). Repro duction is ‘thus the survival-of-the-fittest or enphasis step of the genetic algoritha. The best strings make nore copies for mating than the worst. After reproduction, simple crossover may proceed in two steps. First, menbers of the newly reproduced strings in ‘the mating pool are mated at random. Second, each pair of strings undergoes crossing over as follows: an integer position k along the string is selected uniformly at random on the interval Q,g-1), Two new strings are created by swapping all characters between positions 1 and k inclusively, For example, consider two strings A and B of length 7 nated at random fron the mating pool created by previous reproduction: A = al a2 a3 ab a5 a6 a7 B = bl b2 b3 bs b5 6 b7 Suppose the roll of a die turns up a four. The resulting crossover yields two new strings A' and Bt following the partial exchange At = DL b2 b3 bu a5 a6 a7 BY = al a2 a3 ab b5 b6 b7 The mechanics of the reproduction and crossover operators are surprisingly simple, involving nothing more complex than string copies and partial string exchanges; however, together the emphasis step of reproduction ‘and the structured, though randontzed, information exchange’ of crossover give genetic algorithas mich of their power. At First this’ seens surprising. How can’ such sinple (computationally trivial) operators result in anything useful let along a rapid and relatively robust. search - mechanisn? Furthermore, doesn't it seen a little strange that chance should play such a fundanental role in a directed search process? The answer to the second question was well Fcopnized by the athenatician J. Hodanaré We shall see a little later that the possibility of imputing discov: ery to pure chance is already excluded....0n the contrary, that there is’ an intervention of ‘chance but also a necessary work of unconsciousness, the latter imply- ing and not ’ contradicting the former.... Indeed, it is obvious that invention or ‘discovery, be it in mathematics or anywhere else, takes place by conbining ideas. ‘The suggestion here is that while discovery is not a result of pure chance, it is almost certainly guided by directed serendipity, Furthermore, Hadanard hints that a proper role for chance is to cause the juxtaposition of different notions. It is Interesting that genetic algorithms adopt Wadamard's mix of direction and chance in a manner which efficiently builds new solutions fron the best partial solutions of previous trials, To see this, consider a population of n strings over sone appropriate aiphaber coded so that each is a conplete IDEA or prescrip~ tion for performing a particular task (in our coming example, each string is a description of how to operate all 10 compressors ‘on a natural gas pipeline.). Substrings within each string (IDEA) contain various NOTIONS of what's important or relevant to the task. Viewed in this way, the population contains not just a sample’ of n IDEAS, rather it contéins a multitude of NOTIONS and rankings of those NOTIONS for task performance. Genetic algorithns carefully explote this wealth of infornation about important NOTIONS by 1) reproducing quality NOTONS according to their perfornance and 2) crossing these NOTIONS with many other high performance NOTIONS from other strings. Thus, the act of crossover with previous. reproduction specu- Yates on new IDEAS constructed fron the high perforeance buflding blocks (NOTIONS) of past Era If reproduction according to fitness combined with crossover give genetic algo- rithms the bulk of their processing pover, what then is the purpose of the mutation operator? Not surprisingly there is much confusion about the role of mutation in genetics (both natural and artificial). Perhaps it is the result of too many B novies detatling the exploits of mutant eggplants that devour portions of Chicago, but whatever the cause for the confusion, we find that mutation plays a decidedly secondary role in the operation of genetic algorithms. Muta~ tion {s needed because, even though reproduc- tion and crossover effectively search and reconbine extant NOTIONS, occasionally the: nay becone overzealous and lose some poten tially useful genetic material (2's or O's at particular locations). The mutation operator protects against such an unrecoverable loss. In the simple tripartite GA, rutation is the occasional random alteration of a string position. In a binary code, this simply ‘beans changing a1 to a 0 and vice versa. By itself, putation is a random walk through the string’ space. ‘hen used sparingly with reproduction and crossover it is an insurance policy against prenature loss of important NOTIONS. That the mutation operator plays a secondary role we simply note that the Frequency of mutation to obtain good results in enpirical genetic algorithn studies is on the order of 1 mutation per thousand bit (position) transfers. Mutation rates are similarly small in natural populations which leads us to conclude that mutation is 10 appropriately considered as a secondary nechanisn, The underlying processing power of genetic algorithns is understood in nore rigorous terms by considering the notion of a NOTION more carefully. If two or more strings (IDEAS) contain the sare NOTION there are sin{larities between the strings at one or more positions. To consider the number and form of the possibie relevant sinilari- ties we consider a schena [3] or similarity teaplate; a similarity template is simply a string over our original alphabet (1,0) with the addition of a wild card or don't care character *, For example, with string length 2 = 7 the schena is0*t* represents all Strings with a 1 in the first position and a © in the third position. A simple counting argunent shows that while there are only 2% strings, there are 3% well-defined schenata or possible templates of similarity. Furthermore, it is easy to show that a particular string 1s itself a representative of 2% different schemata. thy 1s this interesting? The interesting part cones from considering the effect of reproduction and crossover on the multitude of schemata contained in a population of n strings (at most n+2% schemata). Reproduction on average gives exponentially nore samples to. the Observed best similarity patterns (a near~ optinal sampling strategy 1f we consider a multi-arned bandit problen), Second, cross= over, combines schenata fron different strings so that only very long defining length schemata (relative to the string length) are interrupted. Thus, short defin- ing length schemata are propagated generation to generation by giving exponentially in- creasing samples to the observed best, and all this goes on in parallel with little explicit book-keeping or special menory other than the population of n strings. How many of the n+2® schenata are usefully processed per generation? Using a conservative esti- Sate, Helland’ hos shotm that_O(n3) senenate are ‘usefully sampled per generation. This Compares favorably with the munber of func tion evaluations (n), and because this processing leverage 18 so important (and apparently unique to genetic. algorithns) Holland gives it a special nane, implicit porallelisn. In the next section we exploit this leverage in the optinization of a natural gas pipeline. : ‘THE TRIPARTITE GENETIC ALGORITHM 31. GAS PIPELINE OPTIMIZATION NATURAL. We apply the genetic algorithm to the steady state serial natural gos pipeline problem of Wong and Larson [4]. As rentioned previously, the problen is not renarkable. Wong and Larson successfully used a dynamic progranning approach and gradient procedures have also been used. Our goal here ts to connect with extant optimization and control literature. We also look at sone of the issues we face in applying genetic algorithns to more difficult problens where standard techniques may be inappropriate. We envision a serial system vith an alternating sequence of 10 compressors and 10 pipelines. A fixed pressure source exists at the inlet; gas is delivered at line pressure to the delivery point. Along the way, compressors boost pressure using fuel taken fron the line, Nodeling relationships for the steady flow of an ideal gas are well studied. Ke adopt Wong and Larson's formula- tion for consistency. The reader interested in more modeling detail should refer to their original work. Along with the usual modeling rela tionships, we must pose a reasonable objec- tive function and constraints. For this problem, we use Wong and Larson's objective function and constraint specification. Specifically, we minimize the sumed horse- power over the 10 compressor stations in the Serial line subject fo maximun and minizun pressure constraints as well as maximum and mininun pressure ratio constraints. Con- straints in these state variables are ad- Joined to the problen using an exterior Penalty method. Whenever @ constraint is violated a penalty cost is added to the objective function in proportion to the square of the violation. As ve shall see in a nonent, constraints in control variables may be handled with the choice of some appropriate finite coding. As discussed in the previous section, one of the necessary conditions for using & genetic algorithm is the ability to code the underlying parameter set as a finite length string. This is no real limitation as every user of a digital computer or calculator knows; hovever, there 1s motivation for constructing Special, relatively crude codings. In this study, the full string is formed from the concatenation of 10, four bit substrings where each substring is’ a mapped fixed point binary integer (precision = 1 part in 16) representing the difference in squared pressure across each of the ten compressor stations. This rather crude Giscretization gives an average precision in Pressure of 34 psi over the operating range ‘500-1000 psia, The nodel, objective function, con- straints, ‘and genetic. algorithn have been Programed “in Pascal. ‘We exanine results fron a nunber of independent trinis and compare to published results. To initiave Simlation, "a starting population of 50 trings 13 selected at randon, For” each trial of the genetic algorities we run to generation 60. "This represents 0 total of 50#61"3050 function evaluations per inde Pendent ‘trial The resuice froa. three als ure shom in Figure 1. This figure shova the ‘cost of the best string of each gereration “as the ‘solution ‘procéeds, — At Firat, perforeance 12 poor. After suffictent genetic action, near-optimal results are obtained. In all three cases, near-optinal results are obtained by generation 20 (1050 function evaluations). Figure 1. Best-of-Generation Results - Steady Serial Problem To better understand these results, we compare the best solution obtained in’ the first trial (run $8.1) to the optinal results obtained by dynanic programming. A pressure profile is presented in Figure 2. The GA Tesults are very close to the dynanic programming solution, with most of the difference explained by the large discretiza- tion errors associated with the CA solution. i t = 7 STATION NUTBER Figure 2, Pressure Profile - Run $S.1 Steady Serial Problen To gain a feel for che search rapidity of the genetic algorithm, we must compare the nunber of points searched to the size of the search space. Recall that in this problea, near-optimal performance is obtained after only 1050 function evaluations. To put this in perspective, with a string of length 40, there are 2° different possible solutions in the search space (20 = 1.1612), Therefore, we obtain near-optinal results after search ing only le-7% of the possible alternatives. If we were, for example, to search for the best person anong the worlds 4.5 billion people as rapidly as the genetic algorithn ve Would only need to talk to 4 or 5 people before making our near-optina) selection. ‘A LEARNING CLASSIFIER SYSTEM FOR DYNAMIC SYSTEM CONTROL In the remainder of this paper, we show how the genetic algorithm's penchant for discovery in string spaces may be usefully applied to search for string rules in a learning classifier system (LCS), Learning classifier systens are the latest outgrowth of Holland's continuing work on adaptive systens [5]. Others have continued and extended this work in a variety of areas ranging from visual pattern recognition to draw poker [6-8], A learning classifier systen (LCS) {s an artificial systen that learns rules, called classifiers, to guide its interaction in an arbitrary environment. It consists of three main elements: 1, Rule and Message System 2. Apportionment of Credit Systen 3. Genetic Algorithm A schenatic of an LCS is shown in Figure 3. In this schematic, we see that the rule and ressage systen receives environmental information through its sensors, called detectors, which decode to sone” standard pessage Tormat. This environsental nessage Te placed on a pessage Lise along with a finite number of “other “Internal ‘messages generated from the previous cycle. Messages on the message list may activate classifier: rules in the classifier store If activated a classifier may then be chosen to send a nessage to the message list for the next cycle. Additionally, certain messages may call for external action through a nunber of action triggers called effectors. In this way, the rule and message systen combines both external and internal data to guide behavior and the state of mind in the next state cycle, In an LCS, it {s important to maintain simple syntax in the primary units of infor- pation, messages and classifiers. In the current study messages are £-bit (binary) strings and classifiers ,are 3%-position strings over the alphabet {0,1,@). In this alphabet the @ 1s a wild card, matching a 0 12 or a 1 in a given message. Thus, we maintain powerful pattern recognition capability with sinple structures, ENVIRONMENT Xoo Les GA reas § i MESSAGE pe rons |] MESSACE popes ust [erssrien stone Figure 3, Schematic ~ Learning Classifier System In traditional rule-based expert sys- tems, the value or rating of a rule relative to other rules is fixed by the programmer in conjunction with the expert or group of experts being emulated. In a rule learning system, we don't have this luxury. The relative value of different rules is one of the key pieces of information which must be learned. To facilitate this type of learn- ing, Holland has suggested that rules coexist in a competitive service economy. A competi- tion {s held anong classifiers where the right to answer relevant messages goes to the highest bidders with this payment serving as a source of, incone to previously successful message senders. In this way, a chain of middlenen is formed from manufacturer (source Message) to message consumer (environmental action and payoff). The competitive nature of the economy insures that the good rules Survive and that bad rules die off. In addition to rating existing rules, we must also have a way of discovering new, possibly better, rules. This, of course, 18 the appropriate role for our genetic algo- rithm. In the learning classifier systen application, we mist be less cavalier about replacing entire string populations each generation, and we should pay more attention to the replacement of low perforners by new strings; however, the genetic algorithn adopted’ in the LES is very sinilar to the simple tripartite algorithn described earli- Taken together, the learning classifier systen with a computationally complete and convenient rule and message system, an apportiontent of credit systen modeled after @ competitive service economy, and the fnnovative search of a genetic algorithm, provides a unified framework for investiga” ting the learning control of dynamic systems. In the next section we examine the applica- tion of an LCS to natural gas pipeline operation and leak detection. ‘A LEARNING CLASSIFIER SYSTEM CONTROLS A PIPELINE, A pipeline model, load schedule, and upset conditions are programmed and inter- faced to the LCS. We briefly discuss this environnental model and present results of normal operations and upset tests. A model of a pipeline has been developed which accounts for Iinepack accumulation and frictional resistance. User demand varies on a daily basis and depends upon the weather. Different patterns may be used for winter and sumer operation. In addition to normal sumer and winter conditions, the pipeline nay be subjected to a leak upset. During any given time step, a leak may occur with a Specified leak probability. If a leak occurs, the leak flow, @ specified value, is extracted from the upstream junction ’and persists for a specified number of time steps. The LCS receives a message about the pipeline condition every time step. A template for that message is show in Figure 4. “The system has conplete, albeit imperfect and discrete, knowledge of ‘its state includ- ing inflow, outflow, tnlet pressure, outlet pressure, pressure rate change, season, tine of day, ime of year, and current temperature reading. COO Deseriptlon rx | inter peesnece aol 2 © w] Figure 4. Pipeline LCS Environmental Message Template 13 Jn the pipeline task, the LCS has a nunber of alternatives for actions it may take. It may send out a flow rate chosen fron’ one of four values, and it may send a Bessage indicating Whether a leak {8 suspect= ed or not. ‘The LCS receives reward fron its trainer depending upon the quality of its action in relation to the current state of the pipe- line. To make the trainer evervigilant, a computer subroutine has been written which administers the reward consistently. ‘This is not a necessary step, and revard can cone from a hunan trainer. Under normal operating conditions we examine the performance of the learning classifier systen with and without the genetic algorithm enabled. Without the genetic algorithn, the systen 1s forced to rake do with its Original set of rules. The results of a normal operating test are pre~ sented in Figure 5. Both runs with the LCS outperform a randoa walk (through the operating alternatives). Furthermore, the run with genetic algoritha enabled {s superi~ or to the run vithout GA. In this figure, we show time-averaged total evaluation versus tine of simulation (naxinun reward per timestep = 6). ihe onrst Figure 5. Time-averaged TOTALEVAL vs. Time. Normal Operations. Runs POLCS.1 & POLCS.2 More dramatic performance differences are noted when we have the possibility of leaks on the system. Figure 6 shows the time-averaged total evaluation versus time for several runs with leak upsets. Once agein the LCS {s initialized with’ random rules and permitted to learn fron external reward. Both LCS runs outperform the random walk and the run with GA clearly beats the run with no new rule learning. To understand this, we take a look at sone auxilfary performance measures. In Figure 7 we see bl the percentage of leaks alarmed correctly versus time. Strangely, the run without GA alarms a higher percentage of leaks than the run with GA. This nay seem counterintuitive until we examine the false alara statistics in’ Figure 8. The run without GA is only able to alarm a high percentage of leaks correctly because it has so many false alarms. “The run with GA decreases its false alarm percentage, while increasing its leaks correct percentage. CET) Tine (OAYS) Figure 6, Time-averaged TOTALEVAL vs, Time = Leak Runs - POLCS.5 & POLCS.6 i: ee ae Ye ET Tine (oAys) Figure 7. Percentage of Leaks Correct vs. Time Runs POLCS.5 & POLCS.6 CONCLUSTONS In this paper, we exanined the perfor- mance of a genetic algorithn in two appli- cations. In the first, a tripartite genetic algoritha consisting of reproduction, cross~ over, and mutation was applied ‘to the 4 F FRLSE pues to we wae eae Ha Tine toAys) Figure 8, Percentage of False Alarns vs. Tire Runs POLCS.5 & POLCS.6 optimization of a natural gas pipeline’ operation. The control space vas coded as 40 bit binary strings. Three initial popu~ lations of 50 strings were chosen at random, The genetic algorithm was started and in ali three cases, very near-optimal performance was obtained after only 20 generations (1050 funetion evaluations). In the second application, a genetic algorithm was the primary discovery mechanism in a larger rule-learning syste called a learning classifier system, The LCS, con- ng of @ syntactically simply rule and ye system, an apportionment of credit mechanism based on a competitive service econony, and a genetic algorithm, vas taught to operate a gas pipeline under winter and sumer conditions. It also was trained to alarm correctly for leaks while mininizing the number of false alarms, REFERENCES 1, Goldberg, D. E., "Computer-Aided Pipe- Line Operation using Genetic Algorithas and Rule Learning," Ph.D. Dissertation, University of Michigan, Ann Arbor, 1983, 2, Hadamard, J., The Psychology of Inven= thon inthe’ Mathematical Field Prince> ‘ton University Press, Princeton ton University Press, Princeton, 1945, 3, Wolland, J. H., Adaptation in Natural and “Artificial” sjotems, Untverstey oF Htenigan Press, Amr Arbor, 1975. 4, Wong, P. J. and R, E, Larson, "Optimi- zation of Watural Gas Pipeline Systems via Dynamic Programming," IEEE Trans. ‘Auto. Control, vol. AC-13, no. 5, pp. 475-481, October, 1968. Holland, J. H. and J. S. Reitman, "Cognitive ‘Systens Based on Adaptive Algorizhas,"" in Pattern-Directed Infer~ ence Systens, ‘Waterman, D. ke and Fe Hayer-torh (ets), pp. 513-529, Acadentc Press, New York, 1978. Snith, S. F., "A Learning System Based fon Genetic Adaptive Algorithns," Ph.D. dissertation, University of Pittsburgh, Pietsburgh, 1980, Booker, L. B., “Intelligent Behavior as an Adaptation to the Task Environnent," Ph.D, dissertation, University of Michigan, Ann Arbor, ‘1982, Wilson, S., "Adaptive ‘Cortical’ Pattern Recognition," unpublished anuscript, Rowland Institute cf Sctence, Canbridge, MA, 1983. 15 KNOWLEDGE GROWTH IN AN ARTIFICIAL ANIMAL by Stewart W. Wilson Rowland Institute for Science, Cambridge MA 02142 ABSTRACT Results are presented of experiments with ple artificial anima! mode! acting in a simulated en: vironment containing food and other objects Proce- dures within the model that lead to improved perfor: mance and perceptual generalization are discussed. ‘The model is designed in the light of an explicit definition of intelligence which appears to apply to all animal Ife. Iv is suggested that study of aruf cial animal models of increasing complexity would contribute to understanding of natural and artificial intelligence. INTRODUCTION The science of understanding and realizing in- telligence in artificial systems needs a definition of intelligence. Every science needs good definitions of the problems it addresses. But in the artificial intelligence field there has been a hesitancy about defining intelligence. For example, on the first page of a recent, widely used Al textbook we find: “A definition in the usual sense seems impossible be- cause intelligence appears to be an amalgam of so many information-representation and information- processing talents.”[1} For many Al goals, this oinis- sion 3s nol important But the lack of a good work- ing definition can lead to uncertainty in evaluating progress toward understanding intelligence per 4, even though results are in other respects substan- tial. This paper reports work using an artificial, be- having, animal model to study intelhgence at a primitive level An explicit definition of intelligence is adopted, and guides construction of the model. ‘The definition has mtuitive appeal and apparent ap- phicability to the range of hfe from human beings to very primitive animals. Because of this range, some results with the primitive animal model should pro- vide insight into intelligence sn general A DEFINITION OF INTELLIGENCE A good definition should be relatively simple and yet cover most of the things we regard as belonging to the concept and few we regard as not belong- 16 ing. The psychological hterature offers a number of useful similar efforts but the best definition of in- telligence we have found is the following, from the physicist van Heerden: Intelligent behavior is to be repeatedly successful in satisfying one’s psychological needs in diverse, observably different, situations on the basis of past experience.|2] “This definition (vH) is suitable for the computer study of intelhgence beeause i 1s comprehensive and terms are not difficult to dehine coi for experimental purposes. A high rate of recespt of certain reward quantities can correspond to “re- peatedly successful in satisfying one’s psychological needs” {on the simplest level. somatic needs). To “diverse, observably different, situations” can corre- spond sets of distinct sensory input “vectors” with each set having a particular implication for optim: action, To “past experience” can correspond a suit able internal record of earlier interactions with the environment, and their results THE ANIMAT MODEL Computer modehng of human levels of intelli- gence 1s complex VH's apparent applicabshty to both simple animals and human beings (assuming appropriate translations of its terms) suggests the usefulness of the easier course of considering basic problems that simple animals must solve, and con- structing behaving models aimed at solving them. Observation of the models should aid understand- ing of all intelligence, and the construction of more complex models To define our model, we abstract four basic chai acteristics of simple animals 1) The animal exists in a sea of sensory signals. At any moment only some signals are significant; the rest are irrelevant. 2) The animal is capable of actions (e.g. movement) which tend to change these signals 3) Certain signals (e.g. those attendant on con- sumption of food), oF certain signals’ absence (e.g. absence of pain) have special status for him. 4) He acts, both externally and through internal operations, so as approximately to optimize the rate of occurrence of the special signals. An aminal’s sensory-motor situation is described in very general terms by (1) and (2). Characteristics (3) and (4) are assumptions which provide a way of making definite the notion of ‘needs” and their satisfaction. Together, the four characteristics form the basis of our artificial animal model. For brevity, we call such a model an “ansmat” We take as the animat’s basic problem the gen- eration of rules which associate sensory signals with appropriate actions so as to achieve the optimiza- tion of (4), above For this, the major questions are adaptive, namely: 1) How to discover and emphasize rules that work 2) Get rid of those that don't (since memory space ts hmited and noise is undesirable), and 38) Optimally generalize the rules that are kept (since space is limited). ‘There is some previous work along these hnes Notable were Grey Walter's machina speculatriz, which was a sort of sub-animat which chose actions based on needs and the sensory situation, but did not adapt its rules; and m_ docalis, which ‘could be taught a conditioned response|$] More recently, Holland and Reitman|4} exhibited successful perfor- mance by a rule-adaptive animat-like system which optimized its rate of satisfaction of two distinct needs. Booker|5] experimented with an animat-like “hypothetical organism” which adepted its rules in a simple environment that contained both attr tive and aversive stimuli; he also provides a review of earlier systems. The present investigation 1s in- debted to the last two works IMPLEMENTATION Within the above framework we make the model definite by defining the animat’s environment, sen- sory channels, repertoire of actions, its association, rules, and then its performance and adaptation al- gorithms. Environmen A rectangle on the computer terminal screen 18 rows by 58 columns and continued toroidally at its edges defines the environmental space. Alphanu- meric characters at various positions represent ob- Jects; the animat itself is denoted by * Some, pos- sibly many, positions are just blank nels. __In studies so far, * has been given the ability to Pick up seusory signals from objects which happen to be one step (row and/or column) away, in any of nso} the eight (including diagonal) directions, nothing ts detected from more distant objects. Thus the “sense vector” has eight positions With * located, for ample, as shown below left, the sense vector would be as shown at the right. ‘FP TTEbbbbb, where b stands for blank To form the sense vec- tor, the circle of positions surrounding * is mapped clockwise starting at 12 o'elock, into a left-to-right string But this vector is not the final sensory input We imagine that an object is ultimately sensed as the outcome of measurements upon it by one or more feature or attribute detectors Without loss of gen- crality we assume each detector produces either a 0 or 1 output. If there are d detector types, an ob- ject translates into a binary string d bits in length. ‘The sense vector as a whole thus translates into @ “detector vector” of 8d bits. Detector translations or encodings of objects are fixed in 's “low-level” sensory hardware ‘They are assigned at the begin- ning of an experiment For example. in experiments discussed here, “F” (food) 1s encoded as “117; “T" (tree or obstacle) as “01”, and “b” (open space) as 00". {The first bit might. be thought of as the out- put of a “food smell?" detector, the second, of an “opacity” detector | Thus the above sense vector translates into the detector vector. 01 01 11 00 00 00 00 00 The associative apperatus takes the detector vector as input “’s actions are restricted to single-step moves 1m each of the eight directions. The directions are numbered 0-7 starting at 12 o'clock and proceeding clockwise; for example, a move in direction $ would be south-ensterly The animat may move, or attempt to move, to 2 position occupied by an object. The environment’s response for each kind of object is predefined. In present experiments, if the move is into position whose encoding is 00 (the blank object), there 1s no response (though the new sense vector will in general be different). If * steps into a space occupied by an object whose encoding has the first bit equal, to 1, * is regarded as having eaten the object and receives a reward signal. If * tries to step toward an adjacent object whose encoding is O1, the step is not permitted to occur (2 collision-like banging may be displayed) The foregoing establish a semi-realistic situation in which sensory signals carry partial, but uncertain information about the location of food, and avail En: vironmental predictability can be varied through the choice and arrangement of the objects. The number of object types which may be experimented with 1s Inmited only by the number of bits in the detector encoding scheme. Ass ation Rules. For its association rules, the amimat uses a rudi- mentary form of Holland’s|6| “classifier” rule The amimat’s rules each consist of a “taxon” and an “ac- tion”. The taxon is a sort of template capable of matching a certain set of detector vectors. The ac- tion is some one of the available actions. The ani mat’s classifier says, in effect, “if my taxon matches the current detector vector, then consider taking this action” It 1s a kind of hypothesis about what to do given a certain sensory situation (class of de- tector vectors). An example of a classifier would be O¥ 01 14 0% 00 00 O# O# 2 The matching rule requires that for any taxon position having a 0 or 2, the same value must occur in the detector vector; taxon positions with # (don’t care) match unconditionally. Because of the #’5, which confer a kind of generality on the classifier, the above taxon, for example, will match 32 possible detector vectors, including the one discussed earher It is worth making a few further observations about this classifier. First, it is a pretty good one because if food 1s present in direction 2 and the clas- sifier matches the detector vector, the action rec- ‘ommended is to move in direction 2 and not some other direction! Second, in directions 0, 3, 6, and 7, the taxon only requires that the object be, in effect, non-food, it being irrelevant whether these direc- tions have obstacles or are blank. Directions 1, 4, and 5 have not been so generalized Broadly speak- ing, & classifier is more useful to the animat to the extent itis general (matches many detector vectors) without being so general that it makes too many errors (re, that in certain matching situations its recommended action is inappropriate) Besides taxon and action, each classifier pos- sesses a “strength”, a quantity serving as the prin cipal measure of a classifier’s value to the animat. There may be other associated quantities, as well. ‘The amimat keeps a classifier population |[P] of fixed size Usually, |P 1s mtiahzed by filling all the taxa with 0, I, and # according to some ran- dom rule, actions are similarly filled in, As the an- imat’s CRT “life” evolves, the classifier population changes, as will be described. PERFORMANCE ALGORITHM "'s basic cycle is one “step”, within which events having purely to do with immediate behavior are 18 very simple Virst, the current detector vector is cal- culated Second, |P} 1s searched for classifiers which match it, these form the “match set” [M| Third, a classifier is selected from |M) using a probability dis tribution over the strengths of [M,’s elassifier’s, that is, the probability of selection of a particular clas- sifier is equal to its strength divided by the sum of strengths of classifiers im {M| Fourth, cording to the action of the selected classifier, or ines to. ‘The environment’s response to the move will be as described earher It can be seen that *’s move choice tends to be the one having the greatest total strength among the ‘M/ classifiers advocating it ‘Thus, overall, * first asks which classifiers of |P) “recognize” the current sensory situation, then from these tends to pick the move with the greatest associated strength ‘The subset of M) consisting of classifiers whose action 1s the same as the chosen action 1s called the “action ser A ADAPTATION ALGORITHM ‘The adaptation algorithm has three distinct ase pects 1) reinforcement of classifier strengths; 2) “genetic™ operations on classifiers yielding new clas- sifiers, and 3) direct creation of classifiers. Reinforcement. As discussed in the last section, a classifier's strength is a major determinant of its ability to influ- ence *’s action and therefore performance. We con- sequently want strength to reflect the performance which tends to result when this classifier 1s in [A]. That would be straightforward if every step were rewarded we could, for example, adjust the clas- sifier’s strength by an amount proportional to the reward Classifiers which got bigger rewards would be stronger, thus more likely to be an (A), etc Realistically, however, it 1s usually the case that only some of an organism's actions receive a def- inite reward from the environment. Actions lead~ ang up to, or setting the stage for, a rewarded ac- tion are themselves not directly rewarded, but they must somehow be encouraged or the final payoff will not occur. Holland|7} addressed this problem in proposing a “bucket-brigade” algorithm in which, very briefly, 1) classifiers make payments out of their strengths (o classifiers which were active on the pre- ceding cycle, and 2) the same classifiers later corre- spondingly receive payments from the strengths of the next set of active classifiers External reward goes to the final active set in the chain. In effect, given amount of external reward will eventually’ flow all the way back through a rehable chain, reinforcing every precursor classifier Our basic implementation of this idea is as fol lows On each step: 1) all classifiers in AJ have a fraction ¢ of their strengths removed; 2) the total strength thus removed from |] is dis- tributed to the strengths of any classifiers in |A- 1), defined as the action set in the previous step, 3) * then moves and if external reward 1s received iis distributed to the strengths of ‘Al; if exter- nal reward 15 not received. the classifiers of |A\ replace those of |A-1 ‘Thus every |A. participates in general in two trans- actions, one paying out, the other receiving We can write $4 =Sa-eSa+p where Sq 1s [Al's total strength on one step, Si its total on the next, and p is the total payoff received {either external reward or from the next (Al) If p is the same over time, Sa approaches a constant value given by p/e, so that under reasonably steady pay- off conditions, Sq 1s an estimator of typical payoff Similarly, the strength of any individual classifier is fan estimator of its t pical payoff The total payoffs to [A and A-I] are sn the sim- plest case shared equally by the reciptent classifiers. This has the consequence that the more classifiers are im, say, /A/, the less payoff each gets Genetic Operations: Consider two classifiers which match sunilar si vations: Of 01 14 OF 00 00 OF O# 2 and O# Of 11.01 00 0% O44 Og 2 Each is good. but each still lacks something in gener- ality since, for example. the matching requirements for 01 in bits 2-3 and 6-7 respectively, of each are perhaps unnecessarily restrictive Suppose we make & new classifier by combining bits 5-9 of the first with bits O-4 and 10-15 of the second. The result would be the slightly more general classifier. O# OF 19 0H OOOH OOH / 2 ‘The above operation on two classifiers resembles kind of crossing-over or recombination of chromo- some parts in genetics. It is an operation in which two *parent” classifiers produce an offspring that is possibly an improvement over both of them An other “genetic” operation, this time using just one Parent, would first clone the parent then mutate one ‘oF more of the clone’s taxon positions Other types of operations on classifier structure can be imagined {one will be discussed later). In each case the at tempt is to use existing classifiers as the starting Points for improved classifiers But the crossover points above were chosen quite carefully; otherwise the offspring might have been no 19 Improvement, or even a retrogression (Lo a classifier nore specific than either parent). We do not expect the animat to know where best to cut and mutate. How can we expect genetic operations to be of any Holland 8 presents a mathematical theory show- tng that a population of individual symbol strings im which cach string can be assigned a numerical worth, will progressively increase in average worth as its members undergo reproduction, genetic oper- ations on or among the offspring, and deletion of in- dividuals to maintain constant population size The key requirement is that an individual's probability of reproduction be proportional to its worth. Hol- land extended the theory to include classifier s tems. In employing genetic operations, our ani constitutes an exploration and test of the theory ‘The specific algorithm employed is as follows 1) A first classifier ¢1 of P! is selected with proba- bility proportional to is strength: 2) If et as merely to be reproduced, a copy of it as made and added to 'P_ To make room some classifier 1s deleted; 3) If el 1s to be crossed with another classifier. a second, ¢2, is selected, also with probability pro- portional to strength, but from the subset of P of classifiers having the same action as cl. Two cut points are chosen as above, but at random, and an offspring ¢3 constructed out of the parts. ¢3 is added to P’ and some classifier is deleted. Note that the parents are kept (unless one happens to suffer the deletion, but this 1s unlikely) The offspring, in effect, go into competition for payoff with the parents. Better (higher strength) offspring should proliferate more rapidly than their parents, driving them out, for worse offspring, the reverse should be the case 2Cr ’ Operations Occasionally, as * executes the performance algo- rigthm, a detector vector may occur that no cla fier of [P_ matches, i.e , the situation is unrecognized ‘The animat’s response is to create a new. match- tng, classifier A taxon is made by adding some 4’ at random to the detector vector, an action 1s cho- sen randomly. The created classifier 1s added to P and one is deleted The new classifier immediately matches the previously unrecognized situation and action occurs by the normal mechanism EXPERIMENTAL PROCEDURE ‘The animat model was designed with the vIl- intelligence definition as a guide In experiments with the model we are interested in finding pro- cedures and parameter values that seem to give * greater rather than less vH-intelligence For this two measures have been adopted. One is a perfor- mance measure. given an environment, how many steps does * take, on average, to find food objects? The other 1s a generality measure docs * evolve classifiers each tending to be useful in a number of distinct situations? Generality 1s important because it suggests that a high level of performance devel- oped in one environment will carry over to a some- what different environment ‘The experimental procedure is to fix *"s methods and parameters, then have him do a large number of “problems” in a particular environment E. The measures of performance and generality are tracked A “problem” always consists of starting * at a ran- domly selected blank position in E, then * moves until he eats some food, at which point the problem ends The number of steps between start and food is recorded a moving average of this quantity over the previous 60 problems 1s the performance measure, STPSAV To track generality, we calculate a histogram over the “periods” of all classifiers in P. The pe- riod of a classifier 15 a moving average of the number of steps by * between occurrences in A. of this clas- sifier Thus a frequently used classifier will have a low period [P| will then be general to the extent the histogram of periods is largest at low period. As |P] evolves we expect the histogram peak to move to- ward lower period, if [P]'s generality is increasing. Figure 1 The Environment “WOODS7*. Aw environment used for many of the exper- iments is “WOODS7", shown in Fig 1. Although WOODS? may look ¢ contains a tor tal of 92 distinct sense 's need to dis- cover and generalize 1s substantial. To obtain per- formance baselines, we can start * randomly, then let him also move completely randomly until food (F) 1s bumped into. For WOODS7, the long-term average of the number of steps this takes is about 41 20 steps. We may also ask (9: what is the best possible performance (if, say, the animat had human capa- bilities)? For every starting position, the number of steps to the nearest F can be found and averaged over all starting positions. The result for WOODS7 is 2.2 steps. RESULTS AND DISCUSSION Fig 2 showsa performance curve for a combi- nation of procedures and parameter settings that 1s ‘among the best so far found. There is an initial rapid improvement within the first 1000 problems (un- typically good during the first 100 problems, where STPSAV usually stays above 15), followed by very gradual improvement thereafter. The performance at 8000 problems, between 4 and 5 steps, is quite respectable compared with “perfect” (2 2 steps), es- pecially since * has no information whatsoever until he 1s next to a nonblank object. “4 12 1 116 6 Average steps to food 4 Musber of problens x 1028 Figure 2. STPSAV (ragged line) and Period Av- erage (broken line) for * to 8000 prob- lems. Period values as marked. For the same animat, Fig. 3 shows the histogram of periods of [P] at 8000 problems. There is a defi- nite bulge for low periods; the average period is 116. For comparison, the broken line in Fig. 2 shows the trend of the period averages at earlier epochs, indi- cating gradual generalization in the sense we have defined Qualitatively, a * such as this one gives the im- pression of “knowing” the Woods quite well. When next to F, * nearly always takes it directly; occa- sionally he will move one step sideways and take it from that direction. When next to one or more T's, of d 116 but with no F immediately in sight, * quite reliably steps around the obstacle(s) and finds the F_ When * 1s “out im the open”, i.e, the sense vector consists of blanks, he has no information about the best way to go, as in a thick fog. One might expect *'s be- havior to resemble a random walk but this is not the case. Instead, the movements look more like a ge eral “drift” 1n some direction, with some supert posed randomness After several problems the drift may shift to another direction 72 60 40 Number of Classifiers 30 ‘amt © Se 108 159 200 258 380 350 400 «so Period Figure 3. Histogram of classifier periods for the * of Figure 2 at 8000 problems Parameter Values Parameter values for the animat of Fig 2 were arrived at by experiment. Three basic parame- ters are discussed in this section, with observations about setting them reasonably For Fig 2, [P| contained 400 classifiers. A suit- ‘able value for this number appears related to the number of distinct sense vectors or “scenes” (here, 92) in the environment Too small a ratio of clas- sifiers to scenes results in “forgetful” behavior in which * keeps losing good moves that appeared well learned. A small ratio means that for some scenes deletion has a high probability of eliminating all matching classifiers. For ratios above about four, the forgetting is much less noticeable ‘To the extent jeneralizes, more and more classifiers match each sense vector, further reducing the problem ‘The “estimator fraction”, ¢, was set at 0.2, 1.€, ‘a classifier lost 20 percent of its strength each time it entered [A]. In general, smaller values of ¢ mean that a classifier’s strength reflects 2 weighted av- erage of payoffs that reaches farther into the past. Conversely, a larger value makes the strength more sensitive to recent payoffs. It was found that e = 04 produced a noticeably more erratic STPSAV curve, whereas changing from ¢ = 02 to 0.1 did not affect the curve significantly Strength should accurately estimate a classifier’s typical payoff. In this problem, payoff fluctuations are apparently large enough so that ¢ = 0-4 results in too short an averaging interval for good estimation Ife 1s too small, though, newly formed classifiers may get evaluated too slowly, we therefore kept ¢ at 0.2 The rate at which genetic operations occurred was set proportional to the problen rate Specif- ically, at the end of each problem, a single genetic event (as described earlier) took place with probabil- tty RGPROB Given the event, crossover occurred with probability XPROB Settings were typically 0.25 and 0.50, respectively. These seemed to ensure that, on average, classifiers would be fully evaluated by the reinforcement process by the time they were selected for a genetic operation (or deleted) Typ- ically, a problem took five steps in which each set AJ had about 10 members, giving about 50 evalu- ations. The above value for RGPROB then implies 200 evaluations per genetic event. This seems ex- cessive except that some classifiers are much more frequently used than others and we wanted to allow for the well-rewarded but infrequently called-upon classifier. It is possible our results would have been speeded up, without adverse side effects. by a lngher genetic rate. Distance Estimation. Performance in the earliest animat experiments was far below the level of Fig. 2. One defect was a kind of “dithering” in which while * would tend to- -d F's, the path would have unnecessary sidesteps and wanderings. It was then realtzed that the ba- sic reinforcement algorithm does not care whether = path from point A to food is long or short, there 1s nothing which preferentially reinforces the most ex- peditious classifiers Any path, even a looping one, will come to equilibriuin at a high strength level in ts constituent classifiers The solution had to be more subtle than simply penalizing long paths. What 1s required 1s a tech- mmigue that, at every position, tends to prefer the most direct of several possible moves but does not prevent the setting up of a long path if that is ac- tually the shortest path available. Our solution was twofold First cach classifier was niade to keep an estimate of its distance (in steps) to food. This did not require elaborate look-ahead Instead each clas- sifier in A-1] adjusted its distance estimate accord- ing to an average of the distance estimates of |Aj, when reward was received, the members of |A’ were similarly adjusted using the quantity 1. This tech- nique, with each estimate an average over the last fow updates, 1s quite satisfactory The distances are employed as follows In the performance cycle, selection from (M] is based on probability proportional to strength, distance mstead of just strength Consequently, a move tends to be selected that 1s not only strong, but also ‘short™ Now comes the second part of the solution. At the same ume as |) 1s formed, the set NOT/A] of the re- maining classifiers in |M) 1s taxed Ly a small amount (typically five percent): the “longer” classifiers thus tend to incur a loss by not being selected This “lateral inhibition™ induces a sort of catastrophe in which the shorter classifiers become even more likely to be picked and the longer become ever weaker, and can disappear entirely. Note that the competition is purely local and does not work against the setting up of minimal long paths This technique is very effective against “dither- ing’; the progressive takeover of a match set by a discovered shorter move has been repeatedly ob- served Our solution is not perfect, however, be- cause to suppress the special case of occasional loop- ing situations we had to impose a small tax (five percent) on [A]. Since [A] 1s the set which receives payoff, the tax has hittle effect except if a loop is taking place, and then the tax 1s soun very effective, Still, 1m principal, even a small tax on [A] reduces the strength flow in very long chains, putung them at a reproductive disadvantage. This residual prob- Jem may be an indication that as paths grow, they should be “condensed” into umts of behavior longer than one step. Extension A second area of changes which improved perfor- mance had to do with the “Create” operations. As discussed, Create at first only occured when {M! was empty. It was found that * sometimes also got stuck looping among situations with nonempty [M)’s. The tax on [A] enabled recognition of these loops because the total strengths in each (A) would tend to zero We put in a threshold that triggered Create if the strength of any [M] got too low. This suppressed looping dramatically and improved performance It was also found important to trigger Create randomly, at a very low rate (typically, with prob- ability 0.02 per step). * is engaged in path con- struction, using the best available current evidence This can lead to good but nevertheless suboptimal paths which might be improved if * would only try something different. Random Creates are one wa: to introduce a new move direction Usually the new classifier is no improvement. But when it is, and it gets tried (gets in {A]), it will be (often heavily) rein- forced and therefore given a good chance at eventual reproductive success 22 A different type of Create was also found useful Instead of randomly picking the action in a Created classifier * may make an educated guess, as fol- lows From its current position, * steps tentauvely: into a randomly selected adjacent position. ‘There, |M| 1s determined and the strength-weighted aver- age of the distances of its classifiers, MNDIST|M], is formed. ‘The same is done for several adjacent positions ‘These values are then compared with MNDIST M) for the starting position. Several de- cision schemes are possible, with the general idea of picking an action direction corresponding to the shortest apparent path. If, however, none of the ad- Jacent MNDIST|M|'s is better by more than 1 than the current position’s value, it 1s preferable not to create a new classifier. This technique 1s important early in *’s existence, when very little is yet know: but, interestingly, it appears that " should not rely entirely upon it, Some suboptimal paths get set up which tend not to be improved The problem goes away if random Creates are also available Effect of Genetic Operations Finally, we shall discuss what. the experiments suggest about the role of the genetic operations To begin. at 1s helpful to define a “concept” as a set of classifiers from |P| having exactly the same (axon and action, and for which there 15 no other class 1m [P| with that taxon and action. The baste effect of *'s genetic operations then appears to be to ex- ert a pressure tending to increase the generahity of {P\’s concepts. That 1s, with time, the periods of the concepts in [P| tend to decrease The pressure as restrained by the requirement that the concepts be more or less correct (* must get the food expe- ditiously). The precise point of balance appears to depend on the parameter regime An mmportant experiment is to evolve an antmat th reinforcement and Create going as usual, but th genetic operations turned off. The result is a performance almost as good as Fig. 2. But signifi cant generahzation does not occur, the curve of his- togram averages remains essentially flat ata value of about 270. There thus appears to be a division of effort Create introduces the raw material, the specific examples to be evaluated; and the genetic operations produce more general concepts from the examples Its clear that crossover 1s capable of making a more general classifier out of two less general par- ents, this was illustrated earher We are not sure, however, just why for * the more general concept has a selective advantage. Somehow, greater gener- ahty must lead to greater concept strength; there 1s no other way to win out. Yet being active more fre- quently does not mm itself result in greater strength strength 1s an estimator typicel payoff, not payoff rate. Our tentative hypothesis stems from noting that a more specific concept will always have to share payoff with any more general offspring that comes into existence This imbally weakens the specific concept so that the number of classifiers making 1t up tends to fall (at equilibrium, numbers are propor- tional to total strength) Consequently, the specific gets even less of the payoff. since payoff 1s shared The result 1s a cascading situation in which the more general concept wins out. The odds favor the gen- eral because it has more than this one source of p: off. While general classifiers appear to have a selec- tuve advantage, this is of no use unless such classi- fiers can be formed and mtroduced in the first place Crossover 1s adequate for some types of generaliza- ton. But a natural operation for the purpose 1s obviously intersection We have tmplemented this operation as follows ‘Two parents are chosen and anew taxon is formed by intersecting copies of the parents’ taxa over a randomly selected interval In that interval, if the parents differ at a position, the new taxon gets a #, if not, the new taxon gets the common value Outside the interval, the new taxon 1s filled in from parent 1 Intersection 1s a “hot” operation which should be used cautiously because it can introduce # s ft a high rate. Nevertheless, our results show in- creased generalization with litle performance loss when crossover and intersection are both available to*. Space remains only discuss the deletion tech- nique. The simplest method, conceptually, is to delete at random Then, to a first approxunation, the equilibrium number of classifiers in a concept or in any subsct of {P) whatsoever —is proportional toiits total strength A drawback of random deletion 18 that a valuable concept that happens to consist of one classifier 1s at considerable risk untal 1 re- produces. This is not a problem on average if P is large enough Yet one wonders whether “deleting the weak” might not be better Several methods have been tried, all but one clearly worse than random deletion. ‘The possibly better method 1s to delete with probability propor- tional to the reciprocal of strength This has the obvious effect of tending to protect the precious clas- sifier just mentioned. It can also be shown that the Probability that a concept {C_ will lose a member under this type of deletion is proportional 10 the square of its number, which places a strong restraint on over-expansion The * of Fig. 2 employed both intersection (along with crossover) and inverse-strength deletion CONCLUSION In its simple way, * meets the definition of antel- ligence stated at the beginning. * becomes good at satisfying its need for food in a Woods of di verse object configurations on the basis of expei ence. Though not yet vested, *'s rule generaliza- on over Lime suggests that performance would be maintained in a somewhat different Woods. or if the Woods slowly changed While the present animat has numerous limi tations (sensory, motor, memory ete) there docs not seem to be any essential barrier to removal of the limitations and to carryover of the present algo- rithms to a more sophisticated model in more come phicated environments, ACKNOWLEI EMENT ‘The author wishes to acknowledge valuable co versations with CG Shaefer of the Rowland Insti tute REFERENCES 1, Winston, PH Artificial Intelligence, 2nd Reading, Massachusetts. Addison-Wesley, 1984 2) van Heerden, PJ The Foundation of Empirical Knowledge. Wassenaar, The Netherlands Wis- uk, 1968 3) Walter, W.G. The Living Brain New York: Norton. 1953. i4, Holland, JH., & Reitman, JS Cognitive sys- tems based on adaptive algorithms In Pattern- Directed Inference Systems, Waterman, DA, & Hayes-Roth, F., (eds). New York: Academic Press, 1978 5 Booker, L Intelligent Behavior ws an Adapta- tron to the Task Environment. Ph D. Dissertation (Computer and Communication Sciences), The University of Michigan, 1982 6 Holland, JH Adaptation In Progress tn Theo: retical Biology, 4, Rosen, R, & Snell, F.M , (eds.) New York Plenum 1976 — Genetic algorithms and adaptation In Adaptive Control of Ill-Defined Systems, Self- ridge, O.G Russland, EL, & Arbib, MA (eds) New Vork Plenum 1984. 8 Adaptation in Natural and Artsfictal Systems Ann Arbor University of Michigan Press. 1975 9 Martha Gordon, personal con munication. IMPLEMENTING SEMANTIC NETWORK STRUCTURES USING THE CLASSIFIER SYSTEM Stephame Forrest ‘The University of Michigan ‘Ann Arbor, Michigan Introduction One common cnticism of Classifier Systems is the low-level nature of their representations In Classifier Systems information is stored as rules (classifiers) that have a very constrained format (binary bit strings). Low-level binary bit string representations support adaptive learning algorithms well (Holland, 75)(Holland, 80). However, it is difficult to interpret the behavior of these systems without a high-level interpreter that can code and de-code the ones and zeroes into more meaningful terms. In particular, although gross behaviors can be measured at various intervals using some fitness function it is difficult to chart how learning takes place or to determine what role is played by each component of the system. This feature of low-level representauions makes 1 difficult to establish direct connections between the behavior of Classifier Systems and more common high-level symbolic representations used in artificial intelligence programs The research described in this paper addresses this criticism by demonstrating that Classifier Systems are capable of representing sophisticated high-level structures. This has been accomplished by selecting one class of knowledge representation paradigms (semantic networks) and showing how they can be implemented as a collection of Classifier System rules. The described system takes high-level semantic network descriptions as input and automatically translates them into a Classifier System representation. It also provides a “query processor” that takes high-level queries about the semantic network, translates them into a sequence of Classifier System operations, and translates the results of the queries back 24 into higher-level answers In large scale parallel systems such as the Classifier System, the issue of control is ct tral Control issues arise in two ways for the Classifier System in deciding which external classifiers are to be generated, and in deciding which external messages are to be placed on the message list and when As the number of rules in the system increases, it quickly becomes impossible to do control the system manually, There are at least two possible ways to automate the process’ “learning” and “compiling.” Compilation can be viewed as mapping high-level structures onto lower-level operations (“top down”). Likewise, some kinds of learning (for example, genetic algorithms) can be viewed as the gradual emergence of higher-level structures from a random assortment of low-level processes: systems using these kinds of learning organize themselves from the “bottom up.” The bottom-up approach is the one that has been studied previously for Classifier Systems (Holland, 80) (Booker. 82) (Goldberg. 83) The top-down approach 1s being explored in this paper. The implementation takes the form of a compiler mapping “high-level” semantic network definitions onto the Classifier System. In this context, the Classifier System is properly viewed either as a lower-level target language or as a specification for an abstract parallel machine One particular semantic network formalism was selected for this research KL-ONE (Brachman. 78) (Schmolze and Brachman. 82) (Brachman and Schmolze. 85) The KL-ONE family of languages 1s widely used. st contains most of the common semantic network constructs (the most notable exception being cancel links). has been precisely described, and includes sophisticated accessing functions as part of the design of the language These characteristics make KL-ONE an excellent exemplar of the semantic network representation paradigm The remainder of this paper is divided into five sections (1) brief description of my version of the Classifier System, (2) short introduction to KL-ONE (3) description of the Classifier System implementation of KL~ INE. (4) discussion. and (5) conclusions The Classifier System Since there are several variants of Classifier Systems, I will describe below the one used for this project. This particular system does not include those features that are specific to the use of adaptive algorithms, such as bidding, support, etc. This 1s because 1 am interested in showing what sorts of representations are possible, not how they can evolve. The following view of the Classifier System emphasizes how it can be used to represent higher-level structures and does not rely on any parucular hardware implementation. Thus, it is appropriate to describe the language of possible programs for the Classifier System as a formal grammar. The input to a Classifier program is the set of external messages (often called detector messages) that are added to the message list during the program's execution ‘The output is the set of messages (called effector messages) read from the message list by an external agent. Just as many traditional programs can be run interactively. a classifier program can be thought of as receiving intermittent input from the external environment and occasionally emitting output messages. The syntax for the Classifier System is as follows: ::= ~ => " ~ = " = 1 0 £ Each classifier, or production rule, consists of a condition part and an action part The action part specifies exactly one action. while the condition part may contain many conditions (pre-conditions of activation) Rules with more than one condition are referred to as *multuple-condition classifiers.” A muluple-condition classifier must have each of 1ts pre- conditions fulfilled in a single time step for it to be activated ‘The conditions and actions are fixed length strings over the alphabet (1.0.4) where = denotes “don't care” and 1 and 0 are literals. The determination of whether or not a specific message matches a condition 1s a 26 — ii logical bit comparison on the defined (1 or 0) bits. If a “not” condition is used, the condition is fulfilled just in the case that no message on the message hist matches it. The #'s in the condition part designate “don’t care” positions in the sense that they match either 1 or 0. The action part of the classifier determines the message to be posted. All defined bits appear directly in the output message. Any + symbols in the action part indicate that the corresponding bit value in the activating message should be substituted for the # symbol in the output message.! Actual messages are always completely defined in that they do not contain “don’t care” symbols Separate conditions are placed on separate lines, and the first condition (the distinguished condition) of a classifier is used to pass through messages to the action part. ‘As a simple example, consider the following four bit (n = 4) classifier system #00% => 1101 ~Wi=> un ‘This classifier system has three classifiers. The second classifier illustrates muluple- conditions, and the third contains a negative condition. If an initial message, “0000” is placed on the message list at time TO, the pattern of activity shown below will be observed on the message list Time Step Message List Activating Classifier To: 0000 external TL 1101 first ill third T2 mu second 73 T4 lll third, ‘For multiple-condition classifiers this operation 1s ambiguous since it 1s not clear what it means to simultaneously perform “pass through” on more than one condition. The ambiguity is resolved by selecting one condition to be used for pass through. By convention, this will always be the first condition. Another ambiguity arises sf more than one message Matches the distinguished condition in one ume step. Again by convention, in my system I Process all the messages that match this condition. The example illustrates this procedure. The final two message lists (null and “1111") would continue alternating until the system was turned off In TI. one message (1101) matches the first (distinguished) condition and ” both messages match the second condition. Pass through is performed on the first condition, producing one output message for time T2. If the conditions had been reversed (##9#1 distinguished), the message list at ume T2 would have contained two identical messages (1111) KL-ONE KL-ONE organizes descnptive terms into a multrlevel structure that allows properties of a general concept, such as “mammal.” to be inherited by more specific concepts, such as “zebra.” This allows the system to store properties that pertain to all mammals (such as “warm-blooded”) in one place but to have the capability of associating those properties with all concepts that are more specific than mammal (such as zebra) A multilevel structure such as KL-ONE 1s easily represented as a graph where the nodes of the graph correspond to concepts and the links correspond to relations between concepts Such graphs, with or without property inheritance, are often referred to as semantic networks. KL-ONE resembles NETL [Fahiman. 79] and other systems with default hierarchies 1m its exploitation of the idea of structured inheritance of properties. It differs by taking the definitional component of the network much more seriously than these other systems. In KL-ONE, properties associated with a concept in the network are what constitute its definition. This 1s a stronger notion than the one that views properties as predicates of a “typical” element, any one of which may be cancelled for an “atypical” case. KL-ONE does not allow cancellation of properties Rather. the space of definitions is seen as an infinite lattice of all possible definitions: there are concepts to cover each “atypical” case All concepts in a KL-ONE network are parually ordered by the “SUBSUMES" relation. This relation, often referred to as “IS-A” in other systems. defines how properties are inherited through the network. ‘That 1s, if a concept A is subsumed by another concept B, A inherits all of B's properties. Included in the lattice of all possible definitions are contradictory concepts that can never have an extension (instance) in any useful domain, such as ~a person with two legs and four legs.” Out of this potentially infinite lattice, any parucular KL-ONE network will choose to name a finite number of points (because they are of interest in that application), always including the top element, often referred to as “THING ” KL-ONE also provides a mechanism for using concepts whose definitions either cannot be completely articulated or for which it is inconvenient to elaborate a complete definition — the PRIMITIVE construct. For example, if one were representing abstract data types and the operations that can be performed on them, it might be necessary to mention the concept of “Addition.” However, it would be extremely tedious and not very helpful in this context to be required to give the complete set-theoretic definition of addition. In a case such as this, it would be useful to define addition as a primitive concept. ‘The PRIMITIVE construct allows a concept to be defined as having something special about it beyond its explicit properties Concepts defined using the PRIMITIVE construct are often indicated with “*" when a KL-ONE network is represented as a graph While NETL stores assertional information (e.g., “Clyde is a particular elephant”) in the same knowledge structure as that containing definitional information (for example “typical elephant”), KL-ONE separates these two kinds of knowledge. A sharp distinction 1s drawn between the definitional component, where terms are represented. and the assertional component, where extensions (instances) described by these terms are represented. It 1s possible to make more than one assertion about the same object in any world For example it may be possible to assert that a certain obyect is both a “Building” and a “Fire Hazard” In KL-ONE, the definitional component (and its attendant reasoning processes) of the system is called the “terminological” space, and a collection of instances (and the reasoning Processes that operate on it) is referred to as the “assertional” space. The feavures of KL- ONE that are discussed here (structured inheritance. no cancellation of properties primutive concepts, etc.) reside in the terminological component, while statements in the assertional Component are represented as sentences in some defined logic. Reasoning in the assertional art of the system is generally viewed as theorem proving At the heart of knowledge acquisition and retrieval is the problem of classification Given a new piece of information. classification 1s the process of deciding where to locate “ that information m an existing network and knowing how to retrieve it once it has been entered. This information may be a single node (concept) or, more likely, it may be a complex description built out of other concepts. Because KL-ONE maintains a strict notion of definition, it is possible to formulate precise rules about where any new descripuon (terminological) should be located in an existing knowledge base. As an example of this classification process in KL-ONE, if one wants to elaborate a new concept XXX that has the following characteristics, XXXX takes place in Africa. and XXXX involves hunting zebras, 1) XXXX 1s a kind of vacation, i there exists a precise way to determine which point in the lattice of possible definitions should be elaborated as XXXX Finding the proper location for XXXX would involve finding all subsumption relationships between XXXX and terms that share characteristics with it If the terminological space 1s implemented as a multi-level network, this process can be described as that of finding those nodes that should be immediately above and immediately below XXXX in the network The notions of ‘above” and “below” are expressed more precisely by the relation “SUBSUMES* Deciding whether one concept SUBSUMES another 1s the central issue of classification in KL-ONE, The subsumption rules for a particular language are a property of the language definition (Schmolze and Israel, 83) In summary. there are two aspects to the KL-ONE system (1) data structures that store information and (2) a sophisticated set of operations that control interactions with those data structures. In the following sections. the first of these aspects is emphasized A more detailed treatment of KL-ONE operations 1s contained in (Lipkis. 81) More precisely, XXXX has a location role which 1s value restricted to the concept Afmca, an activity role which 1s value restricted to concept HuntngZebras, and a SUPERC link connecting it to the concept Vacation. 30 Classifier System Implementation of KL-ONE In this section, a small subset of the KL-ONE language 1s introduced and the corresponding representation in classifiers 1s presented. Then it 1s shown how simple queries can be made to the Classifier System representation to retrieve information about the semantic network representation. The simple queries that are discussed can be combined to form more complex interactions with the network structure (Forrest, 83) A KL-ONE semantic network can be viewed as a directed graph that contains a finite number of link and node types Under this view, a Classifier System representation of the graph can be built up using one classifier to represent every directed link in the graph The condition part of the classifier contains the encoded name of the node that the link comes from and the action part contains the encoded name of the node that the link goes to. Tagging controls which type of link is traversed. In the following, two node types (concepts and roles) and six link types (SUPERC, ROLE. VR, DIFF, MAX. and MIN) are discussed ‘These node and link types comprise the central core of most KL-ONE systems and are sufficiently rich for the purposes of this paper For the purposes of encoding, the individual bits of the classifiers have been conceptually grouped into fields. The complete description of these fields appears below The description of the encoding of KL-ONE is then presented in terms of fields and field values, rather than ‘using bit values It should be remembered that each field value has a corresponding bit pattern and that ultimately each condition and action 1s represented as a string of length thirty-two over the alphabet {1.0,%} The word nil denotes “don’t care” for fan entire field. There are several distinct ways in which the classifiers’ bits have been interpreted. The use of tagging ensures that there is no ambiguity in the interpretations used. The type definition facilities of Pascal-like languages provide a natural way to express the the conceptual interpretations I have used. as shown below wpe tag = (NET.NUM.PRE) link = (SUPERC,ROLE,DIFF,VRLIN! direction (UP,DOWN); compare = (AFIELD,BFIELD,CFIELD); name = string: message = string; numeric = 0 .. 63; .MAX.MIN,): classifier pattern = record case tag : tagfield NET /* Structural Variant * (vagfield name lnk direction); NUM :/* Numeric Variant * (tagfield name nil direction compare numeric); PRE /* PreDefined Variant * (tagfield message); end; This definition defines three patterns for constructing classifiers: structural, numeric, and predefined. The structural pattern is by far the most important. It is used to represent concepts and roles. The numeric pattern 1s used for processing number restrictions. ‘The predefined pattern is used for control purposes; it has no don’t cares in it, providing reserved words, or constants, to the system. The structural pattern has been broken into four fields: tag, name, link, and direction. ‘The tag field is set to NET. the name field contams the coded name of a concept or role, the link field specifies which link type is being traversed (SUPERC, DIFF etc.), and the direction determines whether the traversal is up (specific to general) or down (general to specific) ‘The Numeric pattern has six fields: tag, name, link, direction, compare, and number In most cases the name, link. and direction fields are not relevant to the numeric processing and are filled with don't cares. The tag field is always set to NUM, and the compare field is * one of AFIELD. BFIELD. or CFIELD. The compare field is used to distinguish operands in arithmenc operations. The number field contains the binary representation of the number being processed. The Predefined pattern has the value PRE in the tag field. The rest of the pattern 1s assigned to one field These bits are always completely defined (even in conditions and 32 SSS actions) as they refer to umique constant messages These messages provide internal control information and they are they are used to initiate queries from the command processor Concept Specialization All concepts in KL-ONE are partially ordered by the “SUBSUMES” relation. One concept, for example Surfing, is said to specialize another concept, say WaterSport, if Surfing is SUBSUMEG by WaterSport. This means that Surfing inherits all of WaterSport’s properties. The “SUBSUMES” relation can be inferred by inspecting the respective properties of the two concepts, or Surfing can be explicitly defined as a specialization of WaterSport Graphically the specialzation is represented by a double arrow (called a SUPERC link) from the subsumed concept to the subsuming concept (see Figure 1). KL- ONE’s SUPERC link is often called an ISA link in other semantic network formalisms. Since the SUBSUMES relation is transitive, SUPERC hinks could be drawn to all of WaterSport’s subsumers as well Traditionally, only the local links are represented explicitly. WaterSport Surfing Figure 1 Concept Specialization ‘Two classifiers are are needed to represent every explicit specialization in the network This allows traversals through the network in either the UP (specific to general) or DOWN (general to specific) direction. The classifiers form the link between the concept that is being specialized and the specializing concept. The following two classifiers represent the network shown in Figure 1 NORM-WaterSport-SUPERC-DOWN => NORM-Surfing-SUPERC-DOWN NORM-Surfing-“SUPERC-UP => NORM-WaterSport-SUPERC-UP. A role defines an ordered relation between two concepts. Roles in KL-ONE are similar to slots in frame-based representations. The domain of a role is analogous to the frame that contains the slot; the range of a role 1s analogous to the class of allowable slot- fillers. In KL-ONE, the domain and range of a role are always concepts. Just as there is a partial ordering of concepts in KL-ONE, so 1s there a partial ordering of roles. The relation that determines this ordering is “differentiation.” Pictorially, the DIFFERENTIATES relation between two roles is drawn as a single arrow (called a DIFF link). Roles are indicated by a circle surrounding a square (see Figure 2). This allows roles to be defined in terms of other roles similarly to the way that concepts are defined from other concepts. The domain of a role is taken to be the most general concept at which it 1s defined, and, likewise, the range is taken to be the most general concept to which the role 1s restricted (called a value restriction), If there 1s no explicit value restriction in the network for some role, its range is assumed to be the top element. THING Roles are associated with a concept. and one classifier is needed to represent each association (link) between a concept and its role, For example, the role Arm might be associated with the concept Person (see Figure 2) and the following classifier would be generated. nil-Person-nil-ml-nil PRE-RoleMessage => nil~Arm-DIFF-nil-nil. Roles can be defined in terms of other roles using DIFF links. For example. the role Sibling can be defined as a differenniater of “Relatives” (see Figure 3). Building on this defimtion. the conjunction WealthySibling is defined by constructing DIFF links from WealthySibling both to Sibling and to Wealthy as shown in Figure 3. 34 xii ccaicaaidemenemmtbitent Figure 2 Concept and Role Figure 3 shows how these would be drawn Wealthy Sibling WealthySibling Figure 3 Role Differentiation There are two links specified by this definition. Two classifiers are needed to tepresent each link so that queries can be supported in both directions (UP or DOWN). They are shown below: NORM-Wealthy-DIFF-DOWN => NORM-WealthySibling-DIFF-DOWN NORM-WealthySibling~DIFF-UP => NORM-Wealthy-DIFF-UP NORM-Sibling-DIFF-DOWN => NORM-WealthySibling-DIFF-DOWN NORM-WealthySibling-DIFF-UP => NOR! {-Sibling-DIFF-UP ‘These classifiers control propagations along DIFF links. They could be used to query the system about relations between roles. Value Restrictions Value restrictions limit the range of a role in the context of a particular concept In frame/slot notation. this would correspond to constraining the class of allowable slot fillers for a particular slot. To return to the sibling example. we might wish to define the concept of a person all of whose siblings are sisters (Person WithOnlySisters). In this case the role, Sibling, 1s a defining property of PersonWithOnlySisters The association between a concept. and a role 1s indicated in the graph by a line segment connecting the concept with the role. Value restrictions are indicated with a single arrow from the role to the value restriction (a concept). Figure 4 illustrates these conventions. Person Person With OnlySisters Sibling Figure 4 Value Restrictions One classifier 1s needed for each explicitly mentioned value restriction This classifier associates the local concept and the relevant role with their value restriction. The control message. VR, ensures that the classifier 1s only activated when the system is looking for value restrictions The following classifier is produced for the value restriction: nil-Person WithOnlySisters-nil-nil-nil |-Sibling-nil-nil-nil PRE-VRMessage => nil-Ferale-SUPERC-nil-nil It should be noted that the above definiuon does not require a PersonWithOnlySisters to actually have any siblings It just savs that if there are any. they must be female. The definition can be completed to require this person to have at least one sister by placing a number restriction on the role 36 simatic Pictorially, number restrictions are indicated at the role with (x,y), where x 1s the 4 i lower bound and y is the upper bound Not surprisingly, these constructs place limitations on the minimum and maximum number of role fillers that an instance of the defined concept can have. In KL-ONE, number restrictions are limited to the natural numbers The default MIN restriction for a concept is zero. and the default MAX restriction is infinity. Thus, in the above example, the concept Person WithOnlySisters has no upper bound on the number of siblings. Child (0.0) OnlyChild Sibling Figure 5 Number Restrictions Consider the defimtion of an only child shown in Figure 5. This expresses the definition of OnlyChild as any child with no siblings The following two classifiers would be generated for the number restriction: nil-Sibling-nil-nil-ni nil-OnlyChild-nil-ni NUM-nil-MAX-nil-nil-0 - PRE-MaxMessagt é nil-Sibling-nil-nil-nil nil-OnlyChild-nil-mil-ml fs PRE-MinMessage => NUM-nil-MI Querving The System Four important KL-ONE constructs and their corresponding representations in classifiers have been described These are: concept specialization, role attachment and differentiation, value restriction, and number restriction Once a Classifier System representation for such a system has been proposed, it is necessary to show how such a representation could perform useful computations. In particular, it will be shown how the collection of classifiers that represent some network (as described above) can be queried to retneve information about the network. An example of such a retrieval would be discovering all the inherited roles for some concept. In the context of the Classifier System, the only IO capability is through the global message list. The form of a query will therefore be message(s) added to the message list from some external source (a query processor) and the reply will likewise be some collection of messages that can be read from the message list after the Classifier System has iterated for some number of time steps. ‘As an example, consider the network shown in Figure 6 and suppose that one wanted to find all the inherited roles for the concept HighRiskDriver, First, one new classifier must be added to the rule set NET-ail ~ PRE-ClearMessage => NET-nil. This classifier allows network messages to stay on the message list until it is explicitly de- activated by @ ClearMessage appearing on the message list. The query would be performed in two stages. First, a message would be added to the message list that would find all the concepts that HighRiskDriver specializes (to locate all the concepts from which High ‘Driver can inherit roles) This query takes two tume steps After the second time step (when the three concepts that HighRiskDriver specializes are on the message list). the second stage 1s initiated by adding the “Role” message to the message list. It is necessary at this point to ensure that the three current messages will not be rewritten at the next time step so that the role messages will not be confused with the concept messages. This is accomplished by adding the ClearMessage. which “turns off” the one overhead classifier. Both stages of the query are shown below-* *The -> symbol indicates messages that are written to the message hist from an 38 | | Limb} Legs Thing Person Gender Sex Female Male” Sex Woman Sex Man Sex to HighRisk Driver Younghtan Figure 6 Example KL-ONE Network, external source 39 Time Step Message List To -> NET-HighRiskDnver-SUPERC-UP Tk NET-HighRiskDriver-SUPERC-UP NET-Person-SUPERC-UP T2 NET-HighRiskDriver-SUPERC-UP NET-Person-SUPERC-UP NET-Thing-SUPERC-UP -> _ PRE-RoleMessage => PRE-ClearMessage 73 NET-Sex-DIFF-UP NET-Age-DIFF-UP NET-Sex-DIFF-UP NET-Limb-DIFF-UP 74 NET-Sex-DIFF-UP NET-Age-DIFF-UP NET-Limb-DIFF-UP. The query could be continued by adding more messages after ume T4. For example. the VRMessage could be added (with the ClearMessage) to generate the value restrictions for all the roles on the list. This style of parallel graph search is one example of the kinds of retrievals that can be performed on a set of classifiers that represent a an inheritance network Other parallel operations include: boolean combinations of simple queries. limited numerical processing, and synchronization. An example of a query using boolean combinations would be to discover all the roles that two concepts have in common. This 1s accomplished by determining the inherited roles for each of the two concepts and then taking their intersection Queries about number restrictions involve some numerical processing. Finally, it is also possible to synchronize the progression of independent queries For these three types of queries. additional overhead classifiers are required Discussion The techniques discussed in the previous section have been implemented and fully described (Forrest, 85) These techniques are presented in the context of more complex KL- 40 ally KL- ONE operations such as classification and determination of subsumption. ‘The implemented system (excluding the Classifier System simulation) 1s divided into four major parts: parser, classifier generator, symbol table manager. and external command processor. The parser takes KL-ONE definitions as input, checks their syntax, and enters all new terms (concepts or roles) into a symbol table. The classifier generator takes syntactically correct KL-ONE definitions as mput and (using the symbol table) constructs the corresponding classifier representation of the KL-ONE expression The parser and classifier generator together may be thought of as a two pass compiler that takes as input KL-ONE network definitions and produces “code” (a set of classifiers) for the Classifier System. Additional classifiers that are mdependent of any given KL-ONE network (for example, the overhead classifier described in the previous section) are loaded into the list of network classifiers automatically. ‘These include classifiers to perform boolean set operations. sorting, arithmetic operations, ete The symbol table contains the specific bit patterns used to represent each term in a KL-ONE definition. One symbol table 1s needed for each KL- ONE network. Thus. if new concepts are to be added to a network without recompilation, the symbol table must be preserved after “compilation.” The external command processor runs the Classifier System, providing mput (and reading output) from the “classifier program.” Several techniques for controlling the behavior of @ Classifier System have been incorporated into the implementation Tagging, in which one field of the classifier 1s used as a selector, is used to maintain groups of messages on the message list that are in distinct states. This allows the use of specific operators that are defined for particular states This specificity also allows additional layers of parallelism to be added by processing more than ‘one operation simultaneously In these situations the messages for each operation are kept istinct on the global message list by the unique values of their cags. Negative conditions activate and deactivate various subsystems of the Classifier System. Negative conditions are used to terminate computations and to explicitly change the state of a group of messages when a “trigger” message is added to the list. ‘The trigger condition violates the negative condition and that classifier is effectively turned off. Computations that proceed one bit at a time illustrate wwo techniques (1) using control messages to sequence the processing of a computation. and (2) how to collect and combine information from independent messages into one message. Sequencing will alw: be useful when a computation 1s spread out over multiple ume steps instead of being performed in one step. Collection is important because in the Classifier System it 1s easy to “parallelize” information from one message into many messages that can be operated on independently. This is most easily accomplished by having many classifiers that match the same message and operate on various fields within the message The division of one message into 1ts components takes one time step. However. the recombination of the new components back into one message (for example. an answer) is more difficult. The collection process must either be conducted in a pairwise fashion or a huge number of classifiers must be employed. The computational tradeoff for n bits is 2" classifiers (one for each combination of possible messages) in one time step versus n classifiers (one for each bit) that are sequenced for n time steps. Intermediate solutions are also possible ‘Synchronization techniques allow one operation to be delayed until another operation has reached some specific stage. Then both operations can proceed independently: until the next synchronization point. Synchronization can be achieved by combining tagging with negative conditions. Conclusions Classifier Systems are capable of representing complex high-level knowledge structures. This has been shown by choosing one example of a common knowledge representation paradigm (KL-ONE) and showing how it can be translated into @ Classifier System rule set In the translation process the Classifier System 1s viewed as a low-level target language into which KL-ONE constructs are mapped. The translation 1s described as compilation from high-level KL~ONE constructs into low-level classifiers Since this study has not mncorporated the bucket brigade learning algorithm, one obvious direction for future study 1s exploration of how many of the structures described 42 here are learnable by the bucket brigade. This would test the efficacy of the learning algorithm and 1 would allow an investigation of whether the translauions that I have developed are good ones or whether there are more natural ways wo represent similar structures. While the particular algorithms that I have developed might not emerge with learning, the general techniques could be expected to manifest themselves It 1s possible that some of these structures are not required to build real world models, but this seems unlikely based on the evidence of KL-ONE and some miual investigations with the bucket brigade These structures are for computations that are useful in many domains and could be expected to play a role in most sophisticated models that are as powerful as KL-ONE. Since they are useful in KL-ONE, this suggests that they mught be useful in other real world models A start has already been made in this direction. Goldberg (Goldberg, 83] and Holland [Holland, 85] have shown that the bucket brigade is capable of building up default hnerarchies, using tags using negative conditions as triggers, and limited sequencing (chauning). In addition. 1 would look for synchronization. more sophisticated uses of tags more extensive sequencing, and in the context of knowledge representation. the formation of roles Roles are more complex than “properties” for two reasons First, they are two place relations rather than one place predicates. and second. relations between roles (DIFF links) are well defined. Of the other structures it 1s possible that some are so central to every Tepresentation system that they should be “bootstrapped’ into a learning system ‘That 1s they should be provided from the beginning as a “macro” package and not required to be learned from the beginning every time References Booker, Laiton (1982) “Intellgent Behavior as an Adaptation to the Task Environment”. Ph D. Dissertation (Computer and Communication Sciences) The University of Michigan, Ann Arbor, Michigan. Brachman, Ronald J (1978) “A Structural Paradigm for Representing Knowledge,” Technical Report No. 3605, Bolt Beranek and Newman Inc., Cambridge, Ma. Brachman, Ronald J and Schmolze, James G (1985), “An Overview of the KL-ONE Knowledge Representation System,” Vol. 9, No. 2 Fahlman. Scott E. (1979), NETL: A System for Representing and Using Real-World Knowledge. The MIT Press, Cambridge, Ma. Forrest, Stephanie (1985), “A Study of Parallelism in The Classifier System and Its Application to Classification m KL-ONE Semantic Networks”, Ph. D. Dissertation (Computer And Communication Sciences) The University of Michigan, Ann Arbor, Mi. Goldberg, David (1983), Ph. D. Dissertation. The University of Michigan, Ann Arbor, Mi. Holland, John H (1975) Adaptation m Nawural and Aruficial Systems. The University of Michigan Press, Ant Arbon Mie Holland, John H (1980). “Adaptive Algorithms for Discovering and Using General Patterns in Growing Knowledge Bases”, International Journal of Policy Analysis and Information Systems, Vol.4 No 3 Holland, John H. (1985), Personal Communication. Lipkis, Thomas (1981), “A KL-ONE Classifier”, Consul Note #5. USC/Information Sciences Institute, Marina del Rey, Ca. Schmolze, James G. and Brachman, Ronald J. (1982) (editors) “Proceedings of the 1981 KL- ONE Workshop.” Technical Report No. 4842, Bolt Beranek and Newman Inc.. Cambridge. Ma Schmolze, James G. and Israel, David (1983). “KL-ONE: Semantics and Classification.” in Sidner, C., et al., (editors) Techmical Report No. 5421, Bolt Beranek and Newman Inc . Cambridge. Ma... pp 27-39. 44 The Bucket Brigade is not Genetic T. H. WESTERDALE Abstract -- Unlike genetic reward schemes, bucket brigade schemes are subgoal reward schemes. Genetic schemes operat~ ing in parallel are here compared with a sequentially operating bucket brigade scheme. Sequential genetic schemes and parallel bucket brigade schemes are also examined in order to highlight the non-genetic nature of the bucket bri- gade. 1, INTRODUCTION The Bucket Brigade can be viewed as a class of appor- tionment of credit schemes for production systems. There is an essentially different class of schemes which we call genetic. Bucket Brigade schemes are subgoal reward schemes. Genetic schemes are not. For concreteness, let us suppose the environment of each production system is a finite automaton, whose outputs are non-negative real numbers called payoffs. (To simplify our discussion, we are excluding negative payoff, but most of our conclusions will hold for negative payoff as well.) Each production's left hand side 1s a subset of the environ- ment state set and each production's right hand sade 1s a member of the environment's input alphabet. Associated with each production is a positive real number called that production's availability. Probabilistic sequential selection systems are systems in which the following four steps take place each time unit (1) The state of the environment is examined and those pro- ductions whose left hand sides contain this state form the eligibility set. (2) A member of the eligibility set is selected, probabilistically, each production in the set being selected with probability proportional to its availa- bility. (3) This production then fires, which means merely that its right hand side is input into the environment caus~ ing an environment state transition and an output of payoff. (6) A reward scheme (or apportionment of credit scheme examines the payoff and on its basis adjusts the avaizlabili- ties of the various productions. Thus the availabilities are real numbers which are being continually changed by the weward scheme. Probabilistic sequential selection systems Giffer from one another in their differing reward schemes. states eaisume that for any ordered pair of environment @tes there is a sequence of productions which will take us from the first state to the second. The average payoff per unit time is a reasonable meas- ure of how well the system is doing. If the availabalities are held fixed, the system-environment complex becomes a finite state Markov Chain, and the average payoff per unit time (at equilibrium) is formally defined in the obvious way. As the availabilities change, the average payoff per unit time changes. Thus the average payoff per unit time can be thought of as a function of the availabilities. The object of the reward scheme is to change the availabilities so as to increase the average payoff per unit time The systems above have been simplified so as to more easily illustrate the points we wish to make. In any usefu. system the environment would output other symbols in addi- tion to payoff, symbols whach we could call ordinary output symbols. The left hand sides of the productions would then be sets of ordinary output symbols. A useful system would also contain some working memory (a “blackboard” or “message list") which could be examined and altered by the produc- tions. In the above systems the working memory is regarded as part of the environment and instead of sets of output symbols we have sets of (Moore type) automaton states which produce those symbols. for illustrative purposes we have simplified the system by removing various parts and leaving only those parts on which the reward scheme operates. In our systems the set of productions is fixed. We want to study the reward scheme, and allowing generation of new productions from old ones (E.9. [4]) will merely dis- tract us. II. GENETIC SYSTEMS WITH COMPLETE RECOMBINATION At any given time, the production system can be thought of as a population of productions, the availability of a Production giving the number of copies of that production in the population, or some fixed multiple of the number of copies. Thus the process of probabilistic selection of the production to fire can be thought of as randomly drawing productions from the population, until one 1s drawn that i in the eligibility set. In some systems the population is held explicitly and the availabilities are implicit whereas in others the availabilities are held explicitly and the population is implicit. 1f the system is to be viewed as a population of pro- ductions, then of course after each production is success- fully selected from the population it as tested on the environment with the environment in the state in which the previously selected production left it. 46 It is easier to analyse systems in which the result of a test of a production is independent of which productions were tested previously. Such systems are usually unrealis tic, but if the system is viewed as a population of produ tion strings, rather than of individual productions, then it is often realistic to view the test of a string as being independent of whach strings were tested previously. Let us look at a population system of this kind. The system wall consist of a population of production strings. The popula tion will change over time. Tame is viewed as divided into large units called generations. During a generation, every string in the population is tested against the environment and, as a result of the tests, the reward scheme determines the composition of the population in the next generation. A system of this kind we call a string population system Let's examine such a system and give its reward scheme in detail. We shall call the system, System A. System A is a genetic system with complete recombination. Begin wath a set of productions, each with an availa- bility. Let nm and N be large integers wath n much larger than N. The set of availabilities define a population of length n strings of productions (possibly with repeats) as follows. Let v be the sum of all the availabilities. For any length n string, the number of copies of that string in the population is nv" times the product of the availabil ities of its constituent productions. In each generation the number of progeny of each string 1s given by testing the string and summing the payoff obtained during the test. To test a string one selects the first production in the strang that 1s in the eligibility set, fires it, then moves on down the string untal one finds the next production in the strang that 1s now in the eligibility set, fires it etc. etc. until N productions have fired. We will not worry here about the few cases where one gets to the end of the string before N productions have fired. We are assuming that during a generation, every string is tested against the environment. We are also assuming that there is an “initia] state" of the enviranment and that when each string 1s tested the test always begins with the environment in the initial state, so that the results of a string test are independent of which strings were previously tested. ____The formation of progeny 1s followed by complete recom- bination. In other words, each production's availability i incremented by the number of times that production occurs in the new progeny, and the next generation's population i formed from the availabilities just as the previous Seneration’s population was. (In effect, the strings are brokén into individual productions and these productions then re-combine at random to form a new population of denath-n strings.) We could have demanded that each string test begin with the environment in the state in which the last string left at, but if N and n are large then this demand will make hardly any difference to the test results. This 1s because the environment “forgets” what state it started in during a long test. For example, suppose there is one production whose left hand side is the set of all environment states and whose right hand side 1s a symbol which resets the environment to one particular state. Let's call this pro- duction the resetting production. Then during any string test, once the resetting production 1s encountered, the pay- off for the rest of the test and the successive availability sets are independent of the state the environment was in when the test started. Thus each string has a value independent of which strings were tested previously, except for a usually small amount of payoff at the start of the test before the first occurrence of the resetting produc- tion. One can generalize these comments usefully to the case where there is no resetting production [6], but we wall mot do so formally here. The important thing to note 1s that except for a usually small initial segment, the sequence of successive eligibility sets would be independent of which strings were tested previously (provided n and N are large enough). Thus we do not lose anything important af we assume that each test begins with the environment in some initial state. So we can think of the tests in a gen- eration as taking place sequentially or in parallel, it makes no difference. Let the value of a string be the sum of the payoffs when the string is tested wath the environment begun in the initial state. If there are x copies of a string in the Population, and if the value of 7 15 y, then the number of progeny of will be xy. If r 18a production which occurs z times inthe string » then zxy will be the contribution of the progeny of 7 to the increase 1n the availabality of r This is obvious, and we have only re-stated matters in this way to make it clear that we need not insist that x, y, and the activations are integers. The formalism makes perfect sense provided they are non negatave real numbers. If the value of is 0.038 then every copy of 7 will have 0.038 Progeny. (But remember, we insist that ‘activations, and hence x, are actually positive. Note that the behavior of System A can be thought of as a sequence of availability tuples. In any given generation the population composition 1s given by the availabilities. Just as in the probabilistic sequential selection systems, the availabilities determine the average payoff per unit time (averaged over the tests of all the strings in the gen- eration). System A is deterministic. Given a tuple of avaalabal- ities it is completely determined what the next tuple of 48 i cal. duct any tie wil. cum: mal. equ reg To per set sch eve: tem two tha sys tor sys alw Sys by “th sam. wil con sel the tho bil cha an the whe po. ize jec dir the tio imp sys gen all availabilities (in the next generation) will be. We wall call two string population systems equivalent if they pro- duce the same change in the availabalities, that is if given any tuple of availabilities, the next tuple of availabili- ties will be the same whichever system we are examining. Actually we need a weaker notion of equivalence. We will also call two systems equivalent in several other car- cumstances. We will describe these circumstances infor- mally, but wall not give here a rigorous definition of equivalence. Let the set of all possible tuples of availabilities be regarded as a subset of Euclidean space in the usual way. To each point in the subset corresponds an average payoff per unit time. System A defines for each point in the sub set a vector giving the change an availabilities which its scheme would produce. Two systems are equivalent if at every point the change vector is the same for the two sys- tems and the average payoff is also the same. We also call two systems equivalent if there is a positive scalar k such that at each point (1) the average payoff for the second system is k times that of the first, and (2) the change vec- tor of the two systems aims in the same direction. So a system which was like System A but whose reward scheme always gave just half as many progeny would be equivalent to System A. If we define normalizing a vector as dividing it by the sum of its components, then condition (2) becomes “the normalized change vector of the two systems 1s the same.” For completeness I must mention a complication which will not be important in our discussion. We need to loosen condition (2) by normalizing the points in the space them- selves. Normalizing a point in the space projects it onto the normalized hyperplane. (Its components can then be thought of as probabilities and it 1s of course these proba- bilities that we are really interested in.) If we take a change vector at a point, and think of the change vector as an arrow with its tail at that point, then we can normalize the point where its tail is and also normalize the point where its head is. The arrow between the two normalized Points 1s a projection of the change vector onto the normal- ized hyperplane. We want condition (2) to say “the pro jected change vector of the two systems aims in the same direction", or “the normalized projected change vector of the two systems 1s the same". Sorry about this complica- fon. It does make sense, but the details will not be important in our discussion. Of course many schemes are probabilistic. Consider a System (System 8) just like System A except that in each Generation instead of its reward scheme giving progeny to all strings in the population, the reward scheme randomly selects just one string and gives only that string progeny (the same number of progeny System A would give it). Now the change in availabilities 1s probabilistic. At each point there are many possible change vectors, depending on which string 1s selected. When a system produces many pos sible change vectors at a point, we simply average them, weighting each possible change vector with the probability that it would represent the change. It is the average change vector that we then use in deciding system equivalence (or rather, the normalized projected average change vector). We call a scheme noisier the more the pos~ sible change vectors at a point differ from each other. So System 8 1s equivalent to System A, though System B 1s much noisier. Fisher's fundamental theorem of natural selection [1) (2] applies to Systems A and 8 so we know that for these systems the expectation of the change in the average payoff Per unit tame 1s non-negative. We call a system with this Property safe. A safe system, then, 28 one in which at every point, the average change vector aims in a direction of non-decreasing average payoff. Clearly then, a system that is equivalent to a safe system is also safe Consader a system like System A except that the anitial state (the state in which all string tests begin) 1s dif ferent from the initial state in System A. Technically this new system would not be equivalent to System A, but if mn and N are large enough it is nearly equivalent. In decid- ang system equivalence we will assume n and N are large enough. More precisely, we note that as n and N increase, a system's normalized projected average change vectors gradu- ally change. At any point, the normalized projected average change vector approaches a limit vector as n and N approach infinity. It 28 this limit vector that we use as our nor- malized projected average change vector in deciding system equivalence. Thus the change in initial state produces a new system that is equivalent to System A. In fact, a sys tem like A or B which begins each string test with the environment in the state the last string test left it is a system equivalent to A and B. In all the systems discussed in this paper, a tuple of availabilities defines an average payoff per unit time, and the reward scheme defines, for each such tuple, an average change vector. This is true also an the probabilistic sequential selection systems. Thus we can compare any two of our systems and ask whether they are equavalent. We ask if there is a reward scheme for a probabilistic sequential selection system that makes the system equivalent to Systems A and 8. The natural candidate is System C defined by the following reward scheme: Reward every N pro- ductions which fire by incrementing the availabilities of 50 these N productions by the sum of the payoffs over these N firings. But System C is not equivalent to System A. In the System A string tests, productions are skipped when they are not in the eligibility set. System A rewards these (ancrements their availabilities) whereas System C does not. To make C equivalent to A we must do something about reward- ang the productions that are not in the eligibality set. Equivalently we can instead penalize the various pro- ductions that are in the elagibility set. (See [5] for the formal details of the argument in the remainder of this sec- tion, including the effect of ancreasing string length.) The idea 1s that whenever production r 1s rewarded (has its availability incremented), the eligibility set R at the tame x fired is penalized as follows. Let S$ be the sum of all availabilities and R the sum of the availabilities of pro- ductions an R. The absolute probability of r 1s the availa- bility of r divided by 5. The problbility of r relative to Ris the availability of r divided by R. If the reward 1s x, the availability of ras first increased by x. Then the availabilities of all members of R are adjusted to bring R back down to what at was before the reward. The adjustment is done proportionally: i.e. the adjustments do not change the probabilities, relative to R, of the members of R. We call these adjustments penalties since they penalize a pro duction for being eligible. Let System C' be System C with this penalty scheme added. Then-System C’ 1s equivalent to Systems A and B In fact we can easily make this penalty scheme more sensible if we reward every time unit. The payoff in a time unit becomes the reward of the last N productions that fired (wath corresponding penalties for the eligibility sets) This gives an equivalent, but more sensible scheme. More sensibly, we can use an exponential weighting function, so that the reward of the production that fired z time units ago 1s c* times the payoff. (c is a constant and O1) since System A is. Unfortunately a system using a bucket brigade scheme will not in general be safe, and it wall not be equivalent to System A. Since System P 15 equivalent to the genetic Systems A and 8, we n call 0 also a genetic system. (Fisher's theorem says that a genetic system must be safe.) We can call the reward scheme of System 0 a genetic scheme for a probabilistic sequential selection system. IIL, THE BUCKET BRIGADE Genetic schemes like the scheme of System 0 form one ass of reward schemes for probabilistic sequential seli tion systems. Another class is the class of bucket brigade schemes. We shall examine the following bucket brigade scheme. Let Cand kK be constants, O THEN #212 dgnore -> 02 2 fot ignore -> 01 2 ignore => 10 1. rule ffo ignore => 10 2 Figure 5. The Solution to the Parity Probles this occurs, LS-2 quickly begins evolving individuals with no rules that fire at all. Doing nothing at least scores zero vhich is better than being punished. A balance of revard and punisheent which will be ‘maintained as tasks increase in complexity is needed so as to avoid the GA's ability to quickly exploit this weakness in the eritte function. The next critic employed a computational schene based on that used on the Scholastic Aptitude Test and so vas called SAT scoring. The main idea in the scoring of aultiple choice tests is that indiscrisinant guessing should have an expectation of zero, but that if a student can eliminate sone of the choices ona question, then he should be encouraged to guess by having the expected score increase as the range of guessing jecreases. For the SAT, this is achieved by subtracting from the nuaber of correct ansvers, the number of wrong anavers veighted by the inverse of the number of choices minus one. This gives an expectation vhich varies fron zero for vild guessing to the eaximin score for no guessing. For Lightly different expectation vas thought je. Wild guessing was deened better than doing nothing because this at least vould give the GA sone active rules to deal with. So the designed expectation vas that wild guessing (e.g. calling every case the sane class) should score haif of the naxizus. ‘At this point in the experimentation, an effort vas also initiated to learn about’ the sensitivity of LS-2 to changes in four of its main parameters, population size, crossover, mutation ind inversion rates. All exparinents reported so far used a population size of 30 per dinension of the performance vector, a crossover rate of .95, @ sutation rate of .01 and an inversion rate of .25. These first three values vere suggested by Greffenstette (8) and the inversion rate by Saith {13}. Lizited resources prevented the best approach which vould have been the seta-GA approach of Greffenstetto, so different settings vere produced by increasing the population size in steps of 10 per divension and simultaneously reducing the rates Bore or less in unison. This process was continued until the mean evaluations-to-solution stopped improving. Means vere computed for three runs at each setting with different random seed: ‘The SAT critic has the sane expectation the previous critic for 2-class problens vith balanced training so this task was not repeated. A 3-c solved {n 6921 evaluations, a 77% 4 the original critic. A ‘G-class subproblen vas solved in 26591 evaluations. Both of these results represented a 78 best parameter setting of 40, .90, .005 and .20 for Population size per dimension, crossover, sutation and inversion rates respectively. One final improvenant vas made in LS-2, this tine to the conflict resolution. In LS-1 Smith had not persitted conflict resolution to consider the nop action so long as a "real” action were suggested. In all the 15-2 experisents so far, noop competed equally vith tha "real" actions. The argument for this was that for some task environnents, doing nothing, or continuing te think (cycle) was a decision and that {f the environnent vere dynamic, then this sight vell affect Performance. Hovever, sone counter argusents can also be sade. The’ pattern discrimination tasks considered so far are “not dymanie; the patterns don't change while LS-2 1s trying to decide. Also, this strategy allovs for sone stochastic effect to remain in the eritic-reported values. By deciding to cycle again vhen a "real" action had been suggested, LS-2 postponed the coputation of the credit in'a non-detersinistic vay. The critic was only permitted to evaluate the suggested action artay on the final cycle. I vould now argue that if the task environsent 13 dynanic, and a do nothing faction should be considered, then it explicitly included as one of the "r Noop should not be considered a do nothing action. With this final isprovenent, LS-2 solved the 3-, 4- and full S-class problems in 5647, 15938 and 44509 evaluations respectively. The ‘taning effect of these improvesents in 15-2 are illustrated in figure 6. “10° BSN) “ESo0 mip inean-evel-Lo-solut ion} ‘Soo epee S00) wae " winder of et Figure 6. Inprovenents in LS-2 with Changing Critic uss ‘The major finding of this research vas that vector feedback {s essential to multicl: discriminant learning. Vector selection provides the necessary protection against unfair competition while sioultaneously providing the proper pressure for the evolution of the utopian individual capable of high perforcance on all facets of the task. Secondary to this major finding are a number of observations vhich may contribute to better understanding of GA's and how to effectively utilize thes. The solution of the parity problea clearly demonstrates LS-2's ability to learn non-linear discrimination. Ternary coding of KS-1 was inferior to binary coding, even vith the redundancy inherent in the binary coding schese. A search for coding Schenes which are binary and yet avoid this redundancy might pay handsone dividends. Grofenstette's finding (8] that genetic rch may be very efficient with snaller populations and higher aixing rates than previous Wisdom suggested, seeas generally to have been confirmed. Populations of 40 per dimension of performance vith crossover rates of .7 to .9, utation rates of .001 to .01 and inversion rai of .1 to .2 provided the best perfornance on the problems studied here. It should be noted, hovever, that the search vas limited and began with Grefenstette’s solution. ‘As Saith observed, the critic ts critical. The GA is cay if exploiting the properties of its critte and so good performance vas only achieved when reward and punishoent vere carefully balanced. The application of punishzent to a performance vector has raised a question which did ot oceur vith scalar performance systems. There are tvo places vhere this punishment may be applied. Suppose that a PS progran incorrectly classifies a class 1 case as class 2. By the punishsent to the class 1 slot performance vector one ts punishing the failure to do the right thing. By applying it to the class 2 slot, one is punishing the program for doing the wrong thing. It is unknown which strategy, or both leads to faster learning. The experiments reported here applied the punishzent to the slot corresponding to the case to be classified, thus alvays punishing the failure to do the right thing. Other approaches might be profitably studied, ‘The task independent measures proposed by Saith did not seem to be sufficiently closely associated with good performance to varrent their use. Hovever, his strategy of disallowing noop actions to Compete in conflict resolution was superior to allowing it ‘A final observation is in order on the original question of using a GA for intelligent signal classification. The strategy used in LS-2 seems to be promising, but requires that a prior decision be nade on the length and sampling rate for the signal, The patterns must be "frozen" so that the system can examine then. This feature sens to impose undesirable lisitations. A gore dynamic method of exanining the signal, bit by bit, and only reporting a decision ‘when enoug! information has been acquired to do so vith confidence, seens to offer a sore robust approach. REFERENCES 1. A.B. Bekey, C. Chang, J. Perry, and H.M. Hoffer, "Pattern recognition of cultiple EG ignals applied to the description of hunan it," Proceedings of the IEEE,Vol. 65 No. 5, May 1977. 2. J.R. Bourne, V. Jagannathan, B. Hanel, B.H. Jansen, J.W. Ward, J.R. Hughes and C.W. Ervin “Evaluation of a syntactic pattern recognition proach to quantitative electroencephalo- graphic analysis,” Electroencephapography & Clinical Neurophysiology, 52:57-64, 1981. 3. A. Brindle, Genetic Algorithas for function optimization, Ph.D. Dissertation, University of Alberta, Edmonton, Alberta, Canada, 1975, 4. (B.A. Giese, J.R. Bourne and J, Ward, "syntactic analysis of the electroencephalograa," IEEE Trans. Systens, Man and Cybernetics, Vol. SHC-9 No. 8, Aug 1979. 5. V, Jagannathan, An artificial intelligence approach to computerized electroencephalogras analysis, Ph.D. Dissertation, Vanderbilt. University, Nashville, Tennessee 1981. 6. Kenneth DeJong, Analysis of the behavior of a class of genetic adaptive systens, Ph.D. Dissertation, University of Michigen, Ana Arbor, 1975. 7. Kenneth DeJong, “Adaptive systen design: a genetic approach," IEEE Trans. Systeas, Man and Cybernetics, Vol. SHC-10 No. 9 Sept 1980. 8. John J. Greffenstette, "Genetic algorithas for ultilevel adaptive "systens," IEEZ Trans. on Systess, Man and Cybernetics, in press. 9. John H. Holland, Adaptation in natural and artifictal systeas, University of Michigan Press, Ann Arbor, Michigan 1975. 10, J-H. Holland and J.S. Reitman, "Cognitive systems based on adaptive algoriths,” in Pattern-Directed Inference Systess, Waterman and Hayes-Roth (Eds.), Acadesic Press, 1978. Ll. (RS. Michalski, J.G. Carbonell and 1. Mitchell, (Eds. ), Machine Learning, Tioga Publishing Co.,Palo Alto, California 1983 12. RS. Michalski, J.G. Carbonell and T.M. Mitchell, (Eds.), Proceedings of the International Machine Learning Workshop, University of Illinois, Urbana-Champaign, Tilinots, 1983. 13. S.F. Saith, A learning system based on genetic adaptive ‘algorithas, Ph.D. Dissertation, University of Pittsburg, 1980. IMPROVING THE PERFORMANCE OF GENETIC ALGORITHMS IN CLASSIFIER SYSTEMS Lashon B. Booker Navy Center for Applied Research in AI Naval Research Laboratory, Code 7510 Washington, D.C. 20375 ABSTRACT Classifier systems must continuously infer useful categories and other generalizations — in the form of classifier taxa — from the steady stream of messages received and transmitted This paper describes ways to use the genetic algorithm more effectively in discov 1 such patterns. Two issues are addressed. First, a flexible criterion is advocated for deciding when a message matches a classifier taxon. This is shown to improve performance over a wide range of categorization problems. Second, a restricted mating policy and crowding algorithm are introduced. These modifications lead to the growth and dynamie management of subpopulations correlated with the various pattern categories in the environ- ment. INTRODUCTION A classifier aystem is a special kind of production system designed to permit non- trivial modifications and reorganizations of its rules as it performs a task (Holland,1976). Classifier systems process binary messages. Each rule or classifier is a fixed length string whose activating condition, called a tazon, is a string in the alphabet {0,1,#}. The differences between classifier systems and more conventional production systems are discussed by Booker [1982] and Holland {1983}. One of the most important qualities of classifiers systems as a computational para- digm is their flexibility under changing environmental conditions _[Holland,1983). This is the major reason why these systems are being applied to dynamic, real-world problems like the control of combat systems |Kuchinski,1985] and gas pipelines [Gold- berg,1983]. Conventional rule-based systems are brittle in the sense that they function 80 poorly, if at all, when the domain or under- lying model changes slightly. Several factors work together to enable classifier systems to avoid this kind of brittleness: parallelism, categorization, active competition of alterna- tive hypotheses, system elements con- structed from “building blocks” , ete Perhaps the most important factor is the direct and computationally efficient implementation of categorization. Holland [1983, p.92] points out that Categorization is the system’s sine qua non for combating the environment's perceptual novelty. Classifier systems must continuously infer useful categories and other generalizations — in the form of taxa — from the steady stream of messages received and transmit ted. This approach to pattern-directed inference poses several difficulties. For example, the number of categories needed to function in a task environment is usually not known in advance. The system must there- fore dynamically manage its limited classifier memory so that, as a whole, it accounts for all the important pattern classes. Moreover, sinee the categories created depend on which messages are compared, the system must also determine which messages should be clustered into a category. The fundamental inference procedure for addressing these issues is the genetic algorithm [Holland,1975]. While genetic algorithms have been analyzed and empiri- cally tested for years [DeJong,1975;Bethke,1981], most of the knowledge about how to implement them .has come from applications in function optimization There has been little work done to determine the best implementation for the problems faced by a classifier system. This paper begins to formulate such an understanding with respect to categoriza- tion. In particular, two questions related to genetic algorithins and classifiers systems are examined: (1) What kinds of performance measures provide the most informative ranking of classifier taxa, allowing the genetic “algorithm to efficiently discover useful patterns? How can a population of classifier taxa be dynamically partitioned into distin- guishable, specialized subpopulations correlated with the set of categories in the message environment? Finding answers to these and related ques- tions is an important step toward improving the categorization abilities of classifier sys- nd, expanding the repertoire of prob- lems these systems can be used to solve THE CATEGORIZATION PROBLEM In order to formulate these issues more precisely, we begin by specifying a class of categorization problems. Subsequently, 2 criterion is given for evaluating various solu- tions to one of these problems. 81 Defining Message Categories Hayes-Roth 1973] defines. a “schematic” approach to characterizing pat- tern categories that has proven useful in building test-bed environments for classifier systems [Booker,1982|. This approach assumes, in the simplest case, that each pat- tern category can be defined by a single structural prototype or characteristic Bach such characteristic is a schema designating a set of features values required for category membership. Unspecified values are assumed to be irrelevant for determining membership. ‘The obvious generalization of using just one characteristic to define a category is to permit several characteristics to define a category disjunctively. Pattern generators based on the schematic approach generate exemplars by assigning the mandatory coin- binations given by one or more of the pat- tern characteristies and producing irrelevant feature values probabilistically In this way, each exemplar of a category manifests at least one of the defining characteristics The categorization problem can be very difficult under the schematic approach since any given item can instantiate the charac- teristics of several alternative categories Classifiers receive, process, and transmit binary message strings We define a category of binary strings by specifying a set of pattern characteristics Each charac- teristic is a string in the alphabet {1,0,*} where the * is a place holder for irrelevant, features. A characteristic is a template for generating binary strings in the sense that the 1 and O indicate mandatory values and the * indicates values to be generated at random. ‘Thus the characteristic 1*0* gen- erates the four strings 1000, 1001, 1100, and 1101. When more than one characteristic is associated with a category, one is selected ab random to generate an exemplar ‘The correspondence between the syntax of a taxon and the designation of pattern charac- teristics is obvious. The class of pattern categories defined in this manner therefore spans the full range of categorization prob- lems solvable with a set of taxa. An Evaluation Criterion ‘A set of taxa is a solution to a categori- zation problem if it corresponds directly with the set of characteristics defining the category. In this sense, the set of taxa models the structure of the category. One way to evaluate how closely a set of taxa models a set of characteristics is to define what an “ideal” model would look like, then, measure the discrepancy between the model given by the set of taxa and that ideal. More specifically, the structure of a pattern category is given by its set of characteristics. We first consider the case involving only one characteristic. As the genetic algorithm searches the space of taxa, the collection of alleles and schemata in the population become increasingly less diverse. Eventually, the best schema and its associ- ated alleles will dominate the population in the sense that alternatives will be present only in proportions roughly determined by the mutation rate. A population with this property will be called a perfect model of the category. The taxon which corresponds exactly with the characteristic will be called the perfect taxon. One way to describe the perfect model quantitatively is in terms of the probability of occurrence for the perfect taxon. An exact value for this probability is difficult to compute, but for our purposes it can be approximated by the “steady state” proba- bility! P(é)= [TP(é,) , where P(é,) is the 3 proportion of the allele occurring at the jth position of the perfect taxon €. In the ideal case, if u is the mutation rate, what we want is P(€,) = 1-x for the alleles of € In order to measure the discrepancy between an } ‘The probability of occurrence over wit fandom pairing, operators F repeated cross the absence of other 82 arbitrary population and the perfect model, we can use the following metric: G = P(g) wo (ee + (1-P(€)) ul a probability of Pre) P\ where P(é) 1s the ideal occurrence for € and P(g) is é’s probability of occurrence in the current population. This information-theoretic measure is called the directed divergence between the two probability distributions [Kullback,1959]. is a non-negative quantity that approaches zero as the “resemblance” between P and P! increases. The G metric has proven useful in evaluating other systems that generate stochastic models of their environment (eg Hinton et al. {1984]). When a pattern category is defined by more than one characteristic, we can use the G metric to evaluate the population's model of each characteristic separately This involves identifying the subset of the popula- tion involved in modeling each characteris- tic; and, treating each subset as a separate entity for the purpose of making measure- ments. A method for identifying these sub- sets will be discussed shortly. MEASURES FOR RANKING TAXA Given a class of categorization prob- lems to be solved, and a criterion for evaluating solutions, we are now ready to examine the performance of the genetic algorithm. ‘The starting point will be the measures used to rank taxa. Only if the taxa are usefully ranked can the genetic algorithm, or any learning heuristic, have hope of inferring the best taxon. In this sec- tion we first pot out some deficiencies in the most often used measure; then, alterna tive measures are considered and shown to provide significantly better performance. Brittleness and Match Scores ‘The first step im the execution cycle of every classifier system is a determination of which classifiers are relevant to the current set. of messages Most implementations make this determination using the straight- forward matching criterion first proposed by Holland and Reitman [1978]. More specifically, if M-=myms ~~~ m,, m, € {0,1his a message and Cmeyeg* ++ ey, 6 € fossts a classifier taxon, then the message Mf satisfies or matches C if and only if m=, wherever ¢, is O or 1. When c,=#, the value of m, does not matter. Every classifier matched by a message is deemed relevant. Relevant classifiers are ranked according to the specificity of their taxa, where specificity is proportional to the number of non-#’s in the taxon Holland and Reitman used a sim- ple match score to measure relevance The score is zero if the message does not match the taxon; otherwise it is equal to the number of non-# positions in the taxon This simple match score — hereafter called Mi — effectively guides the genetic algorithm in its search of relevant taxa. Because all non-relevant taxa are assigned a score of zero, however, M1 is the source of a subtle kind of brittleness Whenever a mes- sage matches no taxon in the population, the choice of which taxa are relevant must be made at random ‘This can clearly have undesirable consequences for the perfor- mance of the classifier system; and, also for the prospects of quickly categorizing that message using the genetic algorithm. In order to circumvent this difficulty, Holland and Reitman use an initial popula- tion of classifiers having a 90% proportion of #'s at each taxon position. This makes it very likely that relevant taxa will be avail- able for the genetic algorithm to work with. Unless the pattern categories in the environ- ment are very broad, though, the brittleness of this approach is still a concern. Suppose, for example, a classifier system must the pattern characteristic A fairly well-adapted population of classifiers will contain taxa such as LIOLOFH#, 1HO1O¥H, 11741041, 11411070, ete, As the categorization process under the genetic algorithm continues, the variability in the population decreases. It therefore becomes unlikely that the population will contain many taxa having four or more #’s. Such taxa would have a mateh score too low to compete over the long run and survive Now suppose the environment changes slightly so that the characteristic is **010 that is, the category has been expanded to allow either a 0 or 1 in the first two posi- tions In order to consistently match the exemplars of the new category, the popula- tion needs a taxon with four #'s at exactly the right loci. There is no reason to expect such good fortune since the combinations of attribute values are no longer random. The population will most likely have no taxon to match new exemplars, and the genetic algo- rithm will blindly search for a solution. Another proposed resolution of this dilemma is to simply insert the troublesome message into the population as a taxon {Hol- fand,1976], perhaps with a few #'s added to it. The problem with this is that the rest of the classifier must be chosen more or less at random. By abandoning the “building block” approach to generating classifiers, this method introduces the brittleness inherent in ad hoc constructions that cannot make use of previous experience. What is needed is a way of determining partial relevance, so the genetic algorithm can dis- cover useful building blocks even in taxa that are not matched. In the example cited above, such a capability would allow the genetic algorithm to recogmize #101044 and 1/#010## as “near miss” categoriza- tions and work from there rapidly toward the solution ##010##. Alternatives to M1 The brittleness associated with the match score MI has a noticeable impact on categorization in classifier systems. To demonstrate this effect, a basic genetic algorithm [Booker,1982] was implemented to manipulate populations of classifier taxa Taxa in this system are 16 positions long. The effectiveness of a match score in identi- fying useful building blocks is tested by presenting the genetic algorithm with a categorization problem. Each generation, a binary string belonging to the category is constructed and mateh scores are computed for every taxon. The genetic algorithm then generates a new population, using the match score to rate individual taxa. To test MI, three pattern categories were selected: Cl = MNNNnLNntLAa C2 = LLL LS 484888 CB a [tttteeeeeeenene ‘These characteristics are representative of the kinds of structural properties that are used to define categories, from the very specific to the very broad. Three sets of tests were run, each set starting with an i tial population containing a different propor- tion of #’s. Each test involved a population of size 50 observed for 120 generations, giv- ing a total of 6000 match score computa- tions? At the end of each run, a G value was computed for the final population to evaluate how well the characteristic had been modeled. The results of these experi- ments — averaged over 15 runs — are given in Table 1. For each pattern category, there are statistically significant? decreases in per- formance as the proportion of #’s is changed from 80% to 33% (Recall that the best G value is zero) Given this quantitative evi- dence of M1’s brittleness, it is reasonable to ask if there are better performing alterna~ tives. The primary criterion for an alterna- tive to MI 1s that it identify useful building 2 6000 function evaluations 1 the observation 75| that bas become a standard ing genetic m3 3 For all results preseated io this paper, this me test was performed comparing the means of Uo gr terval ps. The 84 Table 1 Final Average G Value Using MI LH Category [uitial Percentage of #'s 80% 50% 33% cl 7.83 10.28 12.25 c2 4.95 16.72 25.13 c3 598 13.67 36.57 blocks in non-matching taxa; and, that it retain the strong selective pressure induced by MI among matching taxa, One way to achieve this 1s to design a score that 1s equal to MI for matching taxa, but assigns no matching taxa values between 0 and 1, The question is, how should the non-matching taxa be ranked? If we are concerned with directly iden- tifying useful alleles, the following simple point system will suffice: award 1 point for each matched 0 or 1, % point for each #, and nothing for each position not matched. The value for # is chosen to make sure it is more valuable for matching a random bit in a message than a 0 or 1, whose expected value in that case would be % To convert this point total into a value between 0 and 1, we divide by the square of the taxon length. This insures that there is an order of magnitude difference between the lowest score for a matching taxon and all scores for non-matebing taxa. More formally, if t is the length of a taxon, n, is the number of exactly matched 0's and I's and ng is the number of #’s, we define a new match score « MI if the mecoage matches the tozon ny + %ng otherwiee ra MQ— Another way to rank non-matching taxa is by counting the number of alpha level for each test was 0S mismatched 0's and 1's. This approach measures the Hamming distance between a message and a taxon for the non-# positions. A simple match score M3 can be defined to plement this idea. If n is the number of mismatched 0's and 1’s, then M1 if the mesoage matches the tazon Ln otherwise ra M3= Now it must be determined if M2 and MG usefully rank non-matching taxa; and, if so, whether that gives them an advantage over Mi. Accordingly, M2 and M3 were tested on the same three patterns and types of populations described above for Ml. ‘These experiments are summarized in Tables 2 and 3. As before, all values are averages from 15 runs. First consider the final G values shown in Table 2 When the population is initialized to 80% y's there 1s little difference among the three match scores. The only statistically significant differences are with pattern C3, where both M2 and M3 do better than MI. This is interesting because C3 is a category that has no generalizations other than the set of all messages. MI operates by seizing upon matching taxa quickly, then refining them to fit the situation. This strategy is frustrated when general taxa that consistently match are hard to find. Since M2 and M3 can both take advantage of other information, they do not have this problem with C3. When the population is initialized to 33% #'s the lia- bilities of M1 become very obvious For each pattern category, the performance of M2 and M3 are both statistically significant improvements over MI. In order to further understand the behavior of the match scores, we also com- pare them using DeJong’s [1975] on-line per- formance criterion. On-line performance takes into account every new structure gen- erated by the genetic algorithm, emphasi2- ing steady and consistent. progress toward the optimum value, The structures of interest here are populations as models of the pattern characteristic. The appropriate Table 2 Table 3 Comparison of Final G Values Comparison of On-line Performance c 80% #f's 80% #'s - Match Score category a ta Category [hae | MS cl 7.83 | 10.30 | 7.76 ca 25.75 22.93 ce 4.95 | 2.25 | 4.32 o2 14.06 13.45 C3. sos |_142 | 097 3 175 282 50% #'s 50% #'s cl 1028 8.17 696 cl 34.41 26.3 2198 cz 1672} 7.03 | 439 oz 2709 | 2022 | 17.81 cs 13.67 |_ 8.67 |_ 913 2126 | 14.78 | 1354 33% #'s 33% #'s cr 12.25 | 8.05 | 519 co 26.35 | 21.46 C2 25.13 | 13.99 | 10.37 C2 35.3 26.75 C3 36.57 | 11.41 | 7.28 G3 40.16 19.34 on-line measure is therefore 1) . MT) =p) 2a(7), where T is the number given by of generations observed and G(t) is the @ value for the (th generation. The on-line performance of the match scores is given in Table 3 When there are 80% #'s, the only statistically significant difference is the one between M3 and Ml on category C3. In the case of 50% #s, the statistically significant differences occur on Cl, where both M2 and M8 outperform M1; and, on C2, where only M3 does better than Ml. Finally, in the difficult case of 33% #’s, the differences between M3 and MI are all statistically significant. M2 is significantly better than M1 only on category C3. Taken together, these results suggest that M3 is the best of the three match scores. It consistently gives the best perfor- mance over a broad range of circumstances. Figure 1 shows that, even in the case of 38% #'s, MB reliably leads the genetic algorithm to the perfect model for all three categories. Using M3 should therefore enhance the abil- ity of classifier systems to categorize mes- sages How should a classifier system use M3 to identify relevant classifiers? The criterion for relevance using a score ike M3 is cen- tered around the idea of a variable thres- hold. The threshold is simply the nurnber of mismatched taxon positions to be tolerated. Initially the threshold is set to zero and relevance is determined as with M1. If there are no matching classifiers, or not enough to Bill the system’s channel capacity, the thres- hold can be slowly relaxed until enough classifiers have been found. Note that this procedure is like the conventional one in that it clearly partitions the classifiers according to whether or not they are relevant to a message. This means that negated conditions in classifiers can be treated as usual; namely, a negated condi- tion is satisfied only when it is not relevant to any message. 86 DISCOVERING MULTIPLE CATEGORIES ° In developing the match score M3, we have enhanced the ability of the genctic algorithm to discover the defining charac- teristic for a given pattern category. What Mf there is more than one category to learn, or a single category with more than one defining characteristic? In this section we show how to modify the genetic algorithm to handle this more general case First, two modifications are proposed for the way indi- viduals are selected to reproduce and to be deleted. Then, the modified algorithm is shown to perform as desired An Ecological Analogy The basic genetic algorithm is a reli- able way to discover the defining charac- teristic of a category When there is more than one characteristic in the environment, however, straightforward optimization of match scores will not lead to the best set of taxa, Suppose, for example, there are two categories given by the characteristics 11°*...** and 00% The ideal popula- tion for distinguishing these categories would contain the classifier taxa HH... and OO##...##; that is, two specialized sub- populations , one for each category. The genetic algorithm as described so far will treat the two patterns as one category and produce a population of taxa having good performance in that larger category. In this case, that means the taxon ##H##...### will be selected as the best way to categorize the messages. The problem is obvious. Requir- ing each taxon to match each message results in an averaging of performance that ” is not always desirable. Various strategies have been proposed for avoiding this problem. When the number of categories is known in advance, the classifier system can be designed to have several populations of classifiers (Holland and Reitman,1978); or, a single population with pre-determined partitions and operator Froure o MS CONVERGES TO THE PERFECT woDEL so 40 4 c e a0 4 v ° a t cs v zo 4 5 | ro 4 4 T T T T ° 100 z00 300 400 GENERATIONS 87 restrictions [Goldberg,1983]. Both of these approaches involve building domain depen- dencies into the system that lead to brittle- ness. If the category structure of the domain changes in any way, the system must be re-designed. It is preferable to have a non-brittle method that automatically manages several characteristics in one population. What is needed is a simple analog of the speciation and niche competition found in biological populations. The genetic algorithm should be implemented so that, for each charac- teristic or “niche” , a “species” of taxa is generated that has high performance in that niche. Moreover, the spread of each species should be limited to a proportion determined by the “carrying capacity” of its niche. What follows is a description of technical modifications to the genetic algorithm that implement this idea. A Restricted Mating Strategy If the genetic algorithm is to be used to generate a population containing many spe- cialized sub-populations, it is no longer rea- sonable for the entire population to be modified at the same time. Only those indi- viduals directly relevant to the current category need to be involved in the repro- ductive process. Given that the overall population size is fixed and the various sub- populations are not physically separated, two questions immediately are raised: Does modifying only a fraction of the population at a time make a difference in overall perfor- mance? How is a sub-population identified? DeJong [1975] experimented with genetic algorithms in which only a fraction of the population is replaced by new indivi- duals each generation. His results indicate that such a change has adverse effects on overall plan performance. The problem is that the algorithm generates fewer samples of the search space at a time. This causes the sampling error due to finite stochastic effects to become more severe. An increase 88 in cumulative sampling error, in turn, makes it more likely that the algorithm will con- verge on some sub-optimal solution. The strategy adopted here to reduce the sampling error is to make sure that the “productive” regions of the search space consistently get most of the samples. In the standard implementations of the genetic algorithm, the search trajectory is uncon- strained in the sense that any two individu- als have some non-zero probability of mating and generating new offspring (sample points) via crossover. This means, in particular, that taxa representing distinct characteris- ties can be mated to produce taxa not hkely to be useful for categorization. As a simple example, consider the two categories given by 1111**** and 0000****, Combining taxa specific to each of these classes under ¢ross- over will lead to taxa like 1100**** which categorize none of the messages in either category. ‘There is no reason why such func- tional constraints should not be used to help improve the allocation of samples It there- fore seems reasonable to restrict the ability of functionally distinct individuals to become parents and mate with each other. This will force the genetic algorithm to progressively cluster new sample points in the more pro- ductive regions of the search space. The clusters that emerge will be the desired spe- cialized subpopulations. As for identifying these functionally distinct individuals, any restrictive designa- tion of parent taxa must obviously be based on match scores. This ts because taxa relevant to the same message have a similar categorization function. Taken together, these considerations provide the basis for a restricted mating policy. Only those taxa that are relevant to the same message will be allowed to mate with each other. This restriction is enforced by using the set of relevant classifiers as the parents for each invocation of the genetic algorithm. Crowding Under the restricted mating policy, each set of relevant taxa designates 3 species. Each category characteristic desig. nates a niche. Following this analogy, indi- viduals that perform well in a given niche will proliferate while those that do not do well in any niche will become extinct. This ecological perspective leads to an obvious mechanism for automatically controlling the size of each sub-population. Briefly, and very simply, any ecological niche has limited resources to support the individuals of a species. The number of individuals that can be supported in a niche is called the carry- ing capacity of the niche. If there are too many individuals there will not be enough resources to go around. The niche becomes “crowded,” there is an overall decrease in fitness, and individuals die at a higher rate until the balance between niche resources and the demands on those resources is restored Similarly, if there are too few indi- viduals the excess of resources results in a proliferation of individuals to fill the niche to capacity. The idea of introducing a crowding mechanism into the genetic algorithm is not new. DeJong [1975] experimented with such a mechanism in bis function optimization studies. Instead of deleting individuals at random to make room for new samples, he advocates selecting a small subset of the population at random. ‘The individual in that subset most similar to the new one is the one that gets replaced. Clearly, the more individuals there are of a given type, the more likely it is that one of them will turn up in the randomly chosen subset. After a certain point, new individuals begin to replace their own kind and the prolifera- tion of a species is inhibited. A. similar algorithm can be imple- mented much more naturally here. Because a message selects via mateh scores those taxa that are similar, there is no need to choose a random subset. Crowding pressure can be exerted directly on the set of relevant taxa. This can be done using the atrength paraineter normally associated with every classifier [Holland,1983]. The strength of a classifier summarizes its value to the system in generating behavior. Strength i continuously adjusted using the bucket bri- gade algorithm [Holland,1983] that treats the system like a complex economy. Each classifier’s strength reflects its ability to turn a “profit” from its interactions with other classifiers and the environment. One factor bearing on profitability 1s the prevailing “tax rate”. Taxation is the easiest way to intro- duce crowding pressure Assume that a classifier is taxed some fraction of its strength whenever it is deemed to be relevant to a message. Assume, further, that all relevant classifiers share in a fixed sized tax rebate. The size of the tax rebate represents the limited resource available to support a species in a niche. When there are too many classifiers in a niche their average strength decreases in a tax transac- tion because they lose more strength than they gain. Conversely, when there are too few classifiers in a niche their average strength will increase. The crowding pres- sure is exerted by deleting classifiers in inverse proportion to their strength ‘The more individuals there are in a niche, the Jess their average strength Merbers of this species are therefore more likely to be deleted. In a species with fewer members, on the other hand, the average strength will be relatively higher which means members are more likely to survive and reproduce. In this way, the total available space in the population is automatically and dynamically managed for every species. The number of individuals in a niche increases or decreases in relative proportion to the average strength in alternative niches. Testing the New Algorithm Having described the restricted mating policy and crowding algorithm, we now examine how well they perform in an actual implementation. The genetic algorithm used in previous experiments was modified as indicated above. The number of taxa in the population was increased to 200, and each taxon was given an initial strength of 320. A taxation rate of 0.1 was arbitrarily selected, and the tax rebate was fixed at 50°32; In other words, whenever there are 50 relevant taxa, the net tax transaction based on initial strengths is zero. Each generation the tax transaction is repeated 10 times to help make sure the strengths used for crowding are near their equilibrium values. Four categorization tasks involving multiple characteristics were chosen to test the performance of the algorithm: 1) Lan tss494* 2) LILLTLL At 8eee8% sees U0 3) LLL ELLLateeee* sees LL 4) aan 00000000****9+" eeeeeeen TLL The frst task involves two categories that are defined on the same feature dimensions The second task contains categories defined on different dimensions. In the third task the categories share some relevant features ‘Table 4 Performance With Multiple Catego: Task On-line . Avg. G value : or all categories 1 1212 8.3 2 1091 841 3 12.77 789 4 15.75 11.64 in common. Finally, the fourth task involves three categories to be discriminated. Experiments were performed on each of these tasks, running the genetic algorithm enough generations to produce 6000 new individuals per characteristic. Each genera- tion, one of the characteristics was selected and a message belonging to that category was used to compute match scores. In the first three tasks, at least 50 relevant taxa were chosen per generation. Only 30 were chosen on task 4 to avoid exceeding the lim- ited capacity of the population. All popula- tions were initialized with 80% #’s. The results are summarized in Table 4 and show that the algorithm behaves as expected. The performance values are comparable to those obtained with Ml working on a simpler problem with a dedicated popula- tion. More importantly, an inspection of the populations revealed that they were parti- tioned into speciazed sub-populations as desired. CONCLUSIONS This research has shown how to improve the performance of genetic algo- rithms in classifier systems. A new match score was devised that makes use of all of the information available in a population of taxa, This improves the ability of the genetic algorithm to discover pattern characteristics under changing conditions in the environment Modifications to the algo- rithm have been presented that transform it from a function optimizer into a sophisti- cated heuristic for categorization. The first modification, a restricted mating policy, results in the isolation and development of clusters of taxa, or sub-populations, corre- lated with the inferred structural charac- teristics of the pattern environment. The second modification, a crowding algorithm, is responsible for the dynamic and automatic allocation of space in the population among the various clusters. Together, these modifications produce a learning algorithm powerful enough for challenging applica- tions. As evidence of this claim, a full-scale classifier system has been built along these lines that solves difficult cognitive tasks [Booker,1982]. Acknowledgements +t The ideas in this paper were derived from work done on the author's Ph.D. dissertation That work was supported by the Ford Foundation, the IBM Corporation, the Rackham School of Graduate Studies, and National Science Foundation Grant MCS78.26016. REFERENCES Bethke, A.D. (1981), “Genetic Algorithms as Function Optimizers”, Ph.D. dissertation, University of Michigan. Booker, L.B. (1982), “Intelligent Behavior as an Adaptation to the Task Environment”, Ph.D. disser- tation, University of Michi- gan. DeJong, K.A (1975), “Anslysis of the Behavior of a Class of Genetic Adaptive Systems” Ph.D dissertation, University of Michigan. Goldberg, D.E. (1983), “Computer-Aided Gas Pipeline Operation Using Genetic Algorithms and Rule Learning”, Ph D dissertation, University of Michigan. Hayes-Roth, F. (1973), “A Structural Approach to Pattern Learn- ing and the Acquisition of Classificatory Power", Proceedings of the First International Joint. Confer- ence on Pattern Recognition, p. 343-355, Hinton, G., Sejnowski, T., and Ackley, D. (1984), “Boltzmann Machines: Constraint Satisfaction Net- works that Learn”, Technical Report, CMU-CS-84-119, Carnegie-Mellon University,. Holland, J.H. (1975), Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor. Holland, J.H. (1976), “Adaptation” , In Pro- gress in Theoretical Biology 4 (Rosen, R. and Snell, Feds). ‘Academic Press, New York Holland, J.H. (1983), “Escaping Brittleness”, Proceedings of the Intern: tional Machine — Learnit Workshop, June 1983, Monti- cello, Illinois, pp 92-95. Holland, JH. and Reitman, JS. (1978), “Cognitive Systems Based on Adaptive Algorithms”, In Pattern-Directed Inference Systems, (Waterman, D. and Hayes-Roth, F. eds), pp. 313- 329 Academic Press, New York. Kuchinski, M J. (1985), “Battle Management Systems Control Rule Optim- vation Using Artificial Intelli- gence”, Technical Note, Naval Surface Weapons Center, Dablgren, VA. Kullback, S. (1959), Information Theory and Statistics, John Wiley and Sons, New York. 92 Hultiple Objective optinization with Vector Evaluated Genetic Algorithas J. David Schaffer Departaent of Electrical Engineering Vanderbilt University Nashville, TH 37235 ABSTRACT Genetic algorithas (GA's) have been shown to be capable of searching for optise in function spaces vhich cause difficulties for gradient techniques. This paper presents a method by which the power of GA's can be applied to the optinization of multiobjective functions. Js Introduction ‘There 43 currently considerable interest in optisization techniques capable of handling multiple non-cosensurable objectives. Hany. | _probless are of this type where, fo such factors as cost, safety | and Performance must be taken into account. ‘A class of adaptive search procedures known as genetic algorithas (GA's) have already ‘shown to possess desireable properties [3,10] to out perfore gradient techniques ‘on sone problens, particularly those of high order, with Bultiple’ peaks or with noise disturbance [4,5,6). This paper describes an extension of the traditional GA which allows the searching of parameter spaces where sultiple objectives are to be optinized, The software systen implenenting this procedure was called VEGA for Vector Evaluated Genetic Algoritha, The next’ section of this paper will jescribe the basic GA and the vector extension. Then some properties are described vaich sight logically be expected of this method. Sos preliminary experiments on soue simple problens are then presented to illuminate these properties and finally, VEGA {8 compared to an established eultiobjective search technique on a set of sore formidable probleas, 28 stor Genetic Algoritha Unlike many other search techniques which waintain @ single "current best" solution and try to improve it, a GA maintains a set of possible solutions called a population, This population 1s improved by a cyclic two=step process consisting of @ selection step (survival of the fittest) and recoabination step (mating). Each cycle is usually called a generation. Hore detailed descriptions of ‘these operations say be found in the literature 13,4,5,6,10). ‘The question addressed here 1s, how can this process be applied to problens where fitness is a vector and not @ scalar? How eight survival of the fittest be implesented when there is gore than fone way to be fit? We exclude scalerization processes such a3 weighted sums or root eean square by the assumption that the different dimensions of the vector are non-coanensurabl When comparing vector quantities, the usual concepts eaploye¢ are those proposed’ by Pareto [11,13]. For two vectors of the sane size, the equality, less-than and greater-than relations require ‘that these relations hold element by elenent. Another relation, partially-less-than, 13 defined as follows: vector 'X = (x1, x2, 4. 4 x0) 4s said to be partially-less-than vector ¥ = (yl, 2s vee y yn) Aff x4 <= yt for all 4 ond for at least one value of 4, xi < yi. Assuming that siniaa fare sought, if X 4s partially-less-than Y, then Y is said to be inferior to or doatnated by X. The objective of a search for ainina in a vector-valued space is, then, a search for the set of non- Anferior menbers, or the mecbers not dominated by any others, At least one mesber of this Pareto iniual set will doninate each vector outside the set, Dut among thesselves, none is doninated. in mind, a simple vector of the fittest process was Smplenented. The selection step in each generation decane @ loop, each tine through the loop the ‘appropriate fraction of the next generation was selected on the basis of another elesent of the fitness vector, This process, illustrated in figure 1, protects the survival of the best individuals on each dimension of performance and, simultaneously, provides the appropriate probabilities for aultiple ction of individuals who are better than average on more than one diaension.. 2+ Some Anticipated Properties of VEGA 3.1 Multiple Solutions One potential advantage of VEGA over other optieization searches should now be clear. Since the object of the search is @ set of solutions, GA has built-in advantage by working with o population of test solutions. By comparing each Andividval in a population to every other, those Who are dominated by any other/s can be flagged as Anferior. The set of non-inferior individu: ich generation is the current best guess at the performance parents cee, Generatton(te1) . 2 1 . select 0 7 Sebgrauns ‘sing each Sleenston of berforsance: itor popstze souffle oly Geral operators popsize Figure 1, Schenatic of VEGA Selection Pareto-optinal (PO) set. By presenting a nuaber of non-inferior solutions, VEGA provides the user vith an idea of the tradeoffs required by his problea if 4 single solution must be selected. It should be noted that VEGA's view of non-inferiority {s strictly local; it is limited to the current population. While a locally dominated individual is also globally dominated, the converse is not necessarily true. An individual who is non- dominated in one generation may becone dominated by an individual who energes in a later generation. 3.2 Possible Speciation There ts a potential probles vith this vector selection process. Survival pressure is applied favoring extrene perforsance on at least fone dimension of perforsance. If a utopian individual (i.e, one vho excels on all dimensions of performance) exists, then he may be found by genetic combinations of extrene parents, but for many probleas this utopian solution does not exist. For these probleas, the location of the Pareto- optimal set or front is sought. This front will. contain some menbers vith extrene performance on each dimension and sone vith "aiddling” performance on all dimensions. Frequently, these cospromise solutions ara of most interest, but there may be danger of their not surviving VEGA's selection ‘This sight give rise to the evolution of "within the population which excel on different aspects of performance. This danger is expected to be nore severe for problens with a concave PO front than for those with a convex one. See figure 2. Two methods for conbating this potential property of VEGA vere conceived. One trick would to provide a heuristic selection preference for non-dosinated individuals in each generation. This would provide extra protection for the "middling" individuals. ‘Another, not necessarily — exclusivi approach would be to try to encourage crossbreeding among the "species" by adding sone sate selection 94 heuristics. In a traditional GA, mates are selected at random. On the assuaption that utopian individuals are more Likely to result from crossbreeding than inbreeding, such heuristics might speed the search. 4. Preliminary Experinents 4.1 The Test Functions In order to test the properties of VEGA ‘a set of three simple functions (f1, £2 & selected. Fl was a single-valued quadratic function variables. (i.e. f1(x1,x2,x3) = x12 + x22 + x3#82). This function vas run to test Vhether VEGA reverts to a traditional GA uhen the perfornance vector has only one dizension. F2 vas a tyo-valuod function of one variable (i.e. £21(x) = x##2; £22(x) = (x-2)##2). ‘The initial random population for the search on this function is illustrated in figure 3. In addition to the locations of x, £21 and £22, this figure also shows the dominated flag for each x (1 Af doninated, 0 if not). The PO region {s Ocex 2 oe £ t rT T 1 T T a0 20 By Generation Soo Figure 10. 904 30 20 100 20 60 20 Ac-Standard Be-Hybrid/sax. exp. val. C--Hybrid/perct. invol. D--Pop. variance Ranking 110 50 Average Best Average 4 y E Ir E 5 x Le 34 Ls 24 2: i TT TT . 500 1000 1500 2000 ‘Trial 405 Figure 12. 40 204 t- 30 204 c L20 > 10 Fe E © i a T e sbo 11000 1sb0 TT 2000 As-Standara eiat BesHybrid/max. exp. val. G--Hybrid/perct. invol. Pop. variance E+-Ranking m1 Genetic Search with Appro: ate Function Evaluations John J Grefenstette! J Michael Fitzpatrick Computer Science Department ‘Vanderbilt. University Abstract Genetic search requires the evaluation of many candidate solutions to the given problem. The evaluation of candidate solutions to complex problems often depends on statistical sampling techniques This work explores the relationship between the amount of effort spent on individual evaluations and the number of evaluations performed by genetic algorithms It 1 shown that in some cases more efficient search results from less accurate individual evaluations 1. Introduction Genetic algorithms (GA's) are direct search algorithms which require the evaluation of many points in the search space In some cases the computational effort required for each evaluation is large In a subset of these cases 1t 1s possible to make an approximate evaluation quickly. In this Paper we investigate how well GA’s perform with approximate evaluations. This topic 1s motivated 1m part by the work of De Jong [5], who included @ noisy function as part of his test environment for GA's, but did not specifically study the imphcations for using approximate evaluations on the efficiency of GA's Our main question 1s Given a fixed amount of computation time, 18 it better to devote substantial effort to getting Inghly accurate evaluations or to obtain quick, rough evaluations and run the GA for many more Generations? We assume that the evaluation of each structure by the GA involves a Monte Carlo sampling, and the effort required for each evaluation 1s equal to the number of samples performed, Since the GA's we consider do not obtain accurate evaluations during the search, the traditional metrics, online performance and offline performance, are not appropriate (or at least not easily obtained). Instead, we assume that the GA runs for a fixed amount of time, after which it yields a single answer. The performance measurement we use 1s the absolute performance, that 1s, the exact evaluation of the suggested answer after a fixed amount of time. In section 2 we describe the statistical evaluation technique In section 3 we describe the result of testing on a simple example evaluation function. In section 4 we describe the result of testing on image comparison functions In section 5 we present future directions of research on approximate evaluations 2 The Statistical Technique Evaluation In this work we investigate the optimization of a function f(z) whose value can be estimated by sampling The variable z ranges over the space of structures representable to the GA We are interested in functions for which an exact evaluation requires a large investment in time but for which an approximate evaluation can be carried out quickly. Examples of such functions appear in the evaluation of integrals of complicated integrands over large spaces Such integrals appear in many applications of physics and engineering and are commonly evaluated by Monte Carlo techniques [13,15]. An example from the field of image processing, examined in detail below, is the comparison of two digital images Here the integrand is the absolute difference between image intensities in two images at a given point in the image and the ‘space is the area of the image. ‘Research supported in part by the National Science Foundation under Grant MCS-8305603, 12 ‘Throughout our discussions it 18 convenient to treat the function, f(z), to be optimized as the mean of some random variable r(z), In terms of the evaluation of an integral by the Monte Carlo technique, f(z) would be the mean of of the integrand’s value over the space and r(z) is simply the set of values of the integrand over the space The approximation of f(z) by the Monte Carlo technique proceeds by selecting n random ‘sample from r(z) ‘The mean of the sample serves as the approximation and to the extent that the samples are random, the sample mean is guaranteed by the law of large numbers to converge to f(z) with increasing n Once f(z) is approximated, the desired value of the integral can be approximated by multiplymg the approximation of f(z) by the volume of the space There are many approaches to improving the convergence of the sample mean and. the confidence in the means for a fixed n [15] We will not investigate these approaches Here we will be concerned only with the sample mean’ and an estimate of our confidence in that mean The idea which we are exploring is to use as an evaluation function in the GA optimization of A(z), not f(z) itself, but an estimate, e(z), of (2) obtained by taking n randomly chosen samples from r(z) It 1s intuitive that e(z) approaches Jz) for large n, From statistical sampling theory itis known that if r(2) has standard deviation (2) then the standard deviation of the sample mean, 02), 18 given by (1) of2) = o(2/Vn In general o(z) will be unknown It 1s simple, however, to estimate o(z) from the samples using the unbiased estimate, (2) off2) = (6 -el2))'/(n—1) = It is clear from equation (1) that reducing the size of ¢,(z) can be expensive Reducing o fz) by @ factor of two, for example, requires four umes as many samples It 1s intuitive that the GA will require more evaluations to reach a fixed level of optimization for f(z) when oz) is larger Concomitantly, it is intuitive that the GA will 113 achieve a less satisfactory level of optimization for f(z) for a fixed number of evaluations when ¢,(2) 18 larger What is 1s not obvious is which effect 18 more important here, the increase in the number of evaluations required or the increase in the time required per evaluation The following experiments explore the relative importance of these two effects 3. A Simple Experiment As a simple example function we have chosen to minimize Kowa) = Zaye? We imagine that f(z,y,z) 1s the mean of some distribution which 1s parameterized by z, y and 2, but instead of actually sampling such a function to achteve the estimate e(z,y,z), we use e(z.yz) = fle, where norse represents a pseudo-random function chosen to be normally distributed and to have ero mean The standard deviation, ¢,{2,4,2), of e(z,y,2) 15 in Uhis case equal to that of the noise function and it is chosen artificially No actual sampling is done The advantage of this ‘experimental scheme is that we can investigate the effects of many different distributions and sample sizes for each o,(z,y,2) we choose without performing all the experiments In order to get some idea of the effect of the dependence of o,(x,y,2) on z, y and z, we perform two different sets of experiments on f(z,y,2) (a) o,%y.2) independent of z, y and s, and (b) ozuz)=d,f(zy.2) The search space is limited to z, y and z between -512 and +512 digitized to increments of 0.01 The GA Parameters are the standard ones suggested by De Jong {5] population size 50, crossover rate 06, mutation 0.001 For the experiments of type (a) we determine for several values of o, the number of evaluations necessary to find z, y and 2 such that f(zy,2) falls below a threshold of 005 For the experiments of type (b) we determine for several values of d, the number of evaluations threshold Necessary to achieve the 0.05 The results of 50 runs at each setting are shown in Figure 2. It as immediately obvious that these Graphs are approximately linear In fact linear Tegression analysis produces a correlation coefficient of 099 in each case The linearity of these graphs simphifies their analysis considerably. To see the relative importance of number of evaluations versus time per evaluation we can start with the equation for the straight. lines (3) £, = 1244422680, (4) £, = 18,020+9285., where £, and E, are the number of evaluations required for case a and b, respectively We Imagine that the evaluations were obtained by sampling from a normal distribution whose standard deviation is @ in case (a) and d/(z,y,2) in case (b) In that case we can use Equation (1) for o,(z,y,2) in both Equations (3) and (4) to get (8) B, = 12444-22680/Vn (8) , = 18,020+9285./Vn These equations give the number of evaluations Fequired to achieve the threshold as a function of the number of samples taken per evaluation, but they do not indicate the total effort required to achieve the threshold The total time required for the optimization procedure includes: the time for the n samples taken at each evaluation and the overhead incurred by the GA for each evaluation. Taking these factors into consideration we arrive at two equations for the time necessary to achieve the threshold, (7) t=(a +8 pp)(1244422680/Vn) _ (8) 4=(a;+8,p)(18,020+9285./Vn) where @ is the GA overhead per evaluation and B18 the time per sample These equations allow us to determine the optimal value for n, i.e, the value which will minimize the time necessary to reach the desired threshold in this sample 14 problem It can be seen that for large n each expression for the time increases linearly with n Thus, regardless of the relative size of the ‘overhead, the optimal value of n 3s, not surprisingly, finite As n approaches zero each expression approaches infinity, but the smallest Possible value for n 1s one The optimal value of n for either case can be found by finding the ‘minimum of the appropriate expression subject to the restriction that m be an integer greater than zero Further analysis requires some idea of the size of a/8 Since the results apply only to the Particular example evaluation function f(z,y,2) a detailed analysis 1s not worthwhile We simply note that in the case in which a is negligible, the optimal value of n is 1, and as a increases the optimal value will increase ‘Thus, at least for small overhead the answer to the question: concerning the relative importance of the number of evaluations versus the time required for a given evaluation is clear. The ume required for a given evaluation 1s more important The accuracy of the evaluation should be sacrificed in order to obtain more evaluations Optimization Proceeds more quickly with many rough evaluations than with few precise evaluations 4. An Experiment on Image Registration ‘The preceding simple example has the following special characteristics (1) the function to be optimized 1s simple, (2) r(z) has a normal dietmbution, (3) the standard deviation of r(z) 18 @ known function, These characteristics make it Possible to do simple experiments which are easy to analyze In more general problems these characteristics are nol guaranteed, but they are not necessary to insure the efficacy of the statistical approach To demonstrate the method for practical problems, we describe here out approach to a problem which has none of these characteristics. ‘The problem is found in the registration of digital images The functions which are optimized in image registration are measures of the difference between two images of @ scene, in our case X-ray images of an area of a human neck, which have been acquired at different umes The images differ because of motion which has taken place between the two acquisition times, because of the injection of dye into the arteries, and because of noise in the Image acquisition process The registration of such images 15 necessary for the success of the process known as digital subtraction angiography in which an image of the interior of an artery 1s produced by subtracting a pre-injection image from a post-injection image The details of the process and the registration technique can be found in [7] By performing a geometrical transformation which warps one image relative to the other it 18 possible to improve the registration of the images so that the difference which 1s due to motion 1s reduced. The function parameters specify the transformation, and it 1s the goal of the genetic algorithms to find the parameter values which minimize the image difference. The general problem of image re important in such diverse fields as aerial photography [8,16,17] and medical imaging [1,7,12,14,18] General introductions to the field of image registration and extensive bibliographies may be found in {8,9,11] An image comparison technique based on random sampling, different from the method used here, is described in [2] ‘The class of transformations which we consider includes elastic motion as well as rotation and translation. ration 1s ‘The transformations which are employed here are illustrated in Figure 1 Two images are selected and a square subimage, the region of interest, 1s specified as image one -- imi A. geometrically transformed version of that image 18 to be compared to a second image -- im2 The transformation is specified by means of four vectors ~ dl, d2, d3, and d4 -- which specify the motion of the four corners of iml The transformed image 1s called m3. The motion of intermediate points 1s determined by means of bilnear interpolation from the corner points ‘The magnitudes of the horizontal and vertical components of the d vectors are limited to be less than one-fourth of the width of the subimage to avoid the” possibility of folding [8] (More complicated warpings will require additional 115 vectors ) ‘The images are represented digitally as square arrays of numbers representing an approximate map of image intensity Each such intensity 1s called a przel ‘The image difference is defined to be the mean absolute difference between the pixels at corresponding positions in im? and 1m3 ‘The exact mean can be determined by measuring the absolute difference at each pixel position; an estimate of the mean may be obtained by sampling randomly from the population of absolute pixel differences The effort required to estimate the mean ts approximately proportional to the number of samples taken; so, once again, the question arises as to the relative importance of number of evaluations used in the GA versus the time required per evaluation In general, the distribution of pixel differences for a given image transformation 1s not normal Its shape will, in fact, depend in an unknown way fon the geometrical transformation parameters, and consequently the standard deviation will change in an unknown way. Thus, while the experiments on f(2,y,2) suggest that better results will be realized if less exact evaluations are made iis nol clear how the level of accuracy should be set We note that in the analysis of the experiments on {(z,y,2) fixing the number of samples, n, has the effect of fixing, either ¢, or = 2/flzyz), given the assumed forms of In the image registration case and in the general case, however, fixing n fixes neither of these quantities, since the o's behavior cannot in general be expected to be so simple. We could, however, fix either of these quantities approximately by estimating @ using Equation (2) as samples are taken during an evaluation and continuing the sampling until n is large enough such that the estimate of ©, obtamed from Equation (1) 1s reduced to the desired value Thus, the results from the previous experiments suggest three experiments on image registration — (1) try to determine an optimal fixed n, (2) try to determine an optimal fixed o, (3) try to determine an optimal fixed d,. We have implemented the first idea and a variation of the third idea The variation 1s motivated by noting from statistical sampling theory that by fixing ), we are equivalently fixing our confidence in the accuracy of the sample mean as representative the actual mean. If, for example, we require that the sample mean be within (100p)% of the actual mean with 95% confidence, we should sample until we determine that X, 1s less than or equal to P/1 96 [19] If we can fix only an estimate of d,, as in the general case, then the (100p)% accuracy at 95% confidence level requires that the estimate of A, be less than or equal to p/t ,{n) Here t {n) ts student's t at a confidence level of 100(1-9)% and a sample size of n[4] ‘This t-test is exact only if the distribution of the sample mean is normal In order to assure that the sainple mean is approximately normal the sample size, n, should be at least 10 [4] Our variation on fixing A, 15 to pick a confidence level of 95% (an arbitrary choice) and then fix p, subject to n> 10 to determine an optimal p The experiments to determine an optimal value of n and p for image registration and im the general case differ from those described for F(z,y,2) above in two ways First, because so Iittle 1s known about the distributions in the general case, actual sampling 1s necessary Second, because so litle is known about the mean which 1s to be optimized (minimized) it 1s difficult to determine in the general case whether a threshold has been reached, and therefore the criterion for halting must be different. We have considered two alternative halting critena (1) determing an exact mean, or a highly accurate estimate of the mean, of the structure whose estimate is the best at each generation, halting when that value reaches a threshold, and using as @ measure of performance the total number of samples taken, (2) halting after a fixed number of samples have been taken and using as the measure of performance the exact evaluation of the structure whose estimate is the best at the last generation The first allernative suffers from the disadvantage that the additional evaluation at each generation 1s expensive and tends to offset the savings gained through approximate evaluation The severity of the disadvantage 1, 116 fon the other hand, diminished as the size of the generation is increased Therefore this method suggests a new consideration in setting the number of structures per generation We choose in this work to avoid the question of the optimal number of structures by choosing the simpler alternative, (2) The results of our experiments on image registration are shown in Figure 3. The Figure shows data resulling from 10 runs at each setting, ‘The subimage m1 1s 100 by 100 pixels, giving a sample space of size 10,000 The motion of the corners 1s limited to 8 pixels in the x and y directions In each case the GA 1s halted after the generation during which the total number of samples taken exceed 200,000 The parameters for the transformation comprise the x and y components of the four d vectors The range for each of these eight components is [-80, +80] digitized to eight bit accuracy The GA Parameters are set to optimize offline performance, as suggested by [10] population size 80, crossover rate 0 45, mutation rate 0.10 In Figure 3a each GA takes a fixed number of samples per evaluation It can be seen from the Figure that the optimal sample size 1s approximately 10 samples per evaluation Apparently, taking one sample per evaluation does not give the GA sufficient information to carry out an efficient search The fact that performance deteriorates when we take fewer than 10 samples may indicate that Uhe underlying distribution of pixel difference is not in general normal, and so this appheation does not correspond to the ideal experiments described in section 3 In Figure 3b the estimated accuracy interval, based on the t-test, 1s fixed subject to the restriction that the sample size be at least 10 (Note that in Figure 3b, a 10% accuracy interval means thal we are 95% confident that the sample mean is within 10% of the true mean) These experiments suggest that the optimal accuracy interval at 95% confidence is nearly 100%, which corresponds to taking on the average 10 samples per evaluation Given that the performance level is nearly identical whether we take exactly 10 samples per evaluation or we take on the average 10 samples, the first approach 1 preferable, since it does not require the calculation of the t-test for each sample It should be pointed out that, as in the experiment on f(z,y,2), the GA overhead is ignored here If the overhead were included, the ‘optimal sample size would be somewhat larger In any case, it 1s clear that a substantial advantage is obtained in statistical evaluation by reducing sampling sizes and accuracies, at least for this case of image registration 5. Conclusions GA's search by allocating trials to hyperplanes based on an estimate of the relative performance of the hyperplanes One result of this approach is that the individual structures representing the hyperplanes need not be evaluated exactly This ‘observation makes GA’s applicable to problems in which evaluation of candidate solutions can only be performed through Monte Carlo techniques ‘The present work suggests that in some cases the overall efficiency of GA’s may be improved by reducing the time spent on individual evaluations and anereasing the number of generations performed ‘This works suggests some topics which deserve deeper study First, the GA incurs some overhead in performing operations such as selection, crossover, and mutation If the GA runs for many more generations as a result of performing quicker evaluations, this overhead may offset the time savings Future studies should account for this overhead in identifying the optimal time to be spent on each evaluation Second, 1t would be interesting to see how using approximate evaluations effects the usual kinds of performance metrics, such as online and offline performance Finally, additional theoretical work an this area work be helpful, since experimental results concerning, say, the optimal sample size can be expected to be highly application dependent 17 References 1D. G. Barber, "Automatic Alignment of Radionuclide Images," Phys. Med Brol Vol. 27(3), pp 387-96 (1982) 2 Daniel I. Barnea and Harvey F. Silverman, "A Class of Algorithms for Fast Digital Image Registration," [EEE Trans Comp Vol 21(2), pp 179-86 (Feb 1972) 3 Chaim Broit, Optrmal Registration of Deformed Images, Ph D thesis, Computer and Info. Sei, Univ of Pennsylvania (1981) 4 Chapman and Schaufele, Elementary Probability Models and _Statistecal Inference, Xerox College Publ. Co, Waltham, MA (1970). 5 K A DeJong, Analysts of the behavior of a class of genetic adaptive systems, Ph D Thesis, Dept Computer and Communication Sciences, Univ of Michigan (1975) 6 J Michael Fitzpatrick and Michael R Leuze, "A class of injective two dimensional transformations," tobe published 7 J M Fitspatrick, J J. Grefenstette, and D Van Gucht, “image registration by genetic search," Proceedings of IEEE Southeastcon '84, pp 460-464 (April 1984) 8 Werner Frei, T Shibata, and © © Chen, "Past Matching of Non-stationary Images with False Fix Protection," Proc 5th Intl. Conf Patt Recog Vol 1, pp.208-12, IBEE Computer Society (Dee 1-4, 1980) 9 Ardesir Goshtasby, A Symbolically- assisted Approach to Digital Image Registration with Application in Computer Vision, Ph D. thesis, Computer Science, Michigan State Univ (1983) 10 J J. Grefenstette, "Optimization of control parameters for genetic algorithms", to appear in JEEE Trans Systems, Man, and Cybernetics (1985) AL 12. 13. 14, 15. 16 17. 18 19 . Ernest L. Hall, Computer Image Processing and Recognition Academte Press, Inc., New York (1979). KH Hobne and M. Bohm, *The Processing and Analysis of Radiographic Image Sequences," Froc 6th Intnl. Conf. Patt. Recog Vol. 2, pp.884-897, Computer Society Press (Oct 19-22, 1982). FL Jai practic (1980) “Monte Carlo theory and ‘Rep. Frog. Phys Vol 48, p73 J. H. Kinsey and B.D. Vannelhi, “Applic of Digit. Image Change Detection to Diagn and Follow-up of Cancer Involving the Lungs," Proc Soc. Photo-optical Instrum. Eng Vol. 70, pp.99-112, Society of Photo- optical Instr. Eng. (1975) B Lautrup, "Monte Carlo methods in theoretical high-energy physics," Comm ACM Vol 28, p 358 (April 1985) James J Little, “Automatic Registration of Landsat MSS Images to Digital Elevation Models," Proc. Workshop Computer Viston Representation and Control, pp 178-84 IEEE Computer Science Press (Aug 23-25, 1982) Gerard G Medioni, *Matching Regions in Aerial Images," Proc. Comp Viston and Patt Recog., pp.364-65, IEEE Computer Society Press (June 19-23, 1983) Michael J Potel and David E Gustafson, “Motion Correction for Digital Subtraction Angiography," IEEE Proc Sth An. Conf Eng in Med Biol. Soc, pp 166-9 (Sept 1983) Murray R Spiegel, Theory and Problems of Probability and Statistics, MeGraw-Hill, New York (1975) 118 Figure 1a. Subimage im1 is represented by the smaller inner square. The arrows represent the four d- vectors. Figure 1b. im? is the larger image. im3. is the inner image formed by transforming iml according to the d- vectors shown in Fig 1a 4500} on 2 2 = “ S sooo} a 2 eats ia - a aso0b a 08 L2 Le zo SIGHAs Figure 2a. EVALUATIONS i 3 Evaluations Until Threshold vs. Absolute Error. o- ” ae ’ O.4 0.8 L2 2.0 LAMBDAS ' Figure 2b. Evaluations Until Threshold vs. Relative Esror. 19 Nv t a a i, ir a o at a 3 a . = = z z = ing FFERENCE = t nv FINAL AVE, PIXEL D, Be SO 100 150 SAMPLES PER EVALUATION Figure 3a. Performance vs. Fixed Sample Size. 25 50 75 100 ACCURACY INTERVAL Figure 3b. Performance vs. Accuracy Interval. 120 A connectionist algorithm for genetic search! David I. Ackley Department of Computer Science Carnegie-Mellon University Pittsburgh, PA 15213 Abstract An architecture for function maximization is proposed The design is motivated by genetic principles, but connectionist considerations dominate the implementa- tion ‘The standard genetic operators do not appear explicitly in the model, and the description of the model in genetic terms is somewhat intricate, but the imple- mentation in a connectionist framework is quite compact ‘The learning algorithm manipulates the gene pool via a symmetric converge /diverge reinforcement opera- tor Preliminary simulation studies on illustrative functions suggest the model is at least comparable in performance to a conventional genctic algorithm. 1 Overview A new implementation of a genetic algorithm is presented. The possibility for it was noted during work on learning evaluation functions for simple games (1) using a variation on a recently developed connectionist architecture called a Boltzmann Machine [2] ‘The present work abstracts away from game-playing and focuses on relationships between genetic al- gorithms and massively parallel, neuron-like architectures. ‘This work takes function maximization as the task. The system obtains information by supplying inputs to the function and receiving corresponding function values. By-assump- tion, no additional information about the function is available. Finding the maximum of ‘a complex function possessing an exponential number of possible mputs is a formidable problem under these conditions. No strategy short of enumerating all possible inputs can always find the maximum value. Any unchecked point might be higher than those al- ready examined. Any practical algorithm can only make plausible guesses, based on small samples of the parameter space and assumptions about how to extrapolate them. However, the function maximization problem avoids two further complexities faced by more general formulations First, performing “associative learning” or “categorization” can be viewed as finding maxima in specified subspaces of the possible input space. Second, in the most general case, the function may change over tme, spontaneously or in response to the system’s behavior. There the entire history of the search may affect the current location of the maximum value. Section 2 presents the model. For those familiar with genetic algorithms, highlights of Section 2 are « Real-valued vectors are used as genotypes instead of bit vectors. Reproduction and crossover are continuous arithmetic processes, rather than discrete boolean processes. ‘This research 1s supported by the System Development Foundation 121 © The entire population is potentially is not limited to contiguous port invalved in each crossover oper: of genes. Lion, and crossover # The reproductive potential of genotypes is not determined by comparison to the average fituess of the population, but by comparison to a threshold. Adjusting the threshold can induce rapid convergence or diverge an already converged population. Section 3 describes simulation studies that have been performed. ‘The model is tested on functions that are constructed to explore its behavior when faced with various hazards First a simple convex function space is considered, then larger spaces with local maxima are tried. Section 4 discusses the model with respect to the framework of reproductive plans and genetic operators developed in {10}. Possible implications for connectionist research are not extensively developed in this paper. Section 5 concludes the paper. 2 Development The goal of this research was to satisfy both genctic and conncctionist constraints as harmoniously as possible. As it turned out, the standard genetic operators appear only implicitly, as parts of a good description of how the model behaves. On the other hand, the implementation of the model in connectionist terms is not particularly intuitive. After sketching a genctic algorithm, this section presents the model via a loose analogy to the political process of a democratic society. The section concludes by detailing the implemen- tation of this “election” model and drawing links between the genetic, the political, and the connectionist descriptions. 2.1 Genetic algorithms. Genetic evolution as a computational technique was proposed and analyzed by Holland [10). It has been elaborated and refined by a number of re- searchers, e.g. [3, 4] and applied in various domains, e.g. (13, 6]. In its broadest formula- tions it is a very general theory; the following description in terms of function maximization is only one of many possible incarnations. Genetic search can be used to optimize a function over a discrete parameter space, typically the corners of an n dimensional hypercube, so that any point in the parameter space can be represented as an n bit vector The technique manipulates a set of such vectors to record information gained about the function. The pool of bit vectors is called the population, an individual bit vector in the population is called a genotype, and the bit values at each position of a genotype are called alleles. The function value of a genotype is called the genotype’s fitness or figure of merit. There are two primary operations applied to the population by a genetic algorithm. Reproduction changes the contents of the population by adding copies of genotypes with above-average figures of merit. The population is held at a fixed size, so below-average genotypes are displaced in the process. No new genotypes are introduced, but changing the distribution this way causes the average fitness of the population to rise toward that of the most-fit existing genotype. In addition to this “reproduction according to fitness,” it is necessary to generate new, untested genotypes and add them to the population, else the population will simply 122 converge on the best one it started with. Crossover is the primary means of generat plausible new genotypes for addition to the population. In a simple implementation of crossover, two genotypes are selected at random from the population. Since the population is weighted towards higher-valued genotypes, a random selection will be biased in the same way The crossover operator takes some of the alleles from one of the “parents” and some from the other, and combines them to produce a complete genotype. This “ollspring” is added to the population, displacing some other genotype according to various criteria, where it has the opportunity to flourish or perish depending on its fitness. To perform a search for the maximum of a given function, the population is first ini- tialized to random genotypes, then reproduction and crossover operations are iterated Eventually some (hopefully maximal valued) genotype will spread throughout the popula- tion, and the population is said to have “converged.” Once the population has converged to a single genotype, the reproduction and crossover operators no longer change the makeup of the population. One technical issue is central to the development of the proposed model In addition to reproduction and the crossover operator, most genetic algorithms include a “background” mutation operator as well. In a typical implementation, the mutation operator provides a chance for any allele to be changed to another randomly chosen value. Since reproduction and crossover only redistribute existing alleles, the mutation operator guarantees that every value in every position of a genotype always has a chance of occuring. If the mutation rate is too low, possibly critical alleles missing from the initial random distribution (or lost through displacement) will have only a small chance of getting even one copy (back) into the population. However, if the probability of a mutation is not low enough, information that the population has stored about the parameter space will be steadily lost to random noise. In either of these situations, the performance of the algorithm will suffer. 2.2 A democratic society metaphor. Envision the democratic political process as a gargantuan function maximization engine. The political leanings of the voting popula- tion constitute the system’s store of information about maximizing the nebulous function of “good government.” An election summarizes the contents of the store by computing simple sums across the entire population and using the totals to fill each position in the government. When the winners are known, voters informally express opinions about how well they think the elected government will fare. The bulk of the time between elections is spent estimating how well the government actually performs. By the next election, this evaluation process has altered the contents of the store: better times favor incumbents; worse times, challengers. In society, the function being optimized is neither well-defined nor arbitrary, and the final evaluation of a government must be left to history, but in the abstract realm of function maximization the true value of a point supplied to any function can be determined in a single operation. The immediacy and accuracy of this feedback creates an opportunity for an explicit learning algorithm that would be difficult to formalize in a real democracy. Credit and blame can be assigned to the voters based on how well their opinions about the successive governments predict the results produced by the objective function. Voters that approved of a high-scoring government can be rewarded by giving them more votes, so their preferences become a bit more influential in the subsequent election. Voters in such circumstances tend to favor the status quo. Voters whose preferences cause them to approve of a low-scoring government lose voting power, take a chance on something new ‘The proposed model to learning. become a bit more willing to built around such an approach An iteration of the algorithm consists of three phases which will be called “election,” “reaction,” and “outcome.” The function maximization society is run by an n member “government” corresponding to the n dimensions of the function being maximized. In cach election all n “government positions” are contested. There are wo political parties, “Plus” and “Minus.” A genotype represents a voter’s cturrent party preferences, recording a signed, real-valued number of votes for each of the positions. Which party wins a position depends on the net vote total for that position. A government represents a point in the parameter space, with Plus signifying a 1 and Minus signifying a 0. After an election is concluded, each voter chooses a reaction to the new government: “satisfied,” “dissatisfied,” or “apathetic.” The complete state of a voter includes the weights of its genotype plus its reaction In general, voters whose genotypes match well with the government—i.e., most (or the most strongly weighted) of the positions have the same signs as the genotype weights—will be satisfied and therefore share in the credit or blame for the government's performance. Voters that got about half of their choices are likely to be apathetic, and therefore are unaffected by any consequent reward or punish- ment. Voters that got few of their choices are likely to be dissatisfied with the election results. Dissatisfied voters share in the fate of the government, but with credit and blame reversed in a particular way discussed below. Satisfied and dissatisfied voters are also referred to as active, and apathetic voters are also referred to as inactive. In the outcome phase, the performance of the government is tested by supplying the corresponding point to the objective function and obtaining a function value. This value is compared to the recent Instory of function values produced by previously elected govern- ments to obtain a reinforcement signal. A positive result indicates a point scoring better than usual and vice-versa. The reinforcement signal is used to adjust the preferenées of the active voters. Positive reinforcement makes the reactions of the population more stable, and negative reinforcement makes them more likely to change. Finally, the newly ob- tained function value is incorporated into the history of function values, and an iteration is complete. Two points are worth making before considering the actual implementation. The first point is that there is noise incorporated into both the election and the reaction processes. If the sum of the vote for a given position is a landslide, the result will essentially always be as expected, but as the vote total gets closer to zero the probability rises that the winner of the position will not actually be the party that got the most votes. There are no ties or runoff elections; if the sum of the vote for a position totals to exactly zero the winner is chosen completely at random. Voter reactions are also stochastic, based on the net degree of match over mismatch between each genotype and the elected point. Although real election systems try to ensure that the winner got the most votes, in the proposed model this nondeterminisin serves the crucial function of introducing mutation. Moreover, unlike the constant-probability mutation operator mentioned in the previous section, it is data dependent. Mutation is very likely in those positions where no consensus arises from the population, but it will almost never upset a clear favorite. The second point is that only the currently active voters participate in the election. 124 gore rument positions ested 1 OL 1 vale v = 877 “Ite 3 95.6 mw tevel 0 = 113 voters as been passed to tH 1 so the r fe whether the weights sigual ss apphed Ne uch returned 1 {signal 1s negative The ase (f)- decrease (|). oF re Satisfied voters vote in the manner described above. Dissatisfied voters vote in a sign- reversed manner: positive weights vote for Minus and negative weights vote for Plus. Apathetic voters do not vote at all, but they react to each election and may become active. Section 4 discusses a genetic interpretation of this strategy. 2.3 A connectionist implementation. The ever-increasing demand for computational power and the continuing desire to understand the human brain has encouraged research into massively parallel computational architectures that resemble the physiological picture of the brain more closely than does the standard Von Neumann model. The basic as- sumption of the connectionist approach (see, e.g., {5] or [7]), is that computation can be accomplished collectively by large numbers of very siniple processing units that contain very little storage. The bulk of the memory of the system is located in communication links between the units, usually in the form of one or a few scalar values per link that control the link’s properties. In terms of individual units and links, the Perceptron (12] typifies the kinds of hardware considered: a unit is simple linear threshold device, adopting one of two numeric output states based on a comparison between the sum of its input links and its threshold; a link connects two units and contains a scalar variable that 1s multiplied by the link input to produce the link output. In terms of problem formulations, network organizations, and learning algorithms, connectionist research has moved in many directions from the Perceptron; the proposed model uses assumptions most closely related to those employed in {1, 2, 9, 11]. There is not space to explicitly motivate all of the decision designs of the implementation, but analogies to the political and genetic descriptions are discussed as they arise. Figure 1 sketches an instance of the model and defines terminology. The basic processing element of the model is called a unit. Each unit i has a ternary 125 -1000-800— 600-400-2000 20040000800 FoM0 AL; Figure 2. A phase ified by Eq. (2), pl generate erate 4 Feet ane epee Tine and points hetween the lines ge 100. (AE, ow the dotted line generate state variable s; € {+-1,0,—1}. Units communicate their current states to other units via Jinks. A link between two units i and j has a real-valued weight w,,. All links between units are bidirectional and have the same weight in both directions, ie. w,j = wy. In the political analogy, groups of units represent both the government positions and the voters. In the former case, s; represents the winner of position i, with s; = 1 —» Plus and ; = ~1—» Minus, Parameters are set so that s, = 0 cannot occur for the position units. In the latter case, s, represents the reaction of voter 2, with s; = 1— “satisfied,” 5, =0— “apathetic,” and s, = ~1 — “dissatisfied.” A unit simply retains its current state until it is probed, at which time it checks the states of the units it is connected to and the weights on those links and applies a probabilistic decision rule to select a state. The quantity that sums up the current context of a unit 7 is called “AZ,” and is defined as Q) where ranges over all the units in the network and w,, = 0 if units i and j are not connected. Given AE; and a uniform random variable 0 < ¢ < 1, the decision rule is 1 1+ e-(AE,+a)/T i. () ie < aa 0 otherwise. +1 if €> The boundaries between the unit states are plotted in Figure 2. The size of the model parameter T > 0—the “temperature”—determines how sharply the boundaries slope as AE; moves away from zero; it controls how “noisy” the system is. The model parameter a > O controls the width of the “apathy window” when the voter units are probed. In the political analogy, the election and reaction processes are both implemented by the probe operation. An election is performed by probing each of the position units once. Since position units connect only to voter units the ordering of the probes is irrelevant, and the contests for each position can happen in parallel. When applied to a position 126 unit 1, the summation in Hq (1) totes up the effective vote count for the position. If a voter unit j is apathetic, then s, - 0 and w,, does not affect the total for the position, otherwise either wj; or —w,; is included in the total depending on whether the voter is satisfied or dissatisfied. ‘The wminer of the position is then determined by Eq. (2), applied with a = 0. As AF, becomes more positive, the likelihood of Plus winning Ue contest increases, and vice-versa. If one takes the limit as T —> 0, Eq. (2) approaches a step function corresponding to a deterministic clection based only on the sign of AE). ‘The voter reaction is assessed symmetrically, by probing each of the voter units once. When applied to a voter unit i, the sunuation in Bq. (1) produces a net match score between an elected government and the voter's preferences. The match score for the voter increases when the state of position 7 has the same sign as w,, and decreases when the signs differ. The voter’s reaction is then determined by Eq. (2), with a set as a model parameter. A large positive AE, indicates a particularly good match between a government and a voter, and generates a high probability that the voter will be satisfied and adopt s, = 1; a large negative value indicates a particularly bad natch and strongly suggests s, = —1; and a near-zero value indicates an ambiguous situation and generates the largest probability of adopting s, = 0. The assumption of bidirectional links with symmetric weights guarantees that a voter's behavior during elections and reactions will be consistent. If all of a voter's preferred candidates are elected, for example, then in the zero temperature limit the voter cannot be dissatisfied with the government. In genetic terms, an election can be viewed as part of a generalized crossover operation. If we imagine one satisfied voter in an otherwise apathetic population, the outcome of a (sufficiently low temperature) election will be a direct expression of that voter’s genotype: wherever the weight from the voter to a position is positive Plus will win and vice-versa. If two voters are satisfied, some mixture of their genotypes will be expressed by the position units, depending on the relative magnitudes of the weights to the positions where the voters disagree. This situation bears a close resemblance to the standard crossover operator. The difference is that standard crossover determines the winners of disputed positions by a random choice of crossover point, whereas the proposed model exploits accumulated performance data to bias each decision.” In the general case the crossover operation is hard to see explicitly, considering the effects of many satisfied voters, the dissatisfied vote, temperature, and the fact that the crossed-over genotype is not guaranteed admission to the population. The next steps in the algorithm are straightforward. The states of the position units are translated into a binary vector J; the vector is passed to the objective function; a scalar value v is returned. The function value has no meaning in itself since the possible range of function values is unknown. A judgment must be made whether the value is “good” or “bad,” assuming that whatever is deemed good will be made more probable in the future. The expectation level 9 is used to produce the reinforcement signal —sta is Tye0-T (3) This statement is too strong if the model using standard crossover also uses inversion, since in that case the grouping induced by the crossover point docs depend on the past performance of the model, as recorded by the inversion operator. Section 4 discusses inversion and crossover further. 0. Initialization. Given unknown function {v = f(I)|Ic 2",v © R}. Select model parameters Create 1 position units and 7m voter units. Link each position unit to each voter unit. Set all nr link weights w,, = 0. Set all nm unit states s, = 0. Set 0 1. Blection: Probe each position unit (Eqs. | and 2). 2. Reactios ‘robe each voter unit 3. Outcome: 3.1, Mitness test: Compute v = f(I). 3.2. Discount expectations: Compute r (Fa. 3) 3.3. Apportion credit: Update w,, (Iq. 4 3.4. Adjust expectations: Update 0 (Eq. 5). 4. Iterate: Go to step 1. 0 Size of population; number of voters. ‘Temperature of unit decisions. Apathy window for voter reactions. Payoff rate. 0 “Temperature” of reinforcement scaling. 1 Time constant for function averaging. 0 Excess expectation. eoce Figure 3. Algorithin summary and list of model parameters. ‘This employs the same basic sigmoid function used in the unit decision rule, but r is bounded by :t1 and is used as an analog value rather than a probability The’ model parameter T; scales the sensitivity around 0 = v.5 r is used to update the weights Wij ter = Wye + krsys, (4) where k > Ois the payoff rate. The change to each link weight depends on the product s;8,. If the voter unit is apathetic the weight does not change, otherwise either kr or —kr 1s added to the weight, depending if the voter and position units are in the same or different states. Ifr is positive, the net effect of this 1s that the AE of satisfied units becomes more pos- itive and the AE of dissatisfied units becomes more negative, i.e., each active unit becomes somewhat less likely to change state when probed. Consistency is encouraged; the incum- bents are more likely to be reelected, the voters are less likely to change their reactions. When r is negative the reverse happens Inconsistency is encouraged; victory margins erode, voter reaction becomes more capricious. An updating of weights with positive r is called “converging on a genotype,” with negative r, “diverging from a genotype.” In genetic terms, the weight modification procedure both implements reproduction and completes the implementation of the crossover operator. Only the crossed-over genotype as expressed in the position units is eligible for reproduction, and then only if r > 0. Otherwise the network diverges, and that genotype decreases its “degree of existence” in the population. It 1s displaced, by some amount, but it is not replaced with other 3 ‘The precise form of Eq. (3) docs not appear essential to the model Several variations all searched effectively, though they displayed different detailed behaviors. 128 members of the population—the total “voting power” of the population declines a bit instead. Intuitively speaking, the space vacated by a diverged genotype is filled with noise. ‘The final implementation issue is the computation of the expectation level. A number of workable ways to manipulate @ have been tried, but the simulations in the next section all use a simple backward-averaging procedure D1 = PO + (1~ p)(v +4) (5) where 0 < p< 1is the “retention rate” governing how quickly 0 responds to changes in v. Just allowing @ to track v is inadequate, however, for if the network completely converged there would be no pressure to continue searching for a better value. A positive value for the model parameter 5 avoids this complacency and ensures that a converged network will receive more divergence than convergence, and eventually destabilize. Figure 3 summarizes the algorithm and lists the seven model parameters. 3 Behavior This section describes preliminary simulations of the election model. Most of the objective functions considered here were explored during the design of the model, rather than being chosen as independent tests after the design stabilized. The functions were created to embody interesting characteristics of search spaces in general. ‘All of the simulations described in this paper use the following settings for the model parameters m=50 T=10n a=sn k=20 T,=10 p=0.75 6=40 Note that the temperature and the apathy are proportional to the dimensionality of the given parameter space. For convenience, these are called the “standard” settings, but significantly faster searching on a function of interest can be produced by fine-tuning the parameters. The standard settings were chosen because they produce moderately fast performances across the four selected functions, each tested at four dimensionalities. ‘The simulations count the average number of function evaluations before the model evaluates the global maximum. Two other algorithms were implemented for comparison. The first was the following hillclimbing algorithm 1, Select a point at random and evaluate it. 2. Evaluate all adjacent points. If no points are higher than the selected point, go to step 1. Otherwise select the highest adjacent point, and repeat this step. Iterated hillclimbing is a simple-minded algorithm that requires very little memory. Its performance provides only a weak bound on the complexity of a parameter space. The second algorithm was a basic version of Holland’s Ri reproductive plan {10}, using only simple crossover and mutation. Considering the lack of sophisticated operators in the implementation, and the author’s inexperience at tuning its parameters, the performance of the Ri implementation should be taken only as an upper bound on the achievable performance of a simple genetic algorithm.* 4 ‘The Ri model parameter values were selected after a short period of trial and error on the test 129 rT 3.1 A convex space. Consider the following trivial function: Score 10 points for each L bit, Return the sum. The global maximum equals 10n and occurs when all bits are turned on. This “one max” function was tested because it can be searched optimally by _ hillclimbing, and the generality of a genetic search is unnecessary. Figure 4 tabulates the simulation results for n = 8,12, 16,20. As expected, the hillelimbing algorithm found the maximum more quickly than did the model, but it is encouraging that on all but the smallest case the election model comes within a factor of two of hillclimbing’s efficiency on this convex space. Observations made during the simulations suggest that the relatively poorer performance of Ri arose primarily from the occasional loss of one or more critical alleles, producing the occasional very long run. Although increasing the mutation rate reduced the probability of such anomalies, it produced a costly rise in the length of typical runs. ‘One max n 8 | 12 16 20 ‘Method Evaluations performed* Hillclimb 3i_| 82 | 128 | 198 Election 73 (117 | 187 | 302 Holland Ri_| 195 | 674 | 1807 | 4161 * Rounded averages over 25 runs. Figure 4. Comparative simulation results on the “one max” function, In all simulations, the performance measure ix the number of objective function evalnations perforined before the global maximum is evaluated. 3.2 A local maximum. Convex function spaces are very easy to search, but spaces of interest most often have local maxima, or “false peaks” Consider this “two max” function: Score 10 points for each 1 bit, score ~8 points for each 0 bit, and return the absolute value of the sum This function has the global maximum when the input is all 1’s, but it also has a local maximum when the input is all 0’s. Figure 5 summarizes the simulation results. With this function, a simple hillelimber may get stuck on the local maximum, so multiple starting points may be required. Two max n 8 12 | 16 20 ‘Method Bualuations performed™ Hillclimb 37_[_97 | 186 | 230 Election 83 | 152 | 194 | 269 Holland R1_| 113 | 340 | 794 [ 1622 * Rounded averages over 25 runs. Figure 5. Comparative simulation results on the “two max” function. functions. Using the notation defined in {10}, the values were M = 50, Po = 1, Py = 0, 1Pyy = 0.5, and ¢ = (1/1)925*2/, where n is the dimensionality of the objective function. Constant offsets were added to the functions where necessary to ensure non-negative function values. 130 Nonetheless, on this function also the hillclimber outperforms the model, although only by a narrow margin on the larger cases. The sere existence of a local maximum does not imply that a space will be hard to search by iterated hillclimbing. The regions surrounding the two maxima of the function have a constant slope of 18 points per step toward the nearer maximum. The slopes have the same magnitude, so the higher peak must be wider at its base. With every random starting point, the hillchmber 1s odds on to start in the “collecting area” of the higher peak, so it continues to perform well. 3.3 Fine-grained local maxima. Consider the following “porcupine” function: Score 10 points for cach 1 bit and compute the total. If the number of 1 bits is odd, subtract 15 points from the total. Return the total. Every point that has an even number of 1 bits is a porcupine “quill,” surrounded on all sides by the porcupine’s “back”—lower valued points with odd numbers of 1 bits. As the total number of | bits grows, the back slopes upward; the task is to single out the quill extending above the highest point on the back. Porcupine n Ls] 2 16 | 20 Method . Evaluations performed* Hillclimb [145 | 2474 | 41973 =| Election | 160 | 211 241 | 495 Holland Ri | 163 | 739 | 1296 | 3771 | * Rounded averages over 25 runs. Figure 6. Comparative sunulation results on the *poreupme” function, Unlike the first two functions, the porcupine function presents a tremendously rugged landscape when one is forced to navigate it by changing one bit at time. Not surprisingly, hillclimbing fails spectacularly here. Figure 6 displays the results. The landscape acts like Aypaper, trapping the hillclimber after at most one move, and the resulting long simulation times reflect the exponential time needed to randomly guess a starting point within a bit of the global maximum. (The hillclimber was not run with n = 20 for that reason.) On the other hand, the election model gains less than a factor of two over its performance on the one max function. The strong global property of the space—the more 1’s the better, other things being equal—is detected and exploited by both genetic algorithms.5 Although the porcupine function reduced hillclimbing to random combinatoric search, in a sense it cheated to do so, by exploiting the hillclimber’s extremely myopic view of possible places to move. A hillclimber that considered changing two bits at a time could proceed directly to the highest quill. But increasing the working range of a hillclimber exacts its price in added function evaluations per move, and can be foiled anyway by using fewer, wider quills (e.g., subtract 25 points unless the number of ones is a multiple of 5 The concept of parity, which determines whether one lands on quill or back, is not detected or exploited. All three algoritlins continue to try many odd parity points during the search. The general notion of parity, independent of any particular pattern of bits, cannot be represented m such simple models; the amport of this demonstration is that the genetie models can make good progress even when there are aspects of the objective function that, from their point of view, are fundamentally unaccountable. 131 three.) Higher peaks may always be just “over the horizon” of an algorithm that searches fixed distances outward from a single point. 3.4 Broad plateaus. The porcupine function was full of local maxima, but they were all very small and narrow. A rather different sort of problem occurs when there are large regions of the space in which all points have the same value, offering no uphill direction. Consider the following “plateaus” function: Divide the bits into four equal-sized groups. For each group, if all the bits are 1 score 50 points, if all the bits are 0 score —50 points, and otherwise score 0 points. Return the sum of the scores for the four groups. In a group, any pattern that includes both zeros and ones is on a plateau. Between the groups the bits are completely independent of each other; within a group only the combined states of all the units has any predictive power. When n = 8 there are only two bits in a group and the function space is convex, because the sequence 00 ~+ {01,10} — 11 is strictly uphill. However, since each group grows larger as n increases, this function rapidly becomes very non-linear and difficult to maximize. Plateaus n 8 | 2 16 | 20 ‘Method = Bvaluations performed” Hillclimb 34 414 [| 2224 [13404 Election | 146 | 392 | 758 | 2304 Holland R1_[{ 228 | 607 | 2223 | 8197 * Rounded averages over 25 runs. Figure 7. Comparative simulation results on the “plateaus” function. 4 Discussion ‘The proposed model was developed only recently, and has it has not yet been analyzed or tested extensively. Although it would be premature to interpret the model and simula- tions in a very broad scope, a few interesting consequences have been uncovered already. This section touches on a number of relationships between the election model and the analytic structure of schemata and generalized genetic operators developed by Holland in Adaptation in Natural and Artificial Systems (ANAS) [10]. Given a population, computational effects related to simple crossover can be achieved in many ways. For example, disputed positions could be resolved by random choices between the parents, or by appealing to a third genotype as a tie-breaker. Like simple crossover, both of these implementations perform the basic task of generating new points that instantiate many of the same schemata as the parents. An appropriate crossover mechanism interacts well with the other constraints of the model and the task domain. For example, the information represented by a DNA molecule is expressed linearly, so the sequential ordering of the alleles is critical. In these circumstances, the simple cut-and- swap crossover mechanism is an elegant solution, since it is cheap to implement and it preferentially promotes contiguous groups of co-adapted alleles. In an unconstrained function optimization task, as little as possible should be presumed a priori about how the external function will interpret the alleles. In these circumstances, 132 the sequential bias of the standard crossover mechanism is unwarranted. ANAS proposes an inversion operator to compensate for it. The inversion operator tags each allele with its position number in terms of the external function, so the ordering of the genotype can be permuted to bring co-adapted alleles closer together and therefore shelter them from simple crossover. However, if two chosen parents do not have their genotypes permuted in the same way, a simple crossover between them may not produce a complete set of alleles. ANAS offers two suggestions. If inversion is a rare event, sub-populations with matching permutations can develop, and crossover can be applied only within such groups. But then information about the linkages between alleles accumulates only slowly. Alternatively, one of the parents can be temporarily permuted to match the other parent in order to allow simple crossover to work, but then half of the accumulated linkage information is ignored at each crossover. The proposed model does not use the ordering of the alleles to carry information. Linkage information is carried in the magnitudes of the genotype weights, in non-obvious ways involving all three phases and the assumption of symmetric weights. For example, the defining loci of a discovered critical schema are likely to be represented by relatively large weights on a genotype, since those weights will receive systematically higher net reinforcement than the non-critical links. Conversely, relatively large weights to a few positions cause the designated alleles to behave in a relatively tightly coupled fashion. In the election phase, large weights increase the chance that the alleles will be expressed simultaneously and receive reproduction opportunities. In the reaction phase, the same large weights increase the chance that the voter will be apathetic when the implied schema is not expressed, since the genotype’s large weights will tend to cancel. Strongly coupled alleles will be disrupted more slowly over successive outcome phases. Although it is not discussed in ANAS, subsequent research found it useful to include a “crowding factor” that affects how genotypes get selected for deletion to make room for a new offspring [4]. The idea is to prefer displacing genotypes that are similar to the new one, thus minimizing the loss of schemata. In the proposed model, note the interaction between the reaction phase and the outcome phase. Only active voters are affected by weight modification. Since voters tend to be satisfied or dissatisfied when they strongly match or mismatch the government, and dissatisfied voters invert the sign of the weight modifications, converging on a genotype preferentially displaces similar existing genotypes. The representation of genotypes by real-valued vectors instead of bit vectors has widespread consequences. One major difference concerns the displacement of genotypes as a result of reproduction or crossover. When a bit vector is displaced from a conventional population, the information it contained is permanently lost. In contrast, the proposed reinforcement operator is an invertible function. Between a constant government and a voter, any sequence of positive and negative reinforcements has the same effect as their sum. Observations revealed that the election model exploits this property in an unan- ticipated and useful way. The happenstance election of a surprisingly good government often leads to a run of reelections and positive reinforcements, occasionally freezing the network solid for a few iterations, until the expectation level catches up. If one examines the signs of the genotype weights at such a point and interprets them as boolean vari- ables, the population often looks nearly converged. But the expectation level soon exceeds any fixed value, and weaker negative reinforcements begin to cancel out the changes and to regenerate the pre-convergent diversity During such times, the government positions with the smallest victory margins are the first to begin changing, which causes a period of stochastic local search in an expanding neighborhood around the convergence point. If farther improvement is discovered, the network will frequently converge on it, but often the destabilization spreads until the government collapses entirely and a period of wide- ranging global search ensues. It may be that much of the election model’s edge over the Rt algorithm on the strict maximization-time performance metric used in this paper arises from this tendency to hillelimb for a while in promising regions of the parameter space, without irrevocably converging the population. 5 Conclusion ‘The architectural assumptions of the model—-the unit and link definitions, the decision rule, and the weight update rule—were first explored for reasons unrelated to genetic algo- rithms. The assumption of symmetric links between binary (:£1) threshold units was made by Hopfield [11] because he could prove such networks would spontancously minimize a particular “energy” function that was easily modifiable by changing link weights. Hopfield used the modifiable “energy landscape” to implement an associative memory. Hopfield’s deterministic decision rule was recast into a stochastic form by Hinton & Sejnowski [8] because they could then employ mathematics from statistical mechanics to prove such a system would satisfy an asymptotic log-linear relationship between the probability of a state and the energy of the state. 0/1 binary units were used. They found a distributed learning algorithm that would provably hillelimb in a global statistical error measure. They used the system to learn probability distributions. The weight update rule was investigated by the author because it provided a sim- ple method of adjusting energies of states based on a reinforcement signal for a back- propagation credit assignment algorithm [1]. +1 binary units were used. The connectionist network was used as a modifiable evaluation function for a game-playing program. The system learned to beat simple but non-trivial opponents at tic-tac-toe. Observations made during simulations raised the possibility that genetic learning was occurring as the system evolved. In that work, the government corresponds to the game board, and a voter, in effect, specifies a sequence of moves and countermoves for an entire game. The model fre- quently played out variations that looked like crossed-over “hybrid strategies.” The rapid spread through the units of a discovered winning strategy was suggestive of a reproduction process. The research reported here focused on that possibility. The task was simplified to avoid problems caused by legal move constraints, opposing play, and delayed reinforcement. Given an appropriate problem statement, the basic election/reaction scheme seemed to be the simplest approach. Extending the unit state and decision rule to three values occurred to the author while developing the political analogy. In theory, apathy could be eliminated, because a unit with a near-zero AE would pick +1 or —1 randomly, so rewards and punishments irrelevant to that unit’s genotype would cancel out in the long run. In practice, explicitly representing apathy improves the signal-to-noise ratio of the reinforcement signal with respect to the genotype. The unit is not forced to take a position and suffer the consequences when it looks like a “judgment call.” The performance of the algorithm is generally faster and more consistent, but a percentage of the population is 134 ignored at each election. For the large populations implied by massively parallel models, it appears to be an attractive space/time trade-off. The connectionist model presented here has a much more sophisticated genetic de- scription than was anticipated at the outset. Only reproduction, crossover and mutation were intentionally “designed into” the model. It was a surprise to discover that the model performed functions reminiscent of other genetic operators such as inversion and crowding factors. As an emergent property, the model displays both local hillelimbing and global genetic search, shifting between strategies at sensible times. More experience with the pro- posed model is needed, but a crossing-over of genetic and connectionist concepts appears to have produced a viable offspring. References [1] Ackley, D.H. Learning evaluation functions in stochastic parallel networks. Carnegie- Mellon University Department of Computer Science thesis proposal. Pittsburgh, PA: December 4, 1984. [2] Ackley, D.H., Hinton, G.E., & Sejnowski, T.J. A learning algorithm for Boltzmann Machines. Cognitive Science, 1985, 9(i), 147-169. 3] Bethke, A.D. Genetic algorithms as function optimizers. University of Michigan Ph.D. ‘Thesis, Ann Arbor, MI: 1981. . [4] DeJong, K.A. Analysis of the behavior of a class of genetic algorithms. University of Michigan Ph.D. Thesis, Ann Arbor, MI: 1975. {5] Feldman, J., (Ed.) Special issue: Connectionist models and their applications. Cogni- tive Science, 1985, 9(1). 6] Goldberg, D. Computer aided gas pipeline operation using genetic algorithms and rule learning. University of Michigan Ph.D. Thesis (Civil engineering), Ann Arbor, MI: 1983. [7] Hinton, G.E. & Anderson, J.A. Parallel Models of Associative Memory. Hillsdale, NJ: Erlbaum, 1981. [8] Hinton, G.E., & Sejnowski, T.J. Optimal perceptual inference. Proceedings of the JEEE Computer Society Conference of Computer Vision and Pattern Recognition. June 1983, Washington, DC, 448-453. (9] Hinton, G.E., Sejnowski, T.J., & Ackley, D.H. Boltzmann Machines: Constraint satis faction networks that learn. Technical report CMU-CS-84-119, Carnegie-Mellon Uni- versity, May 1984, {10] Holland, J.H. Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975. (11) Hopfield, J.J. Neural networks and physical systems with emergent collective compu- tational abilities. Proceedings of the National Academy of Sciences USA, 1982, 79, 2554-2558. [12] Rosenblatt, F. Principles of neurodynamics: Perceptrons and the theory of brain mech- anisms. Washington, DC: Spartan, 1961. [13] Smith, S. A learning system based on genetic algorithms. University of Pittsburgh Ph.D. Thesis (Computer science). Pittsburgh, PA: 1980. 135 Job Shop Scheduling with Genetic Algorithms Dr. Lawrence Davis Bolt Beranek and Newman Inc. 1. INTRODUCTION The job shop scheduling problem is hard to solve well, for reasons outlined by Mark Fox et al’, Their chief point is that realistic examples involve constraints that cannot be represented in a mathematical theory like linear programming. In ISIS, the system that Fox et al have built, the problem is attacked with the use of multiple levels of abstraction and progressive constraint relaxation within a frame-based representation system. ISIS is a deterministic program. however, and faced with a single scheduling problem it will produce asingle result. Given the vast search space where such unruly problems reside, the chances of being trapped on an inferior local minimum are good for a deterministic program. In this paper, techniques are proposed for treating the problem non-deterministically, with genetic algorithms. 2. JOB SHOP SCHEDULING: THE PROBLEM A job shop is an organization composed of a number of work stations capable of performing operations on objects. Job shops accept contracts to produce objects by putting them through series of operations. for a fee. They prosper when the sequence of operations required to fill their contracts can be performed at their work centers for less cost than the contracted amount, and they languish when this is not done. Scheduling the day-to- day workings of a job shop (specifying which work station is to perform which operations on which objects from which contracts) is critical in order to maximize profit, for poor scheduling may cause such problems as work stations standing idle, contract due dates not being met, or work of unacceptable quality being produced. The scheduling problem is made more difficult by that fact that factors taken into account in one’s schedule may change: machines break down, the work force may be unexpectedly diminished, supplies may be delayed, and so on. A job shop scheduling system must be able to generate schedules that fill the job shop’s contracts, while keeping profit levels as high as practicable. The scheduler must also be able to react quickly to changes in the assumptions its schedules are based on. In what follows, we shall consider a simple job shop scheduling problem, intended to be instructive rather than realistic, and show how genetic algorithms can be used to solve it. 3,_SJS-A SIMPLIFIED JOB SHOP SJS Enterprises makes widgets and blodgets by contract. There are six work stations in SJS. Centers 1 and 2 perform the grilling operation on the raw materials that are delivered to the shop. Centers 3 and 4 perform the filling operation on grilled objects, 136 and centers 5 and 6 perform the final milling operation on filled objects. Widgets and blodgets go through these three stages when they are manufactured. Thus, the sequence of processes to turn raw materials into finished objects is this: RAW MATERIALS - GRILLING - FILLING - MILLING - CUSTOMER. SJS has collected a number of statistics about its operations. Due to differences in its machinery and personnel, the expected time for a work station to complete its operation on an object is as follows, in minutes: WORK STATION WIDGETS BLODGETS 1 5 15 2 8 20 3 10 12 4 8 15 5 3 6 6 4 8 The cost of running each of the work stations at SJS is as follows, per hour: WORK STATION IDLE ACTIVE z 10 70 2 20 60 3 10 70 4 10 70 5 20 80 6 20 100 In addition, SJS has overhead costs of 100 units per hour. Finally, it requires some time for a work station to change from making widgets to making blodgets (or vice versa). The change time for each station is: WORK STATION CHANGE TIME 30 10 20 20 9 18 oonene 4. A SCHEDULING PROBLEM . Suppose SJS is beginning production with two orders, one for 10 widgets and one for 10 blodgets How should it be scheduled so as to maximize profits from the point at which operations begin, to the point at which both orders are filled? Let us consider three schedules that address this problem. In schedule 1, individual work stations are assigned their own contracts. We notice that the production of blodgets takes longer than the production of widgets, and so we make widgets with centers 2, 4, and 6, and make blodgets with centers 1, 3, and 5. If the shop follows this schedule, the various work stations are occupied as follows: STATION CONTRACT WORKING WAITING HRS-WORKED COST 1 dlodgets 0-150 30 3 210 2 widgets 0-80 40 2 140 3 blodgets 15-162 60 3 210 4 widgets 8-88 40 2 150 5 blodgets 27-168 120 3 240 6 widgets 16-92 80 2 220 In simulating the operation of the job shop under this plan. we note that some work stations spend a good deal of time waiting for objects to work on. Work stations 5 and 6, for example. spend from one to two hours waiting because they are faster than the centers that feed objects to them. It is possible to let them stand idle for the first hour of the day without delaying the filling of the orders, yielding a second schedule with cost 970, a 17 per cent reduction over the first schedule, achieved by giving these work stations an initial idle hour. A different way to cut down on the waiting time would be to leave work station 6 idle throughout the day, performing all operations with work station 5 during the second and third hours of the day. Work station 5 must start work on blodgets when it begins, switch to widgets later on and finish them, then switch back to making blodgets at the end. The cost of this schedule is 950, an 18.8 per cent reduction over the direct cost of the first schedule. It is interesting to note that a deterministic system would be likely to try one or the other of the two optimizations on the first schedule, but not both. Each of these optimizations brings the situation to a local minimum in cost, and advance predictions of which such optimization will be best appear difficult to make. 5. AN AMENABLE REPRESENTATION OF THE PROBLEM If we consider a schedule to be a literal specification of the activity of each work sta- tion, perhaps of the form “Work station w performs operation o on object x from time 11 to time t2.” then one will be caught in a dilemma if one applies genetic techniques to this problem. Either one will attempt to use CROSSOVER operations or not. If so, their use will frequently change a legal schedule into an illegal one, since exchanging such state- ments between chromosomes will cause operations to be ordered for which the necessary previous operations have not been performed. As a result, one would acquire the benefits of CROSSOVER operations at the cost of spending @ good deal of one’s time in a space of illegal solutions to the problem. If one foregocs CROSSOVER operations, however, one 138 loses the ability to accelerate the search process, the very feature of the genetic method that gives it its great power. There is a solution to this dilemma’. It is to use an intermediary, encoded representa- tion of schedules that is amenable to crossover operations, while employing a decoder that always yields legal solutions to the problem. Let us consider the scheme of representations and decoders that generated the second and third schedules above. A complete schedule for the job shop was derived from a list of preferences for each work station, linked to times. A preference list had an initial member~-a time at which the list went into effect. The rest of the list was made up of some permutation of the contracts available, plus the elements “wait” and “idle”. The decoding routine for these representations was a simulation of the job shop’s operations, assuming that at any choice point in the simulation, a work station would perform the first allowable operation from its preference list. Thus, if work station 5 had a preference list of the form (60 contract] contract2 wait idle), and it was minute 60 in the simulation, the simulator looked to see whether there was an object from contract 1 for the work station to work on. If so, that was the task the work station was given to perform. If not, the simulator looked to see whether there was an object from contract 2 to work on. If so, it set the work station to change status to work on contract 2. noting the elapsed time if contract 1 had been worked on last, and then set it to work on the new object. If not. the station waited until an object became available. By moving the “wait” element before contract2, one could cause the work station to process objects from contract 1 only, never changing over to contract 2. Representing the problem in this way guarantees that legal schedules will be produced, for at cach decision point the simulator performs the first legal action contained on a work station’s list of all available actions. The decoding routine is a projected simulation, and the evaluation of a schedule is the cost of the work stations. performing the tasks derived in the simulation. As we shall see, the simulation decoder also provides some information that will guide operations to perform useful alterations of a schedule. 6. DET/ S$ OF OPERATION The program used a population sized 30, and ran for 20 generations. The problem was tried 20 times. It converged on variations of Schedule 2 14 times and a variation of Schedule 3 6 times®. The operations used were derived from those optimizations made by us as we tried to solve the problem deterministically: RUN-IDLE: If a work station has been waiting for more than an hour insert a preference list with IDLE as the second member at the beginning of the day, and move the previous initial list to time 60. The probability of applying this operation was the percentage of tine the work station spent waiting, divided by the total time of the simulation. SCRAMBLE: Scramble the members of a preference list. Probability was 5 per cent for each list at the beginning of the run, tapered to 1 per cent at the last generation. CROSSOVER: Exchange preference lists for selected work stations. Probability was 40 per cent at the beginning of the run, tapered to 5 per cent at the last generation. Each member of the initial population associated a list of five preference lists with each work station. The preference lists were spaced at one-hour intervals, and each was a random permutation of the legal actions. The evaluation function summed the costs of simulating the run of the system for five hours with the schedule encoded by an individual. (Although SJS overhead costs are not included in the discussion of the three schedules earlier, the evaluation function included. them.) If, at the end of five hours, the contracts were not filled, 1000 was added to the Tun costs. 7. CONCLUDING OBSERVATIONS The example discussed above is much simpler than those one would encounter in real life, and the range of operations employed here would have to be widely expanded if a realistic example were approached. In addition, the system here would have to be extended to handle the sorts of phenomena that the ISIS team has handled: establishing connections between levels of abstraction, finding useful operations, and building special constraints into the system, for example. My belief is that these things could be done if they are successfully done by a deter- ministic program, for it has been our experience that a quick. powerful way to produce an genetic system for a large search problem is to examine the workings of a good deterministic program in that domain. Wherever the deterministic program produces an optimization of its solution, we include a corresponding operation. Wherever it makes a choice based on some measurement, we make a random choice, using each option’s measurement to weight its chances of being selected. The result is a process of mimicry that, if adroitly carried out, produces a system that will out-perform the deterministic predecessor in the same environmental niche. In the case of the schedules produced above, the genetic operators were just those optimizations of schedules that seemed most beneficial when we attempted to produce good schedules by hand. The crudeness of the approach stems from our lack of any fully specified deterministic solution to more realistic scheduling problems. When fuller descriptions of knowledge-based scheduling routines are available, it will be interesting to investigate their potential for conversion into genetic scheduling systems. FOOTNOTES 1 “ISIS: A Constraint-Directed Reasoning Approach to Job Shop Scheduling,” Mark S. Fon. Bradley P. Allen, Stephen F. Smith, Gary A. Strohm. Carnegie-Mellon University: Research Report. 1983. 2 The strategy of encoding solutions in an epistatic domain for operation purposes, while decoding them for evaluation purposes, was worked out and applied to a number of test cases by a group of researchers at Texas Instruments, Inc. The group included me, Nichael Cramer, Garr Lystad, Derek Smith, and Vibhu Kalyan. 3. A number of variations in the scheduling that made no difference in the final evaluation have been omitted in this summary. 140 Compaction of Symbolic Lagout using Genetic Algorithms Michael P. Fourman Dept of Electrical and Electronic Engineering Brunel University, Uxbridge Middx., UK. michael Rbruser @ucl-cs.AC.UK Introduction. Design may be viewed abstractly as a problem of optimisation in the presence of constraints. Such problems become interesting once the space of putative solutions is too large to permit exhaustive search for an optimum, and the payoff function too complex to permit algorithmic solutions. Evolutionary algorithms [Holland 1975) provide a means of guiding the search for good solutions. These algorithms may be viewed as embodying an informal heuristic for problem solving along the lines of “To find a better strategy try variations on what has worked well in the past.” Here, a “strategy” is an attempt at a solution. A strategy will generally not address all the constraints imposed by the problem. The algorithms we are considering guide the search by comparing strategies. We represent this comparison by the relation abeats b (which will usually be be a partial order, but need not be total). We call strategies which satisfy all the constraints of the problem “solutions”. In general, solutions should beat other strategies and, of course, some solutions will beat others. Abstractly, the algorithms merely search for strategies which are well-placed in this ordering. Many problems in silicon design involve intractable optimisation problems, for example, partitioning, placement, PLA folding and layout compaction. We say a Problem is intractable when the combinatorial complexity of the solution space . for the problem makes exhaustive search impossible, and the varied nature of the constraints which must be satisfied makes {t unlikely that there is a constructive algorithmic solution to the problem Automatic solution of such problems requires efficient search of the solution space. Simulated annealing has been applied to the first three problems [Kirkpatrick e¢ 2/ 1983], branch and bound techniques have been applied to layout compaction [Schlag e¢ a/. 1983], In this paper we report on the application of a genetic algorithm to layout compaction. The first prototype solved a highly simplified version of the problem. it produced jayouts of a given family of rectangles under the constraint that no two shall overlap, with cost given by the area of a bounding box. A more realistic prototype deals with the layout of a family of rectangular modules with a single level of interconnect. These prototypes allow the designer to add his ideas to the evolving population of layouts and thus supplement rather than replace his expertise. Symbolic Layout. Acircuit diagram conveys connectivity information: 1 Go 4 gL od 1 9 To manufacture the circuit this must be tranformed to a representation in terms of layout elements, each layout element must be assigned an absolute mask | Position. A layout diagram conveys this mask-making information. The passage from a circuit diagram to a layout may be divided into three stages: firstly the topology (relative positioning of layout elements) of the layout is designed and represented by a symbolic layout, then a mask level is assigned to each wire in the circuit - the design is now represented by a stick diagram, finally the mask geometry (absolute size and positions) is created, Engineers commonly use these 142 intermediate notations to represent the intermediate stages in the design process. Here is a mask layout for our circuit: LL Here is a symbolic version of this layout: Fo at ql TT 4 1 The corresponding stick diagram is. OMA Ltt en Ma EAE 143 A symbolic layout is a representation of a circuit design which includes some Yayout information. The symbolic layout represents a number of design decisions on the relative placement of circuit elements. A stick diagram may be regarded as a symbolic layout with a greater variety of symbols. The procedure leading from a symbolic layout to a mask layout is a form of compaction. In general, there are many realisations of a given symbolic layout. The aim of compaction is to produce a layout respecting the constraints implicit in the symbolic layout while optimising performance and yield. Current compaction algorithms require the designer to provide a layout as input. Compaction usually consists of the modification of this layout by sliding elements closer together while retaining the topology. Clearly, the order in which elements are moved affects the result. Most algorithms simply compact in each coordinate direction in turn. Modern designs are modularised hierarchically The process of symbolic layout and compaction may occur at any level of this hierarchy. The example we have used for iNustration above is a leaf cell (a dynamic NMOS shift register cell) from the bottom level of the hierarchy. Leaf cell layout provides great opportunities for area reduction and yield enhancement, as these cells are replicated many times and any small improvements at this level have a magnified effect on the chip. Optimising leaf celi layout requires awareness of many interacting constraints and complex cost functions (for example connectivity constraints given by the circuit design, geometric constraints given by the process design rules, and the cost functions arising from performance requirements and knowledge of yield hazards). Because of this, constructive algorithmic solutions to this problem have not proved efficient. Traditionally, this area of design has been left to human experts. We hope to apply genetic algorithms to leaf-cell compaction, and have implemented two prototypes to explore the applicability of these methods in this domain. 144, penetic Aloott Genetic algorithms are applicable to problems whose solution may be arrived at by a process of successive approximations. This means that we need to be able to modify strategies in such a way that modifications of good strategies are likely to be better than randomly chosen strategies. A simple heuristic in this setting would be to take a strategy, a, and randomly generate a modification , M(a), of it which may, or may not, be accepted on a probabilistic basis. An algorithm embodying this idea is simulated annealing [Kirkpatrick e¢ a/ 1983]. The algorithm procedes by starting with a strategy and repeatedly modifying it in this way, varying the acceptance procedure according to the value of a variable called temperature . If M(a) beats a, the modification is accepted. if a beats M(a), the modification may be accepted (the probability of this increases with temperature and decreases !f M(a) !s badly beaten). The algorithm is run, starting at a high temperature which Is gently lowered. This simulates the mechanism whereby a physical system, gently cooled, tends to reach a low-energy equlibrium position. Genetic algorithms apply where the strategies have more structure. (In fact, in Most applications of simulated annealing, this extra structure {s available.) Strategies are represented as conjunctions of elementary restrictions on the search space, or decisions . The evolutionary algorithm produces a population of strategies, rather than a single strategy. The Idea is that by combining some parts of one good strategy with some parts of another, we are likely to produce a good strategy. Thus in generating the progeny of a population, we allow not only Modifications or mutation , but also reproduction which combines part of one strategy with part of another. The basic step is to take a population and produce a number of progeny using a combination of mutation and reproduction. The progeny compete with the older generation, and each other, for the right to reproduce. If reproduction is to maintain good performance, we need to be able to divide strategies in such a way that decisions which cooperate are likely to stay together. This Is accomplished in an indirect and subtle manner. Strategies are represented as strings of decisions. The child, R(a,b), of a and b is generated by randomly splitting a and b and joining part of one to part of the other. Thus, decisions which are close together in the string are likely to stay together. To allow cooperating decisions to become close together, we include inversions (which merely choose some substring and reverse it) among the possible Mutations. These act together with reproduction and selection, to move decisions which cooperate closer to each other. 145 Nothing analogous to the temperature used in simulated annealing appears * explicitly in the genetic algorithm. The likelihood that a nascent individual will ‘survive to reproduce depends on the degree of competition it experiences from the rest of the population. As the population adapts, the competition hots up -which has the same effect as the cooling in the simulation of annealing. Although genetic algorithms may be seen as a generalisation of simulated annealing, mutation plays a subsiduary réle to reproduction. The population at any generation should be viewed as a repository of information summarizing the results of previous evaluations. Individuals which perform well survive to reproduce, Reproduction acts to propagate combinations of decisions occuring in these individuals. The better an individual performs, the longer it will survive and the more chances it has to reproduce. The relative frequencies with which various groups of decisions occur in the population record the degree to which they have been found to work well together. Holland has shown that (under appropriate statistical assumptions) the effect of the genetic algorithm is to use this information to effect an optimal allocation of trials to the various combinations of genes. yi ic algori ; The genetic algorithm evolves populations of individuals. In our implementation, each Individual 1s characterised by a chromosome which Is a string of genes. The length of chromosomes Is not fixed. New Individuals are produced by a stochastic mix of the classic genetic operators: crossover, mutation and inversion. Crossover Picks two Individuals at random from the population, randomly cuts their chromosomes and splices part of one with part of the other to form a new chromosome. Mutation picks an individual from the population and, at a randomly chosen number of points in its chromosome, may delete, create or replace a gene. Inversion reverses some substring of a randomly selected chromosome. ASimple Layout Problem. The layout problem addressed by our first prototype may be thought of as a form of 2-dimenstonal binpacking: A collection of rectangles ts to be placed in the plane to satisfy certain design rules and minimise some cost function. 146 The simplest version of this problem (the one we address) has rectangles of fixed sizes, the design rule that distinct rectangles should not overlap, and cost given by area of a bounding box. This version of the problem is already intractable: Suppose we satisfy the constraint that the distinct rectangles, p,q, should not overlap, by stipulating that one of the four elementary constraints p above q Pp below q P left_of q Pright_of q is satisfied. Then, for a problem with n rectangles, we have N = n? - n pairs and, a priori , 4" elements in our search space. In fact, this estimate of the size of the problem is unreasonably large, there are ways of reducing the search space significantly; for example, "branch and bound” procedures have been used [Schlag et a/. 1983). Layout Strategies. We consider layout strategies which consist of consistent lists of elementary constraints (as above). Given such a list, the rectangles are placed in the first quadrant of the plane as close to the origin as is consistent with the list of elementary constraints. (The procedure which interprets the constraints is very unintelligent. For example, it interprets ‘p above q' by ensuring that the y-coordinate of the bottom of p is greater than that of the top of q, even if p is actually placed far to the right of q (because of other constraints). Any inconsistent lists of constraints produced by the genetic operators are discarded. 5 , a Populations of consistent lists of constraints are evolved using various orderings for selection. When defining a selection criterion, various conflicting factors must be addressed. For example, our simplest criterion attempts firstly to remove design-rule violations and then to reduce the area of the layout. Strategies with fewer violations beat those with more and, for those with the same numberof violations, strategies with smaller bounding boxes win. This simple prioritising of concerns led to the generation of some rather unpromising strategies; while the selection criterion was busy removing design rule violations, for example, any strategy with few such violations (compared to the current population norm) was accepted. Typically, these would have large areas and redundant constraints. The algorithm would later have to spend time refining these crude attempts. In an attempt to mitigate this effect, we added a further selection, favouring shorter chromosomes all other things being equal. Smith has pointed out that implementations of the genetic algorithm allowing variable length chromosomes tend to produce ever longer chromosomes (as chromosomes below a certain length are selected against). We did not find this an overwhelming problem as longer chromosomes were more likely to be rejected as inconsistent by the evaluation function. Nevertheless, we did find that the performance of the algorithm was improved by introducing a selection favouring shorter chromosomes. ‘We also experimented with trade-offs between the various criteria, established by computing a composite score for each strategy and letting the strategy with the better score win. We found that the genetic algorithm was remarkably robust in optimising the various scoring functions we tried. However, the results were often unexpected; the algorithm would find ways of exploiting the trade-offs provided in unanticipated ways. We have not yet found a selection criterion of this type which works uniformly well, over a range of examples. However, by tuning the selection criterion to the example, good solutions have been obtained. A better way of combining our various concerns was found. Rather than address the concerns serially, or try to address all the concerns at once, we select a concern randomly each time we have a selection to make. A number of predicates for comparing two individuals were programmed. (For example, comparing areas of bounding boxes, comparing areas of design rule violations, comparing the areas of rectangles placed.) Each time we are asked to compare two individuals, we non-deterministically choose one of these criteria and apply it, ignoring the others. This works surprisingly well. It is easy to code in new criteria and to adjust the algorithm by changing the relative frequencies with which the criteria are chosen. The resulting populations show a greater variability than with deterministic selection, and alleles which perform well in some respects, but _ would have been selected out with our earlier deterministic approach, are retained. Results, Most of our experiments with this prototype have been based on problems with a large amount of symmetry, for which it is easy (for us) to enumerate the optimal solutions. If we actually wanted to solve these problems, other approaches 148 exploiting the symmetries available would certainly be more efficient. However, for the purpose of evaluating the performance of the genetic algorithm, we claim these examples are not too misleading. The algorithm is not provided with any knowledge of the symmetries of the problem nor of the arithmetical relationships between the sizes of the rectangles. For the purposes of evaluating the applicability of the genetic algorithm to layout compaction, the prototype is probably pessimistic. Real layout problems are far more constrained (by, for example, connectivity constraints). This not only reduces the size of the search space per se, but also appears to localise the interdependence of various genes making the problem more suitable for the genetic algorithm. Analve analysis of avery simple example is instructive. The example consists of six rectangles, three 3 x 1 (horizontal) and three 1 x 3 (vertical). A minimal solution of this problem was found (consistently) in under 50 generations with 20 progeny per generation (1000 points of the search space evaluated). A solution to this problem must say how each of these rectangles is constrained, both horizontally and vertically. Thus the search space has 6!? (about 2 x 10°) points. The problem has 8 basic solutions and a symmetry group of order 36. There are about 7.5 x 10° points/solution. Of these, we only examine some 10°. Representing Layout. Our first prototype deals with a problem which has little direct practical significance for VLS! layout. (However, Rob Holte has pointed out that scheduling Problems from operations research might be represented by minor variations on our prototype problem.) As a next step towards a practical layout tool, we have implemented a system which compacts a simple form of symbolic layout. The Problem is to formalise the constraints implicit in the symbolic layout, and to find a representation, suitable for the genetic algorithm, for layout strategies We consider @ symbolic layout of blocks connected by wires. The rectangles (blocks) are of fixed size and may be translated but not rotated. The interconnecting lines (wires) are of fixed width but variable length. The interconnections shown must be maintained, and no others are allowed. In addition, there are design rules which prohibit unconnected pairs of tiles (wires or blocks) from being placed too close together. 149 This form of the symbolic layout problem was introduced by [Schlag e¢ a/. 1983). Here is their example of a simple symbolic layout: We represent the problem at two levels: A surface level deals with tiles of three kinds - blocks, horizontal wires and vertical wires. In addition to evolving layout constraints dealing with the relative positions of tiles (above, right-of etc. as before), we use a fixed list of structural constraints, to represent the information in the symbolic layout, and fundamental constraints which represent the size limitations on tiles. Structural constraints have the following forms vcrosses h,Nbv, Sbv,Ebh,Wbh where v, h are vertical and horizontal wires and b is a block. These constraints allow us to stipulate which wires cross (and hence are connected) and which wires connect to which edges (North, South, East or West) of which blocks. At a deeper level, unseen by the user, the problem is represented in terms of the primitive layout elements , north b, south b, east b, west b, left h, right h, y_posn h, top v, btm v, x_posn v, whose names. are self-explanatory. For each tile, we generate a list of fundamental 150 constraints expressing the relationship between the primitive layout elements arising from it. This representation allows both blocks and wires to stretch. The example above is represented by declaring the widths of the wires and sizes of the blocks and then specifying the following list of constraints. (We use a LISP list syntax as it is more widely familiar, actually, our implementation is written in ML.): ((E BI H2) (crosses V3 H2) (crosses V3 H3) (crosses V4 H3) (NB4 V4) (WBS H3) (SBI VI) (crosses V1 H1) (crosses V2 HI) (N B2 V2) (E B2 HS) (WB3 HS) (S B4 V6) (crosses V6 H4) (crosses VS H4) (N B3 VS) (S BS V7) (N B6 V7)) Again, we evolve lists of layout constraints. These are compiled, together with the fixed structural and fundamental constraints representing the symbolic layout to give graphs of constraints on the primitive layout elements, whose positions are thus determined. The number of design-rule violations and the area of the | resulting layout are again used to select between rival strategies. Solutions to this problem were found in around 200 generations of 20 progeny, and this was reduced to around 150 generations when the algorithm was given a few “hints” in 151 the form of extra constraints. Watching the evolving populations showed that - progress was rapid for around SO generations. Thereafter, the algorithm appeared to get stuck for long periods on local minima (in the sense that one configuration would dominate the population). This lack of variation in the population reduced the usefulness of crossover. When mutation led to a promising new configuration, there would be a period of experimentation leading rapidly to a new local minimum. This might suggest that either the population size (100) or the probability of mutation being used as an operator (0.1) is too small. We have not yet experimented with variations on these parameters. We think that better solutions would be either to introduce a further element of competition into the genetic algorithm by penalising configurations which become too numerous (implementing this is problemaical), or to evolve a number of populations allowing a limited degree of “intermarriage” (We are currently implementing the latter approach. If it is successful it will be a good candidate for parallel implementation.) Conclusions, The genetic algorithm may be viewed as a (non-deterministic) machine which is programmed by supplying it with a selection criterion - an algorithm for comparing two lists of constraints. We have experimented with various selection criteria based on combinations of the total intersection area, 1, of overlap involved in design-rule violations, and the area, A, of a bounding rectangle. Experiments were made to compare various performance criteria based on combinations of the number of design-rule violations, and the area of a bounding rectangle. From our experience with the prototype, It appears that the choice of a selection criterion is an essential difficulty in applying the genetic algorithm to layout. The problem is that we must evolve populations of partial solutions (strategies), while the optimisation task is defined in terms of a cost function defined on layouts (solutions). To extend a (technology imposed) cost-function, © defined on solutions, to the space of strategies, In such a way that the genetic algorithm will produce a solution (rather than just a high-scoring strategy), Is a non-trivial task. We intend to experiment with our second prototype In various ways before going on to tmplement a “real” system dealing with design-rules for a practical multi-layer technology. We will continue to experiment with selection criteria and we are 152 implementing the idea of having several weakly interacting populations running in parallel, described above. We also intend to integrate other, rule-based, methods with the genetic algorithm, automating the provision of “hints”. Thus, a number of suggestions for strategies would be generated and passed to the genetic algorithm which would then explore combinations and variations of these. Acknowledgements, ! would like to thank Steve Smith for introducing me to Genetic Algorithms, and Robert Holte for many stimulating discussions, his criticism and encouragement have been invaluable. References, Holland, John H. 1975. Adaptation in natural and artificial systems. Ann Arbor, University of Michigan Press. Kirkpatrick, S. C.D. Gelatt, and MP. Vecchi 1983. Optimisation by simulated annealing. Sc/ence, 1983, 220, 671-680. Schlag, M., Y.-Z. Liao, and CK. Wong 1983. An algorithm for optimal two-dimensional compaction of VLSI layouts. /W7EGRATION, the VLSI journal 1 (1983) 179-209. ‘Smith, S.F. 1982. Implementing an adaptive learning system using a genetic algorithm. Ph.D. thesis. U. of Pittsburgh 1982. 153 ALLELES, LOCT, AND THE TRAVELING SALESMAN PROBLEM by David E. Goldberg and Robert Lingle, Jr. Department of Engineering Mechanics ‘The University of Alabama, University, AL 35486 INTRODUCTION We start this paper by making several seemingly not-too-related observations: 1) Simple genetic algorithns work well in problens which can be coded so the Underlying building blocks (highly fit, short defining length schemata) lead to improved performance. 2) There are problens (nore properly, codings for problens) that are GA-Hard s-difficult for the normal reproduc~ tion+crossover+mutation processes of the simple genetic algorithm, 3) Inversion {s the conventional answer when genetic algorithnists are asked how they intend to find good string order- ings, but {nversion has never done much in-empirical studies to date. 4) Despite nunerous rumored attempts, che traveling salesman problem has not succumbed to genetic algorithn-like solution. Our goal in this paper is to show that, in fact, these observations are closely related.’ Specifically, we show how our atti to solve the traveling Problen (TSP) with genetic elgorithes have ied to a new type of crossover operator, partially-mapped crossover (PMX), which permits genetic algorithas to search for better string orderings while still searching for better allele combinations. The partially-mapped crossover operator conbines a mapping operation usually associated with inversion and subsequent crossover between non-homologous strings with a swapping operation that preserves a full gene comple- nent. The resultant is an operator which enables both allele and ordering combinations to be searched with the implicit parallelism usually reserved for allele combinations in ore conventional genetic algorithns. In the remainder, we first examine and question the conventional notions of gene and locus. This leads us to consider the rechan- ics of the partially-mapped crossover opera’ tor (PUX), This discussion is augrented by the presentation of a sample implenentation (for ordering-only problems) in Pascal. Next, we consider the effect of PHX by extending the normal notion of a schena by 154 introducing the __o-schenata_—_ (ordering schemata) or locus templates, This leads to simple counting arguments’ and survival robabi lity calculations for o-schenata under MK. These results show that with high probability, low order o-schenata survive PHK thus giving us a desirable result: an operator which searches anong both orderings and allele combinations that lead to good fitness. Finally, we demonstrate the effec~ tiveness of this extended genetic algoritha consisting of reproduction+P™X, by applying tt to an ordering-only problen, the traveling salesman problen (TSP). Coding the problen as an n=permutation with no allele values, we obtain optinal or very near-optimal results in a well-known 10 elty problem. Our dis- cussion concludes by discussing extensions in problens with both ordering and value consi dered. ‘THE CONVENTIONAL VIEW OF POSITION AND VALUE In genetic algorithm work ve usually take a decidely Nendelian view of our arti- ficial chromosones and consider genes which xy take on different values (alleles) and positions (loci). Normally we assune that alleles decode to our problen parameter set (phenotype) in @ manner independent of locus: Furthermore, we assure that our paraneter set nay then be evaluated by a fitness function (a non-negative objective function to be maxinized). Symbolically, the fitness £ depends upon the parameter set x which in turn depends upon the allele values v or nore compactly f = £(x(v)). While this is cer tainly conventional, we need to ask whether this 1s the most’ general (or even nost biological) vay to consider this mapping. More to the point, shouldn't we also consider the possible effect of a string's ordering o ‘on phenotype outcome and fitness. Mathenati~ cally there seems to be no good reason to exclude this possibility which we may write fe (x(0,v)). While this generalization of our coding techniques is attractive because it would permit us to code ordering problens more naturally, we must make sure we naintain che implicit 'parallelisn of the reproductive plans and genetic operators we apply to the generalized structures, Furthermore, because GA's are drawn from biological example we should be careful to seek natural precedent before committing ourselves to this extension, To find biological precedent for the importance of ordering as well as value we need only consider the sublayer of struc- ture beneath the chromosome and consider the amino acid sequences that lead to particular proteins. At this level, the values (anino acids) are in no way tagged with neaning. There are only anino acids and they cust appear in just the right order to obtain « useful outcone (a particular protein). Thus, there is biological example of outcoses that depend upon both ordering ané value, and we do not risk the loss of the right flavor by considering them both. Then, wherein Iies our problen? If it is ok to admit both ordering and value {information into our fitness evaluation, what is missing in our current thinking ‘about genetic algorithms which prevents us from exploiting both ordering and value inforna~ tion concurrently? In previous work where ordering was considered at ll (primarily for its effect on the creation of good, tightly Linked, building blocks), the only’ ordering operator considered was inversion, @ unary operator which picks two points’ along a single string at random ond inverts the included substring (1). Subsequent crossover between non-honologous (differently ordered) strings occurred by mapping one string's order to the other, crossing via simple crossover, and unnapping the offspring. This procedure’ is well and good for searching among different allele combinations, but 1 does little to search for better orderings. Clearly the only operator effecting string order here is inversion, but the beauty of genetic algorithms is’ contained in the Structured, yet randomized information exchange Of crossover--the combination of highly fit notions from different strings. With only a unary operator to search for better string orderings, we have little hope of finding the best ordering, or even very good orderings, in strings of ony substantial Tength. Just as cutation cannot be expected to find very good allele schenata in reason- able tine, inversion cannot be expected to find good orderings in substantial problens. What is needed is a binary, crossover-like operator which exchanges both ordering and value information anong different strings In the next section, we present a new opera’ tor which does precisely this. Specifically, we outline an operator we call partially~ mapped crossover (PMX) that exploits impor- tant similarities in value and ordering simultaneously when used with an appropriate reproductive plan, PARTIALLY-MAPPED CROSSOVER (PMX) = MECHANICS. To exchange ordering and value infor- mation anong different strings we present a new genetic operator with the proper flavor. We ‘call - this operator partially-napped crossover because a portion of one string ordering {s mapped to a portion of another and the renaining information is exchanged 155 after appropriate swapping operations. To tie down these ideas ve also present a piece of code used in the computational experiments to be presented later. To motivate the partially-napped cro: over operator (PYX) we will consider differ- ent orderings only and neglect any value information carried with the ordering (this is not a limitation of the method because allele information can easily be tacked on to city nane infornation). For example, consider two permutations of 10 objects: A= 98456713 210 Be@71230 9546 PHX proceeds as follows. First, two posi- tions are chosen along the string uniformly at random, The substrings defined from che first number chosen to the second number chosen are called the MAPPING SECTIONS. Next, we consider each napping section separately by mapping the other string to the napping section through a sequence of swap- ping operations. For example, if we pick two random nurbers say 4 and 6, this defines the two mapping sections, 5-6-7 in string A, and 2-3-10 in string B.' The mapping operation, say B to A, is performed by swapping first the 5 and the 2, the 6 and the 3, and the 7 and the 10, resulting in a well defined offepring. | Simtlarty the napping, and svap- ping operation of A to B results in the sw Of the 2 and the 5, the 3 and the 6, ond the 10 and the 7. The resulting two new strings are as follows: Abe 9 84 23201 B= 8101567 9 The mechanics of PHX is a bit more complex than simple crossover so to tie dom the ideas completely we present a code excerpt which implements the operator for order- ing-only structures in Figure 1. In this code, the string {s treated as a ring and attention 1s paid to the order of selection of the two mapping section endpoints. The pover of effect of this operator, as with sisple crossover, {s much more subtle than {s suggested by ‘the simplicity of the string aatching and \ swapping. Clearly, however, portions of the string ordering ate being propagated untouched as. we. should expect." In the next section, we identify the type of information being exchanged by introducing the o-schenata (ordering. schena~ ta). We also consider the probability of survival of particular ovschenata under PMX. PARTIALLY-NAPPED CROSSOVER - POWER OF EFFECT In the analysis of a simple genetic algorithn with reproductionscrossover+auta’ tion, we consider allele schenata as the underlying building blocks of future solu tions. We also consider the effect of the genetic operators on the survivability of Dara Types and Constants const mixackty = 1001 re city Nourseray = Tencetons and Procedure: function findaci tvici tyawne- mack twt ch tnt varut ‘Stito: Ht until tone finda cs ia )17 and? procesura ruse. One tanetct uy (Etnd_etey, oom Procedure crors tours Sie tour: var JteMtesttiotenert besin i tou iourtseias Agurtinws te tourgoldt Flovcrors Figure 1. Ps Sheree” and Jazt tas loncred! soscroze tty, cron ae tours touraresy enter hiccrossscttyt 214s tourlnatt tour? anaes tourareay ot S hiceroze + it Af CMterbonectty) then nictesteeit > maetarts than beste + Rach ty tour. ‘a1 Implementation of PMX ~ Partially Mapped Crossover - procedure cross_tour. important schemata. In a similar way, in our current work we consider the o-schenata or ordering schenata, and calculate the survival probabilities of important o-schenata under the PHX operator just discussed. As in the previous section we will neglect any allele information which may be carried along to focus solely on the ordering infornation; however, we recognize that we can always tack on the allele information for problens where it {s needed in the coding. To motivate an o-schema consider tuo of the 10-pernutations: 8 910 ce123% 5 9 810 567 Del235467 As with allele schemata (a-schenata) where we appended a * (a meta-don't-care symbol) to our k-nary alphabet to motivate a notation for the schemata or similarity templates, so 156 do we here append a don't care sysbol (the !) to mean that any of the renaining permuta- tions will do in the don't care slots. Thus in our example we have, anong others, following o-schemata common anong structur C and D: Seer) it 7! at To consider the nunder of then with. no positions fixed, 2 ‘position Fixed, 2 positions fixed, ete., and recognize that the number of o-schemata with exactly j positions fixed {s simply the product of the hhunber of combinations of groups of J azong 2 objects, (4), tines the nuaber of permuta- tions of groups of J among & objects. Summing from 0'to & (the ‘string length) we ovschemata, we take obtain the number of o-schenata: nw ai ay ay os ~ LUE THT While this expression has not been reduced to closed form, it may be shown for large £ that the nunber of ovschenata is certainly greater than (21), Furthermore, it {8 eas{ly shown that each particular string (permutation) is a representative of 2" o-schemata and that 2 Population contains at most net o-schemata, Next we consider the survival probabili- ty of a particular o-schena under the partially-mapped crossover operator. The easiest vay to calculate this is to use conditional probabilities over three mutually exclusive events: the o-schena is entirely contained within the match section (Event Wewithin), the schena ts entirely outside the match section (Event O-outside), or the schena is cut by a cross point (Event C-cut). Thus, the probabtlity of survival (Event S-survival) may be given: P(S) = P(S|w)P(H) + P(S|0)P(O) + P(s|c)P(C) Since the probability of surviving a cut is very low (P(S|C)20) we ignore this pi sibility and focus on the other tvo event: Assuming a cut length k, a defining length of the schena 6(s), and an o-schena of order (number of fixed positions) o(s), che overall probability of survival (for iarge string length 2) may be estimated: is) = K$e2 , fake Ba gy Hay Closer examination of this equation reveals two nodes of survival. When the cut length {s large with respect to the defining length, relatively short defining length. schenat survive with high probability. The second and more subtle mode of survival occurs when short, low order scherata survive, because a snall’cut length dictates a snall probability of interruption due to swapping. Together the two modes combine to pass through short, low order o-schenata 0 normal reproductive these building blocks at optimal rates. Hence, PHX permits the Sane type of implicit paralielism to occur in both orderings and alleles as we have already witnessed using simple crossover on allele information alone. ‘A PURE ORDERING PROBLEM - THE TRAVELING SALESMAN PROBLEM (TSP) In some sense we've presented this paper in the reverse order of discovery. ke did not 1) adnit ordering information, 2) dis- cover PYX end o-schemata, and '3) apply reproduction+PX to the traveling salesnan problen. In fact, by trying to solve the TSP with genetic algorithns, we were led to PMX-Like operators, then o-schemata, and 157 finally POX, The traveling salesnan problen is a pure ordering problem (2,3,4) where one attempts to find the optinal tour (ninimun cost path vhich visits each of n cities exactly once). The TSP is possibly the most celebrated coabinatorial optimization problen of the past three decades, and despite numerous exact (impractical) "and heuristic (inexact) pethods already discovered, the pethod remains an active research area in its own right, partially because the problem is part of a'class of problens considered to be NP-complete for which no polynontal tine solution {s believed to exist. Our interest in the TSP sprung mainly fron a concern over claims of genetic algorithm robustness. If GA's are ‘robust, why have the rumored attempts at "solving" the TSP with CA's failed. This concern led us to consider many schenes for coding the ordering inforaation, with strange codes, penalty functions, and the Like, but none of these had the appropri- ate flavor--the bullding blocks didn't seen right. This led us to consider the current scheme, which does have appropriate butlding Docks, and as we shall soon see, does (in fone problem) lead to optinal or near-optinal results. The specific problen ve consider is Karg ‘and Thompson's well-studied 10 city problen (4). While a 10 city problen {s no final touchstone of success, it does contain 9 alternatives (the GA ‘knows nothing of the problen's symmetry which reduces this number to (91)/2), We code the problem as @ normal~ ized (city'l in the first position) 10-permu~ tation and apply reproduction and PAX to successive populations. We use roulette wheel reproduction with’ selection probabili- tles set in the normal way, and fitnesses are created from costs and scaled by subtracting string cost fron population maximum cost, £1" Cua ~ y+ We choose initial popula popsize=200, at random, This nuaber lected to obtain a rich spread of order 2 o-schenata in the population. This re- quires a population size proportional to n(n-1) or roughly n®, Tt might be useful to have order 3 schenata as well, but this may require larger populations than we are used to working with. We present the results of two runs on the 10 ity problen in Figures 2 and 3. Figure 2 shows the population average cost with each successive generation. The cross~ over probability was set at 0.6 60 each generation represents roughly 120 new func~ Elon evaluations (0.64200). Figure 3 shows the population best results with successive erations. As ve can see, run 1 reaches the optimal (1!) result rather quickly, while run 2 converges on a very near-optinal tour (we only ran twenty generations--there was still enough diversity left so inproverent was possible in run 2). The best of run 1 was indeed the Karg and Thompson optinun, tour 1-2-3-4-5-10-9-6-6-7 with cost=378, The best of run 2 was a near-optimun, the tour 1-2-3-10-9-5-H-6-8-7 with coste381. We are 600.0 & 8 550.0, iS ° °RUN 1 g = 500.0 —— pun 2 = 3 450.0 S . Q = 2 400.0 G 8 350.0 0 5 10 15 20 GENERATION Figure 2, Generation Average Cost vs. Generation for 10 City 15? currently working on a 20 city problem and a The new operator is tested in an 33 etty problem, although we need to do sone Teprogranming to fit the large population sizes into our TBM PC's, We also have built in an inversion operator, but have not had a chance to test its effect on average and best results. CONCLUSIONS In this paper ve have examined a nev type of crossover operator, partially-napped crossover (PMX), for the exploration of codings where ordering and allele information nay directly or indirectly effect fitness values. The rechanics of the operator have en described, and an orderingronly {nple- mentation has been presented in Pascal. The power of effect of the new operator has been analyzed using an extension to the concept of called the o-schenata (ordering Simple counting argusents have been put forvard which show che vast anount of information contained in the o-schenata, and survival probabilities have been estina~ ted for o-schenata under the PHX operator. The result is an operation which preserves ordering butlding blocks (and allele building blocks if they are attached) so orderings and allele combinations aay be explored” with implicit parallelism, 158 ordering-only problem, the traveling salesman problen, Using reproduction+PHX in two runs, optinal’ or very near optinal results are found in a well-known 10 city problen after exploring a small portion of the tour search space. We are continuing our work by testing the method in larger problens, but we are encouraged with the GA-like’ performance obtained on our first test. This work has important inplteations for inproving pore general GA-search in problens where both allele combinations and ordering information are. inportant. The binary operation of PHX does permit’ the randonized, yet structured, informacion exchange anong doth “alleles an¢ ordering building blocks which sinple crossover pronotes anong allele schenata alone. This should assist us in our efforts to successfully apply genetic algori- thns to ever nore complex problens. REFERENCES 1. Holland, J. Hy and_Artifictal Syste chigan . 5 450.0 = & 8 2 o——* RUN 4 z o 5 *—* BUN 2 = 5 400.0 a 8 7 w S ne & a a 350 0 0 5 10 15 20 GENERATION Figure 3. Best-of-Generation Cost for 10 City TSP 2 and G. L, Neuhauser, "Th in Problem: A Survey,’ eration Research, vol. 16, May-June 1968, pp. 538-558, 3. Parker, R. G. and R. L. Rardin, "The Traveling ‘Salesnan Probien: An Update of Research," Naval Research Logistics Quarterly, vol. 30, 1983, pp. 69-96. 4. Kare) R. L, and G. 1. ‘Thompson, "A } Heuristic Approach to Soiving Travelling | Salesman Problens,”" Managenent Science, j woke 20) = 2) January” 2964, pp: i meee | | 159 Genetic Algorithms for the Traveling Salesman Problem John Grefenstette!, Rayeev Gopal, Brian Rosmaita, Dirk Van Gucht Computer Science Department Vanderbilt University Abstract This paper presents some approaches to the application of Genetic Algonthms to the Traveling Salesman Problem. A number of representation issues are discussed along with several recombination operators Some prelimmary analysis of the Adjacency List representation is presented, as well as some promising experimental results. 1. Introduction Genetic Algorithms (GA's) have been applied to a variety of function optimization problems, and have been shown to be highly effective in searching large, complex response surfaces even in the presence of difficulties such as high- dimensionality, multimodality, discontinuity and noise [4]. However, GA’s have not been applied extensively to combinatorial problems. The major obstacle is in finding an appropriate representation. This paper presents some approaches to the design of GA's for a well known combinatorial optimization problem -- the ‘Traveling Salesman Problem (TSP) The TSP 1s easily stated: Given a complete graph with N nodes, find the shortest Hamiltonian path through the graph. (In this paper, we will assume Euclidean distances between nodes) The TSP 1s NP-Hard, which probably means that any algorithm which computes an exact solution of the TSP requires an amount of computation time which 1 exponential in N, the size of the problem [5] In addition to many important applications, the TSP 1s often used to illustrate heunistic search methods {2,7,8], so it is natural to investigate the use of GA's for this problem. Choosing an appropriate representation 1s the first step in applying GA’s to any optimization problem If the problem mvolves searching an N- dimensional space, the representation problem is often solved by allocating a sufficient number of bits to each dimension to achieve the desired accuracy For the TSP, the search space is a space of permutations and the representation problem is more complex Consider a path representation in which a tour 1s represented by a Iist of cities (abcde f) The first problem is that the representation is not unique each tour has N representations. This can be solved by fixing the initial city. Another problem is that the crossover operator does not generally yield offspring which are legal tours. For example, suppose we cross tours (ab c de) and (a de cb) between the third and fourth cities We get as offspring (a b cc b) and (ade de), neither of which are legal tours Finally, there 1s a problem im applying the hyperplane analysis of GA’s to this representation. The definition of a hyperplane 18 unclear in this representation. For example, (a # # # #) appears to be a first order hyperplane, but st contains the entire space. The problem 1s that in this representation, the semantics of an allele in a given position depends on the surrounding alleles. Intuitively, we hope that GA’s will tend to construct good solutions by identifying good building blocks and eventually combining these to get larger building blocks For the TSP, the basic building blocks are edges, Larger building blocks correspond to larger subtours ‘The path representation does not lend itself to the description of edges and longer subtours in ways which are useful to the GA. In section 2, we present two representations which offer some improvements over the path representation. Section 3 discusses the design of a heuristic recombination operator for what we consider to be the most promising representation In section 4, some preliminary experimental Research supported in part by the National Science Foundation under Grant MCS-8305603 160 results are described for the TSP. discusses some future directions Section 5 2. Representations for TSP 2.1, Ordinal Representation In the ordinal representation, a tour is described by a list of N integers in which the ith element can range from 1 to (N-it1) Given a path representation of a tour, we can construct the ordinal representation TourList as follows’ Let FreeList be an ordered list of the cities. For each city in the tour, append the position of that city in the FreeList to the TourList and delete that, city from the FreeList. For example, the path tour (a c e db) corresponds to an ordinal tour (12.321) as shown: Tourl, 0 (abede) (1) (bede) (12) (ode) (123) (b d) (12.32) (b) (12321) 0 Note that it 1s necessary to fix the starting city to avoid multiple representation of tours. A sumilar procedure provides a mapping from the ordinal representation back to the path representation. In fact, the mapping between the two representations is one-to-one. The primary advantage of the ordinal representation is that the classical crossover operator may be freely applied to the ordinal representation and will always produce the ordinal representation of a legal tour However, the results of crossover may not bear much relation to the parents when translated to the path representation For example, consider the following two tours: ordi fours path tours (12321) faced) 161 (24111) (beacd) Suppose that we cross the ordinal tours between the second and third positions. We get the following tours as offspring: ordinal tours path tours (12111) facbde) (24321) (bedca) The subtours corresponding to the genes in the ordinal tours to the left of the crossover point do not change. However, the subtours corresponding, to genes to the right of the crossover points are disrupted in a fairly random way Furthermore, the closer the crossover point is to the front of the tour, the greater the disruption of subtours in the offspring As predicted by the above consideration of subtour disruptions, experimental results using the ordinal representation have been generally poor. In most cases, 2 GA using the ordinal representation does no better than random search on the TSP. 2.2, Adjacency Representation In the adjacency representation, a tour is described by a list of cities. There is an edge in the tour from city i to city j iff the allele in position 1 is j For example, the path tour (1 3 5 4 2) corresponds to the adjacency tour (3.1524) Note that any tour has exacily one adjacency list representation 2.2.1. Crossover Operators Unlike the ordinal representation, the adjacency representation does not allow the classical crossover operator. Several modified crossover operators can be defined Alternating Edges Using the alternating edges operator, an offspring 1s constructed from two parent tours as follows. First choose an edge at random from one parent Then extend the partial tour by choosing the appropriate edge from the other parent Continue extending the tour by choosing edges from alternating parents If the parent's edge would introduce a cycle into a partial tour, then extend the partial tour by a random edge which does not introduce a cycle. Continue until a complete tour is constructed For example, suppose we have mom dad = 234561) 251643) ‘Then we might get the following offspring. kid =(254163) where the only random edge introduced into the offspring is the edge (4 1) All other edges were inherited by alternately choosing edges from parents, starting with the edge (1 2) from mom. Experimental results with the alternating edges operator have been uniformly discouraging. The obvious explanation seems to be that good subtours are often disrupted by the crossover ‘operator Ideally, an operator ought to promote the development of coadapted alleles, or in the TSP, longer and longer high performance subtours. The next operator was motivated by the desire to preserve longer parental subtours Subtour Chunks Using the subtour chunking operator, an offspring 1s constructed from two parent tours as follows First choose a subtour of random length from one parent. ‘Then extend the partial tour by choosing a subtour of random length from the other parent. Continue extending the tour by choosing subtours from alternating parents. Dunng the selection of a subtour from a parent, if the parent’s edge would introduce a cycle into partial tour, then extend the partial tour by a random edge which does not introduce a cycle. Continue until a complete tour 1s constructed Subtour chunking performed better than alternating edges, as expected, but the absolute performance was still unimpressive. An analysis of the allocation of trials to hyperplanes provide a partial explanation for the poor performance of 162 this operator 2.2.2, Hyperplane Analysis . The primary advantage of the adjacency representation is that it permits the kind of hyperplane analysis which has been applied to the Nedimensional function optimieation GA. paradigm (1,3,6] Hyperplanes defined in terms of a single defining position correspond to the natural building blocks, ie, edges, for the TSP problem. For example, the hyperplane (# # # 2 #) is the set of all permutations in which the edge (4 2) occurs. We briefly summarize the main points of the classical hyperplane analysis of GA's. In the absence of recombination operators, selection of structures for reproduction in proportion to the structure's observed relative performance allocates trials to all represented hyperplanes in the population (roughly) according to the following formul M(H,t+1) = M(H)" uf) / u(P,.)) where M(H,t) = # of representatives of H at time t u(H,t) = observed performance of H at time t u(P,t) = mean performance of population at time t. ‘The elements of any hyperplane partition compete against the other elements of that partition, with the better performing elements eventually propagating through the population. This in turn leads to a reduction in the dimensionality of the search space, and the construction of larger high performance building blocks. In the adjacency representation, a first order hyperplane partition consists of all of the hyperplanes which are defined on the same position For example (HALA), (HHH #) (HH HH), (## #5 #)) is a first order hyperplane partition. Each element of the partition contains an equal number of tours. Selection is supposed to distinguish among the elements of this partition and to favor the high performance hyperplanes. However, the following theorem shows that selection has very little information on which to allocate trials to competing first order hyperplanes Theorem 1 Suppose that H,, and H,. are two first-order hyperplanes defined by the edg (a b) and (a c), respectively, in a Euclidean TSP. ‘Then | u(H,,)- u(H,,) | < 4(ab + ac) where ab and ac represent the lengths of the edges (a b) and (ac), respectively. Proof. We show that there is a one-to-one mapping f between the tours in H,, and the tours H,, such that if x is a tour in Hy, and y = f(x) 9 the corresponding tour in H,_, then | Length(y) - Length(x) | < 4{ab+ac). ‘The theorem follows directly. ‘The followin justrates the mapping f- ‘That 1s, y is obtained by exchanging the nodes b and c in the tour x Using the triangle inequality, it 1s easy to show that. -(4ab + 2ac) < Distance(y) - Distance(x) S (dac + 2ab) 163 So | Distance(y) - Distance(x) | < d{ab+ac) QED. In practice, the observed difference between competing first order hyperplanes is usually an order of magnitude less than the bounds in the theorem. And since the overall tour length is generally very large compared to the bound in the theorem, there 1s generally no significant difference between the mean relative performance ‘of any two competing first order hyperplanes Our experimental studies have shown that the difference in the observed performance of competing first order hyperplanes in a TSP of size 20 is generally less than 5% of the mean population tour length In larger problems, this difference can be expected to rapidly approach ero One might suspect that the TSP is not a suitable problem for GA's, that the TSP some sense GA-Hard Bethke(1) characterizes some problems for which GA’s are unsuitable Informally, Bethke shows that there are functions and representations for which the low order hyperplanes can mislead the GA into allocating trials to suboptimal areas of the search space However, Bethke’s techniques, which involve the Walsh transform of the objective function, apply to one-dimensional functions of a real variable ‘using a fixed-point representation A similar set of results may be derivable for combinatorial problems using the adjacency representation But ‘Theorem 1 does not indicate that the information an the first order hyperplanes of the adjacency representation 18 misleading, just that 1t is buried In other words, measuring the fitness of a tour by the tour length may be too crude a measure for apportioning credit. We now describe a crossover operator which performs a secondary apportionment of credit at the level of individual alleles 3. Heuristic Crossover Theorem 1 shows that selection alone may not be able to properly allocate trials to first order hyperplanes, given our adjacency representation for the TSP The heuristic crossover operator attempts to perform a secondary apportionment of credit at the allele level. This operator constructs an offspring from two parent tours as follows: Pick a random city as the starting point for the child’s tour. Compare the two edges leaving the starting city in the parents and choose the shorter edge. Continue to extend the partial tour by choosing the shorter of the two edges in the parents which extend the tour If the shorter parental edge would introduce a cycle into the partial tour, then extend the tour by a random edge. Continue until a complete tour is generated. In order to compare this operator with the Previous two recombination operators, 1000 random pairs of parents were chosen for a TSP of site 20. For each pair of parents, an offspring was constructed according to each of the crossover operators. For all three operators, the offspring generally inherited about 30% of the edges from each parent. The remaining 40% were random edges introduced by the recombination operator to create a legal tour For the first two operators, the offspring generally show no improvement in overall tour length when compared to the better parent Not surprisingly, the heuristic crossover produces offspring which are, on average, about 10% better than the better parent. It seems reasonable that such an improvement should give selection a way to promote the propagation of good edges through the population. The next section shows some experimental results which confirm this expectation. It is important to note that, with the proper choice of data structures, the heuristic crossover operator can be implemented to run as a linear function of the length of the structures {9] ‘This imphes that, if E is the number of trials and N is the number of cities, our GA’s for the TSP run with asymptotic complexity O(EN), the same as pure random search 4. Experimental Results This section describes some experiments with the adjacency representation and the heuristic crossover operator For each experiment, N cities were randomly placed in a square Euclidean space. The initial population consisted of randomly generated tours The selection method 164 was based on the expected value model The crossover rate was set at 50%, and there was no explicit mutation operator. . Figure 1 shows the results of a 50 city problem, Figure 2 shows a 100 city problem and Figure 3 shows a 200 city problem. Each Figure shows a representative tour from the initial population, the best tour obtained part way through the search, and the best tour obtained after the entire search, along with a randomly selected tour in the final population. It can be seen, especially in Figues 2 and 3, that good subtours tend to survive and to propagate. The figures also show that there is still a good deal of diversity in the final population Statistical techniques [2] allow us to estimate that the expected length of an optimal tour for experiment 1 is approximately 37 45 The optimal tour obtained by the GA differs from this expected optimum by about 25% After an equal number of trials, random search produces a best tour of length 148.6, nearly 300% longer than the optimal tour. The optimal tour obtained in experiment 2 differs from the expected optimum by 16% The optimal tour obtained in experiment 3 differs from the expected optimum by about 27%, ‘These results are encouraging and suggest that further investigation of this approach is warranted Experiments show that GA’s which use heuristic crossover but not selection perform better than random search but significantly worse than GA’s which use both selection and heuristic crossover That 1s, there appears to be a symbiotic relationship between the two level of credit assignment performed by selection and heuristic crossover We are currently working on clarifying the relationship between selection and the heuristic crossover operator 5. Future Directions This papers presents some _ preliminary observations and experiments Many more questions about the TSP need to be investigated Some interesting future projects include Combining GA’s with other heuristics In may be useful to heuristically choose the initial population of tours. For example, the nearest neighbor algorithm can generate a set of relatively good tours when started from various iitial cities For very large problems, nearest neighbor can be approximated by choosing a random set of cities and taking the one closest to the current city. Heuristics could also be invoked at the end of the GA to do some local modifications to the tours in the final population For example, the Figures shows many opportunities for improving the final tour by some local edge reversals. Comparison with simulated annealing. Simulated annealing is another randomized heuristic algorithm which has applied to very large (N > 1000) TSP’s. From the published Iterature on simulated annealing (2,7, it appears that our resulls are at least competitive. A careful comparison of these two techniques would be very interesting. Effects of GA parameters. There are several control parameters involved in any GA implementation, such as population site, crossover rate, etc. which may have an effect on the performance of the system The proposed GA's are sufficiently different from previous GA's that it might be useful to investigate the effects of these parameters for the TSP. Other combinatorial appltcations. How do the ideas developed thus far apply to combinatorial problems other than the TSP? References 1 A. D. Bethke, Genetic algortthms as function optimizers, Ph D Thesis, Dept. Computer and Communication Sciences, Univ. of Michigan (1981). 2. E, Bonomi and J-L Luton, "The N- city traveling salesman problem statistical mechanics and the Metropolis Algorithm," SIAM Review Vol 26(4), pp. 551-569 (Oct 1984) 3K A. Dejong, Analysis of the behavior of a class of genetic adaptive systems, Ph D_ Thesis, Dept Computer and Communication Sciences, Univ. of Michigan (1975) K A. Dejong, “Adaptive system design’ a genetic approach, IEEE Trans. Syst, Man, and Cyber. Vol. SMC-10(9), pp. 556-574 (Sept 1980) M. R. Garey and D S$ Johnson, Computers and Intractability, WH. Freeman Co, San Fransisco (1979) . J. H. Holland, Adaptation in Natural and Artifictal Systems, Uni. of Michigan Press, Ann Arbor (1975). 8. Kirkpatrick, C. D, Gelatt, and M. P. Vecchi, "Optimization by simulated annealing," Science Vol. ‘220(4598), pp. 671-680 (May 1983). . J. Pearl, Heuristics, Addison-Wesley, Menlo Park (1984) B, J. Rosmaita, Exodus. An extension of the the genetic algorithm to problems dealing with permutations, MS Thesis, Computer Science Department, Vanderbilt University (Aug 1985) FIGRE Te FIGRE Ib 58 CITIES 58 CITIES DISTANCE = 197.82 DISTANCE = 64.76 INITIAL POPULATION . GENERATION 38 1969 TRIALS FIGE Te FIGRE 1é 58 CITIES 58 CITIES DISTANCE = 68.32 DISTANCE = 46.84 FINAL POPULATION . GENERATION 234 14686 TRIALS Figure 1. 166 FIGRE 2a 188 CITIES DISTANCE = 547.12 INITIAL POPULATION FIGRE 26 108 CITIES DISTANCE = 118.47 GENERATION 125 6296 TRIALS FIGHRE 2e 108 CITIES DISTANCE = 99.84 FINAL POPULATION Figure 167 FIGRE 24 108 CITIES DISTANCE = 87.21 GENERATION 487 28338 TRIALS TIGRE 3a 208 CITIES DISTANCE = 1475.68 INITIAL POPULATION FIGRE 36 208 CITIES DISTANCE = 223.81 GENERATION 227 11373 TRIALS FIGRE ae FIGRE 34 288 CITIES 208 CITIES DISTANCE = 351.22 DISTANCE = 203.46 FINAL POPULATION GENERATION 483 24596 TRIALS Figure 3, 168 Genetic Algorithms: A 10 Year Perspective Kenneth De Jong George Mason University Fairfax, VA 22030 1, Introduction In 1975 Holland's book, Adaptation in Natural and Artificial Systems, was pub- lished and provided a summary of the work which Holland and his students had been pursuing for some time. An important theme in this wide ranging study of the pro- perties of adaptive systems was that adapta- tion can be usefully modeled as a form of search through a space of structural changes which one might make to a complex system in an attempt to “improve"” its behavioral characteristics. ‘This gave rise to a metho- dology for studying existing (natural) adap- tive systems and designing (artificial) adap- tive systems which focused on answering key questions such as: What are the legal struc- tural changes one is allowed to make? How is that space searched in an attempt to iden- tity structural changes which improve behavior? How does one ascertain that resulting behavioral changes are, in fact, an improvement? ‘As an example of the merit of this approach, Holland specified the architecture for and provided a theoretical analysis of a class of adaptive systems in which the struc- tural modification space is represented by strings of symbols chosen from some alpha- bet and the searching of this representation space is accomplished by an unusual pro- cedure called a genetic algorithm. I think it is fair to say, at this point in time, that the careful definition and theoretic analysis of these genetic algorithms (GAs) was and con- tinues to be one of the major contributions of this effort. In the intervening ten years, a good deal of interest and activity has resulted in important new and their potential applications, culminating in this conference. Unfortunately, as is the case in many novel areas of research, to find a forum 169 journal/conference structure for reporting the wide ranging activities which have resulted from Holland's provocative ideas With only a few exceptions, much of this work has been disseminated via unpublished Master's and Ph.D. theses, personal com- munications, and presentations at a series of informal suinmer workshops. 1 am pleased to report that this sit tion 1s changing for the better. In addition to growing institutional support for research in this area, the renewed interest in machine learning in the AI community as well as the continued interest in robust, flexible problem solving strategies in many different contexts has led to a dramatic increase in interest in GAs during the last few years. There remains, however, a fairly serious gap in the coverage of GA research activities since 1975. Those who are new to the area find it difficult to ascertain who has been doing what and frequently get involved unneces- sarily in rediscovering various aspects of undocumented “wisdom” regarding the implementation and application of GAs This conference in general and this paper in particular represent attempts to remedy such perceived gaps, to suggest open research issues, and to identify potential application areas. The following sections summarize my ‘own personal perspective on the current state of the art in this field. Conceptual and Perceptual Issues Most algorithms are developed with a purpose in mind such as sorting, memory management, tree traversal, ete. Genetic algorithms, “however, represent a highly idealized model of a natural process and as such can be legitimately viewed as a simula- tion at a very high level of abstraction. This tends to raise some conceptual end percep- tual difficulties when trying to understand exactly what GAs do and how they might be used. Much of the early GA research, in an attempt to simplify an already complicated situation, focused on understanding how Gas behaved when the structure space to be searched was an N-dimensional space of numerical parameters (corresponding to independently settable dials on a control panel) and the behavior of the system under the new control settings (the fitness measure) was ascertained by simply computing 2 memoryless function whose arguments were the new control settings. By carefully choos- ing functions which presented a variety of well understood payoff surfaces, a great deal of insight was obtained regarding how GAs distribute trials in such spaces in response to the feedback obtained from earlier trials. This gave rise to a very natural question: Do Gas provide @ new and important technique for solving global function optimization problems? A good deal of research [DeJong75, Brindle80, Bethke8I] has and continues to be done in this area impressive results. However, because of this historical focus and emphasis on function optimization applications, it is easy to fall into the trap of perceiving GAs themselvee as optimization algorithms and then being surprised and/or disappointed when they fail to find an “obvi- fous” optimum in a particular search space. My suggestion for avoiding this perceptual trap is to think of GAs as a (highly ideal- ized) simulation of a natural process and as such they embody the goals and purpose (if any) of that natural process. I'm not sure if anyone is up to the task of defining the goals, and purpose of evolutionary systems; how- ever, I think it’s fair to say that such sys- tems are not generally perceived 2s function optimizers. The question that remains, then, is how can one characterize what GAs do in a way which is useful for understanding how they might be best applied to difficult areas such as global function optimization, machine learning, NP-hard problems, machine vision, etc I believe we still have a long way to go in this area. I have attempted to summarize recent advances as well as identify some open issues in the next section. To my mind the best perspective currently available as to what GAs do is Holland's characterization of them as simul- taneously solving a large number of K-armed bandit problems. (If you haven't read it or didn't understand it, you should make an effort to do so.) Although this characteriza- tion leaves many unanswered questions, armed with this viewpoint, one shouldn't be surprised Chat 1) the best individual encoun- tered so far may not even survive into the next generation, 2) that the population itself seldom converges to a global (or even local) optima, or 3) that the ability of GAs to pro- duce a steady stream of offspring that are better than any seen eo far can vary from quite impressive to dismal. At the risk of summarizing the obvi- ous, it is important to realize that GAs have properties of their own independent of the application area, and the key to a successful application {including global function optim ization) is to understand and exploit these properties 3. Representation Issues ‘The strongest hyperplane analysis. results assume that GAs use a very specific form of selection, crossover, and mutation to search a space of fixed length binary strings. In order to take advantage of the power of GAs as analyzed, the space to be searched in & particular application must be mapped onto a representation space of this form Depending on the application, selecting an appropriate mapping can range from a trivial activity to a highly creative one. There is now sufficient expenence to begin to characterize search spaces with respect to choosing a representation mapping. The fol- lowing is an attempt to do so. Searching Parameter Spaces Typically, the simplest way to make a complex process more flexible (adaptive) 1s to identify a fixed set of parameters which can be altered to improve behavior The obvious mapping 1s to think of each of the N parameters as a genes and assign each a gene (string) position. If we then choose for each parameter a set of unique symbols representing the legal values of that parame- ter, we have a very intuitive internal representation as strings of length N. Cross- over occurs between symbol boundaries and produces “legal” offspring, and mutation 170 when applied to position i selects a new sym- bol from the legal symbol set for that posi tion. There is both theoretical and experi: mental evidence to suggest that such direct intuitive mappings are appropriate when the number of legal values a parameter may take on is quite small (ideally, 2) and inap- propriate when they deviate much from the ideal [Holland75}. Although there are many interesting problems which permit such direct mappings (eg., feature spaces, certain NP-hard prob- lems), most parameter modification problems do not. An obvious solution is to map each of the N symbol sets onto a set of fixed- length binary strings, concatenate the results, and apply GAs to this representation space. While it is easy to demonstrate a dramatic improvement in the behavior of GAs in switching from a short length, high cardinality representation of a problem to a longer, but lower cardinality representation, there are a several issues which arise for which we do not have good answers. Fre- quently the cardinality of a symbol set is not a power of 2, requiring rounding up to the next power of 2 and implying the symbol map is into but not onto the set of binary strings. In so doing, the size of the represen- tation space ean be increased (in the case) by a factor of 2N over the ori search space. Since crossover and mutation invariably produce some of these unas- signed strings, there are any number of ways to handle including discarding such strings as illegal, assigning such strings low payoff, or mapping such strings redundantly into the symbol set. Each of these approaches has been tried at various times with no clear indication (either experimen- tally or theoretically) of the overhead incurred by such rounding or whether one approach is consistently better than another. Frequently the application permits enough flexibility in defining the original search space so that the set of legal values each parameter can take on can easily and naturally be powers of 2 (e g, most function optimization problems) so that rounding up issues are perceived as critical. There remains, however, the problem of selecting which of the M! ways M objects can be mapped onto another set of M objects in order to generate binary representations. This issue came up early in the function studies in that when presented with certain relatively simple continuous surfaces GAs appeared to “lack the killer instinct” in the sense that they would quickly find near-optimal points, but fail to press on to better points near by. Further analysis indicated that such behavior was generally caused by artificial “representation boundaries” introduced by mapping the ori- ginal space onto a binary representation space in such a way that ‘‘near-by-ness” had not been preserved. Hence, at a representa- tion boundary, a small change in the value of a parameter is achieved only by a radical change in the binary representation of that parameter value. Since crossover and muta- tion are operating at the bit level, only very low probability sequences of events could “bump” the search over such boundaries. Experiments with alternative encodings such ‘as gray codes yielded clearly identifiable improvements in cases where representation boundaries appeared to be a problem, but gave mixed results in others [Brindle80, Bethke8l, | Another suggestion for which there are no definite results is to redefine mutation so that it works at the parameter level, guaranteeing that at any point in time each parameter value is equally likely to be generated. The argument against such an approach is the disruptive effect such an operator would have on the proper allocation of trials to hyperplanes at the bit level. As a consequence, an important open question is a better understanding of exactly what has to be preserved when choosing a mapping and how to find mappings with the desired properties. The only hints and ‘suggestions along these lines that I am aware of are Bethke’s use of Walsh transforms to characterize when representation spaces are “GA hard" [Bethke8]] Any new results in this area would greatly improve our under- standing and use of GAs. 3.2. Adaptive Representations Since there may not be sufficient « priori insight to select an appropriate representation, an alternative approach which has been discussed but for which there i little theoretical or experimental insight to allow GAs themselves to select the map- ping as part of the adaptive process. One strategy involves including extra "tag bits” with each individual which identifies the awn mapping to be used. An interesting issue here is whether GAs should be modified to be aware of such tags bits (for example, by ‘only applying crossover to parents with identical mappings) or whether GAs should manipulate the tags bits in the usual way as undistinguished members of a longer binary string. In the former case, this introduces the idea of subpopulations (species) for which there is considerable support in natural systems but for which there are no analytic results. In the latter case, the presumed usefulness of binary strings inher- ited from one (and possibly both) parents can be lost because they are interpreted in totally different. way in an offspring unless the parents had identical tag bits and muta- tion left them unchanged. Holland raised similar issues while analyzing the disruptive eflects of crossover ‘on co-adapted sets of alleles which, because of the particular representation chosen, hap- pened to be far apart [Holland75]. His suggestion was to introduce the inversion operator as a mechanism for changing the physical location of genes without changing. their functional interpretation. As above, left unresolved were issues such as whether there should only be a few inversion patterns (species) present in a population with mating (crossover) occurring only within species or whether crossover should be modified to allow offspring to inherit an inversion pat- tern from one parent but gene values from both, Early experimental work {Franz72, DeJong75] generated little evidence of any significant improvement due to introducing inversion in a function optimization context; however, inversion proved to be effective in later work using GAs to search spaces of production system programs [Smnith80). 3.3. Context Sensitive Values A related but more fundamental prob- em arises when the application area has the property that the legal values for one param- eter are contezt sensitive in that they depend on which values have been chosen at other positions. While it is frequently con- venient and natural to view such problems as defining parameter spaces to be searched, violating the assumption that values can be selected independently can have dramatic effects on the performance of GAs. A simple example of this occurs if we try to represent 172 the unit circle with Cartesian coordinates mapped onto fixed-length strings. GAs, by independently choosing symbols at each position, will distribute trials over the unit, square. The usual “fix is to define the payoff outside the unit circle to be excep tionally low (a penalty function) and let the Gas “learn” to keep new trials inside the desired region. Suppose, however, we gen- eralize the problem to that of representing an Nedimensional hypersphere using Carte- sian coordinates. If GAs distribute their tri- als over the enclosing hypercube, as N gets large, the volume of the hypersphere becomes vanishingly small relative to the hypercube and the search process becomes hopelessly bogged down on a surface which appears to be uniformly bad almost every- where. In this ease, of course, it doesn't take much insight to suggest a switch to polar coordinates. However, there are other cases in which alternate representations are not so easy to find. My favorite example of this is the Traveling Salesman Problem (TSP), and 1 am delighted to see that it is well represented at this conference. I continue to believe that it captures in a simple, elegant way many of the open GA issues. A good deal of thought and discussion has gone into the problem of representing TSPs in a form amenable to GAs with very little success to this point. Since the problem involves vi ing each of N cities exactly once while minimizing the total distance of a tour, the most natural way to represent candidate solutions is to list in order the cities visited. Obviously, even though this representation can be viewed as N parameters specifying the Ith city to be visited, it is strongly con- text sensitive in that once a city symbol is used, it eannot be re-used in another post- tion’ OF course, one can always permit the GAS to construct illegal tours via crossover and mutation and assign them a very low payoff. Unfortunately, just as with hyper- spheres, the space of interest here (the set of all permutations of N symbols) becomes a vanishingly small fraction of the the set of all combinations as N increases. There have been many alternative representations invented and explored, but to my knowledge none represent the set of permutations in an efficient, context free way. The alternative to finding a representa tion which fits with the standard versions of crossover and mutation is to change the definition of crossover and mutation to fit the representation. Inventing new mutation operators is not too difficult in this case, the most natural being low order permutation operators. Crossover requires a bit more creativity and usually involves taking a p: tial tours from one parent and splicing in whatever is legally possible from the second parent. The results to date from this approach have not been any more encourag- ing than the previous ones using the stan- dard versions of crossover and mutation on inadequate representations. The problem in this case is that, by altering the genetic operators, we have altered the way in which GAs distribute trials and the fundamental theorems regarding efficient parallel search need to be re-proved. So we find ourselves “caught between a rock and a hard place” with few places to turn, I don’t claim to have the answer either, but there are several observations which would seem to provide some hints. TSP problems fall into an equivalence class of problems called NP-complete because there are no known polynomial-time solu- tions for any member of the class and if one were found there are polynomial-time transformations permitting all _ other members to be solved in polynomial time. The Boolean Satisfiability Problem (BSP) is ‘a member of this class and involves finding truth value assignments to N boolean vari- ables in such a way as to make an art given boolean expression of these N variables true. The most natural representation for BSPs is precisely what is needed for use GAs, namely a binary string of length N. Crossover and mutation work precisely as intended and problems of surprising size can be solved. (Unfortunately, there isn't much interest here in nearly correct assignments!) What we have then are two problems which are known to be equivalent in the NP-hard sense, but are quite different in a GA-hard sense. The difference seems to hinge on a sort of duality relationship between the two problems. Fitness for BSPs is defined purely in terms of the values of the symbols and not their relative positions in the string. This maps well onto our notion of hyperplane and in these situations crossover and mutation are effective mechanisms for homing in on good value combinations. On the other hond, TSP fitness is defined purely in terms of the order of valueless genes which represents being in city n Here inver- sion seems most natural with crossover and mutation inappropriate in their usual form ‘What seems to be needed is a definition of a hyperplane in this dual space, Unfor- tunately, our notions of hyperplanes are so tightly bound to spaces represented by a fixed number of independent axes that it's hard to conceive of alternate definitions. With an appropriate definition Uhere would be a much clearer view of the duals to cross- ‘over and mutation, and hopefully 2 dual set of analytic results, 3.4. Context Sensitive Interpretations Another form of context sensitivity can arise and cause difficulty when the same value of a particular parameter has different interpretations depending on the values of other parameters. We have already seen how this can occur when attempting to select representations adaptively. Another nice example arises in attempting to escape from the context sensitive value representa- tions of TSPs. One could imagine an N parameter representation in which the first parameter specified which of the N cities should be visited first. Having deleted that city from our list, the second parameter always takes on a value in the range 1...N-1, specifying Ly position on our list which of the remaining cities is to be visited second, and so on. Values for each of the parame- ters can now be independently selected and crossover and mutation always produce legal tours. However, the performance of GAs on this representation is not significantly better than the previous ones. The difficulty appears to Le that gene values to the right of a crossover point or a mutation are inter- preted quite differently (ie., specify totally different su! tours) in an offspring than in the parent, ing the concept of minimal disruption of “building block” formation What seems to be needed is a representation which allows good subtours (co-adapted sets) to form and be passed on in combination other subtours, forming better tours, and so on. With the traditional definition of a hyperplane, this seems to rule out context sensitive interpretations as bad representa- tions. 1 am unaware of any alternatives other than the hope that perhaps a more general perspective on hyperplanes will clar- ify these issues. 3.8, Varying Length Representations So far we have been discussing issues which appear in the context of searching parameter spaces. There are, of course, many other (generally more complex) kinds of spaces which represent the set of permissi- ble structural changes to an adaptive pro- cess. In some cases strings are still a natural representation, but there may be no notion of a fixed length. A good example are strings which specify structural changes via “genes” which represent actions to be Laken. One string may consist of only a few actions while others require many. If we wish to use standard GAs, the simplest (but somewhat inefficient) approach is to assume some rea- sonable upper bound on the length, throw in a “'no-op" action, and require all strings to be maximum length. Alternatively, eross- over can be easily generalized to produce offspring whose length is different (in gen- eral) from either parent by choosing independent crossover points in each parent However, it is important to note that neither approach is sufficient to guarantee good GA performance on varying string length spaces. To understand why requires asking what the hyperplanes are in this con- text. Both Holland [Holland75| and Smith [Smith80] discuss the issues. I will not repeat the discussions here, but just note that there is considerable evidence that a sufficient condition for good GA performance is that the genes express their actions in a position independent way. 3.6, Non-String Representations What should one do when elements in the space to be searched are most naturally represented by more complex data structures such as arrays, trees, digraphs, ete. Should fone attempt to “linearize” them into a string representation or are there ways to creatively redefine crossover and mutation to work directly on such structures. Iam unaware of any progress in this area. How- ever, the issues appear to be reasonably clear. Any linear representations will have 174 to satisfy (he properties discussed in the preceding sections in order to achieve efficient GA search. Similarly, any attempts to modify crossover and mutation will require analogous hyperplane analysis results to guarantee reasonable performance 3.7. Production System Spaces One of the most intellectually pleasing ways to effect changes in the behavior of a complex process is to modify its knowledge base. There has been a good deal of research within the AT community regarding appropriate ways to represent knowledge. Production rules are frequently chosen when learning is involved |Waterinan70, Newell77, Buchanan78]. The GA community has also maintained 2 long standing interest in pro- duction system architectures because of their amenability for use with GAs [Holland75, Holland78, Smith80, Booker82|. From my Perspective there are currently two main approaches to searching production system rule spaces with GAs. The first is typified by the classifier systems developed initially by Holland [Hol- land78) and Booker [Booker82}. Here indivi- duals in the population represent single pro- duction rules (typically fixed length) and the current population represents the entire set of rules governing the behavior of the adap- tive process. GAs play a subservient role within a larger cognitive model and are invoked intermittently to produce new rules which replace existing rules in the popula- tion, The alternate approach is represented by the LS-1 system developed by Smith {Smith8o}. Individuals represent entire rule sets to be plugged into the knowledge base and evaluated. ‘The next generation of rule sets is produced in the usual way by apply- ing genetic operators to existing rule sets. Both approaches have _ produced encouraging results in quite different con- texts. There is not enough experience, how- ever, to understand precisely the strengths, weaknesses, and tradeofis involved in either of the approaches. My guess is that the classifier approach will prove to be most use- ful in an on-line, real-time environment in which radical changes in behavior cannot. be tolerated whereas the LS-1 approach will be best suited for off-line environments in which more leisurely exploration and more radical behavioral changes are acceptable. Fitness Functions In addition to choosing an appropriate representation on which to apply GAs, care- ful thought must be given to the characteris- ties of the payoff function used to provide feedback regarding an individual's fitness to produce offspring. The wealth of data from GA function optimization studies simultane- ‘ously show a general robustness in perfor- mance over widely varying classes of fune- tions and intermittent dismal results. This has lead to several informal characterizations of the kinds of surfaces which are GA-hard. Surfaces which are flat almost everywhere except for an occasional spike present difficult. search problems for any approach including GAs. The intuitive explanation is that, since there is (essentially) no differential payoff among the competing hyperplanes, such peaks will be found only by chance. Unfortunately, it is not all that difficult to inadvertently construct one in applications like the hypersphere and BSP examples discussed earlier. This immediately suggests another way to fool GAs: put misleading information in the hyperplanes. Fortunately, this is much more difficult to do because of the simul- taneous sampling of many different hyper- plane partition elements. Bethke [Bethke8!] has a nice discussion of this using Walsh transforms to characterize GA-hard func- tions. However, much more work needs to be done in this area, It should be also noted that it is quite easy to incorrectly blame GAs for poor per- formance when the fault in fact lies else- where. One classic case of this arises when using GAs to improve the performance of a complex process for which no payoff function is given. Since one has to be constructed, care must be taken to verify that high payo values as seen by GAs corresponds to good behavior as observed by watching the com- plex process itself. Another case arises when numeric parameter spaces are being searched. Since there is typically some free- dom in how finely to discretize a parameter range, choosing too coarse a discretization factor may inadvertently leave out optimal points in the representation space being searched by GAs and then blame the GAs for not finding them! Until recently, most GA research and applications involved payoff functions which return a single (scalar) payoff value. There are situations in which it is more natural to have the payoff function return a vector of values reprisenting, for example, scores on non-commensurate aspects of performance. Rather than insisting that an artificial fune- tion be created which combines such scores into a single payoff value, it would be prefer- able to have GAs work directly with multi- valued payoffs. Schaffer (Schaffer85] has explored this possibility recently and has obtained promising results. 5. Genetic Operators ‘There certainly is nothing sacred about the traditional operators defined and analyzed by Holland. What is important is that we have criteria from Holland's hyper- plane analy is which operators should meet. If changes are made to existing operators or new ones are introduced, it is important to verify that they aren't overly disruptive of the process of distribution of trials according to payoff and that they encourage the forma- tion of building blocks. There are still some interesting open questions along these lines with respect to rather modest variations of the standard operators. It is pretty much standard procedure now to view crossover as applying to circular strings and selecting two crossover points, ‘the beginning and the end of the segment the second parent. This modification is well supported both theoreti- cally and experimentally. What happens if we continue along this vein and select two segments from the second parent (via four crossover points)? Is this helpful or too dis- tuptive? ‘The answers are pretty clearly negative by the time we have increased the number of crossover points to the extent ‘that an offspring’s gene values are randomly selected from its parents values. Perhaps the number of crossover points should be a function of the length of the strings involved. Applying the traditional crossover to strings with thousands of genes (which is currently being done) seems to be intuitively more disruptive than one with four or six crossover points. If so, where does the law of diminishing returns set int The role of mutation as a background operator which introduces new allele values is fairly well understood and accepted in the abstract. As discussed earlier, problems can arise from our choice of representation in which mutation (and crossover) are operat- ing at the bit level, but our interpretation of the search space is at a higher level. This can lead to a frequently tried but rarely sue cessful strategy of increasing the mutation rate to improve GA performance. A better approach in such situations is to think in terms of both higher and lower level versions of the genetic operators. Both Holland |Hol- Jand75] and Smith [Smith80} discuss this, but much more work needs to be done. 6. Selection The technique of selecting parents for reproduction with a frequency proportional to observed fitness has strong theoretical justification and considerable empirical sup- port. However, there are occasions when this process seems to break down when implementing GAs with finite populations. This has come to be known as “the scaling problem" and can occur in a number of ways. If a highly fit individual is encoun- tered early in the search process among. mediocre peers, selection will give it such strong preference that it can dominate the population in a few generations and cause premature convergence. Similarly, late in the search process the population can be leg- iumately dominated by members with very high payofls which differ on an absolute scale, but when normalized to produce expected number of offspring are equivalent, out to the third or fourth decimal place. The effect is that essentially every parent contributes equally to subsequent popula- tions in spite of fitness differences. There have been a number of proposed solutions including the introduction of scal- ing factors and crowding factors (DeJong75| and selection by rank [Wetzel83, Shaffer85| However, I think it is fair to say that a gen- eral solution still eludes us. 7. GA Parameters One of the observations people are quick to make 1s that GAs are themselves complex processes which appear to have a set of paraineters (crossover rate, mutation rate, populstion size, ete.) which could be tuned to iinprove performance. There is considerable empirical support for the state- ment that within reasonable ranges the values of such parameters are not all that critical [DeJong75, Grefenstette85]. As a consequence most GA applications work with fixed “accepted” parameter values. However, there is also evidence to suggest that additional performance improvements could be obtained if such parameter values could be dynamically modified. The difficulty is in deciding when and how to effect such changes. Should we have a two- level GA complex with the top level GA actively searching the parameter space of the lower level GA and trying out new parameter combinations? Are there simpler signals such as allele loss which should trigger parameter changes? Unfortunately, the existing theory gives little guidance here. 8. Conclusion In rervading the previous sections, I became a little concerned that the reader might infer a strong negative tone from this long list of problems and open issues in GA research. Nothing could be further from my intent. [am enthusiastic about the pote which GAs hold an ely involved in GA research and applications It is that enthusiasm which generated this paper and this conference. I hope the result is that the next time we get together my list will be considerably shorter (or at least different)! References [Bethke80| Bethke, A., "Genetic Algorithms as Function Optimizers", Doctoral Thesis, CCS Department, University of Michigan, 1981. [Booker82] Booker, L. B, "Intelligent Behavior as an Adaptation to the Task Environment”, Doctoral Thesis, CCS Department, University of Michigan, 1982. [Brindle80] Brindle, A., "Genetic Algorithms for Function Optimization”, Doctoral Thesis, Department of Computing Science, Univer- sity of Alberta, 1980, 176 {Buchanan78] Buchanan, B., Mitchell, T.M., "Model-Directed Learning of Production Rules", in Pattern-Directed Inference Sye- tems, ‘eds. Waterman and Hayes-Roth, ‘Academie Press, 1978. [DeJong75] De Jong, K., "The Analysis of the Behavior of a Class of Genetic Adaptive Systems”, Doctoral Thesis, CCS Depart ment, University of Michigan, 1975. [DeJong80a] De Jong, K., "A Genetic-based Global Function Optimization Technique”, TR 80-2, Department of Computer Science, University of Pittsburgh, 1980. [DeJong80b] DeJong, K., "Adaptive System Design: A Genetic Approach”, IEEE Trans. on Systems, Man and Cybernetics, 10,9, Sept. 1980. [DeJong81] De Jong, K. and Smith, T, *Genetic Algorithms Applied to Information Driven Models of US Migration Patterns”, 12th Annual Pittsburgh Conf. on Modelling. and Simulation, April 1981. [Frante72| Frantz, D. R., "Non-linearities in Genetic Search", Doctoral The ccs Department, University of Michi [Grefenstette85] Grefenstette, J., "Genetic Algorithms for Multilevel Adaptive Sys- tems”, to appear in IBEE Trans. on Sys- tems, Man and Cybernetics. |Hedrick76] Hedrick, C.L., "Learning Pro- duction Systems from Examples”, Artificial Intelligence, Vol. 7, 1976. |Holland75] J. H, Holland, Adoptotion in Natural and Artificial Systems, University of Michigan Press, 1975. [Holland78] Holland, J.H., Reitman, J., "Cognitive Systems Based on Adaptive Algo- rithms", in Pattern-Directed Inference Sye- tems, eds. Waterman and Hayes-Roth, ‘Academic Press, 1978. [Newell77] Newell, A. "Knowledge Representation Aspects of Production Sys- tems", Proc. 5th ICAI, 1977. |Schaffer85] Schaffer, J. D., "Multiple Objec- tive Optimization with Vector Evaluated Genetie Algorithms”, to appear in Proc. Int'l Conf. on Genetic ' Algorithms and their Applications, July 1985, [Smith80] Smith, S. F., "A Learning System Based on Genetic Adaptive Algorithms”, Doctoral Thesis, Department of Computer Science, University of Pittsburgh, 1980. [Smith83] Smith, S. F., "Flexible Learning of Problem Solving Heuristics Through Adap- tive Search”, Proc. 8th IJCAI, August 1983. [Wetzel83] Wetzel, A., "Evaluation of the Effectiveness of Genetie Algorithms to Com- binatorial Optimi jon”, Doctoral Thesi Department of Library and Information Sei- ence, University of Pittsburgh, 1983. Classifier System with Long-term Memory in Machine Learning Hayong Zhou Vanderbilt University ABSTRACT This paper dig ee the advantages of classifier systema with long-term memory and includes a description of the basic structure of auch a aystem. The learning atrategy used here ia twofold one. Firat, an analogical learning strategy is employed to inject the appropriate knowledge into the population. Second, o production system with a GA-based learning component woked to perform subsequent learning. The proposed system has one overall objective: It aeeka to increase the efficiency and power of the learning system over a long period of time of use. 1. Introduction A genetic algorithm (GA) is a problem- solving and non-deterministic search algorithm first introduced by Holland in 1975(3). It has been shown, theoretically and empirically, that GAs are robust and effective in various task domains, even in the presence of difficulties such as noise, high-dimensionality, multimodality discontinuity(7]. The outgrowth of the continuing research in this area evolved into a message-passing, rule- based production system called classifier tem{4]. A classifier system is a learning system in which many simultaneously. A classifier is a pattern sensitive with condition/action Each condition specifies the set of messages satisfying and classifiers are active element form 178 it, and each action specifies the message to be sent when its condition part is satisfied. In short, & classifier system manipulates knowledge structures (KSs) in response to performance via a genetic algorithm. It provides a framework for cognitive simulation 2] Several published classifier systems which incorporate transfer of learning knowledge from fone task to another have been developed. In 1978, Holland and Reitman designed the first classifier system called CS-1 tested on mate problems An experiment was conducted to demonstrate transfer of learning from a small mase problem to a large but similar one(4]. The ‘experimental result showed that CS-1 was able to solve the large maze problem much faster when initially supphed with some learned knowledge In 1982, Booker did in-depth simulation study of classifier systems as cognitive models[2]. He performed demonstr: structures rveral experiments to e the effects of prior knowledge structures on learning in new situation. For “positive transfer*(transfer of knowledge for solving similar tasks), his results were very encouraging Before proceeding any further, the “reversal learning task" needs to be descr Schrier{6] trained a monkey on a reversal learning task. Reward and punishment were reversed repeatedly while keeping the input information to the monkey unchanged. Performance of this monkey was inefficient at the outset, but, eventually, each new reversal could be learned with a single trial. In order to test the learning ability of classifier systems, Booker ran his system on the reversal learning task. Surprisingly, the resulting performance was inconclusive. The reasons, according to Booker, are that “the emphasis on recency and short-term memory in the system is too great" because “by the time the organism had reached criterion on a given reversal, the classifiers learned during the previous reversal were likely to have been deleted - that is, become “extinct* due to the drastic change in the environment"(2}. In 1984, Schaffer completed the LS-2 designed for the pattern discrimination task domain|6}. He also gave the reversal learning task to his system. The results obtained so far are not encouraging either(private communication), In sum, efforts to build powerful classifier systems have met with impressive success over the past The attempts to transfer learned knowledge for solving similar tasks, though manaully, have been shown to be useful and effective. However, the failures in solving the reversal learning task pose a question Is there any way that classifier systems can keep knowledge which is useful but irrelevant to the current situation intact in order to increase the efficiency and power of their learning ability? To answer this question, this paper proceeds from a general need for having a long-term memory to a proposed prototype in the following sections. 2. Motivation for the design of classifier system with long-term memory(CSLM) We begin this section with several assumptions which have been associated with traditional classifier systems ‘© The domain of learning is concerned with a single task. ©The changes in environments are slight, smooth and gradual. * The efficiency for solving similar tasks in a long run is not important If task domain satisfies these assumptions, it would be unnecessary to augment a classifier system with long-term memory. However, an ideal learning system should be able to switch its attention as needed while still preserving the most useful knowledge gained in the past no matter how its environment has been changed By doing 10, the system would increase its efficiency and power over time and improve its learning ability os the number of learned tasks grows In short, the main concern of this paper is to investigate how to accumulate and preserve knowledge not only within a task, but also among tasks It has been shown empirically that the sire of a population should be chosen around S0{aumber of knowledge structures) in order to maximize computational efficiency(8). In practice, tion larger than 200. For auch small knowledge pools, it is hard to imagine that a set of generalized knowledge structures could be constructed, for example, suitable for many pattern discrimination tasks. A short-term memory, i.e the population in a classifier system, can not be expected to meet the challenges imposed by drastic environmental changes Each knowledge structure in a population is evaluated by the Critic designed for the current task. It is very difficult, if not impossible, to preserve those knowledge structures which were perfect for some previous tasks but not suitable for the current situation We see this as a serious weakness of the current model and as the major motivation for the design of a classifier system with long-term memory (CSLM). most of classifier systems never use a pop 3. Overall description of CSLM. In this section, an overall organization of CSLM 1s outlined ‘The description is based on the following diagram and 1s intended to be instructive rather than specific In figure 1, an have one of the following three outcomes. understanding of the basics of classifier systems eee hhas been assumed. It is well described in (24,5) The wat cee step is to bring the learned - Knowledge structures into the Detectors population Heuristic initialitation of Yorn or Receive the population is done. decriptors (Di) 2. Partial matching ‘One similar task can be found. The similarity between the incoming task and the stored ones indicates that there might exist some useful building lene posarieeetiee) blocks in the stored knowledge structures which, hopefully, can ipepaet isso provide a promising direction to start, I with, Thus the search space would be pruned and the computational effort {thr Learning might be reduced. store the winners (nav KSa) 3. No matching. ai It tells us that no previous experience regarding the incoming task is known, or possibly has been forgotten. In this * Deseriptora: Descriptors serve as indices case the CSLM has to start from scratch, no worse than current In simplest terms, we can visualise the main components of CSLM as follows: to learned knowledge structures. The “ classifier systems. descriptors for various tasks could be very general In fact, complete and precise Tepaterm Umemery:, The long-term ‘memory consists of two separated memories called Episodic © Memory(EM) and Knowledge Base(KB) respectively. The EM stores all descriptors for previous tasks. Each descriptor has one pointer descriptor for a task is neither necessary nor realistic. In practice, the descriptors might use a low level language(a string of bits) or a high level language(alphabet) to express main characteristics of taske They Pointing to its corresponding KSs in the KB The content of the EM may be considered as the indices for accumulated may be produced automatically from incoming tasks, or supplied by users, * Matcher: The Matcher(a procedure) performs two functions. ~—_ matching knowledge structures ‘The KB preserves learned KSs descriptors and initiating a population. We Whenever a task has been solved, the set of discuss them together here Matching the soluions are stored ‘ia the tong-tarsi descriptor of a incoming task with that of memory along with the associated pointer ks i 7 a solved tasks in a long-term memory might One of the basic learning strategies 180 employed in CSLM is “learning by analogy” which appears to be a centr human cognition and promises to be a powerful mechanism in machine learning . Learning by analogy consists of two phases. The first phase called the "reminding phase" which identifies the similarity between an incoming task and the problems observed or solved before The second phase involves the transfer of appropriate knowledge obtained in the past into the new situation Carbonell pointed out the importance of learning by analogy:* In general, transfer of experience among related problems appears to be theoretically significant phenomenon as well as a practical necessity in acquiring the task - dependent expertise necessary to solve more complex real world problems*{1). The approach used in CSLM 1s to form descriptors derived from the detector array to categorize tasks. In the reminding process, similarity could be determined by matching these stored descriptors in a long-term memory with the descriptor derived from an incoming task In the next phase of analogical problem solving, the related knowledge structures, if any, would be brought into the population Notice that to inject these learned KSs into a population 18 not the end of our story. Instead, it should be viewed as providing strong guidance for future search ‘The genetic algorithm will manipulate these ‘useful building blocks and transform them into a form that would be appropriate for the current task. In the next phase, the classifier system is invoked to perform the subsequent learning which will not been detailed here inference method in 4. Solving the reversal learning task in CSLM First of all, we need to emphasize that the interestingness of the reversal learning task is not only because it represents a new class of learning tasks, but also, more importantly, it tests the learning ability of a system on how well it can preserve useful knowledge from radical changes in environments. Let us see what will happen if a reversal learning task is given to CSLM. Suppose that a CSLM has created a set of KSs for a given task and stored it along with its associated descriptor in a long-term memory, as shown in figure 2.2, When the second task with the same appearance: but opposite meaning(reversal task) is given, the CSLM, as expected, is in the worst possible position to learn the new task since the Matcher procedure would have brought the learned KSs into the population. In this case, the learned KS would receive a low score and the classifier system would have to develop a new KS for the reversal task. However, after the CSLM has created two sets of KSs for each reversal, it can solve subsequent reversal learning tasks with a single trial, As noted earher, the generality of » descriptor for a task would guarantee the CSLM to recognize the tasks with the same or similar characteristics. ‘Thus the Matcher would be able to pull two sets of KSs out of the long-term memory based on the similarity measurement and myect them into the population The initialized population 1s shown in figure 2.b. Therefore, the Critic would be able to choose the appropriate KS for each reversal vigure 2 Another significance of this demonstration is to show what happens af a set of bad knowledge structures has been used to initialize the population. The full power of genetic algorithms comes from the parallel nature of the search and the immunity to false peaks, Therefore, these injected KSs are only tentative, and as such are subject to testing If some of them prove useless or misleading, they will die out in subsequent generations There 1s a further point worth noting the portion of a population to be heuristically initiated should be judiciously decided 20 that the premature convergence could be avoided while still giving an opportunity for guiding future search, 5. Summary and Future Research ‘This paper has discussed the advantages of augmenting classifier systems with long-term memory and described a prototype of CSLM conceptually The process of solving the reversal learning task was demonstrated as well The driving force behind this paper 1s to extend the current model in order to deal with more complex task and make environments have been drastically changed Several difficulties which can be anticipated in the design of CSLM are mentioned here: consistent progress even if ‘* How to extract descriptors from tasks reasonable accuracy and effort while maintaining the delicate balance between generality and specificity? * How to update the content of a long-term memory dynamically? © How to best initialize a population? In seeking the answer to these questions and to test the feasibility of the proposed ideas, a specific CSLM designed in the pattern discrimination domain is to be implemented It 1s hoped that the experimental results will be available soon as an evidence of the improved learning ability of the proposed system 182 Acknowledgements ‘The author would like to thank hs advisor, Dr John Grefenstette for his guidance, and Dr. David Schaffer for his encouragement during the development of this paper References 1. Carbonell, J.G. Learning By Analogy Formulating And Generalizing Plans From Past Experience. Machine learning, 137-159, Tioga publishing Co 2 Booker, L. Intelligent Behavior As An Adaptation To The Task Environment PhD dusertation, The university of Michigan, 1982 3. Holland, J.H. Adaptation in Natural and Artificial System The university of Michigan press, 1975 4.Holland, J.H. and Reitman, JS. "Cognitive Systems Based On Adaptive Algorithms, ittern-directed inference system, 313-329, 1978 5. Schaffer, J.D. Some Experiments in Machine Learning Using Vector Evaluated Genetic Algorithm PhD dissertation, Vanderbilt University 1984 6 Schrier, A.M. Transfer By Macaque Monkey Between Learning-set. and Reperted-reversal Tasks Percept_ Mot Skills, 23, 787-792 7, Dejong, K.A. Analysis of the Behavior of a Class of Genetic Adaptive Systems, PhD dissertationUmw of Michigan, 1975 8 Grefenstette, J.J. Optimization of Control Parameters for Genetic Algorithm, To _appear_in IBEE Trans Sys Man, Cybn,1985 A Representation for the Adaptive Generation of Simple Sequential Programs Nichael Lynn Cramer Texas Instruments Inc. PO Box 226015,MS 238 Dallas, TX 75266 ABSTRACT ‘An adaptive system for generating short sequential computer functions is described. ‘The created functions are written in the simple “number-string” language JB, and in TB, a modified version of JB with a tree-like structure. These languages have the feature that they can be used to represent well-formed, useful computer programs while still being amenable to suitably defined genetic operators. The system is used to produce two-input, single- output multiplication functions that are concise and well-defined. Future work, dealing with extensions to more complicated functions and generalizations of the techniques, is also discussed. INTRODUCTION The techniques of adaptive Genetic Algorithms 'GAs]! have been shown to be useful in many areas. Initially, these systems involved the adjusting of a fixed set of parame- ters in order to optimize the performance of a given algorithm?. Much work has been done toward the goal of evolving the algorithms themselves, particularly in Production System-like domains!(eh2P8).34, ‘This paper discusses work towards developing a sequen- tial programming language that is suitable for manipulation by GAs so as to permit the adaptive generation of simple computer functions from low-level computational primitives. FUNCTIONAL REPRESENTATION ‘The scheme that we will follow is first to find a suitably powerful programming lan- guage, and then encode the programs in this language in such a way as to make them amenable to the standard Genetic Operators [GOs]. The basic language to be used is a variation of the algorithmic language PL having the following operators: (INC VAR) ;:add 1 to the variable VAR (:ZERO VAR) ;;set the variable VAR to 0 (LOOP VAR STAT) :;perform the statement STAT VAR times (:GOTO LAB) ::jump to the statement with label LAB Programs in PL consist of an arbitrary number of globally-scoped (positive) integer variables and statements containing operators of the above forms. Two simple example PL Programs are: et variable VO to have the value of V1 (:ZERO VO) OOP V1 (:INC VO)) Multiply V3 to V4 and store the result in V5 (:ZERO V5) (:LOOP V3 {:LOOP V4 (:INC V5)}) While PL can be shown to be Turing Equivalent 5, we will be interested in the language subset PL-{: GOTO}. This language subset has two useful properties: first, while it is not fully Turing Equivalent, it still comprises a powerful set of functions (specifically, the set of primitive recursive functions)® and second, programs written in PL-{. GOTO} are guaranteed to halt. Finally, we make two small extensions to the language. First, a :SET operator, which accepts two variables and sets the value of the first variable equal to that of the second (As can be seen in in the examples above, this operation is trivially definable in PL-{: GOTO}; if so desired, it can be considered a macro or subroutine operator.) Secondly, we define a :BLOCK operator that accepts two statements as arguments and evaluates the two statements sequentially. (This is essentially just a grouping operation that has no effect on the overall structure of the language.) Now, the encoded representation for our programs should have two characteristic (Goal 1) It should be amenable to the standard GOs. (Goal 2) The representation should produce only well-formed programs, even when subjected to the GOs While some representations, e.g. character-strings, might be well suited for the mechanisms of GOs, the random generation and/or altering of characters is not likely to produce, say, a useful FORTRAN program. Consequently, it is strongly desirable that the chosen representation be such that all such generated programs stay in the space of syntactically correct programs. Not all such generated programs would be useful (adapation would be expected to correct that); it is only important at this point that such programs be well formed. This paper will consider lists of integers as a representation for these programs where the object the integer represents (variable. operator, etc,) is determined by the integer’s position in the list. Clearly such a representation satisfies Goal 1 above, the standard GOs (Crossover. Mutation, Inversion) would be well defined on such a list. To satisfy Goal 2, we need to define a decoding of an arbitrary hist into a well-formed program. THE JB LANGUAGE ‘A first attempt at such a decoding is the language JB. The list of integers is first divided into statements of some length large enough for the longest statement size, (three in the present case). Any integers left over at the end of this list are ignored The first of these statements is defined to be the Main Statement |MS and the remaining Nq, statements are the Auxiliary Statements |AS,. Syntactically, these statements are interpreted as follows: {0 42) -> (:BLOCK 4S, 452) {16 0) -> (LOOP V% ASo) (219) -> (‘SET V; Yo) (3.17 8) -> (ZERO Vj;) ..the 8 1s ignored (4.05) ~> (:INC Vo) :.the 5 is ignored Here the symbols of the forms V7, and AS, represent. respectively. example Variables and Auxiliary Statements. This body of statements is embedded in an environment containing Ny, body-variables (initialized to 0) and N,, input-variables. At the end of the execution of the program, any of the Nutot = (Niv + Nov) available variables can be returned as ouput. 184 The function is entered by executing the MS, which, typically, will call on one or more of the AS’s. An example JB program would be: (00135813214345992) ‘This would be grouped into the following Statements: (00 1) ;;main statement -> (:BLOCK ASo AS;) (3.5 8) ;;auxiliary statement 0 -> (:ZERO Vs) (13 2) ;;auxiliary statement 1 -> (:LOOP V3 AS2) (1 4 3) ;;auxiliary statement 2 -> (:LOOP V4 AS3) (459) ;jauxiliary statement 3 -> (:INC Vs) This is the same as the PL multiplication program above. ‘As can be seen, virtually (see below) any list (of sufficient length) of integers chosen from the range [0,Nand-1] can be used to generate a well-formed JB program. Where Nrand = Nutot*Nas*Nop (Nop is the total number of operator types). A particular language object (variable, AS, operator-type) needed for the program can then be extracted from a given integer in the list by taking the modulus of that integer with respect to the respective number above. This ensures random selection over all syntactic types. Two problems arise from this straight forward use of the JB language. The first, a minor problem, is that a JB integer-list will not define a correct program when a loop is created among the Auxiliary Statements. In practice, with a moderate number of AS’s this is a rare occurence. Moreover, it is easy to rernove such programs during the expansion of the body of the program. (In any case, this problem will be removed in the TB language below.) A second. more serious problem is that while the mechanisms of the applications of the GOs are very simple in the JB language, the semantic implications of their use are quite complicated. Because of the structure of JB, semantic positioning of a integer-list clement is extremely sensitive to change. As a specific example, consider a large compli- cated prograin beginning with a :BLOCK statement in the top-level Main Statement. A single, unfortunate, mutation converting this operator to a :SET would destroy any useful features of the program. Secondly, this strongly epistatic nature of JB seems incompatible with Crossover, given Crossover’s useful-feature-passing nature. A useful JB substructure shifted one integer to the right will almost certainly contain none of its previously useful properties. THE TB LANGUAGE In an effort to alleviate these problems, we consider a modified version of JB. This language, called TB, takes advantage of the implicit tree-like nature of JB programs. TB is fundamentally the same as JB except that the Auxiliary Statements are no longer used. Instead, when a TB statement is generated, cither at its initial creation or as a result of the application of a GO (defined below), any subsidiary statements that the generated statement contains are recursively expanded at that time. The TB programs no longer have the simple list structure of JB, but instead are tree-like. Because we are simply recursively expanding te internal statements without altering the actual structure of the resulting program. the TB programs still satisfy Goal 2. Indeed, it can be seen that, because of its tree-like structure, TB does not suffer from the problem of internal loops described above. Thus. all possible program trees do indeed describe syntactically correct programs. An example of a TB program is: {0 (3.5) (1.3 (1.4 (45)))) This expands to the same PL and JB multiplication programs given above. ‘The standard GOs are defined in the following way: Random Mutation could be defined to be the random altering of integers in the pro- gram tree. This would be valid but would encounter the same “catastrophic minor change” problems as did JB. Instead, Random Mutation is restricted to the statements near the fringe of the program tree. Specifically: 1) to leaf statements, i.e., those that contain operators that do not themselves require statements as arguments (:INC, :SET, :ZERO). And 2) to non-leaf statements (with operators :BLOCK, :LOOP) whose sub-statement ar- guments are themselves leaf operators. Inside a statement, mutation of a variable simply means randomly changing the integer representing that variable. Mutating an operator involves randomly changing the integer representing the operator and making any nec- essary changes to its arguments, keeping any of the integers as arguments that are still appropriate, and recursively expanding the subsidiary statements as necessary. Similarly, following Smith®, we restrict the points at which Crossover can occur. Specifically, Crossover on TB is defined to be the exchange of subtrees between two parent programs; this is well-defined and clearly embodies the intuitive notion of Crossover as the exchange of (possibly useful) substructures. This method is also without the problems that Crossover entails in JB. In a similar manner, we could define Inversion to be the exchange of one or more subtrees within a given program. EXAMPLE As a concrete example, an attempt was made to “evolve” concise, two-input, one- output multiplication functions from a population of randomly generated functions. As discussed by Smith®(""P5) a major problem here is one of “hand-crafting” the evaluation function to give partial credit to functions that, in some sense, exhibit multiplication-like behavior, without actually doing multiplication. After much experimentation, the following scheme for giving an evaluation score was used. For a given program body to be scored, several instantiations of the function were made, each having a different pair of input variables {IVs]. Each of these test functions was given a number of pairs of input values and the values of all of the function’s variables were collected as output variables |OVs). The resulting output values were examined and compared against the various combinations of input values and IVs. The following types of behavior were noted and each successive type given more credit: 1] OVs that had changed from their initial values. (Is there any activity in the function?) 2] Simple Functional dependence of an OV on an IV. (Is the function noticing the input?) 3] The value of | an IV is a factor of the value of an OV. (Are useful loop-like structures developing?) 4] Multiplication. (Is an OV exactly the product of two IVs.) Furthermore, rather than accept input and/or output in arbitrary variables. scores were given an extra weight if the input and/or output occurred in the specific target variables. To ensure that the fenctions remain reasonably short, functions beyond a certain length are penalized harshly. Finally, a limit is placed on the length of time a function is permitted to run; any functios that has not halted within in this time is aborted. 186 ‘A number of test runs were made for the system with a population size of fifty. These ‘were compared against a set of control runs. The control runs were the same as the regular runs except that there was no partial credit given; all members of the population were given a low, nominal score until they actually started multiplying correctly. All runs were halted at the thirtieth generation. The system produced the desired multiplcation functions 72% more often than the control sample. FUTURE WORK Finally, a number of questions remain concerning the present system and its various extensions: Extensions of the Present System: Generation of other types of simple arithmetic operations seem to be the next step in this direction. Given the looping nature of the underlying PL language it seems obvious that the system should be well suited for also generating addition functions. However, it is less clear that it would do equally well attempting to generate, e.g., subtraction or division functions, to say nothing of more complicated mathematical functions. Indeed, the results of the control case above show that it is difficult not to produce multiplication in this language; generation of other types of functions would prove an interesting result. On the other hand, are there other, comparably simple, languages that are better suited to other types of functions? Concerning Extensions of the Language: A useful feature of the original JB language is its suitability for the mechanisms of the GOs. Can some further modification be made to the current TB language to bring it back into line with a more traditional bit-string representation? Are these modifications, in fact, really desirable? Alternatively, would it be useful to modify the languages to make GOs less standard? For example, would it be productive to formalize the subroutine swapping nature of the present method of Crossover and define a program as a structure comprising a number of subroutines, where the appli- cation Crossover and Inversion was restricted to the swapping of entire subroutines, and Random Mutation restricted to occurring inside the body of a subroutine? ACKNOWLEDGEMENTS I would like to thank Dr. Dave Davis for innumerable valuable discussions and Dr. Bruce Anderson for preserving the environment that made this work possible. REFERENCES 1. Holland, John H., Adaptation in Natural and Artificial Systems, Univerity of Michigan Press, 1975. 2. Bethke, A., Genetse Algorithms as Function Optimizers, Ph.D Thesis, University of Michigan, 1980. 3. Smith, S.F., A Learning System Based on Genetic Adaptive Algorithms, Ph.D. Thesis, Univ. of Pittsburghm, December, 1980. 4. Holland, J.H. and J. Reitznan, Cognitive Systems Based on Adaptive Algorithms, in Pattern Directed Inference Systems. Waterman and Hayes-Roth, Ed. Academic Press, 1978. 5, Brainerd, W.S. and Landweber L.H., Theory of Computation, Wiley-Interscience, 1974 6. Smith, S.F., Flezible Learning of Problem Solving Hueristics through Adaptive Search, Proc. IICAI-83, 1983. ADAPTIVE *CORTICAL™ PATTERN RECOGNITION by Stewart W. Wilson Rowland Institute for Science, Cambridge MA 02442 ABSTRACT It ts shown that a certain model of the primate retmno-cortical mapping “sees” all centered objects with the same “obyect-resolution” , or number of dis- tinct signals. independent of apparent size. In an arpficial system, this property would permit recog- mition of patterns using templates in a cortex-like space It is suggested that with an adaptive produc- tion system such as Holland's classifier system, the recognition process could be made self-organizing. INTRODUCTION Templates are generally felt to have Innited use- fulness for visual pattern recognition. Though they provide a simple and compact description of shape, templates cannot directly deal with objects that, as 1s common, vary in real or apparent (ie., imaged) size However, the human visual system, in the step from retina to cortex, appears to perform an auto- matic srze-normalizing transformation of the retinal image This suggests that pattern recognition using templates may occur in the cortex, and that artsfi- cial systems having a similar transformation should be investigated. Properties of the retino-cortical mapping which are relevant (o pattern recognition are discussed in the first half of this paper. In the second half, we outline how an adaptive production system having template-hke conditions might recog- nize patterns that had been transformed to a “cor- tucal” space. THE RETINO-CORTICAL MAPPING Recent papers in image processing and display, and in theoretical neurophysiology, have drawn at- tention to a nonlinear visual field representation which resembles the primate retno-cortical system. Weiman and Chaikin [1) propose a computer archi- tecture for picture processing based on the complex logarithmic mapping, the formal properties of which they analyze extensively. They and also Schwartz 2] Figure 1. “Retina” consisting of each connected to an “MSU” in the “cortex” of Fig 2 188 Figure 2_ Each MSU receives signals from a data field in Fig. 1 Letters indicate connection pat tern, present. physiological and perceptual evidence that the mapping from retina to (striate) cortex embod- ies the same function Wilson (3) discusses the map- ping in the light of additional evidence and exam- ines its potential for pattern recogmition Early re- lated ideas in the pattern recognition literature can be found in Harmon’s [4] recognizer and in certain patents [5]. A hypothetical structure (adapted from |3)) sche- matizing important aspects of the retino-cortical (R-C) mapping 1s shown in Figures 1 and 2. The “retina” of Figure 1 consists of “data fields" whose size and spacing increase linearly with distance from the center of vision The “cortex” of Figure 21s a matrix of identical “message-sending units” (MSUs) each of which receives signals from its own retinal data field, processes the signals, and generates a rel- atively simple output message that summarizes the overall pattern of light stimulus falling on the data field. The MSU's output message is drawn from ‘a small vocabulary, ie, the MSU's input-output transform is highly information-reducing and prob- ably spatially nonlinear Further all MSUs are regarded as computing the same transform, except for seale. That is, if two data fields differ in stze by a factor of d, and their luminance inputs have the same spatial pattern ex- cept for a scale factor of d, then the output messages from the associated MSUs will be identical. (Physi- logically, the cortical hypercolumns (6] are hypoth- esized in [3] to have the above MSU properties ) ‘The pattern of connections from retina to cortex is as suggested by the letters in Figures 1 and 2. Data fields along a ray from center to periphery map into a row of MSUs, and simultaneously, each ring of data fields maps into a column of MSUs The lefernost column corresponds to the innermost ring, the 12 o'clock ray maps into the top row, and so forth. It is convenrent to describe position in retinal space by the complex number z = re'®, where r and ¢ are polar coordinates We can denote cortical po- ition by w = u + iv, where u is the column index increasing from left to right and v is the row in- dex increasing downwards For the mapping to have complex logarithmic form, it must be true that the position w of the MSU whose data field is at 2 satis- fies w = log z or, equivalently, u = logr and v=¢ ‘That the equations do hold can be seen from Fig- ure 1 The distance Ar from one data field center to the next is proportional to r itself, which implies that u is logarithmic in r Similarly, the fact that all rings have equal numbers of data fields directly implies that v is linear in polar angle Thus (with appropriate units) we have w = logz. (The sin- gularity at z = 0 can be handled by changing the function within some small radius of the origin. For present purposes we are interested in the mapping’s logarithmic property and will ignore this necessary “te) Figures 8-5 (at end of article) review three salient properties of the R-C mapping that have been noted by previous authors The photos on the left in each figure are “retinal” (TV camera) images On the right are crude “cortical” images obtained by the expedient of sampling the retinal data field centers, ‘The mapping used has 64 MSUs per ring and per ray Figure 3 shows a clown seen at two distances differing by a factor of three The cortical um- ages, though “distorted”, are of constant size and shape. Also shown is the result of rotating the clown through 45 degrees again, cortical size and shape re- main the same. The pictures show how retinal scale change and rotation only alter the position of the cortical image Figure 4 illustrates these effects for a texture The cortical images are again the same except for ashift The mapping thus brings about a kind of size and rotation invariance which one would expect to be useful for pattern recognition Figure 5, m contrast, shows that the mapping Incks translation invariance ‘The same clown is seen at a constant distance but in three different posi- tions with respect to the center of vision. Transla- tion non-mvariance would appear to be a distinct disadvantage for pattern recognition. As the clown recedes from the center in Figure 5, its cortical image gets smaller and less defined. ‘The effect illustrates how in a sense the mapping optimizes processing resources through a resolving power which is highest at the center and decreases toward the periphery. This variation is sometimes sds cited as a useful property of the eye, and cussed in connection with an artificial retin: structure by Sandint and Tagliasco (7). OBJECT-RESOLUTION ‘The pattern recognition potential of the map- ping’s size-normalizing property 1s best seen by defin- ing a somewhat unusual notion of resolution. Recall first that the resolving power p of a sensor is the number of distinct signals per unit visual angle, in the case of a linear sensor (such as a TV camera), p 15 a constant. Suppose we ask of a system: when its sensor images a centered object of half-angle A, how any distinct signals, corresponding to the object, will the sensor produce? Let us name this quan- tity the system's object-resolution, Rp. Then, in the case of a linear system, it is clear that Re will be proportional to p*A. ‘That is, Rp will depend on the distance or “apparent size” of the object, or on the relationship between perceiver and object. The resulting amount of information may be in- sufficient for recognition, it may be just mght, or it may overload and therefore confuse the recogn tion process. This uncertainty leads to the scale or “grain” problem noted by Marr |8) and others and to Marr and Hildreth’s |9| proposed solution of com- jons at several resolutions which are Inter to be combined. The grain problem 1s also a mousation for the application of relaxation techniques {10} in pattern recognition Let us now ask what is the object-resolution of an R-C system For such a system the resolving power is p = ¢/r, with r the distance from the center of vi- sion The constant ¢ can be defined as the number of MSU outputs per unit visual angle at an eccen- tricity of r = 1. Object-resolution Ro can be found by taking a centered object of half-angle A and in- egrating over the object from a small inner radius cA (c< 1) out to A We have e [geet independent of A. Ro nln A= 240%tn Thus the mapping’s object-resolution or spatial quantization of the seen object is independent of the object's apparent size or distance, and independent of its actual size as well. It depends only on ¢ (and Given a fixed value of ¢, the system may be to see every centered object, regardless of size, equally well, independent of the perceiver-object re- lationship. (Strictly speaking, the above integral in- cludes only a fraction 1—¢? of the object, the “outer” fraction. But if ¢ 1s very small the omitted fraction € wall contain an insignificant portion of the object's pattern ) The object-resolution of the R-C mapping can be thought of an terms of the number of data fields per retinal ring. By mentally superimposing and then expanding and contracting a centered object on Figure 1, one can see that it 1s examined in an equivalent way at any scale. In fact, it 1s convenient to use the number of fields per ring as a measure of Re. ‘The R-C mapping’s constant object-resolution is the significant difference between it and a linear sys- tem In the remainder of the paper we will develop implications of this difference First, why in an im- portant sense the “grain” problem disappears. Sec- fond, why Gestalt-like templates are, cortically, suit- able for pattern recognition. Third, in outline, how the cortical approach with templates allows a sep rate adaptive theory due to Holland (11) to be ap- plied to pattern recognition—and in the process may solve the mapping’s apparent problem of translation non-invariance THE “GRAIN" PROBLEM Basically, a “grain” problem e 1 priori way to tell whether the size of the elements with which the perceiver 1s looking is the same as that of the optimally informative element of the ob- Ject or scene. In the linear case, we found that the information about an object may be insufficient, just right, or overloading depending on (1) the perceiver- ‘object relationship and of course on (2) the amount of detail m the obyect itself. In the R-C mapping case, the information is constant, dependent only on the perceiver. Thus (1) above—uncertainty due to the perceiver-object, relationship—disappears But the information may still, st scems, be insufficient, just right, or overload- ing—depending on object detail We can develop a criterion for the latter as fol- lows. Let an object's “object frequency spectrum” be the two-dimensional Fourier spectrum of geo metrically similar object of unit size, and let fo be the highest significant (for diserimination) frequency. in such a spectrum. Then, roughly, we may say that a mapping with resolution Rg (in units of fields per ring) provides sufficient information about an object > if Roz fo. But this bound 1s not ultimately limiting It only says whether information from one fization 1s suffi cient for recognition. Peculiarly, by the mapping’s constancy of information, any fixated local part of an object is seen in as much detail as is the whole ob- ject. Thus, if Ro < fo, the system can always gather ‘enough information by scanning, 1.e., by moving the center of fixation to any part not seen clearly. Re is therefore always sufficient, though several fixations may be required Can there be too much resolution? Only if ob- jects turn out to be simpler than expected. But often this can be known in advance. In contrast, in the linear ease, superfluous resolution will always occur whenever object images become large. TEMPLATES In any digital computer implementation, a tem- plate for pattern matching consists of a finite (usu-_ ally rectangular) array of cells in each of which the relative brightness (o be matched is specified. The array has a fixed resolution since the number of cells, is fixed. One major traditional problem with templates is a variation of the “grain” problem: Unless the tem- plate’s resolution 1s the same as the system’s object resolution, there is virtually no chance of getting a correct match. The R-C mapping offers a solution since the system's object-resolution 1s fixed, and the 190 resolution of all stored templates can be made ex- actly commensurate. For instance, the system can acquire its templates by copying its own cortical MSU output images of identified objects. The same objects when later presented in other sizes will be “seen” in the same way. ‘Templates have other problems, e.g., orientation and brightness variations may lead to mismatch. ‘These will be taken up later. Our analysis suggests, however, that templates may yet have an important role to play in general pattern recognition, provided the matching occurs in a cortex-like space. OUTLINE OF AN ADAPTIVE CORTICAL PATTERN RECOGNITION SYSTEM This section will outline a system concept com- bining the R-C mapping, a production system based ‘on cortical templates, and the theory of adaptation due to Holland, A visual world mapped as in Figures 1 and 2 suggests a natural polarity between center and pe- riphery The same centered object, as it grows big- ger, expands toward the periphery, and its cortical image, as noted, shifts as a unit from the left side of the “cortex” toward the right side. The impli: tion is strong that processing, in the cortex, should right. The pattern of an object, whatever its degree of shift from the left, will be encountered “sooner or later” and thus be available for matching against templates Further reflection suggests that rather than work- ing with two-dimensional templates, it might be simpler to use one-dimensional column templates— the identification of a pattern consisting of succes- sive matching of the appropriate column templates. Storage would be saved because a given column tem- plate would often be a contributor in more than one two-dimensional match. An appropriate structure for performing the cor- ion of successively matching column templates form of production system in which (1) the con- dition of each production includes a column tem- plate pattern and one or more internal message pat- terns, and (2) the action is an internal message to be placed on the common message list. (These in- ternal messages are distinct from the MSU output messages To avoid confusion, the internal messages will be called i-messages.) In addition, a separate set of “effector” produc- tons, whose conditions consisted only of i-message patterns, would monitor the i-message list. When an appropriate i-message appeared on the hist, the effector would fire Its “action” would be (1) an ex- ternal action such as moving the center of vision, or (2) an “internal” action also modifying the system’s frame of reference but. not directly observable from the outside (more on this later), or (3) a signal to the outside world denoting a pattern name. Many details need to be filled in to make this an ‘operating system However, enough has been given Co suggest a process in which starting at the left end of the cortex, columns would be scanned and pro- ductions would fire in dependent sequence (the de- pendency based on i-messages as well as the column information being matched), resulting ultimately: in an effector firing whose signal named the object in view. Production systems have not usually been con- sidered in connection with pattern recognition be- ‘cause production conditions typically deal with *nor- malized” or logical variables and, given the gra problem, patterns in linear vision are anything but normalized. In cortical space, however, patterns are normalized so that there the power of productioi can potentially be exploited But we can go farther One part of the adaptive theory due to Holland is concerned with “cognitive systems” based on sets of productions called “clas- sifiers” The form of a classifier 1s, most generally, a string whose condition part consists of a fixed length “environmental detector pattern” together with one for more i-message patterns, and whose action part is an output i-message or effector action. The m- portant point for us is that the “environmental de- tector pattern” has exactly the form of the column templates we have been considering, so that clas- sifier systems and the adaptive theory may be di- rectly applicable to “cortical” pattern recognition It has been demonstrated [13-16] that given an ap- propriate external reward regime a classifier system can evolve a set of classifiers that is adapted to, or “fit”, in its environment. This means in particular that the conditions of the classifiers recognize what matters, and the -messages and actions are appro- priate Much further research must be done, but by combining classifiers with R-C vision, a new path would appear to be open to the objective of a self- organizing visual pattern recognition system. If the adaptive properties of the Holland sys- tem be assumed, we can suggest how the produc- tion structure given earlier might deal with non- centered objects. They look different from their cen tered forms" this 1s the mapping’s translation non- invariance The problem would be solved if classi- fiers existed which would react to the off-center form and lead to an effector which would move the center of vision so as to center the object (at which point “standard” classifiers could recognize it). At first sight, the evolution of this kind of se- quence seems implausible: you would need classifiers for every object in every peripheral position How- ever, the mapping helps by reducing the detail seen an an object. as it recedes toward the periphery; 1n the linnit, every object becomes ust a “blob”. This suggests that only a relatively small number of dis- tinct classifiers would be needed to “acquire” any object for standard (centered) inspection ‘There remains the problem, not of the isolated object, but of the more-or-less centered one—such as 1 face—which is still not centered quite weil enough to fire its standard classifiers How can an appro- priate centering movement come about? For this question, and related ones, we need to consider the “internal effectors” mentioned earlicr Three are important in the present discussion Object-Resolution (OBRES), Azimuth (AZIM), and Brightness Gain (BG AIN). OBRES 1s an effector (or set of them} which, given appropriate i-messages, Il alter the system’s obyect-resolution (in effect changing the number of data fields per ring in Fig- ure 1) This permits seeing an object (regardless, of course. of its apparent size) in detail, or more coarsely, depending on the iemessage list circum- stances The evolution of OBRES effectors. ap: propriate to different circumstances would occur through the adaptive mechanisms If we now recall the problem of the slightly off conter face, it seems plausible that, given some re- duced level of object-resolution, most different faces with that degree of decentering could be matched by a relatively small (and thus practical) set of clas- sifiers These would lead to e movement command bringing the face to the center, where it would be recognized in detail (after, perhaps, restoration by OBRES of a higher Re). ‘The AZIM internal effectors set the direction the system regards as “up” In cortical space, this amounts to shifting the input column vector along its length by a definite amount before matching clas- sifier template patterns against it The purpose of AZIM 1s, of course, to allow a given set of classi- fiers to be effective for recognition even if the object is not in standard orientation. But how will the right azimuth be set in such a case? We again have recourse to the evolution of relatively coarse classi- fiers which, given reduced object-resolution through OBRES, will recognize the presence of a nonspe- cific (“obtong”, say) object at a certain orientation. ‘These would lead to the right AZIM acting, and spe- cific recognition could then occur Finally, BGAIN is a set of internal effectors to deal with the persistent problem of setting the right brightness level for template matching The intent is that the appropriate gain will be determined (via the i-message list) by what is seen, and that the evolution of an appropriate set of BGAIN effectors will again be under adaptive control in the Holland sense ‘The various internal effectors, and the external one resulting 1n snovement, are concerned with the system's “point of view” on its visual input, that is, with systematic transformations which will allow the system’s form detector set—the classifiers—to function efficiently. SUMMARY We began this paper with the retino-cortical mapping and showed how it “saw” centered objects with a resolution independent of the object’s size Constant object-resolution led to a renewed prospect for template matching in general pattern recogm- tion. Fixed size templates permitted the power of production systems to be brought to bear. Finally, the applicability of Holland’s adaptive theory to pro- duction systems allowed us to suggest that a recog- nition system based on the mapping might be made self-organizing, in the process overcoming the map- ping'’s “problem” of translation non-invariance. REFERENCES [1] Weiman, C.F.R. & Chaikin, G. Logarithmic spiral grids for image processing and displ Computer Graphics and Image Processing, 1A 197-226. 1979. Schwartz, EL. Spatial mapping in the primate sensory projection. Biological Cybernetice, 25, 181-194, 1977, Wilson, S.W. On the retino-cortical mapping Int J. Man-Machine Studies, 18, 361-389, 1983. {2} {3} ‘4] Harmon, LD. Line-drawing pattern recognizer Electrontes, 39-48, Sept 2, 1960. [5] Singer, J R Electronic recognition. US. 3,255,437, Jan. 7, 1986. Burckhardt, CB., et al Pattern recognition apparatus utilizing complex spatial filterin USS. 3,435,244, March 25, 1969. McLaughlin, J-A , et al. Pattern recognition ap- paratus and methods invariant to translation, scale change, and rotation. US. 3,614,736, October 19, 1971 Hubel D.H. & Wiesel, TN. Uniformity of mon- key striate cortex: a parallel relationship be- tween field size, scatter, and magnification f tor J. Comp. Neurology, 188(3), 295-805. 1974 {6} 7] Sandini, G. & Taghasco, V_ An anthropomor- phic retina-like structure for scene analysis. Com- puter Graphics and Image Processing, 14, 365- 1980. |8] Marr, D. Early processing of visual information 192 Philosophical Transactions of the Royal Society of London B, 275, 483-524, 1976. |9] Marr, D., & Hildreth, E. Theory of edge detec tion. Proc. Royal Society of London B, 207, 187-219, 1980. 10] Davis, LS & Rosenfeld, A. Cooperating pro- cesses for low-level vision a survey. Artificial Intelligence, 17, 245-263, 1981. 11) Holland, J.H. Adaptation in Natural and Ar- tificial Systems Ann Arbor U of Michigan Press, 1975. [12] Evidence and a model for scanning in humans is presented in Wilson, SW , Strobe imagery: scanning model. Submitted for publication. EE 13) Holland, J.1L, & Reitman, J.S. Cognitive sys- tem based on adaptive algorithms In Pattern- Directed Inference Systems, Waterman, D.A & Hayes-Roth, F. (eds.). New York: Academic Press, 1978 [14] Booker, L. Intelligent behavior as an adapta- tron to the task environment. Ph D. Disserta- ton Sciences). tion (Computer and Communic: ‘The University of Michigan, 1982 [15] Goldberg, DE Computer-arded gas pipeline ‘operation using genetic algorithms and rule learn- ing Ph.D. Dissertation (Civil Engineering), ‘The University of Michigan, 1983. [16° Wilson, S.WV. Knowledge growth in an artificial animal. "These Proceedings 194 Fig 5 MACHINE LEARNING OF VISUAL RECOGNITION USING GENETIC ALGORITHMS Arnold C. Englander Itran Corporation, Manchester, N.H. ABSTRACT This paper briefly describes preliminary work with an application of genetic algo- rithms. Genetic algorithms are used as the mechanism by which a vision recognition system learns to classify dis- torted examples of different but similar classes of image patterns. The system develops increasingly effective collec- tions of class specific feature detectors producing increasingly unambiguous, hence reliable, recognition performance. Algorithms and early simulation results are described. Genetic algorithms are applied to a special case of a diffi- cult optimization problem which is emerging in several forms in computational vision research. The general optimi- zation problem has a performance measure that is easily formulated as an algo- rithm involving the composi- tion of both functionals and logical operations. However, the performance measure is not itself a smooth, much less convex, functional. This pre- cludes the application of most conventional optimization techniques. I. INTRODUCTION A variety of techniques for the machine recognition of objects in images exist in the literature and in demonstrated machine vision technology (1,2,3]. There is an image recognition problem which is difficult for all of these techniques but which arises in practical applications. The problem combines two troublesome characteristics. First, pattern classes have prototypes which correlate highly with the prototypes of different pattern classes. Second, the pattern examples (to be classified) are randomly distorted and occluded. Practical cases of this problem arise in reading characters stamped in certain industrial materials such as rubber and cast metal. Other examples are found in robot vision "bin- picking" applications involving certain assortments of parts. This paper describes the use of genetic algorithms as the basis of a machine vision system which improves its own performance with such recognition problems by learning from labeled examples.? II. THE OPTIMIZATION PROBLEM Experience in applying conventional recognition techniques to difficult industrial vision problems has led to this view: Robust recognition performance relies on the identification and use 1 For a general and thorough introduction to genetic algorithms, including general analytical results, see the pioneering book by Holland [4]. of a large set of local image features having two properties. First, important local features are those which, either alone or in small groups, disambiguate the recognition process by being necessary and/or sufficient ("essential") evidence for classification. Second, such features and groups of features must be likely survivors of the distortion and occlusion operations under which image pattern examples are generated from class prototypes. Obviously essential features are application dependent. They depend on the class Prototypes and on the distorting and occluding Processes. The problem's strong dependence on application particulars leads to the requirement that the recognition system improve its own performance by associative learning from labeled examples. It is desirable to identify many small features which are essential when detected alone or in a variety of groupings. This way the features which contribute to the recognition process are likely to survive the random distortions and occlusions. The detections of essential features should be not only graded and combined in weighted sums but combined in ways which allow pieces of evidence to "veto" the significance of other pieces of evidence. Intuitively, the behavior of algorithms based on such ideas will be complicated by implicit non- 198 linear, "competitive" and "cooperative" interactions between the evidence derived from the detections of essential features. II. USE OF GENETIC ALGORITHMS Applying these views to machine learning of visual recognition leads to an optimization problem over a space of populations of 2-D detector arrays where each array is a composite of templates for the detection of essential image features, The overall population of detector arrays is divided into class specific sub-populations each of which is optimized to respond maximally to examples of a particular image pattern class. The recognition algorithm classifies unidentified images by assigning them to the detector array sub-population producing the highest sum of individual recognition responses, The recognition response of an individual detector is the product of a match between the detector and the input image, and a term called "strength". The strength of a detector array is indicative of the detector array's past performance in disambiguating recognition decisions. Optimization of a sub- population of class specific detector arrays means finding detectors which strongly match input image examples of the specified class, but which only weakly match input image examples of other classes. This is difficult because the different image pattern classes have prototypes which are alike in the sense of being highly cross-correlated. This optimization problem reflects the desired strategy and intuitively seems simple. However, it is not easy to solve. The problem's per- formance measure on individual detector arryas is composed of functionals and logical operations, It is not itself a smooth, much less convex, functional. Such optimization problems are unsolvable by most conventional methods. Because genetic algorithms impose unusually few con- straints on the formulation of optimization problems they are applicable to this problem.* The match between detectors and input images involves a "matchscore" which is common to most genetic algorithms. The strength of detectors develops iteratively. During the associative learning phase of the system, the strength of each detector is increased each time the detector's response is above the average response of all detectors and the class origin of the input image and the class specificity assignment of the detector are the same. The strength of a detector is decreased each time it produces an above average response to an input image originating from a class other than the class to which the detector's sub-population is being optimized to recognize. Here, an image pattern is a 2-D array of binary valued picture elements, or "pixels". (This corresponds to a 2-D map of the zero crossings in a digital image processed by convolution with a difference of gaussians (DOG) operator for the detection of edges. The resulting zero crossings are useful in portraying the boundaries of objects in the scene.) The image patterns are randomly distorted and occluded examples of prototypes from one of several distinct, but similar, image pattern classes. A detector array is a 2-D array of pixels of the same size as the image patterns. Here each pixel takes one of three symbols, {0,1,#} where {0,1} indicate values taken by pixels in image patterns and # indicates the "don't care" condition in the usual genetic algorithm matchscore. A standard matchscore is used in mating image patterns to detectors arrays by simply "un- winding" the image patterns and detector as taxa type character strings (over {0,1} for image patterns and over {0,1,#} for detectors). Genetic algorithms optimize the class specific sub-populations of detector arrays, indirectly, 2 Other cases of such optimation problems are emerging in computational vision research [5]. One case involves the goal of combining the information of various visual processes (stereopsis, motion, and "shape from-shading" for example) into a single interpretation (of 3-D or "2-1/2-D" for example), which is optimal under a performance measure which combines functionals and logic. applicable to such problems as well. Genetic algorithms may be by operating on the individual detector arrays in each separate, class-specific sub- population. Restricting "mating" and "replacement" operations to taxa within the same sub-population, two "parents" are selected (in each sub-population, at the completion of each recognition trial involving labeled examples, hence changes in strengths). The "parent" taxa are selected according to the detectors returning the two highest recognition responses (the product of the match with the current input image example and the detector strength) or with probabilities proportional to the recognition responses. The two "parents" generate two "offspring" under genetic operators and the "offspring" each replace an "individual" judged to be "weak" for having one of the two lowest strengths of the taxa in the sub-population. The "offspring" enter the sub- population with strengths which are a fraction of the average strength of the two "parents" and the strengths of the "parents" are reduced to match that of their "offspring". These selection rules reflect heuristic arguments and experimentation. "Parents" are selected as to recognition responses to ensure that they are "strong" for having con- tributed to disambiguation in the past, and that they are well matched to the current input example. "Weak" indivi- duals are "un-selected" by low "strength" alone, rather than 200 by the current match-"strength" product, to avoid losing detector arrays which tend to be useful but match poorly with the current input example (which is randomly distorted and occluded). Early simulations involved standard operators of genetic algorithms: "cloning", “cross— over", "inversion", and "muta- tion", chosen according to pro- babilities which are fixed for each experiment. As is commonly believed, it is most useful to assign "crossover" the highest usage probability. Experiments were also performed using Wilson's "imprinting" and "ternary intersection" opera- tors, with low usage probabili- ties. Wilson's operators seem relevant and useful to this problem [6]. III. EARLY SIMULATION RESULTS Early simulation results are promising in that self-optimi- zation by genetic algorithms is obvious. The recognition system, operating in training mode, clearly improves its cumulative average of correct recognitions from very low initial percentages to moderately high percentages over a few hundred trials. In simulations involving 4 pattern classes of 2 prototypes each, 4 sub-populations of detector arrays having 32 detector . arrays each, and image and detector arrays of 32 by 32 pixels, the system averaged correct recognitions 25% of the time for the first 100 or so trials, rising exponentially to 78% correct recognitions after 1000 trials. In such simula- tions the detectors were initialized with pixels con- taining 0,1,#, with equal pro- bability and Wilson's genetic operators were used randomly with small probabilities. In some simulations the system improved its recognition per- formance over correlation based pattern recognition techniques in a few thousand training iterations. As expected, over time, the system evolves strong detector arrays which partly resemble the prototypes of the pattern classes to which the detectors are assigned. But the resemblance is never complete because detectors must match features present in examples of their assigned pattern class but ignore features which are also characteristic of other classes. The evolution of such detectors is apparent in the simulations. IV. CONCLUSION Preliminary work with an application of genetic algorithms has been described. Genetic algorithms are the mechanism by which a vision recognition system learns to Classify distorted examples of different but similar classes of image patterns. This work addresses an unconventional optimization problem which arises naturally from an intuitive model of visual learning. Early simulation results indicate that the proposed model can lead to the design of an effective machine vision system. REFERENCES 1. R, Duda, P, Hart: Wiley, New York, 1973. 2. &E. Hall: Academic, New York, 1979. 3. J. Tou and R. Gonzalez: p Classificati a Addison-Wesley, Reading, MA, 1974. 4. Jd. Holland: University of Michigan, Ann Arbor, 1975. 5. D. Terzopoulos: "Multilevel Reconstruction of Visual Surfaces: Variational Principles and Finite-Element Representations", in Multiresolution Image Processing and Analysis, ed. A. Rosenfeld, Springer, New York, 1984 (see page 283). 6. S. Wilson: "Knowledge Growth in an Artifical Animal", in Proc. Fourth Yale Workshop on Applications of Adaptive Systems Theory, New Haven, Conn., 1985. Bin Packing Hith Adaptive Search Derek Smith Texas Instruments 1.0 INTRODUCTION He have looked at the problem of bin packing arbitrarily dimensioned rectangular boxes into a single orthogonal bin. Figure 1 shows a good bin packing, the sort we are aiming for. Figure 2 shows a poor bin packing. The problem is NP-hard in the strong sense, so there is little hope of finding a polynonial time optimisation algorithm for it (1). Reasonable approximation algorithms exist which can be guaranteed to be within 22% of optimal (1). Gur approach has been to use a wrinkle on genetic algorithms (3), developed in the Texas Instruments Conputer Science Laboratory (2). 2.0 ADAPTIVE SEARCH The epistatic domain of bin packing has traditionally not been amenable to adaptive search techniques. This is because it is difficult to represent a bin packing on which we can do crossover and mutation and retain either a reasonable packing or a legal packing. Consider a flip mutation (rotate through 90 degrees) of box 18 in figure 1. The flip will either cause a illegal bin packing due to boxes overlapping each other, or if we fracture the packing by moving the neighbouring boxes away to make the flip legal, will produce a poor bin packing. Dur solution is to represent the bin packing as a list of the boxes plus an algorithm for decoding the list into a bin packing. The list is readily mutatable (flipping boxes), and is amenable to a modified form of crossover. The decoding algorithm takes any list of boxes and forns a legal packing. Hence we attempt to produce good bin packings using Genetic Algorithns. 2.1 The Representation As explained above our representation is a list with an associated algorithm to apply to the list to produce a bin packing. For effective search the algorithm must produce legal packings from any operation on the list. Here we describe two such decoding algorithms. The first algorithm we call SLIDE PACK. He take each box, in order, from the list, place it in one corner of the bin, and let it fall to the farthest corner away, as if under a gravity that only allowed it to nove 202 orthogonally. The effect is that a box will zigzag into a stable position in the opposite corner from which it Kas placed. Box 2 in figure 3 shows the SLIDE PACK algorithm. SLIDE PACK is fast as there is no backtracking, and is simple to conpute. Its time complexity is O(nkX2), where n is the number of boxes. There are nl possible orderings of our list of n boxes. If we associate a flipped state with each box, this gives us nl2Mn menbers in the set of all encoded representations. Although we can contrive packings that SLIDE PACK can never do, we believe that in general we can reach all of the search space by operating on the list of boxes. The second algorithm we call SKYLINE PACK. For each box in the list, in order, we try the box in all stable positions, and in all its orientations on the partially packed bin. A stable position is where the box is tucked into a corner, or cave formed by other previously packed boxes. The algorithm takes its name from the fact that it tours the skyline formed by the previously packed boxes to find the position it fits best. Figure 4 shous sone of the places that box 2 is being considered for by the SKYLINE PACKer. Again we have ni possible orderings of the list. However each tine a we pack a box we try that box in many positions - we are covering nore of the search space than in the SLIDE PACKing of a box. It is clear that We can no longer generate all possible bin packings, as a poor placenent of a box wil! be ignored in favour of a better placement somewhere else on the skyline. A nore practical question is whether we can represent all good bin packings. He believe so (again informally) but with less conviction than with the SLIDE PACK. SKYLINE PACK has time complexity O(nKKs) . With @ randomly generated list SKYLINE PACK will tend to generate a significantly denser packing than SLIDE PACK, however, it takes longer to run. Figure 2 is a typical SLIDE PACKing of a randomly generated list, whilst figure 5 is a typical SKYLINE PACKing. SLIDE PACK can produce good packings as shown in figure 1 when we apply the adaptive search techniques. The trade off is whether to run the adaptive search with larger populations and for more generations using SLIDE PACK, or in the sane amount of time use SKYLINE PACK for fewer generations. Our experiments have shown that SKYLINE PACK is more favorable, however with a better tuning of the adaptive search SLIDE PACK nay produce better results. 2.2 The Genetic Operators Our representation of a packing, as described, is the order of the boxes presented to the packing algorithm. Traditional crossover cannot operate on such a list. Consider a crossover of list (12345) with (S 4 3241) the crossover point being after the second elenent to produce (1 2321). The list now has boxes 1 and 2 duplicated and boxes 4 and § missing. Hence we use a MODIFIED CROSSOVER which takes the order of the boxes before the splice from the first list, and the order of the boxes which remain to be packed fron the second list after the splice point. In the above example we would generate the list (125 4 3). Hollands theorens (3) regarding the effectivness of crossover no longer hold. He have not yet investigated the theoretical aspect of the modified crossover. However we have experimented with its use; we have run random search versus our genetic operators, and have found the genetic operators to produce consistently better results. One of the mutations we have experimented with is SCRAMBLE, that is randoaly reordering sone portion of the list. At the beginning of the adaptive search process we can concentrate on SCRANBLing the beginning portions of the list to evolve a good basis for the packing. As the evolution proceeds we can move our area of interest father up the list. A FLIP nutation to try different orientations of the boxes is necessary if the decoding algorithm does not try the box it is packing in all its orientations. FLIP is applied discretely to boxes in the list. 2.3 The Evaluation Because we require our evaluation procedure to score dense packings highly, a straightforward evaluation criteria is the ratio of the area of the boxes packed to the area of the bin. This works well as an evaluation of a packing. It is less clear how to evaluate partial packings which are required in such decoding algorithms as the SKYLINE PACKer where we need an evaluation of the packing for each position of the box along the skyline, to choose where to settle it. He have tried numerous ways to measure partial bin packings. One of the most intriguing is to take the inverse square of the separation of the box being packed to all the other boxes. This favors boxes filling in caves, especially if they fit snugly into the cave. There is sone analogy here to gravitational effects, and indeed such an evaluation allows us to pack space (as opposed to in a containing bin) as the boxes are attracted to each other. Graph 1 shows how the density of a partial bin packing falls as the number of boxes packed increases. This is due to the forming of more and larger caves by the later boxes. As the evolution continues we form less caves, and we can see from the graph that by generation 20 we have kept to about 85% density. 3.0 RESULTS He have benchmarked our results against a recentiy developed deterministic bin packing program within our group. This program uses some heuristics and dynamic programming techniques. Qur program can 204 produce the same packing density 300 tines faster. Also if a greater density is required then we can simply allow our program to run for longer, or run it again. Similarly if a less dense packing is required we run for only a short time. Graph 2 shous how the density increases as the evolution proceeds. This is a tremendous practical advantage of this approach. A practical disadvantage is that each time He run the process we Will end up with a different packing. 4.0 FUTURE RESEARCH There is work to be done in the mating of the decoding algorithm and the genetic operators. In particular, finding ways to operate a portion of a bin packing without having repercussions on the whole packing. Work is also in progress in making the genetic operators robust to quantity of data, variation in dimensions of boxes, and variations in the aspect ratio of the bin. He are also considering a process which monitors the adaptive search Whilst it runs. Such a process could vary the importance of the nutations as the search proceeds. It could bring in mutations to produce diversity of the search if it were trapped at a local naxina. It could also alter the size of the population at various stages in the evolution. Currently such variations are set up at the start of arun, it would be nore effective to have the process continually monitoring and adapting itself. In order to learn how to implement the monitor process we need to study how the search space is being explored. Seeing our bin packing algorithms run by the use of graphics has been very useful in this work to date. Graph 3 shows the sort of display which He would like in order to watch the evolution, learn about the process, and write the if monitoring system we have mentioned. Numbers 1 through 4 are four of the menbers of the initial population. The trees sprouting from then represent the performance of their offspring. 1 was a poor initial packing and soon died away. 4 was a good packing and we can see it spanned many children in exploring its portion of the search space. Note also that 2 and 3 are allowed to evolve to maintain diversity in the search. Graph 4 is the sane concept as graph 3 in a search space that we have completley mapped out and in which we can draw the local naxina, represented by Is in the graph. We could then test new levels of operators, and different population sizes in a controlled and visible search space. Graph four shows only two dimensions of such a space, which for n boxes is n-dimensional. 5.0 ACKNOHLEDGEMENTS This work is only possible because of the enthusiasm, research work, and utilities for adaptive search al! provided by co-worker Lawrence Davis. He thank the referees for their valuable comments. 6.0 REFERENCES Garey and Johnson, Computers ond Intractability, 1973, H. H. Freeman. Lawrence Davis, Applying Adaptive Algorithas to Epistatic Domains, To appear proc. IJCAI-8S. John H. Holland, Adaptation in Natural and Artificial Systens, University of Hichegan Press, 1975. 206 Graph 1 Density a bin packing proceeds cerceament, Graph 2 Density as search proceeds WP co ° Graph 8 Tracing the evolution ~ Directed Trees Method for Fitting a Potential Function Craig G Shaefer Rowland Institute for Science, Cambridge MA 02142 Abstract The Directed Trees Method 1s employed to find interpolating functions for potential energy sur- faces The matheinatical algorithm underlying this fitting procedure 1s described along with example calculations performed using a genetic adaptive algorithm for fitting the Ax unfolding families to 1- and 2-dimensional surfaces ‘The properties and advantages of the usage of genetic adaptive algorithms in conjunction with the Directed Trees method are illustrated in these examples. Seetion: 1. Introduction How does one choose a mathematical model to describe a particular physical phenomenon? ‘To help im answering this question, we have developed a method called the Directed Trees (DT) method? for describing Uhe possible structures available to a particular special type of model. the gradient dynamical systems The gradient dynamical systems are, however, quite general and flexible and hold a ubiquitous presence in the physical sciences. In the next section we illustrate where this special type of model ‘fits’ into a very broad class of mathematical models. The DT method employs a relatively young branch of mathematics called differential topology: “topological” in order to form categories of solutions for gradient dynamical systems to reduce the problem to the study of a finite number of different categories, and “differenti 1m order to allow for quantitative calculations within these models. For the purposes of this paper, it is sufficient to say that m the numerical applications of the Directed Trees method, systems of nonlinear ‘equations arise for which we require solutions Although classical numerical methods could be employed for the solution of these nonlinear systems, we find that genetic adaptive algorithms (GAs) are especially suited for this purpose and have certain advantages In order to introduce our application of GAs to the solution of nonlinear systems of equations and be able to discuss the advantages which GAs offer over the more classical numerical methods, the third section of this paper provides a brief exposition on the topological concepts inherent to the Directed Trees method and describes the equations that arise in its quantitative applications. Section 4 contains examples of the usage of genetic adaptive algorithms for solution of these systems. Section: 2. General Mathematical Models In this paper, we are secking not so much # procedure for calculating the specific solution to the mathematical model of a physical system, but rather the dev lopment of a model for which we may classify its solutions into behavioral categories so that one particular solution from each category serves as & paradigm for all solutions belonging to its category. Obviously. this will greatly simplify the study of the general solution of a model_In order to do this, however, we first need to restrict the type of mathematical model to which our classification scheme is applicable To understand where our restricted class of models fits into the general class of mathematical models, below we will describe the simplifications inherent to our restricted class. The following table contains a hst of possible variables whose interrelationships we seek ‘These variables include items such as the spatial and time coordinates, and parameters such as the masses of particles, the refractive indices of mediums, densities, temperatures, etc In addition, our model might 207 gl:2 General Mathematical Models Variable General Term ‘Comments Zen” i (tion) 1m spatial coordinates TER t time coordinate gem pe (uaa) # parameters (mass, refractive index,...) 6 CUBA O10) SOLUTIONS (trajectories... } DIS G8 tay 1c time derivatives Dio aaa Sagm ert om 0s spatial derivatives { T(6)dt ST(O., dt (7 functionals time integrals of functionals JS K(6)dz PRO: Ydz, (x funetionalyy spatial integrals of functionals f LAE 66,F -)P (ern) mtegrodifferential functionals t= functionals may depend on any of the variables located above them in this table Table 1 Table containing possible variables, parameters, and functional dependencies for a general mathe- also depend on the derivatives with respect to the time and spatial coordinates as well as integrals whose integrands are funictions of the other variables or solutions Suppose we have a physical system for which we have a set of m arbitrary rules that specify the in- teractions of the variables from Table 1 This leads to the following system of m equations, called an Integrodifferential System, whose solutions describe the behaviors of the physical system a) Since we have n equations, let us suppose that there are n solutions, thus we take © = (@;,...,Q,) in what follows. Let us remark that this system forms a very general and flexible mathematical model for studying physical phenomena. It encompasses almost all mathematical models that are currently employed in the sciences This system of integrodifferential equations ts however, much too difficult to solve in all of its generahty — only in very specific cases are solutions even known And virtually nothing 1s known about how these solutions vary as the parameters are changed. We must make a few simplifications in these equations before anything can be said about their general solutions These simphfications are very typical though, for many models in the “hard” sciences have as their fundamental premises the assumptions that we describe below. To begin, we assume that f does not explicitly depend upon Z, D/6 for j > 1, D§6, fT(S)at, nor { K(6)dz. Then the system has the form: f = /(6,/,t;D,6) = 6, for which more can be said concerning its solutions. Instead of studying this system, though, we continue with a further simplification concerning the dependency of f on the time derivatives, and, in particular, we consider those f of the form, Durected Trees Method Jor Fitting a Potential Function — §1°3 F = D6 - F'(6; At) = G. Note that the function j’ appears to be similar to a force vector. In effect, the above system of equations describes the situation in which the rates of change of the solutions are proportional to a vector that depend upon the solutions themselves This type of system arises in classical mechanics and is usually called a Dynamical System. If the further restriction that the forces do not explicitly depend on the time, then we have the following system of equations, which form an Autonomous Dynamical Systere f = D,O ~ f'(6; 7) = 0. A few useful statements can be made about the solutions of this type of system of equations and their behaviors as the parameters Fare varied We, however, will agam continue to make one further simphifying assumption on the form of the f*. We noted above that the vector function, Fis of a form similar to the forces in kinematics and electrodynamics _If, in fact, f' is a true force, then it can be taken to be the negative gradient of some scalar potential 6: f' = ~Dg¢(6;A). Thon we have the system F= D6 + Deol iA) , (2) which is termed a Gradient Dynamical System. Many very powerful statements can be made about the © and their behaviors as functions of f for this system Oftentimes, we are concerned with the *stationary” solutions of (2), 1¢., solutions which are time-independent These stationary solutions require the forces to we require Dg¢(6; A) = 0. This equation determines what are called the equilibria, of the grachent system ‘The most powerful and general statements can be made about equilibria and how they depend upon their parameters. vanish, in other words ‘The solutions, 6, of the above systems are merely generalized coordinates for the physical systems, and thus, following the standard nomenclature, we replace © by ¥. For example, these solutions, z, might be the positions of equilibria as functions of time, the Fourier coefficients of a time series, or even laboratory measurements. We have thus shaved the general mathematical model, (1), of a physical process to the specific case of examining the behaviors of scalar potential functions, 4(Z, A) It 1s for these special cases that differential topology yields the most useful results. In the next section we examine the primary results of singularity theory which allows any arbitrary potential to be classified into a finite number of different category types. It 1s this class greatly simplifies the study of gradient dynamical systems? Since we are interested im the particular potential functions stemming from the solution of the Schroedinger equation, under the Born-Oppenheimer approxi- mation for a chemical reaction, we apply the classification scheme specifically to potential energy surfaces (PESs) Keep in mind that the same classifications and calculations are applicable, however, to any gradient dynamical system. The classification scheme that we have developed for PESs, as we have mentioned, is called the Directed Trees method and contains both a qualitative diagramatic procedure for implementing the classification as well as a quantitative computational procedure for calculation of specific behaviors and characteristics of the model. Section: 3. The Topology of Potentials Why should we concern ourselves with an alternate classification scheme based upon differential topology for potential energy surfaces? The reason for domg so is that this new Directed Trees classification has two special properties: structural stability and genericity.? The concept of structural stability plays an important role in the mathematical theory of singularities. There are sev eral reasons for this importance. First of all, usually the problem of classifying objects 1s extremely difficult, 1t becomes much simpler if the objects one is classifying are stable Secondly. in many cases, the class of all stable objects forms what is loosely called fa generic set. This means that the set of all stable objects is both open and dense, in the mathematical sense, in the set of all objects In other words, almost every object 1s & stable object and every object is “near” a stable object. Thus every object can be represented arbitrarily closely by a combination of stable objects For instance, the Implicit, Function Theorem of calculus and Sard’s Theorem of differential 209 §13 topology imply that almost all points are regular points (points whose gradients are nonzero) for stable functions and thus are not critical points Stated differently, regular points are generic, ic, they form an open and dense subset of the set of all points for stable functions. (Stable functions are functions which can be perturbed and still maintain their same topological properties.) Even though almost all points of a function are regular points, nondegenerate isolated critical points do occur and have a generic property: they are not removed by perturbations The importance of nondegenerate critical points extends beyond their mere existence, for they “organize” the overall shape of the function. This can be seen in the following one-dimensional example Consider a smooth function of a single dimension, f(z), which has three critical points between z= O and x= 1 If the curvature at the eritieal point with the smallest x coordinate in this interval is negative, then the curvatures for the middle and highest eritical points must be positive and negative, respectively, for no other combination can lead to a smooth function connecting these three critical points. In addition, the functional values at the smallest critical point and the largest critical point must bbe greater than the value of the function at the middle critical point. A stmple graph of f satisfying the above conditions will show that if these statements were not in fact true, then additional critical points would be required between these three eritical points dust _as nondegenerate critical points orgamze the shape of a one-dimensional function degenerate eritw al joints “organize” families of functions having specific arrangements of nondegenerate critical points These deve: al points are nongeneric in the sense that small perturbations either split the degenerate critical points into nondegenerate points or annihilate the degenerate point completely leaving behind only regular points, It aught therefore seem that we should not concern ourselves with degencrate critical points since they are mathematically “rare” occurrences on a surface and can be removed Ly small perturbations ‘The manner m whieh degenerate points Sorganize” functions into classes, however, leads to a generic classification of famihes of functions that is stable 10 perturbations and hence will be very useful m our study of PESs A third reason for the importance of stability stems from the applications of singularity theory to the experimental sciences It 1s customary to msist on the repeatability of experiments Thus similar results are expected under similar conditions, but since the conditions under which an experiment takes place can never be reproduced exactly, the results must be invariant under small perturbations and hence must be stable to those perturbations. Thus we see iL is reasonable to require that the mathematical model of a physical process have the property of structural stability In order to define this concept of stability, we first need a notion of the equivalence between objects. This is usually given by defining two objects to be equivalent if one can be transformed into the other by diffeomorphisms of the underlying space in which the objects are defined. For the specific case when the objects are PESs, these diffeomorphisms are coordinate transformations and will be required to be smooth, that is, differentiable to all orders, and invertible. This invertability 1s a requireinent of the Directed Trees method and forms an important reason for employing GAs in the numerical applications of the DT method ‘The mathematical branch of differential topology called catastrophe theory forms the foundation for the DT method. In its usual form, catastrophe theory is merely a classification of degenerate singularities of mappings, the techniques of which use singularity theory and unfolding theory extensively along with a very umportant simphfying observation made by Thom which has come to be called the Splitting Theorem In this paper, we wish only to emphasize the fundamental concepts behind the Classification Theorem thus providing & heuristic justification for its use in the study of PESs. In the process we describe the functional relationships between the PES and its canonical form, which we call the Directed Trees Surface (DTS). We do not provide rigorous statements nor proofs of any of the theorems of differential topology, but, more importantly, we hope to provide an intuitive description of the fundamental concepts behind these theorems In order to describe these results. we employ the terminology of differential (opology, and thus below we provide the basic definitions necessary for comprehension and discussion of the DT method A glossary of topological terms and notation used. sometimes without comment, in this paper is provided in the Appendix. Since our main interest in this paper 1s the local properties of potential energy functions, we begin by recalling some preliminary definitions of local properties If two functions agree on some neighborhood of a point, then all of their derivatives at that point are the same. Thus, if we are interested in trying to deduce the local behavior of a function from information about its derivatives at a point, we do not need to be concerned with the nature of the function away from that point but may only be concerned with the function on some neighborhood at this point. This leads to the concept of a germ of a function. Let L be the set of all Durected Trees Method for Fitting a Potential Function §1:3 continuous functions from the Euchdean space * (o 8 defined in a neighborhood of the origin. We say that two such functions, f,9 € L determine the same germ if they agree in some neighborhood of the origin, so that a germ of a function 1s an equivalence class of functions Since this Uheory 1s entirely local, we may speak of the values of a germ j and we write {(z) for z € R", although t would be more correct to choose a representative f from the equivalence class f. A germ f at x 1s smooth if 1 has a representative which 13 sinooth in the neighborhood of z Because germs and functions behave similarly, we often use f and J interchangeably to represent a germ Only where confusion may result wll we distinguish a germ from one of its representatives We may also talk of germs at points of 8" different from the origin. A germ is thus defined by @ local mapping from some point of origin If two smooth functions have the same germ a a point, then their Taylor expansions at that point are identical We may, without loss of generality, take the origin of a germ to be the origin of 8". The set of all germs from R” to R forms an algebra # ‘This convenient fact allows us to study the germs of maps with powerful algebraic techniques that ultimately lead to algebraic algorithms for the topological study of arbitrary PESs § Fundamental to many apphentions of applied mathen a finite number of terms nits Taylor expansion Tor 4) estimate for the size of the remainder term after tra so much an the stze of the re remainder term can be removed completely In this ease My truncated Taylor series 1m the new evord: » of transforming away the higher-order terms of a Taylor series expansion is formalized a the notic jerminacy Before defining determinacy, we first. introduce some additional nomenclature The ‘Taylor series of f at x which 1s truncated after Lerms of degree p is referred to as the p-jet of f at z, denoted by j*f(z) We now define what we mean by the local equivalence of germs Two germs, f,g with f(z) = gz). are equivalent if there exists local C™- diffeomorphisms y R" — R* and g.R — RK such that g = o(f(y(z}}). Thus, by suitable C® changes of local coordinates, the germ J can be transformed into the germ g. We now note why the coordinate changes must be invertible Neglecting a constant, the two functions are equal on some neighborhood of a point, and we have expressed fas a function of z, that 1s, f(¥(z)) In addition, we would hike to be able to express g as a function of the coordinates for f, that is, 9(7(¥)) ‘This requires us to invert the y coordinate transformation z= z(y) As we stated earlier, this invertibihty criterion becomes an important reason for choosing GAs to solve the systems of nonhear equations that arise from the DT method, With this, we may now formulate the definition of determinacy. The p-jet ¢ at x 1s p-determined if any two germs at 2 having ¢ as their p-jet are equivalent Lice 18 the technique of representing a function by itative calculations, 1 1s necessary to make some ation of the series Sometumes we are not interested + by a suitable change m coordmates near z, the fe function 1s. in a very precise sense, equal to amder term as in whet If we are studying a C*-function f, we may understand its local behavior by expanding J in a truncated Taylor series, ignoring all of the higher-order terms of degree greater than p. We can be sure that nothing essential has been thrown away ifwe know that f 1s pedetermined Stated more precisely, we may study the topological behavior of a p-determined germ f by studying its p-jet j°f One might think at first that no germs are p-determined for finite p As an example of this, consider the germ of f at the origin of ? given by f(z,y) = 2? This is not p-determined for any p, since the following function, which has the same p-jet as J, 9(2,y) = 2? + y*, is 0 at the origin and positive elsewhere, whereas f 1s also 0 along the y-axis. However, Mf f were a function of z alone, f(z) = 2?, then f would be 2-determined. We thus see that the determinacy of f depends not only on its form but also on the domain over which it acts Since we have noted that if 1 function is p-determined, its topological behavior may be understood by studying its p-jet, then we may now ask the following question Are there methods for deciding whether or not a given p-yet 1s determined? We answér this question in the affirmative, and in a later paper we describe an algorithm based on work by Mather for calculating the determinacy of p-jets In Section 4 of this paper, which describes the fitting of DTSs to PESs, we provide examples for (i) the DTS behavior for cases in which the proper p-jet 1s chosen for f, (it) the behavior for cases in which the chosen p-jet has p less than the deterininacy of f, and (iit) the behavior of yf in which p is greater than the determinacy of J. Below we summarize the four basic and interrelated concepts of singularity theory (i) stability, (ii) genericity, (:ii) reduction, and (1v) unfolding of singularities To describe what is meant by stability, consider the map J. —+ R given by f(z) = 2? This map is stable, since we may perturb the graph of this map 21 Le Differential Topology of the Directed Tree Method shghtly and the topological picture of its graph remains the same. That is, consider the perturbed map g RR, olz) = x? + ex with c #0. This perturbed function, g, still has a single critical point just as f docs, and can be shown to be just a reparametrization of f. Thus we hope to characterize and classify stable maps since if we perturb these, we can still predict their topological behavior. Since our goal is to provide a mathematical model for classifying and calculating PESs, one might ask whether there are enough stable maps to be worthwhile in this endeavor That 1s, can any arbitrary PES be approximated by a stable map? This is the question of the genericity of stable maps, i.e., whether the set of all stable maps 1s open and dense in the set of all maps. If it is, then any map 1s arbitrarily “close” toa stable map and may be represented by combinations of stable maps. It thus makes sense to study the properties of stable maps since these properties will then be pertinent to any arbitrary PES. Reduction refers to the often employed technique of splitting a problem into two components: one component whose behavior is simple and known, and a second component whose behavior is unknown and hence more interesting and whose behavior we would like to study. This is typical in most physical models in which there are many variables whose functional behavior 1s assumed to be simple, for example, harmome These variables are usually “factored out” of the overall model for the physical phenomenon since the behavior of the system over these variables is known The Splitting Theorem provides a justification for this reduetiomsm. René Thom introduced the baste notion of the unfolding of an unstable map m order to provide stability for a family of maps To see what this means, let us consider the following example to which we will often return for illustrating new topological concepts. Let f:® —+ R be given by f(z) = 2°. This map is unstable at zero. since if we perturb f by ¢z, where ¢ is small, the perturbed map g(z) = z° + ez assumes different critical behaviors for « <0 and «> 0 There are two critical points, a minimum and a maximum, in a small neighborhood of zero when ¢ < 0, but for ¢ > O there are no critical points. The family of maps F(z,¢) = 9(z) is, however, stable. Thus F includes not only f, but also all possible ways of perturbing f. The map F is said to be an universal unfolding of f. It is very important that the unfolding F includes all possible ways of perturbing f. To be more specific, consider perturbing f by the term 62°, where 6 is arbitrarily small but not zero The map A(z) = 2° + 62? assumes the same critical behavior for all 6 # 0, that is, h(z) has one maximum and one minimum, Thus for ¢ <0, g(z) has the same critical behavior as h(z), and st can be shown that g and h are “equivalent” for ¢ <0 and 6 #0. (The precise meaning of “equivalent” is described in the Glossary.) On the other hand, there ts no 6 for which f(z) lacks critical points, thus A(z) 1s not equivalent to g(z) when ¢ > 0. Therefore h 1s not capable of describing all possible perturbations of J, since it 1s unable to provide g with ¢ > 0, The unfolding g is, however, capable of describing all possible perturbations of J. Our discussion so far does not indicate how we know this fact, it is a rather deep result of singularity theory stemming from results based on the early insights of Thom The crux of singularity theory is how to unfold the “interesting” component of a given model into a stable mapping with the least number of parameters, such as the ¢ from above. 3.1, Theorems from Topology Several principal theorems of differential topology concern the effects that critical points have on the geometrical shape of manifolds. Since each has been carefully proven and thoroughly investigated in the Merature, we only include here an informal statement of these theorems and a few of the results derivable from them. We emphasize that these theorems are closely related to each other: their differences entail the stepwise removal of some of the assumptions upon which the first theorcin 1s based The first of these theorems 1s borrowed from elementary calculus: the Implicit Function Theorem © This theorem controls the behavior of a surface at regular points, that 1s, at points which are not critical points. Excluding the overall translational and rotational coordinates of a molecule, the critical points of potential energy surfaces are isolated.” Thus almost all points of a PES are regular points and hence the implicit function theorem describes the local behavior of almost all of a PES Qualitatively speaking, the Implicit Function Theorem states that at a noncritical point of a potential function, the coordinate axes may 22 Directed Trees Method for Fitting a Potential Function §1:3.2 be rotated so that one of the axes aligns with the gradient of the potential at that point. Then the function is represented as f(z’) = z} where 2 are the new coordinates. This is intuitively obvious by considering the gradient to be “force vector”, then the coordinate axes may be rotated so that one axis 1s colinear with the force, which may then be described as a linear function of this one coordinate In analogy to our one- dimensional example of the control which critical points have on the possible shape of a function, we find that the overall shape of a PES depends upon the positioning and ty pe of its critical points. The Morse Theorem, which 1s sometimes called the Morse Lemma in the literature,® and its corollaries describe how nondegenerate critical points both control the shape of a surface and determine the relationship between an approximately measured function and the stable mathematical model which 1s used to descibe that physical process In particular, through the elimination of the assumption that the gradient 1s nonzero at a point, we find around nondegenerate critical points, a new coordinate system so that a potential may be represented as the sum of squared terms of the coordinates with no higher-order terms no linear terms, and no quadratic cross terms. Thus the function has the form f = Y7 z'? and 1s termed a Morse function Corollaries of the Morse Theorem say that Morse functions are stable and this stability is a generic property Lastly, we discuss degenerate critical points and their influence on the possible configurations of nondegenerate points. By eliminating the assumption of the nonsingular Hessian matrix at a critical pomt of the surface, the GromelleMeyer Sphtting ‘Theorem says that the function may be split into two components one is a Morse function, Fay, and the other 15 non-Morse function. Fp yy. The non-Morse component cannot be represented as quadratic terms and does not involve any of the coordinates of the Morse component! The Arnol’d-Thom Classification Theorem? categorizes all of these non-Morse functions into families provides canonteal forms for thein and describes the interrelations among the various families The ramifications of the Arnol’d-Thom theorem cannot be overestimated. If function, F (¥,f), havinga non-Morse critical point at (,; J.) 1s perturbed. The perturbed function, F(z, ), through diffeomorphisins ¥ and f, is obtained from F by perturbing the Morse part and the non-Morse part separately. Perturbation of the former does not change its qualitative eritical behavior, while perturbation of the later does. Thus one can “forget” about the coordinates involved in the Morse function, while concentrating on the subspace spanned by the variables of Fya¢. The theorem classifies all possible types of perturbed functions in this subspace. Corollaries also establish the stability and genericity of the universal unfoldings of the Classification ‘Theorem 3.2. Potential Functions and their Canonteal Forms In this section we want not only to diseuss the connection between arbitrary potential functions and their canonical forms provided in a separate paper,! but also to demonstrate the quantitative relationships that exist between the critical points, gradients, and curvatures of the potential function with the corresponding expressions that exist for the canonical forms. In order to define the extent of the applications of these canomical forms we begin with a brief exposition of Thom’s method! for modeling a physical system. First, suppose the physical system we wish to model has n distinet properties to which n definite real values may be assigned We define an n-dimensional Euchdean space, &", which parametrizes these various physical variables, Each point in " represents a particular state for the physical system. If Z, #€ 8", is such ‘8 point, then the coordinates of Z (z1,...,zn), are called the state variables Let X C&R" be the set of all possible states of the physical system The particular state, # € X, which describes the system, 1s determined by a rule which usually depends on a multidimensional parameter represented by #, 7 = (01, . px) € R* For most physical systems this rule 1s often specified as a flow associated with a smooth vector field This flow, or trajectory. on Y usually determines the attractor set of Y. Sometimes the rule is specified so the flow “chooses® a particular attractor on Y with the “largest” basin At other times the rule may only specify that the attractor be a stable one. Since very little 1s known mathematically about the attractors of arbitrary vector fields, catastrophe theory has little to say about this general model. If, however, the vector field is further restricted to be one generated by the gradient of a given smooth function, say V, then Thom’s theory becomes very useful in the study of the physical model In other words, if Y = -DV(Z, A) where V 3s considered a family of potential functions on R" @ W*, the attractors of Y are just the local minima of 213 §1°3.2 Correspondence between Potential Functions and their Unfoldings V(Z,A). In terms of a potential function, the rule $ again may have several forms. For instence, $ may choose the global minimum of V, or it may require only that the state of the system corresond to one of the local minima of V. The specific details of the method which $ uses to move Z to the attractors of Y determines the dynamics of the trajectory of Zin I. Various choices for $ may correspond to tunneling through barriers on V, to steepest descent paths on V, or to “bouncing” over small barriers by means of thermodynamic fluctuations. 3.3. Relationships between Potential Functions and thesr Unfoldings In order to examine a specific example, let us suppose that / is a gradient vector field: Y = —DV(Z, A), where V(Z, 7): 8" @ R* — R is a smooth potential function of the state variables, Z, and depends upon & parameter f. The attractor set of Y is then specified as a set of stable minima of V. The cnitical points of V, defined by DV = 6, form a manifold, Xv, where Xv ¢ H"+*, which includes the stable minima. Choosing @ point, (0: Fo) « 5"**, of Xv, Thom’s classification theorem tells us that in some neighborhood of (0; 0), V 18 equal to the sum of a universal unfolding, U;, of one of the germ functions, Gy, and a quadratic form Q, Q@ = Ov. ,4,7% for k < 6 and j = 1 or 2.9 More formally, if Me CR” is a neighborhood of Z and Np © %* is a neighborhood of fo, then V: Nz © Np —+ Rts equivalent to PEA) = Gilza,y) + PE: A) + Qlzs41.n) = Usl E55) + Qzie1.n) for some finite ¢ with Z,5 denoting the first j coordinates of F while Z,~1, denotes the last n — j coordinates. This means that there exist diffeomorphisms ¥* Mz @ Ny — Nz, and a: Np — ® such that, for any (2) € Nz © Np, we have VEA= AKA AM) +l) - (3) This equation allows us to quantitatively relate the critical points, gradients, and curvatures of V and Fi. Application of the chain rule for derivatives of vector fields to equation (8) provides an expression for the gradient of V: DV (A) = D(X; ADT FA) (4) where D denotes the partial derivative operator with respect to the coordinates of the function or operator which follows it In order to determine the Hessian of V, HV, we carefully reapply the chain rule to (4) to yield. HV (2) = D'X(z) ¢ HA,(X) « Dx(z) + Ss Di Fi(X)HXx(Z) (5) where D* is the transpose of D. We now have expressions equating not only V and Fi, (3), but also their gradients, (4), and Hessians, (5). Through these systems of nonlinear equations the unfolding parameters and diffeomorphisms may be calculated. As Connor!! has pointed out in a different context, the diffeomorphism and parameters of an unfolding may be calculated via the solution of the nonlinear system of equations which arises from the correspondence between the critical points of the unfolding and those of the experimental function. For PESs, however, the critical points are usually not known a priori, and thus this 1s not a viable procedure. Extensions of this method are reasonable though. For instance, the DTS and PES must correspond within a neighborhood of any point. Thus, # similar system of nonlinear equations may be derived, for points within some neighborhood of a particular point, whose solution yields the parameters and diffeomorphism. Alternatively, at a single point the function and all of its derivatives must comcide with those of the DTS. Therefore, since ab ini quantum calculations now provide analytic first and second derivatives, it 1s reasonable to employ this information to help calculate the DTS parameters and diffeomorphism. Thus, the calculation of a single point on the PES with its first and second derivatives may be employed to determine a first approximation to Directed Trees Method for Fitting a Potenteal Function §1:4 the parameters and diffeomorphism Thus, from a single point, we may be able to specify to which unfolding within a given family the particular PES belongs. Since there are canonical forms for the DTS, we also have canonical forms for its critical points, in particular, its saddle points !? Therefore, one might next move over to the DTS saddle point and perform another quantum mechanical calculation there Of course, this point will not correspond to the PES saddle point, but since locally the diffeomorphism is approximately the identity function. it will be close to the PES saddle point ‘The additional information obtained at this new point may then be used to calculate a second approximation for the parameters and diffeomorphism ‘Thus, with each new point, better parameters are calculated so that the DTS better fits the PES In the next section, we perform sainple calculations on the one-dimensional unfolding families, the A, families Section: 4. DT Method for fitting a PES via the Genetic Algorithm As we discussed in the last section, the problem of fitting a DTS to a PES is one of finding a solution to a nonlinear system of equations The DT method allows for a flexsble choice for the form of the optimization function We have considered both weighed least squares as well as absolute value evaluation functions. In particular, 1n the follwoing examples we have employed the experimental and evaluation functions provided below Expernnental function’ f(z) = 02° + 2° — 322, @ = 005 Az Unfolding: F(X) = 29+ 1X + po Diffeomorphism: —_X(z) = eo +e12 + ez? Evaluation functions. R= Sy wil F(X) ~ Hz) (6) Ry = Sw OED ~ 24a FX) _ 2 fty) oe 333 R=, wy Ry = rR ry Ry + ro where {r,ra,r2,w;) are weighting factors, ‘The standard numerical methods for solving nonlinear s) stems often involve algorithms of the Newton- Raphson type '3 As we mentioned earher, the coordinate transformation must be a diffeomorphism, and hence invertible Empirically, we found that when employ ing a Newton-Raphson algorithm for solving these nonlinear systems, the calculated coordinate transformations often did not satisfy the inverubihty criterion ‘Therefore we resorted to constrained optimization techmiques Several methods, including the Box complex algorithm, and standard least squares procedures,!® have been successfully used to solve these nonlinear equations Typically, the constained methods were very slow to converge to a minimum and thus required a significant increase in computational time, Since the evaluation functions involved the differences between values for the experimental PES and its DTS they were froth with shallow local minima. Thus, for some problems, these methods did not converge to the global minimum of the evaluation functions. In addition, the constrained optimizations often tended to remain close to their constraint boundaries, resulting in the optimizations becoming stuck in local minima. These considerations led us to try other function optimizers Besides these classical techniques, genetic adaptive algorithms (GAs) also may be employed to solve these systems, GAs are based on an observation originally made by Holland!® that living organisms are very efficient at adapting to their environs. Implicit in a genetic adaptive search is an immense amount of parallel calculation, and empirical studies indicate that GAs will often outperform the usual numerical techniques.!7 We do not discuss the working of GAs here, but rather refer the reader to literature references.!8 Several features illustrated in the following fitting examples are of importance and we mention them here, (i) We show that the coordinate transformation employed by the DT Method is required to be a 215 gla Correspondence between Potential Functions and their Unfoldings diffeomorphism. If the coordinate transformation calculated via the DT method is not a diffeomorphism, then the chosen determinacy of the PES 1s too low and a higher-order unfolding family 1s needed in order to accurately fit the PES (11) Also illustrated 1s the fact that the diffeomorphism may include terms which have asymptotic behavior, for example, exponential terms. In this case, the asymptotic behavior of the surface may be reproduced by including comparable behavior in the diffeomorphism. (11) “Bumps” or “shoulders” on surfaces that do not form critical points still reflect the fact that they stem from the annihilation of critical points of a germ function. Thus any bump or shoulder on a surface means that a higher order unfolding family will be required in order to accurately reproduce them (1) Also depicted in these examples 1s the DT Method for fitting a 2-dimensional potential energy surface Our example 2-D surfaces have one “interesting” coordinate, that 15, one coor: is not harmonic, and one coordinate which is harmonic, Ag DTS Fit for R and Rg Experiment ——R eR * Data points Experiment ——R eR Date points ¢ 1 Az DS fits employing R and Rs to an As experimental function at 1, 2, and 3 data points. In Figure 1, we illustrate the Directed Trees fitting procedure by employing the genetic algorithm for fitting the’ Az unfolding family to an experimental function belonging to the As family. We choose this experimental function to exemplify several features of the DT method In particular, the value of the coefficient of the 2° term was chosen in order to generate a third eritical point on the experimental surface within the coordinate interval 3 < x < 3. We choose this interval so that the local nature of the fitting procedure for the Az unfolding may be demonstrated In conyunction with this local aspect of the Az DTS 216 Directed Trees Method for Fitting a Potential Function §1:4 on the |~3,3| interval, however, we would like to point out that all three critical points, and hence the experimental function itself, may be accurately represented with the 4 unfolding family Even though the highest-order term of the As germ function 1s fourth-order, its unfoldings may have three critical points and thus the three critical points of this As PES may be accurately reproduced on the interval {3,3} We have successfully fit an Ag DTS (o all three singularities of this Az PES (The Ag unfolding family does not have the proper local topology and consequently it cannot accurately reproduce this PES When an AZ DTS fit 1s attempted, either the fitting 1s very poor or the calculated coordinate transformation is not a diffeomorphism ) ‘This example also demonstrates the usage of the DTS to help choose new positions for further calculations and the employment of the first and second derivatives in addtion to the functional values at the data points Ag DTS Fit to Noisy Data 20. 10. ics & So 0. re ++ + “Experiment” + + "Experiment —— 10. © Noisy Data Figure 2 Az DTS fit to noisy date points In this figure, the experimental function 1s drawn as narrow solid lines. For clarity, the data points, which are represented as “sold” squares, are drawn at a constant ‘y’ coordinate and not at their proper functional values Their proper functional values are located on the narrow solid curve The dotted lines are the Az DTS fits employing R as the fitting criterion Thus, these R curves attempt only to fit the functional value of the experimental function at each of the data points The thick dashed lines are the Az DTS fits employing the Rs evaluation fiunction, thus these dashed curves fit not only the functional value but also the values of the first and second derivatives at cach point In Part A of Figure 2 we have attempted the DTS fit employing only a single experimental point Note that in this case, the R fit does not have the proper local topology There is not enough information available to determine the local shape of the experimental function and at 1s only fortuitious that the R unfolding has about the same value of its first derivative as the experimental function. On the other hand, the Rs DTS fit does have the proper local topology but its ertuical pos are far removed from the corresponding experimental minimum and maximum. In Parts B and C of this figure we employ two experimental data pomnts for fitting the DTSs In Part B_the chosen date points include the single point from Part A plus an additional point at the minimum of the DT'S surface calculated in Part A. We thus have used the approximate DTS surface of Part A to choose where the next calculation should be performed. The new information from the second datum point 1s then used to refine the DTS. In Part C, we use the same datum point as in Part A as well as the maximum point of the DTS in A. These refined DTS curves in Parts B and C now provide a more accurate estimates of the minimuin and maximum of the experimental function. We use three data points for Part D, the original point from A as 217 gl:d Correspondence between Potential Functions and their Unfoldings well as Part A’s DTS's minimum and maximum points. Note that che DTS fit to the three points does not Nave uhe proper topology of the experimental fonction. The Rs DTS, however is a very accurate fit within the neighborhoods surrounding the maximun: and minimum of the experimental function, Note, however, Boe eee DDTS vs unable to ft the second, rightmost, maxunum of the experimental funetion. This is ane ine ns thurd eritical point generated by the sixth-order cerm in the experimental Function cannot be represented scithin the Az unfolding family, which has, at most, two critical points ‘A higher-order family reece to be chosen in order Co Gt this maximum value In particular, the As family ‘would be capable sPhtting both of the maxima and the minimum on this experimental function. One does not have to use an “Fe unfolding for thes experimental function even though it contains 1 sixth-order term Ag DTS Fits to an Ag function 40. ~ 30. ix. XS 20. , we 10. a ° 4 cece +Bxperiment — — RDS Ss 7 = Data pointes 1.48 4 ue) -6. -4. -2. 0. 2 -6,.-4.-2. 0. 2 x ‘x, Figure # Az DTS fit to the minimum and shoulder of an As experimental function the experimental function was assumed to be known exactly. ‘This 1s usually In the previous figure, energy at each datum point on & PES. not the case, ‘Typically, there are random errors in the pote ‘These random fluctuations stem from round-off errors 1n calculations, integration inaccuracies, or experimental random fluctuations ‘Also, as previously noted, the evaluation fonctions have many local minima which often appear to be similar to random fluctuations To-show that the data, we return to the experimental figure and add rather severe random fluctuations to the functional of values approximate wave functions, numerical DT method in conjunction with the GA optimizer does not require ex function of Figure 2 in the next 28 Directed Trees Method for Fitting a Potential Function §1:4 ‘as well as its first and second derivatives The GA optimizer 1s very efficient at avoiding local minima and consequently works well for noisy PESs. Part A of this figure has six data pomts to which noise has been added (The “open” squares representing the data points in this figure now reside at their proper functional values ) Note that the best Az DTS fit employing Rs to these data docs not accurately repeat the “exact” experimental function. that 1s, the function without the random fluctuations which is drawn as a dashed Ine. In fact, it might appear as if the DTS does not even accurately fit the two data points at z = ~1 It must be realized that the functional value of the data points is all that is being plotted in this diagram. The Re evaluation function however, includes the first and second derivatives as criteria for a fit. Thus for 1 small number of data points, the random fluctuations in the first and second derivatives need not cancel and thus the DTS need not accurately fit the two functional values at z= —1. In Part B, we have added additional data points. Here, the DTS fairly accurately fits the “exact” experimental function. This figure illustrates that the Directed Trees method coupled with the genetic algorithm are easily applied to fitting DTSs to noisy PESs Ag DTS Model Fit —-— A, DTS 30. “R 3 40. —— Experiment 2 1 By * Data points, 0. vo = 2 -4.-2. 0. 2. x Figure 4 As DTS fit to the As experimental function One particular advantage of employing the genetic algorithm for fitting DTSs to PESs is that it is easy to require that the calculated coordinate change remain a diffeomorphism. In the next figure, we see not only a new experimental function as well as its DTS fits, but an addition, plots of the corresponding diffeomorphisms, X(z), for the DTSs Note that in Part A, we have chosen data points surrounding the minimum of the experimental function at x = 0. This experimental function has only a single minimum, but it does have a shoulder at around z= -% Even though this shoulder is not a new critical point, it stems from the annihilation of a saddle and a minimum of the Az family of functions. Hence our Ay DTS cannot fit this experimental function exactly It is capable of fitting either the minimum as illustrated in Part A or the shoulder as illustrated in Part B. In addition Part A also illustrates the possibility of asymptotic behavior being included in the diffeomorphism, then the DTS 1s capable of fitting the asymptotic behavior on a PES. In fact, instead of expanding the diffeomorphism as a Taylor series, as we have done here, it could easily be expanded as a suin of exponential terms whose asymptotic behaviors are then imparted to the DTS. Note that, as the diffeomorphism levels off for z < —5, the DTS also becomes asymptotically level Part B of this diagram contains a warning, however The function ¥(z) is not a diffeomorphism over the entire interval, -7 < z < 3, and hence, the assumptions necessary for application of the Arnol’d-Thom 219 gd Correspondence between Potential Functions and their Unfoldings Classification Theorem are not satified over this mterval. In fact, the critical point of X(z) leads to an additional critical of the DTS at about 0. This critical point of X(z) was induced by attempting to fit the Ap unfolding family to “three” critical points the one actual minimum of the surface and the annihilated saddle and minimum which generates the shoulder region. If the datum point at z = } is removed, then X(z) remains a diffeomorphism and the DTS accurately fits the shoulder of the experimental function ‘This example reveals an advantage of the genetic algorithm over many of the nonlinear Newton-lke optimization schemes. Unlike the Newton methods which require an initial guess and can become “stuck” in local minima, the genetic algorithm only requires starting intervals for its parameter values. This, by the way, allows one to assure that the coordinate transformation X remains a diffeomorphisin by ineans of controlling the ranges over which the parameter values may vary. In addition to the fact that parameter intervals are a much less restrictive mitial condition than having to guess a starting parameter solution, one may also easily specify the resolution at which each individual parameter is caleulated. Thus individual parameters may all be optimized at differing resolutions If X 1s not a diffeomorphism after fitting a DTS to a PES, then this 1s a tipoff that, the chosen fitting family 1s too small and does not contain enough critical points necessary for fitting the surface. Thus one should choose a higher-order family fur fitting shis surface. fn partiewlar, the next figure, Figure 4 illustrates that if we choose the 43 unfolding fanuhy to fit thie expernmental function, then both the shoulder and the minimum may be accurately ht Siace this expernental function is 3-determined and we are employing the dy unfolding family the diffeomorphism 1 a linvar funcuon with no erstical points. We next consider the Directed Trees met smtal function which has more than onedimension We choose an experimental func! monic coordinate and one harmonic coordinate. This PES 1s representative of isomerization reactions It 1s an unportant trial case because of the recent interest in quast-periodie versus chaotic trajectories on simular two-dimensional surfaces 19 Also a similar surface was also chosen by Fukui? to illustrate the intrinsic reaction coordinate method. Contour levels of this function are drawn in Parts A and C of the following figure. There are several things ‘to note about the experimental function First of all, there are two minima and one saddle point. Neither of the minima are located at special points, such as the origin. Also, a line drawn between the (wo minima is not parallel to either of the coordinate axes The DT method, though, is capable of “rotating” the DTS coordinate axes so that it can accurately represent the experimental surface In Part B of tus figure, we have chosen che As family for fitting this function. Note that the corresponding contour levels in all Parts of this diagram are drawn employing the same ty pe of line, whether that be solid, dashed dash-dotted, or dotted The “stars” (*} in Parts A and C locate the data points used in the ealeulations for Parts B and D. respectively. The As DTS of Part B very accurately fits the experimental You might ask what would happen if one were to choose a family which can display more critical points than the experimental function contains. This 1s allustrated in Part D of this Figure In this case, the Ag unfolding family was chosen to fit the same experimental function as provided in Part A Note than in Part D, the DTS accurately fit both minima and the saddle point of the experimental function. In addition, however, there 1s a new saddle point appearing around the point (2.1,0.2). This new saddle point stems from the fact that the Ay family can display four critical points, It is worth noting, however, that in the region surrounding the data points, the Ay DTS accurately fits the experimental function The new, extraneous, saddle point of the DTS lies outside the local neighborhood of the data points employed to fit this PES. This example of employing the Ay unfolding mightlead one to consider always employing a high-order unfolding family to fit all PESs. One finds, however, from the practical viewpoint of calculating the fitting parameters, that a properly chosen unfolding family (one whose determinacy and local topology is the same as the experimental PES) will greatly reduce the amount of calculation and hence provide an casily calculated fit to the PES This 1» because the DTS has the proper number of critical points to reproduce the topology of the surface data 1s not required to suppress extraneous critical points of the unfolding. Thus there is an optimum unfolding family, from a calculation standpoint, for each PES It is true that the higher-order family, assuming it contains the lower-order family as a subfamily, will provide an unfolding which repeats the topology of its lower-order subfamily. It is this subfamily, however, that should be chosen as the unfolding family for the original fitting procedure. As our last example of a 2-dimensional fitting to a 2-dimensional PES, we choose the same “exact” Directed Trees Method for Fitting a Potential Function §1:4 Experimental PES’ Fitted DTS Figure 5 As and Ay 2-dimensional DTSs fit to an experimental function Contour lines are drawn at energies of 15, 10, 8, 7, 4, and 1 mn all parts of this figure experimental function, but add in random fluctuations to the experimental values and its first and second derivatives. For this example, we also employ the Rs evaluation function in determining the unfolding and diffeomorphism parameters Note that in Figure 6 the A: DTS has the same critical behavior as the exerimental PES, however, it 1s not as accurate of a fit as that shown in Figure 5. This is because the noise included in our functional values is rather extensive Since it is not possible to see these random fluctuations on a contour plot of the PES, we have drawn a 3-D stereo projection of the experimental PES along with the noisy data points chosen In this view, the bold circles are the experimental points chosen on the surface while the light crosses are the “exact” experimental values corresponding to the noisy data points. 221 ged Correspondence between Potential Functions and their Unfoldings Noisy PES Figure 7 Stereo view of the noisy data points of the Az experimental 2-D function The bold circles are the noisy data points while the thin ‘-+" signs are the corresponding “exact” values Section: 5. References [1] See the associated paper “The Directed Trees Method | Classification of Potential Energy Surfaces” (submitted for publication) le Is {18a} 18b] {18¢) [18d] (18e| |19a) 119b, [20a [20b) Directed Trees Method for Fulting a Potentsal Function — §1°5 For typical examples of the simphifications that may arise solely from a classification scheme, see “The Differential Topology of the Directed Trees Method V. Symmetry Invariant Potential Energy Surfaces” ‘These concepts have already played important parts in the story of classical mechanics and dynamics, for example, see Arnol'd, V 1 “Mathematical Methods of Classical Mechanics”, Springer-Verlag: New York, 1978, and Arnol'd, V 1, Avez, A. “Ergotic Problems of Classical Mechanics”; Benjamin: New York. 1968; Ch t, 3-4 See the Glossary for the definition of an algebra, See the accompanying paper “The Differential Topology of the Directed Trees Method III: Determinacy and Unfolding Algorithms.” Rudin, W_ “Principles of Mathematical Analysis; 3rd Ed McGraw-Hill: New York, 1976; p 224. Mezey, P..G. Theoret. Chim. Acta (Berl) 1981, 58, 309. Morse M. Trans Amer. Math Soc 1931 33 72 Milnor. J) “Morse ‘Theory” Princeton Unis Press. Prin Amol'd, VoL Russian Math Surveys 1974 2. 10. Thom RB. “Structural Stability and Morphogenesis” Benjamin: Reading. MA, 1975 Connor. JN 1 Mol Phys 1976 31 3% New Jersey, 1963 No 1 See the accompanying paper “I'he Dalferential Topology of the Directed Trees Method II: Potential Energy Surfaces and Canonical Forms” Fletcher, R. “Practical Methods of Optimization, Vol 1 Unconstained Optimization”, Wiley New York, 1980; Ch 6 Richardson, J A. Commun, ACM 1973, 16, 487, International Mathematical & Statistical Libraries, Inc. Subroutines ZXSSQ, ZSPOW, and ZXCNT from “IMSL Library of Fortran Subroutines”; 9th ed., IMSL, Inc: Houston, TX, 1981 Holland, JH. “Adaptation in Natural and Artificial Systems”. Univ, of Michigan Press. Ann Arbor, 1975 De Jong, K.A. “Analysis of the Behavior of a Class of Genetic Adaptive Systems”. PhD dissertation, Univ of Michigan, August, 1975 Bethke, A.D. “Genetic Algorithms as Function Optimizers”; PhD dissertation, Un. of Michigan, January, 1981. Brindle, A_ “Genetic Algorithms for Function Optimization”, C.S. Department Report TR81-2 (PhD dissertation), Univ. of Alberta, 1981 De Jong, K A. IEEE Trans Systems, Man, and Cybernetics 1980, 10, 9. Holland, J.H. In Prog. in Theor. Biol., 1976, 4, 263. Rosen, R., Snell, F M., eds ; Academic Press: New York, 1976. Holland, J.H “Adaptation in Natural and Artificial Systems”; Univ. of Michigan Press Ann Arbor, 1975. DeLeon, N: Berne, Bi. J. Chem Phys 1981, 75. 2495. Kariotis, Ri; Suhl, Haj Bckmann, J-P Phys Rev Lett 1985, ‘Tachibana, Aj Fuku, K Theor Chin Acta 1978, 49, 321 ‘Tachibana, A., Fukui, K Theor. Chim Acta 1979, 51, 189. 223 (t) (2) (3) i} (3) (6) (7) (8) (9) (10) (uy (2 Section: 6. Glossary ‘The following furnishes brief definitions of a few of the terms from differential topology that we employ in the text of this paper C™-Diffeomorphism: If y is a C™-diffeomorphismn, then it satisfies the following three criteria’ (i) yas m times differentiable, (1) y has an inverse, yo"! R" — R", such that yoy"? = toy =I, and (iti) y-? is m tames differentiable, where m is either finite, 00, of » Equivalence class: If A is a set and if ~ is an equivalence relation on A, then the equivalence class of @€ A isthe set {r€ Ala ~ x} Equivalent: Two functions, {-H* — Rand g K" —- WR, are equivalent at 0 if there exists a diffeomorphism XR" — R" and a constant such that A(X) 0 (7) of na neighborhood of 0 Equivalence of two functions imphes that they have the same geometric “shape” eritical behavior ‘They have corresponding critical points which are of the same type Genericity: A generte property 1s a property possessed by an open dense subset of the system. This means that a generic property 1s “typical” for the system, and a complementary subset for which the property does not hold has measure 0 ‘Thus it 1s “mathematically rare” for a generic property not to hold. Since a generic property holds on a dense subset of the system, then any member of the system, including those not having the generic property, may be approximated arbitrarily closely by elements having the generic property. An example of this 1s that a function having a degenerate critical point may be approximated by functions having only Morse critical points Germ, Germ-equivalent: Let Tbe a topological space and $ be any set. Let f:U + $ and g:V — S be maps with domains U, V open sets in T, and suppose = hes in UNV. Then f and g are said to be germ-equivalent at x if there exists some open neighborhood IW’ of z lying inside UV such that f = 9 ‘on IY This 1s an equivalence relation on the set of all maps defined on neighborhoods of z in T and with values in Sand the equivalence classes are called germs of maps at z If S is a topological space also, then we can consider germs of continuous maps If § and 7 are normed linear spaces, we can consider germs at r of C" maps. Iftwo C® maps are germ-cquivalent at z, then all their derivatives at Z are the same. K-determined: Let {= and let k be a non- negative integer. Then f is right-determined (right-left determined) if, for every g € R" such that j*(f) = j*(9), then f~e 9 (f~n1 9) Jet, kejet: The k-jet of a function f denoted by 7*(f), 18 the Taylor series expansion of f at z and truncated after the order k terms Neighborhood N: Given a topological space, (7,1), a subset NC Tas an is there is a member $ of r with te Sc Regular point: A point, z, is a regular point if z is in the domain of a function, f R” —- ®, and the gradient of the function at z is not zero. ) Smooth or C%: A function f, sR" and are continuous at z ighborhood of a point t € T R™, 1s called smooth at a point, z, if all of its derivitives exist ) Stability: Properties of a mapping which are invariant to perturbations of the map are called stable properties, and the collection of maps which possess a particular stable property may he referred to as ' atable class of maps. In particular. a property 1s stable provided that whenever fo: X — Y possesses a property and f X — ¥ 1s a homotopy of fo, then, for some <> 0, each f; with £ < ¢ also possesses the property ) Structural Stability: For the single function case, let fH" --+ R be a function and P.R" @ RE — ® be an arbitrary small perturbation Then f is stable at a point in Zo if there exists & diffeomorphism ¥ = X(Z) such that the perturbed function, 9 = f +p, in the new coordinate system is equivalent to the unperturbed function, (2) = 9(X) +0 224 (13) (4) Directed Trees Method for Fulting a Potential Function §1:6 Topology, Topological Space, Open Sets: Let Tbe a set; a topology + on T is a collection + of subsets on T which satisfy the following criteria: a family r of subsets on T 1s a topology on T if (i) fT cr, then UT Er, (ii) if T C+ and T is finite, then OT € 2, (m) Oe rand Te then (T,=) 18 a topological space, T 1s its underlying set, and the members + are called the open or r-open sets of (7,1) of T Unfolding, Versal and Universal: An unfolding of a function, f(z), 1s a parametrized smooth family of functions, F(z, f), where 7 = (py. ..,pj), whose members are possible perturbations of f(z) The dimension of j, j, 18 called the codimension of the unfolding. Usually unfolding also refers to a particular member of the family, F(Z,). An unfolding, G. 1s vereal unfolding if any other unfolding of { may be obtained from G via a diffeomorphism An unfolding, H is a univereal unfolding if it and is of minimum codimension is both versal 225