Saturday, 30 April 2011

An idea to combat bloat in genetic programming

Introduction

Genetic programming (Banzhaf et al., 1998) uses an evolutionary algorithm (Eiben, 2003) to search for programs that solve a well-defined problem that can be evaluated using one number, the fitness of the program. Evolutionary algorithms are inspired by the evolution of species by natural selection. The main difference of this search paradigm, compared to methods that are more traditional, is that is works with a group of candidate solution (a population), instead of just one solution. This makes the method more robust, i.e. less likely to get trapped in a local minimum. These algorithms code their solution as a gene. New generations of solutions are formed using sexual (and a-sexual) reproduction. Sexual reproduction (crossover of genes) is emphasized as it allows for the combination of multiple partial solutions into one, thus using the parallel computations made by entire population, making the use of a population less inefficient as it would be otherwise.

Bloat and evolution

One of the fundamental problems of genetic programming (GP) is bloat. After some generations (typically below one hundred), the search for better programs halts as the programs become too large. The main cause of bloat is generally thought to be the proliferations of introns. Introns are parts of the program that do not contribute to the calculation that is made, e.g. a large calculation that is multiplied by zero in the end, or a section that start with: if false then. Additionally, bloat can be caused by inefficient code, e.g. two times x=x+1, instead of x=x+2.

In genetic programming, the linear genes are typically converted into trees or lists with lines of code. Some other coding methods are also used, see Banzhaf et al. (1998). Performing a simple crossover on the genes would almost certainly result in an illegal program. Therefore, crossover is performed on a higher level, e.g. two sub-trees are exchanged between the sets of genes, or sections of program lines are exchanged. This way the child-program is legal code, but still, it is often bad code, and its results are not very similar to its parents. As a consequence, evolution favors bloat at the moment that it becomes hard to find better solutions. In this case, it is better for the parents to get offspring that is at least just as fit as they are, as getting less fit children. The probability to get equally fit offspring is better in large programs (with a small coding part) than in small ones (most of which is useful code), as in large programs it is more likely that the crossover operator does not destroy anything. The problem of bloat is thus intimately linked to the destructive power of crossover.

Combating bloat

At the moment, there are four ways to combat bloat (Smith, 2000). Koza introduced automatically generated functions, where parts of the program are encapsulated and thus can no longer be destroyed by crossover. Furthermore, many researchers run their GP with an enormous population, hoping to get good results before bloat becomes a problem. A third way, is to give a penalty in the cost function (or fitness function) to long programs or program with many non-coding parts; this is called parsimony pressure. It is computationally intensive to determine which parts of the code are non-coding. Elegant and creative versions of this idea are to use double tournament selection (one for fitness, and one for size; Luke and Panait, 2002) or to use a multi-objective search algorithm with fitness, size and diversity as objectives (De Jong and Pollack, 2003). Due to the problems with bloat, many researchers are opting for a stronger reliance on mutations relative to crossover. These four methods are apparently not sufficient, as bloat is still seen as an important problem (Smith, 2000).

Bloat is hindering the application of GP to interesting, more complex, problems that need more than a few dozen generations to find a solution. One may even say that as long as evolution comes to a halt, one can hardly call GP an evolutionary method. In nature crossover is helpful in finding good solutions, especially for more complex problems.

Here, I would like to introduce a new way to code programs in genes that will hopefully reduce the bloat problem, and might even remove it all together. The basic idea, is to more closely mimic computation in nature, and split up the large sequential program into many smaller pieces of code (called proglets) running in parallel. This will make the program more complex, but as now only proglets are exchanged during crossover, the child-programs that are created will normally be viable.

Random parallel genetic programming

In nature, computations are made in parallel and without global control. Computation in nature is typically very robust, which can be seen in the viability of offspring, but also in the flexibility of a neural network, where it is possible to remove neurons with only little loss. Von Neumann (1966) gives three reasons for the robustness and error-tolerance of natural computing: 1) redundancy, 2) employment of autonomous units without global control and 3) the ability to sense the significance of errors leading to repair or disuse of subunits; see also the inspiring book Von Neumann (1961).

Proglets

We may attain a part of this robustness and flexibility in a GP scheme by splitting up the full calculation in many small proglets. Each of these proglets would be given an identification number, or a few id-numbers. This would provide us with natural boundaries at which non-destructive crossover is possible: In the crossover scheme, the proglets of the parents are gathered, and each of the two children would get one copy of every double proglets, and the proglets with a unique identification number are distributed randomly. This way partial solutions, from different parts of the genome, can be combined without making a mess of the offspring chromosome.

To avoid mating by very dissimilar parents, and thus exchanging proglets at random, a speciation scheme is necessary. This could be a scheme where reproduction is not allowed in case proglets have too different id-codes. By also mutating the identification numbers, it can be arranged that proglets with the same id-number are responsible for about the same tasks. Without mutation of the id-numbers, it would be possible that tasks with the same id-number would evolve to become more and more different in the population. Following the trend for adaptive parameters in evolutionary computing (Eiben, 2003), it would be best to make this maximally allowed difference between id-numbers, part of the genome.

The proglets can be kept small by a hard limit to their length, or by giving larger proglets a fitness disadvantage (parsimony pressure). The latter solution sounds most promising, as it may allow longer codelets in case this is necessary, i.e., brings some advantage. For technical reasons a maximum size to the length of all proglets together will probably also be necessary. This boundary should preferably be much larger. The proglets should have access to local memory, like in any GP algorithm, but also to global memory. Global memory is necessary for communication between the proglets and for enabling them to work cooperatively on the solution.

Synchronization

The simplest way to synchronize the execution of the proglets is to execute them in order of ascending identification number. Such an algorithm could be called proglet-GP.

A scheme that comes closer to Von Neumanns first two properties for robust natural systems, would be to executed the proglets multiple times in a random order, to simulate parallel execution of the proglets on sequential hardware. Depending on the implementation of this random scheme, some proglets may not be executed at all. Such a parallel scheme may force the algorithm to develop a different kind of program, a set of proglets that is be more robust and more redundant and thus even better able to survive crossover. This system can be called Random Parallel Genetic Programming (RPGP).

Redundancy would naturally evolve in such a system, as important proglets would insure themselves against not being executed by increasing their number of copies in the chromosome. Such a system can be expected to be more robust as proglets can no longer be sure that other proglets are executed as well, thus proglets would try to reduce their dependency on other proglets. This may also improve the functioning of the proglets in another environment after reproduction.

Even if the proglets are as autonomous as possible, they will always have remaining interdependencies and will need to synchronize their actions. The proglets could do this themselves by signaling. For example, one proglet could set a variable in the global memory to signal that it has been executed, and another proglet could use this variable to determine whether it will make a calculation or not.

However, Wiener (1955) argued that a clock is very important to organize complex systems, whether it be a machine, a factory, the brain (waves in the EEC) or society (calendar with weeks, and months). A sequential system is less efficient than such a time synchronized system, as it is complicated by waiting, and storage. Thus, it would probably be good to give the system a digital clock and a few sinus curves as analog clocks.

Random other ideas

It may be a good idea to make a collection with chromosomes and start with such a collection instead of with random chromosomes. This could be done even if the problem at hand is different from the ones the collection was made from. This may in the end, make for proglets that are robust and useful over a large range of problems; this may evolve evolvability. Like the skeleton of animals, this has been recycled and adapted for every animal.

Another idea to reduce bloat is a proportional crossover scheme, where the number of crossover operations is made proportional to the length of the programs. Thus, a short program is subjected less often to crossover, as a long program. This favors programs that solve the problem at hand in as little commands as possible. The only problem I see with this method is that it may cause so much crossover in some programs that they are simply destroyed. You can normalize the operator by making it perform one (or some other constant) crossover operation for a program of average length.

Douglas Hofstadter work in the cognitive sciences emphasizes the idea of a parraleles terrassen raster. Translate to GP this would mean to test a larger group of new chromosomes just superficially before allowing some of them to get into the population where a full fitness calculation is made. You could do this in a few steps of computationally more intensive fitness calculations. This mimics nature where a sperm at least has to show that he is a good swimmer and navigator, before he is allowed to help build an entire embryo. It will depend on the nature of the problem whether this idea can be applied.

The last idea is not related to bloat. An interesting observation of Von Neumann (1966) is that the nervous system uses pulses to represent numbers, and not a binary scheme. A monotonic function is probably used to convert the pulses to natural numbers. The binary scheme is very efficient, but less reliable, less robust. A small change in a binary number can lead to a large change in its value. Is this a reason not to use binary numbers in evolutionary programming and other evolutionary algorithms?

Final words

This short paper introduced the idea of breaking up the one large GP program in many small proglets. This scheme has natural boundaries at which crossover can take place and may thus make sexual reproduction more productive. This maintains its advantage of combining partial solutions from various parts of the gene pool, without much damage due to crossover. If the proglets are executed in a random order, the scheme can be called: Random Parallel Genetic Programming (RPGP).

Such a scheme might have the additional benefit that the proglets lend themselves to recycling. A further idea to reduce bloat is proportional crossover, where relatively large programs are punished by more crossover events.

I had hoped to work on applying GP to the assimilation of cloud and rain into numerical weather prediction models. However, the coming time I will work instead on a bit less risky project. Thus, I hope that someone from the GP community is willing to implement the RPGP scheme, as I am curious if, and how good, it will work.

Postscript

In the mean time much more is known about introns and their functions, as well as various tricks used by nature to make evolution more efficient. German speakers are referred to to the popular scientific book of Joachim Bauer (2008), which is great reading for anyone interested in evolution. (Postscript added on April 30, 2011)

References

Banzhaf, W., P. Nordin, R.E. Keller, F.D. Francone. Genetic programming, an introduction, on the automatic evolution of computer programs and its applications. ISBN: 3-920993-58-6, Dpunkt, Heidelberg, Germany, 1998.

Bauer, J. Das kooperative Gen. Evolution als kreativer Prozess. Wilhelm Heyne Verlag München, ISBN 978-3-453-60133-8, 223 p., 2008.

Eiben, Á.E.: Introduction to evolutionary computing, ISBN: 3-540-40184-9, Springer, Berlin, Germany, 2003.

Jong, de, E.D. and Pollack, J.B. Multi-Objective Methods for Tree Size Control. Genetic Programming and Evolvable Machines, vol. 4, no. 3, pp. 211-233, 2003.

Hofstadter, D. and the fluid analogies research group. Die FARGonauten. Ueber Analogie und Kreativitaet. ISBN: 3-608-91758-6, Klett-Cotta, Stuttgard, Germany, 1996, American original: Fluid concepts and creative analogies, Basic books, New York, USA, 1995.

Luke, S., and L. Panait. Fighting Bloat With Nonparametric Parsimony Pressure. In Parallel Problem Solving from Nature - PPSN VII (LNCS 2439). Juan Julian Merelo Guervos et al, eds. Springer Verlag. 411-421, 2002.

Smith, P.W.H. Controling code growth in genetic programming. In Soft computing techniques and applications, R. John and R. Birkenhead (eds.), pp 166-171, ISBN: 3-7908-1257-9, 2000.

Von Neumann, J. Theory of self-reproducing automata. Edited and completed by A.W. Burks. University of Illinois Press, Urbana, 1966.

Von Neumann, J. Collected works, Volume V, Design of computers, theory of automata and numerical analysis, Ed. A.H. Taub, Pergamon press, Oxford, USA, 1961.

Wiener, N. Time and organization. Second Fawley Foundation Lecture, University of Southhampton, pp. 1-16, also in Wiener (1985), 1955.

Wiener, N. Collected Works, Volume IV, Massachusetts Institute of Technology, ISBN: 0-262-23123-9, 1985.

This post was first published on my homepage on 30 December 2004.

No comments: