Evocosm Logo

Downloads
libevocosm-4.0.2.tar.gz
libevocosm-4.0.2.zip

Required Packages
Brahe

Documents
API Documentation
Open Source License (Simplified BSD)
Closed Source / Commercial License

Evocosm

A C++ Framework for Evolutionary Computing

 

I'm fascinated by machine learning through the evolutionary discovery of algorithms; I've also discovered some interesting ways to use genetic algorithms for testing software. As such, I write a number of different kinds of evolutionary algorithms, all of which follow certain patterns. And when I see patterns, I see the need for a framework — thus was born Evocosm, a collection of tools for creating a wide variety of evolutionary algorithms.

Evocosm is a set of classes for abstracting the fundamental components of an evolutionary algorithm. You'll find a general discussion below, and complete API documentation via the links at left. The distribution includes two complete evolutionary algorithms: a generic function optimizer and an Iterated Prisoner's Dilemma.

My goal is to provide portable code for experimentation and demonstration. I've no interest in language wars. Trying to be everything to everyone will keep me from every finishing even a single project. For now, Evocosm is a C++ project.

Component Overview

Evolutionary algorithms come in a variety of shapes and flavors, but at their core, they all share certain characteristics: populations that reproduce and mutate through a series of generations, producing future generations based on some measure of fitness. An amazing variety of algorithms can be built on that general framework, which leads me to construct a set of core classes as the basis for future applications. 

The classes include:

  • Floating-Point Chromosomes
    Evcosom supports the crossover and mutation of IEEE-754 floating-point numbers, using an algorithm I invented in the mid-1990s. This topic is covered in detail here.
  • Roulette Wheels
    The roulette_wheel class implements the concept of a "software roulette wheel" for Evocosm. This is a tool for natural selection, wherein the fitness of an organism determines the width of its "slot" on an imaginary roulette wheel.
  • Organisms
    Think of an "organism" as an answer to a problem posed by a fitness landscape; "genes" define its behavior and an associated fitness value is assigned by an evocosm during testing. Evocosm provides the freedom to define organisms as almost anything: bit strings, floating-point numbers, finite state machines, LISP programs, or external robots controlled via radio waves. In A Complexity of Options, I used an Evocosm-derived GA to determine the gcc options that produce the faster code.
  • Fitness Landscapes
    A "fitness landscape" defines the environment where organisms "live" or a problem that they are tested against.  The landscape is intimately tied to the nature of the organism; think of an organism as a potential solution to a problem implemented by the landscape. A floating-point organism, for example, could be tested by a fitness landscape that represents a function to be maximized. Or, an organism describing the shape of wing could be tested by a landscape that simulates a wind tunnel.
  • Evocosms
    The evocosm class binds a population of organisms to a set of objects that define the rules of survival and reproduction. An evocosm will have one or more populations, which will evolve against population-unique and shared (common) fitness landscapes; breeding is controlled by a set of class objects from the following classes.
  • Fitness Scaling
    As a population converges on an "answer", the difference between fitness values often becomes very small; this prevents the best solutions from having a significant advantage in reproduction. Fitness scaling solves this problem by adjusting the fitness values to the advantage of the most-fit chromosomes. Evocosm includes a variety of fitness scaling algorithms.
  • Selecting Survivors
    A selector decides which organisms survive from one generation to the next. Some evolutionary algorithms will not use a selector; other will. In general, it is effective to keep the "best" organisms from one generation to the next, so that good genes do not become lost at random. This is, of course, an improvement on nature, where being "the best" doesn't guarantee survival.
  • Reproduction
    In most cases, a reproducer generates new organisms using parents selected (by fitness) from an existing population. In some singular (and probably rare) cases, a reproducer might generate new, random organisms in order to keep diversity high. Reproduction techniques can include crossover and asexual, sexual and (my favorite) try-sexual models.
  • Mutation Operators
    A mutator applies mutations (random, usually small changes) to a set of organisms. Mutation is highly dependent on the type of organism. In traditional genetic algorithms, a mutation flips one or more bits in an integer (i.e., chromosome). Evolving a path for the Traveling Salesman Problem involves complex mutations that maintain valid permutations of destination points; in the case of floating-point numbers, I've provided utilities for mutating and crossing IEC-60559 (IEEE- 754) float and double types.

Building the Library

The .tar.gz archives contain source code, makefiles, and doxygen configurations. The makefile defines several targets:

  • make clean should be rather obvious
  • make will build static and shared libraries
  • make install copies Evocosm libraries and header files to system directories; you'll want to be running as root (sudo) for the installation.

If this all sounds very "Unix" — you're correct. This code was developed under Linux using GCC and a variety of free software tools.

Floating-Point Evolution

The majority of genetic algorithms work on pure bit strings, converting those strings to the desired types for fitness testing. In Lawrence Davis' book Handbook of Genetic Algorithms, he transforms a 44-bit string into two floating point values via a series of operations. I've seen similar techniques elsewhere, and I find them a bit cumbersome.

In theory, a GA should have no knowledge of the format of the data it is modifying; however, natural chromosomes do encode some structure in their sequence; for example, DNA encodes "stop" sequences between definitions of amino acids. Crossover appears to take place in specific positions along natural chromosomes — and while mutation doesn't care about the chromosome's structure, but its does change that structure.

In context of a computer program, the structure of a chromosome isn't so important as the ability to logically modify its bits through crossover and mutation. I decided to build tools for the mutation and crossover of encoded floating-point values of types float and double. Floating-point numbers contain scaled values that may have a fractional part. On most modern computer systems — including the popular Intel and AMD processors — the C/C++/Java float and double types implement the single-precision and double-precision floating-point formats defined by the Institute of Electrical and Electronic Engineers (IEEE) standard 754-1985. A float is a 32-bit value, and a double is a 64-bit. These bits in a floating-point value are divided into three components: A sign bit, an exponent, and a mantissa. The following table shows the internal format of the float and double types.

 

31

30       23

22                     0

 

float

 

 

 

 

 

sign

exponent

mantissa

 

 

 

 

 

 

 

63

62       52

51                      

                                           0

double

 

 

 

 

 

sign

exponent

mantissa

 

 

The highest-order bit in a floating-point value is the sign bit. If the sign bit is one, the value is negative; if the sign bit is zero, the value is positive. The exponent of a float occupies 8 bits and the mantissa uses the remaining 23 bits; a double has a 52-bit mantissa and an 11-bit exponent. In addition, the mantissa has an implicit high order bit of 1.

The mantissa holds a binary fraction greater than or equal to 1 (because of the implied high bit being one) and less than 2. The number of bits in the mantissa affects the accuracy of the floating-point value. A float has 6 decimal digits of accuracy, and a double (with its longer mantissa) is accurate to 15 decimal digits. Since the mantissa is a binary fraction, and it can't always exactly reflect a decimal value you've tried to store in it. For example, there is no binary fraction that can exactly represent the values 0.6 or 1/3. Floating-point numbers represent an approximation of a decimal value; this is where rounding errors come from.

The exponent is a binary number representing the number of binary digits the mantissa is shifted left (for a positive actual exponent) or right (for a negative actual exponent). The exponent is a biased value; you calculate the actual exponent value by subtracting a bias value from the exponent stored in the value. The bias for a float is 127; the bias for a double is 1023. Thus, a float value with an exponent of 150 would represent a number with an exponent of 23. The constants FLT_MIN, FLT_MAX, DBL_MIN, and DBL_MAX define the minimum and maximum values for floating point numbers, in the Standard C header file float.h. Two other relevant float.h constants are FLT_EPSILON and DBL_EPSILON, which represent the smallest possible difference between two float and double values.

In Standard C++, the numeric_limits template (defined in the limits header) is specialized to describe the characteristics of each numeric data type.

Floating-point numbers can take on some unusual values. It's possible for a value number to represent positive and negative infinity, for example. Or, a floating-point value may be in a special format that doesn't represent a valid number. Any routines that generate or change floating-point numbers must avoid producing these unusual values.

A floating-point value represents infinity when the bits in the exponent are all one and the bits in the mantissa are all zero. When both the mantissa and exponent are zero, the floating-point number is zero. Infinity, as well as zero, can have a sign. Positive and negative zero operate identically in calculations and comparisons.

When is a number not a number? When its exponent is all ones and its mantissa contains any set of bits that is not all zeros (which would indicate an infinity). A value in this format is known as a NaN (Not a Number). The sign bit for a NaN is irrelevant.

Unusual floating-point values often have undesirable consequences in calculations, so I wanted to avoid the creation of unusual numbers through floating-point reproduction. Looking at the above descriptions, we can see an obvious commonality between the troublesome NaNs and infinities: both types have exponents filled with ones. This characteristic provides a way of identifying "weird" floating-point number when performing mutations and crossover.

Mutation and crossover randomly change bits in the three components of a floating-point number: the sign bit, exponent, and mantissa. Changing the exponent and sign have the most dramatic affect on a floating-point value, since the change of one bit can dramatically alter the magnitude of a number. Assuming that all bits have an equal chance of mutation, we get the following probabilities that a random bit change will affect a specific component:

sign bit exponent mantissa
float 3.1% 25.0% 71.9%
double 1.6% 17.1% 81.3%


Depending on the application, I've found that those intrinsic percentages don't always allow effective mutations. The exponent, in particular, is so likely to be changed that numbers fluctuate wildly under mutation. To solve that problem, I created a system for the roulette-wheel selection of the component to be mutated, allowing me to weight mutation in favor of changing the mantissa. The relative chance of mutating each component is configurable. My experiments suggest limiting the mutability of the exponent to under 15%, keeping the sign bit mutation rate at about two or three percent.

Example: Function Optimizer

Computer scientists often encounter problems that are not amenable to numerical methods of solution. A common problem is that of finding input value that produces a minimum or maximum output from a function. It's quite easy to optimize a function that maps a single high or low value; things become quite a bit more difficult when a function generates several such values. Consider this two-dimensional function:

Mapped onto an x-y plane from (1,-1)-(1,-1), the function in question produces this image:

This fitness landscape is somewhat challenging, with several peaks, including two very close in height. Yes, you might be able to generate a much more detailed picture and see where the highest peak is — but not all equations can be visualized so easily. Rather than rely on the human eye or a "good" guess, we can use a computer program to find the actual answer.

How should a program go about finding the maximum for f(x,y)? The graph shows several local maxima in the upper quadrant, including two tall spikes in close proximity. Traditional approaches to finding function maxima or minima — a process known as optimization — use a variety of techniques that rely on the ability to climb "upward" to a solution. In optimizing f(x,y), most optimization techniques (hill climbing, for example) become "trapped" on the smaller "hills"-unless they begin looking for maxima in just the right place.

A genetic algorithm begins with a set of randomly-selected points from which it selects the best performers through fitness testing. Crossover combines the best attributes from the most successful members of the population, and random mutation introduces new characteristics that may produce better solutions. Subsequent generations build upon the success of their "ancestiors", evolving toward a solution that best "fits" the given niche (in this case, x and y values that produce the highest peak).

If you're looking for six to eight digits of precision, the peak is usually found in a few hundred generations. Being a stochastic process, the genetic algorithm doesn't always produce identical performance, even when applied to the same problem. Play a bit with the configuration, selecting a variety of parameters to gain a feel for how they affect performance. The best way to learn about genetic algorithms is to write them.

The function optimizer is a basic genetic algorithm, implemented in a few simple classes; it demonstrates how Evocosm works on a fundamental level. Other more advanced features of Evocosm — particularly multiple intermigrational populations and state machines — fall into the realm of complex applications that will be my focus in other articles.

Example: Iterated Prisoner's Dilemma

The "Prisoner's Dilemma" is well-documented at Wikipedia.

My implementation of the Iterated Prisoner's Dilemma (IPD) implements strategies with a simple state machine (a template class from Evocosm.) It demonstrates almost all the features of Evocosm, and is more complex than the function optimizer example.

The algorithm thens to find interesting strategies after short runs of a few hundred generations. The current program simply reports to the console, giving a CSV list of population statistics for each generation followed by a description of the best IPD machine found. If time permits, and people show some interest, I may release a visual version of the program I've been exeprimenting with, wherein multiple populations compete, producing some rather novel strategies.

As always, I'm interested in your comments.

Syraqua Logo

Scott
Robert
Ladd

Software Engineer
 

 
 

 


© 2013 Scott Robert Ladd
All rights reserved.

The grey-and-purple dragon logo, the blue coyote logo, Coyote Gulch Productions, Itzam, SigScope, Evocosm, and Acovea are all Trademarks of Scott Robert Ladd.