• Ei tuloksia

G ENETIC A LGORITHMS

Genetic algorithms belong to the family of meta-heuristic search algorithms, and are used as a way to seek for solutions in a very large search space. When a deterministic search is not feasible, but the problem can still be formalized as a search problem, genetic algorithms provide a sophisticated way to quickly search for a sufficiently good solution [Mitchell, 1996]. Holland [1975] structured the idea of GAs as a way for computer science to take advantage of the phenomena present in natural evolution.

Evolution is described by Darwin’s theory of natural selection: the individuals that are best fit to the environment they are living in will survive, and the ones who are poorly adjusted to the environment will die.

Over time, this kind of natural selection will ensure that species that have been able to produce properties that enhance their probability of survival will be spared. Thus, such species will be able to produce fitter offspring.

Genetic algorithms, by definition, work with the same concepts that are involved in the evolution of species. The core of each living being is in its DNA. DNA, in turn, can be separated into strings of chromosomes. Each chromosome consists of genes, which specify different properties for the individual. For example, one gene is in charge of the color of eyes, one of the shape of the face, and so on. Each gene usually has several options, e.g., the eyes can be blue or brown and the face can be round or oval; these different variations are called alleles. A gene always has a specific place in the chromosome; this is called the locus. Collectively, the chromosome(s) containing these genes specify the entire individual. A set of individuals at a certain time point is called a population, and generation indicates how many iterations the evolution has performed.

Naturally, for evolution to appear, there must be reproduction in order to bring new individuals into the population and thus enable natural selection. Reproduction is most commonly achieved with mating between two individuals, which is referred to as crossover. Crossover combines parts of two individuals (parents) and aims at producing offspring, which contain genes from both parents. The offspring is then added to the existing population. When the amount of individuals exceeds the amount a population can hold, natural selection occurs to discard the unfit individuals. Crossover is, however, very slow when an individual needs to adapt to a quickly changing environment. In order to accommodate to a new environment, mutation is required. On chromosome level, mutation usually changes the value of one or more genes (i.e., one allele is chosen over another, and new possible alleles may be produced).

In order to know which individuals are better fit than others, the individual’s fitness is calculated. The fitness is related to the probability

40

that an individual has for participating in crossover and surviving through to the next generation.

A formalization of the GA is given in Algorithm 1. In the following subsections I will go over how to work with a genetic algorithm in more detail by using the knapsack problem as an example. In the knapsack problem, we have a set of items that all have different weights and volumes, and a subset of those items should be chosen so that their combined weight is maximized, while the combined volume cannot exceed the volume of the knapsack.

Algorithm 1 geneticAlgorithm

Input: formalization of solution, initialSolution population  createPopulation(initialSolution) while NOT terminationCondition do

foreach chromosome in population p  randomProbability

if p > mutationProbability then mutate(chromosome)

end if end for

foreach chromosome in population cp  randomProbability

if cp > crossoverProbability then addToParents(chromosome) end if

end for

foreach chromosome in parents father chromosome

mother selectNextChromosome(parents) offspring  crossover(father, mother) addToPopulation(offspring)

removeFromParents (father, mother) end for

foreach chromosome in population calculatefitness(chromosome) end for

selectNextPopulation() end while

2.2.1 Encoding

The very first step when applying genetic algorithms is to encode individuals (the possible solutions) so the algorithm can work with them.

As stated, the GA works with concepts from biology, so the encoding mechanism should be such that the solution can be thought of as a chromosome and be divided into genes. Apart from that, there are no restrictions regarding encoding.

41 The best encoding mechanism depends on the problem in question. What might work well for one problem might not be sufficient or even feasible to encode another type of problem. The most common encoding mechanism is to use a string or vector of bits, so that one bit corresponds to one gene. A bit in a specific index (locus) usually indicates the inclusion (bit value 1) or exclusion (bit value 0) of the data represented by that gene, i.e., the different variations 0 and 1 are considered alleles, specifying a certain property for a particular gene in one locus. However, this form of encoding is highly insufficient for more complex problems. For example, if the problem is an ordering problem (e.g., the traveling salesman problem, where the goal is to find the optimal order for visiting the cities), a simple inclusion/exclusion encoding is not enough, as all the bits (cities) need to be included. For the most complex problems, even a single value isn’t enough, but the gene itself may also be a vector. The gene may also contain different kinds of fields for different types of data; in this case the gene may be called a supergene [Amoui et al., 2006].

If we consider the knapsack problem, the simple bit vector encoding is quite sufficient. For n items, the solution can thus be encoded as a vector with n bits, where 0 represents not including the corresponding item in the knapsack, and 1 represents including it. The order of the items can be anything, the most natural one being ascending based on either weight or volume (i.e., the item represented by the gene in locus 1 would have the smallest weight or volume). The only requirement is that each item is fixed to be represented by a specific bit. The information about the weights and volumes can be stored separately. Thus, if in our example we have seven items, and items four and seven are included in the knapsack, the individual representing this solution would be encoded as 0001001.

When the GA begins to search for a solution, an initial population is first needed. An initial population can be constructed roughly in three ways: 1) creating random individuals by administering mutations to a “null”

individual, 2) creating the individuals intelligently, or 3) inserting “ready”

individuals into the population. The best policy for creating the initial population depends on the problem at hand, and basically anything goes as long as the GA is provided with a starting point to begin the evolutionary process.

2.2.2 Mutations

After an initial population is created, mutations are applied for each individual in order to create diversity. The actual implementation of mutations is completely dependent on the encoding of the solution.

However, a mutation most often changes the value of one gene. If the gene is a combination of values, i.e., is a vector itself or a supergene, a mutation

42

can change the gene completely or just one value within the gene. The problem can permit several mutations, i.e., an individual can be modified in many different ways. Whatever the mutations are, the result should always still be a feasible solution. It is also possible to have a separate

“correction mutation” that will check the chromosome after a mutation to see that it is still valid. If the mutation has caused the chromosome to become unnatural, i.e., it does not belong to the solution space anymore, corrective actions will take place. Such actions do not necessarily just revert the mutation that caused the problem, but might do even bigger changes to the chromosome. The chromosome can even be completely discarded from the population.

The locus where the mutation is applied to is usually selected randomly, but in some cases intelligent methods can be used to search for particular genes where mutations should be performed. Mutations are also given a probability, the so-called mutation rate. If the mutation rate is 10%, then during one generation, approximately 10% of the individuals will be subjected to that particular mutation. Thus, not all individuals are mutated during every generation, as mutation probabilities are quite often rather low.

In the knapsack example, only one logical mutation exists: changing a bit from 0 to 1 or vice versa. Considering the solution in the previous subsection, we may now mutate the solution so that we apply the mutation to locus three. The bit vector would now change from 0001001 to 0011001, and item three is included in the knapsack. Figure 3 illustrates the mutation.

Figure 3. Mutation

2.2.3 Crossover

After mutations are administered to the chromosomes, the GA proceeds to perform crossover. Crossover combines parts of two individuals (the parents) in order to produce new solution(s) (their offspring) to the population. Crossover is also given a probability, i.e., a specified crossover rate, and is usually applied more often than mutations. This probability, i.e., the likelihood of an individual being involved with crossover, can at its simplest be randomly assigned (as in Algorithm 1), but in practice this is rare. It is more common to select parents based on their fitness values or ranks, so the better a fitness one has, the bigger is the likelihood of being

43 involved in crossover, and thus passing on favorable properties to its offspring.

The crossover operation can combine simply two “halves” of parents (one-point crossover) or several blocks of genes (two-(one-point or multi-(one-point crossover). Selecting the crossover point or points can be done randomly, or heuristics can be used to find the best crossover point(s) to preserve the most desired qualities of both parents. If the crossover points are selected randomly, it is common to produce at least two new individuals. If, however, the crossover points are selected in a purposeful way to combine the best parts of parents, it is more natural to produce only one offspring, as another(s) only contains “leftover” genes. The offspring can replace the parents or just be added in the generation. If they replace the parents, then it is assumed that offspring is always better than the parents, as there are no “extra” individuals that could be discarded in natural selection.

In the knapsack example, we can perform a simple single-point crossover where the crossover point cp is selected randomly. Crossover takes the genes (bits) from one individual (mother), from loci 1 to cp, and the bits in loci from cp+1 to n from another individual (father). These genes are combined to produce a new individual (child1), which now contains bits from both parents. Another child (child 2) can be straightforwardly created by taking genes from the father from loci 1 to cp and genes from the mother from loci cp+1 to n. An example of perfoming the crossover for the knapsack case is given in Figure 4. Here the crossover point is in locus three, and two children are produced.

Figure 4. Crossover

2.2.4 Fitness

The fitness function determines the quality of a solution. For example, in the case of the travelling salesman problem the fitness would be the length of the route, and it should be minimized. However, in the case of creating a test case for a software system, the fitness would be the coverage of the test case, which should be maximized (and is far more complex to calculate than the length of the route for the salesman). Correctly defining the fitness function is crucial, as it is the only thing guiding the genetic

44

algorithm in its search for as good solutions as possible. The fitness function rarely gives an absolute quality of a solution, i.e., if one solution has a fitness value 1, and another a fitness value 10, it should not be assumed that the second solution is ten times better than the first one.

Rather, the fitness value should be thought of as a relative indicator of the order of solutions, when examined against the quality attributes.

In the case of our knapsack example, the fitness function is simple: the fitness function is the sum of weights of the included items, while regarding the constraint that the volume cannot exceed a set limit. For our example with seven items, we can specify that item one has a weight of one unit, item two weighs two units, and so on, and their volume would be reversed, i.e., item one has a volume of seven units, item two has a volume of six units, and so on. We assume that the knapsack has a maximum volume of 10. The fitnesses of the individuals in Figure 4 can be determined as follows: the fitness for the mother is 2+5+7 = 14, for the father 3+4+7 = 14, the fitness for the first child is 2+4+7 = 13, and the fitness of the second child is 3+5+7 = 15. Thus, the best solution would appear to be the second child, with a fitness of 15. However, we need to consider the volume restriction. The volumes are 10 for the mother, 10 for the father, 11 for the first child and 9 for the second child. As the maximum volume is 10, the first child is discarded, and the second child remains the fittest individual.

2.2.5 Selection

As according to the biological analogy, natural selection is applied in the GA to discard the weakest individuals (with respect to the used fitness function) from the population. As crossover produces “extra” solutions, the size of the population at the end of one generation is always bigger than the specified population size. Selection can thus simply discard as many individuals as have been added through crossover, and so the population size will always be the same at the start of each generation.

The selection operator should be defined so that the fittest survive (as in natural selection) but also so that there still remains variation in the population so the evolution does not get stuck to a local optimum.

The simplest way of defining a selection operator is to use a purely elitist selection. This selects only the very best individuals in terms of fitness.

Elitist selection is easy to understand and simple to implement; one can simply discard the weakest individuals in the population. However, elitist selection on its own is not an ideal selection operator, as elitist algorithms tend to converge towards local optima too quickly.

45 Another way of defining the selection operator is to use a fitness-proportionate selection. In this case, the probability of being selected for the next generation depends on fitness values. However, differences in actual quality are rarely directly comparable to absolute fitness values. In order to take this into account, the fitness-proportionate selection can be rank-based. When the selection is based on rank, the individuals are ordered according to their fitness values. The probability of being selected to the next generation thus increases according to the rank – not according to absolute fitness values. The basic fitness-proportionate selection as well as its rank-based variation can be implemented, for example, with a

“roulette-wheel” sampling [Mitchell, 1996; Michalewicz, 1992; Reeves, 1995]. Here, each individual is given a slice of the “wheel” that is in proportion to the “area” that its fitness has in the overall fitness of the population. In simple fitness-proportionate selection, the size of the slice is calculated directly based on the absolute fitness values, while in rank-based selection the size is proportionate to the rank, and thus the sizes increase in a linear fashion when compared to the order of the individuals.

Either way, the individuals with higher fitnesses have a larger area in the wheel, and thus have a higher probability of getting selected. The wheel is then spun for as many times as there are individuals needed for the population. An alternative to the roulette wheel method is to use, e.g., the tournament technique to select the next generation [Miller and Goldberg, 1995].

A common selection operator is a crossing of the two methods presented above; the survival of the very fittest is guaranteed by choosing a few of the best individual(s) with elitist methods, while the rest of the population is selected with the probabilistic method in order to ensure variety within the population.

There are different approaches to using the selection operator. Mitchell [1996] and Reeves [1995] consider that the selection operator selects the individuals that are most likely to reproduce, i.e., become parents.

Michalewicz [1992] uses the selection operator in order to find the fittest individuals for the next generation. Both approaches keep the same selection probabilities for all individuals during the entire selection process, i.e., an individual with a high fitness value may be selected to the next population more than once. Thus, there are no specific rules to how the selection operator should be defined, as long as the GA generally follows the guidelines of natural evolution.

2.2.6 Parameters

Correctly defining the different operations (mutations, crossover and fitness function) is vital in order to achieve satisfactory results. However,

46

as seen in Algorithm 1, there are also many other parameters regarding the GA that need to be defined and greatly affect the outcome. These parameters are the population size, number of generations (often used as the terminating condition) and the mutation and crossover probabilities.

Having a large enough population ensures variance within a generation, and enables a wide selection of different solutions at every stage of evolution. However, at a certain point the results start to converge, and a larger population always means more fitness evaluations and thus requires more computation time. Similarly, the more generations the algorithm is allowed to evolve for, the higher the chances are that it will be able to reach better results. However, again, letting an algorithm run for, say, 10 000 generations will most probably not be beneficial: if the operations and parameters have been chosen correctly, a reasonably good solution should have been found much earlier.

Mutation and crossover probabilities both affect how fast the population evolves. If the probabilities are too high, there is the risk that the implementation of genetic operations becomes random instead of guided.

Vice versa, if the probabilities are too low there is the risk that the population will evolve too slowly, and no real diversity will exist.

Thus, all parameters should be carefully considered. There is no single method for finding the optimal or “correct” parameters, but the most common way is to simply perform trial-and-error iterations until results are satisfactory.