Multi-Layer Perceptrons (MLPs) are a well-known and widely used supervised machine learning technique, especially in classification contexts. As with other supervised learning techniques, MLPs first have to "learn" how to classify the data. This is done by adjusting the weights that connect the different neurons in order to obtain as good a result as possible.
Usually training MLPs is done as follows: The weights are initialized to random values and then some sort of iterative gradient descent algorithm is used to minimize the misclassification-error on a given training dataset. Backpropagation (BP) and its derivatives are a very popular choice of algorithm here.
However, there are some drawbacks to using BP algorithms: they might get stuck in local minima, and if they DO converge to an optimum, they might do so very slowly.
So why not try a different approach? Like ANNs, Evolutionary Algorithms are algorithms based on biology. More specifically, evolution. Their general working principle is to generate a whole lot of (random) solutions, combine the better ones to produce new solutions, and to maybe slightly modify existing solutions by some random factor. That's it, in a nutshell.
EAs can be used as heuristics to find solutions for problems where other methods fail. And they're fairly good at solving multidimensional problems. Sounds like a good alternative to Backpropagation? Well let's find out!
The applet on this side implements an MLP with one hidden layer, and several different learning algorithms. They can be tested on various datasets, all taken from the UCI Machine Learning Repository. I think there's a nice selection ranging from very easy classification tasks to very1 hard ones (wine and breast-cancer are very easy, for example).
There are 4 different learning algorithms implemented in the applet:
- the classic, good old B.P
- Evolutionary Strategies (ES)
- ES are mainly based on mutation. They very seldom (if at all) combine solutions to find new ones. The basic idea is that µ parents generate a total of λ offspring. Depending on the type of the ES, the next generation is then computed by selecting µ new "parents" from all the offspring (usually this is done by taking the best µ solutions) or by taking into account the current parents as well (and not only the offspring). The former approach is known as a (µ, λ) ES and is implemented in the "EvolutionaryStrategy", and the latter is known as (µ + λ) ES and is implemented in "EvolutionaryStrategy2". When generating an offspring, the parent is simply changed (with a certain probability) by a certain (normally distributed) factor.
- Differential Evolution (DE)
- DE is a swarm intelligence algorithm that does away with all the biological analogies. Here, a new individual is generated by taking an existing individual (let's call it x) and generating 3 new random solutions (a, b and c). One of the dimensions of x (say xi) is picked at random and is replaced by a combination of the values of a, b and c: xi = i + F(bi - ci). All the other dimensions of x might also be changed, with a given probability R. It can be argued that DE is a fairly simple EA algorithm, but if you check out the further reading/related work section you'll find a paper that makes a point for preferring DE to EAs that are arguably more complex.
- GeneticAlgorithm (GA)
- Of course, no treatment of EAs would be complete without GAs. The one implemented is a very simple GA without elitism, using stochastic wheel selection and one point crossover. During implementation, rank based selection and a uniform crossover were also tried, but they both did not improve the performance of the GA.
What can be learned by playing with the applet?
Well, there are actually two things to notice: Backpropagation beats all the EAs, almost all the time. Both in accuracy and in speed. This is actually quite disappointing. But then again, BP was designed specifically for ANNs, while the EAs treat the ANN as "black box". All they see are multidimensional vectors (representing weights) and the associated fitness (representing the accuracy). This leads to the natural conclusion that there is a reason why BP (or actually variants thereof) are a much more popular choice for training MLPs.
It is worth noting, though, that other people have tried optimizing MLPs using EAs. Check out the further reading/related work section for one such example, where the researchers have determined that EAs can indeed outperform Backpropagation! However, their GA was tuned to the problem of optimizing MLPs, while mine is a run-of-the-mill GA, so that might explain why my implementation fails to show how EAs can beat BP.
There are of course also approaches trying to combine the strengths of EAs and BPs, or to adapt EAs to work better with ANN. Again, check out the further reading section for more details.
Apart from the performance in comparison with BP, it is interesting to see how the different algorithms behave, and I think my applet demonstrates this in quite some detail. Playing around with the parameters gives a pretty nice feeling for how the algorithms behave, how they converge, or how different parameter settings can completely change the behaviour of the algorithms!
The applet is quite self-explanatory (or at least that's what I hope, else just drop me an email). After scaling the data down to an [0, 1] interval, it is split into training- and testset by a 70:30 ratio.
The applet produces nice plots of accuracy vs iteration. The graphs are plotted using the testset, while the learning algorithm only ever sees the training set. This explains why even when using an elitist EA (one that never discards the best solution found so far) the graph is not always monotonically decreasing.
Related work & further reading
- Leonardo Noriega has a terse, but quite useful Multilayer Perceptron Tutorial that I used as a pointer when implementing the MLP and the BP algorithm.
- The PhD thesis of M.E.H. Pedersen, Tuning & Simplifying Heuristical Optimization, points out why Differential Evolution is a pretty useful algorithm even though it's simpler than other Swarm-based optimization heuristics. It also contains useful advice on how to choose the parameters for DE.
- Pedro A. Castillo has tried to combine GAs and BP for training MLPs and has come up with G-Prop: Global optimization of multilayer perceptrons using GAs. He later evaluated a number of these hybrid systems in his paper Comparing Evolutionary Hybrid Systems For Design And Optimization Of Multilayer Perceptron Structure Along Training Parameters and concludes that they yield better results than using only BP or GAs.
- Randall S. Sextona and Jatinder N. D. Gupta have compared the performance of BP and GAs in their paper Comparative evaluation of genetic algorithm and backpropagation for training neural networks. Interestingly enough their results differ greatly from mine, as they find that GAs outperform BPs (at least on the problems they tried).