TY - JOUR

T1 - Mining parasite data using genetic programming

AU - Barrett, John

AU - Raga, J. A.

AU - Kostadinova, A.

N1 - Barrett, J., Kostadinova, A., Raga, J.A. (2005). Mining parasite data using genetic programming. Trends in Parasitology, 21, (5), 207-209

PY - 2005

Y1 - 2005

N2 - Genetic programming is a technique that can be used to tackle the hugely demanding data-processing problems encountered in the natural sciences. Application of genetic programming to a problem using parasites as biological tags demonstrates its potential for developing explanatory models using data that are both complex and noisy. In many areas of biology, the ability to collect data outstrips the ability to analyse it. Techniques are needed to mine large datasets and extract biologically meaningful relationships. Genetic programming (GP) is a stochastic optimization approach that helps to discover comprehensible rules for data mining. It is one of a group of supervised, evolutionary programming techniques that uses darwinian concepts to generate and optimize predictive mathematical models. This is done by mimicking ‘natural selection’ using ‘populations’ of mathematical models. Initially, a population of n models (short computer programmes) is generated, each model representing a different, random combination of variables, constants and mathematical functions. The fitness of each model is determined (in terms of how well it solves the problem). The ‘best’ models are then selected for ‘breeding’ to produce the next generation of ‘fitter’ models, and so on until a model is evolved that solves the problem with the required degree of accuracy or until a specified stopping criterion is reached. During breeding, different parts of the models are recombined, and the mathematical functions and variables can be changed: the equivalent of crossover and mutation. Because GP is a randomized algorithm, it is not deterministic, and each new run with a dataset evolves an independent model. Therefore, several alternative solutions to a problem can be evolved. For complex problems for which there is no single answer, each run can result in a different best model, and a validation process must then be devised to select the most appropriate one.

AB - Genetic programming is a technique that can be used to tackle the hugely demanding data-processing problems encountered in the natural sciences. Application of genetic programming to a problem using parasites as biological tags demonstrates its potential for developing explanatory models using data that are both complex and noisy. In many areas of biology, the ability to collect data outstrips the ability to analyse it. Techniques are needed to mine large datasets and extract biologically meaningful relationships. Genetic programming (GP) is a stochastic optimization approach that helps to discover comprehensible rules for data mining. It is one of a group of supervised, evolutionary programming techniques that uses darwinian concepts to generate and optimize predictive mathematical models. This is done by mimicking ‘natural selection’ using ‘populations’ of mathematical models. Initially, a population of n models (short computer programmes) is generated, each model representing a different, random combination of variables, constants and mathematical functions. The fitness of each model is determined (in terms of how well it solves the problem). The ‘best’ models are then selected for ‘breeding’ to produce the next generation of ‘fitter’ models, and so on until a model is evolved that solves the problem with the required degree of accuracy or until a specified stopping criterion is reached. During breeding, different parts of the models are recombined, and the mathematical functions and variables can be changed: the equivalent of crossover and mutation. Because GP is a randomized algorithm, it is not deterministic, and each new run with a dataset evolves an independent model. Therefore, several alternative solutions to a problem can be evolved. For complex problems for which there is no single answer, each run can result in a different best model, and a validation process must then be devised to select the most appropriate one.

U2 - 10.1016/j.pt.2005.03.007

DO - 10.1016/j.pt.2005.03.007

M3 - Article

C2 - 15837607

SN - 1469-8161

SP - 207

EP - 209

JO - Parasitology

JF - Parasitology

ER -