PressCNRS international magazine

Table of contents

Information Technology

PhyML, a tool for biologists

To reconstruct the relationships and evolutionary links between different living species, researchers can compare their genomes, stored in enormous databases containing millions of DNA sequences. PhyML can help them do it in an instant.

Imagine that you are a biologist and want to know the degree of separation between the different species of yeast (or worms, insects or birds) that you are studying, as well as their evolutionary history. To avoid errors due to superficial morphological resemblances, the most reliable method is molecular phylogeny—a comparison of the genomes of the species. To do this, you have access to databases containing millions of DNA sequences, covering tens of thousands of living species, but without the appropriate tools for comparing them, they will remain mysterious sequences of A's, T's, C's, and G's.

PhyML, a tool for biologists

© E. Douzery, P.-H. Fabre (ISEM/CNRS), and F. Chevenet (IRD)

Using PhyML, researchers have been able to reconstruct the phylogeny of primates, based on sequences containing more than 900,000 base pairs. Contemporary species are at branch extremities, nodes correspond to ancestral species, and branch lengths represent evolutionary times.

 This is where Olivier Gascuel and Stéphane Guindon come in. Researchers at the LIRMM in Montpellier, they developed in 2003 a powerful algorithm which can be used to estimate the evolutionary relationships linking a set of organisms, based on comparison of their DNA or protein sequences.1 The paper presenting this method, “A Simple, Fast and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood,” was selected by the International Citation Index as “fast breaking paper” in February 2005.2 This algorithm dramatically reduces computing time and can be used with much larger and more complex data sets. “There are studies you couldn't imagine doing before, which now take just a few minutes,” says Gascuel. “Prior to this, researchers did not dare tackle problems involving 100, 200, or 500 species, or sequences with several tens of thousands of letters. Our software makes it possible to process this type of data.”

 Using this algorithm as a basis, the two researchers have developed a program called PhyML in the format of open access software. Researchers from all over the world benefit from its user-friendliness—they simply enter their data and wait for the eight processors in the LIRMM computer to process it.3 They receive the results by e-mail several minutes or hours later. “The server is running at full power—at any one time, there are always at least three or four users. Every time I go to a conference, I hear people talking about our  program,” says Gascuel. The algorithm is based on the maximum likelihood principle, a cen-
tral concept in statistics. Here it involves defining a hypothetical model that describes both the Darwinian tree representing the genealogy of the species and the mutations which have taken place on this tree along the course of evolution. The algorithm then calculates the probability of the data given the assumed model. The process is repeated for every model until it reaches maximum probability. The corresponding model constitutes the algorithm's response. This method of trial and error required colossal processing time before the development of PhyML, a very satisfactory approximation of the statistical principle which simplifies the search by concentrating on the most likely models. PhyML returns a phylogenetic tree in which the leaves (the extremities of the branches) correspond to the contemporary species, each node represents a common ancestor, and the length of the branches represents the period of time over which the species has evolved.

 “Previously, phylogenies were based on a single gene, even though there are thousands in the DNA databases. Today, each species is represented by 50 or 100 genes, so the phylogenies are much more reliable,” explains Gascuel. This is a boon for biologists, because phylogenetics is becoming increasingly important. “About one biology paper in four contains a phylogenetic analysis. These methods are used in fields ranging from functional genomics to biodiversity research, via, for example, the study of viruses such as HIV or SARS.” Molecular phylogenetics has been revolutionary for systematic biologists. They have found that in the plant world, for instance, morphological characteristics such as shape and color of the flowers are very poor indicators for grouping plants into order, family and genus. On the other hand, evaluating plants' degree of separation at the genomic level gives highly accurate results.

 With PhyML, the latest techniques in statistical inference complement the great classification project started by Linnaeus, the famous Swedish naturalist and continued by Charles Darwin and his theory of evolution. But there is more to come, namely the possibility of understanding evolutionary processes at the genome level. This is a vast research program in which the developers of PhyML are involved: Gascuel is studying repeated sequences of DNA, while Guindon is working on mechanisms for mutation selection in viruses.

 The two researchers are currently working on new prototypes for PhyML, incorporating improvements that should lead to an even more reliable selection of the right model.

Sebastián Escalón

Notes :

1. Computer Science, Robotics and Microelectronics laboratory. Joint lab: CNRS / Université de Montpellier-II.
2. View web site
3. View web site

Contacts :

Olivier Gascuel
LIRMM, Montpellier
View web site


Back to homepageContactcredits