Phylogenetics

In biology, phylogenetics /faÉªlÉµdÊ'ÉªËˆnÉ›tÉªks/ is the study of evolutionary relationships among groups of organisms (e.g. species, populations), which are discovered through molecular sequencing data and morphological data matrices. The term phylogenetics derives from the Greek terms phylÃ© (Ï†Ï…Î»Î®) and phylon (Ï†á¿¦Î»Î¿Î½), denoting "tribe", "clan", "race" and the adjectival form, genetikÃ³s (Î³ÎµÎ½ÎµÏ„Î¹ÎºÏŒÏ‚), of the word genesis (Î³ÎÎ½ÎµÏƒÎ¹Ï‚) "origin", "source", "birth".

In fact, phylogenesis is the process, phylogeny is science on this process, and phylogenetics is phylogeny based on analysis of sequences of biological macromolecules (DNA, RNA and proteins, in the first). The result of phylogenetic studies is a hypothesis about the evolutionary history of taxonomic groups: their phylogeny.

Evolution is a process whereby populations are altered over time and may split into separate branches, hybridize together, or terminate by extinction. The evolutionary branching process may be depicted as a phylogenetic tree, and the place of each of the various organisms on the tree is based on a hypothesis about the sequence in which evolutionary branching events occurred. In historical linguistics, similar concepts are used with respect to relationships between languages; and in textual criticism with stemmatics.

Phylogenetic analyses have become essential to research on the evolutionary tree of life. For example, the RedToL aims at reconstructing the Red Algal Tree of Life. The National Science Foundation sponsors a project called the Assembling the Tree of Life (AToL) activity. The goal of this project is to determine evolutionary relationships across large groups of organisms throughout the history of life. The research on this project often involves large teams working across institutions and disciplines, and typically provides support to investigators working on computational phylogenetics and phyloinformatics tasks, including data acquisition, analysis, and algorithm development and dissemination.

Taxonomyâ€"the classification, identification and naming of organismsâ€"is usually richly informed by phylogenetics, but remains a methodologically and logically distinct discipline. The degree to which taxonomies depend on phylogenies differs depending on the school of taxonomy: phenetics ignores phylogeny altogether, trying to represent the similarity between organisms instead; cladistics (phylogenetic systematics) tries to reproduce phylogeny in its classification without loss of information; evolutionary taxonomy tries to find a compromise between them in order to represent stages of evolution.

Construction of a phylogenetic tree

The scientific methods of phylogenetics are often grouped under the term cladistics. The most common ones are parsimony, maximum likelihood (ML), and MCMC-based Bayesian inference. All methods depend upon an implicit or explicit mathematical model describing the evolution of characters observed in the species included; all can be, and are, used for molecular data, wherein the characters are aligned nucleotide or amino acid sequences, and all but maximum likelihood (see below) can be, and are, used for phenotypic (morphological, chemical, and physiological) data (also called classical or traditional data).

Phenetics, popular in the mid-20th century but now largely obsolete, uses distance matrix-based methods to construct trees based on overall similarity in morphology or other observable traits (i.e. in the phenotype, not the DNA), which was often assumed to approximate phylogenetic relationships.

A comprehensive step-by-step protocol on constructing phylogenetic tree, including DNA/Amino Acid contiguous sequence assembly, multiple sequence alignment, model-test (testing best-fitting substitution models) and phylogeny reconstruction using Maximum Likelihood and Bayesian Inference, is available at Nature Protocol

Prior to 1990, phylogenetic inferences were generally presented as narrative scenarios. Such methods are legitimate, but often ambiguous and hard to test.

Limitations and workarounds

Ultimately, there is no way to measure whether a particular phylogenetic hypothesis is accurate or not, unless the true relationships among the taxa being examined are already known (which may happen with bacteria or viruses under laboratory conditions). The best result an empirical phylogeneticist can hope to attain is a tree with branches that are well supported by the available evidence. Several potential pitfalls have been identified:

Homoplasy

Certain characters are more likely to evolve convergently than others; logically, such characters should be given less weight in the reconstruction of a tree. Weights in the form of a model of evolution can be inferred from sets of molecular data, so that maximum likelihood or Bayesian methods can be used to analyze them. For molecular sequences, this problem is exacerbated when the taxa under study have diverged substantially. As time since the divergence of two taxa increase, so does the probability of multiple substitutions on the same site, or back mutations, all of which result in homoplasies. For morphological data, unfortunately, the only objective way to determine convergence is by the construction of a treeÂ â€" a somewhat circular method. Even so, weighting homoplasious characters does indeed lead to better-supported trees. Further refinement can be brought by weighting changes in one direction higher than changes in another; for instance, the presence of thoracic wings almost guarantees placement among the pterygote insects because, although wings are often lost secondarily, there is no evidence that they have been gained more than once.

Horizontal gene transfer

In general, organisms can inherit genes in two ways: vertical gene transfer and horizontal gene transfer. Vertical gene transfer is the passage of genes from parent to offspring, and horizontal (also called lateral) gene transfer occurs when genes jump between unrelated organisms, a common phenomenon especially in prokaryotes; a good example of this is the acquired antibiotic resistance as a result of gene exchange between various bacteria leading to multi-drug-resistant bacterial species. There have also been well-documented cases of horizontal gene transfer between eukaryotes.

Horizontal gene transfer has complicated the determination of phylogenies of organisms, and inconsistencies in phylogeny have been reported among specific groups of organisms depending on the genes used to construct evolutionary trees. The only way to determine which genes have been acquired vertically and which horizontally is to parsimoniously assume that the largest set of genes that have been inherited together have been inherited vertically; this requires analyzing a large number of genes.

Hybrids, speciation and introgressions

The basic assumption underlying the mathematical model of cladistics is a situation where species split neatly in bifurcating fashion. While such an assumption may hold on a larger scale (bar horizontal gene transfer, see above), speciation is often much less orderly. Research since the cladistic method was introduced has shown that hybrid speciation, once thought rare, is in fact quite common, particularly in plants. Also paraphyletic speciation is common, making the assumption of a bifurcating pattern unsuitable, leading to phylogenetic networks rather than trees. Introgression can also move genes between otherwise distinct species and sometimes even genera, complicating phylogenetic analysis based on genes. This phenomenon can contribute to "incomplete line sorting" and is thought to be a common phenomenon across a number of groups. In species level analysis this can be dealt with by larger sampling or better whole genome analysis. Often the problem is avoided by restricting the analysis to fewer, not closely related specimen.

Taxon sampling

Owing to the development of advanced sequencing techniques in molecular biology, it has become feasible to gather large amounts of data (DNA or amino acid sequences) to infer phylogenetic hypotheses. For example, it is not rare to find studies with character matrices based on whole mitochondrial genomes (~16,000 nucleotides, in many animals). However, simulations have shown that it is more important to increase the number of taxa in the matrix than to increase the number of characters, because the more taxa there are, the more accurate and more robust is the resulting phylogenetic tree. This may be partly due to the breaking up of long branches.

Phylogenetic signal

Another important factor that affects the accuracy of tree reconstruction is whether the data analyzed actually contain a useful phylogenetic signal, a term that is used generally to denote whether a character evolves slowly enough to have the same state in closely related taxa as opposed to varying randomly. Tests for phylogenetic signal exist.

Continuous characters

Morphological characters that sample a continuum may contain phylogenetic signal, but are hard to code as discrete characters. Several methods have been used, one of which is gap coding, and there are variations on gap coding. In the original form of gap coding:

group means for a character are first ordered by size. The pooled within-group standard deviation is calculated â€¦ and differences between adjacent means â€¦ are compared relative to this standard deviation. Any pair of adjacent means is considered different and given different integer scores â€¦ if the means are separated by a "gap" greater than the within-group standard deviation â€¦ times some arbitrary constant.

If more taxa are added to the analysis, the gaps between taxa may become so small that all information is lost. Generalized gap coding works around that problem by comparing individual pairs of taxa rather than considering one set that contains all of the taxa.

Missing data

In general, the more data that are available when constructing a tree, the more accurate and reliable the resulting tree will be. Missing data are no more detrimental than simply having fewer data, although the impact is greatest when most of the missing data are in a small number of taxa. Concentrating the missing data across a small number of characters produces a more robust tree.

The role of fossils

Because many characters involve embryological, or soft-tissue or molecular characters that (at best) hardly ever fossilize, and the interpretation of fossils is more ambiguous than that of living taxa, extinct taxa almost invariably have higher proportions of missing data than living ones. However, despite these limitations, the inclusion of fossils is invaluable, as they can provide information in sparse areas of trees, breaking up long branches and constraining intermediate character states; thus, fossil taxa contribute as much to tree resolution as modern taxa. Fossils can also constrain the age of lineages and thus demonstrate how consistent a tree is with the stratigraphic record; stratocladistics incorporates age information into data matrices for phylogenetic analyses.

History

The term "phylogeny" derives from the German Phylogenie, introduced by Haeckel in 1866, and the Darwinian approach to classification became known as the "phyletic" approach.

Ernst Haeckel's recapitulation theory

During the late 19th century, Ernst Haeckel's recapitulation theory, or "biogenetic fundamental law", was widely accepted. It was often expressed as "ontogeny recapitulates phylogeny", i.e. the development of an organism successively mirrors the adult stages of successive ancestors of the species to which it belongs. This theory has long been rejected. Instead, ontogeny evolves â€" the phylogenetic history of a species cannot be read directly from its ontogeny, as Haeckel thought would be possible, but characters from ontogeny can be (and have been) used as data for phylogenetic analyses; the more closely related two species are, the more apomorphies their embryos share.

Timeline of key events

1300s, lex parsimoniae (parsimony principle), William of Ockam, English philosopher, theologian, and Franciscan monk, but the idea actually goes back to Aristotle, precursor concept
1763, Bayesian probability, Rev. Thomas Bayes, precursor concept
1700s, Pierre Simon (Marquis de Laplace), perhaps 1st to use ML (maximum likelihood), precursor concept
1809, evolutionary theory, Philosophie Zoologique, Jean-Baptiste de Lamarck, precursor concept, foreshadowed in the 1600s and 1700s by Voltaire, Descartes, and Leibniz, with Leibniz even proposing evolutionary changes to account for observed gaps suggesting that many species had become extinct, others transformed, and different species that share common traits may have at one time been a single race, also foreshadowed by some early Greek philosophers such as Anaximander in the 6th century BC and the atomists of the 5th century BC, who proposed rudimentary theories of evolution
1837, Darwin's notebooks show an evolutionary tree
1843, distinction between homology and analogy (the latter now referred to as homoplasy), Richard Owen, precursor concept
1858, Paleontologist Heinrich Georg Bronn (1800â€"1862) published a hypothetical tree to illustrating the paleontological "arrival" of new, similar species following the extinction of an older species. Bronn did not propose a mechanism responsible for such phenomena, precursor concept.
1858, elaboration of evolutionary theory, Darwin and Wallace, also in Origin of Species by Darwin the following year, precursor concept
1866, Ernst Haeckel, first publishes his phylogeny-based evolutionary tree, precursor concept
1893, Dollo's Law of Character State Irreversibility, precursor concept
1912, ML recommended, analyzed, and popularized by Ronald Fisher, precursor concept
1921, Tillyard uses term "phylogenetic" and distinguishes between archaic and specialized characters in his classification system
1949, jackknife, Maurice Quenouille (foreshadowed in '46 by Mahalanobis and extended in '58 by Tukey), precursor concept
1950, Willi Hennig's classic formalization
1952, William Wagner's groundplan divergence method
1953, "cladogenesis" coined
1960, "cladistic" coined by Cain and Harrison
1963, 1st attempt to use ML (maximum likelihood) for phylogenetics, Edwards and Cavalli-Sforza
1965
- Camin-Sokal parsimony, 1st parsimony (optimization) criterion and 1st computer program/algorithm for cladistic analysis both by Camin and Sokal
- character compatibility method, also called clique analysis, introduced independently by Camin and Sokal (loc. cit.) and E.O. Wilson
1966
- English translation of Hennig
- "cladistics" and "cladogram" coined (Webster's, loc. cit.)
1969
- dynamic and successive weighting, James Farris
- Wagner parsimony, Kluge and Farris
- CI (consistency index), Kluge and Farris
- introduction of pairwise compatibility for clique analysis, Le Quesne
1970, Wagner parsimony generalized by Farris
1971
- Fitch parsimony, Fitch
- NNI (nearest neighbour interchange), 1st branch-swapping search strategy, developed independently by Robinson and Moore et al.
- ME (minimum evolution), Kidd and Sgaramella-Zonta (it is unclear if this is the pairwise distance method or related to ML as Edwards and Cavalli-Sforza call ML "minimum evolution".)
1972, Adams consensus, Adams
1974, 1st successful application of ML to phylogenetics (for nucleotide sequences), Neyman
1976, prefix system for ranks, Farris
1977, Dollo parsimony, Farris
1979
- Nelson consensus, Nelson
- MAST (maximum agreement subtree)((GAS)greatest agreement subtree), a consensus method, Gordon
- bootstrap, Bradley Efron, precursor concept
1980, PHYLIP, 1st software package for phylogenetic analysis, Felsenstein
1981
- majority consensus, Margush and MacMorris
- strict consensus, Sokal and Rohlf
- 1st computationally efficient ML algorithm, Felsenstein
1982
- PHYSIS, Mikevich and Farris
- branch and bound, Hendy and Penny
1985
- 1st cladistic analysis of eukaryotes based on combined phenotypic and genotypic evidence Diana Lipscomb
- 1st issue of Cladistics
- 1st phylogentic application of bootstrap, Felsenstein
- 1st phylogenetic application of jackknife, Scott Lanyon
1986, MacClade, Maddison and Maddison
1987, neighbor-joining method Saitou and Nei
1988, Hennig86 (version 1.5), Farris
1989
- RI (retention index), RCI (rescaled consistency index), Farris
- HER (homoplasy excess ratio), Archie
1990
- combinable components (semi-strict) consensus, Bremer
- SPR (subtree pruning and regrafting), TBR (tree bisection and reconnection), Swofford and Olsen
1991
- DDI (data decisiveness index), Goloboff
- 1st cladistic analysis of eukaryotes based only on phenotypic evidence, Lipscomb
1993, implied weighting Goloboff
1994, Bremer support (decay index), Bremer
1994, reduced consensus: RCC (reduced cladistic consensus) for rooted trees, Wilkinson
1995, reduced consensus RPC (reduced partition consensus) for unrooted trees, Wilkinson
1996, 1st working methods for BI (Bayesian Inference)independently developed by Li, Mau, and Rannalla and Yang and all using MCMC (Markov chain-Monte Carlo)
1998, TNT (Tree Analysis Using New Technology), Goloboff, Farris, and Nixon
1999, Winclada, Nixon
2003, symmetrical resampling, Goloboff

References

Bibliography

External links

The Tree of Life
Interactive Tree of Life
PhyloCode
ExploreTree
UCMP Exhibit Halls: Phylogeny Wing
Willi Hennig Society
Filogenetica.org in Spanish
PhyloPat, Phylogenetic Patterns
SplitsTree, program for computing phylogenetic trees and unrooted phylogenetic networks
Dendroscope, program for drawing phylogenetic trees and rooted phylogenetic networks
Phylogenetic inferring on the T-REX server
Mesquite
NCBI â€" Systematics and Molecular Phylogenetics
Sydney Brenner's seminars: "What Genomes Can Tell Us About the Past"
Mikko's Phylogeny Archive
Who is Who in Phylogenetic Networks research papers related to the phylogenetic network
Phylogenetic Reconstruction from Gene-Order Data
ETE: A Python Environment for Tree Exploration This is a programming library to analyze, manipulate and visualize phylogenetic trees. See: Huerta-Cepas, Jaime; Dopazo, JoaquÃn; GabaldÃ³n, Toni (2010). "ETE: A python Environment for Tree Exploration". BMC Bioinformatics 11: 24. doi:10.1186/1471-2105-11-24. PMCÂ 2820433. PMIDÂ 20070885.Â
PhylomeDB: A public database hosting thousands of gene phylogenies ranging many different species. See: Huerta-Cepas, J.; Capella-Gutierrez, S.; Pryszcz, L. P.; Denisov, I.; Kormes, D.; Marcet-Houben, M.; GabaldÃ³n, T. (2010). "PhylomeDB v3.0: An expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions". Nucleic Acids Research 39 (Database issue): D556â€"60. doi:10.1093/nar/gkq1109. PMCÂ 3013701. PMIDÂ 21075798.Â
Lents, N. H.; Cifuentes, O. E.; Carpi, A. (2010). "Teaching the Process of Molecular Phylogeny and Systematics: A Multi-Part Inquiry-Based Exercise". Cell Biology Education 9 (4): 513. doi:10.1187/cbe.09-10-0076.Â

Phylogenetics

Friday, May 22, 2015

Phylogenetics

Construction of a phylogenetic tree

Limitations and workarounds

Homoplasy

Horizontal gene transfer

Hybrids, speciation and introgressions

Taxon sampling

Phylogenetic signal

Continuous characters

Missing data

The role of fossils

History

Ernst Haeckel's recapitulation theory

Timeline of key events

See also

References

Bibliography

External links

Friday, May 22, 2015

Phylogenetics

Construction of a phylogenetic tree

Limitations and workarounds

Homoplasy

Horizontal gene transfer

Hybrids, speciation and introgressions

Taxon sampling

Phylogenetic signal

Continuous characters

Missing data

The role of fossils

History

Ernst Haeckel's recapitulation theory

Timeline of key events

See also

References

Bibliography

External links

Share this