Some time ago we published a method for exploring among-site rate variation in evolutionary datasets . This particular problem has been of interest for more than 40 years - some characters in a dataset will evolve at different rates to other characters and this might mislead phylogeny reconstruction.
There are a few principal situations where evolutionary rate variation can mislead phylogeny reconstruction. Firstly, an evolutionary rate at some characters that is so fast that signal is completely eroded and consequently whatever small pattern is present in the data, even though it is not a phylogenetic signal, can be the signal that 'wins'. The second is a consequence of mutational bias, where distant relatives evolve to have the same nucleotide or amino acid bias. Lastly, there are homoplastic mutations that are driven by natural selection.
So - what to do?
The most common answer, if the data is molecular (i.e. DNA or protein) is to model the rate variation using some kind of distribution. The usual distribution is the gamma distribution and most often a discrete approximation to a gamma distribution is used to model variation at different sites. You can, for instance, decide to use a gamma distribution with four categories of rate. Then for every character you can calculate its likelihood for every category and then sum across all these likelihoods to get the overall likelihood for the site. The categories with the fastest rate will fit the fast evolving sites best and the category with the slowest rate will fit the slowest evolving sites best. Overall, we generally find that using a discrete approximation to a gamma distribution will fit the data better than when we don't use it.
One of the problems is that we need a tree in order to categorise sites.
We carried out an analysis of this kind of approach and we could show that there was a systematic bias and a circularity in this kind of approach. By using a tree to identify the fast-evolving sites, you naturally assign a fast rate to sites that don't agree with the tree that is being used.
Which would be fine if the tree was the correct tree, but it is really problematic if the tree being used is not the correct tree.
So, if you don't know the true tree you might get the wrong answer and if you do know the true tree, you don't need to get the right rates in order to infer the true tree - you already know it.
Our approach was to use no tree at all.
We have compared sites with one another in a pair-wise fashion. We expect that if there is a reasonable phylogenetic signal, then different sites should have similar patterns of character-state distributions. Homoplastic sites would tend, on average, to disagree with the non-homoplastic sites.
We have implemented this logic in a software program called TIGER. You can download it from here.
It will read in an alignment and proceed to compare the similarities of pairs of sites to one another. Each site finally gets a score for how similar it is on average to all the other sites.
Homoplastic sites in many analyses will be seen as those sites that tend to disagree with the other sites in the alignment. Therefore, in principle, it is possible to identify homoplastic sites as those that disagree the most.
The method will, of course, break down if the alignment doesn't really contain much phylogenetic signal, since most sites will then disagree with or completely agree with one another. Therefore, this approach should be classified along with other methods of data exploration and signal assessment.
Contact us if you have any questions about the method.
1. Cummins, C.A. and McInerney, J.O. (2011) A method for inferring the rate of evolution of homologous characters that can potentially improve phylogenetic inference, resolve deep divergence and correct systematic biases. Systematic Biology doi: 10.1093/sysbio/syr064.