Doug Taylor, Evolutionary Biology Silene Vulgaris Pollen, Seeds, and Flower
Home Research People Teaching Colleagues Protocols Pictures Lab News Links

Using the SOWH test to compare ML tree topologies:

FromGoldman, N., Anderson, J.P., and Rodrigo, A.G.2000Likelihood-based tests of topologies in phylogenetics.Syst. Biol.49(4):652-670.

This test is used to compare the topology of your Maximum likelihood (best) tree to that of an alternate evolutionary hypothesis for tree topology. For example, we have used it to compare the ML tree to an alternate topology in which Houstonia is constrained to be monophyletic. You will need a version of PAUP* (Swofford, 1999) as well as Seq-Gen (available here (Rambaut & Grassly, 1997).

This protocol is a version of the protocol found at: http://www.ebi.ac.uk/goldman/tests/SOWHinstr.html

If you have a maximum likelihood tree topology, we will assume that you have already estimated your correct model of evolution. You can make a constrained tree using that same model of evolution in PAUP*.The best way to constrain a tree is using the PAUP commands:

Constraints name_monophyletic_group=((1-5));

In this case, ‘constraints’ is the PAUP command that constrains taxa 1-5 to be monphyletic. All of the other taxa included will be placed outside of the clade containing taxa 1-5.This will allow PAUP to find the best tree which constrains 1-5 to be monophyletic. Changing the topology in MacClade does not do this, rather it defines a single alternative tree topology, which may not be the best constrained topology. Don’t forget to save both the best tree and the constrained with branch lengths in both PAUP and Phylip format !!(this can be done under ‘options’ when saving trees) Once you have both the constrained tree and the best tree, you want to open them both up into PAUP* and re-estimate the model parameters. This is a sample of how to do that:

#nexus
begin paup;
log file=lscores.log;[start log file]
exec datafile.nex;[read in data file]
cleartrees;
[get first tree which is most likely tree, no constraints]
gettrees file=best.tree;
[get second tree, and keep all trees that are different.This is getting constrained tree]
gettrees file=constrained.tree mode= 5;
[set critertion to maximum likelihood]
set criterion=likelihood;
[set likelihood model]
lset nst=6 rmatrix=estimate rclass=(a b c c b a) basefreq=estimate pinv=est rates=gamma shape=estimate;
[calculate likelihood parameters for all trees]
lscores all /khtest;
[stop writing to log file]
log stop;
end;

The log file that you made will output all of the likelihood parameters.These will be used in the Seq-Gen program (allexcept the pinvar variable).It has also done a Kishino-Hasegawa test that you can compare to your SOWH test later. The KH test output will also calculate the differences in the –lnL of your two tree topologies.Your output will look similar to this:

Tree number 1: (this is the constrained tree)
-Ln likelihood = 4185.44058
Estimated base frequencies = A:0.303967 C:0.220064 G:0.236390 T:0.239579
Estimated R-matrix:
-11.97881280.63130727
1-0.631307271.9788128
1.97881280.63130727-1
0.631307271.97881281-
Estimated value of gamma shape parameter = 0.306055
Tree number 2:(this is the best ML tree)
-Ln likelihood = 4175.03705
Estimated base frequencies = A:0.304209 C:0.219753 G:0.234761 T:0.241277
Estimated R-matrix:
-11.94273570.62630705
1-0.626307051.9427357
1.94273570.62630705-1
0.626307051.94273571-
Estimated value of gamma shape parameter = 0.324588
Time used to compute likelihoods = 53.00 sec
Kishino-Hasegawa test:
KH test using normal approximation, two-tailed test
---------- KH-test ----------
Tree-ln LDiff -ln Ls.d.(diff)TP
---------------------------------------------------------------------
14185.4405810.403525.101292.03940.0417*
24175.03705(best)

These parameters and the phylip version of the constrained tree are used in Seq-Gen to generator data sets. Using the Mac version of Seq-Gen, a window will appear with aconsole to write in and buttons from which you can choose files. For the input, depress the file button and choose your constrained phylip tree. Then type in your model parameters (from your constrained tree) in the dialog box like so:

-m REV –l 1076 –n 100 –a 0.306055 –f 0.303967,0.220064,0.236390,0.239579 –r 1,1.9788128,0.63130727,0.63130727,1.9788128,1

where ‘l’ is the number of characters in your data matrix, ‘-n’ is the number of data sets to simulate, ‘-a’ is the gamma shape parameter, ‘-f’ is the base frequencies (acgt), and ‘-r’ is the values in the r-matrix. Also, depress the output ‘file’ button to save the seq-gen output to a file. Running this will make 100 data sets of 19 taxa with 1076 characters. Once this is completed, open the ouput file which is your 100 simulated data sets.To run a partial optimization of the SOWH test, you now need to change this output file to PAUP executable file. At the beginning of each of the new data sets is a line with saying ‘19 1076’(this is the number of taxa and number of characters in each set). This needs to be replaced by the following PAUP block:

;
end;
begin paup;
set crit=like;
lset nst=6 rmat=est rates=gamma shape=est;
gettrees file=contrained.tre ;
lscore/scorefile=constrained_score append=yes ;
set crit=like;
lset nst=6 rmat=prev rates=gamma shape=prev;
hsearch start=stepwise nbest=1 status=no;
savetrees file=besttree append=yes format=phylip;
lscore/scorefile=best_score append=yes;
begin data;
dimensions ntax=19 nchar=1076;
format datatype=nucleotide gap=- missing=? matchchar=.;
matrix

where ‘constrained.tre’ is your constrained tree file in PAUP format. This PAUP block will also need to be added to the very end of the seq-Gen file.The portion …begin data;…. through …matrixshould not be included in this final paup block. The very first PAUP block at the beginning of the file shold not include the first 12 lines of the PAUP script ( ;end; … through best_score append=yes;).This file can then be executed in PAUP*. This will compute a likelihood score for your constrained tree topology for each of the 100 data sets and it will compute a likelihood score for the best tree of each of the 100 data sets.This will give you two output files (constrained_score and best_score).These files should be opened in excel and the difference between each pair of likelihood scores calculated.There should be 100 pairs.These differences can tehn be ordered and you can see where your difference (calculated above in the KH test output) falls into this distribution of scores (outside of 95 precent of the scores?)


Department of Biology, PO Box 400328 University of Virginia, Charlottesville, VA 22904-4328
Email: drt3b@virginia.edu  Phone:(434)982-5217


Site Design by Eyewall Design.com