Workshop II: Multiple Sequence Alignment


Sample sequence datasets from McClure et al. 1994 (see reference below):
kinases/kin10.fasta globins/glob10.fasta
kinases/kin12.fasta globins/glob12.fasta
kinases/kin6.fasta globins/glob6.fasta
proteases/pro10.fasta rh/rh10.fasta (ribonculease H)
proteases/pro12.fasta rh/rh12.fasta
proteases/pro6.fasta rh/rh6.fasta

Exercise:

Using the kin10.fasta sequence set above, rank the performance of several of the alignment programs listed on the course "Multiple Sequence Alignment Resources" handout (e.g., CLUSTAL-W, PIMA. MSA, DCA, ITERALIGN, SAGA, T_COFFEE, POA, PCMA, PROALIGN, MAVID, MUSCLE, Align-m, DIALIGN-T, PRALINE-psi, PRRN, MAFFT-5, ProbCons) based on their ability to accurately align each of the 8 structurally conserved sites in the kinase catalytic domain (see McClure et al. 1994, Fig. 2, p. 584). To simplify the scoring, score the accuracy of aligning each site as "all-or-none", i.e., count a site as being correctly aligned only if all of the sequences in the set are correctly aligned within that site.

Note:  Pre-run results files for some of these programs are provided below:
kin10.clustalw kin10.pima kin10.map
kin10.dca kin10.iteralign kin10.t_coffee kin10.poa kin10.praline
kin10.pcma kin10.proalign kin10.mavid kin10.muscle kin10.align_m
kin10.dialign-t kin10.praline-psi kin10.prrn kin10.mafft5 kin10.probcons

References: McClure MA, Vasi TK, and Fitch WM (1994). Compartitive analysis of multiple protein-sequence alignment methods. Mol. Biol. Evol. 11:571-592. (PDF)


CSHL Computational Genomics Course, Nov 2-8, 2005
Randall F. Smith, Bioinformatics, GlaxoSmithKline R&D