Biochem 503, Fall 2008

Protein sequence comparison and Protein evolution

(9/2,4/2008)
Mathews and van Holde, Chapter 7, 233-239

Topics

Protein evolution

Doolittle, R. F., Feng, D. F., Johnson, M. S., and McClure, M. A. (1986) Relationships of human protein sequences to those of other organisms. Cold Spring Harb. Symp. Quant. Biol. 51:447-455.

***Pearson, W. R., "Protein sequence comparison and protein evolution," ISMB2000 - Tutorial, San Diego, CA (2000). PDF version

Statistics and evaluation of matches

***Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994) Issues in searching molecular sequence databases. Nature Genet. 6:119-129.


Sites for sequence database searching:

BLAST search page at the National Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov/BLAST/.

FASTA search page at the U. of Virginia: http://fasta.bioch.virginia.edu/fasta

EXPASY server for similarity searches, motifs, pI, molecular weight, etc. http://expasy.hcuge.ch


WWW sites discussing sequence comparison

An excellent tutorial on using the BLAST programs: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html.

http:prot_talk12-95.html

http://twod.med.harvard.edu/seqanal/

http://www.public.iastate.edu/~pedro/research_tools.html


Exercises:

Use the FASTA WWW search page and/or BLAST to search protein databases or specific genomes:
  1. Search the SwissProt database with the mouse voltage gated potassium channel cik4_mouse. Identify the phylogenetically most distant homologous sequence. Is this the most distantly related sequence? What is the expectation value E()-value of the highest scoring unrelated sequence? (Why do you think the sequence is unrelated?)

  2. Can you identify a p53_human homolog from a non-vertebrate? (Hint: search SwissProt, if no non-vertebrate is found, search Blast NR. Use advanced Blast to limit the search to different organisms, e.g. C. elegans).

  3. Does M. jannaschii contain a glutamate dehydrogenase (dhe4_human)? Search the M. jannaschii proteome using SSEARCH. No homologues? Try searching again with dhe2_sulso.

Questions on this topic from previous exams

  1. Give two reasons why percent identity is not as useful as the E()-value in establishing protein or DNA homology. What does E()-value mean? What range of values can an E()-value take?

  2. The mouse voltage gated potassium channel cik4_mouse shares weak, but not statistically significant similarity with a hypothetical Methanococcus jannaschii open reading frame (ORF, a putative protein sequence) with an expectation value E() ~ 0.1. Suggest some additional sequence-based strategies to confirm that this putative M. jannaschii protein is homologous to mammalian voltage-gated ion channels, indicating which strategies would provide the most reliable inferences.

  3. Present one argument to support the assertion that "random" protein sequences have similarity scores that behave like "real, unrelated" protein sequences.
    How is this assertion used to justify the inference of homology from similarity?

  4. (20 pts) You have just cloned and sequenced a gene encoding a novel protein (<50% identical to any known protein) induced during liver regeneration. Describe the different analyses you could perform on the sequence to infer its 2o and 3o structure. Which are the most reliable? The least?

  5. Why is protein sequence comparison more effective than DNA sequence comparison? Do you expect to be able to detect homology relationships between human and yeast genes with DNA sequences? With protein sequences? Describe a sequence comparison experiment to demonstrate the difference in sensitivity of the two methods.

  6. In genome research, investigators use the phrase transitive catastrophe when referring to large sets of exceptionally misleading incorrect annotations based on BLAST or FASTA similarity searches on proteins from newly sequenced genomes. What is the most likely explanation for these errors? How could they be avoided?
  7. Shown below are the results of a FASTA sequence similarity search using a Drosophila glutathione transferase query sequence. (a) Assuming the statistics are accurate, what is likely to be the highest scoring unrelated sequence? (b) Given that there are hundreds of glutathione transferase sequences known from animals, plant, and bacterial, how might you confirm that your candidate unrelated sequence is truly unrelated? (c) How might you demonstrate, using sequence similarity alone, that A37378 gluthathione transferase pi is homologous to the Drosophila query sequence? The best scores are: bits E(14548) XUFF11 glutathione transferase 1 - fruit fl ( 210) 326 3.8e-90 XUZM32 glutathione transferase III - maize ( 223) 48 3.3e-06 XUZM31 glutathione transferase III - maize ( 221) 43 5.6e-05 XUZM1 glutathione transferase I - maize ( 214) 41 0.00031 RGECSS stringent starvation protein E. coli ( 213) 40 0.00058 XURTG glutathione transferase class alpha Ya1 ( 223) 40 0.0007 XURT8C glutathione transferase 8, - rat ( 223) 30 0.53 XURTG4 glutathione transferase 4 - rat ( 219) 29 0.98 A37378 glutathione transferase pi - human ( 211) 27 5.4 OUBP22 antirepressor protein ant - phage P22 ( 301) 27 6.8 NOBY2 phosphopyruvate hydratase 2 - yeast ( 438) 27 7.4 S30223 translation elongation factor eEF-1 b ( 228) 26 7.8 PWBYD H+-transporting ATP synthase delta ( 213) 26 8.7

Biochem 503 Home page