In the links
below, [pgm]
indicates a link with most of the information filled in; e.g. the
program name, query, and
library. [seq] links go to the
NCBI, for more information about the sequence.
Identifying homologs and non-homologs; effects of scoring matrices
and algorithms
1. Use the FASTA search page [pgm] to compare Drosophila
glutathione transferase
GSTT1_DROME [seq] (gi|121694) to the PIR1
Annotated protein sequence database. Be sure to press , not .
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties?
What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
Looking at an alignment, where are the boundaries of the alignment (the best local region)?
What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm
that your candidate non-homolog was truly unrelated?
Note that this drosophila glutathione transferase shares significant
similarity with both sequences from bacteria (SSPA_ECO57, stringent
starvation protein) and mammals. How might you test whether the
stringent starvation protein is homologous to glutathione
transferases? (Hint - compare your candidate non-homolog with SwissProt for a more reliable test.)
Compare the expectation (E()) value for the distant
relationship between GSTT1_DROME and
GSTP1_HUMAN (class-mu). How would you demonstrate that
GSTT1_DROME is homologous to GSTP1_HUMAN?
Examine how the expectation value changes with different scoring
matrices (BLOSUM62, BlastP62, PAM250) and different gap
penalties. (The default scoring matrix for the FASTA programs is
BLOSUM50, with gap penalties of -10 to open a gap and -2 for each
residue in the gap - e.g. -12 for a one residue gap).
What happens to the E()-value for the 100% identical sequence with the different matrices and different gap penalties?
What happens to the E()-value for distant homologs, like GSTA1_RAT with the different matrices and different gap penalties?
What happens to the E()-value for the highest scoring unrelated
sequence with the different matrices?
Try the search with ssearch [pgm] (Smith-Waterman). Again, look
at the E()-values for distant homologs and the highest scoring
unrelated sequence.
2. Do the same search (121694) using the Course
BLAST [pgm] WWW page.
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties?
What are the numbers after the description of the library sequence? Which one is best for inferring homology?
Looking at an alignment, where are the boundaries of the alignment (the best local region)?
What is the highest scoring non-homolog?
How do the blastp E()-values compare with the
FASTA (blosum62) E()-values for the distantly related
mammalian and plant sequences?
Comparison of Protein:Protein, translated DNA:protein to DNA:DNA searches - more sensitive DNA searches3. In the next three exercises, we will try to find
gstt1_drome homologs in the Arabidopsis genome, using
(a) protein:protein (BLASTP), (b) DNA:protein (BLASTX), (c)
protein:DNA (TBLASTN), and (d,e) DNA:DNA (BLASTN) searches.
In each of the exercises below, the BLASTP, BLASTX etc. links are pre-set to search Arabidopsis sequences.
What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4
Finally, try the DNA:DNA comparison.
Use
BLASTN [pgm] to compare dmgst
(gi|8033) to the Reference mRNA sequences (refseq_rna) sequences in Arabidopsis.
Are there detectable Arabidopsis homologues?
Searching from the command line -- BLAST and FASTA
We have installed current versions of FASTA and BLAST on
franklin.achs.virginia.edu. To use the programs (as well as
the BED tools and other bioinformatics tools), you should need to modify the ~/.variables.ksh file in your home directory on franklin.achs. Simply type:
Once you have modified your .variables.ksh file, you should reload it:
. (dot) .variables.ksh
and then you should be able to find the similarity searching programs, e.g. type:
blastp -h
to see the blastp options.
Once you have the /seqprg/bin directory in your path, you can do blast or ssearch (or fasta searches from the command line. Like the FASTA Web pages, we need to know three things:
The name of the program: ls /seqprg/bin lists the programs available. You will be using the following programs: /seqprg/bin/blastp, blastn, tblastn, blastx /seqprg/bin/fasta, fastx, tfastx, ssearch, and lalign
The location of the query sequence file. I have put some protein and DNA sequences in /data/slib2/bioc5080/data. You can use these directly, or copy them to your own directory.
The location of the sequence databases to be searched. For these exercises, you will be using the FASTA protein databases at /data/slib/fa_dbs/ and the BLAST protein databases at /data/slib/bl_dbs/*.psq (When you refer to the database names with blast, leave out the .psq.
For seaches against DNA databases with the FASTA programs, you can simply use 'm' for mammal, 'p' for primate, or 'b' for verteBrate. When using blastn or tblastn, you will refer to /data/slib/gb_asn/gbmam, gbpri, gbrod. To see all the possible names, type ls /data/slib/gb_asn/*.nal.
5. Simple command line searches
Reproduce the first search you did with fasta from the command line. Type:
(In this command line, -q makes the search "quiet", it does not prompt for the query or library file; -H turns off the histogram, and -S causes low complexity regions to be ignored in the initial scan.)
DNA and translated DNA searches. Use the blastn program to compare the mouse GSTM1 sequence /data/slib/bioc5080/data/mgstm1.nt to the mammalian division of genbank (/data/slib/gb_asn/gbmam).
Then do the same search with a translated comparison (use tblastn to compare /data/slib/bioc5080/data/mgstm1.aa to the mammalian division of genbank (/data/slib/gb_asn/gbmam. How many more sequences do you find?
Both blast and fasta have some output options that make it much easier to parse alignment results, looking for alignment boundaries, percent identity, E()-values, etc. Next week, we can use BioPerl to parse blast and fasta output, but you can produce a very simple form of the output using the fasta -q -m 8C option, or the blast -outfmt=7 option.
For fasta, you can use the fasta36 -q -H -m 8C option to get tab-delimited alignment summaries that look like blast -outfmt=7 output. Do the searches with gstt1_drome.aa vs swissprot saving the results in tab delimited format.
To get protein (or DNA) sequences on franklin.achs, you can use the blastdbcmd program. blastdbcmd -entry search_term
For example: blastdbcmd -entry gstp1_human > gstp1_human.aa
Would extract the gstp1_human sequence into the file gstp1_human.aa. Later in the course, we will use BioPerl to do this.
Extract a protein sequence and look at the description line. Why do you think are there multiple ">gi" entries on the line.
To get more information on fasta36 command line options, type:
man /seqprg/man/man1/fasta36.1
For more information on blast, look at /seqprg/doc/blast_user_manual.pdf.
HOMEWORK for Friday, Feb. 24
(1) Take a protein sequence of your choice, and compare it to the
swissprot database using blastp, fasta, and
ssearch. Prepare a table for all of the significant hits (E()
< 0.001) comparing the E()-values and alignment lengths found with the
three different programs. Use the -s BP62 for fasta
and ssearch to make sure the same scoring matrix is used.
(2) (a) Take a protein sequence of your choice, and compare it to the
proteins in swissprot
(/data/slib/fa_gbs/swissprot.lseg for fasta,
/data/slib/bl_dbs/pir1 for blast.
(b) Write a program that automatically
searches for the highest scoring non-homolog by extracting candidate
non-homologs from the results file -- you may want to use -outfmt=7
(blast) or -m 8 (fasta) -- and then re-searching the swissprot
database with the library sequence as query to see whether the
candidate non-homolog shares significant similarity with proteins that
share significant similarity with the original query.
Significant similarities within sequences - domain duplication
6. Exploring domains with local alignments --- Calmodulin
Use lalign [pgm] to examine local
similarities between calmodulin CALM_HUMAN and itself.
Use plalign [pgm] to plot the same alignment. How many repeats are present in this sequence.
What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD20).
7. Exploring domains with local alignments --- Cortactin (SRC8_HUMAN)
Use lalign [pgm] to examine local
similarities between SRC8_HUMAN and itself.
Note the average percent identity for the some of the most significant alignments.
Use plalign [pgm] to plot the same alignment. How many
repeats are shown in this alignment? How long do they appear to be?
Based on the percent identities you saw in part (a), what would the appropriate scoring matrix be to accurately identify the cortactin domains?
What happens to the domain alignment plot when you use a shallower
scoring matrix (try BP62, MD40). How do the BP62, PAM120, and MD40
alignments differ? Why are they different? Which matrix appears to
best identify all the repeats found in the PFAM diagram?
You can look at the PFAM annotation of this protein at PFAM: SRC8_HUMAN [pfam] (Cortactin). How many repeats are shown in this diagram? How long are they?