Bioch5080 - Similarity Searching Exercises

fasta.bioch.virginia.edu/bioch508/bioch508_ex1.html


These exercises use programs on the FASTA WWW Search page and the Bioch508 BLAST WWW Search page [pgm].

In the links below, [pgm] indicates a link with most of the information filled in; e.g. the program name, query, and library. [seq] links go to the NCBI, for more information about the sequence.


Identifying homologs and non-homologs; effects of scoring matrices and algorithms

1. Use the FASTA search page [pgm] to compare Drosophila glutathione transferase GSTT1_DROME [seq] (gi|121694) to the PIR1 Annotated protein sequence database. Be sure to press , not .

  1. Take a look at the output.
    1. How long is the query sequence?
    2. How many sequences are in the PIR1 database?
    3. What scoring matrix was used?
    4. What were the gap penalties?
    5. What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
    6. Looking at an alignment, where are the boundaries of the alignment (the best local region)?

  2. What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm that your candidate non-homolog was truly unrelated?

  3. Note that this drosophila glutathione transferase shares significant similarity with both sequences from bacteria (SSPA_ECO57, stringent starvation protein) and mammals. How might you test whether the stringent starvation protein is homologous to glutathione transferases? (Hint - compare your candidate non-homolog with SwissProt for a more reliable test.)

  4. Compare the expectation (E()) value for the distant relationship between GSTT1_DROME and GSTP1_HUMAN (class-mu). How would you demonstrate that GSTT1_DROME is homologous to GSTP1_HUMAN?
  5. Examine how the expectation value changes with different scoring matrices (BLOSUM62, BlastP62, PAM250) and different gap penalties. (The default scoring matrix for the FASTA programs is BLOSUM50, with gap penalties of -10 to open a gap and -2 for each residue in the gap - e.g. -12 for a one residue gap).
    1. What happens to the E()-value for the 100% identical sequence with the different matrices and different gap penalties?
    2. What happens to the E()-value for distant homologs, like GSTA1_RAT with the different matrices and different gap penalties?
    3. What happens to the E()-value for the highest scoring unrelated sequence with the different matrices?

  6. Try the search with ssearch [pgm] (Smith-Waterman). Again, look at the E()-values for distant homologs and the highest scoring unrelated sequence.

2. Do the same search (121694) using the Course BLAST [pgm] WWW page.

  1. Take a look at the output.
    1. How long is the query sequence?
    2. How many sequences are in the PIR1 database?
    3. What scoring matrix was used?
    4. What were the gap penalties?
    5. What are the numbers after the description of the library sequence? Which one is best for inferring homology?
    6. Looking at an alignment, where are the boundaries of the alignment (the best local region)?

  2. What is the highest scoring non-homolog?

  3. How do the blastp E()-values compare with the FASTA (blosum62) E()-values for the distantly related mammalian and plant sequences?

Comparison of Protein:Protein, translated DNA:protein to DNA:DNA searches - more sensitive DNA searches
3. In the next three exercises, we will try to find gstt1_drome homologs in the Arabidopsis genome, using (a) protein:protein (BLASTP), (b) DNA:protein (BLASTX), (c) protein:DNA (TBLASTN), and (d,e) DNA:DNA (BLASTN) searches.

In each of the exercises below, the BLASTP, BLASTX etc. links are pre-set to search Arabidopsis sequences.

  1. BLASTP Compare the GSTT1_DROME [seq] (gi|121694) protein sequence to Arabidopsis proteins using BLASTP [pgm].

    What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4

  2. BLASTX Try the same search using the GSTT1_DROME cDNA DMGST [seq] (gi|8033) against Arabidopsis proteins using BLASTX [pgm].

    What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4

  3. TBLASTN. Use GSTT1_DROME [seq] (gi|121694) against translated Arabidopsis DNA using TBLASTN [pgm].

    What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4

  4. Finally, try the DNA:DNA comparison. Use BLASTN [pgm] to compare dmgst (gi|8033) to the Reference mRNA sequences (refseq_rna) sequences in Arabidopsis.

    Are there detectable Arabidopsis homologues?


Searching from the command line -- BLAST and FASTA

We have installed current versions of FASTA and BLAST on franklin.achs.virginia.edu. To use the programs (as well as the BED tools and other bioinformatics tools), you should need to modify the ~/.variables.ksh file in your home directory on franklin.achs. Simply type:

cat /data/slib/info/init_seqprg.sh >> .variables.ksh
Once you have modified your .variables.ksh file, you should reload it:
. (dot) .variables.ksh
and then you should be able to find the similarity searching programs, e.g. type:
blastp -h
to see the blastp options.

Once you have the /seqprg/bin directory in your path, you can do blast or ssearch (or fasta searches from the command line. Like the FASTA Web pages, we need to know three things:

  1. The name of the program: ls /seqprg/bin lists the programs available. You will be using the following programs:
    /seqprg/bin/blastp, blastn, tblastn, blastx
    /seqprg/bin/fasta, fastx, tfastx, ssearch, and lalign

  2. The location of the query sequence file. I have put some protein and DNA sequences in /data/slib2/bioc5080/data. You can use these directly, or copy them to your own directory.
  3. The location of the sequence databases to be searched. For these exercises, you will be using the FASTA protein databases at /data/slib/fa_dbs/ and the BLAST protein databases at /data/slib/bl_dbs/*.psq (When you refer to the database names with blast, leave out the .psq.

    For seaches against DNA databases with the FASTA programs, you can simply use 'm' for mammal, 'p' for primate, or 'b' for verteBrate. When using blastn or tblastn, you will refer to /data/slib/gb_asn/gbmam, gbpri, gbrod. To see all the possible names, type ls /data/slib/gb_asn/*.nal.


5. Simple command line searches
  1. Reproduce the first search you did with fasta from the command line. Type:
    fasta36 -q -H -S /data/slib/bioc5080/data/gstt1_drome.aa /data/slib/fa_dbs/pir1.lseg > gstt1.fasta_pir1_results
    
    (In this command line, -q makes the search "quiet", it does not prompt for the query or library file; -H turns off the histogram, and -S causes low complexity regions to be ignored in the initial scan.)

  2. You could also search SwissProt by typing:
    fasta36 -q -H -S /data/slib/bioc5080/data/gstt1_drome.aa /data/slib/fa_dbs/swissprot.lseg > gstt1.fasta_sp_results
    
    Take a look at your fasta results.

  3. Now try the same search with blastp. You can quickly find the blastp command options by typing: blastp.
    blastp -query /data/slib/bioc5080/data/gstt1_drome.aa -db pir1 > gstt1.blastp_pir1_results
    
    Take a look at your blast results.

  4. DNA and translated DNA searches. Use the blastn program to compare the mouse GSTM1 sequence /data/slib/bioc5080/data/mgstm1.nt to the mammalian division of genbank (/data/slib/gb_asn/gbmam).

    Then do the same search with a translated comparison (use tblastn to compare /data/slib/bioc5080/data/mgstm1.aa to the mammalian division of genbank (/data/slib/gb_asn/gbmam. How many more sequences do you find?

  5. Both blast and fasta have some output options that make it much easier to parse alignment results, looking for alignment boundaries, percent identity, E()-values, etc. Next week, we can use BioPerl to parse blast and fasta output, but you can produce a very simple form of the output using the fasta -q -m 8C option, or the blast -outfmt=7 option.

    For fasta, you can use the fasta36 -q -H -m 8C option to get tab-delimited alignment summaries that look like blast -outfmt=7 output. Do the searches with gstt1_drome.aa vs swissprot saving the results in tab delimited format.

  6. To get protein (or DNA) sequences on franklin.achs, you can use the blastdbcmd program.
    blastdbcmd -entry search_term
    For example:
    blastdbcmd -entry gstp1_human > gstp1_human.aa
    Would extract the gstp1_human sequence into the file gstp1_human.aa. Later in the course, we will use BioPerl to do this.

    Extract a protein sequence and look at the description line. Why do you think are there multiple ">gi" entries on the line.


To get more information on fasta36 command line options, type:

man /seqprg/man/man1/fasta36.1
For more information on blast, look at /seqprg/doc/blast_user_manual.pdf.
HOMEWORK for Friday, Feb. 24

(1) Take a protein sequence of your choice, and compare it to the swissprot database using blastp, fasta, and ssearch. Prepare a table for all of the significant hits (E() < 0.001) comparing the E()-values and alignment lengths found with the three different programs. Use the -s BP62 for fasta and ssearch to make sure the same scoring matrix is used.

(2) (a) Take a protein sequence of your choice, and compare it to the proteins in swissprot (/data/slib/fa_gbs/swissprot.lseg for fasta, /data/slib/bl_dbs/pir1 for blast.

    (b) Write a program that automatically searches for the highest scoring non-homolog by extracting candidate non-homologs from the results file -- you may want to use -outfmt=7 (blast) or -m 8 (fasta) -- and then re-searching the swissprot database with the library sequence as query to see whether the candidate non-homolog shares significant similarity with proteins that share significant similarity with the original query.


Significant similarities within sequences - domain duplication

6. Exploring domains with local alignments --- Calmodulin

  1. Use lalign [pgm] to examine local similarities between calmodulin CALM_HUMAN and itself.
  2. Use plalign [pgm] to plot the same alignment. How many repeats are present in this sequence.
  3. What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD20).

7. Exploring domains with local alignments --- Cortactin (SRC8_HUMAN)

  1. Use lalign [pgm] to examine local similarities between SRC8_HUMAN and itself. Note the average percent identity for the some of the most significant alignments.

  2. Use plalign [pgm] to plot the same alignment. How many repeats are shown in this alignment? How long do they appear to be?

  3. Based on the percent identities you saw in part (a), what would the appropriate scoring matrix be to accurately identify the cortactin domains?

  4. What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD40). How do the BP62, PAM120, and MD40 alignments differ? Why are they different? Which matrix appears to best identify all the repeats found in the PFAM diagram?

  5. You can look at the PFAM annotation of this protein at PFAM: SRC8_HUMAN [pfam] (Cortactin). How many repeats are shown in this diagram? How long are they?

Where to get the FASTA package: faculty.virginia.edu/wrpearson/fasta/

The "normal" FASTA WWW site:

Contact Bill Pearson: wrp@virginia.edu


Course Home Page