CSHL Computational Genomics - Ensembl/Ensmart


These exercises begin at Ensembl, and also use BioMart.

Basic gene structure - linking to other databases

  1. Exploring features related to a gene
    1. Find the gene report for the human HRAS gene. Hint: use the search/find boxes to do a text search.
    2. How many transcripts are predicted for this gene? What is the size of the longest predicted mRNA? How many exons does it have?
    3. With what diseases is HRAS associated? What is its cellular function? Hint: follow some of the links in GeneView.
    4. How many amino acids does it code for? Which InterPro domains does the protein product contain?
    5. Find the GO section of GeneView and follow the links to explore the Gene ontology terms (describing gene and protein function) in Ensembl GOView.
    6. In which chromosomal band is HRAS located? On which clone and contig in the genomic sequence assembly?
    7. Which regions in and around the gene correspond to regions shown in the mouse matches track?
    8. Is there a putative mouse orthologue? If so, where is it in the mouse genome?

  2. Exploring a region
    1. Display the region between markers D12S1871 and D12S1604 in ContigView. Hint: start by clicking chromosome 12 on the human homepage.

      Centre the display around the CSAD gene by clicking on this gene in Overview. In Detailed View, why are some genes above and some below the DNA contigs?

    2. Note the large gap in the assembly. Turn on the Gaps track under Decorations and you will also see smaller gaps. Find out what the small gaps represent, and zoom in to see them in the DNA (contigs) track in ContigView. Export a FASTA file of the sequence of the displayed region and look for the gaps (Ns).
    3. What is the nearest marker to the start of the CSAD gene? How many synonyms does this marker have?
    4. Zoom in on the RARG gene. Identify a coding SNP and look at the corresponding SNPView page. Hint: turn on the SNP track if necessary.

  3. Go to ENSG00000001626's GeneView. From there, follow the link to EMBLs M55110. Jump to the relevant LocusLink entry and back to the main Ensembl Gene Report. From here, follow the link to OMIMs 219700 and SwissProts P13569.

  4. To which UniGene cluster is mapped the following EST (accession number: BM694439) isolated from a serially normalised cDNA library of human cells? How many entries are in this cluster? Which Ensembl transcripts correspond to this EST? Which OMIM entry does UniGene link this EST? To which RefSeq entry have we mapped this Ensembl transcript? What information can you provide about its genomic location and function?

  5. Take exons 10 to 12 from CFTR (ENSG00000001626) and Blast them against human, rat and mouse Ensembl cDNA sets.

    Which Ensembl genes are hit? Are any of them exact matches? To which Ensembl family does this correspond? Which InterPro domains are present in these proteins?

  6. Use the same sequence to query the genome using SSAHA. Which locations do you find?

    To which of the genes from the previous exercise do they correspond?

  7. Take the human chromosomal region flanked by genetic markers D22S687 and D22S59.

    What genes are there and what information can you obtain about them?


BioMart

  1. Retrieve all mouse homologues of human disease genes containing transmembrane domains located between 1p22 and 1q22. Hint: Start with a human gene list. Include Disease OMIM ID and Disease description in the OUTPUT section.

  2. Retrieve the gene structure of the following mouse genes: ENSMUSG00000042351 ENSMUSG00000022393

    Hint: This can be done under Filter (Gene Section) Ensembl Gene IDs. Select Ensembl Gene ID, Ensembl Transcript ID, Ensembl Exon ID, Exon Start and Exon End. Export the gene structure in HTML format. Then take the link from the Ensembl genes to GeneView in order to confirm the gene structure.

  3. Retrieve the sequences 5kb upstream of all human known genes between D1S2806 and D1S464.

    Hint: Use the Genes transcript information ignored under the sequence type options. Known genes in Ensembl are Ensembl gene predictions that could be mapped to external database entries (e.g. SwissProt) with a high similarity score.

  4. Retrieve all human SNPs that have a TSC ID, from chromosome 6 between 15 and 15.2 Mb, with 200 bases flanking sequence.

  5. Take this >list, and knowing that it was obtained from an experiment with GeneChip HG-U133A, retrieve 500bp of conserved upstream sequence.

    Hint: Remember to untick Region to allow genome wide searches. And tick Entries associated with upstream matches in mouse genome. This is a human probe set

  6. Retrieve all mouse homologues of human disease genes containing transmembrane domains located between 1p22.3 and 1q22.

    Hint: Start with a human gene list. Include Disease OMIM ID and Disease description in the OUTPUT section.

  7. Retrieve the gene structure of the following mouse genes: ENSMUSG00000042351 ENSMUSG00000022393

    Hint: This can be done under Filter (Gene Section) Ensembl Gene IDs. Select Ensembl Gene ID, Ensembl Transcript ID, Ensembl Exon ID, Exon Start and Exon End. Export the gene structure in HTML format. Then take the link from the Ensembl genes to GeneView in order to confirm the gene structure.

  8. Retrieve the sequences 5kb upstream of all human known genes between D1S2806 and D1S464.

    Hint: Use the Genes transcript information ignored under the sequence type options. Known genes in Ensembl are Ensembl gene predictions that could be mapped to external database entries (e.g. SwissProt) with a high similarity score.

  9. Retrieve all human SNPs that have a TSC ID, from chromosome 6 between 15 and 15.2 Mb, with 200 bases flanking sequence.

Course Home Page