Worked Example BioMart Ensembl Tutorial
Worked Example BioMart Ensembl Tutorial
(MYLK_Bovin) is located on chromosome 1. What other known genes are found on cow chromosome 1? What are their Ensembl Gene IDs and Entrez Gene IDs? Do they have any domains predicted by Interpro? Follow the worked example below to answer these questions.
STEP 3: Select the database: Ensembl genes (version 48) and the species of interest under Choose Dataset. (Bos Taurus genes)
STEP 4: Narrow the geneset by clicking Filters on the left. Click on the + infront of REGION to expand the choices.
STEP 6: Expand the GENE panel and select Status (gene) as known. The filters have determined our gene set. Click Count (at the top) to see how many genes have passed these filters.
STEP 7: Click on Attributes to select output options (i.e. what we would like to know about our lect geneset).
STEP 8: Expand the GENE panel. Ensembl Gene and Transcript IDs are selected by default.
Se
Note the summary of selected options. The order of attributes determines the order of columns in the result table.
To save a file of the complete table, click Go. Or, email the results to any address. Or, View All rows as HTML.
STEP 12: Go back and change Attributes by clicking on it, and adding InterPro Short Description from the PROTEIN section.
STEP 13: Clicking Results should now show a table like this. Select View ALL rows
Result Table
V) BIOMART - Exercises
These exercises have been designed to familiarise you with different questions you can answer with this tool, and the types of data you can retrieve with BioMart. 1. Retrieve all SNPs for novel human G-protein coupled receptor genes (GPCRs Use the InterPro domain ID: IPR000276) on chromosome 2. Note: As this is the first exercise we walk you this time through BioMart step-bystep (but of course you can also try to do this exercise without our help!) Start a new BioMart session by clicking New, or go back to the Ensembl homepage and click on Mine Ensembl with Biomart under Ensembl tools. Choose the database and the dataset for your query as follows: - Select Ensembl 48 - Select Homo sapiens genes (NCBI36). Click on Filters at the left. Filter this dataset to select your genes of interest as follows: - Expand the REGION section at the right by clicking on the +. Select Chromosome 2. Click [count] at the top of the panel and note the number of Ensembl genes on Homo sapiens chromosome 2. - In the GENE section, select Status (gene) NOVEL. - In the PROTEIN section, select the second Limit to genes with these family or domain IDs option. Select Interpro ID(s) and enter IPR000276 in the box. Click [count] again and note that the number of genes is updated. Click on Attributes (at the left). Select the output for your gene list as follows: - Select the SNPs Attribute Page. - In the GENE section Ensembl Gene ID and Ensembl Transcript ID are selected by default also select Ensembl Peptide ID and Ensembl Peptide length. - In the GENE ASSOCIATED SNPs section select Reference ID, Allele, Peptide location (aa), Location in Gene (coding etc), Synonymous Status and Peptide Shift. Note: Clicking on count now will not show an altered number. Attribute selections should not affect the count (i.e. the number of genes that have passed the filters). Click on Results (at the top) to obtain the first 10 rows of your table. To obtain the entire table select View all rows as HTML or export a file by clicking Go.
Note that the output for this query gives you one row for each SNP, and if there are alternative transcripts then SNP data is given for each. This means that a particular SNP may appear more than once. Find the coding SNPs, and note that you have information about the effect of the SNP, and its location within the protein. Synonymous status is yes for silent mutations. Two amino acids will be shown in the Peptide Shift column if there are two alleles on the protein level. The Peptide location (aa), Synonymous Status and Peptide Shift will all be blank if the SNP is not in a coding region. 2. Click New to start a new query. Retrieve the gene structure (i.e. start and end coordinates of exons) of the mouse gene ENSMUSG00000042351. 3. Retrieve peptide sequences of all chicken genes on chromosome 1. 4. The file http://www.ebi.ac.uk/~xose/Affy_exercise.txt contains a list of probeset IDs from a microarray experiment using the Affymetrix array HG-U133 Plus 2.0 (human). Retrieve the 500 bp upstream of the transcripts matching these probeset IDs. 5. Retrieve the 5UTR sequence of cow genes on chromosome 5 that possess a UTR. 6. Retrieve sequence (including reference ID in the header) of all human SNPs that have an ID from The SNP Consortium (TSC), from chromosome 6 between 15 Mb and 15.2 Mb, with 200 bases flanking sequence. 7. Retrieve the mouse homologues of Homo sapiens genes CASP1, CASP2, CASP3, and CASP4. (These are HGNC symbols for the genes). 8. Design your own query!
Answers (BioMart)
1. You should find one novel gene on chromosome 2 with this InterPro domain. (Note: there can be more than one gene with one InterPro domain). The result set has one transcript and a total of 261 rows of output (to see this, change the option from TSV to XLS under Export all results and click Go, then open in Excel so you dont have to count the rows manually). The transcript has 9 coding SNPs (Location in Gene is coding), most of which are non-synonymous (Synonymous status is no) and thus affect the amino acid sequence of the encoded peptide. One allele is a stop codon- can you find it?
2. Click New. Select: Database and dataset: Ensembl 48 and Mus musculus genes (NCBIM36). Filters: GENE ID list limit Ensembl Gene ID(s): enter the mouse gene ID. Attributes Structures: select in the EXON panel: Ensembl Exon ID, Exon Start and Exon End. Click Results. You should find 7 exons. Take the link from the Ensembl Gene ID in your output back to the GeneView page to confirm the BioMart data with the gene structure displayed on this page. 3. Database and dataset: Ensembl 48 and Gallus gallus genes (NCBI36). Filters: REGION Chromosome 1 Attributes: Sequences: Peptide Sequences, and add to the header: Description and Ensembl Peptide ID along with the default options (Ensembl Gene ID and Transcript ID). Count should show 2297 Ensembl genes 4. Database and dataset: Ensembl 48 and Homo sapiens genes (NCBI36). Filters: GENE: ID list limit: Affy hg u133 plus 2 ID(s) and enter the list of probeset IDs. Attributes: Sequences select Flank (Transcript), Upstream flank 500. In the header, apart from the already default selected options, select Ensembl Transcript ID. You should find upstream sequences for the transcripts of 31 genes (Hint: click count to see the number of genes!) 5. Database and dataset: Ensembl 48 and Bos Taurus genes (NCBI36). Filters: REGION Chromosome 5 GENE: Entries with a 5UTR Only Attributes Sequences and select 5UTR Count should show 547 genes. FYI: The Flank option in the Sequences Attribute page:
If you choose the option Flank (Gene) you will see only one upstream sequence per gene in the output. In the case where a gene has multiple transcripts, the upstream sequence of the transcript that extends the furthest at the 5 end is shown. If you want to export the upstream sequences for each transcript you should choose the option Flank (Transcript).
6. Database: SNP and dataset: Homo sapiens SNPs (dbSNP127;HGVbase 15; TSC 1; affy GeneChip Mapping Array). Filters: REGION: Chromosome 6, Base pair Start 15000000, Base pair End 15200000 GENERAL SNP FILTERS: SNP source: SNPs with TSC ID(s) Only. Attributes Sequences: SEQUENCES : SNP sequences, Upstream flank 200, Downstream flank 200. SNP: SNP attributes, select Reference ID. You should find 69 SNPs. 7. Database: Ensembl 48 Dataset: Homo sapiens genes (NCBI36) Filters: GENE: ID list limit HGNC Symbol(s). Enter the human HGNC (HUGO) symbols in the box: CASP1, CASP2, CASP3, and CASP4. Attributes: Under Homologs, select in the MOUSE ORTHOLOGS panel Mouse Ensembl Gene ID and Mouse External ID. Also select Ensembl gene ID and Transcript ID (default options) and Description in the GENE panel (these will be for the starting dataset i.e. Human.) Results displays the mouse ortholog