Genetic Disease Poster Project

Genetic Disease Poster Project

Bioinformatics II: Introduction to Techniques to Access More Databases

(adapted from Dr. Nick Ewing)

When we last left our genetic diseases, we had identified gene symbols and associated them with nucleotide sequences of their genes and mRNA's, and also with amino acid sequences of the polypeptides coded for by the genes' base sequences. All of this was done through the NCBI website and its links to various databases. Today we will explore a few more links to get information about (1) the locations of the disease genes on the chromosomes, (2) the average sizes of human chromosomes in terms of nucleotide pairs (bp), (3) the location of the coding sequences in the mRNA, and (4) the molecular weight of the encoded polypeptide. Like last time we will start with an example of finding that information for the G6PD gene and then you will have time to find similar information for your disease gene. You will probably want to save that information or e-mail it to yourself, because it will be useful for your poster.

A. Finding Genes on Chromosomes

1. Let's start at the NCBI Home Page (http://www.ncbi.nlm.nih.gov/) again. Go to the right hand column "Hot Spots" and click on "Map Viewer."

2. Here in the left hand column, you can Select a group or organism. Select Homo sapiens.

3. When you click "Go," you should reach a page with ideograms (idealized pictures) of the human chromosome set at the top. In the "Search for" box at the top left, type in the name of your disease gene. For our example I type in G6PD and then click "Go."

4. This brings up a page with the same ideograms, except that some have red marks on them. The red marks represent the loci (plural of locus or location) of G6PD or similar genes. Since I know the G6PD gene is on the X chromosome from my previous search of OMIM, I can focus on the red mark on the X. Notice it is present near the end of the longer arm of the chromosome; this is q28 on the X. This picture of the whole set of human chromosomes and the locus of the disease gene should be saved for your poster.

5. To get a more detailed look at the X chromosome and the G6PD locus, click on the X under the chromosome's ideogram. This should take you to several views of part of the chromosome. When you scan down the right hand side of most of the maps, you will be able to find the gene name or its identification number. The letters may be small. Other genes on the same chromosome are also noted. In the left hand column, just above the ideogram, you can select the scale of the map shown. Move the cursor to the maximum "zoom out" and you should see "show full chromosome." A single click there will give you a full and crowded view of the chromosome.

6. Above the maps you can see under "Region Displayed" which nucleotide pairs of the chromosome are represented. In this case, since we are looking at the full chromosome, we see 0-155 M bp (M = 10⁶), indicating that the total amount of DNA in the human X chromosome is 155 x 10⁶bp. Record the total amount of DNA in the chromosome your disease gene is on for your poster.

7. Scroll down to "Summary of Maps" and look at "Map 4 Ensembl Genes On Sequence." You can see the "Total Genes on Chromosome" (in this case 2005); save this number as well, so that you can compute the average size (#bp) of a gene on the chromosome for your poster.

8. To see some of the genes on the chromosome a little more clearly, "zoom in" to "show 1/100th of the chromosome" and single click. You should get a larger view of the 1/100th of the chromosome containing the disease gene. Now click on the "Table View" of Map 1 Genes on Sequence. Here you should find the nucleotide at which different genes start and end in the DNA of the chromosome, along with the gene symbol, links to information about the gene, the position of the gene on the chromosome band map, and a brief description of the gene's product. This will let you find genes that are said to be linked to your gene.

B. Finding the first and last codons for the polypeptide in the mRNA

Since we now know that the gene and the mRNA base sequences are longer than the bases needed to code for the amino acid sequence of the polypeptide, we are faced with complications when a new piece of nucleic acid is sequenced. That piece of nucleic acid might be a part of a chromosome that is under study or it may be a copy of an mRNA that has been isolated from a specialized cell. We might be interested in the structure and function of the polypeptide being encoded. But we would have to find out: Does our base sequence code for a polypeptide and if so where does the coding sequence begin and end?

One approach to answering this question is to start from the 5' end of the sequence, translate it 3 bases at a time without skipping or re-reading bases (since we know that each amino acid is encoded by continuous, non-overlapping triplets in mRNA) and see if we get a continuous chain of 50 or more amino acids. But we must remember that the reading sequence could start with the second base and reading 3 bases at a time from that point would yield an entirely different set of codons. Further, the reading sequence could start with the third base and reading 3 bases at a time from that point would yield a third different set of codons. Moreover, if we were working with a sequence in DNA, we would have 3 other possible ways to read, and I hope you know why. Each of the possible ways to start the decoding of triplets is called a reading frame. Finding the correct start and end base for translating the base sequence into a likely polypeptide is known as finding an open reading frame. It would be a lot of work to do this manually with our codon dictionaries. Fortunately, there are bioinformatics programs to do the work for us; we will use ORF (Open Reading Frame) Finder, accessible through the NCBI Home Page.

1. Return to the NCBI Home Page and access the "Full Report" page on your gene, in which the diagram of the gene and links to the pre-mRNA, mRNA, and polypeptide sequences were given. Access the mRNA base sequence in FASTA format (so that there are no index numbers along with the base symbols); copy the base sequence.

2. Now go back to the NCBI Home Page and look down the Hot Spots column on the right for ORF Finder. Clicking on ORF Finder gives you a page that will allow you to paste your base sequence in the box under "or sequence in FASTA format," and to click OrfFind.

3. You should now see a page with six narrow boxes on the left, each representing the product of translating the base sequence with a different reading frame. ORF Finder does not keep track of whether you pasted in a DNA or an RNA base sequence, so it will give you 6 possible reading frames. The blue colored portion of each box represents the part of the base sequence that starts with a start codon, codes for amino acids, and ends with a stop codon. The most likely candidate for the ORF is the box with the longest continuous blue color. Click on that.

4. You should now see indexed rows of the base sequences that are being decoded alternating with rows of symbols for the amino acids encoded by the triplets right above them. Now you can see which base is the first to be translated and which is the last. You can also see where the 5' untranslated region (UTR) ends, where the 3' UTR begins, and where there are potential start sites for alternate translations of the message. Make a copy of this information for use on your poster.

C. Finding the Molecular Weight of Your Polypeptide (and other possibly protein features)

1. Return to the NCBI Home Page and access the "Full Report" page on your gene, in which the diagram of the gene and links to the pre-mRNA, mRNA, and polypeptide sequences were given. Access the amino acid sequence of the polypeptide in FASTA format (so that there are no index numbers along with the base symbols); copy the amino acid sequence.

2. Access the following website: http://ca.expasy.org/tools/#proteome. This site provides programs for analyzing amino acid sequences. While much of the available proteomic analysis is beyond our present needs, be aware that predictions and comparisons of protein modifications, secondary and tertiary structures, and sequences localizing proteins to different parts of the cell are available for more advanced projects. Go to the blue band marked "Primary Structure Analysis," find the link marked compute "pI/MW," and click on it.

3. You should get to a page allowing you to paste in the amino acid sequence and click "Click here to compute pI/Mw" below it. Presto, you should get the isoelectric point and molecular weight at the bottom of the page. For G6PD, that would be pI=8.23 and MW= 62468.35 You won't need the pI, but do save the molecular weight of your polypeptide for your poster.

4. Go back to the Expasy Primary Structure Analysis section, but this time click on the Colorseq Tool near the end of the section.

5. On this page you can again paste in your copied amino acid sequence and search for certain strings of specific amino acids. For example, polypeptides that are destined to associate with membranes often contain "signal peptides" comprising 15-20 hydrophobic amino acids near the amino terminus. So we might request a "pre-defined residue set" of Hydrophobic amino acids be colored red. When we click "Submit," we get a visual output showing us where the hydrophobic amino acids are located in the polypeptide. Does it look like G6PD contains a signal peptide?

BIO 2 Assignment Due October 31, 2008

1. With what codon does translation of the G6PD mRNA begin? With what codon does it end? [Note that the NCBI sites do not bother to convert T's in DNA to U's in RNA.]

2. In prokaryotes there is usually a sequence 6-7 nucleotides to the 5' side (coming before) the start codon of an mRNA. This is known as the Shine-Delgarno Sequence, it is usually 5' AGGAGG, and it is complementary to a sequence in the prokaryotic 16s rRNA. The complementarity helps the mRNA bind to the small ribosomal subunit to begin translation.

a. Write the sequence in the 16s rRNA, which would be complementary to the Shine-Delgarno Sequence. Mark its 5' end.

b. If you expected the human mRNA to be like a prokaryotic mRNA, over what bases (identify them by number) in the mRNA would you scan for the AGGAGG sequence? Did you find it there?

3. Chromosomes in karyotypes (characteristic sets for species) are numbered from 1 to n, where 1 is the largest chromosome and n is the smallest.

a. Which chromosome in the human set would you expect to contain the fewest DNA bp? Explain your answer.

b. Find the total number of bp in that chromosome using Map Viewer and record it below.

4. The G6PD locus is said to be highly polymorphic; many different G6PD alleles (resulting from random mutations) exist in the human population. Some of the mutant alleles cause no phenotypic changes, while others cause mild symptoms of hemolytic anemia, and still others cause severe symptoms. One mutant allele causing severe symptoms is a base substitution of TCC-to-TTC in the DNA coding strand (not the template strand), resulting in a change at amino acid 188 in the polypeptide.

a. Would this change be considered a frameshift, missense, nonsense, or silent mutation? Explain your answer.

b. A table on p. 79 of your text shows some chemical properties of amino acid side groups. What amino acid substitution would result from the mutation described above and how might that substitution affect the side group interactions?