Genetic Disease Poster
Project
Bioinformatics
II: Introduction to Techniques to Access
More Databases
(adapted from Dr. Nick Ewing)
When we
last left our genetic diseases, we had identified gene symbols and associated them
with nucleotide sequences of their genes and mRNA's, and also with amino acid
sequences of the polypeptides coded for by the genes' base sequences. All of this was done through the NCBI website
and its links to various databases.
Today we will explore a few more links to get information about (1) the
locations of the disease genes on the chromosomes, (2) the average sizes of
human chromosomes in terms of nucleotide pairs (bp),
(3) the location of the coding sequences in the mRNA, and (4) the molecular
weight of the encoded polypeptide. Like
last time we will start with an example of finding that information for the
G6PD gene and then you will have time to find similar information for your
disease gene. You will probably want to
save that information or e-mail it to yourself, because it will be useful for
your poster.
A.
Finding Genes on Chromosomes
1. Let's start at the NCBI Home Page
(http://www.ncbi.nlm.nih.gov/) again. Go
to the right hand column "Hot Spots"
and click on "Map Viewer."
2. Here in the left hand column, you
can Select a group or organism. Select Homo sapiens.
3. When you click "Go," you should reach a page with
ideograms (idealized pictures) of the human chromosome set at the top. In the
"Search for" box at the
top left, type in the name of your disease gene. For our example I type in G6PD and then click
"Go."
4. This brings up a page with the same
ideograms, except that some have red marks on them. The red marks represent the loci (plural of
locus or location) of G6PD or similar genes.
Since I know the G6PD gene is on the X chromosome from my previous
search of OMIM, I can focus on the red mark on the X. Notice it is present near the end of the
longer arm of the chromosome; this is q28 on the X. This
picture of the whole set of human chromosomes and the locus of the disease gene
should be saved for your poster.
5. To get a more detailed look at the X
chromosome and the G6PD locus, click on the X under the chromosome's ideogram.
This should take you to several views of part of the chromosome. When you scan down the right hand side of most
of the maps, you will be able to find the gene name or its identification
number. The letters may be small. Other genes on the same chromosome are also
noted. In the left hand column, just
above the ideogram, you can select the scale of the map shown. Move the cursor to the maximum "zoom out" and you should see
"show full chromosome." A single click there will give you a full and
crowded view of the chromosome.
6. Above the maps you can see under
"Region Displayed" which
nucleotide pairs of the chromosome are represented. In this case, since we are looking at the
full chromosome, we see 0-155 M bp (M = 106),
indicating that the total amount of DNA in the human X chromosome is 155 x 106 bp. Record
the total amount of DNA in the chromosome your disease gene is on for your
poster.
7. Scroll down to "Summary of Maps" and look at
"Map 4 Ensembl
Genes On Sequence." You can see the "Total Genes on Chromosome" (in this case 2005); save this
number as well, so that you can compute
the average size (#bp) of a gene on the chromosome
for your poster.
8. To see some of the genes on the
chromosome a little more clearly, "zoom
in" to "show 1/100th of the chromosome" and
single click. You should get a larger
view of the 1/100th of the chromosome containing the disease gene. Now click on the "Table View" of Map
1 Genes on Sequence. Here you should
find the nucleotide at which different genes start and end in the DNA of the
chromosome, along with the gene symbol, links to information about the gene,
the position of the gene on the chromosome band map, and a brief description of
the gene's product. This will let you
find genes that are said to be linked to your gene.
B.
Finding the first and last codons for the
polypeptide in the mRNA
Since we
now know that the gene and the mRNA base sequences are longer than the bases
needed to code for the amino acid sequence of the polypeptide, we are faced
with complications when a new piece of nucleic acid is sequenced. That piece of nucleic acid might be a part of
a chromosome that is under study or it may be a copy of an mRNA that has been
isolated from a specialized cell. We
might be interested in the structure and function of the polypeptide being
encoded. But we would have to find
out: Does our base sequence code for a
polypeptide and if so where does the coding sequence begin and end?
One
approach to answering this question is to start from the 5' end of the
sequence, translate it 3 bases at a time without skipping or re-reading bases
(since we know that each amino acid is encoded by continuous, non-overlapping
triplets in mRNA) and see if we get a continuous chain of 50 or more amino
acids. But we must remember that the
reading sequence could start with the second base and reading 3 bases at a time
from that point would yield an entirely different set of codons. Further, the reading sequence could start
with the third base and reading 3 bases at a time from that point would yield a
third different set of codons. Moreover, if we were working with a sequence
in DNA, we would have 3 other possible ways to read, and I hope you know
why. Each of the possible ways to start
the decoding of triplets is called a reading
frame. Finding the correct start and
end base for translating the base sequence into a likely polypeptide is known
as finding an open reading frame. It would be a lot of work to do this
manually with our codon dictionaries. Fortunately, there are bioinformatics
programs to do the work for us; we will use ORF (Open Reading Frame) Finder, accessible through the NCBI Home
Page.
1. Return to the NCBI Home Page and access the "Full Report" page on your
gene, in which the diagram of the gene and links to the pre-mRNA, mRNA, and
polypeptide sequences were given. Access
the mRNA base sequence in FASTA format (so that there are no index numbers
along with the base symbols); copy the base sequence.
2. Now go back to the NCBI Home Page and look down the Hot Spots column on the right for ORF Finder. Clicking on ORF Finder gives you a page that
will allow you to paste your base sequence in the box under "or sequence in FASTA format," and to click OrfFind.
3. You should now see a page with six
narrow boxes on the left, each representing the product of translating the base
sequence with a different reading frame.
ORF Finder does not keep track of whether you pasted in a DNA or an RNA
base sequence, so it will give you 6 possible reading frames. The blue colored portion of each box
represents the part of the base sequence that starts with a start codon, codes for amino acids, and ends with a stop codon. The most
likely candidate for the ORF is the box with the longest continuous blue color.
Click on that.
4. You should now see indexed rows of
the base sequences that are being decoded alternating with rows of symbols for
the amino acids encoded by the triplets right above them. Now you can see which base is the first to be
translated and which is the last. You can
also see where the 5' untranslated region (UTR) ends,
where the 3' UTR begins, and where there are potential start sites for
alternate translations of the message. Make a copy of this information for use on
your poster.
C.
Finding the Molecular Weight of Your Polypeptide (and other possibly
protein features)
1. Return to the NCBI Home Page and access the "Full Report" page on your
gene, in which the diagram of the gene and links to the pre-mRNA, mRNA, and
polypeptide sequences were given. Access
the amino acid sequence of the polypeptide in FASTA format (so that there are
no index numbers along with the base symbols); copy the amino acid sequence.
2. Access the following website: http://ca.expasy.org/tools/#proteome. This site provides programs for analyzing
amino acid sequences. While much of the available proteomic
analysis is beyond our present needs, be aware that predictions and comparisons
of protein modifications, secondary and tertiary structures, and sequences
localizing proteins to different parts of the cell are available for more
advanced projects. Go to the blue band
marked "Primary Structure
Analysis," find the link marked
compute "pI/MW," and click on it.
3.
You
should get to a page allowing you to paste in the amino acid sequence and click
"Click here to compute pI/Mw" below it.
Presto, you should get the isoelectric point
and molecular weight at the bottom of the page.
For G6PD, that would be pI=8.23 and MW=
62468.35 You won't need the pI, but do save the molecular weight of your
polypeptide for your poster.
4.
Go
back to the Expasy Primary Structure Analysis section, but this time click on the Colorseq Tool
near the end of the section.
5.
On
this page you can again paste in your copied amino acid sequence and search for
certain strings of specific amino acids.
For example, polypeptides that are destined to associate with membranes
often contain "signal peptides" comprising 15-20 hydrophobic amino
acids near the amino terminus. So we
might request a "pre-defined
residue set" of Hydrophobic amino acids be colored red. When we click "Submit," we get a visual output showing us where the
hydrophobic amino acids are located in the polypeptide. Does it look like G6PD contains a signal
peptide?
BIO 2 Assignment Due October 31, 2008
1.
With what codon does translation of the G6PD mRNA begin? With what codon
does it end? [Note that the NCBI sites do not bother to convert T's in DNA to
U's in RNA.]
2.
In prokaryotes
there is usually a sequence 6-7 nucleotides to the 5' side (coming before) the
start codon of an mRNA. This is known as the Shine-Delgarno Sequence, it is usually 5' AGGAGG, and it is
complementary to a sequence in the prokaryotic 16s rRNA. The complementarity
helps the mRNA bind to the small ribosomal subunit to begin translation.
a. Write the sequence in the 16s rRNA, which would be complementary to the Shine-Delgarno Sequence.
Mark its 5' end.
b. If you expected the human mRNA to be like a
prokaryotic mRNA, over what bases (identify them by number) in the mRNA would you scan for the AGGAGG
sequence? Did you find it there?
3.
Chromosomes in karyotypes (characteristic sets for species) are numbered
from 1 to n, where 1 is the largest chromosome and n is the smallest.
a. Which chromosome in the human set would you
expect to contain the fewest DNA bp? Explain your answer.
b. Find the total number of bp
in that chromosome using Map Viewer and record it below.
4.
The G6PD locus is
said to be highly polymorphic; many different G6PD alleles (resulting from
random mutations) exist in the human population. Some of the mutant alleles cause no
phenotypic changes, while others cause mild symptoms of hemolytic anemia, and
still others cause severe symptoms. One
mutant allele causing severe symptoms is a
base substitution of TCC-to-TTC in
the DNA coding strand (not the template strand), resulting in a change
at amino acid 188 in the polypeptide.
a. Would this change be considered a frameshift, missense, nonsense, or silent mutation? Explain your answer.
b. A table on p. 79 of your text shows some chemical properties of amino acid side groups. What amino acid substitution would result from the mutation described above and how might that substitution affect the side group interactions?