Assignment 2.2 - Introduction to Sequence Analysis
Obtaining DNA and protein sequences of interest is only the beginning of sequence analysis. Some common sequence information that you may need to know about your sequences include determining how closely realted two sequences are two each other, calculating the G-C content of nucleotide sequences, translation of nucleotide sequences, and discovering conserved regions among a related collection of protein sequences. Assignment 2.2 directs you to begin analyzing the sequences in your DNA and protein files created in Assignment 2.1. Read the chapter "What is Comparative Genomics" from the book Comparative Genomics for a more detailed description of how sequence analysis is used to answer significant biological questions.
Your analysis steps (Be sure to tabulate and organize your data as you generate it):
-
Estimate the percent identity between your original gene of interest and the 10 other sequences in your DNA FASTA file. Begin by using the LAlign program found here: http://embnet.vital-it.ch/software/LALIGN_form.html. The LAlign program is a basic program designed to align your two sequences, output the alignment, and report the percent identity between the two submitted sequences. You want to set the program to use a "global alignment", otherwise it will default to giving you several partial alignments. Explore online and check out other programs that you may like better. Another useful program that gives both percent identity and similarity can be found here http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::stretcher and is called EMBOSS 6.3.1: stretcher. It is possible to use a local (on your computer) program to determine the percent identity between sequences. The BioEdit program, described below can open your FASTA files and from the Menu you can select Sequence --> Pairwise Alignment --> Align Two sequences (optimal Glabal alignment). See the instruction video below for details.
Using the BioEdit Program Video Tutorial
-
Use a DNA/RNA G-C content calculator program (http://www.endmemo.com/bio/gc.php) to determine the percent of G and C nucleotides in DNA sequences listed in your DNA FASTA file. Search online for an alternative web-based tool to calculate G-C content and report the URL for the site.
-
Explore free-web based tools to translate each of your eleven sequences in your DNA FASTA file. Record the initial nucleic acid length and the resulting amino acid length of the protein. Start your search by exploting the wealth of tools available on the ExPASy website: http://www.expasy.org/.
-
Use a web-based tool calculate the molecular weight (in Daltons, Da) and the isoelectric point (pI) of each protein in your Protein FASTA file. The pI of a protein is the pH required to maintain a molecule at no net electrical charge. In other words, the protein's negative and positive charges are equal. This information is useful for designing protein biochemistry and column chromatography experiments.
-
Determine if your proteins hav any recognizable functional domains. Record the name of the domain and amino acid position within your protein. For example: identified Serine kinase domain from amino acods 34 to 123. http://www.genome.jp/tools/motif/ or ScanProsite http://prosite.expasy.org/scanprosite/. This data is useful for predicting the function of a protein.
-
Predict the number of transmembrane domains present in your eleven proteins. http://www.enzim.hu/hmmtop/ (use the submit tab) or http://www.cbs.dtu.dk/services/TMHMM-2.0/ or TMpred http://www.ch.embnet.org/software/TMPRED_form.html. This data can indicate the location (i.e. membrane) of a protein.
-
Using your Protein FASTA file perform a multiple alignment of your protein sequences using the Clustal Omega program found on the ExPASy website. To visualize and analyze your output alignment file you will need to download and install either BioEdit (PC - http://www.mbio.ncsu.edu/bioedit/page2.html) or Sea View (Mac - http://pbil.univ-lyon1.fr/software/seaview.html). Using BioEdit calculate the percent identity and similarity between your original protein of interest and the 10 other protein sequences in the multiple alignment. Next use BioEdit to generate a nice graphic image of your alignment. If you are using a MAC and will visualize your alignment with Sea View then open your alignment file in the program and use the generate .pdf function under the File menu to create an image of your alignment. Using Jing or another screen capture program take a snap shot of as much of the alignement as you can to include in your data analysis summary. For MAC users, use programs like LAlign to determine the percent identity between your protein of interest and other sequences in your Protein FASTA file. Unfortunaltely, BioEdit is not available for Macs, but for MAC users Sea View has some additional functionalities that BioEdit lacks. Definetely explore the potential of this program, using your DNA and Protein FASTA files as test cases. Videos illustrating how to navigate the the Clustal Omega program and BioEdit are below.
-
Compile all of your generated data, inlcuding a color image of your protein multiple alignment, in a spreadsheet program. Pay attention to the presentation and ease of access to your data. Make use of formatting learned in Unit 1 to aid in your organization and presentation of your data. Finally be sure to indicate the web-based program used to generate the data you present in your spreadsheet.