Scenario - Bill Newman
Working in the lab on a Tuesday morning, Bill receives EST sequence results for an organism collected last fall in the "exclusion zone" at Chernobyl. He is anxious to compare these sequences to those of other organisms, to determine whether any of the EST sequences is likely to code for proteins involved in DNA repair or radiotolerance. He decides to do similarity searches with BLAST against the IMG databases and fires up his computer.
He chooses a sequence from the set of ESTs, gets the BLAST results, and finds that there are three high-similarity hits in extremophile archaeal species to one of the ESTs. Two are annotated as DNA repair proteins, and one is listed as a hypothetical protein kinase. He is fairly certain that the latter was annotated incorrectly, but there's always the chance that it's a very similar protein doing a different job. Not willing to dismiss the oddball entirely, he looks at the annotation history of each of the hits, checking to see whether the sources are reliable. As he suspected, the oddball was annotated automatically by an algorithm that is not rated very highly. The others were annotated by humans, so he thinks he may have indeed found a DNA repair gene that is highly expressed in the organism from Chernobyl. One of the two human annotators is an up-and-coming researcher whom he met at the International Conference on Microbial Genomes last year, Martha Jane Gilbert. The other annotator, Phil Dupont, is new to him, but the system shows a high average rating for this person's annotations, so he is pleased.
He next runs a multiple alignment on the three genes with high similarity and looks at the amino acid sequences all right next to each other. With this comparison, Bill is now quite certain that the oddball sequence codes for a DNA repair protein, and because the other two are so similar, he is also quite sure that they all code for the same repair protein. He updates the annotation for the oddball, assigning the correct GO term for the function, EC number for the protein, and COG group for the gene. He also gives the gene a new, more accurate name. He registers his agreement with Phil and Martha Jane's annotations as well, then heads off to lunch, feeling the EST sequencing is already paying off.
Scenario - Martha Jane
On a Wednesday afternoon of a relatively productive week, Martha Jane grabs a cup of coffee and settles down in front of her computer to do some data analysis she’s been meaning to get to. She is currently working on identifying communities of microorganisms that play a role in converting carbon dioxide emissions into harmless oxygen and water by-products, commonly known as photosynthesis.
Her lab recently finished sequencing a few samples they received from the Crouch Mining Dahlquandy surface mine, one of the largest open cast mines in the UK. They have already identified many of the genes, and they know that many of these genes play a role in photosynthesis, but they don't know the precise role each plays. A quick search through the KEGG database allows her to see the various individual enzymes involved in photosynthesis in representative organisms. She wants to identify the genes in the Dahlquandy set that code for each of the proteins involved.
She logs into JGI’s IMG portal and does a search for genes involved in carbon fixation. She chooses one specific enzyme, malate dehydrogenase, and views the list of genes assigned to it. Most of the gene matches look like they were automatically annotated; in fact, she notices one gene sequence that has a small untranslated region right in the middle. That gene must have been incorrectly annotated, for such a region clearly indicates the termination of a gene! She registers her disagreement with these annotations. She is glad to see so many hits, but knows that the power of the system will come from numbers rather than precision.
Now she pulls out an Excel chart with the various genes that were identified in her sample and begins running comparisons with the genes in the IMG portal. Sure enough, for each of the genes in her set, IMG finds related genes with a specific role in photosynthesis. After a series of comparisons like this, she has a much better feel for the roles each of these genes play. In particular, there is a set of genes within IMG that she is now certain play a role in Photosystem I. She enters a comment to this effect for all of them. She hopes that other members of the scientific community will offer similar hypotheses about these genes and begin to annotate them more fully.
Scenario - Phil Dupont
It is late Tuesday morning. Phil got to the lab later than he wanted and is planning his day. He has a lot to do for his presentation at a lab meeting two weeks from now. He checks his email and learns that the sequence for the fungus he is studying is now complete and loaded into the IMG database. He can now proceed to analyze the genes. He goes to the IMG site and finds the genome for his organism. He is interested in a specific gene that has been sequenced before (as expressed sequence tags, or ESTs). He has the complete cDNA sequence for this gene in a text file on his computer. He first wants to identify all paralogs and orthologs across the entire database but then decides to limit his search to the available fungi. He chooses his subset of organisms for comparative analysis and runs a BLAST homology search. From the resulting list of similar sequences, he picks a few that seem noticeably more similar than the others.
He can now do a multiple alignment of the most similar genes. He notices that he can export the multiple alignment source sequences to an electronic FASTA file. Switching to his electronic notebook, he makes a note of that and then proceeds to make the multiple alignment. He wants to rank levels of conservation between his gene of interest and the other genes with high similarity. He makes a note that the first BLAST hit, labeled membrane protein TerC, is an exact match for his gene of interest, and the second, third, and fourth genes listed are similar but longer. He wonders if the first hit was actually the algorithm returning the source sequence as a hit to itself, but then he sees that they have different ID numbers and that they are, in fact, from different organisms. He concludes that, since they are identical, they are probably conserved.
He feels pretty confident about transferring the annotation from the homolog he identified to his gene of interest. To be absolutely sure that his thinking was right, he still wants to investigate the homolog labeled as membrane protein TerC to make sure that he isn't basing his work on an improper annotation. He wants to see its SWISSPROT number (since he knows SWISSPROT is a curated database) and the GenBank number. Finding both of these, he believes that he has enough evidence to make a homology-based annotation.
He hasn’t entered any annotations into the IMG system yet, so he signs up for a user account. He logs into the system and enters the annotation for his gene of interest, including the reference to the homolog he identified. Before submitting it, he thinks he'd like to see the homolog on a phylogenetic tree. He goes to an online phylogenomics tool and sets the necessary parameters. The resulting structure does not support the functional assignment of the homolog as membrane protein TerC. He makes a quick reference search on PubMed. He finds publications that confirm that the homolog's function assignment was not correct. It is actually membrane protein TerD. Now convinced that he has found an error, he shows the data to his supervisor for one last check. They decide that Phil is indeed correct and the annotation for the homolog is erroneous. He goes back to the system and corrects the annotation for the homolog, adding a PubMed reference to support his functional assignment. He then proceeds to finish the annotation of his gene of interest, referencing the homolog as support for the new functional assignment.