Problem Statement

One of the tightest bottlenecks in genomics these days is identifying the functions of genes in genomes that have already been sequenced. Scientists have access to genomic sequence for a steadily increasing number of organisms, but cooperative effort by biologists worldwide is needed to make sense of the code. The process of assigning functions to stretches of sequence is called functional annotation. This is currently done for a given genome by a small group of trusted annotators who find sequences of interest and enter annotations that indicate what they think the sequence does (i.e., what protein it codes for or what that protein does) via an online annotation tool. The annotation tool for a given genome is normally made available only for a relatively short period of time, so the amount of human effort that can be applied to the problem is severely restricted.

Our idea is to create an open annotation tool that provides a way of gathering knowledge from anyone in the world who wants to contribute whenever they want, rather than a closed set of official annotators who can only contribute when annotation is open. To make it credible for scientists, and to prevent abuse, it needs to incorporate a trust system, allowing users to see at a glance how reliable a given annotation is likely to be and getting unreliable entries out of their way. The trust scheme will involve multiple levels of users, preventing those who have not proven themselves to be reliable annotators from abusing the system. We envision guest access that allows anyone to view annotations; member access that allows entry of annotations for those who obtain a login; and fellow access, allowing verification of other people's annotations, for those members who have been designated as such by other fellows or have made a threshold number of annotations that were verified by a fellow. To allow interoperability with existing bioinformatics tools, our system will also draw on controlled vocabularies based on existing ontology systems (worldwide naming conventions for biological roles that proteins or groups of proteins can have).

We hope to make a new component of JGI's Integrated Microbial Genomes (IMG) system. The IMG system gives users access to the genomes of hundreds of organisms sequenced at centers around the world, making it a logical home for a collaborative annotation system. IMG does not currently have a facility for external users to publish functional annotations, but users have expressed interest in having this capability.