50. Rice genome annotation pipeline system in RGP

1) RGP / Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries, Tsukuba, Ibaraki, 305-0854, Japan
2) RGP / National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan

The Rice Genome Research Program (RGP) and the International Rice Genome Sequencing Project (IRGSP) have announced the completion of a high-quality draft sequence of the rice genome in December 2002. At present, sequence finishing is in progress and the "finished" rice genome sequence is expected before the end of 2004. Annotation of the sequence data, which facilitates gene discovery in a systematic and comprehensive manner, is one of the crucial steps in genome sequencing. Thus we have developed an annotation pipeline system which consists of automated high-throughput annotation through RiceGAAS[1] , manual curation using a customized editing tool called AnnotationPlot, efficient management of annotation data in the database RAD (Rice Genome Annotation Database), and integration of the output into a centralized database INE (INtegrated Rice Genome Explorer). So far, we have released the annotation of more than 85 Mb of phase 3 sequence.

Sequence analysis

All PAC and BAC clones sequenced to phase 3 level are subjected to annotation that involves an automated set of processes incorporated in RiceGAAS and manual curation using additional analysis tools in order to construct the most accurate gene model based on existing evidences. Homology searches of the genomic sequence are executed against the NCBI nonredundant protein database with BLASTX and the rice full-length cDNA and EST databases at RGP or DDBJ with BLASTN. Then the genomic sequences are processed using several prediction programs, namely, FGENESH (Monocots)[2], GeneMark.hmm (O. sativa)[3], GlimmerM (Rice)[4], RiceHMM (developed by RGP)[5] and GENSCAN (Arabidopsis, Maize)[6]. Splice site prediction is carried out with SplicePredictor[7]. In addition, sim4[8] and gap2[9] are added for analysis of splice site regions particularly for gene modeling based on the rice full-length cDNA.

Manual curation

An annotator constructs a gene model based on the annotation standards defined by the IRGSP. For a coding sequence with full-length cDNA, EST or protein matches, the gene model is constructed using any of these evidences. For coding sequences without any database match, the gene model which is supported by multiple gene prediction programs is selected. The TIGR Combiner program, which uses a voting scheme to combine the prediction of several gene finders and produces a single best prediction, is used to decide the gene model to be selected from the output of the different prediction programs. After constructing a gene model, the annotator defines the gene nomenclature based on the similarity analysis with known proteins by BLASTP.

Since the output of annotation consists mainly of numerical information, we developed an editing tool called AnnotationPlot to construct a gene model efficiently (Fig. 1). The AnnotationPlot also displays positional information of analysis programs in a graphical format so that an annotator can easily comprehend the relationships between genome sequence and

results of analysis programs. This is particularly useful for BLAST hits, which often consist of overlapping sets of alignment fragments. In such case, the annotator could not easily decide which alignment should be adopted as evidence to construct a gene model. The AnnotationPlot facilitates clustering of BLAST results, selects the most appropriate alignment from a group of overlapping alignments, and distinguishes alignments with low or high stringencies although an option is also provided that allows viewing of all alignments for reference. The gene model which the annotator constructs must begin at start codon and must not include stop codon except at the tail-end of the gene. The AnnotationPlot also provides visual presentation of coding frames including the start and stop codons which helps the annotator in determining whether a gene model is an actual gene or a pseudogene. The splice sites however cannot be correctly determined using BLAST hits only. We also use sim4 and gap2 in case the genome sequence has a hit with a full-length cDNA sequence to determine the exact position of splice sites. These programs address the problem of efficiently aligning a transcribed and spliced DNA sequence (mRNA, EST) with a genomic sequence containing that gene, allowing for introns in the genomic sequence. The results of alignment are also provided as numerical information. The AnnotationPlot can import these results, display the alignment graphically and facilitate selection of the most appropriate splice sites.

Management of annotation data

After submission to DDBJ, the annotated information is stored into two different databases. One database is INE which facilitates integration of the annotated sequence data with the DNA markers in the genetic map of rice, the YAC-based physical map, the EST markers in the transcript map and the PAC/BAC contigs. To facilitate more efficient management of information derived from annotation, the annotated PAC / BAC clones are also incorporated in RAD which merges clone sequences and annotation information in accordance with minimal tiling path of physical map of rice. This database also facilitates contig or chromosome level view of the results of annotation.

Future plans

Although the automated annotation through RiceGAAS could provide a comprehensive analysis of the characteristic features of the genome sequence, manual curation is still an integral part of the process to assure high-quality annotation. Automating many steps involved in editing the output of RiceGAAS could accelerate the release of annotated sequences in public databases. We are planning to incorporate the voting scheme function of Combiner in the AnnotationPlot to facilitate direct selection of the best gene prediction output. The 'BLAST result clustering' function is very efficient in most cases, but clustering of many alignments may take a long time. We are currently developing much faster clustering algorithm by tuning up the present clustering program or reducing the number of objects for clustering.


[1] Sakata, K. et al.: 2002. RiceGAAS: an automated annotation system and database for rice genome sequence. Nucleic Acids Res., 30: 98-102.

[2] http://www.softberry.com/berry.phtml

[3] Lukashin A. and Borodovsky M.: GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res., 26: 1107-1115

[4] A.L. Delcher et al.: 1999, Improved microbial gene identification with GLIMMER. Nucleic Acids Res., 23: 4636-4641

[5] http://rgp.dna.affrc.go.jp/RiceHMM/

[6] Burge, C et al.: 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268: 78-94.

[7] Reese et al.: 1997. Improved Splice Site Detection in Genie. J Comp Biol., 3: 311-23.

[8] Liliana Flore et al., 1998. A Computer Program for Aligning a cDNA Sequence with a Genomic DNA Sequence. Genome Res., 9: 967-974.

[9] Huang, X. et al., 1997. A tool for analyzing and annotating genomic sequences. Genomics, 46: 37-45.