The Candida Genome Database: Sequence Refinements, November 2008

This page provides information about the refinements to Candida albicans SC5314 genome sequence and annotation performed with the use of comparative genome analysis and sequence alignment curation over the period of April-August 2008 and added to CGD in November 2008.

Overview of the comparative genome analysis:

A comparative genome analysis was performed by Mike Lin, Christina Cuomo, Manolis Kellis, and colleagues at the Broad Institute, using the genome sequences of Candida albicans SC5314, Candida albicans WO-1, Candida dubliniensis, Candida tropicalis, Candida parapsilosis, Lodderomyces elongisporus, Debaryomyces hansenii, Candida guilliermondii, and Candida lusitaniae (Butler et al., submitted). In brief, the genomes were compared to identify conservation that is suggestive of genomic regions that are under selective pressure to maintain sequence signatures indicative of protein coding capability (see the similar comparative analysis of Drosophila genomes, published in Lin et al., Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Res. 2007 Dec;17(12):1823-36. PMID: 17989253.) The analysis identified potential new ORFs in the C. albicans SC5314 genome, as well as possibly mis-annotated ORFs, and possible sequence errors that cause stop codons or frameshifts within protein-coding regions.

Overview of the sequence curation process:

We conducted a targeted sequence analysis at CGD. As a starting point, a list of genomic regions containing suspected sequence errors were pulled out of CGD as a list of all of the "adjustments" that had been added by the Candida Annotation Working Group annotators to compensate for ORF interruptions. To this list we added the list of potential errors from the computational analysis of comparative sequence alignments (conducted by Lin et al.) that did not correspond to any existing "adjustment." Each of the regions containing suspected sequence errors was re-evaluated using sequence trace data. The sequence data came from two sources: (1) from the trace archive from the Stanford Genome Technology Center, which was the starting point for Assembly 19, and (2) from a set of data that we generated using the 454 Genome Sequencer 20 System starting with C. albicans SC5314 DNA supplied by Judith Berman (in excess of 100Mb of new sequence, average read length of 241 bp).

In each case, we BLASTed the Assembly 21 (A21) sequence against all of the sequence traces, both the original set from the Stanford Genome Technology Center (SGTC) shotgun sequencing plus the new set of 454 traces, and also against the contig sequences from Assembly 19 (A19). NCBI BLAST was used with an e-value cutoff of 1e-15. The query sequence comprised the "error region" (i.e., the site of the actual "adjustment" itself) plus flanking sequence of 100 nucleotides to each side. We imposed an an additional constraint that the match to each sequence trace span the entire "error region" (but not the entire flanking sequence).

Alignment of the BLAST hits was performed with the MUSCLE alignment software. We manually curated each of these alignments. In many cases, the sequence data clearly indicate that there are sequence differences between the two alleles of each ORF in SC5314. Because Assembly 21 is already a mix of haplotypes, we focused only on making the sequence corrections/updates that would restore the open reading frame (insertions and deletions that cause frameshifts, and substitutions that affect in-frame stop codons), and did not attempt to dissect out the alleles or to make other corrections based on these sequence data. (Construction of an updated diploid assembly is beyond the scope of this current work.) Where the traces support a change that restores (or lengthens) the open reading frame, we recorded the "snippet" of sequence from A21, the "snippet" with the change included, and then the supporting evidence: the number of traces that support this change, and the total number of traces aligned across this region. The sequence changes were parsed from the sequence snippets. The sequence updates were then incorporated into the context of the genomic sequence of the affected ORFs (with the "adjustments" removed). The coding sequence of each of the ORFs (taking introns into account) was computationally translated, and tested for proper start and stop codons as well as for stop codons internal to the reading frame. Remaining errors were addressed by iterations of generation of new alignments (centered on a different region, using a higher stringency e-value cutoff, or removing the full-length match constraint, as needed), curation of new sequence snippets, and translation tests.

As part of this work, all of the "adjustments", which were put into the ORF annotation to compensate for presumed errors in the genomic sequence, have now been removed from Assembly 21. In each case the sequence has been re-examined; underlying errors have been corrected or, in cases where no error is confirmed, the ORF annotation has been updated accordingly. The updates to individual ORFs have been grouped broadly into Sequence Changes and Annotation Changes. Annotation changes include:
- addition of a new ORF (or non-ORF feature such as a pseudogene, or a blocked reading frame, which is a conserved region that includes an ORF from start to stop as well as conserved sequence outside of the ORF),
- change in the location of one or both termini of an ORF (ORF boundary change),
- addition of an old ORF that was previously deleted (a reinstated ORF),
- merge between two or more distinct ORFs which results in a single contiguous ORF (the terms "merge-keep" and "merge-delete" are applied to the ORFs within the pair that become the new primary ORF in CGD and the ORF that is assigned "deleted" status after the merge, respectively),
- separation (un-merging) of two ORFs that were previously merged and which are now being restored as separate entities.
Each change is described in detail on the Locus History of each ORF, which may be accessed from the Locus History tab near the top of each Locus Summary Page or the Locus History link in the Additional Information section near the bottom of each Locus Summary page.

In cases in which there is a predicted conserved coding region but not a clear ORF, or a CGD ORF with an "adjustment" for which the sequence data do not support a sequence correction that would restore the open reading frame, our curation guidelines are the following: If the N- or C-terminus can be moved, such that a smaller intact ORF remains, we annotate the smaller ORF. If this is not possible, we annotate the region as a "pseudogene." In cases where the smaller ORF region appears to be part of a larger conserved coding region, we annotate the larger region as a "blocked reading frame," to make this information accessible to our users.

When a new ORF is added to CGD, it receives a name that follows the "orf19" systematic name convention that is in use in the community. The guideline for adding a new ORF name is to identify the neighboring ORFs, start with the flanking ORF that contains the lower number in its numeric name, add a ".1" suffix, and assign this new name to the new ORF. (If the flanking ORF already has a ".1" suffix, we add a ".2" suffix to the new name.) All new ORF names were checked for uniqueness in A19, A20, and A21. Names for pseudogenes and blocked reading frames follow the same convention as ORF names.

ORF classification

As another result of this analysis, all the ORFs in CGD were assigned the additional qualifiers, Verified, Uncharacterized, or Dubious. ORFs are labeled as Verified if there is experimental characterization that indicates that a functional gene product is produced (as defined by the ORF having curated Gene Ontology terms with experimental evidence codes, i.e., evidence codes other than IEA, ISS, RCA, ISA, ISM, ISO, NAS). Uncharacterized ORFs do not currently have curated experimental characterization. Upon publication that provides experimental evidence that an Uncharacterized ORF produces a product, the ORF will be reclassified as Verified. Dubious ORFs are unlikely to encode a protein product, as they appear indistinguishable from random non-coding sequence based on comparative analysis conducted by Lin et al. (2008), they show no significant homology to an S. cerevisiae gene (by the same criteria used to determine Best Hits and Orthologs at CGD), and they have no experimental characterization. Only ORFs that are included in the current sequence assembly are assigned one of these ORF qualifiers.

Sequence and annotation update statistics:

697 sequence updates
- 161 insertions
- 322 deletions
- 214 substitutions
73 new ORFs
63 ORF boundary changes
14 pairs of newly merged ORFs
5 pairs of un-merged ORFs
20 reinstated ORFs
5 spurious adjustments removed
8 new pseudogenes annotated
9 blocked reading frames added


Return to CGD	Send a Message to the CGD Curators