The Candida Genome Database: Sequence Documentation

This page provides information about the DNA and protein sequences in CGD, including their sources, how to access them, and further explanation of some sequence-related issues.

Information about Candida-related strains and species in CGD
Sources of sequence-based information in CGD
Version Tracking for Chromosomal Sequence and Genome Annotation
C. albicans SC5314 Genome Sequence Assemblies

Refinements to Assembly 21 in CGD

Sequence analysis
Sequence correction
ORF classification

Assembly 21 in CGD

Intron annotation
Non-intron adjustments to ORF coordinates
Translation and tRNAs

Assembly 20 in CGD
Assembly 19 in CGD
Assembly 6 in CGD

Sources of SNP data
Accessing Sequences in CGD

Information about Candida-related strains and species in CGD

CGD provides sequence for download from several Candida-related strains and species, listed below. Initially, CGD curation was focused on the C. albicans literature, because C. albicans serves as a genetic model for the other Candida-related species, and it is the most well-represented of these species in the published experimental literature. As of June, 2011, we have also added curated infromation about C. glabrata. We are now expanding the manual curation process to include information about other Candida-related species, and will be adding gene-based information for them, including Locus Summary pages.

The C. albicans SC5314 sequence file names, as well as chromosome identifiers within the files, were updated on 25 August 2010 to include the name of the species and strain. This change was necessary to accommodate multiple Candida and Candida-related species and strains at CGD.

Note: Candida albicans and some related species (often called the "CTG clade") use a non-standard genetic code, "Translation table 12: Alternative Yeast Nuclear Code," to translate nuclear genes. For more information about translation tables used in CGD, please see the Non-standard Genetic Code Usage in Candida help page.

Sources of sequence-based information in CGD

Candida albicans SC5314 was sequenced by the Stanford Genome Technology Center (Jones et al., 2004, PNAS 101:7329-7334) and the Biotechnology Research Institute of the National Research Council of Canada (Hoog et al., 2007, Genome Biol 8:R52). Please see the assembly information below for more details.
Candida albicans WO-1 was sequenced by the Broad Institute (Butler et al. 2009, Nature 459:657-662). Sequence and annotation obtained by CGD from The Broad Institute.
Candida auris B8441 was sequenced by the Centers for Disease Control and Prevention (Lockhart et al. 2017, Clinical Infectious Diseases 2017:64:134-140). Sequence and annotation obtained by CGD from GenBank.
Candida dubliniensis CD36 was sequenced by the Wellcome Trust Sanger Institute (Jackson et al., 2009, Genome Res. 19:2231-2244). Sequence and annotation obtained by CGD from EBI.
Candida glabrata CBS138 was originally sequenced by Genolevures (Dujon et al., 2004, Nature 430:35-44; Koszul et al., 2003, FEBS Lett. 534(1-3):39-48). The genome was later re-assembled, leveraging long-read sequencing to correct errors in repetitive regions (Xu et al., 2020, Mol Microbiol 113:1209-1224). Please note that the alternate designations ATCC 2001 and CBS138 refer to the same strain of C. glabrata.
Candida guilliermondii ATCC 6260 was sequenced by the Broad Institute (Butler et al. 2009, Nature 459:657-662). Sequence and annotation obtained by CGD from The Broad Institute.
Candida lusitaniae ATCC 42720 was sequenced by the Broad Institute (Butler et al. 2009, Nature 459:657-662). Sequence and annotation obtained by CGD from The Broad Institute.
Candida orthopsilosis Co 90-125 was sequenced as described by Riccombeni et al., 2012 (PLoS ONE 7(4): e35750). Sequence and annotation obtained by CGD on 5/15/2012 from EBI.
Candida parapsilosis CDC 317 was sequenced by the Wellcome Trust Sanger Institute (Butler et al. 2009, Nature 459:657-662). Sequence and annotation obtained by CGD from GenBank.
Candida tropicalis MYA-3404 was sequenced by the Broad Institute (Butler et al. 2009, Nature 459:657-662). Sequence and annotation obtained by CGD from The Broad Institute.
Debaryomyces hansenii CBS767 was sequenced by Genolevures (DuJon et al., 2004, Nature 430:35-44). Sequence and annotation obtained by CGD from EBI.
Lodderomyces elongisporus NRLL YB-4239 was sequenced by the Broad Institute (Butler et al. 2009, Nature 459:657-662). Sequence and annotation obtained by CGD from The Broad Institute.

Version Tracking for Chromosomal Sequence and Genome Annotation

The version designation appears in the name of each of the relevant sequence files that are available at CGD, so the exact source of the sequence data is always clear.

This version system was implemented for C. albicans SC5314 and C. glabrata CBS138 in CGD as of June 2011, and it is based on the system designed for tracking of the A. nidulans sequence and annotation versions in AspGD. The same system of version designation will be used for version tracking for the chromosomal sequence and genome annotation of other species, as they are added into CGD.

Version designations appear in the following format:
sXX-mYY-rZZ
as described in detail here.

A list of all of each of the versions of the sequence and annotation for each species, with release notes, is listed on the Summary of Genome Versions page.

Information about every update to the chromosome sequence and/or chromosomal location of any gene (or other annotated feature) is displayed on the CGD Locus History page for each of the relevant genes, and also on the appropriate CGD Chromosome History page

Please feel free to contact us with any questions.

C. albicans SC5314 Genome Sequence Assemblies

Refinements to Assembly 21 in CGD (November 2008)

Update

The paper describing the comparative genomic analysis that was the basis for the refinements to Assembly 21 performed in 2008 has now been published (Butler, G., et al. [2009] Nature).

Sequence analysis

In a collaboration between CGD and the Broad Institute, MIT, a targeted re-analysis of Candida albicans genome sequence and annotation has been performed using new comparative genome analysis data and newly generated sequence data. A comparative genome analysis was done by Mike Lin, Christina Cuomo, Manolis Kellis, and colleagues at the Broad Institute, who compared the genome sequences of Candida albicans SC5314, Candida albicans WO-1, Candida dubliniensis, Candida tropicalis, Candida parapsilosis, Lodderomyces elongisporus, Debaryomyces hansenii, Candida guilliermondii, and Candida lusitaniae (Butler et al., submitted). Their analysis identified many conserved genomic regions corresponding to potential new ORFs, as well as regions of no significant conservation that are annotated as ORFs, which were candidates for "Dubious" classification. It also revealed several ORFs with incorrectly annotated boundaries, as well as possible sequencing errors that had led to incorrect ORF assignments. The Annotation Working Group had also previously identified suspected sequencing errors, and many ORFs in CGD contained "adjustments" (artificial sequence changes), which were added to compensate for such presumed errors and to restore ORF integrity. CGD staff inspected each of the areas identified by the Broad Institute group and by the Annotation Working Group.

Sequence correction

C. albicans

As a result of this analysis, hundreds of sequence errors were corrected, which allowed us to update annotations for 530 ORFs, and 73 new ORFs identified in the comparative genome analysis were added to CGD. All artificial "adjustments" (arbitrary sequence changes made to correct presumed errors; see below) were removed from the sequence. The sequence and annotation changes made on each chromosome are listed on individual Chromosome History pages, which are linked from a Summary Table. The detailed description of the methodology used in analysis and curation, as well as the summary of the results, is available in the Sequence Refinements, November 2008 documentation.

Note that sequence and annotation changes were made to Assembly 21 only, not to previous assemblies.

ORF classification

As another result of this analysis, 181 non-conserved ORFs were identified whose sequence is indistinguishable from random non-coding sequence. These ORFs were classified in CGD as Dubious ORFs, unlikely to be biologically significant. The remaining ORFs in CGD were classified either as Verified, meaning that there is experimental evidence for the existence of a gene product (as defined by the ORF having curated Gene Ontology terms with experimental evidence codes, i.e., evidence codes other than IEA, ISS, RCA, ISA, ISM, ISO, NAS), or as Uncharacterized, meaning that no experimental evidence currently exists but that the ORFs are likely to represent biologically significant genes. These classifications are displayed on the Locus Summary page of each ORF, and may be changed in the future as new experimental evidence becomes available.

Assembly 21 in CGD (September 2007)

Assembly 21 (A21) is described in: van het Hoog M, et al. (2007) Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol 8(4):R52. URL: http://genomebiology.com/2007/8/4/R52 . In addition to making a chromosomal level assembly, by mapping the contigs of Assembly 19 (A19) to chromosomes and filling many of the gaps between them, the authors also made numerous and widespread modifications to the genomic sequence based on alignments of the sequence traces generated by inputting the SGTC's sequence traces into Sequencher software. Many of these modifications introduced insertions, deletions, and substitutions relative to the Assembly 19 sequence. In many cases A, C, T, or G was substituted with an ambiguous nucleotide; within ORFs, such ambiguous nucleotides consequently resulted in ambiguous amino acids in the predicted ORF translation, which is represented as an "X" within the A21 protein sequence.

More information is available in Assembly 21 Sequence Documentation.

Intron annotation data

The intron data published in the paper
Mitrovich QM, Tuch BB, Guthrie C, Johnson AD. Computational and experimental approaches double the number of known introns in the pathogenic yeast Candida albicans. Genome Res. 2007 Apr;17(4):492-502
have been incorporated into Assembly 21 (and 20) in CGD. Assembly 19 coordinates have not been updated.

Non-intron adjustments to ORF coordinates

In Assembly 21 (and 20), gaps that were introduced by the Annnotation Working Group (AWG) to compensate for presumed sequencing errors that interrupt ORFs are labeled "Adjustments". "Adjustments" refer to gaps between regions of CDS that are NOT expected to be biologically significant introns. Adjustments with a length that is a negative number indicate an overlap between two regions of CDS, resulting in a duplication of the overlapping part of the sequence in the predicted ORF. In Assembly 19 all introns and adjustments are called Gaps.

In November 2008, all the non-intron "adjustments" in Assembly 21 were removed, as explained in the Sequence Refinements, November 2008 documentation.

Summary files listing introns and non-intron adjustments are available from the CGD Downloads site.

Translation and tRNAs

The tRNA genes were predicted from the C. albicans genome sequence using the tRNAscan-SE algorithm developed by T. M. Lowe and S. R. Eddy. The process used is described here, Annotation of tRNAs in the Candida Genome Database.

The C. albicans codon usage table may be accessed using the link in the left-hand menu bar of the CGD home page, under the heading "Download Data," or using the link on the Download Sequence page. This table displays the calculated frequency of use of each codon in the diploid complement of C. albicans protein-coding genes. The table was produced with the GCG program CodonFrequency using the diploid complement of all predicted coding sequences (13,117 open reading frames) from Assembly 19 of the C. albicans SC5314 genomic sequence, as found in the file 'orf_coding.fasta' dated 07-Jun-2005. Where the sequences of two alleles differ, both sequences were used to calculate codon usage. Where the sequences of two alleles were identical, two copies of the coding sequence were added to the pool of sequences used to calculate codon usage. Thus, codon usage was calculated from the entire diploid complement of protein-coding genes.

Note that C. albicans uses an alternative genetic code for nuclear genes, different from that used by most other fungi. Details and links to translation tables for nuclear and mitochondrial genes can be found at NCBI's Taxonomy Browser.

Assembly 20 in CGD (September 2006)

Assembly 20 of the C. albicans sequence, released in May 2006, was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan. After the release, it was discovered that the sequence traces that had been used to fill some of the gaps and determine overlaps between Assembly 19 contigs were derived from strain WO-1, rather than from the reference strain SC5314. The sequence of these regions is consequently expected to be inaccurate where WO-1 sequence was used, and small contigs may have been misassembled based on the WO-1 sequence data. The Biotechnology Research Institute of the National Research Council of Canada has since then released a new Assembly 21 that supersedes Assembly 20.

More information is available in Assembly 20 Sequence Documentation.

Assembly 19 in CGD (March 2004)

The contig sequences in CGD are from Assembly 19 of the C. albicans genome sequence, from the supplementary material published in the C. albicans sequencing paper, Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., Newport, G., Thorstenson, Y.R., Agabian, N., Magee, P.T., Davis, R.W. and S. Scherer. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334. Supplementary data: http://genome-www.stanford.edu/candida-pnas2004-supplement/. (Older sequence assemblies, including Assemblies 4, 5, and 6, have been archived at CGD. These data may be retrieved from the "archived_assemblies" folder on the CGD Sequence Download Page.)

More information is available in Assembly 19 Sequence Documentation.

Assembly 6 in CGD (January 2002)

This page contains documentation from the Stanford Genome Technology Center (SGTC), which was previously available on the SGTC's Candida information server, and has been archived here (verbatim) for reference.

Note: The original SC5314 sequence trace files and quality scores generated by the Stanford Genome Technology Center are available for download from CGD. The construction of the sequencing library and sequencing methods are described in Tzung et al. (2001).

Sources of SNP data

Please note: This is not intended to be a comprehensive bibliography, rather, a list of a few helpful references:

SNPs between allelic Assembly 19 contigs, C. albicans strain SC5314, are published in the Assembly 19 paper, Jones et al. (2004), and these data are available for download as supplementary material associated with the publication.

SNP data for C. albicans from Forche et al. (2004) are available in the supplementary material associated with the paper and may be viewed using the SNP track in the CGD GBrowse genome browser.

SNP data are included among the data from Butler et al. (2009) for eight Candida genomes, and are available for download as supplementary material associated with the paper, and from the Broad Institute website.

Accessing Sequences in CGD

From the Locus Summary Page:

The "Retrieve Sequences" pull-down menu, which is located on the Resources sidebar on the right-hand side of each Locus Summary Page, retrieves, for each gene in Assembly 21, or each allele in Assembly 19: the Genomic DNA (with introns included); the Coding Sequence (with introns removed); the Genomic DNA with 1 kb of flanking sequence upstream and downstream of the gene (also includes any introns); or the ORF translation (predicted protein sequence).

From the CGD Sequence Retrieval Tool:

To access the Sequence Retrieval Tool (also called Get Sequence, or Gene/Sequence Resources, use the link under Search Options on the left-hand sidebar of the CGD Home Page or use the "Gene/ Sequence Resources" link under Specialized Gene and Sequence Searches on the Search Options page.

By Bulk Download

You may download gzip compressed sequence files in bulk from the CGD Sequence Download Page; a variety of file options exist for retrieval of data from Assemblies 19, 20, and 21. There is a link to this page under Download Data on the left-hand sidebar of the CGD Home Page. Archived copies of older sequence assemblies, including Assemblies 4, 5, and 6, may also be retrieved from the the CGD Sequence Download Pages.

You may also retrieve sequence information for any set of genes (either specified by a list of gene names, or by selecting a region of a chromosome or contig) using the Batch Download Tool.

From the GBrowse Genome Browser:

You may also view nucleotide or protein sequence using the GBrowse genome browser. GBrowse may be accessed using the "Chromosomal Location" or "Contig Location(s)" links, or the GBrowse map thumbnail views on each Locus page, or by using the "Genome Browser" links displayed on each BLAST result page. Sequence download options are available from the Reports & Analysis pull-down menu in the interface. The GBrowse Help Documentation page has additional instructions for use of the GBrowse interface.

Using BLAST (Basic Local Alignment Search Tool):

You may use the CGD BLAST tool to conduct protein or DNA sequence searches against various sequence datasets in CGD, as described in detail on the BLAST documentation page. Alignments of the query sequence with its sequence matches (also called "hits") are displayed along with hyperlinks to related sequence resources. The "CGD GBROWSE" hyperlink above each set of HSPs on the BLAST results page opens the GBrowse genome browser, with the HSP displayed in the browser window. GBrowse may be used to further explore the region containing the match: to view ORFs and other features in the neighborhood of the hit, to browse and download adjacent sequences, to view the 6-frame translation of the region, and to view restriction sites. (For a description of GBrowse features, please see our GBrowse documentation). If applicable, links are provided to directly download/view the entire ORF or peptide sequence, or to navigate to the corresponding Locus page.


Return to CGD	Send a Message to the CGD Curators

The Candida Genome Database: Sequence Documentation

Contents

Update

Sequence analysis

Sequence correction

ORF classification