Assembly 21 (A21) is described in van het Hoog M, et al. (2007) Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol 8(4):R52. URL: http://genomebiology.com/2007/8/4/R52 In addition to making a chromosomal level assembly, by mapping the contigs of Assembly 19 (A19) to chromosomes and filling many of the gaps between them, the authors also made numerous and widespread modifications to the genomic sequence based on alignments of the sequence traces generated by inputting the SGTC's sequence traces into Sequencher software. Many of these modifications have introduced insertions, deletions, and substitutions relative to the Assembly 19 sequence. In many cases A, C, T, or G has been substituted with an ambiguous nucleotide; within ORFs, such ambiguous nucleotides consequently result in ambiguous amino acids in the predicted ORF translation, which is represented as an "X" within the A21 protein sequence.
The coordinates of each ORF in A21 were determined at CGD using a sequence-based mapping procedure, described in detail under "A21 ORF mapping procedure" below. Where possible, the Assembly 20 (A20) sequence of each ORF was used to determine the A21 coordinates; if the A20 ORF sequence did not match any sequence in A21, then the A19 sequence of the ORF was compared to the A21 sequence (because the sequence of some ORFs changed significantly between A19 and A20). Where relevant, it is noted in the downloadable files whether the A19 or A20 version of the ORF was used for mapping onto A21 (called "assembly used for mapping" in the file descriptions below).
Some ORFs from Assembly 20 map to corresponding regions in A21 chromosomes that, due to sequence changes made during generation of A21, have frameshifts or new in-frame stop codons in A21. In such cases, CGD manually added "adjustments" to the ORF coordinates, splitting the ORF into segments, with gaps or overlap between segments so as to restore the A21 coding sequence (CDS) to match the A20 CDS. In cases in which the A21 sequence lacks the stop codon that terminates the ORF in A20, the A21 ORF has been extended downstream to the next in-frame stop codon. Issues encountered in mapping individual ORFs to A21 are described on the Locus History page for the ORF in CGD.
ORFs in Assembly 19 that were deleted in A20 were not mapped to A21.
The original Assembly 21 released by van het Hoog et al. does not include the sequence of the mitochondrial DNA. Datasets that contain the mitochondrial genome use the sequence from Assembly 19.
The chromosomal coordinates of the tRNA genes were predicted in Assembly 21 by using the tRNAscan-SE algorithm developed by T. M. Lowe and S. R. Eddy.
CGD has conducted a series of analyses of the A21 ORF set, the results of which are available for download and described in detail below. To download any file, select the filename.
Note: Throughout these files, coding sequence is denoted by upper case letters, and noncodong (introns or adjustment/gap) sequences are denoted by lower case letters. An ORF's "genomic sequence" is the chromosomal sequence of the region, including any intronic sequence or gap sequence within the boundaries of the ORF, from the beginning of the start codon through the last nucleotide of the stop codon. An ORF's "coding sequence" or "CDS" is the sequence of the actual open reading frame, after introns have been spliced out and any non-intron adjustments have been made (gaps removed, or overlapping segments aligned into place).
ORFsDeletedFromA21.txt - This file contains list of ORF names and, where available, locus names for ORFs that could not be mapped well (from A20 nor A19) or which were deemed not fixable (e.g., multiple stop codons within all three A21 reading frames). Hence, these are the ORFs that are "Deleted from Assembly 21" and labeled as such on their Locus pages. ORFs that have been deleted have been eliminated from the other files listed below (e.g., an ORFs with stop codons internal to the reading frame that has been deleted from A21 is present on this list and not included in the ORFsWinternalStops.txt file).
ORFsWintrons.txt - This file contains ORFs that have introns in A21. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and full genomic and coding sequence of each of these ORFs.
ORFsWintrons_alignment.txt - This file has alignments between the full A21 genomic sequence and the A21 coding sequence of ORFs that have introns. The alignments were generated using MUSCLE, with the sequences in the file ORFsWintrons.txt, and the output is in ClustalW format.
ORFsWambiguousBases.txt - This file contains ORFs with ambiguous bases (other than A, T, G, or C) in their A21 genomic sequence (the sequence of the ORF itself and any gap or intron sequences within). The following ambiguous bases are contained in the Assembly 21 sequence as released to CGD:
B : C or G or T;
D : A or G or T;
K : G or T;
M : A or C;
N : unspecified A or C or G or T;
R : A or G;
S : C or G;
W : A or T;
Y : C or T.
This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and full genomic sequence of each of these ORFs.
ORFsWadjustments.txt - This file contains ORFs that have adjustments in their A21 coordinates. "Adjustments" are gaps or overlap between coding segments that have been introduced into the ORF coordinates to compensate for presumed errors in the sequence that cause stop codons internal to the ORF region (either an in-frame stop codon or a shift in reading frame). Please note that some A20 ORFs also have adjustments, and that the list of A20 ORFs with adjustments differs from those with adjustments in A21, due to changes that have been made to the sequence during generation of A21. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and full genomic and coding sequence of each of these ORFs.
ORFsWadjustments_alignment.txt - This file contains alignments between the full A21 genomic sequence and the A21 coding sequence of ORFs that have adjustments. The alignments were generated using MUSCLE, with the sequences in the file ORFsWadjustments.txt, and the output is in ClustalW format.
ORFsWinternalStops.txt - This file contains protein sequences of ORFs that, when mapped to A21, have stop codons within the reading frame in the A21 sequence. "Adjustments" have been manually introduced to the ORF coordinates to reconstruct a complete reading frame, where possible (see also: ORFsWadjustments.txt). This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the amino acid sequence of the predicted ORF translation. For ORFs that have been given "adjustments," the adjusted sequence is also listed.
ORFsWnoEndStop.txt - This file contains protein sequences of ORFs that, when mapped to A21, do not end with a stop codon. Many of these have other problems in A21, such as insertions or deletions in the sequence that disrupt the reading frame. Where possible, ORF coordinates have been manually updated to create a complete reading frame (e.g., truncated at an in-frame stop, extended downstream to the next in-frame stop, or given an "adjustment" that changes the reading frame.) This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the amino acid sequence of the predicted ORF translation. For cases where the ORF has been manually updated, the updated sequence is also included.
ORFsWnonATGstart.txt - This file lists ORFs that do not start with an ATG in A21. In some cases, the coordinates were updated manually during curatorial review, and the updated ORF does start with an ATG; these cases are also listed in this file. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the coding sequence of each of these ORFs. For cases where the ORF has been manually updated, the updated sequence is also included.
ORFsWpartialCodon.txt - This file contains coding sequences of ORFs that have partial codons (i.e., the length of the coding sequence is not a multiple of 3). Where possible, ORF coordinates have been manually updated to create a complete reading frame (e.g., truncated at an in-frame stop, extended downstream to the next in-frame stop, or given an "adjustment" that changes the reading frame.) This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the coding sequence (CDS). For cases where the ORF has been manually updated, the updated coding sequence is also included.
ORFsWmanualUpdates.txt - This file lists the A21 "problem ORFs" that were manually updated at CGD to address problems in A21. All ORFs that failed one or more of the checks described below under "Details of the A21 ORF Mapping Procedure" were subject to manual review at CGD. Where possible, the issues were addressed: "adjustment(s)" were added to change reading frame or to bypass in-frame stop codons, coordinates were updated to extend or truncate the ORF to acquire a terminal stop codon, and ORFs that narrowly failed criteria for "good" matches to A21 were reinstated if they could be mapped to A21. Only the ORFs for which the coordinates were changed as a result of this review are listed in this file. This file contains the ORF name, locus name (where available), A21 coordinates determined by the computational mapping procedure, manually updated ORF coordinates, descriptive note, the assembly used for mapping, the predicted ORF translation based on the coordinates before updating, and the predicted ORF translation based on the updated coordinates.
ORFsWnoChanges.txt - This file lists ORFs that have not changed at all in A21 as compared to the sequence in the assembly (A20/A19) that was used for mapping the ORF in each case. (Where possible, the A20 sequence of each ORF was used to determine the A21 coordinates; if the A20 ORF sequence did not match any sequence in A21, then the A19 sequence of the ORF was compared to the A21 sequence; this was necessary because the sequence of some ORFs changed significantly between A19 and A20.) This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the genomic sequence of each of these ORFs.
ORFsWsynChanges.txt - This file lists A21 ORFs that show synonymous changes when compared to the sequence in the (A20/A19) assembly that was used for mapping the ORF in each case. The nucleotide sequence of the coding sequence (CDS), excluding any intronic sequence or gap regions ("adjustments"), is not the same between the two assemblies, however, the translated amino acid sequence is the same. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the A21 coding sequence of each of these ORFs, followed by the ORF name, locus name, A20 or A19 coordinates (the assembly from which the ORF was mapped), and A20 or A19 coding sequence of the ORF.
ORFsWsynChanges_alignment.txt - This file contains alignments between the A21 coding sequence (CDS) and the A20/A19 CDS of each ORF that shows synonymous changes upon mapping to Assembly 21. The alignments were generated using MUSCLE, with the sequences in the file ORFsWsynChanges.txt, and the output is in ClustalW format.
ORFsWnonsynChanges.txt - This file lists A21 ORFs that show non-synonymous changes when compared to the sequence in the (A20/A19) assembly that was used for mapping the ORF in each case. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the A21 coding sequence (CDS) of each of these ORFs, followed by the ORF name, locus name, A20 or A19 coordinates (the assembly from which the ORF was mapped), and A20 or A19 CDS of the ORF.
ORFsWnonsynChanges_alignment.txt - This file contains alignments between the A21 coding sequence (CDS) and the A20/A19 CDS of each ORF that shows non-synonymous changes upon mapping to Assembly 21. The alignments were generated using MUSCLE, with the sequences in the file ORFsWnonsynChanges.txt, and the output is in ClustalW format.
ORFsWsimpleChanges.txt - This file lists A21 ORFs that show simple changes when compared to the sequence in the (A20/A19) assembly that was used for mapping the ORF in each case. A "simple" change is defined as follows: the nucleotide identity across the ORFs (the genomic region from the start to stop codon, including introns or gaps within the ORF) is 98% or greater, and the aligned region encompasses the entire length of the ORF. The "simple change" category includes ORFs that may contain substitutions, small insertions, and/or small deletions. Cases in which only sequence within an intronic or gap ("adjustment") region has changed, and the translated sequence has not been affected, are also included in this category. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the A21 genomic sequence of each of these ORFs, followed by the ORF name, locus name, A20 or A19 coordinates (the assembly from which the ORF was mapped), and A20 or A19 genomic sequence of the ORF.
ORFsWsimpleChanges_alignment.txt - This file contains alignments between the A21 genomic sequence and the A20/A19 sequence of each ORF that shows simple changes upon mapping to Assembly 21. The alignments were generated using MUSCLE, with the sequences in the file ORFsWsimpleChanges.txt, and the output is in ClustalW format.
ORFsWcomplexChanges.txt - This file lists A21 ORFs that show complex changes when compared to the sequence in the (A20/A19) assembly that was used for mapping the ORF in each case. A "complex" change is defined as follows: the nucleotide identity across the ORFs (the genomic region from the start to stop codon, including introns or gaps within the ORF) is less than 98%, and/or the aligned region does not encompass the entire length of the ORF. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the A21 genomic sequence of each of these ORFs, followed by the ORF name, locus name, A20 or A19 coordinates (the assembly from which the ORF was mapped), and A20 or A19 genomic sequence of the ORF.
ORFsWcomplexChanges_alignment.txt - This file contains alignments between the A21 sequence and the A20/A19 sequence of each ORF that shows complex changes upon mapping to Assembly 21. The alignments were generated using MUSCLE, with the sequences in the file ORFsWcomplexChanges.txt, and the output is in ClustalW format.
We first used the A20 sequence of each ORF to find the corresponding region on A21 chromosomes. The full A20 genomic DNA sequence of each ORF (including introns and any gap/adjustment regions) was mapped, using BLAST, to A21 chromosomes. The threshold parameters used were that both the percent identity and alignment length of the BLAST HSP (high-scoring pair) be > 95%. The ORFs that passed this threshold were called "good" full ORF matches. If there was no good match to the A20 sequence of that ORF, we try the mapping using A19 ORF sequence.
Then, for ORFs with multiple exons, the coding sequence segments (individually referred to as a "subfeature") were each separately mapped to the A21 region that matched the full ORF. Threholds were more strict for the subfeature mapping: alignment length was required to be 100% and percent identity > 99%. The matched subfeature regions were used to define the ORF location and then various checks were performed on the generated A21 ORF sequences.
Manual review was performed in cases where ORFs did not pass the criteria for a "good" match to A21 using either A20 or A19 sequence. There were 16 such cases; some had narrowly failed the criteria used for computational classification, and were manually assigned to A21 coordinates.
Below is the list of checks that were performed on each ORF.
(1) check that full DNA sequence does not have invalid bases
(2) check that full DNA sequence does not have N characters
(3) check that coding sequence is multiple of three in length (no partial codons)
(4) check that coding sequence start is ATG
(5) check that coding sequence ends with a stop codon
(6) check that coding sequence does not terminate with multiple sequential stop codons
(7) check that coding sequence does not stop codons internal to the ORF
(8) check that exon-intron boundaries have not changed and, for real introns (as opposed to "adjustments"), compare sequence around the
intron and its adjoining exons to make sure it is intact
(9) check if the intron splice sites are canonical, that is 5' : GT and 3' : AG
(10) line up subfeatures and find gaps or overlaps and suggest adjustments (Only exons and real introns were mapped to A21; adjustments were not mapped.)
(11) check coverage of subfeatures: Start of first subfeature should coincide with start of ORF
and, similarly, end of last subfeature should be the same as end of ORF.
(12) check that full genomic DNA sequence does not have non-ATGC characters
(13) check that coding sequence does not have non-ATGC characters
ORFs that failed one or more of the following checks were reviewed manually: 3, 4, 5, 7. Where possible, the coordinates of the ORF were updated to address the issue. Please see ORFsWmanualUpdates.txt for a list of ORFs for which the coordinates were updated during curatorial review at CGD.
Like Assembly 20, Assembly 21 of the C. albicans sequence was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan. Assembly 21 (A21) sequence files have been submitted by the A21 collaborators and described in van Het Hoog M, Rast TJ, Martchenko M, Grindle S, Dignard D, Hogues H, Cuomo C, Berriman M, Scherer S, Magee BB, Whiteway M, Chibana H, Nantel A, Magee PT. Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol. 2007 Apr 9;8(4):R52. The sequence files are available for download from the CGD Sequence Downloads directory. Please note that the sequences in these files are exactly as released to CGD by the A21 collaborators, prior to any analyses at CGD.
The intron data published in the paper
Mitrovich QM, Tuch BB, Guthrie C, Johnson AD. Computational and experimental approaches double the number of known introns in the pathogenic yeast Candida albicans. Genome Res. 2007 Apr;17(4):492-502
have been incorporated into Assembly 21 (and 20) in CGD. Assembly 19
coordinates have not been updated; in Assembly 19 all introns and
adjustments are called Gaps.
Please note that this dataset now allows CGD to clearly separate the introns in Assembly 21 (and 20) from the gaps that were introduced by the Annnotation Working Group (AWG) to compensate for presumed sequencing errors that interrupt ORFs. These gaps are labeled "Adjustments" in Assembly 21 (and 20) in CGD, and they should be considered markers for regions that require resequencing, rather than corrections to the assembly.
Terminology:
"Intron" labels features that are expected to be bona fide introns in vivo (in contrast to non-biological "adjustments").
"Adjustments" refer to gaps between regions of CDS that are NOT expected to be biologically significant introns. Adjustments with a length that is a negative number indicate an overlap between two regions of CDS, resulting in a duplication of the overlapping part of the sequence in the predicted ORF. Note that "adjustments" on the spreadsheet include all of the non-biological gaps, of positive and negative length.
"Gap" refers to either an intron or an adjustment.
Previously, on the CGD web pages and in our documentation, we had been calling the gaps "introns" and/or "gaps." In contrast, we labeled as "adjustments" the places where the annotation shows pieces of a coding sequence (CDS) to overlap (which is manifest as a "gap" with a length that is a negative number), because these were clearly not bona fide introns.
Summary files listing introns and non-intron adjustments are available from the CGD Downloads site:
ORFsWithIntrons and ORFsWithAdjustments spreadsheets:
There are some ORFs that have both introns and adjustments (a total of 18 A20 ORFs, upon initial incorporation of these data); these ORFs are present on both the ORFsWithIntrons and ORFsWithAdjustments spreadsheets. One additional ORF falls into this category, orf19.10708 (MTLalpha2), but this ORF is not contained in the haploid assemblies A20 or A21 and it is therefore omitted from this list. The coordinates of the adjustments in the ORFs that also contain introns were determined at CGD. The coordinates of all of the other adjustments came from the Annotation Working Group.
Coordinates: comma-delimited coordinate ranges specify each segment of the coding sequence (CDS); coordinates are inclusive. The W or C following the coordinates indicates that the ORF is on the Watson or Crick strand, respectively.
Size of gap: If there are multiple gaps, the individual lengths are separated by commas. Numbers over 999 do not have commas within them.
UTR introns: ORFs that have 5' UTR introns are not included on the ORFsWIthIntrons sheet because the introns fall outside of the ORF boundaries; the 5' UTR introns are displayed as an Assembly 20 track in the GBrowse Genome Browser.
AllGaps spreadsheet:
The AllGaps sheet has a separate row for every gap (intron OR adjustment) within an ORF. ORFs with multiple gaps have multiple rows on the sheet. This list does not include 5' UTR introns, which are displayed as an Assembly 20 track in the GBrowse Genome Browser. Note that the coordinates on the AllGaps sheet are the coordinates of the gap, rather than the coordinates of the CDS. There is a difference in the reporting of coordinates for adjustments of negative length vs. all other gaps: gap coordinates are inclusive (as are exon coordinates), UNLESS the gap is an adjustment of negative length. Looking at the "gap length" field should clarify any cases in which there may be a question.
Assembly 20 of the C. albicans sequence was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan.
WARNING: Assembly 20 Sequence Advisory
The collaborative group who generated Assembly 20 has discovered that the sequence traces that they had been using to fill some of the gaps and determine overlaps between Assembly 19 contigs were derived from strain WO-1, rather than from the reference strain, SC5314. The sequence of these regions are consequently expected to be inaccurate where WO-1 sequence has been used, and there also exists the chance that small contigs have been misassembled based on the WO-1 sequence data.
The Biotechnology Research Institute of the National Research Council of Canada released a list of the regions affected. By comparing the 1kb flanking parts for each suspect region against the Contig19 sequences, CGD was able to reduce the size of many of the suspect regions. In CGD, these regions are displayed in the Assembly 20 Genome Browser (GBrowseas orange-colored regions entitled "Suspect_WO1" followed by a number assigned sequentially to the problematic gap regions, 193 in total. (Please see example link.) These changes are reflected in the downloadable GFF files as well as in the Genome Browser display.
A list of the reduced regions and their chromosomal locations may be downloaded. A list of the ORFs that
are affected by
the regions may also be downloaded.
A list of the original regions and their chromosomal locations may be downloaded. A list of the ORFs that
are affected by
these regions may also be downloaded.
The physical mapping data are now available from the University of Minnesota, at http://albicansmap.ahc.umn.edu/index.html. The optical mapping data have been made available by P.T. Magee, and are now archived at CGD. The mapping data, which were used to order and orient contigs, originate exclusively from the reference strain SC5314, and may be downloaded.
To ensure that you are working only with sequence from the reference strain SC5314, you may retrieve data from Assembly 19 or Assembly 21 instead of Assembly 20. Please feel free to contact us with any further questions.
Whereas Assembly 19 is a diploid assembly that includes both alleles of each gene for cases in which they show significant sequence differences, Assembly 20 is a haploid assembly: in the production of Assembly 20, updates to Assembly 19 have been made in only one allele of each pair, though in some cases, genes may have been assembled from data from the two different alleles. The chromosomes may be thought of as 'reftigs', where they are mosaics of haplotypes, rather than representative of a single haploid genome in the sequenced strain. The process used to generate this assembly is described on the project web site at URL: http://candida.bri.nrc.ca/candida/alignments/index.cfm. The files generated by these groups are posted at URL: http://candida.bri.nrc.ca/alignments/editedEMBL/final. All of the Assembly 20 data in CGD come from these EMBL-format files, copies of which which have also been archived at CGD.
Assembly 20 does not include the sequence of the mitochondrial DNA. Datasets that contain the mitochondrial genome use the sequence from Assembly 19.
The Assembly 20 files were processed at CGD to identify and classify changes that occurred between Assembly 19 and Assembly 20, and to identify other features in which users may be interested (e.g., introns), as described in detail below. Files containing all of these analyses (ORF lists, sequences, and/or alignments) are available for download from CGD. They may be accessed from the CGD Downloads web page, or downloaded by selecting the hyperlinked file names below.
Assembly 20 ORF Classification: The entire classification is summarized in the file, ClassificationTablePerGene.xls.
Sequence comparisons between Assembly 19 and Assembly 20 were performed. Each ORF from Assembly 20 has been classified according to how it changed between Assembly 19 and Assembly 20. The classifications for each gene appear on its CGD Locus page, next to the "Feature type" heading. In addition, ORFs in each category have explanatory Locus History notes on the CGD web site.
The source of all the Assembly 20 information is the set of EMBL-format chromosome files from the BRI, dated May 11, 2006. These were originally posted at http://candida.bri.nrc.ca/alignments/editedEMBL/final/:
Ca20FinalMay11.zip 05/11/2006 01:47:06 PM
A copy of these files has been archived at CGD (please see the CGD Downloads web page for archived, downloadable files).
The source of all Assembly 19 sequence information used for these analyses is the Candida Genome Database, July 2006.
Protein and nucleotide local sequence alignments were performed using bl2seq from the BLAST suite from NCBI. Global nucleotide sequence alignments were performed with the MUSCLE (multiple sequence comparison by log-expectation) software, available at the URL: http://www.drive5.com/muscle/.
1) New ORFs in Assembly 20
Criteria: The orf19 name is new in Assembly 20; it was not present in Assembly 19. This assignment was made computationally.
File contains: List of all ORFs that are new in Assembly 20
File name: NewInAssembly20.txt.
2) ORFs deleted from Assembly 20
Criteria: The orf19 name is present in Assembly 19 and it is not present in Assembly 20. This assignment was made computationally.
File contains: List of all ORFs that were removed during preparation of Assembly 20; they were present in Assembly 19
File name: DeletedFromAssembly20.txt
Note: A subset of the ORFs on this list were subsumed by, or "merged into" another ORF in Assembly 20. Some merged ORFs were combined with a neighboring ORF on the Contig from Assembly 19 (Contig-19). In other cases, an ORF was merged with an ORF that was not adjacent to it in Assembly 19; that is, the Contig-19s containing the two ORFs were not associated with each other in Assembly 19 but have been assembled next to, or overlapping with, each other in Assembly 20.
3) ORFs with no sequence change in Assembly 20
Criteria: The nucleotide sequence of the ORF in Assembly 19 and 20 is the same (sequence across the whole ORF, including any intronic sequence). This assignment was made computationally.
File contains: List of all ORFs with no changes to the nucleotide sequence between Assembly 19 and Assembly 20
File name: NoSeqChangeInAssembly20.txt
Note: These criteria do not exclude ORFs in which adjustments have been made to the position of an intron without any change in the underlying sequence.
4) ORFs with synonymous changes ONLY, between Assembly 19 and Assembly 20
Criteria: The nucleotide sequence of the coding sequence or CDS, excluding any intronic sequence, is not the same between the two assemblies, however, the translated amino acid sequence is the same. This assignment was made computationally.
File contains: List of all ORFs with only synonymous changes between Assembly 19 and Assembly 20 (the nucleotide sequence has changed, yet the predicted amino acid translation is unchanged), with nucleotide alignments between the sequence of the ORF in Assembly 19 and the sequence of the ORF in Assembly 20
File name: SynonymousOnlyChangeInAssembly20.txt
Note: ORFs classified in the categories "Simple Sequence Changes" and "Complex Sequence Changes" may have synonymous changes in addition to other, nonsynonymous sequence changes.
Note: Problem ORFs that have been extended by one or two basepairs in Assembly 20, in the absence of other sequence changes that affect the translated sequence (and therefore the alignment), will meet the criteria for inclusion in this category.
5) ORFs with simple sequence changes in Assembly 20
Criteria: The aligned region encompasses the entire length of the ORF in both Assembly 19 and Assembly 20, and amino acid identity is 98% or greater. This assignment was made computationally.
File contains: List of all ORFs with small changes in protein sequence between Assembly 19 and Assembly 20, with protein sequence alignments.
File name: SimpleSeqChangesInAssembly20.txt
Note: This category includes ORFs that may contain substitutions, small insertions, and/or small deletions, yet overall identity between the two predicted protein sequences is 98% or greater. Cases in which only intronic sequence has changed, and the translated sequence has not been affected, are also included in this category.
6) ORFs with complex sequence changes in Assembly 20
Criteria: ORF has changed in nucleotide sequence, and changes do not fall into the "synonymous changes only" or "simple amino acid changes" categories. This assignment was made computationally.
File contains: List of all ORFs that have changed significantly in sequence between Assembly 19 and Assembly 20, with protein sequence alignments.
File name: complexSeqChangesInAssembly20.txt
Note: This category includes ORFs that may contain substitutions, insertions, deletions, and/or changes to the 5' and/or 3' boundary (annotation changes, in which the ORF boundary is moved without an underlying sequence change, or sequence changes). The protein alignment may show 100% identity if complex changes have taken place outside of the aligned region (e.g., if the N- or C-terminal region has been changed).
7) All Assembly 20 ORFs, classified by the type of change, if any, that affected the ORF between Assembly 19 and Assembly 20 (Excel-format spreadsheet)
File contains: Excel workbook with two worksheets. The first worksheet contains a list of all of the Assembly 20 ORFs and their classification into the six categories, by the criteria outlined above for files 1-6. A "1" in columns B through F indicates that the ORF is classified in the category.
The columns in the first worksheet are as follows:
A) ORF name
B) Complex Sequence changes in Assembly 20
C) New in Assembly 20
D) No change in Assembly 20
E) Simple sequence changes including substitutions and indels in Assembly 20
F) Synonymous changes Only in Assembly 20
G) Chromosome
H) Start
I) Stop
J) Strand
K) Exon segments
L) Contig19 coordinates
The second worksheet contains a list of all of the Assembly 19 ORFs that are not present in Assembly 20, and the Contig19 name and coordinates. The columns in the second worksheet are as follows:
A) Assembly 19 ORF name
B) Contig19 name and contig coordinates from Assembly 19
File name: ClassificationTablePerGene.xls
8) All Assembly 20 ORFs, classified by the type of change, if any, that affected the ORF between Assembly 19 and Assembly 20 (Tab-delimited text format)
File contains: List of all of the Assembly 20 ORFs and their classification into the six categories outlined above.
The columns are as follows:
A) ORF name
B) Classification (into the categories described for files 1-6, above)
C) Chromosome (ORF name appears in this column if ORF is classified as "deleted from Assembly 20")
D) Start coordinate on chromosome (Contig coordinates appear in this column if ORF is classified as "deleted from Assembly 20")
E) Stop coordinate on chromosome
F) Strand
G) Exon Segments
H) Contig19 coordinates
File Name: ClassificationPerGene.txt
9) Merged ORFs
Criteria: Merged ORFs were evaluated as follows: The Assembly 19 nucleotide sequence, with any introns, of each of the ORFs that were deleted from Assembly 20 were compared by BLAST against the set of all Assembly 20 ORFs (nucleotide sequence, with introns). A strong match indicates that the deleted ORF may have been subsumed by the Assembly 20 ORF. Such candidates were evaluated *manually*. If the orf19 names of the possible merged pair were numerically close to each other (e.g., orf19.1556 and orf19.1555), the candidate pairs were evaluated in the GBrowse genome browser. If the ORFs overlapped on the same strand, the pair was scored as "merged." If the ORFs did not overlap, or were on opposite strands, the pair was scored as "not merged." The possible merged pairs with the orf19 names that were not close to each other were evaluated in the GBrowse genome browser displaying the position of the Assembly 19 contigs overlaid on the Assembly 20 chromosomes. The ORFs were scored as "merged" if they were located on the overlapping segments of the adjacent contigs or if they spanned a junction between the adjacent contigs.
File contains: The Feature name (orf19 name) of the ORF that remains after the merge, the Locus name (e.g., ABC1) of the ORF that remains after the merge, the Feature name (orf19 name) of the ORF that is deleted (subsumed) during the merge, the Locus name (e.g., ABC1) of the deleted/subsumed ORF.
File name: MergedORFs.txt
10) ORFs truncated by contig ends in Assembly 19, along with the new coordinates in Assembly 20
Criteria: In Assembly 19, one terminus of the ORF was a contig end.
File contains: ORF name, chromosomal coordinates in Assembly 20, contig coordinates in Assembly 19, length of protein in Assembly 20, length of protein in Assembly 19. Tab-delimited file.
File name: OrfsAtEndOfContigInAssembly19.txt
Note: This does not include any ORF whose terminus was near, but not at, the end of a contig in Assembly 19 and which was extended in Assembly 20. However, these ORFs are classified as having "complex sequence changes" as described above.
11) ORFs containing gaps/introns/adjustments in Assembly 19
Criteria: ORFs from Assembly 19 are included in this category if the coding sequence (CDS) comprises more than one segment.
*Please note that significant changes to intron and gap-containing ORFs were made in May 2007. This file should be considered a historical record; please see the Intron data and non-intron adjustments to ORF coordinates section of this page for more information.*
File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron; global nucleotide alignment of the entire sequence (including the introns) to the CDS (with introns removed)
File name: OrfsWithIntrons_Assembly19.txt
Note: ***This category includes gaps that are NOT bona fide introns.***
The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. Note that the length of some intron/gaps are negative numbers (i.e., a region of the exon is counted twice).
All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation.
If there are multiple gaps/introns, the sizes of the gaps/introns are separated by commas.
12) ORFs containing gaps/introns/adjustments in Assembly 19 (without alignments)
Criteria: ORFs from Assembly 19 are included in this category if the coding sequence (CDS) comprises more than one segment. This file is identical to the file OrfsWithIntrons_Assembly19.txt, except that it does NOT contain the alignments and is therefore more amenable to viewing as a spreadsheet.
*Please note that significant changes to intron and gap-containing ORFs were made in May 2007. This file should be considered a historical record; please see the Intron data and non-intron adjustments to ORF coordinates section of this page for more information.*
File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron. The file is in tab-delimited text format.
File name: OrfsWithIntrons_Assembly19_List.txt
13) ORFs containing gaps/introns/adjustments in Assembly 20
Criteria: ORFs from Assembly 20 are included in this category if the coding sequence (CDS) comprises more than one segment.
*Please note that significant changes to intron and gap-containing ORFs were made in May 2007. This file should be considered a historical record; please see the Intron data and non-intron adjustments to ORF coordinates section of this page for more information.*
File contains: ORF name; chromosome and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron; global nucleotide alignment of the entire sequence (including the introns) to the CDS (with introns removed). The ortholog assignments have been updated to reflect the Assembly 20-based mapping generated on November 26, 2006.
File name: OrfsWithIntrons_Assembly20.txt
Note: ***This category includes gaps that are NOT bona fide introns.***
The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. Some of the gaps introduced by the Annotation Working Group have a length that is a negative number; that is, the coding sequence comprises two overlapping segments, such that some sequence is counted twice. These are called "Adjustments," rather than "Introns" on the Locus page of the affected ORFs. Like the introns/gaps that are small in size, these "adjustments" should also be considered flags that indicate that resequencing of the area is advised.
Please also note: Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. In several such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). These ORFs are the following: orf19.1261, orf19.130, orf19.1639, orf19.1693, orf19.2440, orf19.3245, orf19.4136, and orf19.5880. After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD adjusted the position of these gaps to restore their position as defined in Assembly 19. The other sequence will remain as-is in CGD until further information is available.
All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation. We provide the size of the intron/gap/adjustment in Assembly 20 and information about the S. cerevisiae ortholog in this file to facilitate initial assessment.
If there are multiple gaps/introns, the sizes of the gaps/introns are separated by commas.
14) ORFs containing gaps/introns/adjustments in Assembly 20 (without alignments)
Criteria: ORFs from Assembly 20 are included in this category if the coding sequence (CDS) comprises more than one segment. This file is identical to the file OrfsWithIntrons_Assembly20.txt, except that it does NOT contain the alignments and is therefore more amenable to viewing as a spreadsheet.
*Please note that significant changes to intron and gap-containing ORFs were made in May 2007. This file should be considered a historical record; please see the Intron data and non-intron adjustments to ORF coordinates section of this page for more information.*
File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron. The ortholog assignments have been updated to reflect the Assembly 20-based mapping generated on November 26, 2006. The file is in tab-delimited text format.
File name: OrfsWithIntrons_Assembly20_List.txt
15) ORFs with changes to intron/gap/adjustment regions between Assembly 19 and Assembly 20
Criteria: Assembly 20 ORFs are included if the number or nucleotide sequence of introns/gaps/adjustments differs between Assembly 19 and Assembly 20.
File contains: ORF name; coordinates of exons in Assemblies 20 and 19; alignment of the Assembly 19 genomic nucleotide sequence (coding sequence plus intron(s)) vs. the Assembly 20 version; alignment of the Assembly 19 ORF protein sequence vs. the Assembly 20 version.
File name: intronChangesInAssembly20.txt
Note: Small changes in coordinates may not result in changes at either the nucleotide or amino acid sequence levels.
Note: ***Not all gaps are bona fide introns.***
The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation.
Please also note: Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. In eight such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD adjusted the position of these gaps to restore their position as defined in Assembly 19.
16) ORFs with changes to intron/gap/adjustment regions between Assembly 19 and Assembly 20 (without alignments)
Criteria: Assembly 20 ORFs are included if the number or sequence of introns/gaps/adjustments differs between Assembly 19 and Assembly 20. This file is identical to the file intronChangesInAssembly20.txt, except that it does NOT contain the alignments.
File contains: ORF names.
File name: intronChangesInAssembly20_OrfList.txt
17) Problem ORFs that have internal stop codons (with translation)
Criteria: This set of ORFs has a stop codon within the reading frame, as presented in the Assembly 20 files from the Annotation Working Group.
File contains: List of ORFs in this category, with nucleotide sequence (full, including any intronic sequence), coding sequence (CDS, with introns removed), and amino acid translation
File name: OrfsWithInternalStopCodonsInAssembly20.txt
Note: Most of the stop codons are near the end of the ORF described in the Assembly 20 file. Some are followed by a few residues of predicted protein sequence, some are followed by additional stop codons. After loading the data from the original Assembly 20 file and archiving this starting data, CGD has adjusted the boundary of these ORFs in the database and in the sequence files. The four exceptions are orf19.4384.1, orf19.3813, orf19.359 and orf19.5775.3 (described in more detail in the file problemORFInEMBLfiles.txt); these ORFs will remain as-is in CGD until additional data are available.
18) Problem ORFs that are lacking terminal stop codons
Criteria: This set of ORFs lacks the terminal stop codons, as presented in the Assembly 20 files from the Annotation Working Group/Assembly 20 collaboration.
File contains: List of ORFs in this category, with nucleotide sequence (full, including any intronic sequence), coding sequence (CDS, with introns removed), and amino acid translation
File name: OrfsWithoutEndStopCodonInAssembly20.txt
Note: In most of these cases, adjusting the end coordinates to extend the ORF by a few nucleotides, relative to its coordinates in the initial Assembly 20 release, would append an in-frame stop codon. After loading the data from the original Assembly 20 file and archiving these starting data, CGD has adjusted the boundary of these ORFs. The new coordinates now appear in the CGD sequence files. There are two ORFs that end with undetermined sequence ("NNN"), orf19.2657 and orf19.7398.1, and the termini of these two ORFs will not be modified by CGD in the absence of additional sequence data. In addition, the orf19.3073 runs of the end of Assembly 20 Chromosome 4 and it therefore lacks a terminal stop. Also included in this file are ORFs that extend downstream of an in-frame stop codon by a few residues. (These ORFs are also included in the category, "Problem ORFs that have internal stop codons," and are listed in the file OrfsWithInternalStopCodonsInAssembly20.txt, as described above.) The coordinates of ORFs with in-frame stops within a few codons of the terminus have also been adjusted; they have been truncated so that they end at the stop codon. These adjustments were performed after loading the data from the original Assembly 20 EMBL-format files and archiving this starting data at CGD. The adjustments are now present in the CGD sequence files.
19) ORFs with partial codons
Criteria: Length of the coding sequence (CDS, with any intronic sequence removed), in nucleotides, is not a multiple of three
File contains: ORF name, nucleotide sequence of the ORF (any intronic sequence included), translated sequence
File name: OrfsWithPartialTerminalCodonInAssembly20.txt
Note: Coordinates of ORFs have been adjusted so that the ORF ends at the stop codon; the extra nucleotides (partial codon) have been removed from the CGD sequence files. These adjustments were performed after loading the data from the original Assembly 20 EMBL-format files and archiving this starting data at CGD.
This query was run after other coordinate adjustments were made; some of the ORFs with partial codons in Assembly 20 were detected by other queries and corrected before this list was generated (e.g., ORFs without terminal stop codons).
20) ORFs with non-AUG start
Criteria: ORF nucleotide sequence does not begin with ATG
File contains: List of ORFs, with nucleotide sequence (including any intronic sequence). There are eight of these in Assembly 20.
File name: OrfsWithNonAUGstartInAssembly20.txt
21) Missing Contig19s, and the Assembly 19 ORFs that they contain
Criteria: Contig19s are included if they are not listed in the EMBL-format Assembly 20 files
File contains: Contig19 name, name of ORF contained on the missing contig, Locus name (if any) of the ORF, Feature Type of ORF, notes
File name: Missing_contigs.xls
Note: The EMBL-format Assembly 20 files released by the Annotation Working Group/Assembly 20 collaboration specify mapping of some of the Assembly 19 contigs to the Assembly 20 chromosomes; however, not all of the Contig19s are included in the EMBL-format files. The file "Missing_contigs.xls" contains information about the Contig19s that are missing from the EMBL-format Assembly 20 files.
Each ORF is contained on a single line; missing Contig19s that comprise multiple ORFs are listed on multiple lines. The Feature Type of each ORF indicates whether it is present in Assembly 20 and, if so, whether the sequence has changed between Assembly 19 and 20. The notes were entered based on manual investigation by BLAST. Excel format file.
22) Subdivided Contig19's
Criteria: Contig 19's that are listed in the EMBL-format file, and which are split into pieces in Assembly 20
File contains: ID of Contig19 fragment; name of Contig19, Assembly 20 chromosome where contig fragment matches, chromosomal coordinates of match
File name: SplitContig19ToChromosomes.txt
Note: The subdivided Contig19 fragments are designated numerically, for example, "Contig19-10070_1," "Contig19-10070_2," "Contig19-10070_3."
23) List of other Contig mapping problems
File contains: Notes on some problems with the Contig19 mapping onto Assembly 20 chromosomes from the EMBL-format files.
File name: problemContigMappingToChr.txt
24) Notes on problematic entries in the Assembly 20 files
File contains: List of problematic ORFs from the Assembly 20 EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration. Notes on the way in which these issues will be handled in CGD.
File name: problemORFInEMBLfiles.txt
Note: This file describes the following types of problems in the EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration: two different orf19 names that have been used for the same region in the EMBL-format Assembly 20 files (2 cases), orf19 names that have been used for two different regions in the EMBL-format Assembly 20 files (4 cases), ORF without a name in the EMBL-format Assembly 20 files (1 case), ORFs with internal stop codons that are not amenable to correction by a simple adjustment in the terminal coordinate (4 cases), ORFs that are extremely changed in sequence between Assembly 19 and Assembly 20 (4 cases), ORFs that contain a stop codon in Assembly 20 in the absence of any underlying sequence changes (coordinates of an intronic or gap sequence has changed position ("slipped") between the two assemblies, creating an in-frame stop codon).
The archived Assembly 20 EMBL-format files are *unmodified* copies of the files released by the Annotation Working Group/Assembly 20 collaboration. Please note that there are some issues with the data in these files, as described in detail below. Thus, these archival copies are not recommended for use as-is.
Subsequent to the May 2006 release of the EMBL-format Assembly 20 files by the Annotation Working Group/Assembly 20 collaboration, updated EMBL-format files have been released to their web site, http://candida.bri.nrc.ca/alignments/editedEMBL/final/. Please note that issues with the data in these files remain unresolved (issues described in detail on the Sequence Help Page). Thus, the EMBL-format files at http://candida.bri.nrc.ca/alignments/editedEMBL/final/ are not recommended for use as-is.
Not all gaps are bona fide introns. All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation. The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. Places where small gaps have been introduced into an ORF should be considered flags that indicate that resequencing of the area is advised.
Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. All gap/intron changes between Assembly 19 and 20 are listed in the file "intronChangesInAssembly20.txt." In several such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). These ORFs are: orf19.1261, orf19.130, orf19.1639, orf19.1693, orf19.2440, orf19.3245, orf19.4136, and orf19.5880. After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD adjusted the position of these gaps to restore their position as defined in Assembly 19. Current CGD files therefore contain the corrected sequence and coordinate data. The other sequence will remain as-is in CGD until further information is available.
Some of the gaps introduced by the Annotation Working Group have a length that is a negative number; that is, the coding sequence comprises two overlapping segments, such that some sequence is counted twice. These are called "Adjustments," rather than "Introns" on the Locus page of the affected ORFs, and they are listed in the file OrfsWithIntrons_Assembly20.txt (with alignments) and OrfsWithIntrons_Assembly20_List.txt (without alignments). Like the introns/gaps that are small in size, these "adjustments" should also be considered flags that indicate that resequencing of the area is advised.
The original Assembly 20 EMBL-format files include ORFs that have internal stop codons. They are listed in the file "OrfsWithInternalStopCodonsInAssembly20.txt." Most of the stop codons are near the end of the ORF described in the Assembly 20 file. (The exceptions are the four ORFs that have internal stop codons which are not amenable to correction by a simple adjustment of the terminal coordinate, orf19.3813 orf19.4384.1 orf19.359 and orf19.5775.3, as described in more detail in the file problemORFInEMBLfiles.txt; these ORFs will remain as-is in CGD until further information is available.) In addition, there are 15 cases in which ORFs have multiple stop codons in the EMBL-format Assembly 20 files (orf19.2309 orf19.1658 orf19.6947 orf19.5870 orf19.2758 orf19.5046 orf19.3140.1 orf19.942 orf19.4305.1 orf19.5592 orf19.7076 orf19.7056 orf19.2423 orf19.6382 orf19.854). After loading the data from the original Assembly 20 file and archiving this starting data, CGD adjusted the boundaries of the ORFs with multiple terminal stops and the ORFs with near-terminal stops, and the updated coordinates now appear in the database and in the sequence files released by CGD.
The original Assembly 20 EMBL-format files include ORFs that are lacking stop codons. They are listed in the file "OrfsWithoutEndStopCodonInAssembly20.txt." In most of these cases, adjusting the end coordinates to extend the ORF by a few nucleotides, relative to its position in the original Assembly 20 files, would append an in-frame stop codon. After loading the data from the original Assembly 20 file and archiving these starting data, CGD adjusted the boundary of these ORFs in CGD, and in the sequence files released by CGD, so that each ORF terminates at the next in-frame, downstream stop codon. There are two ORFs that end with undetermined sequence ("NNN"), orf19.2657 and orf19.7398.1. In addition, the runs of the end of Assembly 20 Chromosome 4 and it therefore lacks a terminal stop. The termini of these three ORFs will not be modified by CGD in the absence of additional sequence data.
The original Assembly 20 EMBL-format files lacks entries for a subset of the contigs from Assembly 19. These are described in the file, /orfsFromMissingContigs_list.txt.
Additional notes on some other issues with the Contig19 mapping onto Assembly 20 chromosomes are contained in the file, problemContigMappingToChr.txt.
The original Assembly 20 EMBL-format files use the same orf19 name for two different ORFs, in four cases. There is also an ORF that has no name in the original Assembly 20 EMBL-format files. The file, problemORFInEMBLfiles.txt contains notes on CGD's investigation of these issues and a detailed description of how CGD has addressed these problems.
It appears that some Assembly 19 ORFs were erroneously deleted during generation of Assembly 20. Two cases were encountered during investigation of the problems described in the file, problemORFInEMBLfiles.txt. These two ORFs,
orf19.71 and orf19.544.1, have been reinstated in CGD. In other cases, an ORF was deleted and then replaced with a new, nearly identical ORF (e.g., orf19.2217, which was replaced with orf19.1860.1). In this case, and others like it, the two ORFs have been scored as "merged." The orf19.2217 will retain its deleted status in CGD; the name orf19.2217 will be added as an alias of the ORF with which it has been merged, orf19.1860.1; and any curated information associated with the deleted ORF has been copied to the remaining member of the ORF pair.
CGD would appreciate it if users report other problems that they may encounter, so that issues can be documented and resolved wherever possible.
The tRNA genes were predicted from the C. albicans genome sequence using the tRNAscan-SE algorithm developed by T. M. Lowe and S. R. Eddy. The names appear in the following format, which is based on the format of the S. cerevisiae tRNA gene names: a lower-case t (for "tRNA"), followed by the one-letter abbreviation of the amino acid with which it is charged, followed by the anticodon (in parenthesis) followed by an integer. The name of the corresponding allele has an additional ".2" suffix. For example, "tA(AGC)1" is an alanyl tRNA with an AGC anticodon, and "tA(AGC)1.2" is the corresponding allele.
The C. albicans codon usage table may be accessed using the link in the left-hand menu bar of the CGD home page, under the heading "Download Data," or using the link on the Download Sequence page. This table displays the calculated frequency of use of each codon in the diploid complement of C. albicans protein-coding genes. The table was produced with the GCG program CodonFrequency using the diploid complement of all predicted coding sequences (13,117 open
reading frames) from Assembly 19 of the C. albicans SC5314 genomic sequence, as found in the file 'orf_coding.fasta' dated
07-Jun-2005. Where the sequences of two alleles differ, both sequences were used to calculate codon usage. Where the sequences of two alleles were
identical, two copies of the coding sequence were added to the pool of sequences used to calculate codon usage. Thus, codon usage was calculated from the entire diploid complement of protein-coding genes.
Note that C. albicans uses an alternative genetic code for nuclear genes, different from that used by most other fungi. Details and links
to translation tables for nuclear and mitochondrial genes can be found at NCBI's Taxonomy Browser.
Note: This section contains documentation from the Stanford Genome Technology Center (SGTC). This documentation was previously available on the SGTC's Candida information server, and has been archived here (verbatim) for reference.
PLEASE NOTE: Archived copies of the PDF diagrams mentioned in the SGTC documentation are available from the CGD Downloads page. You may download the Assembly 19 Contig Diagram files or view the README file.
The contig sequences in CGD are from Assembly 19 of the C. albicans genome sequence, from the supplementary material published in the C. albicans sequencing paper, Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., Newport, G., Thorstenson, Y.R., Agabian, N., Magee, P.T., Davis, R.W. and S. Scherer. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334. Supplementary data: http://genome-www.stanford.edu/candida-pnas2004-supplement/. (Older sequence assemblies, including Assemblies 4, 5, and 6, have been archived at CGD. These data may be retrieved from the "archived_assemblies" folder on the CGD Sequence Download Page.)
The Assembly 19 ORF coordinates (displayed on the Locus pages and displayed in the GBrowse Genome Browser, and available for download) come from the supplementary material published by Jones et al. and also from the Candida Annotation Working Group (AWG, http://candida.bri.nrc.ca/candida/index.cfm). The AWG formed at the ASM Conference on Candida and Candidiasis held January 13-17, 2002, in Tampa, FL. The group consists of researchers who volunteered their own time to annotate the genome (see Braun et al. 2005).
While the Assembly 19 AWG ORF set and the Jones et al. ORF set are similar, they are not identical. Members of the AWG have updated the ORF set of Jones et al. to include known and predicted introns. In addition, the AWG has adjusted the sequence to eliminate presumed sequence errors that create internal stops or frameshifts, and to change the 5' and 3' boundaries of some ORFs.
CGD has incorporated into Assembly 19 the AWG's changes that can be represented by "gaps" in the ORF sequences (or "joins" between regions of sequence along a contig) proposed by Jones et al. Approximately 400 of the ORFs in Assembly 19 are therefore presented as multiple conjoined coding sequences within CGD. Approximately 215 of these are the result of introns, and the remainder are due to short sequence gaps introduced by the AWG. CGD has not introduced any actual changes into the experimentally determined, published Assembly 19 contig sequences.
The Assembly 19 orf19s were loaded into CGD using the following procedure:
Note: CGD has not included the predicted changes in orf boundaries that would require alteration to the contig sequence. Therefore the predicted changes at 5' and 3' ends of orf19s have not been incorporated. The Assembly 19 sequence files available from the AWG and CGD differ in this respect.
Please also note that the AWG's sequence adjustments were typically
made in the context of only one of the two alleles of each gene. For
example, the orf19.5007 allele of ACT1 is depicted with an intron,
whereas the orf19.12474 allele is not (see the ACT1 Locus Page). This difference is reflected in the predicted protein sequence translated from these ORFs. In the case of ACT1, the translation of orf19.12474 begins from an ATG that is downstream of the intron and N-terminal protein coding sequence of orf19.5007.
Assembly 19 is not entirely complete, such that some of the ORFs are truncated by the end of a contig, rather than with a stop codon. Therefore, the protein products of 3'-truncated genes (e.g., orf19.1004) do not end with a translational stop, which is denoted by an * in the protein sequence file. Likewise, some genes are truncated from the 5' end (e.g., orf19.1021). The position of a gene on its contig may be assessed using the GBrowse genome browser. Click on the small map on the right-hand side of any Locus Page to view the gene in the context of the contig that contains it.
In cases where both alleles of a gene are identical, the sequence has been assembled as a single ORF. Therefore, genes with identical alleles have only a single orf19 designation, and these genes have a single entry in the "orf_coding.fasta" file.
Note: This section contains documentation from the Stanford Genome Technology Center (SGTC). This documentation was previously available on the SGTC's Candida information server, and has been archived here (verbatim) for reference.
Note: The original SC5314 sequence trace files and
quality scores generated by the Stanford Genome Technology Center are available for download from CGD.
From the Locus Page:
The "Retrieve Sequences" pull-down menu, which is located on the
Resources sidebar on the right-hand side of each Locus Page,
retrieves, for each gene in Assembly 21, 20, or each allele in Assembly 19: the Genomic DNA (with introns included); the Coding Sequence (with introns removed); the Genomic DNA with 1 kb of flanking sequence upstream and downstream of the gene (also includes any introns); or the ORF
translation (predicted protein sequence).
From the CGD Sequence Retrieval Tool:
To access the Sequence Retrieval Tool
(also called Get Sequence, or Gene/Sequence Resources, use the link
under Search Options on the left-hand sidebar of the CGD Home Page or use the "Gene/ Sequence Resources" link under Specialized Gene and Sequence Searches on the Search Options page.
By Bulk Download
You may download gzip compressed sequence files in bulk from the CGD Sequence Download Page; a
variety of file options exist for retrieval of data from Assemblies
19, 20, and 21. There is a link to this page under Download Data on
the left-hand sidebar of the CGD Home Page.
Archived copies of older sequence assemblies, including Assemblies 4,
5, and 6, may also be retrieved from the the CGD Sequence Download Pages.
You may also retrieve sequence information for any set of genes
(either specified by a list of gene names, or by selecting a region of
a chromosome or contig) using the Batch Download Tool.
From the GBrowse Genome Browser:
To view the nucleotide sequence of a gene using GBrowse, begin by zooming in on the gene in the browser, which is described in detail on the GBrowse Help Documentation page. You may view Assembly 19 or Assembly 21 (as well as A20) in GBrowse; please be aware that the assemblies are stored separately, and that you may browse them separately. GBrowse may be accessed using the "Chromosomal Location" (for Assembly 21 or 20) or "Contig Location(s)" links (for Assembly 19) or the GBrowse map thumbnail views on each Locus page, or by using the "CGD GBrowse" links displayed on each BLAST result page. You may use GBrowse to search by gene name. For example, type "orf19.7247" into the Landmark or Region search box and click on Search. To view the DNA sequence of the region displayed in the browser (which is now your gene of interest), select Download Sequence File or Download Decorated FASTA File from the pull-down menu labeled "Reports and Analysis." The difference between these two formats is that the decorated FASTA file format highlights ORFs contained within the sequence, which is convenient when viewing a large sequence file. The non-decorated sequence file can be displayed in any of several different configurable formats. Each of the file formats are configurable; select the file format from the pull-down menu and then click on the Configure button to select configuration options. Click on the button marked "Go" to view the sequence.
To view any amount of nucleotide sequence of the region upstream or downstream of a gene, you can use the browser to display a specific region relative to the ORF start site and then ask to download this sequence. For example, if you want the sequence of the 1.5 kb region upstream of orf19.7247, enter "orf19.7247:-1500..-1" into the Landmark or Region search box and click on Search. Now use the Download Sequence File or Download Decorated FASTA File option to get the nucleotide sequence of the region.
To view the predicted protein sequence (orf translation) of an ORF in GBrowse, for example, orf19.7247, first type "orf19.7247" into the Landmark or Region search box and click on Search to zoom in on this ORF. Now select Download Protein Sequence File. Click on the button marked "Go" to view the sequence. The protein sequence file format is configurable; select Download Protein Sequence File from the pull-down menu and then click on the Configure button to select configuration options.
You may even view the sequence of an entire contig or chromosome in GBrowse. You can search for a contig by name. For example, type "Contig19-2507" into the Landmark or Region search box and click on Search. You may then use either the Download Sequence File or Download Decorated FASTA File from the pull-down menu labeled "Reports and Analysis" to obtain the nucleotide sequence, or you may use Download Protein Sequence File to obtain the predicted protein sequence of all of the ORFs contained on the contig. If you would like the sequence of a contig containing your favorite gene, but you don't know the name of the contig, there are several ways to find this information. The Contig Location is now listed on the CGD Locus Page for each gene. Alternately, you can search for the gene using the Landmark or Region search in GBrowse, and the name of the contig will be displayed on the Overview Panel near the top of the page.
The GBrowse Help Documentation page has additional instructions for use of the GBrowse interface. To begin exploring in GBrowse now, use this link to see a region of Contig19-10014 as an example from Assembly 19.
Using BLAST (Basic Local Alignment Search Tool):
You may use the CGD BLAST tool to conduct protein or DNA sequence searches against various sequence datasets in CGD, as described in detail on the BLAST documentation page. Alignments of the query sequence with its sequence matches (also called "hits") are displayed along with hyperlinks to related sequence resources. The "CGD GBROWSE" hyperlink above each set of HSPs on the BLAST results page opens the GBrowse genome browser, with the HSP displayed
in the browser window. GBrowse may be used to further explore the region containing the match:
to view ORFs and other features in the neighborhood of the hit, to browse and download adjacent sequences,
to view the 6-frame translation of the region, and to view restriction sites.
(For a description of GBrowse features, please see our GBrowse documentation). If applicable, links are provided to directly download/view the entire ORF or peptide sequence, or to navigate to the corresponding Locus page.
Translation and tRNAs
Assembly 19 from the Stanford Genome Technology Center
"Assembly 19 Release Notes
The Candida genome contains regions that are homozygous, and others that are not. In homozygous regions, the assembler
can combine reads from both alleles into the same contig. In heterozygous regions where the level of heterozygosity is
low, it can do the same in spite of a few disagreements between alleles (it treats the polymorphisms as if they
resulted from sequencing errors). From the assembler's pointof view these regions are effectively homozygous. In
these release notes, the term "homozygous" should be interpreted as looking homozygous to the assembler, and a low
level of polymorphsim between alleles can still be found in the homozygous regions. Assembly 19 does not currently
contain information on polymorphisms in such regions. In the near future we will provide annotation of such residual
polymorphisms.
In regions with more than minimal divergence between alleles, the assembler must put reads from the two alleles into
different contigs. This happened frequently in assembly 6, resulting in considerable fragmentation and difficulty in
interpretation, e.g., in distinguishing allele pairs from family members.
In assembly 19, we have developed techniques to detect separate assembly of alleles and to combine separated contigs
from assembly 6 into diploid contigs in assembly 19. For most contigs in assembly 19, we present distinct sequences
for the two alleles.
Contig numbering. For some contigs from assembly 6, we found no indication of allele sequence assembled separately.
Such contigs passed unchanged, except possibly for minor differences in trimming of low-quality bases at the end,
into assembly 19. Contigs of this kind have the same number in assembly 19; for example, Contig19-1785 is the same as
Contig6-1785, and is presumed to be homozygous.
When we were able to detect separation of alleles in assembly 6, we combined the affected assembly 6 contigs into larger
diploid contigs in assembly 19. All contigs so formed were assigned numbers starting at 10000; for example, Contig19-10014
is made up from contigs 6-1076, 6-2434, 6-1473, 6-1632, 6-2141, and 6-2001. A diagram is provided in PDF format for
Contig19-10014 (and all others) showing how it is formed from assembly 6 contigs. A dotted line separates the assembly 6
contigs assigned to the two alleles. In regions where one allele has a gap, the sequence is presumed to be homozygous and
is filled in from the other allele. Otherwise the top allele derives its sequence from the assembly 6 contig shown above
the dotted line, and the bottom allele from the contig at the same position shown below the line. This process results in
two sequences representing the two alleles for the contig. The top allele is arbitrarily designated as primary, and the
sequence given for Contig19-10014 is that derived from the top set of assembly 6 contigs. The sequence for the other allele
is given the name Contig19-20014 (i.e., add 10000 to the number of the primary allele). In viewing the diagrams, note that
because of insertions and deletions between alleles, corresponding poisitions on the two alleles are not always connected
by a direct vertical line, but usually in large diploid contigs the size of insertions is visually negligible.
Contig19-10262 is exceptional in that it was constructed by joining two assembly 6 contigs linked based on sequence obtained
from Genbank, with no evidence of separation of alleles. Accordingly it does not have a second allele, and there is no
contig 19-20262.
ORFs.
ORFs were called using the same methods described for assembly 6, with one addition. In a small number of cases, the
construction of diploid contigs involved the insertion of blocks of "N" bases to fill gaps on one allele where evidence
indicated that the sequence should not be filled in from the other allele. Usually the number of N's to be inserted was
known at best approximately. To avoid having ORFs crossing large blocks of N of essentially arbitrary length, ORF calling
stopped at any group of 12 or more N's, and ORFs that run up against such N-blocks are labeled using the same incompleteness
rules applied to ORFs running off the ends of contigs in assembly 6. ORFs of this type are identifiable by inclusion of
12 N's at the affected end, which translate to 4 X's in the protein sequence.
ORFs were called using both alleles of the diploid contigs. There are 14220 ORFs in the complete set so obtained. In many
cases ORFs are exactly duplicated between alleles.
ORF Alleles.
Nonredundant Protein Set. A computational process identified pairs of ORFs that are deemed to be alleles based on position
and protein sequence similarity. Generally the identification of alleles is straightforward. In complicated instances we
recommend examination of the ORFs and blast results to understand the situation. The web pages identify ORFs designated as
alleles and give indications of which cases are complicated. The allele pairs were used to generate a nonredundant protein
set using the following rule: whenever the protein sequences for a pair of alleles were identical, the translation of the
ORF derived from the secondary allele (the 20000-series contig) was excluded from the nonredundant protein set. This set
of proteins was used as the blast database in performing the searches of Candida ORFs against all other Candida ORFs. There
are 9259 proteins in the nonredundant set."
Assembly 19 in CGD
Assembly 6 from the Stanford Genome Technology Center
"Assembly 6 Release Notes
Assembly 6 is expected to be the final assembly of Candida albicans sequence data starting from the individual
reads. After preprocessing steps to remove most highly repeated sequences, a total of 313,165 reads were
assembled into 1213 contigs 2kb or greater. These data represent 10.4X mean coverage assuming a haploid genome
size of 15.5MB (excepting repeats such as the rDNA assemble as a single copy). The contigs add to 17.4MB,
exceededing the genome size because of heterozygous regions assembling separately. The mtDNA is not included
in the assembly as it must be translated with a different genetic code.
Translation of the assembly resulted in 9168 ORFs capable of encoding proteins 100aa or greater in length
including ambiguities. In general, the ORFs contained a start and stop codon. ORFs extending to the end of
a contig but lacking a stop were included as, except in rare cases where the contigs are at chromosome ends,
they will eventually reach one. Reading frames that remain open upstream to the beginning of a contig were
divided into two classes. Those that contained a 100aa or greater ORF with a start codon within them are
represented as the smaller ORF with the start because upstream sequences are more likely to encounter a
stop before a start. For completeness, those lacking an internal 100aa ORF with a start are included up to
the beginning of the contig.
Along with assembled DNA, a reference set of ORFs is being provided with assembly 6. The ORFs are numbered
sequentially with the lowest numbers deriving from the contigs with the lowest numbers. A typical fasta
header line reads as follows:
orf6.2.prot orf6-1097:610-281:e 330 bp, 109 aa, contig 2283 bp
This is interpreted as the protein sequence for ORF 2 from assembly 6. It derives from nucleotides 610-281
in contig 6-1097 (a start coordinate greater than the stop indicates it is read from the complementary strand).
The letter "e" indicates that an entire ORF is present (start and stop). The letter "i" is used to indicate an
incomplete ORF with the letters N and C to indicate the end of the ORF that is incomplete. The letter "n" is
used to indicate that while an entire ORF is given, the reading frame remains open to the beginning of the
contig. A count of codons that could be used to extend such an ORF is given in the header line. Examples are:
orf6.1.prot orf6-1072:1371-1:iC 1371 bp, 457 aa, contig 2182 bp
orf6.4.prot orf6-1097:2281-1868:iN 414 bp, 137 aa, contig 2283 bp
orf6.14.prot orf6-1174:18-545:en 528 bp, 175 aa, contig 2692 bp upcont=5
Fasta headers for the DNA sequences of ORFs are formed in the same way except that the ORF name is lacking
the ".prot" extension. ORF DNA sequences are always given as the sense strand.
The set of 9168 ORFs contains a large number of ORFs that are internal to or are overlapping with larger
ORFs. These smaller ORFs are currently included for completeness and will be removed from the reference
set at a later date. In a smaller number of cases, two ORFs are parts of the same gene. Causes include:
introns, gaps in the current sequence, and remaining frameshifts."
Accessing Sequences in CGD
Return to CGD |
Send a Message to the CGD Curators ![]() |