Assembly 6 Release Notes

Assembly 6 is expected to be the final assembly of Candida albicans
sequence data starting from the individual reads.  After preprocessing
steps to remove most highly repeated sequences, a total of 313,165
reads were assembled into 1213 contigs 2kb or greater.  These data
represent 10.4X mean coverage assuming a haploid genome size of 15.5MB
(excepting repeats such as the rDNA assemble as a single copy).  The
contigs add to 17.4MB, exceededing the genome size because of
heterozygous regions assembling separately.  The mtDNA is not included
in the assembly as it must be translated with a different genetic
code.

Translation of the assembly resulted in 9168 ORFs capable of encoding
proteins 100aa or greater in length including ambiguities.  In
general, the ORFs contained a start and stop codon.  ORFs extending to
the end of a contig but lacking a stop were included as, except in
rare cases where the contigs are at chromosome ends, they will
eventually reach one.  Reading frames that remain open upstream to the
beginning of a contig were divided into two classes.  Those that
contained a 100aa or greater ORF with a start codon within them are
represented as the smaller ORF with the start because upstream
sequences are more likely to encounter a stop before a start.  For
completeness, those lacking an internal 100aa ORF with a start are
included up to the beginning of the contig.

Along with assembled DNA, a reference set of ORFs is being provided
with assembly 6.  The ORFs are numbered sequentially with the lowest
numbers deriving from the contigs with the lowest numbers.  A typical
fasta header line reads as follows:

orf6.2.prot  orf6-1097:610-281:e  330 bp, 109 aa, contig 2283 bp

This is interpreted as the protein sequence for ORF 2 from assembly 6.
It derives from nucleotides 610-281 in contig 6-1097 (a start
coordinate greater than the stop indicates it is read from the
complementary strand.  The letter "e" indicates that an entire ORF is
present (start and stop).  The letter "i" is used to indicate an
incomplete ORF with the letters N and C to indicate the end of the ORF
that is incomplete.  The letter "n" is used to indicate that while an
entire ORF is given, the reading frame remains open to the beginning
of the contig.  A count of codons that could be used to extend such an
ORF is given in the header line.  Examples are:

orf6.1.prot  orf6-1072:1371-1:iC  1371 bp, 457 aa, contig 2182 bp
orf6.4.prot  orf6-1097:2281-1868:iN  414 bp, 137 aa, contig 2283 bp
orf6.14.prot  orf6-1174:18-545:en  528 bp, 175 aa, contig 2692 bp upcont=5

Fasta headers for the DNA sequences of ORFs are formed in the same
way except that the ORF name is lacking the ".prot" extension.  ORF
DNA sequences are always given as the sense strand.

Additional experiments and computer analyses will be used to close
remaining gaps.  Any new ORFs that result from these changes will get
new higher numbers.  During this process, ORFs may be removed from the
current set.  Mapping from the reference set of 9168 to new ORFs will
be provided.

The set of 9168 ORFs contains a large number of ORFs that are internal
to or are overlapping with larger ORFs.  These smaller ORFs are
currently included for completeness and will be removed from the
reference set at a later date.  A significant number of ORFs likely
represent diverged alleles of genes.  As the finishing progresses, it
is planned to produce a repesentative haploid genome and provide the
likely alleles as a separate set.  In a smaller number of cases, two
ORFs are parts of the same gene.  Causes include: introns, gaps in the
current sequence, and remaining frameshifts.  As the project
progresses, new designators and mappings to the current set will be
provided.