Assembly 6 Release Notes Assembly 6 is expected to be the final assembly of Candida albicans sequence data starting from the individual reads. After preprocessing steps to remove most highly repeated sequences, a total of 313,165 reads were assembled into 1213 contigs 2kb or greater. These data represent 10.4X mean coverage assuming a haploid genome size of 15.5MB (excepting repeats such as the rDNA assemble as a single copy). The contigs add to 17.4MB, exceededing the genome size because of heterozygous regions assembling separately. The mtDNA is not included in the assembly as it must be translated with a different genetic code. Translation of the assembly resulted in 9168 ORFs capable of encoding proteins 100aa or greater in length including ambiguities. In general, the ORFs contained a start and stop codon. ORFs extending to the end of a contig but lacking a stop were included as, except in rare cases where the contigs are at chromosome ends, they will eventually reach one. Reading frames that remain open upstream to the beginning of a contig were divided into two classes. Those that contained a 100aa or greater ORF with a start codon within them are represented as the smaller ORF with the start because upstream sequences are more likely to encounter a stop before a start. For completeness, those lacking an internal 100aa ORF with a start are included up to the beginning of the contig. Along with assembled DNA, a reference set of ORFs is being provided with assembly 6. The ORFs are numbered sequentially with the lowest numbers deriving from the contigs with the lowest numbers. A typical fasta header line reads as follows: orf6.2.prot orf6-1097:610-281:e 330 bp, 109 aa, contig 2283 bp This is interpreted as the protein sequence for ORF 2 from assembly 6. It derives from nucleotides 610-281 in contig 6-1097 (a start coordinate greater than the stop indicates it is read from the complementary strand. The letter "e" indicates that an entire ORF is present (start and stop). The letter "i" is used to indicate an incomplete ORF with the letters N and C to indicate the end of the ORF that is incomplete. The letter "n" is used to indicate that while an entire ORF is given, the reading frame remains open to the beginning of the contig. A count of codons that could be used to extend such an ORF is given in the header line. Examples are: orf6.1.prot orf6-1072:1371-1:iC 1371 bp, 457 aa, contig 2182 bp orf6.4.prot orf6-1097:2281-1868:iN 414 bp, 137 aa, contig 2283 bp orf6.14.prot orf6-1174:18-545:en 528 bp, 175 aa, contig 2692 bp upcont=5 Fasta headers for the DNA sequences of ORFs are formed in the same way except that the ORF name is lacking the ".prot" extension. ORF DNA sequences are always given as the sense strand. Additional experiments and computer analyses will be used to close remaining gaps. Any new ORFs that result from these changes will get new higher numbers. During this process, ORFs may be removed from the current set. Mapping from the reference set of 9168 to new ORFs will be provided. The set of 9168 ORFs contains a large number of ORFs that are internal to or are overlapping with larger ORFs. These smaller ORFs are currently included for completeness and will be removed from the reference set at a later date. A significant number of ORFs likely represent diverged alleles of genes. As the finishing progresses, it is planned to produce a repesentative haploid genome and provide the likely alleles as a separate set. In a smaller number of cases, two ORFs are parts of the same gene. Causes include: introns, gaps in the current sequence, and remaining frameshifts. As the project progresses, new designators and mappings to the current set will be provided.