Format of gene names
C. albicans gene names should follow the format established
for S. cerevisiae gene names. This format is described in
detail in a guide to S. cerevisiae nomenclature,
published in Trends in
Genetics (TIG) (download
pdf file). The gene name should consist of three letters (the gene
symbol) followed by an integer (e.g. ADE12). Dominant alleles of the
gene (most often wild-type) are denoted by all uppercase letters, while
recessive alleles are denoted by all lowercase letters.
The 3-letter gene symbol should stand for a description of
a phenotype, gene product, or gene function. In addition, it is strongly
preferable that a given gene symbol have only one associated description (i.e., all genes which use a given 3-letter symbol should have a
related phenotype, gene product, or gene function, and that 3-letter
symbols have the same meaning for S. cerevisiae and
C. albicans genes). Where C. albicans and S. cerevisiae genes appear to
be orthologous, it is preferable that they share the same gene name. Where C. albicans and S. cerevisiae genes are similar, but the function of these genes is not the same in both species, it is preferable that the genes do NOT share a name; rather, the gene name assigned should have some significance with respect to the function of the gene.
There are some gene names with non-standard gene format that are currently in use in CGD. Many of these gene names are historical, and are well-recognized within the research community (e.g., WH11; OP4; MTLA1; ADE5,7). Some other genes acquired a non-standard name when the name was used in a publication describing a large-scale experiment (e.g., FESUR1, CAM1-1).
Going forward, it is preferable that newly named genes use standard format whenever possible.
The prefix "Ca" used in front of a gene name is not part of the
true gene name. The use of prefixes adds clarity to papers discussing
genes from different species that share a name (e.g., CaURA3
vs. ScURA3), but the gene names themselves do not include the prefix.
Choosing a gene name
Before deciding on a gene name, please search CGD, PubMed, and the
annotation generated by the Annotation Working Group with possible
gene names to see whether your choice is unique. Also search SGD
Gene Names for any gene name beginning with the 3-letter
symbol, by entering the 3-letter symbol followed by an asterisk,
e.g. "ADE*", in the query box.
Changing a standard gene name
The first published name for a gene is typically used as its standard name; however, gene names may be changed if there is consensus among the groups who study the gene. CGD is happy to facilitate this process. To initiate a gene name change please contact the CGD curators.
At CGD, we curate gene names that have appeared in the published literature; we do not assign gene names, ourselves. CGD
collects all published names for each gene; any names in addition to the standard gene name are present in the database as searchable gene aliases. Gene names or locus tags that appear only in GenBank may be used as aliases in CGD; they are not used as standard gene names unless they appear in the published literature. CGD also includes the gene identifiers assigned during Assembly 4 and Assembly 6, as well as the IPF and CA identifiers from CandidaDB. The gene names that were assigned by CandidaDB based on homology, and the Suggested Names assigned by the Annotation Working Group, are only adopted by CGD upon publication of these names in the scientific literature.
CGD has implemented a gene name reservation system. Reservation of a gene name prior to publication allows other groups to begin using the name as soon as possible, and reduces the likelihood that a gene will acquire multiple distinct names that are used in the published literature. Please use the CGD Gene Registry to reserve new gene names.
Format of systematic names
Systematic names used should be the "orf19.#" names (where "#" is an integer) assigned to open
reading frames identified in Assembly 19 of the genome sequence. The Annotation Working Group has assigned orf19 identifiers to some open reading frames that were not part of the original assembly (described in Braun et al., 2005). New orfs have been assigned names of the format "orf19.#.n", where "orf19.#" corresponds to the identifier of the upstream orf19, and "n" is an integer. For example, orf19.5006.1 is located on Contig19-10216 between orf19.5006 and orf19.5007.
Assembly 20 and Assembly 21 continue to use the orf19 names. In the future, the Candida research community may wish to adopt a position-based systematic gene nomenclature. (For reference, the system used to assign systematic names to S. cerevisiae genes is described in the TIG guide to S. cerevisiae nomenclature; download
the pdf file here).
IPF identifiers
Identifiers of the form "IPF#.n" have been assigned at CandidaDB, where IPF stands for "Individual Protein File," "#" is an integer, and "n" is a version number or an informational tag (described in d'Enfert et al., 2005). CGD currently includes the IPF names that were archived in the Annotation Working Group's annotation file as of February 22nd, 2005, as well as some IPF names that CGD curators gathered from the published literature. Where IPF identifiers have been assigned both to an orf and also to its allele, CGD includes both IPF identifiers as searchable aliases on the Locus page.
A cautionary note about suffixes appended to gene names
Please note that the numerical suffix has a different meaning in the context of orf19 and IPF names; the orf19 suffix denotes that the orf is distinct, whereas the IPF suffix serves either as a version numbering system or a tag that conveys information about sequence homology. For example, orf19.5006.1 is not the same as orf19.5006. In contrast, the IPF identifiers IPF22272 and IPF22272.1 refer to the same gene, and the ".1" suffix indicates that there has been no change made to this record since Assembly 5. A suffix of ".2" or ".3" appended to an IPF identifier indicates that there have been one or two changes, respectively, between Assembly 5 and Assembly 19. In the context of some gene names used at CandidaDB, suffixes serve as informational tags. Suffixes were assigned to indicate that the gene appears to be a 5' or 3' gene fragment, either with or without an adjacent 3' or 5' corresponding fragment the published Assembly 19, and to note whether the fragment is located at the end of a contig. For example, IPF13383.5eoc has similarity to the 5' end of a related gene, and this ORF is also located at the end of a contig. Please see Braun et al., 2005, and d'Enfert et al., 2005, for additional explanation.
Aliases from Assemblies 4 and 6
CGD contains gene name aliases from earlier assemblies of the C. albicans genome sequence, Assemblies 4 and 6. The aliases from assembly 6 have the form "orf6.#" names (where "#" is an integer). The aliases from Assembly 4 have the form "Contig4-$$$$.####" (where $$$$ is a numerical identifier for the contig, and #### is a numerical identifier for the ORF within the contig). These aliases appear on the CGD Locus pages. In addition, the complete mappings may be downloaded in tab-delimited text file format from the CGD Download site. The mapping between Assembly 4 identifiers and orf19 names is based on a mapping provided by Judy Berman, with some additional manual curation. The mapping between Assembly 6 identifiers and orf19 names was generated at CGD by BLAST-based comparison of orf19s to orf6s, as described in detail in the README file in the Download directory.
Detailed format of gene, allele, and protein
names
Many thanks to Aaron
Mitchell for providing this table.