Rules for the Sequence Tables


Last Update: 2002-06-07 Kara

These tables are used for storing raw sequence data and archived sequences in the database.

The "production" version of the sequence will be stored in the display_seq and chromosome tables. Currently, there are 1 to 4 rows for each feature in the display_seq table, depending on the feature_type. For example, ORFs will have 4 rows: genomic DNA, coding DNA, genomic DNA with 1000 bp 5' and 3' flanking regions, and protein sequence. Note that we might at some point want to store subfeatures' sequences as well.

When there is a change to a feature's coordinates without a change to the underlying sequence, the coordinates are stored in the feature_update table. The coordinates of the feature in the feature table are updated, the feature's sequences in the raw_sequence table are updated, and the previous coordinates are stored in the feature_update table. In addition, the version of the chromosome that was associated with the feature is also stored so that the old sequence of a feature can be recreated.

When there is an actual change to the chromosomal sequence, the entire sequence of the chromosome is archived in the chromosome_archive table. In addition, each change that occurred is recorded in the chromosome_update table. This table can store individual nucleotide changes (eg. T to C), and can also store chunks of sequence changes up to 4000 nucleotides (eg. delete 1 kb). Some sample data can be found here.

One example scenario:

The current version of all sequences (features and chromosomes) is 2002-05-23. On 2002-06-01, there is a change to the feature coordinates for YFL039C. In the rows in the display_seq table for YFL039C, the sequences will be updated and the feature_version in the feature table will now be 2002-06-01. In the feature_update table, the old coordinates for YFL039C will be stored, and the feature_version in feature_update will be 2002-05-23. In addition, changes to subfeatures associated with YFL039C will be stored in the subfeature_update table. The chr_version will be 2002-05-23, which refers to a row in the chromosome table. This allows us to re-construct what the original sequence of YFL039C was using the coordinates and the chromosome.
Later, on 2002-10-01, the chromosome sequence is changed, so the chromosome sequence is archived in the chromosome_archive table (and specific changes recorded in chromosome_update). In chromosome_archive, the chr_version will be 2002-05-23; the chr_seq in the chromosome table is updated, and the version will now be 2002-10-01. We will still be able to re-construct the sequence of the old version of YFL039C using the chr_version in the feature_update table, which will now be referring to a row in the chromosome_archive table

TABLE DISPLAY_SEQ
================== display_seq_no Required. Oracle sequence number. Primary key. Cannot be updated. NUMBER(10) display_seq Required. Uppercase. CLOB display_seq_type Required. Coded. On insert, check value against code table. VARCHAR2(40) seq_length Required. NUMBER(10) feature_no Required. VARCHAR2(10) Foreign key to feature table. date_created Required. On insert, set to SYSDATE. On update, keep original date. created_by Required. On insert, set to USER Other notes: ============ Size issues: the largest ORF is 14733 nucleotides (YLR106C); there are 4 ORFs over 10 Kb, though most are below 4 Kb. The smallest is 75 nucleotides. Some examples: Most rows in the feature table will have 4 rows in the raw_seq table: seq_type = 'Genomic DNA' seq_type = 'Genomic DNA +/- 1 kb' seq_type = 'Coding sequence' seq_type = 'Protein translation' For features with alternative translations, the feature will have the above 4 rows plus: seq_type = 'Alternative translation'

TABLE FEATURE
================== This table already exists, but we are adding the following column: feature_version Required. DATE This column stores the current version of the feature.
TABLE FEATURE_UPDATE
================== coord_archive_no Required. Oracle sequence number. Primary key. Cannot be updated. NUMBER(10) old_start_coord Required. NUMBER(10) new_start_coord Optional. NUMBER(10) old_stop_coord Required. NUMBER(10) new_stop_coord Optional. NUMBER(10) chr_version Required. DATE feature_version Required. DATE feature_no Required. NUMBER(10) Foreign key to FEATURE table. date_created Required. On insert, set to SYSDATE. On update, keep original date. created_by Required. On insert, set to USER Other notes: ============ The chr_version is the version of the chromosome with which the previous version of the feature was associated. It may be the current version of the chromosome in the chromosome table, or an archived version of the chromosome in the chromosome_archive table. Note that feature_version is not unique; there may be two different changes to the same feature on the same day. In this case, there will be two rows in feature_update pointing to the same row in the feature table with the same feature_version.
TABLE SUBFEATURE_UPDATE
================== subfeature_update_no Required. Oracle sequence number. Primary key. Cannot be updated. NUMBER(10) old_start_coord Required. NUMBER(10) new_start_coord Optional. NUMBER(10) old_stop_coord Required. NUMBER(10) new_stop_coord Optional. NUMBER(10) feature_update_no Required. NUMBER(10) Foreign key to FEATURE table. subfeature_no Required. NUMBER(10) Foreign key to FEATURE table. date_created Required. On insert, set to SYSDATE. On update, keep original date.
TABLE CHROMOSOME
================== This table already exists, but we are adding the following columns: chr_version Required. DATE chr_seq Required. CLOB chr_version stores the current version of the chromosome, while chr_seq stores the actual chromosome sequence.
TABLE CHROMOSOME_ARCHIVE
================== chr_archive_no Required. Oracle sequence number. Primary key. Cannot be updated. NUMBER(10) chr_seq Required. CLOB chr_seq_length Required. NUMBER(10) chr_version Required. DATE date_created Required. On insert, set to SYSDATE. On update, keep original date. created_by Required. On insert, set to USER
TABLE CHROMOSOME_UPDATE
================== chr_update_no Required. Oracle sequence number. Primary key. Cannot be updated. NUMBER(10) chr_update_type Required. Coded. On insert, check value against code table. VARCHAR2(40) start_coord Required. NUMBER(10) stop_coord Required. NUMBER(10) date_created Required. On insert, set to SYSDATE. On update, keep original date. created_by Required. On insert, set to USER old_seq Optional. VARCHAR2(4000) new_seq Optional. VARCHAR2(4000) Other notes: ============ Some valid values for chromo_change_type are: Deletion, Insertion, Base change. We can also have a type like "Large change", where 4000 bp chunks are replaced.