Difference between revisions of "Chado Sequence Module"

From GMOD
Jump to: navigation, search
m (New page: Chapter 4 The Sequence Module: Features 4.1 The role of features in Chado The central module in Chado is the sequence module. The fundamental table within this module is the feature...)
 
(added to Chado Modules category)
 
(99 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Chapter 4
+
=Introduction=
  
 +
A central module in Chado is the sequence module. The fundamental table within this module
 +
is the feature table, for describing biological sequence features. Chado defines a feature to be a
 +
region of a biological polymer (typically a DNA, RNA, or a polypeptide molecule) or an aggregate
 +
of regions on this polymer. As the term is used here, region can be the entire extent of the molecule,
 +
or a junction between two bases. Features can be typed according to an ontology, they
 +
can be localized relative to other features, and they can form part-whole and other relationships
 +
with other features.
  
The Sequence Module: Features
+
You may find these related documents useful:
  
 +
* [[Chado Manual]]
 +
* [[Chado_Best_Practices|Chado Best Practices]] - many issues specific to the Sequence module are discussed
 +
* [[Chado_FAQ|Chado FAQ]]
 +
* [[Introduction_to_Chado|Introduction to Chado]]
 +
* [[Chado_CV_Module|Chado cv module]] - the Sequence module makes extensive use of controlled vocabularies
  
4.1  The role of features in Chado
+
=Features=
  
 +
{{NeedsEditing}}
  
The central module in Chado is the sequence module. The fundamental table within this module
+
Chado does not distinguish between a sequence and a sequence feature, on the theory that a feature is a piece of a sequence, and a piece of a sequence is a sequence. Both are represented as a row in the [[#Table:_feature|feature]] table.
is the feature table, for describing biological sequence features. Chado defines a feature to be a
+
region of a biological polymer (typically a DNA, RNA, or a polypeptide molecule) or an aggregate
+
of regions on this polymer. As the term is used here, region can be the entire extent of the molecule,
+
or a junction between two bases. Features can be typed according to a classification scheme[6], they
+
can be localized relative to other features, and they can form part-whole and other relationships
+
with other features.
+
  
There are many different types of features. Examples include gene, exon, transcript, regulatory
+
There are many different types of features. Examples include gene, exon, transcript, regulatory
 
region, chromosome, sequence variation, polypeptide, protein domain and cross-genome match
 
region, chromosome, sequence variation, polypeptide, protein domain and cross-genome match
regions. Chado does not have a different table for each kind of feature; all features are stored in
+
regions. Chado does not have a different table for each kind of feature; all features are stored in
the feature table. Types of feature are differentiated using a type id column, which is a foreign key
+
the [[#Table:_feature|feature]] table.
to the cvterm table in the cv (ontology) module, described later. This allows us to type features
+
 
 +
Feature types are taken from the  [http://www.sequenceontology.org/ Sequence Ontology] controlled vocabulary (see also [[Chado_CV_Module|Controlled Vocabulary module]], also known as ''cv''). Types of feature are differentiated using a ''type_id'' column, which is a foreign key to the [[Chado_Tables#Table:_cvterm|cvterm]] table in the cv (ontology) module, described [[Chado_CV_Module|here]]. This allows us to type features
 
according to the Sequence Ontology. The use of ontologies to type tables gives Chado a subtyping
 
according to the Sequence Ontology. The use of ontologies to type tables gives Chado a subtyping
 
mechanism, which is absent from the standard relational model. For example, SO tells us that
 
mechanism, which is absent from the standard relational model. For example, SO tells us that
mRNA and snRNA are different kinds of transcript. This is discussed in more in the next section.
+
mRNA and snRNA are different kinds of transcript. This is discussed in more in the next section.
 
For the purposes of discussion in this document, it can be assumed that any reference to genes,
 
For the purposes of discussion in this document, it can be assumed that any reference to genes,
 
exons, polypeptides, SNPs, chromosomes, transcripts and various kinds of RNAs and so on refers
 
exons, polypeptides, SNPs, chromosomes, transcripts and various kinds of RNAs and so on refers
to features of that sequence ontology type.
+
to features of that Sequence Ontology type.
  
The Chado feature table has a text-valued column named residues for storing the sequence
+
A selection of Chado-relevant types from SO are shown below:
of the feature. The value of this column is string of IUPAC[REF] symbols corresponding to the
+
 
 +
{| class="wikitable"
 +
|+ Sequence Ontology Examples
 +
!SO Term
 +
!SO id
 +
|-
 +
|[http://www.sequenceontology.org/miSO/SO_CVS/exon.html Exon]
 +
|SL:0000025
 +
|-
 +
|[http://www.sequenceontology.org/miSO/SO_CVS/intron.html Intron]
 +
|SL:0000027
 +
|-
 +
|[http://www.sequenceontology.org/miSO/SO_CVS/mRNA.html mRNA]
 +
|SL:0000037
 +
|-
 +
|[http://www.sequenceontology.org/miSO/SO_CVS/miRNA miRNA]
 +
|SL:0000044
 +
|-
 +
|[http://www.sequenceontology.org/miSO/SO_CVS/regulatory_element regulatory_element]
 +
|SL:0000052
 +
|-
 +
|[http://www.sequenceontology.org/miSO/SO_CVS/transcription_factor_binding_site.html transcription_factor_binding_site]
 +
|SL:0000054
 +
|-
 +
|}
 +
 
 +
 
 +
The Chado [[#Table:_feature|feature]] table has a text-valued column named ''residues'' for storing the sequence
 +
of the feature. The value of this column is string of [http://bioinformatics.org/sms/iupac.html IUPAC symbols] corresponding to the
 
sequence of biochemical residues encoded by the feature. This column is optional, because the
 
sequence of biochemical residues encoded by the feature. This column is optional, because the
 
sequence of the feature may not be known. Even if the sequence of a feature is known, it may not
 
sequence of the feature may not be known. Even if the sequence of a feature is known, it may not
be desirable to store it in the feature table, as it may be possible to infer the sequence from the
+
be desirable to store it in the [[#Table:_feature|feature]] table, as it may be possible to infer the sequence from the
 
sequence of other features in the database. For example, exon sequences are generally not stored,
 
sequence of other features in the database. For example, exon sequences are generally not stored,
 
as these can trivially be inferred from the sequence of the genomic feature on which the exon is
 
as these can trivially be inferred from the sequence of the genomic feature on which the exon is
Line 38: Line 74:
 
and more computationally expensive to dynamically splice together the mRNA sequence.
 
and more computationally expensive to dynamically splice together the mRNA sequence.
  
 
+
It is important to realize that the existence of a row in the [[#Table:_feature|feature]] table does not necessarily
It is important to realize that the existence of a row in the feature table does not necessarily
+
 
imply that the feature has been characterized as a result of genome annotation. It is possible to
 
imply that the feature has been characterized as a result of genome annotation. It is possible to
have features of SO type gene for genes that have only been characterized through genetic studies
+
have features of SO type gene for genes that have only been characterized through genetic studies, and for which neither sequence nor sequence location is known. This is in contrast to other
[REF], and for which neither sequence nor sequence location is known. This is in contrast to other
+
feature schemas (such as [[GFF]]) in which it is not possible to represent features without representing
feature schemas (such as GFF) in which it is not possible to represent features without representing
+
 
a location in sequence coordinates. This design decision is crucial for the use of Chado as a database
 
a location in sequence coordinates. This design decision is crucial for the use of Chado as a database
 
for integrating information about the same entity from multiple perspectives.
 
for integrating information about the same entity from multiple perspectives.
  
Because the sequence is stored as a column in the feature table rather than as an independent
+
Because the sequence is stored as a column in the [[#Table:_feature|feature]] table rather than as an independent
 
table, sequences cannot exist in the absence of a row in the feature table; sequences are dependent
 
table, sequences cannot exist in the absence of a row in the feature table; sequences are dependent
 
upon features. This is in contrast with almost all other genomics schemas that allow independent
 
upon features. This is in contrast with almost all other genomics schemas that allow independent
 
treatment of sequences and features. This design decision follows for both philosophical and prag-
 
treatment of sequences and features. This design decision follows for both philosophical and prag-
matic reasons. The feature table also contains columns seqlen and md5checksum, for storing the
+
matic reasons. The [[#Table:_feature|feature]] table also contains columns ''seqlen'' and ''md5checksum'', for storing the
 
length of the sequence and the 32-character checksum computed using the MD5 [RL Rivest. RFC
 
length of the sequence and the 32-character checksum computed using the MD5 [RL Rivest. RFC
 
1321: The md5 message-digest algorithm. Technical report, Internet Activities Board, April 1992.]
 
1321: The md5 message-digest algorithm. Technical report, Internet Activities Board, April 1992.]
algorithm. The length and checksum can be stored even when the residues column is null valued.
+
algorithm. The length and checksum can be stored even when the ''residues'' column is null valued.
 
The checksum is useful for checking if two or more features share the same sequence, without
 
The checksum is useful for checking if two or more features share the same sequence, without
 
comparing the entire sequence string.
 
comparing the entire sequence string.
  
The existence of these columns means that this table is no longer in third normal form (3NF)[REF],
+
The existence of these columns means that this table is no longer in [[wp:Third_normal_form|third normal form (3NF)]],
 
which is usually a desirable formal property of relational database. On balance, the utility of these
 
which is usually a desirable formal property of relational database. On balance, the utility of these
columns outweighs the disadvantages of violating 3NF [updates]. In practical terms, it means that
+
columns outweighs the disadvantages of violating [[wp:Third_normal_form|3NF]]. In practical terms, it means that
the values of the residues, seqlen and md5checksum columns are interdependent and cannot be
+
the values of the ''residues, seqlen'' and ''md5checksum'' columns are interdependent and cannot be
 
updated independently of one another.
 
updated independently of one another.
  
The feature table has a Boolean valued column, is analysis, indicating whether this is an an-
+
The [[#Table:_feature|feature]] table has a Boolean valued column, ''is_analysis'', indicating whether this is an annotation or a computed feature from a computational analysis. Annotations are features that are
notation or a computed feature from a computational analysis. Annotations are features that are
+
generated or blessed by a human curator, or in some cases by an integrated genome pipeline (for example, [[MAKER]] or [[DIYA]]) capable of synthesizing gene models and other annotations from ''in silico'' analyses. They constitute
generated or blessed by a human curator, or in some cases by an integrated genome pipeline[7-9]
+
the definitive version of a particular feature, in contrast to the features generated by gene prediction
capable of synthesising gene models and other annotations from in-silico analyses. They constitute
+
the definitive version of a particular feature, in contrast to the features generated by gene prediction
+
 
programs and sequence similarity searches such as BLAST.
 
programs and sequence similarity searches such as BLAST.
  
The feature table has a dbxref id column that refers to a global, stable public identifier for
+
The [[#Table:_feature|feature]] table has a ''dbxref_id'' column that refers to a global, stable public identifier for
the feature. This column is optional, because not all classes of features have such identifiers for
+
the feature. This column is optional, because not all classes of features have such identifiers for
example, features resulting from gene predictions and blast HSP features may be less stable and
+
example, features resulting from gene predictions and BLAST HSP features may be less stable and
thus lack public identifiers. It is recommended that most annotated features have dbxref ids. The
+
thus lack public identifiers. It is recommended that most annotated features have ''dbxref_id''s. The
organism id column refers to a row in the organism table (defined in the organism module). This
+
''organism_id'' column refers to a row in the [[Chado_Tables#Table:_organism|organism]] table (defined in the [[Chado_Organism_Module|organism module]]). This
column is mandatoryall features derive from a single organism.
+
column is mandatory if the feature derives from a single organism.
  
The name and uniquename columns allow features to be labelled. The name column is optional,
+
==Names of Features==
 +
 
 +
The ''name'' and ''uniquename'' columns allow features to be labelled. The ''name'' column is optional,
 
but it is recommended that all annotated features (as opposed to those that arise from purely
 
but it is recommended that all annotated features (as opposed to those that arise from purely
 
computational methods) have names. The name should be a simple, concise, human-friendly display
 
computational methods) have names. The name should be a simple, concise, human-friendly display
label (such as a gene or gene product symbol, as defined by the nomenclature rules of governing the
+
label (such as a gene or gene product symbol, as defined by the nomenclature rules of governing the
organism). User interface software (such as GBrowse[10] and Apollo[11]) can use the name column
+
organism). User interface software (such as [[GBrowse]] and [[Apollo]]) can use the ''name'' column
 
for labelling feature glyphs in user displays. Uniqueness of name within any particular organism
 
for labelling feature glyphs in user displays. Uniqueness of name within any particular organism
 
or genome project is a desirable characteristic, but is not enforced in the schema, since there are
 
or genome project is a desirable characteristic, but is not enforced in the schema, since there are
occasions where name clashes are unavoidable. In contrast, the uniquename column is required,
+
occasions where name clashes are unavoidable. In contrast, the ''uniquename'' column is required,
and guaranteed to be unique when taken in combination with organism id and type id this is
+
and guaranteed to be unique when taken in combination with ''organism_id'' and ''type_id'' this is
enforced by a constraint in the relational schema. The uniquename may be human-friendly (for
+
enforced by a constraint in the relational schema. The unique name may be human-friendly (for
 
example, it can be the same as the name); however, it is not guaranteed to be so, and in general
 
example, it can be the same as the name); however, it is not guaranteed to be so, and in general
 
should not be displayed to the end user. Its use is mainly as an alternate unique key on the table .
 
should not be displayed to the end user. Its use is mainly as an alternate unique key on the table .
  
The uniquename normally conforms to some naming rule these rules may vary across chado
+
The unique name normally conforms to some naming rule these rules may vary across chado
instances, but they should all guarantee the uniqueness of the uniquename, organism id, type id
+
instances, but they should all guarantee the uniqueness of the ''uniquename, organism id, type id''
 
triple.
 
triple.
  
  
 +
[[Image:Feature-tables.png]]
  
Feature synonyms
+
==Feature Synonyms==
 
+
  
 
In addition to having a name or symbol, it is common for features such as genes to have multiple
 
In addition to having a name or symbol, it is common for features such as genes to have multiple
synonyms or aliases. These synonyms may exist due to different publications referring to the same
+
synonyms or aliases. These synonyms may exist due to different publications referring to the same
gene with different symbols, or because one gene was once believed to be two or more separate
+
gene with different symbols, or because one gene was once believed to be two or more separate
genes. A common curation operation on genes[REF] is splitting and merging, which results in the
+
genes. A common curation operation on genes is splitting and merging, which results in the
 
creation of synonyms.
 
creation of synonyms.
  
This is modelled in Chado with a synonym table and a feature synonym linking table; thus
+
This is modelled in Chado with a [[#Table:_synonym|synonym]] table and a [[#Table:_feature_synonym|feature_synonym]] linking table; thus multiple features can potentially share the same, and a single feature can be have multiple synonyms. Use of a synonym in the literature is indicated with a ''pub_id'' foreign key referencing the [[Chado_Tables#Table:_pub|pub]] table (see [[Chado_Publication_Module|the publications module]]), indicating historical provenance for the use of a synonym.
multiple features can potentially share the same, and a single feature can be have multiple synonyms.
+
Use of a synonym in the literature is indicated with a pub id foreign key referencing the pub table
+
(described later in the section on publications module), indicating historical provenance for the use
+
of a synonym.
+
  
 +
Feature synonyms are found by joining to [[#Table:_feature_synonym|feature_synonym]] and [[Chado_Tables#Table:_synonym|synonym]]. For example, here is a query to find gene by name or synonym:
  
 +
<syntaxhighlight lang="sql">
 +
select feature_id from feature
 +
where name = 'name of interest'
 +
union select feature_id
 +
from feature_synonym fs, synonym s
 +
where fs.synonym_id = s.synonym_id
 +
and s.name = 'name of interest'
 +
and fs.is_current;
 +
</syntaxhighlight>
  
Feature locations
 
  
 +
==Feature Locations==
  
Features can potentially be localized using a sequence coordinate system. A relative localization
+
Features can potentially be localized using a sequence coordinate system. A relative localization model is used, so all feature localizations must be relative to another feature. Some features such
model is used, so all feature localizations must be relative to another feature. Some features such
+
as those of type chromosome are not localized in sequence coordinates. Locations are stored in the [[#Table:_featureloc|featureloc]] table, also part of this sequence module. Other non-sequence oriented kinds of localization (such as physical localization from ''in situ'' experiments, or genetic localizations from linkage studies) are modelled outside the sequence module (for example, in the [[Chado_Expression_Module|expression module]] or [[Chado_Map_Module|map module]]).
as those of type chromosome are not localized in sequence coordinates. Locations are stored in the
+
featureloc table, also part of the sequence module. Other non-sequence oriented kinds of localization
+
(such as physical localization from in situ experiments, or genetic localizations from linkage studies)
+
are modelled outside the sequence module (for example, in the expression or map module).
+
  
A feature can have zero or more featurelocs, although it will typically have either one (for local-
+
A feature can have zero or more featurelocs, although it will typically have either one (for localized features for which the location is known) or zero (for unlocalized features such as chromosomes,
ized features for which the location is known) or zero (for unlocalized features such as chromosomes,
+
or for features for which the location is not yet known, such as a gene discovered using classical genetics techniques). Features with multiple featurelocs will be explained later.
or for features for which the location is not yet known, such as a gene discovered using classical
+
genetics techniques). Features with multiple featurelocs will be explained later.
+
  
A featureloc is an interval in interbase sequence coordinates (see figure), bounded by the fmin
+
A featureloc is an interval in interbase sequence coordinates (see figure), bounded by the ''fmin'' and ''fmax'' columns, each representing the lower and upper linear position of the boundary between
and fmax columns, each representing the lower and upper linear position of the boundary between
+
bases or base pairs, with directionality indicated by the ''strand'' column. Interbase coordinates were
bases or base pairs, with directionality indicated by the strand column. Interbase coordinates were
+
chosen over the more commonly used base-oriented coordinate system because they are more naturally amenable to the standard arithmetic operations that are typically performed upon sequence
chosen over the more commonly used base-oriented coordinate system because they are more nat-
+
coordinates. This leads to cleaner and more efficient database coding logic that is arguably less
urally amenable to the standard arithmetic operations that are typically performed upon sequence
+
coordinates. This leads to cleaner and more efficient database coding logic that is arguably less
+
 
prone to errors. Of course, interbase coordinates are typically transformed into the more common
 
prone to errors. Of course, interbase coordinates are typically transformed into the more common
 
base-oriented system used by BLAST reports and so forth prior to presentation to the end-user.
 
base-oriented system used by BLAST reports and so forth prior to presentation to the end-user.
  
The relational schema includes a constraint which ensures that fmin ¡= fmax is always true any
+
The relational schema includes a constraint which ensures that fmin != fmax is always true, and any
 
attempt to set the database in a state which violates this will flag an error .
 
attempt to set the database in a state which violates this will flag an error .
  
 
+
As mentioned previously, a featureloc must be localized relative to another feature, indicated
As mentioned previously, a featureloc must be localized relative to another feature, indicated
+
using the ''srcfeature_id'' foreign key column, referencing the [[#Table:_feature|feature]] table. There is nothing in the
using the srcfeature id foreign key column, referencing the feature table. There is nothing in the
+
 
schema prohibiting localization chains; for example, locating an exon relative to a contig that is
 
schema prohibiting localization chains; for example, locating an exon relative to a contig that is
itself localized relative to a chromosome (see figure). The majority of Chado database instances will
+
itself localized relative to a chromosome (see figure). The majority of Chado database instances will
 
not require this flexibility; features are typically located relative to chromosomes or chromosomes
 
not require this flexibility; features are typically located relative to chromosomes or chromosomes
 
arms. Nevertheless, the ability to store such localization networks or location graphs can be useful
 
arms. Nevertheless, the ability to store such localization networks or location graphs can be useful
for unfinished genomes or parts of genomes such as heterochromatin [REF], in which it is desirable
+
for unfinished genomes or parts of genomes such as [[wp:Heterochromatin|heterochromatin]], in which it is desirable
to locate features relative to stable contigs or scaffolds, which are themselves localized in an unstable
+
to locate features relative to stable contigs or scaffolds, which are themselves localized in an unstable
 
assembly to chromosomes or chromosome arms. Localization chains do not necessarily only span
 
assembly to chromosomes or chromosome arms. Localization chains do not necessarily only span
 
assemblies protein domains may be localized relative to polypeptide features, themselves localized
 
assemblies protein domains may be localized relative to polypeptide features, themselves localized
 
to a transcript (or to the genome, as is more common). Chains may also span sequence alignments.
 
to a transcript (or to the genome, as is more common). Chains may also span sequence alignments.
  
We will now present a short formal treatment of the properties of these hierarchies of localization
+
 
 +
[[Image:Featureloc-example.png]]
 +
 
 +
 
 +
===The Feature Location Graph===
 +
 
 +
We will now present a short formal treatment of the properties of these hierarchies of localization
 
using graph theory. This treatment can be ignored for the purposes of understanding the basics
 
using graph theory. This treatment can be ignored for the purposes of understanding the basics
 
of the Chado schema; the end-user of the database will be entirely unaware of such technicalities.
 
of the Chado schema; the end-user of the database will be entirely unaware of such technicalities.
However, for the purposes of software engineering and ensuring interoperability between different
+
However, for the purposes of software engineering and ensuring interoperability between different
Chado database instances and different applications, formal treatments such as these are an essential
+
Chado database instances and different applications, formal treatments such as these are an essential
requirement for software specifications.
+
requirement for software specifications.
  
We can define a featureloc graph (LG) as being a set of vertices and edges, with each feature
+
We can define a featureloc graph (LG) as being a set of vertices and edges, with each feature
constituting a vertex, and each featureloc constituting an edge going from the parent feature id
+
constituting a vertex, and each featureloc constituting an edge going from the parent ''feature_id''
vertex to the srcfeature id vertex. The node is labelled with column values from the feature table,
+
vertex to the ''srcfeature_id'' vertex. The node is labeled with column values from the [[#Table:_feature|feature]] table,
and the edge is labelled with column values from the featureloc table. The LG is not allowed to
+
and the edge is labeled with column values from the [[#Table:_featureloc|featureloc]] table. The LG is not allowed to
contain cycles it is a directed acyclic graph (DAG). This includes self-cycles - no feature may be
+
contain cycles, it is a {{GlossaryLink|DAG|directed acyclic graph (DAG)}}. This includes self-cycles - no feature may be
 
localized relative to itself.
 
localized relative to itself.
  
The roots of the LG are the features that do not have featureloc row typically chromosomes
+
The roots of the LG are the features that do not have featureloc rows, typically chromosomes
or chromosome arms, although LG roots may also be unassembled contigs, scaffolds or features for
+
or chromosome arms, although LG roots may also be unassembled contigs, scaffolds or features for
which sequence localization is not get known (such as genes discovered through classical genetics
+
which sequence localization is not yet known (such as genes discovered through classical genetics
techniques). The leaves of the LG are any features that are not present as a srcfeature id in any
+
techniques). The leaves of the LG are any features that are not present as a ''srcfeature_id'' in any
 
featurelocs row typically the bulk of features, such as genes, exons, matches and so on. The depth
 
featurelocs row typically the bulk of features, such as genes, exons, matches and so on. The depth
of a particular LG g, denoted D(g), is the maximum number of edges between any leaf- root pair.
+
of a particular LG ''g'', denoted ''D(g)'', is the maximum number of edges between any leaf- root pair.
 
As has been previously noted, many Chados will have LGs with a uniform depth of 1. Such LGs
 
As has been previously noted, many Chados will have LGs with a uniform depth of 1. Such LGs
 
are said to be simple and the features within them are said to be singletons. The maximum depth
 
are said to be simple and the features within them are said to be singletons. The maximum depth
of all LGs in a particular database instance i is denoted LGDmax(i).
+
of all LGs in a particular database instance i is denoted ''LGDmax(i)''.
 +
 
 +
 
 +
[[Image:Featureloc-graph-example.png]]
  
  
 
The schema does not constrain the maximum depth of the LG. This flexibility proves useful
 
The schema does not constrain the maximum depth of the LG. This flexibility proves useful
when applying Chado to the highly variable needs of multiple different genome projects; however,
+
when applying Chado to the highly variable needs of multiple different genome projects; however,
it can lead to efficiency problems when querying the database. It can also make it more difficult to
+
it can lead to efficiency problems when querying the database. It can also make it more difficult to
write software to interoperate with the database, as the software must take into account different
+
write software to interoperate with the database, as the software must take into account different
 
contingencies. We can solve this problem by collapsing the LG, in which a graph of arbitrary depth
 
contingencies. We can solve this problem by collapsing the LG, in which a graph of arbitrary depth
 
is flattened to a depth of 1, transforming or projecting featurelocs onto the root features (typically
 
is flattened to a depth of 1, transforming or projecting featurelocs onto the root features (typically
Line 188: Line 228:
 
and additional redundant featurelocs between leaf and root features are added to the database.
 
and additional redundant featurelocs between leaf and root features are added to the database.
 
These new featurelocs are known as inferred featurelocs. In the schema inferred featurelocs are
 
These new featurelocs are known as inferred featurelocs. In the schema inferred featurelocs are
differentiated from direct featurelocs using the locgroup column. Direct (non-inferred) localizations
+
differentiated from direct featurelocs using the locgroup column. Direct (non-inferred) localizations
 
are indicated by the locgroup column taking value 0, and transitive localizations are indicated by
 
are indicated by the locgroup column taking value 0, and transitive localizations are indicated by
this column having value ¿0.
+
this column having value !0.
  
The terminology used above can be used to define specifications for applications intended to
+
The terminology used above can be used to define specifications for applications intended to
interoperate with the database. Feature location pairs Certain kinds of features have paired loca-
+
interoperate with the database. Certain kinds of features have paired locations. These include hits and high-scoring-pairs (HSPs) coming from sequence search programs
tions. These include hits and high-scoring- pairs (HSPs) coming from sequence search programs
+
 
such as BLAST, and syntenic chromosomal regions. These kinds of features have two featurelocs
 
such as BLAST, and syntenic chromosomal regions. These kinds of features have two featurelocs
 
(in contrast to the usual 1) one on the query feature and one on the subject (hit) feature. We
 
(in contrast to the usual 1) one on the query feature and one on the subject (hit) feature. We
differentiate the two featurelocs with the rank column. A rank of 0 indicates a location relative to
+
differentiate the two featurelocs with the ''rank'' column. A rank of 0 indicates a location relative to
 
the query (as is the default for most features), and a rank of 1 indicates a location relative to the
 
the query (as is the default for most features), and a rank of 1 indicates a location relative to the
 
subject (hit) feature.
 
subject (hit) feature.
  
For multiple alignments (e.g. CLUSTALW [REF] results), this scheme is extended to unbounded
+
For multiple alignments (e.g. [[bp:Clustalw|CLUSTALW]] results), this scheme is extended to unbounded
ranks [0..n], with arbitrary ordering. Alignments are stored in the residue info column. CIGAR
+
ranks [0..n], with arbitrary ordering. Alignments are stored in the residue info column. [http://www.ensembl.org/info/software/Pdoc/ensembl/modules/Bio/EnsEMBL/Utils/CigarString.html CIGAR]
format[REF] is used for pairwise alignments.
+
format is used for pairwise alignments.
  
Multiple featurelocs may also be required for features of type sequence variant (SO:0000109),
+
Multiple featurelocs may also be required for features of type "sequence variant" (SO:0000109),
indicating points or extents which vary between reference and non- reference sequences. From a
+
indicating points or extents which vary between reference and non-reference sequences. From a
 
modelling standpoint, variants are conceptually similar to alignments; with variants we are noting a
 
modelling standpoint, variants are conceptually similar to alignments; with variants we are noting a
difference as opposed to a similarity. Here a rank of zero indicates the wild-type (or reference) fea-
+
difference as opposed to a similarity. Here a rank of zero indicates the wild-type (or reference) feature and a rank of one or more indicates the variant (or non-reference) feature, with the residue info
ture and a rank of one or more indicates the variant (or non-reference) feature, with the residue info
+
column representing the sequence on wild-type and variant. A featureloc is uniquely identified by the ''feature_id, rank, locgroup'' triple. This means that no feature can have more than one
column representing the sequence on wild-type and variant. [?figure ] A featureloc is uniquely iden-
+
featureloc with the same rank and locgroup. In other words, rank and locgroup uniquely identify a featureloc for any particular feature.
tified by the [feature id, rank, locgroup] triple. This means that no feature can have more than one
+
featureloc with the same rank and locgroup. In other words, rank and locgroup uniquely identify
+
a featureloc for any particular feature.
+
  
 +
===Feature Coordinates===
  
 +
Features are located relative to other features using the [[#Table:_featureloc|featureloc]] table rows. Features can be located on more than one sequence. For example, a BLAST hit HSP can be a feature of both the query and target sequences. To locate a feature, create a  [[#Table:_featureloc|featureloc]]  record with:
  
Difference between the chado location model and other schemas
+
* ''srcfeature_id'' = the id of the sequence on which the feature is being located
 +
* ''feature_id'' = the id of the feature being located
 +
* ''strand'' is 1 for the positive strand, -1 for the negative, and 0 for both or indifferent.
 +
* ''fmin, fmax'' – the minimum and maximum coordinates of the interval
 +
* ''is_fmin_partial, is_fmax_partial'' = true if needed to indicate that the sequence is incomplete (e.g. for ESTs or EST assemblies which are known to not go all the way to the 3’ or 5’ end.)
 +
* ''phase''  = 0, 1, or 2 – denotes phase of first base pair in a nucleotide feature with respect to a source protein, or the offset of the first nucleotide in its codon.
 +
* ''rank, locgroup'' – these are used to organize groups of feature locations and can be ignored in simple cases (the details are discussed below).
  
  
There is a crucial difference between the Chado location model and the sequence location model
+
====Multiple Locations for a Feature====
used in other schemas, such as GFF, GenBank, BioSQL, BioPerl, etc.
+
  
First, Chado is the only model to use the concept of rank and locgroup. Second, and perhaps
+
The ability to have multiple locations for a feature has many uses. For example one can locate a SNP, exon, or protein motif on the genome, on a transcript, and on a protein. A region of similarity between two sequences (HSP) can be located on both of them, so if either is viewed the “hit” is visible.
more important, all these other models allow discontiguous locations (also known as split locations).
+
These will be familiar to anyone who has inspected GenBank annotated DNA records for an or-
+
ganism that has introns within the transcripts; the transcript location is modelled as a sequence of
+
non-contiguous intervals on the genome. The interval represents the location of an exon.
+
  
  
 +
===Difference Between the chado Location Model and Other Schemas===
 +
 +
There is a crucial difference between the Chado location model and the sequence location model
 +
used in other schemas, such as [[GFF]], GenBank, [http://biosql.org BioSQL], or [http://bioperl.org BioPerl].
 +
 +
First, Chado is the only model to use the concept of rank and locgroup. Second, and perhaps
 +
more important, all these other models allow discontiguous locations (also known as "split locations").
 +
These will be familiar to anyone who has inspected GenBank annotated DNA records for an organism that has introns within the transcripts; the transcript location is modelled as a sequence of
 +
non-contiguous intervals on the genome. The interval represents the location of an exon. For example:
 +
 +
            /gene="Acph"
 +
    CDS    join(914..1063, 1143..1241, 1297..1536, 1605..2054,
 +
                  2667..2925, 3063..3172)
  
Although Chado allows a feature to have multiple locations, this is only with variable rank and
+
Although Chado allows a feature to have multiple locations, this is only with variable ''rank'' and
locgroup this is enforced by a uniqueness constraint in the relational schema. We made a conscious
+
''locgroup'' and this is enforced by a uniqueness constraint in the relational schema. We made a conscious
decision to avoid discontiguous locations, because the extra degree of freedom this affords results
+
decision to avoid discontiguous locations, because the extra degree of freedom this affords results
 
in either redundancies or ambiguities. Redundancies arise when exons are stored in addition to a
 
in either redundancies or ambiguities. Redundancies arise when exons are stored in addition to a
 
discontiguous transcript, and ambiguities arise by virtue of the fact that explicit representation of
 
discontiguous transcript, and ambiguities arise by virtue of the fact that explicit representation of
Line 241: Line 293:
 
with contiguous locations. For example, a transcript with a discontiguous location can be modelled
 
with contiguous locations. For example, a transcript with a discontiguous location can be modelled
 
as a collection of exons with contiguous featurelocs, and a transcript with a single contiguous
 
as a collection of exons with contiguous featurelocs, and a transcript with a single contiguous
featureloc representing the outer boundaries defined by the outermost exons.
+
featureloc representing the outer boundaries defined by the outermost exons.
  
 +
==Feature Rank==
  
 +
The ''rank'' field is used when a feature has more than 1 location, otherwise the default rank value of 0 is used. Some features have two locations, for example BLAST hits and HSPs:  one location on the query, rank = 0, and one location on the subject, rank = 1.
  
Extensible feature properties
 
  
 +
==Extensible Feature Properties==
  
The feature table has a fairly limited set of columns for recording feature data. For example, there
+
The [[#Table:_feature|feature]] table has a fairly limited set of columns for recording feature data. For example, there
 
is no anticodon column for recording the RNA triplet for the adapter in a tRNA feature (all feature
 
is no anticodon column for recording the RNA triplet for the adapter in a tRNA feature (all feature
types, including tRNAs, are recorded as rows in the feature table). If we were to add columns such
+
types, including tRNAs, are recorded as rows in the [[#Table:_feature|feature]] table). If we were to add columns such
as anticodon then the number of columns in the table would become very large and difficult to
+
as anticodon then the number of columns in the table would become very large and difficult to
 
manage; most would end up being nullable (for example, anticodon does not apply to non-tRNA
 
manage; most would end up being nullable (for example, anticodon does not apply to non-tRNA
features). This is because different organisms, different types of feature and different projects have
+
features). This is because different organisms, different types of feature and different projects have
differing needs regarding what extra data should be attached to any one feature. How then are
+
differing needs regarding what extra data should be attached to any one feature. How then are
we to attach both biologically relevant and project specific data to features? Chado solves this by
+
we to attach both biologically relevant and project specific data to features?
using an extensible mechanism for attaching attribute- value pairs to features via the featureprop
+
 
table. The featureprop.type id foreign key column references a property in the Sequence Feature
+
Chado solves this by using an extensible mechanism for attaching attribute-value pairs to features via the [[#Table:_featureprop|featureprop]]
Property Ontology (SFPO)[url], distributed as part of Chado. The value text column stores the
+
table. The ''featureprop.type_id'' foreign key column references a property in the [http://www.sequenceontology.org/ Sequence Ontology]. The ''value'' text column stores the
value filler for that property. Sets or lists of values for any property can be stored in the featureprop
+
value filler for that property. Sets or lists of values for any property can be stored in the [[#Table:_featureprop|featureprop]]
table, differentiated by the value of the rank column. Provenance for the featureprop assignment
+
table, differentiated by the value of the ''rank'' column. Provenance for the [[#Table:_featureprop|featureprop]] assignment
is stored using the featureprop pub table in the publications module, described later, allowing
+
is stored using the [[Chado_Tables#Table:_featureprop_pub|featureprop_pub]] table in the [[Chado_Publication_Module|publications module]], allowing
 
multiple publications to be associated with any one assignment.
 
multiple publications to be associated with any one assignment.
  
Because featureprop values can be of an arbitrary size, they are modelled using a SQL TEXT
+
Because [[#Table:_featureprop|featureprop]] values can be of an arbitrary size, they are modelled using a SQL TEXT
type. This has some disadvantages from a query efficiency perspective.
+
type. This has some disadvantages from a query efficiency perspective.
  
 
Numeric values cannot be indexed correctly, and sorting the results of a query can only be done
 
Numeric values cannot be indexed correctly, and sorting the results of a query can only be done
 
via a SQL casting operation, or in software outside of the database management system, either of
 
via a SQL casting operation, or in software outside of the database management system, either of
 
which may result in poorer performance. This is one of several areas in Chado where performance
 
which may result in poorer performance. This is one of several areas in Chado where performance
has been traded in favour of a simpler, more abstract and generic model. Later on we will look at
+
has been traded in favour of a simpler, more abstract and generic model.
strategies for offsetting some of these performance penalties.
+
  
[example table]
+
==Linking Features to External Databases==
  
 +
Public database identifiers are stored in the [[Chado_Tables#Table:_dbxref|dbxref]] table, which holds the database name, the accession number, and an optional version number. Note that this table holds accession numbers published internally by the Chado instance as well as by other databases. A feature can have a primary dbxref, which is linked directly from the [[#Table:_feature|feature]] table. It can also have additional secondary dbxref's linked via ''feature_dbxref''. A feature need not have a primary dbxref; e.g. computed features may be considered “lightweight” and not assigned accession numbers. Some groups may wish to set up a trigger to automatically assign primary dbxrefs to features of types that are locally accessioned; a sample trigger is provided with the schema.
  
  
Feature annotations
+
==Feature Annotations==
  
 +
Detailed annotations, such as associations to [http://geneontology.org Gene Ontology (GO)] terms or [http://obofoundry Cell Ontology] terms, can be attached to features using the [[#Table:_feature_cvterm|feature_cvterm]] linking table. This allows multiple ontology terms to be associated with each feature.
  
Detailed annotations, such as associations to Gene Ontology[5] (GO) terms or Cell Ontology[12]
+
Provenance data can be attached with the  [[#Table:_feature_cvtermprop|feature_cvtermprop]] and [[#Table:_feature_cvterm_dbxref|feature_cvterm_dbxref]] higher-order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using [[#Table:_feature_cvterm|feature_cvterm]]. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) that are linked to the gene features.
terms, can be attached to features using the feature cvterm linking table. This allows multiple
+
ontology terms to be associated with each feature.
+
  
 +
Annotations for existing features can also go into the  [[#Table:_featureprop|featureprop table]] using the Chado feature_property ontology (defined in <code>chado/load/etc/feature_property.obo</code>) and the comment or description terms as appropriate. The purpose of the feature property ontology (and the related <code>chado/load/etc/genbank_feature_property.obo</code> file) is to capture terms that are likely to appear in [[GFF]] or GenBank sequence files. In theory there is no overlap between these ontologies and the Sequence Ontology.
  
 
+
==Relationships Between Features==
Provenance data can be attached with the feature cvtermprop and feature cvterm dbxref higher-
+
order linking tables. It is up to the curation policy of each individual Chado database instance to
+
decide which kinds of features will be linked using feature cvterm. Some may link terms to gene
+
features, others to the distinct gene products (processed RNAs and polypeptides) linked to the
+
gene features (see next section)
+
 
+
 
+
Relationships between features
+
 
+
  
 
Biological features are inter-related; exons are part of transcripts, transcripts are part of genes,
 
Biological features are inter-related; exons are part of transcripts, transcripts are part of genes,
 
and polypeptides are derived from messenger RNAs. Relationships between individual features
 
and polypeptides are derived from messenger RNAs. Relationships between individual features
are stored in the feature relationship table, which connects two features via the subject id and
+
are stored in the [[#Table:_feature_relationship|feature_relationship]] table, which connects two features via the ''subject_id'' and
object id columns (foreign keys referring to the feature table) and a type id (a foreign key referring
+
''object_id'' columns (foreign keys referring to the [[#Table:_feature|feature]] table) and a ''type_id'' (a foreign key referring
to a relationship type in an ontology, either SO[6], or the OBO relationship ontology, OBO-REL[13])
+
to a relationship type in an ontology, either [http://sequenceontology.org SO], or the [http://obofoundry.org/ro/ OBO relationship ontology, OBO-REL],
indicating the nature of the relationship between subject and object features. The core relationships
+
indicating the nature of the relationship between subject and object features.
between features are part-whole (part of) or temporal (derives from). ”Subject” and ”Object”
+
 
 +
The core relationships between features are part-whole (''part_of'') or temporal (''derives_from''). ''Subject'' and ''Object''
 
describes the linguistic role the two features play in a sentence describing the feature relationship.
 
describes the linguistic role the two features play in a sentence describing the feature relationship.
In English, many sentences follow a subject, predicate, object word order. To say that ”exons
+
In English, many sentences follow a subject, predicate, object syntax, and word order is important. To say that ”exons
 
are part of transcripts” is the correct way to describe a typical biological relationship. To say
 
are part of transcripts” is the correct way to describe a typical biological relationship. To say
 
”transcripts are part of exons” is either grammatically or biologically incorrect.
 
”transcripts are part of exons” is either grammatically or biologically incorrect.
  
We use this same terminology (which comes from RDF[REF]) again in the cv module. The
+
We use this same terminology (which comes from [http://www.w3.org/RDF/ RDF]) again in the [[Chado_CV_Module|cv module]]. The
collection of features and feature relationships can be considered as vertices and edges in a graph,
+
collection of features and feature relationships can be considered as vertices and edges in a graph, known as the Feature Graph (FG). Example feature graphs are shown above and in the [[Introduction_to_Chado|Introduction to Chado]].
known as the Feature Graph (FG). Some example feature graphs are shown [figure FEATURE-
+
GRAPH]. The FG is independent of the LG in general the FG and the LG should have no edges in
+
common if there is a featureloc connecting two features, then the addition of a feature relationship
+
between these same two features is redundant.
+
  
The FG is required in order to query the database for such things as alternately spliced genes,
+
The FG is independent of the LG and in general the FG and the LG should have no edges in
 +
common. If there is a featureloc connecting two features, then the addition of a feature relationship
 +
between these same two features is redundant. The FG is required in order to query the database for such things as alternately spliced genes,
 
exons shared between transcripts, etc.
 
exons shared between transcripts, etc.
  
Although the chado schema admits any FG, certain configurations are biologically meaningless,
+
Although the chado schema admits any FG, certain configurations are biologically meaningless,
and should not be used. The FG can be constrained by the Sequence Ontology. Standardized FG
+
and should not be used. The FG can be constrained by the [http://sequenceontology.org Sequence Ontology]. Standardized FG
structures are required for complex applications to be interoperable - this is discussed later on.
+
structures are required for complex applications to be interoperable.
  
 
Unlike the LG, the FG may be cyclic, although cycles in the FG are not common. The subset
 
Unlike the LG, the FG may be cyclic, although cycles in the FG are not common. The subset
Line 326: Line 370:
 
the FG connecting parts with wholes via part of must be acyclic.
 
the FG connecting parts with wholes via part of must be acyclic.
  
 +
==Compliance==
  
Canonical gene models
+
{{NeedsEditing}}
  
 
+
''This section is not complete, it is in progress.''
Regulatory regions
+
 
+
 
+
Sequence variants
+
 
+
 
+
Feature example
+
 
+
 
+
[Diagram showing an example that puts this all together]
+
 
+
 
+
 
+
  canonical-gene-model
+
  The "central dogma" gene model - gene makes mRNA makes polypeptide
+
 
+
  For many people this may be the only data they store in Chado. The
+
  typical protein coding gene model consists of a gene, one or more
+
  mRNAs, one or more exons, and at least one polypeptide.
+
 
+
  Alternately spliced genes have a 1 to many relation between gene and
+
  mRNA. Exons can be part_of more than one mRNA. No two distinct exon
+
  rows should have exact same featureloc coordinates (this indicates
+
  they are the same exon).
+
 
+
  Every [1]feature must have a [2]featureloc with rank=0 and locgroup=0.
+
  The value of the srcfeature_id column should be identical (i.e. all
+
  features are located relative to the same feature), except in rare
+
  circumstances such as when a feature crosses two contigs. Software is
+
  not guaranteed to support this. The srcfeature_id can point to a
+
  [3]contig, a [4]chromosome[5]chromosome_arm or other appropriate
+
  assembly unit.
+
 
+
This scenario involves rows in the following tables:
+
 
+
  table
+
  type_id
+
 
+
  number comments
+
  feature SO:gene 1
+
  feature SO:mRNA
+
  feature exon
+
  feature polypeptide
+
 
+
  Tool: apollo
+
  Status: supported
+
 
+
  Tool: gbrowse
+
  Status: supported
+
 
+
Example
+
 
+
  [.] Download:
+
 
+
  noncoding-gene
+
  Similar to [6]canonical-gene-model, except with noncoding-RNA
+
 
+
  Not all genes are protein-coding. Genes can code for tRNA, miRNA,
+
  snoRNA, etc. A noncoding gene model is identical to a
+
  [7]canonical-gene-model, with the following exceptions:
+
    * There is no polypeptide feature
+
    * Instead of an mRNA feature, there is a feature that is some other
+
      sub-type of [8]RNA
+
 
+
  Tool: apollo
+
  Status: supported
+
  Tool: gbrowse
+
  Status: supported
+
 
+
  pseudogene
+
  A pseudogene is a non-functional relic of a gene
+
  See [9]pseudogene. A pseudogene may look like an ordinary gene, and
+
  may even have discernable parts such as exons. It may sometimes be
+
  desirable to annotate the exon structure of a pseudogene - this can in
+
  principle be done using SO types such as [10]decayed_exon. In practice
+
  no-one is using Chado to do this. There are currently two practices:
+
    * pseudogenes are treated analagously to [11]noncoding-genes. That
+
      is, there are normal "gene" and "exon" features. However, in place
+
      of a subtype of RNA, there is a feature of type pseudogene. This
+
      practice is STRONGLY DISCOURAGED (it is not compliant with the
+
      relations in SO, it gives false counts to the number of real genes
+
      in the database). Note that this is the current default for
+
      FlyBase.
+
    * Pseudogenes are normal [12]singleton-features. There is no
+
      annotation of exon structure. This practice is encouraged. If at a
+
      later date it becomes desirable to annotated the exon structure of
+
      a pseudogene, it will be compatible with this.
+
 
+
  Tool: apollo
+
  Status: unclear
+
  Apollo by default treats pseudogenes using the first method, above. It
+
  may also be possible to configure it to the second, singleton, method.
+
  Annotating the exon structure of pseudogenes the correct way has not
+
  yet been attempted to our knowledge.
+
 
+
  singleton-feature
+
  Many types of features are singletons - that is they are not related
+
  to other features through feature_relationships. Storage of these is
+
  basic and as one may expect
+
  Singleton features present no major problems. Unlike genes, which
+
  typically have parts (with the parts having subparts), singletons do
+
  not form feature graphs (or rather, they form feature graphs
+
  consisting of single nodes). Singleton features are located relative
+
  to other features (usually the genome, but once can have singletons
+
  that are located relative to other features - this may not be
+
  supported by all applications)
+
  Tool: gbrowse
+
  Status: suppported
+
  Tool: apollo
+
  Status: suppported
+
  Apollo supports singletons provided they are located relative to the
+
  genome (singletons located relative to other features will be
+
  ignored). It may be necessary to configure apollo to make the feature
+
  type "1-level"
+
 
+
  dicistronic-gene
+
  A dicistronic gene is a gene with a mRNA that codes for two distinct
+
  non-overlapping CDSs
+
 
+
  Dicistronic genes (see for example, the dmel Adh and Adhr genes) have
+
  totally distinct gene products deriving from the same transcript. To
+
  confuse matters, the two polypeptides are commonly refered to as being
+
  derived from two distinct genes (e.g. Adh and Adhr). The entire
+
  genomic region comprising the transcript (e.g. Adh+Adhr) that includes
+
  both CDSs is refered to as the [13]gene_cassette. In a database such
+
  as FlyBase, there are 3 gene IDs stored in the database - one for each
+
  of the two non-overlapping genes, and one for the gene cassette
+
 
+
  Dicistronic genes make it difficult to have a formal definition of
+
  gene that corresponds nicely with how biologists use the term.
+
 
+
  There are currently two proposals for handling dicistronic genes. The
+
  first is a hack and introduces redundancy, but works well with
+
  existing software and tools. The second is prefered from a modeling
+
  standpoint, but introduces a lot of complexity to software
+
 
+
  operon
+
  Bacterial genes are often transcribed in groups; eg LacZ
+
  There are many similarities with [14]dicistronic-genes here.
+
 
+
  trans-spliced-gene
+
  A trans-spliced gene has one or more transcripts in which that
+
  transcript may be spliced together from different parts of the genome
+
 
+
  A trans spliced transcript is spliced from exons coming from different
+
  parts of the genome. The distance between each trans spliced part may
+
  be large, or it may be in the same location on the opposite strand.
+
 
+
  Most C elegans genes have a trans spliced leader sequence. This is
+
  different from the trans splicing involved in dmel , where we observe
+
  what appears to be two transcripts on separate strands (both
+
  containing coding sequence) joining together in a single functional
+
  transcript
+
 
+
  There are two proposals for dealing with this. One treats the trans
+
  spliced transcript as a single transcripts, with exons coming from
+
  different locations. The other treats the trans spliced transcript as
+
  a mature transcript created from two distinct primary transcripts.
+
  Note that these proposals focus on the dmel example. A solution for
+
  the C elegans example is not proposed (not sure if we even need one?)
+
  We treat this as an ordinary gene model, but relax our rules for exon
+
  locations in a transcript
+
  For example, for the canonical Dmel trans spliced gene, we would allow
+
  transcripts to have exons on different strands. Note that in Chado,
+
  exon ordering comes from [15]feature_relationship.rank (between exon
+
  and transcript), NOT from the featureloc of the exon. Chado has no
+
  problem with this. However, some software may make assumptions that
+
  all exons are on the same strand, or may try to order exons by their
+
  location to get a transcript sequence. This software will have
+
  unintended consequences with trans spliced genes modeled using this
+
  proposal
+
  Tool: apollo
+
  Status: unclear
+
  apollo may accidentally scramble the order of exons. Need to check
+
  Tool: gbrowse
+
  Status: unclear
+
  Not sure.
+
  We would introduce extra transcripts, and have relations between the
+
  transcripts. Only the mature, spliced, transcript would have a
+
  relation to the polypeptide
+
  This may model the biology better. However, it introduces a major
+
  departure from the [16]canonical-gene-model. For this reason this
+
  proposal is unlikely to be adopted
+
 
+
  gene-with-regulatory-elements
+
  regulatory elements may be implicitly or explicitly associated with a
+
  gene
+
 
+
  transposons
+
  transposons can be annotated as [17]singleton-features or as complex
+
  annotations
+
 
+
  A transposon may consist of various parts such as
+
  [18]long_terminal_repeats and gene models coding for genes like gag,
+
  pol, env. These parts may have all decayed over time. Transposon
+
  annotation typically ignores these subtleties as all that is usually
+
  required is a [19]singleton-feature of type
+
  [20]transposable_element_feature. In this case, there is no difficulty
+
 
+
  If one requires detailed transposon annotation then one is entering
+
  uncharted water as far as both Chado and annotation tools are
+
  concerned (which is why this scenario is marked as being under
+
  discussion). One option would be to treat each transposon part as
+
  distinct singletons, but this may be unsatisfactory as one may desire
+
  to have the appropriate part_of relations between the parts.
+
 
+
  P-element-insertions
+
 
+
  SNPs
+
 
+
  gene-with-implicit-features-manifested
+
  Some feature types such as introns are not normally manifested as rows
+
  in chado. They are normally derived on-the-fly from the gaps between
+
  consecutive exons. See for an example. Occasionally it may be
+
  desirable to store the introns actual rows in the feature table - for
+
  scenario in a report database
+
 
+
  feature-localization
+
  All features with sequence annotation should be localized using
+
  featureloc
+
 
+
  localized features must have a [21]featureloc with rank=0 and
+
  locgroup=0. This is the primary location of the feature. The location
+
  always indicates the boundaries of the feature. If the feature is
+
  composed of distinct subfeatures (e.g. a transcript composes of
+
  exons), then it is NOT permitted to use multiple featurelocs to
+
  indicate this. Instead, there must be rows for the subfeatures, each
+
  with their own featureloc
+
 
+
  In a feature graph (i.e. a group of features connected via
+
  [22]feature_relationship rows, all features will typically be
+
  localized relative to the same source feature (i.e. they will all have
+
  the same value for featureloc.srcfeature_id)
+
 
+
  features are typically localized to some kind of genomic or assembly
+
  feature, but chado does not constrain you to using only this. For
+
  example, localizing features relative to a transcript or polypeptide
+
  or even exon is permitted, but unusual practices will most likely not
+
  be recognized by most software
+
 
+
  feature-localization-to-contigs-in-assembly
+
  In an assembled genome, it is common to locate relative to the
+
  top-level assembly units (e.g. chromosomes). However, it is also
+
  permissable to locate to smaller units such as [23]contigs or
+
  [24]golden_path_units
+
 
+
  If a genome assembly is not stable, it is common to locate relative to
+
  assembly units such as [25]contigs. These contigs may then be
+
  localized relative to the top-level assembly units. This is known in
+
  chado terms as a location graph.
+
 
+
  We discuss here location graphs of depth 2. See also
+
  [26]n-level-assemblies. This scenario is often invisible to software
+
  interoperating with Chado. The software is free to only look at the
+
  main features and the contig-level feature and ignore the top-level
+
  assembly feature. It may sometimes be desirable to have software that
+
  can perform location transformations, mapping features from contigs to
+
  top-level units and back
+
  Tool: apollo
+
  Status: unclear
+
  apollo should be happy to treat contigs just as if they were top-level
+
  units as chromosome arms. However, the user may have to explicitly
+
  provide contigs if location queries are desired. For example, apollo
+
  may retrieve nothing if the user asks for a certain range on
+
  chromosome 4, and the features are located relative to contigs which
+
  are themselves on chromosome 4.
+
  Tool: gbrowse
+
  Status: unclear
+
  Gbrowse may expect features to be located relative to top-level units
+
  such as chromosomes.
+
 
+
  redundant-localizations-to-different-assembly-levels
+
  Features can be located relative to both contigs and top-level
+
  assembly units
+
 
+
  Chado allows redundant feature localization using
+
  [27]featureloc.locgroup>0. This allows a database to have primary
+
  locations for features relative to contigs, and secondary locations
+
  relative to top-level units such as chromosomes. The converse is also
+
  allowed.
+
 
+
  This scenario is discouraged unless the chado db admin knows what they
+
  are doing. They must implement solutions to ensure that featurelocs
+
  with varying locgroup do not get out of sync. These solutions are not
+
  part of the standard Chado software suite. Nevertheless, this scenario
+
  may be useful for advanced users in certain circumstances
+
  Tool: gbrowse
+
  Status: unclear
+
  Not clear if gbrowse uses locgroup in querying. If it constrains by
+
  locgroup, then this is essentially the same as
+
  [28]feature-localization-to-contigs-in-assembly
+
  Tool: gbrowse
+
  Status: partial
+
  Not clear if apollo uses locgroup in querying. If it constrains by
+
  locgroup, then this is essentially the same as
+
  [29]feature-localization-to-contigs-in-assembly. Apollo will not
+
  preserve redundant featurelocs when writing back to db. This could
+
  lead to db getting out of sync.
+
 
+
  n-level-assemblies
+
  In theory it is possible (but rare) to have assemblies with variable
+
  depths, or with depths>2
+
  This scenario is rare. If required, then Chado can deal with this -
+
  there is no theoretical limit to the depth of a location graph. One
+
  can have annotated features located relative to minicontigs which are
+
  located relative to supercontigs which are located relative to
+
  chromosomes. Most software that interoperates with Chado will not be
+
  able to deal with this, so this scenario is discouraged except by
+
  advanced users who have no other option
+
 
+
  unlocalized-gene
+
  A gene without sequence based localization
+
 
+
  Many chado instances are purely concerned with genome annotation - in
+
  these cases it would be strange to have genes or other features such
+
  as transcripts with no localization (i.e. no featurelocs). However,
+
  this scenario is actually common when Chado is used in a wider
+
  context. We may of the existence of genes through non-sequence
+
  evidence such as genetics. When we have no sequence-based localization
+
  it is perfectly valid to have gene features with no featurelocs. When
+
  the time comes to create genome annotations for these, we just 'fill
+
  out' the gene feature by adding transcript and exon features.
+
  Tool: gbrowse
+
  Status: supported
+
  Gbrowse supports this scenario in that unlocalized features will be
+
  ignored from the genome viewer, which is appropriate
+
  Tool: apollo
+
  Status: supported
+
  Apollo supports this scenario in that unlocalized features will be
+
  ignored, which is appropriate behaviour for a genome annotation tool
+
 
+
References
+
 
+
  1. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature
+
  2. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
+
  3. http://song.sourceforge.net/#contig
+
  4. http://song.sourceforge.net/#chromosome
+
  5. http://song.sourceforge.net/#chromosome_arm
+
  6. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
+
  7. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
+
  8. http://song.sourceforge.net/#RNA
+
  9. http://song.sourceforge.net/#pseudogene
+
  10. http://song.sourceforge.net/#decayed_exon
+
  11. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#noncoding-gene
+
  12. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
+
  13. http://song.sourceforge.net/#gene_cassette
+
  14. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#dicistronic-gene
+
  15. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature_relationship
+
  16. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
+
  17. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
+
  18. http://song.sourceforge.net/#long_terminal_repeat
+
  19. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
+
  20. http://song.sourceforge.net/#transposable_element_feature
+
  21. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
+
  22. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature_relationship
+
  23. http://song.sourceforge.net/#contig
+
  24. http://song.sourceforge.net/#golden_path_unit
+
  25. http://song.sourceforge.net/#contig
+
  26. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#n-level-assemblies
+
  27. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
+
  28. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#feature-localization-to-contigs-in-assembly
+
  29. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#feature-localization-to-contigs-in-assembly
+
 
+
 
+
4.2 Best Practices
+
 
+
 
+
Chado is a generic schema, which means anyone writing software to query or write to chado (either
+
middleware or applications) should be aware of the different ways in which data can be stored.
+
We want to strike a nice balance between flexibility and extensibility on the one hand, and strong
+
typing and rigor on the other. We want to avoid the situation we have with GenBank entries where
+
there are a dozen ways of representing a gene model, but we need to be able to cope with the
+
constant surprises biology throws at us in an attempt to confound our nice computable models.
+
  
 
Chado uses a layered model - this is tried and tested in software engineering. Some generic
 
Chado uses a layered model - this is tried and tested in software engineering. Some generic
 
software can be targeted at the lower layers and be guaranteed to work no matter what. Other
 
software can be targeted at the lower layers and be guaranteed to work no matter what. Other
more specific software needs a more tightly defined rigorous model and should be targeted at the
+
more specific software needs a more tightly defined rigorous model and should be targeted at the
 
upper layers.
 
upper layers.
  
We require validation software and more formal/computable descriptions of these layers and
+
We require validation software and more formal or computable descriptions of these layers and
policies - for now natural language descriptions will have to suffice.
+
policies - for now natural language descriptions will have to suffice.
  
 +
===Chado Compliance Layers===
  
 +
Proposal for levels of compliance.
  
4.2.1  Chado Compliance Layers
+
====Level 0: Relational Schema====
  
 +
Level 0 conformance basically means the schema is adhered to. Obviously, this is enforced by the DBMS.
  
Layer 0: Relational Schema
+
====Layer 1: Ontologies====
  
 
+
Level 1 conformance is minimal conformance to [http://sequenceontology.org SO] - all feature.types must be SO terms, and all
Level 0 conformance basically means the schema is adhered to. Obviously, this is enforced by the
+
DBMS.
+
 
+
 
+
 
+
Layer 1: Ontologies
+
 
+
 
+
Level 1 conformance is minimal conformance to SO - all feature.types must be SO terms, and all
+
 
feature relationship.types must be SO relationship types.
 
feature relationship.types must be SO relationship types.
  
 
+
====Level 2: Graph====
 
+
Layer 2: Graph
+
 
+
  
 
Level 2 conformance is graph conformance to SO - all feature relationships between a feature of
 
Level 2 conformance is graph conformance to SO - all feature relationships between a feature of
 
type X and Y must correspond to relationship of that type in SO; for example, mRNA can be
 
type X and Y must correspond to relationship of that type in SO; for example, mRNA can be
part of gene, but mRNA can not be part of golden path region. [more detailed/formal explanation
+
part of gene, but mRNA can not be part of golden path region. '''[more detailed/formal explanation
to come]. In practice Level 2 conformance may be undesirable, we may need to make modifications
+
to come].''' In practice Level 2 conformance may be undesirable, we may need to make modifications
 
to SO.
 
to SO.
  
 
Orthogonal to these layers are various additional policy decisions. Some of these are more
 
Orthogonal to these layers are various additional policy decisions. Some of these are more
tolerant of non-conformance than others. (there is also some overlaps with levels 1/2).
+
tolerant of non-conformance than others. (there is also some overlaps with levels 1 and 2).
  
  
 +
===Examples: Current implementations===
  
4.2.2  Examples: Current implementations
+
This section describes details of how different sites are using Chado. '''This is likely outdated information.'''
  
 
+
[http://tigr.org TIGR]: Currently at level 0 conformance, though most (if not all) of the terms being used have
I have listed how FB implements each policy choice - other chado instances feel free to add....
+
 
+
TIGR: Currently at level 0 conformance, though most (if not all) of the terms being used have
+
 
an obvious counterpart in SO. Therefore these ”TIGR Ontology” terms are used in the answers to
 
an obvious counterpart in SO. Therefore these ”TIGR Ontology” terms are used in the answers to
 
the SO-related questions that appear below. We plan on updating our terms with SO terms very
 
the SO-related questions that appear below. We plan on updating our terms with SO terms very
 
soon.
 
soon.
  
SO terms used for standard central-dogma gene model
+
====SO terms used for Standard Central-dogma Gene Model====
  
FB: gene mRNA exon protein [other types are derivable]
+
[http://flybase.org FlyBase]: gene mRNA exon protein [other types are derivable].
  
TIGR: gene transcript CDS exon protein [though the strict answer is for any of these SO
+
[http://tigr.org TIGR]: gene transcript CDS exon protein [though the strict answer is for any of these SO
questions is ”none” since we do not yet meet level 1 conformance]
+
questions is ”none” since we do not yet meet level 1 conformance].
  
 
NOTE: we should be using ’polypeptide’ instead of ’protein’. For now, software should be
 
NOTE: we should be using ’polypeptide’ instead of ’protein’. For now, software should be
 
tolerant of both these uses.
 
tolerant of both these uses.
  
SO terms used for storing alignments
+
====SO terms Used for Storing Alignments====
  
FB: match
+
[http://flybase.org FlyBase]: match
  
TIGR: match
+
[http://tigr.org TIGR]: match
  
NOTE: we want to use the new more specific SO types for match set, match part, for hits and
+
NOTE: we want to use the new more specific SO types for match set, match part, for hits and
 
hsps respectively. For now, software should be tolerant of either usage.
 
hsps respectively. For now, software should be tolerant of either usage.
  
TIGR: We’ve also extended the model for storing pairwise alignments to store multiple align-
+
[http://tigr.org TIGR]: We’ve also extended the model for storing pairwise alignments to store multiple alignments. Each member of the alignment is featureloced to the ’match’ feature. We’ve used this
ments. Each member of the alignment is featureloced to the ’match’ feature. We’ve used this
+
 
representation to store paralogous/orthologous gene families.
 
representation to store paralogous/orthologous gene families.
  
 +
====feature_relationship Types====
  
 +
[http://flybase.org FlyBase]: partof (for mRNA to gene and exon to mRNA) producedby (for protein to mRNA)
  
 
+
[http://tigr.org TIGR]: part of (gene-assembly, exon-transcript, assembly-supercontig) produced by (protein-
 
+
 
+
feature relationship.types
+
 
+
 
+
FB: partof (for mRNA to gene and exon to mRNA) producedby (for protein to mRNA)
+
 
+
TIGR: part of (gene-assembly, exon-transcript, assembly-supercontig) produced by (protein-
+
 
CDS, CDS-transcript, transcript-gene)
 
CDS, CDS-transcript, transcript-gene)
  
Line 797: Line 451:
 
see note above
 
see note above
  
NOTE: the main difference between FB and TIGR here is that TIGR introduce an intermediate
+
NOTE: the main difference between FB and TIGR here is that TIGR introduce an intermediate
CDS feature between mRNA and protein
+
CDS feature between mRNA and protein.
  
 +
====featureloc Policy====
  
 +
[http://flybase.org FlyBase]: all constituent parts of a central dogma gene model are located relative to the same srcfeature
 +
(the chromosome arm). No redundant locations (i.e. featureloc.group ¿ 0) are used.
  
featureloc policy
+
[http://tigr.org TIGR]: Redundant locations are used and indicated with featureloc.group ¿ 0.
  
 
+
NOTE: we want to allow some flexibility with this policy. We believe that the constituent parts
FB: all constituent parts of a central dogma gene model are located relative to the same srcfeature
+
(the chromosome arm). No redundant locations (ie featureloc.group ¿ 0) are used
+
 
+
TIGR: Redundant locations are used and indicated with featureloc.group ¿ 0.
+
 
+
 
+
 
+
NOTE: we want to allow some flexibility with this policy. I believe that the constituent parts
+
 
linked located relative to the feature should always be followed. This can be stated more formally
 
linked located relative to the feature should always be followed. This can be stated more formally
 
as:
 
as:
  
  
   IF  X is linked to Y via feature\_relationship
+
   IF  X is linked to Y via feature_relationship
   AND X is located relative to Z via featureloc.srcfeature\_id
+
   AND X is located relative to Z via featureloc.srcfeature_id
   THEN Y must also be located relative to Z via featureloc.srcfeature\_id
+
   THEN Y must also be located relative to Z via featureloc.srcfeature_id
  
  
TIGR: We’ve followed this policy in adding a featureloc between the protein and genomic
+
[http://tigr.org TIGR]: We’ve followed this policy in adding a featureloc between the protein and genomic
 
contig in our databases (such a featureloc does not appear in the Chado usage documents). This
 
contig in our databases (such a featureloc does not appear in the Chado usage documents). This
additional featureloc simplifies many queries, especially when looking at the genomic context of
+
additional featureloc simplifies many queries, especially when looking at the genomic context of
 
’match’ features associated with proteins.
 
’match’ features associated with proteins.
  
We should also expect that the fmin/fmax boundaries of a feature be defined the the outermost
+
We should also expect that the fmin/fmax boundaries of a feature be defined the the outermost
boundaries of the outermost constituent part features (this rule may require refinement when we
+
boundaries of the outermost constituent part features (this rule may require refinement when we
 
have promoters, enhancers and so on - but for now we don’t).
 
have promoters, enhancers and so on - but for now we don’t).
  
As to what the srcfeature should be, it could be a contig, and assembly or a top-level locat-
+
As to what the srcfeature should be, it could be a contig, and assembly or a top-level locat-
able feature such as chromosome or chromosome arm. Software should be tolerant of different
+
able feature such as chromosome or chromosome arm. Software should be tolerant of different
 
choices here. Whilst it is generally always best to locate relative to the topmost feature (ie the
 
choices here. Whilst it is generally always best to locate relative to the topmost feature (ie the
 
arm/chromosome), sometimes this is not possible or desirable (eg low coverage, heterochromatin).
 
arm/chromosome), sometimes this is not possible or desirable (eg low coverage, heterochromatin).
  
 +
====Non-central Dogma Gene Models====
  
 +
[http://flybase.org FlyBase]: we store a lot of non-central dogma gene models; noncoding gene models and pseudogenes
 +
[need to fill in more details here].
  
non-central dogma gene models
+
[http://tigr.org TIGR]: not many of these stored yet, save for a few pseudogenes and the occasional non-coding
 +
ORF.
  
 +
====Other Features====
  
FB: we store a lot of non-central dogma gene models; noncoding gene models and pseudogenes
+
[http://flybase.org FlyBase]: the FlyBase implementation includes many other feature types, including polyA site and se-
[need to fill in more details here]
+
quence variant [need to fill in details].
  
TIGR: not many of these stored yet, save for a few pseudogenes and the occasional non-coding
+
[http://tigr.org TIGR]: using ’SNP’ in some databases.
ORF
+
  
 +
====Derivable Feature Types====
  
 
+
[http://flybase.org FlyBase]: derivable features (introns, UTRs, intergenic region) are not included. Feature typing is always
other features
+
done to the most specific, non-derivale level. For example, we never use types ”5 prime exon”,
 
+
 
+
FB: the FlyBase implementation includes many other feature types, including polyA site and se-
+
quence variant [need to fill in details]
+
 
+
TIGR: using ’SNP’ in some databases
+
 
+
 
+
 
+
derivable features types
+
 
+
 
+
FB: derivable features (introns, UTRs, intergenic region) are not included. Feature typing is always
+
done to the most specific, non-derivale level. For example, we never use types ”5 prime exon”,
+
 
”dicistronic gene”, ”coding exon” as these are always inferrable. We always use type ”gene” - the
 
”dicistronic gene”, ”coding exon” as these are always inferrable. We always use type ”gene” - the
specific type of gene is inferred from the child type (mRNA, tRNA, snRNA, etc).
+
specific type of gene is inferred from the child type (mRNA, tRNA, snRNA, etc)..
 
+
TIGR: derivable features are not included. currently not storing any tRNAs or snRNAs.
+
 
+
  
 +
[http://tigr.org TIGR]: derivable features are not included. currently not storing any tRNAs or snRNAs.
  
 
NOTE: whilst it is perfectly permissable to include redundant derivable features (useful for
 
NOTE: whilst it is perfectly permissable to include redundant derivable features (useful for
warehouse-style querying), you should not write software that expects to find these if you want the
+
warehouse-style querying), you should not write software that expects to find these if you want the
software to work on different chado db instances.
+
software to work on different chado db instances.
  
 +
====Sequence Variants====
  
 +
[http://flybase.org FlyBase]: these are included in chado, but they are lacking full detail.
  
sequence variants
+
[http://tigr.org TIGR]: only SNPs so far. the SNPs currently being stored are computed from pairwise alignments of sequences already loaded into Chado, so each SNP feature is featureloc’ed to the appropriate place on each of the two sequences (rather than having one of the featurelocs ”dangling”, as
 
+
 
+
FB: these are included in chado, but they are lacking full detail
+
 
+
TIGR: only SNPs so far. the SNPs currently being stored are computed from pairwise align-
+
ments of sequences already loaded into Chado, so each SNP feature is featureloc’ed to the appro-
+
priate place on each of the two sequences (rather than having one of the featurelocs ”dangling”, as
+
 
indicated in some of the Chado usage documents.) featureloc.residue info is used to redundantly
 
indicated in some of the Chado usage documents.) featureloc.residue info is used to redundantly
 
store the base referenced in each of the two sequences.
 
store the base referenced in each of the two sequences.
  
NOTE: variation features should specify the edit that makes one feature (such as the reference/wild-
+
NOTE: variation features should specify the edit that makes one feature (such as the reference/wild-type) from another (the variant/mutant/non-reference). There were perhaps 2 proposals for this
type) from another (the variant/mutant/non-reference). There were perhaps 2 proposals for this
+
[more details required...].
[more details required...]
+
 
+
 
+
Chado usage scenarios version:
+
 
+
Index
+
 
+
canonical-gene-model final The "central dogma" gene model - gene makes mRNA makes polypeptide
+
noncoding-gene final Similar to , except with noncoding-RNA
+
pseudogene discussion A pseudogene is a non-functional relic of a gene
+
singleton-feature discussion Many types of features are singletons - that is they are not related to other features through feature_relationships. Storage of these is basic and as one may expect
+
dicistronic-gene discussion A dicistronic gene is a gene with a mRNA that codes for two distinct non-overlapping CDSs
+
operon discussion Bacterial genes are often transcribed in groups; eg LacZ
+
trans-spliced-gene discussion A trans-spliced gene has one or more transcripts in which that transcript may be spliced together from different parts of the genome
+
gene-with-regulatory-elements discussion regulatory elements may be implicitly or explicitly associated with a gene
+
transposons discussion transposons can be annotated as s or as complex annotations
+
P-element-insertions final
+
SNPs final
+
gene-with-implicit-features-manifested discussion Some feature types such as introns are not normally manifested as rows in chado. They are normally derived on-the-fly from the gaps between consecutive exons. See for an example. Occasionally it may be desirable to store the introns actual rows in the feature table - for scenario in a report database
+
feature-localization final All features with sequence annotation should be localized using featureloc
+
feature-localization-to-contigs-in-assembly final In an assembled genome, it is common to locate relative to the top-level assembly units (e.g. chromosomes). However, it is also permissable to locate to smaller units such as s or s
+
redundant-localizations-to-different-assembly-levels final Features can be located relative to both contigs and top-level assembly units
+
n-level-assemblies final In theory it is possible (but rare) to have assemblies with variable depths, or with depths>2
+
unlocalized-gene final A gene without sequence based localization
+
Abstract
+
 
+
This page contains a selection of Chado best-practices for different usage scenarios. It is designed to complement the Chado SQL DDL (you should familiarize yourself with this first) and the Sequence Ontology. This document status is ALPHA - in progress
+
Scenarios
+
 
+
canonical-gene-model
+
The "central dogma" gene model - gene makes mRNA makes polypeptide
+
For many people this may be the only data they store in Chado. The typical protein coding gene model consists of a gene, one or more mRNAs, one or more exons, and at least one polypeptide.
+
 
+
Alternately spliced genes have a 1 to many relation between gene and mRNA. Exons can be part_of more than one mRNA. No two distinct exon rows should have exact same featureloc coordinates (this indicates they are the same exon).
+
 
+
Every feature must have a featureloc with rank=0 and locgroup=0. The value of the srcfeature_id column should be identical (i.e. all features are located relative to the same feature), except in rare circumstances such as when a feature crosses two contigs. Software is not guaranteed to support this. The srcfeature_id can point to a contig, a chromosome/chromosome_arm or other appropriate assembly unit.
+
 
+
This scenario involves rows in the following tables:
+
 
+
table type_id number comments
+
feature SO:gene 1 The gene must always be provided
+
feature SO:mRNA 1..n One or more transcripts are required, and these are always of type mRNA for protein-coding genes.
+
feature_relationship OBO_REL:part_of SO:mRNA[1..n]---->[1]SO:gene transcripts are always linked to genes by a part_of relation. (Note that SO uses member_of here). One gene can have amny transcripts (multiple splicing). A transcript must always belong to exactly one gene (for an exception, see .
+
feature SO:exon 1..n Exons are always required, even if the genome under consideration has no introns
+
feature_relationship OBO_REL:part_of SO:exon[1..n]---->[1..n]SO:mRNA Exons are always linked to their container transcript (in this case, an mRNA) via the part_of relation. If a transcript is alternately spliced, then an exon can be part_of multiple transcripts
+
feature SO:polypeptide 1..n A protein-coding gene always produces a polypeptide, by definition. The polypeptide is located relative to the same genomic feature as the exons, mRNAs and gene. A single featureloc is used, with fmin and fmax indicating the start and stop codon positions (location is inclusive of stop codon). The polypeptide sequence should be specified as an amino acid sequence.
+
feature_relationship OBO_REL:derived_from SO:polypeptide[1]---->[1..n]SO:mRNA The polypeptide is always derived_from the mRNA. If two alternate spliceforms produce the same polypeptide (i.e. their sequence is the same) then the same polypeptide feature should be used. An mRNA can only derive one polypeptide. For exceptions, see dicistronic-gene
+
featureloc 1..n Every feature above must have a featureloc
+
Tool: apollo
+
Status: supported
+
Tool: gbrowse
+
Status: supported
+
Example
+
 
+
A Drosophila gene with 5 exons and a single spliceform Download: [game] [chado] [chaos]
+
noncoding-gene
+
Similar to canonical-gene-model, except with noncoding-RNA
+
Not all genes are protein-coding. Genes can code for tRNA, miRNA, snoRNA, etc. A noncoding gene model is identical to a canonical-gene-model, with the following exceptions:
+
 
+
There is no polypeptide feature
+
Instead of an mRNA feature, there is a feature that is some other sub-type of RNA
+
This scenario involves rows in the following tables:
+
 
+
table type_id number comments
+
feature SO:gene 1 The gene must always be provided
+
feature SO:RNA 1..n Type can be SO:RNA or any subtype of this type
+
feature_relationship OBO_REL:part_of SO:RNA[1..n]---->[1]SO:gene noncoding transcripts can also be alternately spliced
+
feature SO:exon 1..n Exons are always required, even if the genome under consideration has no introns.
+
feature_relationship OBO_REL:part_of SO:exon[1..n]---->[1..n]SO:RNA Exons are always linked to their container transcript (in this case, a non-mRNA subtype of SO:RNA) via the part_of relation. If a transcript is alternately spliced, then an exon can be part_of multiple transcripts
+
featureloc 1..n Every feature above must have a featureloc
+
Tool: apollo
+
Status: supported
+
Tool: gbrowse
+
Status: supported
+
pseudogene
+
A pseudogene is a non-functional relic of a gene
+
See pseudogene. A pseudogene may look like an ordinary gene, and may even have discernable parts such as exons. It may sometimes be desirable to annotate the exon structure of a pseudogene - this can in principle be done using SO types such as decayed_exon. In practice no-one is using Chado to do this. There are currently two practices:
+
pseudogenes are treated analagously to noncoding-genes. That is, there are normal "gene" and "exon" features. However, in place of a subtype of RNA, there is a feature of type pseudogene. This practice is STRONGLY DISCOURAGED (it is not compliant with the relations in SO, it gives false counts to the number of real genes in the database). Note that this is the current default for FlyBase.
+
Pseudogenes are normal singleton-features. There is no annotation of exon structure. This practice is encouraged. If at a later date it becomes desirable to annotated the exon structure of a pseudogene, it will be compatible with this.
+
Tool: apollo
+
Status: unclear
+
Apollo by default treats pseudogenes using the first method, above. It may also be possible to configure it to the second, singleton, method. Annotating the exon structure of pseudogenes the correct way has not yet been attempted to our knowledge.
+
singleton-feature
+
Many types of features are singletons - that is they are not related to other features through feature_relationships. Storage of these is basic and as one may expect
+
Singleton features present no major problems. Unlike genes, which typically have parts (with the parts having subparts), singletons do not form feature graphs (or rather, they form feature graphs consisting of single nodes). Singleton features are located relative to other features (usually the genome, but once can have singletons that are located relative to other features - this may not be supported by all applications)
+
Tool: gbrowse
+
Status: suppported
+
Tool: apollo
+
Status: suppported
+
Apollo supports singletons provided they are located relative to the genome (singletons located relative to other features will be ignored). It may be necessary to configure apollo to make the feature type "1-level"
+
dicistronic-gene
+
A dicistronic gene is a gene with a mRNA that codes for two distinct non-overlapping CDSs
+
Dicistronic genes (see for example, the dmel Adh and Adhr genes) have totally distinct gene products deriving from the same transcript. To confuse matters, the two polypeptides are commonly refered to as being derived from two distinct genes (e.g. Adh and Adhr). The entire genomic region comprising the transcript (e.g. Adh+Adhr) that includes both CDSs is refered to as the gene_cassette. In a database such as FlyBase, there are 3 gene IDs stored in the database - one for each of the two non-overlapping genes, and one for the gene cassette
+
 
+
Dicistronic genes make it difficult to have a formal definition of gene that corresponds nicely with how biologists use the term.
+
 
+
There are currently two proposals for handling dicistronic genes. The first is a hack and introduces redundancy, but works well with existing software and tools. The second is prefered from a modeling standpoint, but introduces a lot of complexity to software
+
 
+
operon
+
Bacterial genes are often transcribed in groups; eg LacZ
+
There are many similarities with dicistronic-genes here.
+
trans-spliced-gene
+
A trans-spliced gene has one or more transcripts in which that transcript may be spliced together from different parts of the genome
+
A trans spliced transcript is spliced from exons coming from different parts of the genome. The distance between each trans spliced part may be large, or it may be in the same location on the opposite strand.
+
 
+
Most C elegans genes have a trans spliced leader sequence. This is different from the trans splicing involved in dmel , where we observe what appears to be two transcripts on separate strands (both containing coding sequence) joining together in a single functional transcript
+
 
+
There are two proposals for dealing with this. One treats the trans spliced transcript as a single transcripts, with exons coming from different locations. The other treats the trans spliced transcript as a mature transcript created from two distinct primary transcripts. Note that these proposals focus on the dmel example. A solution for the C elegans example is not proposed (not sure if we even need one?)
+
 
+
We treat this as an ordinary gene model, but relax our rules for exon locations in a transcript
+
For example, for the canonical Dmel trans spliced gene, we would allow transcripts to have exons on different strands. Note that in Chado, exon ordering comes from feature_relationship.rank (between exon and transcript), NOT from the featureloc of the exon. Chado has no problem with this. However, some software may make assumptions that all exons are on the same strand, or may try to order exons by their location to get a transcript sequence. This software will have unintended consequences with trans spliced genes modeled using this proposal
+
Tool: apollo
+
Status: unclear
+
apollo may accidentally scramble the order of exons. Need to check
+
Tool: gbrowse
+
Status: unclear
+
Not sure.
+
We would introduce extra transcripts, and have relations between the transcripts. Only the mature, spliced, transcript would have a relation to the polypeptide
+
This may model the biology better. However, it introduces a major departure from the canonical-gene-model. For this reason this proposal is unlikely to be adopted
+
gene-with-regulatory-elements
+
regulatory elements may be implicitly or explicitly associated with a gene
+
transposons
+
transposons can be annotated as singleton-features or as complex annotations
+
A transposon may consist of various parts such as long_terminal_repeats and gene models coding for genes like gag, pol, env. These parts may have all decayed over time. Transposon annotation typically ignores these subtleties as all that is usually required is a singleton-feature of type transposable_element_feature. In this case, there is no difficulty
+
 
+
If one requires detailed transposon annotation then one is entering uncharted water as far as both Chado and annotation tools are concerned (which is why this scenario is marked as being under discussion). One option would be to treat each transposon part as distinct singletons, but this may be unsatisfactory as one may desire to have the appropriate part_of relations between the parts.
+
 
+
P-element-insertions
+
 
+
SNPs
+
 
+
 
+
This outlines one way of modeling SNPs in chado. it also illustrates
+
use of the featureloc table.
+
 
+
Most of this applies to other variation features, but I'll illustrute
+
using SNPs for now to keep it simple.
+
 
+
A SNP is represented as a single feature in chado.
+
 
+
Let's take a basic example - a SNP that flips an A to a G on the
+
genome.
+
 
+
Here we would have one feature and two featurelocs.
+
 
+
(feature
+
  (name "SNP_01")
+
  (featureloc
+
    (srcfeature "Chromosome_arm_2L") ;;; dna feature identifier
+
    (nbeg 1000000)
+
    (nend 1000001)
+
    (strand 1)
+
    (residue_info "A")
+
    (rank 0)
+
    (locgroup 0))
+
  (featureloc
+
    (residue_info "G")
+
    (rank 1)
+
    (locgroup 0)))
+
 
+
the first location is on the chromosome arm (presumably wildtype). the
+
second location has no srcfeature (ie it is set to null). however, it
+
is effectively paired with the first location. if we later wished to
+
instantiate the mutant chromosome arm feature, we would fill in the
+
second locgroup's srcfeature.
+
 
+
Let's take another example - a SNP that has only been characterised at
+
the protein level. This SNP flips an I to a V
+
 
+
(feature
+
  (name "SNP_02")
+
  (featureloc
+
    (srcfeature "dpp-P1")    ;;; protein feature identifier
+
    (nbeg 23)
+
    (nend 24)
+
    (strand 1)
+
    (residue_info "I")
+
    (rank 0)
+
    (locgroup 0))
+
  (featureloc
+
    (residue_info "V")
+
    (rank 1)
+
    (locgroup 0)))
+
 
+
Again, the second featureloc has no srcfeature. the mutant protein is
+
implicit. the mutant protein sequence can be infered by taking the
+
sequence of "dpp-P1" and substituting the 24th residue with a V.
+
 
+
To do a query for all SNPs that switch I to V or vice versa:
+
 
+
SELECT snp.*
+
FROM
+
  featureloc AS wildloc,
+
  featureloc AS mutloc,
+
  feature AS snp,
+
  cvterm AS ftype
+
WHERE
+
  snp.type_id = ftype.cvterm_id        AND
+
  ftype.termname = 'snp'              AND
+
  wildloc.feature_id = snp.feature_id  AND
+
  mutloc.feature_id = snp.feature_id  AND
+
  wildloc.locgroup = mutloc.locgroup  AND
+
  wildloc.residue_info = 'I'          AND
+
  mutloc.residue_info = 'I';
+
 
+
 
+
note that this query remains the same even if mutant protein features
+
are instantiated as opposed to left implicit.
+
 
+
 
+
Let's look at a more complex example. If we have a SNP that has been
+
localised to the genome, and the SNP has an effect on a protein
+
(Isoleucine to Threonine), and we want to redundantly store the SNP
+
effect on the genome, transcript and translation.
+
 
+
[note that in this example, the transcript is on the reverse strand,
+
so the residue is reverse complemented]
+
 
+
(feature
+
  (name "SNP_03")
+
 
+
  ;; position on genome
+
  (featureloc
+
    (srcfeature "chrom_arm_3R")
+
    (nbeg 2000000)
+
    (nend 2000001)
+
    (strand 1)
+
    (residue_info "A")
+
    (rank 0)                      ;; wild
+
    (locgroup 0))
+
  (featureloc
+
    (residue_info "G")
+
    (rank 1)                      ;; mutant
+
    (locgroup 0))
+
 
+
  ;; position on transcript
+
  (featureloc
+
    (srcfeature "blah-transcript001")    ;; processed transcript ID
+
    (nbeg 1000)
+
    (nend 1001)
+
    (strand 1)
+
    (residue_info "T")
+
    (rank 0)                      ;; wild
+
    (locgroup 1))
+
  (featureloc
+
    (residue_info "C")
+
    (rank 1)                      ;; mutant
+
    (locgroup 1))
+
 
+
  ;; position on protein
+
  (featureloc
+
    (srcfeature "blah-protein001")    ;;; protein feature identifier
+
    (nbeg 23)
+
    (nend 24)
+
    (strand 1)
+
    (residue_info "I")
+
    (rank 0)                      ;; wild
+
    (locgroup 2))
+
  (featureloc
+
    (residue_info "T")
+
    (rank 1)                      ;; mutant
+
    (locgroup 2)))
+
 
+
Here we have 6 locations for one SNP. The 6 locations can be imagined
+
to be in a 2D matrix. the purpose of rank and locgroup is to specify
+
the column and row in the matrix
+
 
+
        | genome    transcript  protein
+
--------+-------------------------------
+
wild    |  A          T        I
+
        |
+
mutant  |  G          C        T
+
 
+
rank is used to group the strain and locgroup is used for the grouping
+
within that strain. rank=0 should be used for the wildtype, but this
+
is not always possible; locgroup=0 should be used for primary (as
+
opposed to derived) location, this is not always possible. the
+
important thing is consistency within a SNP to preserve the matrix.
+
 
+
One can imagine rare (but entirely possible) cases where by a single
+
SNP causes different protein level changes in two proteins (for
+
instance, HIV carries a doubly encoded gene - ie the ORFs overlap but
+
have different frames).
+
 
+
Here we would want to add another locgroup, for the second protein
+
 
+
        | genome    transcript  protein1 protein2
+
--------+-----------------------------------------
+
wild    |  A          T        I        Y
+
        |
+
mutant  |  G          C        T        H
+
 
+
Again, if we don't need to instantiate the 2 mutant proteins, but
+
their sequence can be reconstructed from the wild proteins plus the
+
corresponding mutation
+
 
+
[remember chado is interbase, and postgresql substring counts from 1]
+
 
+
The following query dynamically constructs mutant feature residues
+
based on the wildtype feature and the mutant residue changes. this
+
should work for a variety of variation features, not just SNPs. Note
+
that we need to use locgroup to properly group wild/mutant pairs of
+
locations otherwise this query will give bad data.
+
 
+
SELECT
+
snp.name,
+
wildfeat.name,
+
substr(wildfeat.residues,
+
        1,
+
        wildloc.nbeg) ||
+
mutloc.residue_info  ||
+
substr(wildfeat.residues,
+
        wildloc.nend+1)
+
FROM
+
  featureloc AS wildloc,
+
  feature AS wildfeat,
+
  featureloc AS mutloc,
+
  feature AS snp,
+
  cvterm AS ftype
+
WHERE
+
  snp.type_id = ftype.cvterm_id        AND
+
  ftype.termname = 'snp'                AND
+
  wildloc.feature_id = snp.feature_id  AND
+
  mutloc.feature_id = snp.feature_id    AND
+
  wildloc.locgroup = mutloc.locgroup    AND
+
  wildloc.srcfeature = wildfeat
+
 
+
 
+
EXTENSIONS
+
==========
+
 
+
The above will also work if we have a polymorphic site with a number
+
of different possibilities across multiple strains. We just extend the
+
number of rows in the location matrix (ie we have rank > 1).
+
 
+
We could also instantiate multiple SNPs, one per strain, and keep the
+
locations pairwise.
+
 
+
SIMILARITIES TO ALIGNMENTS
+
==========================
+
 
+
You should hopefully notice the parallels between modeling SNPs and
+
modeling pairwise (eg BLAST) and multiple alignments. The difference
+
is, alignments would always have locgroup=0, with the rank
+
distinguishing query from subject. Also, with an HSP feature, the
+
residue_info is used to store the alignment string.
+
 
+
REDUNDANT STORAGE OF COORDINATES ON DIFFERENT ASSEMBLY LEVELS
+
=============================================================
+
 
+
Some groups may find it advantageous to redundantly store features
+
relative to both BACs and chromosomes (or to mini-contigs and
+
scaffolds... choose your favourite assembly units). The approach
+
outlined above works perfectly well with this, we would simple add
+
another column in the location matrix (ie another wild/mutant pair
+
with a distinct locgroup). All queries should work the same.
+
 
+
 
+
 
+
 
+
gene-with-implicit-features-manifested
+
Some feature types such as introns are not normally manifested as rows in chado. They are normally derived on-the-fly from the gaps between consecutive exons. See for an example. Occasionally it may be desirable to store the introns actual rows in the feature table - for scenario in a report database
+
feature-localization
+
All features with sequence annotation should be localized using featureloc
+
localized features must have a featureloc with rank=0 and locgroup=0. This is the primary location of the feature. The location always indicates the boundaries of the feature. If the feature is composed of distinct subfeatures (e.g. a transcript composes of exons), then it is NOT permitted to use multiple featurelocs to indicate this. Instead, there must be rows for the subfeatures, each with their own featureloc
+
 
+
In a feature graph (i.e. a group of features connected via feature_relationship rows, all features will typically be localized relative to the same source feature (i.e. they will all have the same value for featureloc.srcfeature_id)
+
 
+
features are typically localized to some kind of genomic or assembly feature, but chado does not constrain you to using only this. For example, localizing features relative to a transcript or polypeptide or even exon is permitted, but unusual practices will most likely not be recognized by most software
+
 
+
feature-localization-to-contigs-in-assembly
+
In an assembled genome, it is common to locate relative to the top-level assembly units (e.g. chromosomes). However, it is also permissable to locate to smaller units such as contigs or golden_path_units
+
If a genome assembly is not stable, it is common to locate relative to assembly units such as contigs. These contigs may then be localized relative to the top-level assembly units. This is known in chado terms as a location graph.
+
 
+
We discuss here location graphs of depth 2. See also n-level-assemblies. This scenario is often invisible to software interoperating with Chado. The software is free to only look at the main features and the contig-level feature and ignore the top-level assembly feature. It may sometimes be desirable to have software that can perform location transformations, mapping features from contigs to top-level units and back
+
 
+
Tool: apollo
+
Status: unclear
+
apollo should be happy to treat contigs just as if they were top-level units as chromosome arms. However, the user may have to explicitly provide contigs if location queries are desired. For example, apollo may retrieve nothing if the user asks for a certain range on chromosome 4, and the features are located relative to contigs which are themselves on chromosome 4.
+
Tool: gbrowse
+
Status: unclear
+
Gbrowse may expect features to be located relative to top-level units such as chromosomes.
+
redundant-localizations-to-different-assembly-levels
+
Features can be located relative to both contigs and top-level assembly units
+
Chado allows redundant feature localization using featureloc.locgroup>0. This allows a database to have primary locations for features relative to contigs, and secondary locations relative to top-level units such as chromosomes. The converse is also allowed.
+
 
+
This scenario is discouraged unless the chado db admin knows what they are doing. They must implement solutions to ensure that featurelocs with varying locgroup do not get out of sync. These solutions are not part of the standard Chado software suite. Nevertheless, this scenario may be useful for advanced users in certain circumstances
+
 
+
Tool: gbrowse
+
Status: unclear
+
Not clear if gbrowse uses locgroup in querying. If it constrains by locgroup, then this is essentially the same as feature-localization-to-contigs-in-assembly
+
Tool: gbrowse
+
Status: partial
+
Not clear if apollo uses locgroup in querying. If it constrains by locgroup, then this is essentially the same as feature-localization-to-contigs-in-assembly. Apollo will not preserve redundant featurelocs when writing back to db. This could lead to db getting out of sync.
+
n-level-assemblies
+
In theory it is possible (but rare) to have assemblies with variable depths, or with depths>2
+
This scenario is rare. If required, then Chado can deal with this - there is no theoretical limit to the depth of a location graph. One can have annotated features located relative to minicontigs which are located relative to supercontigs which are located relative to chromosomes. Most software that interoperates with Chado will not be able to deal with this, so this scenario is discouraged except by advanced users who have no other option
+
unlocalized-gene
+
A gene without sequence based localization
+
Many chado instances are purely concerned with genome annotation - in these cases it would be strange to have genes or other features such as transcripts with no localization (i.e. no featurelocs). However, this scenario is actually common when Chado is used in a wider context. We may of the existence of genes through non-sequence evidence such as genetics. When we have no sequence-based localization it is perfectly valid to have gene features with no featurelocs. When the time comes to create genome annotations for these, we just 'fill out' the gene feature by adding transcript and exon features.
+
 
+
Tool: gbrowse
+
Status: supported
+
Gbrowse supports this scenario in that unlocalized features will be ignored from the genome viewer, which is appropriate
+
Tool: apollo
+
Status: supported
+
Apollo supports this scenario in that unlocalized features will be ignored, which is appropriate behaviour for a genome annotation tool
+
 
+
 
+
4.3  Table definitions
+
 
+
 
+
feature
+
 
+
 
+
A feature is a biological sequence or a section of a biological sequence, or a collection of such
+
sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein
+
domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and
+
HSPs and so on; see the Sequence Ontology for more
+
 
+
 
+
 
+
  Table 4.1: feature
+
 
+
Column Datatype  Description
+
feature idinteger
+
dbxref id integerAn optional primary public stable identifier for this
+
  feature. Secondary identifiers and external dbxrefs
+
  go in table:feature dbxref
+
organism id  integerThe organism to which this feature belongs. This
+
  column is mandatory
+
namevarcharThe optional human-readable common name for a
+
  feature, for display purposes
+
uniquenametextThe unique name for a feature; may not be necessar-
+
  ily be particularly human-readable, although this is
+
  prefered. This name must be unique for this type of
+
  feature within this organism
+
residues  textA sequence of alphabetic characters representing bi-
+
  ological residues (nucleic acids, amino acids). This
+
  column does not need to be manifested for all fea-
+
  tures; it is optional for features such as exons where
+
  the residues can be derived from the featureloc. It is
+
  recommended that the value for this column be man-
+
  ifested for features which may may non-contiguous
+
  sublocations (eg transcripts), since derivation at
+
  query time is non-trivial. For expressed sequence,
+
  the DNA sequence should be used rather than the
+
  RNA sequence
+
seqlen integerThe length of the residue feature. See col-
+
  umn:residues. This column is partially redundant
+
  with the residues column, and also with featureloc.
+
  This column is required because the location may be
+
  unknown and the residue sequence may not be man-
+
  ifested, yet it may be desirable to store and query
+
  the length of the feature. The seqlen should always
+
  be manifested where the length of the sequence is
+
  known
+
md5checksum  charThe 32-character checksum of the sequence, calcu-
+
  lated using the MD5 algorithm. This is practically
+
  guaranteed to be unique for any feature. This col-
+
  umn thus acts as a unique identifier on the mathe-
+
  matical sequence
+
type idintegerA required reference to a table:cvterm giving the fea-
+
  ture type. This will typically be a Sequence Ontology
+
  identifier. This column is thus used to subclass the
+
  feature table
+
is analysis  booleanBoolean indicating whether this feature is annotated
+
  or the result of an automated analysis. Analysis re-
+
  sults also use the companalysis module. Note that
+
  the dividing line between analysis/annotation may
+
  be fuzzy, this should be determined on a per-project
+
  basis in a consistent manner. One requirement is
+
  that there should only be one non-analysis version of
+
  each wild-type gene feature in a genome, whereas the
+
  same gene feature can be predicted multiple times in
+
  different analyses
+
is obsolete  booleanBoolean indicating whether this feature has been ob-
+
  soleted. Some chado instances may choose to simply
+
  remove the feature altogether, others may choose to
+
  keep an obsolete row in the table
+
timeaccessioned timestamp for handling object accession/modification times-
+
  tamps (as opposed to db auditing info, handled else-
+
  where). The expectation is that these fields would
+
  be available to software interacting with chado
+
timelastmodified timestamp for handling object accession/modification times-
+
  tamps (as opposed to db auditing info, handled else-
+
  where). The expectation is that these fields would
+
  be available to software interacting with chado
+
 
+
 
+
 
+
featureloc
+
 
+
 
+
The location of a feature relative to another feature. IMPORTANT: INTERBASE COORDI-
+
NATES ARE USED.(This is vital as it allows us to represent zero-length features eg splice sites,
+
insertion points without an awkward fuzzy system). Features typically have exactly ONE loca-
+
tion, but this need not be the case. Some features may not be localized (eg a gene that has been
+
characterized genetically but no sequence/molecular info is available). NOTE ON MULTIPLE
+
LOCATIONS: Each feature can have 0 or more locations. Multiple locations do NOT indicate
+
non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the
+
subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature
+
designate alternate locations or grouped locations; for instance, a feature designating a blast hit or
+
hsp will have two locations, one on the query feature, one on the subject feature. features repre-
+
senting sequence variation could have alternate locations instantiated on a feature on the mutant
+
strain. the column:rank is used to differentiate these different locations. Reflexive locations should
+
never be stored - this is for -proper- (ie non-self) locations only; i.e. nothing should be located
+
relative to itself
+
 
+
 
+
 
+
  Table 4.2: featureloc
+
 
+
Column Datatype Description
+
featureloc idinteger
+
feature idinteger  The feature that is being located. Any feature can
+
  have zero or more featurelocs
+
srcfeature idinteger  The source feature which this location is relative to.
+
  Every location is relative to another feature (how-
+
  ever, this column is nullable, because the srcfeature
+
  may not be known). All locations are -proper- that
+
  is, nothing should be located relative to itself. No
+
  cycles are allowed in the featureloc graph
+
fmininteger  The leftmost/minimal boundary in the linear range
+
  represented by the featureloc.  Sometimes (eg in
+
  bioperl) this is called -start- although this is con-
+
  fusing because it does not necessarily represent the
+
  5-prime coordinate. IMPORTANT: This is space-
+
  based (INTERBASE) coordinates, counting from
+
  zero. To convert this to the leftmost position in a
+
  base-oriented system (eg GFF, bioperl), add 1 to
+
  fmin
+
is fmin partial boolean  This is typically false, but may be true if the value
+
  for column:fmin is inaccurate or the leftmost part of
+
  the range is unknown/unbounded
+
fmaxinteger  The rightmost/maximal boundary in the linear range
+
  represented by the featureloc.  Sometimes (eg in
+
  bioperl) this is called -end- although this is con-
+
  fusing because it does not necessarily represent the
+
  3-prime coordinate. IMPORTANT: This is space-
+
  based (INTERBASE) coordinates, counting from
+
  zero. No conversion is required to go from fmax to
+
  the rightmost coordinate in a base-oriented system
+
  that counts from 1 (eg GFF, bioperl)
+
is fmax partial boolean  This is typically false, but may be true if the value
+
  for column:fmax is inaccurate or the rightmost part
+
  of the range is unknown/unbounded
+
strand integer  The  orientation/directionality of the  location.
+
  Should be 0,-1 or +1
+
phase  integer  phase of translation wrt srcfeature id.Values are
+
  0,1,2. It may not be possible to manifest this column for some features such as exons, because the
+
  phase is dependant on the spliceform (the same exon
+
  can appear in multiple spliceforms). This column is
+
  mostly useful for predicted exons and CDSs
+
residue info text  Alternative residues, when these differ from fea-
+
  ture.residues. for instance, a SNP feature located
+
  on a wild and mutant protein would have different
+
  alresidues. for alignment/similarity features, the altresidues is used to represent the alignment string
+
  (CIGAR format). Note on variation features; even
+
  if we dont want to instantiate a mutant chromo-
+
  some/contig feature, we can still represent a SNP
+
  etc with 2 locations, one (rank 0) on the genome,
+
  the other (rank 1) would have most fields null, ex-
+
  cept for altresidues
+
locgroup  integer  This is used to manifest redundant, derivable ex-
+
  tra locations for a feature. The default locgroup=0
+
  is used for the DIRECT location of a feature.  !!
+
  MOST CHADO USERS MAY NEVER USE featurelocs WITH logroup¿0 !! Transitively derived locations are indicated with locgroup¿0. For example,
+
  the position of an exon on a BAC and in global chromosome coordinates.This column is used to dif-
+
  ferentiate these groupings of locations. the default
+
  locgroup 0 is used for the main/primary location,
+
  from which the others can be derived via coordinate
+
  transformations. another example of redundant locations is storing ORF coordinates relative to both
+
  transcript and genome.redundant locations open
+
  the possibility of the database getting into inconsistent states; this schema gives us the flexibility of both
+
  warehouse instantiations with redundant locations
+
  (easier for querying) and management instantiations
+
  with no redundant locations. An example of using
+
  both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of
+
  two different species. we may want to keep redundant locations on both contigs and chromosomes. we
+
  would thus have 4 locations for the single conserved
+
  region feature - two distinct locgroups (contig level
+
  and chromosome level) and two distinct ranks (for
+
  the two species)
+
rankinteger  Used when a feature has ¿1 location, otherwise the
+
  default rank 0 is used. Some features (eg blast hits
+
  and HSPs) have two locations - one on the query
+
  and one on the subject. Rank is used to differentiate these. Rank=0 is always used for the query,
+
  Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for
+
  sequence variant features, such as SNPs. Rank=0
+
  indicates the wildtype (or baseline) feature, Rank=1
+
  indicates the mutant (or compared) feature
+
 
+
 
+
featureloc pub
+
 
+
 
+
COMMENT ON INDEX featureloc c1 IS ’locgroup and rank serve to uniquely
+
 
+
 
+
  Table 4.3: featureloc pub
+
 
+
ColumnDatatypeDescription
+
featureloc pub id integer
+
featureloc id  integer
+
pub idinteger
+
 
+
 
+
 
+
feature pub
+
 
+
 
+
Provenance. Linking table between features and publications that mention them
+
 
+
 
+
  Table 4.4: feature pub
+
 
+
  ColumnDatatype Description
+
  feature pub id integer
+
  feature id  integer
+
  pub idinteger
+
 
+
 
+
 
+
featureprop
+
 
+
 
+
A feature can have any number of slot-value property tags attached to it. This is an alternative to
+
hardcoding a list of columns in the relational schema, and is completely extensible
+
 
+
 
+
  Table 4.5: featureprop
+
 
+
  ColumnDatatype Description
+
  featureprop id integer
+
  feature id  integer
+
  type id  integer  The name of the property/slot is a cvterm. The
+
  meaning of the property is defined in that cvterm.
+
  Certain property types will only apply to certain feature types (e.g. the anticodon property will only apply to tRNA features) ; the types here come from
+
  the sequence feature property ontology
+
  value text  The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database
+
  types, but is easier to query.
+
  rank  integer  Property-Value ordering. Any feature can have multiple values for any particular property type - these
+
  are ordered in a list using rank, counting from zero.
+
  For properties that are single-valued rather than
+
  multi-valued, the default 0 value should be used
+
 
+
 
+
 
+
featureprop pub
+
 
+
 
+
for any one feature, multivalued property-value pairs must be differentiated by rank
+
 
+
 
+
Table 4.6: featureprop pub
+
 
+
Column Datatype Description
+
featureprop pub id integer
+
featureprop id  integer
+
pub id integer
+
 
+
 
+
 
+
feature dbxref
+
 
+
 
+
links a feature to dbxrefs. This is for secondary identifiers; primary identifiers should use fea-
+
ture.dbxref id
+
 
+
 
+
Table 4.7: feature dbxref
+
 
+
ColumnDatatype  Description
+
feature dbxref id integer
+
feature id  integer
+
dbxref idinteger
+
is current  booleanthe is current boolean indicates whether the linked
+
  dbxref is the current -official- dbxref for the linked
+
  feature
+
 
+
 
+
feature relationship
+
 
+
 
+
features can be arranged in graphs, eg exon part of transcript part of gene; translation madeby
+
transcript if type is thought of as a verb, each arc makes a statement [SUBJECT VERB OBJECT]
+
object can also be thought of as parent (containing feature), and subject as child (contained feature
+
or subfeature) – we include the relationship rank/order, because even though most of the time we
+
can order things implicitly by sequence coordinates, we cant always do this - eg transpliced genes.
+
its also useful for quickly getting implicit introns
+
 
+
 
+
  Table 4.8: feature relationship
+
 
+
  ColumnDatatype Description
+
  feature relationship id integer
+
  subject id  integer  the subject of the subj-predicate-obj sentence. This
+
  is typically the subfeature
+
  object idinteger  the object of the subj-predicate-obj sentence. This
+
  is typically the container feature
+
  type id  integer  relationship type between subject and object. This
+
  is a cvterm, typically from the OBO relationship
+
  ontology, although other relationship types are al-
+
  lowed. The most common relationship type is
+
  OBO REL:part of. Valid relationship types are con-
+
  strained by the Sequence Ontology
+
  value text  Additional notes/comments
+
  rank  integer  The ordering of subject features with respect to the
+
  object feature may be important (for example, exon
+
  ordering on a transcript - not always derivable if you
+
  take trans spliced genes into consideration). rank is
+
  used to order these; starts from zero
+
 
+
 
+
feature relationship pub
+
 
+
 
+
Provenance. Attach optional evidence to a feature relationship in the form of a publication
+
 
+
 
+
Table 4.9: feature relationship pub
+
 
+
  Column  Datatype Description
+
  feature relationship pub id  integer
+
  feature relationship idinteger
+
  pub id  integer
+
 
+
 
+
feature relationshipprop
+
 
+
 
+
Extensible properties for feature relationships. Analagous structure to featureprop. This table is
+
largely optional and not used with a high frequency. Typical scenarios may be if one wishes to
+
attach additional data to a feature relationship - for example to say that the feature relationship
+
is only true in certain contexts
+
 
+
 
+
Table 4.10: feature relationshipprop
+
 
+
Column  Datatype Description
+
feature relationshipprop id  integer
+
feature relationship idinteger
+
type id integer  The name of the property/slot is a cvterm.The
+
  meaning of the property is defined in that cvterm.
+
  Currently there is no standard ontology for feature relationship property types
+
valuetext  The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database
+
  types, but is easier to query.
+
rank integer  Property-Value ordering. Any feature relationship
+
  can have multiple values for any particular property
+
  type - these are ordered in a list using rank, counting from zero. For properties that are single-valued
+
  rather than multi-valued, the default 0 value should
+
  be used
+
 
+
 
+
feature relationshipprop pub
+
 
+
 
+
Provenance for feature relationshipprop
+
 
+
 
+
Table 4.11: feature relationshipprop pub
+
 
+
Column Datatype Description
+
feature relationshipprop pub idinteger
+
feature relationshipprop id integer
+
pub id integer
+
 
+
 
+
feature cvterm
+
 
+
 
+
Associate a term from a cv with a feature, for example, GO annotation
+
 
+
 
+
  Table 4.12: feature cvterm
+
 
+
ColumnDatatypeDescription
+
feature cvterm id integer
+
feature id  integer
+
cvterm idinteger
+
pub idinteger Provenance for the annotation.Each annotation
+
  should have a single primary publication (which
+
  may be of the appropriate type for computational
+
  analyses) where more details can be found. Additional provenance dbxrefs can be attached using feature cvterm dbxref
+
is notboolean if this is set to true, then this annotation is interpreted as a NEGATIVE annotation - ie the feature
+
  does NOT have the specified function, process, component, part, etc. See GO docs for more details
+
 
+
 
+
 
+
feature cvtermprop
+
 
+
 
+
Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers;
+
metadata such as the date on which the entry was curated and the source of the association. See
+
the featureprop table for meanings of type id, value and rank
+
 
+
 
+
Table 4.13: feature cvtermprop
+
 
+
  Column Datatype Description
+
  feature cvtermprop id integer
+
  feature cvterm id  integer
+
  type idinteger  The name of the property/slot is a cvterm.  The
+
meaning of the property is defined in that cvterm.
+
cvterms may come from the OBO evidence code cv
+
  value  text  The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database
+
types, but is easier to query.
+
  rankinteger  Property-Value ordering.  Any feature cvterm can
+
have multiple values for any particular property type
+
- these are ordered in a list using rank, counting from
+
zero. For properties that are single-valued rather
+
than multi-valued, the default 0 value should be used
+
 
+
 
+
 
+
feature cvterm dbxref
+
 
+
 
+
Additional dbxrefs for an association. Rows in the feature cvterm table may be backed up by
+
dbxrefs. For example, a feature cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the
+
WITH column in a GO gene association file (but can also be used for other analagous associations).
+
See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details
+
 
+
 
+
Table 4.14: feature cvterm dbxref
+
 
+
Column Datatype Description
+
feature cvterm dbxref id integer
+
feature cvterm id  integer
+
dbxref id integer
+
 
+
 
+
 
+
feature cvterm pub
+
 
+
 
+
Secondary pubs for an association. Each feature cvterm association is supported by a single primary
+
publication. Additional secondary pubs can be added using this linking table (in a GO gene
+
association file, these corresponding to any IDs after the pipe symbol in the publications column
+
 
+
 
+
  Table 4.15: feature cvterm pub
+
 
+
  Column Datatype Description
+
  feature cvterm pub id integer
+
  feature cvterm id  integer
+
  pub id integer
+
 
+
 
+
synonym
+
 
+
 
+
A synonym for a feature. One feature can have multiple synonyms, and the same synonym can
+
apply to multiple features
+
 
+
 
+
Table 4.16: synonym
+
 
+
  Column Datatype Description
+
  synonym idinteger
+
  namevarchar  The synonym itself.  Should be human-readable
+
machine-searchable ascii text
+
  type idinteger  types would be symbol and fullname for now
+
  synonym sgml varchar  The fully specified synonym, with any non-ascii characters encoded in SGML
+
 
+
 
+
feature synonym
+
 
+
 
+
Linking table between feature and synonym
+
 
+
 
+
  Table 4.17: feature synonym
+
 
+
  Column Datatype Description
+
  feature synonym id integer
+
  synonym idinteger
+
  feature idinteger
+
  pub id integer  the pub id link is for relating the usage of a given
+
synonym to the publication in which it was used
+
  is currentboolean  the is current boolean indicates whether the linked
+
synonym is the current -official- symbol for the linked
+
feature
+
  is internal  boolean  typically a synonym exists so that somebody query-
+
ing the db with an obsolete name can find the ob-
+
ject theyre looking for (under its current name. If
+
the synonym has been used publicly & deliberately
+
(eg in a paper), it my also be listed in reports as a
+
synonym. If the synonym was not used deliberately
+
(eg, there was a typo which went public), then the
+
is internal boolean may be set to -true- so that it is
+
known that the synonym is -internal- and should be
+
queryable but should not be listed in reports as a
+
valid synonym
+
 
+
 
+
 
+
feature
+
 
+
 
+
A feature is a biological sequence or a section of a biological sequence, or a collection of such
+
sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein
+
domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and
+
HSPs and so on; see the Sequence Ontology for more
+
 
+
 
+
 
+
  Table 4.18: feature
+
 
+
Column Datatype  Description
+
feature idinteger
+
dbxref id integerAn optional primary public stable identifier for this
+
  feature. Secondary identifiers and external dbxrefs
+
  go in table:feature dbxref
+
organism id  integerThe organism to which this feature belongs. This
+
  column is mandatory
+
namevarchar
+
The optional human-readable common name for a
+
  feature, for display purposes
+
uniquenametextThe unique name for a feature; may not be necessarily be particularly human-readable, although this is
+
  prefered. This name must be unique for this type of
+
  feature within this organism
+
residues  textA sequence of alphabetic characters representing biological residues (nucleic acids, amino acids). This
+
  column does not need to be manifested for all features; it is optional for features such as exons where
+
  the residues can be derived from the featureloc. It is
+
  recommended that the value for this column be manifested for features which may may non-contiguous
+
  sublocations (eg transcripts), since derivation at
+
  query time is non-trivial. For expressed sequence,
+
  the DNA sequence should be used rather than the
+
  RNA sequence
+
seqlen integerThe length of the residue feature. See column:residues. This column is partially redundant
+
  with the residues column, and also with featureloc.
+
  This column is required because the location may be
+
  unknown and the residue sequence may not be manifested, yet it may be desirable to store and query
+
  the length of the feature. The seqlen should always
+
  be manifested where the length of the sequence is
+
  known
+
md5checksum  charThe 32-character checksum of the sequence, calculated using the MD5 algorithm. This is practically
+
  guaranteed to be unique for any feature. This column thus acts as a unique identifier on the mathematical sequence
+
type idintegerA required reference to a table:cvterm giving the feature type. This will typically be a Sequence Ontology
+
  identifier. This column is thus used to subclass the
+
  feature table
+
is analysis  booleanBoolean indicating whether this feature is annotated
+
  or the result of an automated analysis. Analysis results also use the companalysis module. Note that
+
  the dividing line between analysis/annotation may
+
  be fuzzy, this should be determined on a per-project
+
  basis in a consistent manner. One requirement is
+
  that there should only be one non-analysis version of
+
  each wild-type gene feature in a genome, whereas the
+
  same gene feature can be predicted multiple times in
+
  different analyses
+
is obsolete  booleanBoolean indicating whether this feature has been obsoleted. Some chado instances may choose to simply
+
  remove the feature altogether, others may choose to
+
  keep an obsolete row in the table
+
timeaccessioned timestamp for handling object accession/modification timestamps (as opposed to db auditing info, handled elsewhere). The expectation is that these fields would
+
  be available to software interacting with chado
+
timelastmodified timestamp for handling object accession/modification timestamps (as opposed to db auditing info, handled else where). The expectation is that these fields would
+
  be available to software interacting with chado
+
 
+
 
+
featureloc
+
 
+
 
+
The location of a feature relative to another feature. IMPORTANT: INTERBASE COORDI-
+
NATES ARE USED.(This is vital as it allows us to represent zero-length features eg splice sites,
+
insertion points without an awkward fuzzy system). Features typically have exactly ONE location, but this need not be the case. Some features may not be localized (eg a gene that has been
+
characterized genetically but no sequence/molecular info is available). NOTE ON MULTIPLE
+
LOCATIONS: Each feature can have 0 or more locations. Multiple locations do NOT indicate
+
non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the
+
subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature
+
designate alternate locations or grouped locations; for instance, a feature designating a blast hit or
+
hsp will have two locations, one on the query feature, one on the subject feature. features representing sequence variation could have alternate locations instantiated on a feature on the mutant
+
strain. the column:rank is used to differentiate these different locations. Reflexive locations should
+
never be stored - this is for -proper- (ie non-self) locations only; i.e. nothing should be located
+
relative to itself
+
 
+
 
+
  Table 4.19: featureloc
+
 
+
Column Datatype Description
+
featureloc idinteger
+
feature idinteger  The feature that is being located. Any feature can
+
  have zero or more featurelocs
+
srcfeature idinteger  The source feature which this location is relative to.
+
  Every location is relative to another feature (how-
+
  ever, this column is nullable, because the srcfeature
+
  may not be known). All locations are '''proper''' - that
+
  is, nothing should be located relative to itself. No
+
  cycles are allowed in the featureloc graph
+
fmininteger  The leftmost/minimal boundary in the linear range
+
  represented by the featureloc.  Sometimes (e.g. in
+
  [http://bioperl.org Bioperl]) this is called -start- although this is confusing because it does not necessarily represent the
+
  5-prime coordinate. IMPORTANT: This is space-based (INTERBASE) coordinates, counting from
+
  zero. To convert this to the leftmost position in a
+
  base-oriented system (eg GFF, bioperl), add 1 to
+
  fmin
+
is fmin partial boolean  This is typically false, but may be true if the value
+
  for column:fmin is inaccurate or the leftmost part of
+
  the range is unknown/unbounded
+
fmaxinteger  The rightmost/maximal boundary in the linear range
+
  represented by the featureloc.  Sometimes (eg in
+
  bioperl) this is called -end- although this is con-
+
  fusing because it does not necessarily represent the
+
  3-prime coordinate. IMPORTANT: This is space-
+
  based (INTERBASE) coordinates, counting from
+
  zero. No conversion is required to go from fmax to
+
  the rightmost coordinate in a base-oriented system
+
  that counts from 1 (eg GFF, bioperl)
+
is fmax partial boolean  This is typically false, but may be true if the value
+
  for column:fmax is inaccurate or the rightmost part
+
  of the range is unknown/unbounded
+
strand integer  The  orientation/directionality of the  location.
+
  Should be 0,-1 or +1
+
phase  integer  phase of translation wrt srcfeature id.Values are
+
  0,1,2. It may not be possible to manifest this column for some features such as exons, because the
+
  phase is dependant on the spliceform (the same exon
+
  can appear in multiple spliceforms). This column is
+
  mostly useful for predicted exons and CDSs
+
residue info text  Alternative residues, when these differ from feature.residues. for instance, a SNP feature located
+
  on a wild and mutant protein would have different
+
  alresidues. for alignment/similarity features, the altresidues is used to represent the alignment string
+
  (CIGAR format). Note on variation features; even
+
  if we dont want to instantiate a mutant chromosome/contig feature, we can still represent a SNP
+
  etc with 2 locations, one (rank 0) on the genome,
+
  the other (rank 1) would have most fields null, except for altresidues
+
locgroup  integer  This is used to manifest redundant, derivable ettra locations for a feature. The default locgroup=0
+
  is used for the DIRECT location of a feature.  !!
+
  MOST CHADO USERS MAY NEVER USE featurelocs WITH logroup¿0 !! Transitively derived loca-
+
  tions are indicated with locgroup¿0. For example,
+
  the position of an exon on a BAC and in global chromosome coordinates.This column is used to dif-
+
  ferentiate these groupings of locations. the default
+
  locgroup 0 is used for the main/primary location,
+
  from which the others can be derived via coordinate
+
  transformations. another example of redundant locations is storing ORF coordinates relative to both
+
  transcript and genome.redundant locations open
+
  the possibility of the database getting into inconsistent states; this schema gives us the flexibility of both
+
  warehouse instantiations with redundant locations
+
  (easier for querying) and management instantiations
+
  with no redundant locations. An example of using
+
  both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of
+
  two different species. we may want to keep redundant locations on both contigs and chromosomes. we
+
  would thus have 4 locations for the single conserved
+
  region feature - two distinct locgroups (contig level
+
  and chromosome level) and two distinct ranks (for
+
  the two species)
+
rankinteger  Used when a feature has ¿1 location, otherwise the
+
  default rank 0 is used. Some features (eg blast hits
+
  and HSPs) have two locations - one on the query
+
  and one on the subject. Rank is used to differentiate these. Rank=0 is always used for the query,
+
  Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for
+
  sequence variant features, such as SNPs. Rank=0
+
  indicates the wildtype (or baseline) feature, Rank=1
+
  indicates the mutant (or compared) feature
+
 
+
 
+
 
+
featureloc pub
+
 
+
 
+
COMMENT ON INDEX featureloc c1 IS ’locgroup and rank serve to uniquely
+
 
+
 
+
  Table 4.20: featureloc pub
+
 
+
ColumnDatatypeDescription
+
featureloc pub id integer
+
featureloc id  integer
+
pub idinteger
+
 
+
 
+
feature pub
+
 
+
 
+
Provenance. Linking table between features and publications that mention them
+
 
+
 
+
  Table 4.21: feature pub
+
 
+
  ColumnDatatype Description
+
  feature pub id integer
+
  feature id  integer
+
  pub idinteger
+
 
+
 
+
featureprop
+
 
+
 
+
A feature can have any number of slot-value property tags attached to it. This is an alternative to
+
hardcoding a list of columns in the relational schema, and is completely extensible
+
 
+
 
+
  Table 4.22: featureprop
+
 
+
  ColumnDatatype Description
+
  featureprop id integer
+
  feature id  integer
+
  type id  integer  The name of the property/slot is a cvterm. The
+
  meaning of the property is defined in that cvterm.
+
  Certain property types will only apply to certain fea-
+
  ture types (e.g. the anticodon property will only ap-
+
  ply to tRNA features) ; the types here come from
+
  the sequence feature property ontology
+
  value text  The value of the property, represented as text. Nu-
+
  meric values are converted to their text representa-
+
  tion. This is less efficient than using native database
+
  types, but is easier to query.
+
  rank  integer  Property-Value ordering. Any feature can have mul-
+
  tiple values for any particular property type - these
+
  are ordered in a list using rank, counting from zero.
+
  For properties that are single-valued rather than
+
  multi-valued, the default 0 value should be used
+
 
+
 
+
 
+
featureprop pub
+
 
+
 
+
for any one feature, multivalued property-value pairs must be differentiated by rank
+
 
+
 
+
Table 4.23: featureprop pub
+
 
+
Column Datatype Description
+
featureprop pub id integer
+
featureprop id  integer
+
pub id integer
+
 
+
 
+
 
+
feature dbxref
+
 
+
 
+
links a feature to dbxrefs. This is for secondary identifiers; primary identifiers should use fea-
+
ture.dbxref id
+
 
+
 
+
Table 4.24: feature dbxref
+
 
+
ColumnDatatypeDescription
+
feature dbxref id integer
+
feature id  integer
+
dbxref idinteger
+
is current  boolean the is current boolean indicates whether the linked
+
dbxref is the current -official- dbxref for the linked
+
feature
+
 
+
 
+
 
+
feature relationship
+
 
+
 
+
features can be arranged in graphs, eg exon part of transcript part of gene; translation madeby
+
transcript if type is thought of as a verb, each arc makes a statement [SUBJECT VERB OBJECT]
+
object can also be thought of as parent (containing feature), and subject as child (contained feature
+
or subfeature) – we include the relationship rank/order, because even though most of the time we
+
can order things implicitly by sequence coordinates, we cant always do this - eg transpliced genes.
+
its also useful for quickly getting implicit introns
+
 
+
 
+
Table 4.25: feature relationship
+
 
+
  ColumnDatatype Description
+
  feature relationship id integer
+
  subject id  integer  the subject of the subj-predicate-obj sentence. This
+
  is typically the subfeature
+
  object idinteger  the object of the subj-predicate-obj sentence. This
+
  is typically the container feature
+
  type id  integer  relationship type between subject and object. This
+
  is a cvterm, typically from the OBO relationship
+
  ontology, although other relationship types are al-
+
  lowed. The most common relationship type is
+
  OBO REL:part of. Valid relationship types are con-
+
  strained by the Sequence Ontology
+
  value text  Additional notes/comments
+
  rank  integer  The ordering of subject features with respect to the
+
  object feature may be important (for example, exon
+
  ordering on a transcript - not always derivable if you
+
  take trans spliced genes into consideration). rank is
+
  used to order these; starts from zero
+
 
+
 
+
feature relationship pub
+
 
+
 
+
Provenance. Attach optional evidence to a feature relationship in the form of a publication
+
 
+
 
+
Table 4.26: feature relationship pub
+
 
+
  Column  Datatype Description
+
  feature relationship pub id  integer
+
  feature relationship idinteger
+
  pub id  integer
+
 
+
 
+
 
+
feature relationshipprop
+
 
+
 
+
Extensible properties for feature relationships. Analagous structure to featureprop. This table is
+
largely optional and not used with a high frequency. Typical scenarios may be if one wishes to
+
attach additional data to a feature relationship - for example to say that the feature relationship
+
is only true in certain contexts
+
 
+
 
+
Table 4.27: feature relationshipprop
+
 
+
Column  Datatype Description
+
feature relationshipprop id  integer
+
feature relationship idinteger
+
type id integer  The name of the property/slot is a cvterm.The
+
  meaning of the property is defined in that cvterm.
+
  Currently there is no standard ontology for fea-
+
  ture relationship property types
+
valuetext  The value of the property, represented as text. Nu-
+
  meric values are converted to their text representa-
+
  tion. This is less efficient than using native database
+
  types, but is easier to query.
+
rank integer  Property-Value ordering. Any feature relationship
+
  can have multiple values for any particular property
+
  type - these are ordered in a list using rank, count-
+
  ing from zero. For properties that are single-valued
+
  rather than multi-valued, the default 0 value should
+
  be used
+
 
+
 
+
 
+
feature relationshipprop pub
+
 
+
 
+
Provenance for feature relationshipprop
+
 
+
 
+
Table 4.28: feature relationshipprop pub
+
 
+
Column Datatype Description
+
feature relationshipprop pub idinteger
+
feature relationshipprop id integer
+
pub id integer
+
 
+
 
+
 
+
feature cvterm
+
 
+
 
+
Associate a term from a cv with a feature, for example, GO annotation
+
 
+
  
  Table 4.29: feature cvterm
+
=Tables=
  
ColumnDatatypeDescription
+
== Table: feature ==
feature cvterm id integer
+
feature id  integer
+
cvterm idinteger
+
pub idinteger Provenance for the annotation.Each annotation
+
  should have a single primary publication (which
+
  may be of the appropriate type for computational
+
  analyses) where more details can be found. Addi-
+
  tional provenance dbxrefs can be attached using fea-
+
  ture cvterm dbxref
+
is notboolean if this is set to true, then this annotation is inter-
+
  preted as a NEGATIVE annotation - ie the feature
+
  does NOT have the specified function, process, com-
+
  ponent, part, etc. See GO docs for more details
+
  
 +
A feature is a biological sequence or a section of a biological sequence, or a collection of such sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and HSPs and so on; see the Sequence Ontology for more.
  
feature cvtermprop
+
{| border="1" cellpadding="3"
 +
|+ feature Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_dbxref| dbxref]]
 +
| dbxref_id
 +
| integer
 +
| '' ''<br /><br />An optional primary public stable identifier for this feature. Secondary identifiers and external dbxrefs go in the table feature_dbxref.
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_organism| organism]]
 +
| organism_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The organism to which this feature belongs. This column is mandatory.
 +
|- class="tr1"
 +
|
 +
| name
 +
| character varying(255)
 +
| '' ''<br /><br />The optional human-readable common name for a feature, for display purposes.
 +
|- class="tr0"
 +
|
 +
| uniquename
 +
| text
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The unique name for a feature; may not be necessarily be particularly human-readable, although this is preferred. This name must be unique for this type of feature within this organism.
 +
|- class="tr1"
 +
|
 +
| residues
 +
| text
 +
| '' ''<br /><br />A sequence of alphabetic characters representing biological residues (nucleic acids, amino acids). This column does not need to be manifested for all features; it is optional for features such as exons where the residues can be derived from the featureloc. It is recommended that the value for this column be manifested for features which may may non-contiguous sublocations (e.g. transcripts), since derivation at query time is non-trivial. For expressed sequence, the DNA sequence should be used rather than the RNA sequence.
 +
|- class="tr0"
 +
|
 +
| seqlen
 +
| integer
 +
| '' ''<br /><br />The length of the residue feature. See column:residues. This column is partially redundant with the residues column, and also with featureloc. This column is required because the location may be unknown and the residue sequence may not be manifested, yet it may be desirable to store and query the length of the feature. The seqlen should always be manifested where the length of the sequence is known.
 +
|- class="tr1"
 +
|
 +
| md5checksum
 +
| character(32)
 +
| '' ''<br /><br />The 32-character checksum of the sequence, calculated using the MD5 algorithm. This is practically guaranteed to be unique for any feature. This column thus acts as a unique identifier on the mathematical sequence.
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />A required reference to a table:cvterm giving the feature type. This will typically be a Sequence Ontology identifier. This column is thus used to subclass the feature table.
 +
|- class="tr1"
 +
|
 +
| is_analysis
 +
| boolean
 +
| '' NOT NULL DEFAULT false ''<br /><br />Boolean indicating whether this feature is annotated or the result of an automated analysis. Analysis results also use the companalysis module. Note that the dividing line between analysis and annotation may be fuzzy, this should be determined on a per-project basis in a consistent manner. One requirement is that there should only be one non-analysis version of each wild-type gene feature in a genome, whereas the same gene feature can be predicted multiple times in different analyses.
 +
|- class="tr0"
 +
|
 +
| is_obsolete
 +
| boolean
 +
| '' NOT NULL DEFAULT false ''<br /><br />Boolean indicating whether this feature has been obsoleted. Some chado instances may choose to simply remove the feature altogether, others may choose to keep an obsolete row in the table.
 +
|- class="tr1"
 +
|
 +
| timeaccessioned
 +
| timestamp without time zone
 +
| '' NOT NULL DEFAULT ('now'::text)::timestamp(6) with time zone ''<br /><br />For handling object accession or modification timestamps (as opposed to database auditing data, handled elsewhere). The expectation is that these fields would be available to software interacting with chado.
 +
|- class="tr0"
 +
|
 +
| timelastmodified
 +
| timestamp without time zone
 +
| '' NOT NULL DEFAULT ('now'::text)::timestamp(6) with time zone ''<br /><br />For handling object accession or modification timestamps (as opposed to database auditing data, handled elsewhere). The expectation is that these fields would be available to software interacting with chado.
 +
|}
  
 +
Tables referencing this one via Foreign Key Constraints:
  
Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers;
+
* [[Chado_Tables#Table:_analysisfeature| analysisfeature]]
metadata such as the date on which the entry was curated and the source of the association. See
+
the featureprop table for meanings of type id, value and rank
+
  
 +
* [[Chado_Tables#Table:_element| element]]
  
Table 4.30: feature cvtermprop
+
* [[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
  
  Column Datatype Description
+
* [[Chado_Tables#Table:_feature_dbxref| feature_dbxref]]
  feature cvtermprop id integer
+
  feature cvterm id  integer
+
  type idinteger  The name of the property/slot is a cvterm.  The
+
meaning of the property is defined in that cvterm.
+
cvterms may come from the OBO evidence code cv
+
  value  text  The value of the property, represented as text. Nu-
+
meric values are converted to their text representa-
+
tion. This is less efficient than using native database
+
types, but is easier to query.
+
  rankinteger  Property-Value ordering.  Any feature cvterm can
+
have multiple values for any particular property type
+
- these are ordered in a list using rank, counting from
+
zero. For properties that are single-valued rather
+
than multi-valued, the default 0 value should be used
+
  
 +
* [[Chado_Tables#Table:_feature_expression| feature_expression]]
  
feature cvterm dbxref
+
* [[Chado_Tables#Table:_feature_genotype| feature_genotype]]
  
 +
* [[Chado_Tables#Table:_feature_phenotype| feature_phenotype]]
  
Additional dbxrefs for an association. Rows in the feature cvterm table may be backed up by
+
* [[Chado_Tables#Table:_feature_pub| feature_pub]]
dbxrefs. For example, a feature cvterm association that was inferred via a protein-protein inter-
+
action may be backed by by refering to the dbxref for the alternate protein. Corresponds to the
+
WITH column in a GO gene association file (but can also be used for other analagous associations).
+
See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details
+
  
 +
* [[Chado_Tables#Table:_feature_relationship| feature_relationship]]
  
Table 4.31: feature cvterm dbxref
+
* [[Chado_Tables#Table:_feature_synonym| feature_synonym]]
  
Column Datatype Description
+
* [[Chado_Tables#Table:_featureloc| featureloc]]
feature cvterm dbxref id integer
+
feature cvterm id  integer
+
dbxref id integer
+
  
 +
* [[Chado_Tables#Table:_featurepos| featurepos]]
  
feature cvterm pub
+
* [[Chado_Tables#Table:_featureprop| featureprop]]
  
 +
* [[Chado_Tables#Table:_featurerange| featurerange]]
  
Secondary pubs for an association. Each feature cvterm association is supported by a single primary
+
* [[Chado_Tables#Table:_library_feature| library_feature]]
publication. Additional secondary pubs can be added using this linking table (in a GO gene
+
association file, these corresponding to any IDs after the pipe symbol in the publications column
+
  
 +
* [[Chado_Tables#Table:_phylonode| phylonode]]
  
  Table 4.32: feature cvterm pub
+
* [[Chado_Tables#Table:_wwwuser_feature| wwwuser_feature]]
  
  Column Datatype Description
+
----
  feature cvterm pub id integer
+
  feature cvterm id  integer
+
  pub id integer
+
  
  
synonym
 
  
 +
== Table: feature_cvterm ==
  
A synonym for a feature. One feature can have multiple synonyms, and the same synonym can
+
Associate a term from a cv with a feature, for example, GO annotation.
apply to multiple features
+
  
 +
{| border="1" cellpadding="3"
 +
|+ feature_cvterm Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_cvterm_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| feature_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| cvterm_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_pub| pub]]
 +
| pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Provenance for the annotation. Each annotation should have a single primary publication (which may be of the appropriate type for computational analyses) where more details can be found. Additional provenance dbxrefs can be attached using feature_cvterm_dbxref.
 +
|- class="tr0"
 +
|
 +
| is_not
 +
| boolean
 +
| '' NOT NULL DEFAULT false ''<br /><br />If this is set to true, then this annotation is interpreted as a NEGATIVE annotation - i.e. the feature does NOT have the specified function, process, component, part, etc. See GO docs for more details.
 +
|}
  
Table 4.33: synonym
+
Tables referencing this one via Foreign Key Constraints:
  
  Column Datatype Description
+
* [[Chado_Tables#Table:_feature_cvterm_dbxref| feature_cvterm_dbxref]]
  synonym idinteger
+
  namevarchar  The synonym itself.  Should be human-readable
+
machine-searchable ascii text
+
  type idinteger  types would be symbol and fullname for now
+
  synonym sgml varchar  The fully specified synonym, with any non-ascii char-
+
acters encoded in SGML
+
  
 +
* [[Chado_Tables#Table:_feature_cvterm_pub| feature_cvterm_pub]]
  
feature synonym
+
* [[Chado_Tables#Table:_feature_cvtermprop| feature_cvtermprop]]
  
 +
----
  
Linking table between feature and synonym
 
  
  
  Table 4.34: feature synonym
+
== Table: feature_cvterm_dbxref ==
  
  Column Datatype Description
+
Additional dbxrefs for an association. Rows in the feature_cvterm table may be backed up by dbxrefs. For example, a feature_cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the WITH column in a GO gene association file (but can also be used for other analagous associations). See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details.
  feature synonym id integer
+
  synonym idinteger
+
  feature idinteger
+
  pub id integer  the pub id link is for relating the usage of a given
+
synonym to the publication in which it was used
+
  is currentboolean  the is current boolean indicates whether the linked
+
synonym is the current -official- symbol for the linked
+
feature
+
  is internal  boolean  typically a synonym exists so that somebody query-
+
ing the db with an obsolete name can find the ob-
+
ject theyre looking for (under its current name. If
+
the synonym has been used publicly & deliberately
+
(eg in a paper), it my also be listed in reports as a
+
synonym. If the synonym was not used deliberately
+
(eg, there was a typo which went public), then the
+
is internal boolean may be set to -true- so that it is
+
known that the synonym is -internal- and should be
+
queryable but should not be listed in reports as a
+
valid synonym
+
  
 +
{| border="1" cellpadding="3"
 +
|+ feature_cvterm_dbxref Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_cvterm_dbxref_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
 +
| feature_cvterm_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_dbxref| dbxref]]
 +
| dbxref_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
 +
----
  
genotype
 
  
  
Table 4.35: genotype
+
== Table: feature_cvterm_pub ==
  
ColumnDatatype Description
+
Secondary pubs for an association. Each feature_cvterm association is supported by a single primary publication. Additional secondary pubs can be added using this linking table (in a GO gene association file, these corresponding to any IDs after the pipe symbol in the publications column.
genotype id integer
+
uniquename  text
+
description varchar
+
  
 +
{| border="1" cellpadding="3"
 +
|+ feature_cvterm_pub Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_cvterm_pub_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
 +
| feature_cvterm_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_pub| pub]]
 +
| pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
 +
----
  
feature genotype
 
  
  
 +
== Table: feature_cvtermprop ==
  
 +
Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers; metadata such as the date on which the entry was curated and the source of the association. See the featureprop table for meanings of type_id, value and rank.
  
  Table 4.36: feature genotype
+
{| border="1" cellpadding="3"
 +
|+ feature_cvtermprop Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_cvtermprop_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
 +
| feature_cvterm_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. cvterms may come from the OBO evidence code cv.
 +
|- class="tr1"
 +
|
 +
| value
 +
| text
 +
| '' ''<br /><br />The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
 +
|- class="tr0"
 +
|
 +
| rank
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Property-Value ordering. Any feature_cvterm can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.
 +
|}
  
Column  Datatype Description
+
----
feature genotype id integer
+
feature id integer
+
genotype idinteger
+
chromosome id integer
+
rank integer
+
cgroup  integer
+
cvterm id  integer
+
  
  
  
environment
+
== Table: feature_dbxref ==
  
 +
Links a feature to dbxrefs. This is for secondary identifiers; primary identifiers should use feature.dbxref_id.
  
 +
{| border="1" cellpadding="3"
 +
|+ feature_dbxref Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_dbxref_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| feature_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_dbxref| dbxref]]
 +
| dbxref_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr1"
 +
|
 +
| is_current
 +
| boolean
 +
| '' NOT NULL DEFAULT true ''<br /><br />True if this secondary dbxref is the most up to date accession in the corresponding db. Retired accessions should set this field to false.
 +
|}
  
 +
----
  
 +
== Table: feature_pub ==
  
  Table 4.37: environment
+
Provenance. Linking table between features and publications that mention them.
  
  ColumnDatatype  Description
+
{| border="1" cellpadding="3"
  environment id integer
+
|+ feature_pub Structure
  uniquename  text
+
|-
  description text
+
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_pub_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| feature_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_pub| pub]]
 +
| pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
 +
Tables referencing this one via Foreign Key Constraints:
  
 +
* [[Chado_Tables#Table:_feature_pubprop| feature_pubprop]]
  
environment cvterm
+
----
  
  
  
 +
== Table: feature_pubprop ==
  
 +
Property or attribute of a feature_pub link.
  
  Table 4.38: environment cvterm
+
{| border="1" cellpadding="3"
 +
|+ feature_pubprop Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_pubprop_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature_pub| feature_pub]]
 +
| feature_pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr1"
 +
|
 +
| value
 +
| text
 +
| '' ''
 +
|- class="tr0"
 +
|
 +
| rank
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
  Column Datatype Description
+
----
  environment cvterm id integer
+
  environment id  integer
+
  cvterm id integer
+
  
phenstatement
 
  
  
Phenotypes are things like ”larval lethal”. Phenstatements are things like ”dpp[1] is recessive
+
== Table: feature_relationship ==
larval lethal”. So essentially phenstatement is a linking table expressing the relationship between
+
genotype, environment, and phenotype.
+
  
 +
Features can be arranged in graphs, e.g. "exon part_of transcript part_of gene"; If type is thought of as a verb, the each arc or edge makes a statement [Subject Verb Object]. The object can also be thought of as parent (containing feature), and subject as child (contained feature or subfeature). We include the relationship rank/order, because even though most of the time we can order things implicitly by sequence coordinates, we can not always do this - e.g. transpliced genes. It is also useful for quickly getting implicit introns.
  
  Table 4.39: phenstatement
+
{| border="1" cellpadding="3"
 +
|+ feature_relationship Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_relationship_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| subject_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The subject of the subj-predicate-obj sentence. This is typically the subfeature.
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| object_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The object of the subj-predicate-obj sentence. This is typically the container feature.
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Relationship type between subject and object. This is a cvterm, typically from the OBO relationship ontology, although other relationship types are allowed. The most common relationship type is OBO_REL:part_of. Valid relationship types are constrained by the Sequence Ontology.
 +
|- class="tr0"
 +
|
 +
| value
 +
| text
 +
| '' ''<br /><br />Additional notes or comments.
 +
|- class="tr1"
 +
|
 +
| rank
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The ordering of subject features with respect to the object feature may be important (for example, exon ordering on a transcript - not always derivable if you take trans spliced genes into consideration). Rank is used to order these; starts from zero.
 +
|}
  
Column  DatatypeDescription
+
Tables referencing this one via Foreign Key Constraints:
phenstatement id integer
+
genotype idinteger
+
environment idinteger
+
phenotype id  integer
+
type id integer
+
pub id  integer
+
  
 +
* [[Chado_Tables#Table:_feature_relationship_pub| feature_relationship_pub]]
  
phendesc
+
* [[Chado_Tables#Table:_feature_relationshipprop| feature_relationshipprop]]
  
 +
----
  
a summary of a set of phenotypic statements for any one gcontext made in any one publication
 
  
  
Table 4.40: phendesc
+
== Table: feature_relationship_pub ==
  
ColumnDatatype Description
+
Provenance. Attach optional evidence to a feature_relationship in the form of a publication.
phendesc id integer
+
genotype id integer
+
environment id integer
+
description text
+
pub idinteger
+
  
 +
{| border="1" cellpadding="3"
 +
|+ feature_relationship_pub Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_relationship_pub_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature_relationship| feature_relationship]]
 +
| feature_relationship_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_pub| pub]]
 +
| pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
phenotype comparison
+
----
  
  
comparison of phenotypes eg, genotype1/environment1/phenotype1 ”non-suppressible” wrt geno-
 
type2/environment2/phenotype2
 
  
 +
== Table: feature_relationshipprop ==
  
Table 4.41: phenotype comparison
+
Extensible properties for feature_relationships. Analagous structure to featureprop. This table is largely optional and not used with a high frequency. Typical scenarios may be if one wishes to attach additional data to a feature_relationship - for example to say that the feature_relationship is only true in certain contexts.
  
ColumnDatatype Description
+
{| border="1" cellpadding="3"
phenotype comparison id integer
+
|+ feature_relationshipprop Structure
genotype1 idinteger
+
|-
environment1 idinteger
+
! F-Key
genotype2 idinteger
+
! Name
environment2 idinteger
+
! Type
phenotype1 id  integer
+
! Description
phenotype2 id  integer
+
|- class="tr0"
type id  integer
+
|
pub idinteger
+
| feature_relationshipprop_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature_relationship| feature_relationship]]
 +
| feature_relationship_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. Currently there is no standard ontology for feature_relationship property types.
 +
|- class="tr1"
 +
|
 +
| value
 +
| text
 +
| '' ''<br /><br />The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
 +
|- class="tr0"
 +
|
 +
| rank
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Property-Value ordering. Any feature_relationship can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.
 +
|}
  
 +
Tables referencing this one via Foreign Key Constraints:
  
phenotype
+
* [[Chado_Tables#Table:_feature_relationshipprop_pub| feature_relationshipprop_pub]]
  
 +
----
  
a phenotypic statement, or a single atomic phenotypic observation a controlled sentence describing
 
observable effect of non-wt function – e.g. Obs=eye, attribute=color, cvalue=red
 
  
  
Table 4.42: phenotype
+
== Table: feature_relationshipprop_pub ==
  
Column  Datatype Description
+
Provenance for feature_relationshipprop.
phenotype id  integer
+
uniquename text
+
observable id integer  The entity: e.g. anatomy part, biological process
+
attr id integer  Phenotypic attribute (quality, property, attribute,
+
character) - drawn from PATO
+
valuetext  value of attribute - unconstrained free text. Used
+
only if cvalue id is not appropriate
+
cvalue id  integer  Phenotype attribute value (state)
+
assay idinteger  evidence type
+
  
 +
{| border="1" cellpadding="3"
 +
|+ feature_relationshipprop_pub Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_relationshipprop_pub_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature_relationshipprop| feature_relationshipprop]]
 +
| feature_relationshipprop_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_pub| pub]]
 +
| pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
 +
----
  
phenotype cvterm
 
  
  
NULL
+
== Table: feature_synonym ==
  
 +
Linking table between feature and synonym.
  
  Table 4.43: phenotype cvterm
+
{| border="1" cellpadding="3"
 +
|+ feature_synonym Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| feature_synonym_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_synonym| synonym]]
 +
| synonym_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| feature_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_pub| pub]]
 +
| pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The pub_id link is for relating the usage of a given synonym to the publication in which it was used.
 +
|- class="tr0"
 +
|
 +
| is_current
 +
| boolean
 +
| '' NOT NULL DEFAULT true ''<br /><br />The is_current boolean indicates whether the linked synonym is the current -official- symbol for the linked feature.
 +
|- class="tr1"
 +
|
 +
| is_internal
 +
| boolean
 +
| '' NOT NULL DEFAULT false ''<br /><br />Typically a synonym exists so that somebody querying the db with an obsolete name can find the object theyre looking for (under its current name. If the synonym has been used publicly and deliberately (e.g. in a paper), it may also be listed in reports as a synonym. If the synonym was not used deliberately (e.g. there was a typo which went public), then the is_internal boolean may be set to -true- so that it is known that the synonym is -internal- and should be queryable but should not be listed in reports as a valid synonym.
 +
|}
  
  Column  Datatype Description
+
----
  phenotype cvterm id integer
+
  phenotype id  integer
+
  cvterm id  integer
+
  
  
feature phenotype
 
  
 +
== Table: featureloc ==
  
NULL
+
The location of a feature relative to another feature. Important: interbase coordinates are used. This is vital as it allows us to represent zero-length features e.g. splice sites, insertion points without an awkward fuzzy system. Features typically have exactly ONE location, but this need not be the case. Some features may not be localized (e.g. a gene that has been characterized genetically but no sequence or molecular information is available). Note on multiple locations: Each feature can have 0 or more locations. Multiple locations do NOT indicate non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature designate alternate locations or grouped locations; for instance, a feature designating a blast hit or hsp will have two locations, one on the query feature, one on the subject feature. Features representing sequence variation could have alternate locations instantiated on a feature on the mutant strain. The column:rank is used to differentiate these different locations. Reflexive locations should never be stored - this is for -proper- (i.e. non-self) locations only; nothing should be located relative to itself.
  
 +
{| border="1" cellpadding="3"
 +
|+ featureloc Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| featureloc_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| feature_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The feature that is being located. Any feature can have zero or more featurelocs.
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| srcfeature_id
 +
| integer
 +
| '' ''<br /><br />The source feature which this location is relative to. Every location is relative to another feature (however, this column is nullable, because the srcfeature may not be known). All locations are -proper- that is, nothing should be located relative to itself. No cycles are allowed in the featureloc graph.
 +
|- class="tr1"
 +
|
 +
| fmin
 +
| integer
 +
| '' ''<br /><br />The leftmost/minimal boundary in the linear range represented by the featureloc. Sometimes (e.g. in Bioperl) this is called -start- although this is confusing because it does not necessarily represent the 5-prime coordinate. Important: This is space-based (interbase) coordinates, counting from zero. To convert this to the leftmost position in a base-oriented system (eg GFF, Bioperl), add 1 to fmin.
 +
|- class="tr0"
 +
|
 +
| is_fmin_partial
 +
| boolean
 +
| '' NOT NULL DEFAULT false ''<br /><br />This is typically false, but may be true if the value for column:fmin is inaccurate or the leftmost part of the range is unknown/unbounded.
 +
|- class="tr1"
 +
|
 +
| fmax
 +
| integer
 +
| '' ''<br /><br />The rightmost/maximal boundary in the linear range represented by the featureloc. Sometimes (e.g. in bioperl) this is called -end- although this is confusing because it does not necessarily represent the 3-prime coordinate. Important: This is space-based (interbase) coordinates, counting from zero. No conversion is required to go from fmax to the rightmost coordinate in a base-oriented system that counts from 1 (e.g. GFF, Bioperl).
 +
|- class="tr0"
 +
|
 +
| is_fmax_partial
 +
| boolean
 +
| '' NOT NULL DEFAULT false ''<br /><br />This is typically false, but may be true if the value for column:fmax is inaccurate or the rightmost part of the range is unknown/unbounded.
 +
|- class="tr1"
 +
|
 +
| strand
 +
| smallint
 +
| '' ''<br /><br />The orientation/directionality of the location. Should be 0, -1 or +1.
 +
|- class="tr0"
 +
|
 +
| phase
 +
| integer
 +
| '' ''<br /><br />Phase of translation with respect to srcfeature_id. Values are 0, 1, 2. It may not be possible to manifest this column for some features such as exons, because the phase is dependant on the spliceform (the same exon can appear in multiple spliceforms). This column is mostly useful for predicted exons and CDSs.
 +
|- class="tr1"
 +
|
 +
| residue_info
 +
| text
 +
| '' ''<br /><br />Alternative residues, when these differ from feature.residues. For instance, a SNP feature located on a wild and mutant protein would have different alternative residues. for alignment/similarity features, the alternative residues is used to represent the alignment string (CIGAR format). Note on variation features; even if we do not want to instantiate a mutant chromosome/contig feature, we can still represent a SNP etc with 2 locations, one (rank 0) on the genome, the other (rank 1) would have most fields null, except for alternative residues.
 +
|- class="tr0"
 +
|
 +
| locgroup
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />This is used to manifest redundant, derivable extra locations for a feature. The default locgroup=0 is used for the DIRECT location of a feature. Important: most Chado users may never use featurelocs WITH logroup &gt; 0. Transitively derived locations are indicated with locgroup &gt; 0. For example, the position of an exon on a BAC and in global chromosome coordinates. This column is used to differentiate these groupings of locations. The default locgroup 0 is used for the main or primary location, from which the others can be derived via coordinate transformations. Another example of redundant locations is storing ORF coordinates relative to both transcript and genome. Redundant locations open the possibility of the database getting into inconsistent states; this schema gives us the flexibility of both warehouse instantiations with redundant locations (easier for querying) and management instantiations with no redundant locations. An example of using both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of two different species. We may want to keep redundant locations on both contigs and chromosomes. We would thus have 4 locations for the single conserved region feature - two distinct locgroups (contig level and chromosome level) and two distinct ranks (for the two species).
 +
|- class="tr1"
 +
|
 +
| rank
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Used when a feature has &gt;1 location, otherwise the default rank 0 is used. Some features (e.g. blast hits and HSPs) have two locations - one on the query and one on the subject. Rank is used to differentiate these. Rank=0 is always used for the query, Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for sequence_variant features, such as SNPs. Rank=0 indicates the wildtype (or baseline) feature, Rank=1 indicates the mutant (or compared) feature.
 +
|}
  
Table 4.44: feature phenotype
+
{| width="100%" cellpadding="3"
 +
|+ featureloc Constraints
 +
|-
 +
! Name
 +
! Constraint
 +
|- class="tr0"
 +
| featureloc_c2
 +
| CHECK ((fmin &lt;= fmax))
 +
|}
  
  ColumnDatatype Description
+
Tables referencing this one via Foreign Key Constraints:
  feature phenotype id integer
+
  feature id  integer
+
  phenotype idinteger
+
  
 +
* [[Chado_Tables#Table:_featureloc_pub| featureloc_pub]]
  
featuremap
+
----
  
  
NOTE: this module is all due for revision...
 
  
 +
== Table: featureloc_pub ==
  
Table 4.45: featuremap
+
Provenance of featureloc. Linking table between featurelocs and publications that mention them.
  
  Column  Datatype Description
+
{| border="1" cellpadding="3"
  featuremap id integer
+
|+ featureloc_pub Structure
  name varchar
+
|-
  descriptiontext
+
! F-Key
  unittype idinteger
+
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| featureloc_pub_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_featureloc| featureloc]]
 +
| featureloc_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_pub| pub]]
 +
| pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
 +
----
  
  
featurerange
 
  
 +
== Table: featureprop ==
  
 +
A feature can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.
  
 +
{| border="1" cellpadding="3"
 +
|+ featureprop Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| featureprop_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_feature| feature]]
 +
| feature_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. Certain property types will only apply to certain feature types (e.g. the anticodon property will only apply to tRNA features) ; the types here come from the sequence feature property ontology.
 +
|- class="tr1"
 +
|
 +
| value
 +
| text
 +
| '' ''<br /><br />The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
 +
|- class="tr0"
 +
|
 +
| rank
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Property-Value ordering. Any feature can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used
 +
|}
  
 +
Tables referencing this one via Foreign Key Constraints:
  
Table 4.46: featurerange
+
* [[Chado_Tables#Table:_featureprop_pub| featureprop_pub]]
  
Column Datatype  Description
+
----
featurerange id integer
+
featuremap idinteger
+
feature idinteger
+
leftstartf idinteger
+
leftendf id  integer
+
rightstartf id  integer
+
rightendf id integer
+
rangestr  varcha
+
  
  
featurepos
 
  
 +
== Table: featureprop_pub ==
  
 +
Provenance. Any featureprop assignment can optionally be supported by a publication.
  
Table 4.47: featurepos
+
{| border="1" cellpadding="3"
 +
|+ featureprop_pub Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| featureprop_pub_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_featureprop| featureprop]]
 +
| featureprop_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_pub| pub]]
 +
| pub_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
ColumnDatatype Description
+
----
featurepos id  integer
+
featuremap id  integer
+
feature id  integer
+
map feature id integer
+
mapposfloat
+
  
  
  
 +
== Table: synonym ==
  
 +
A synonym for a feature. One feature can have multiple synonyms, and the same synonym can apply to multiple features.
  
featuremap pub
+
{| border="1" cellpadding="3"
 +
|+ synonym Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| synonym_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
| name
 +
| character varying(255)
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The synonym itself. Should be human-readable machine-searchable ascii text.
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Types would be symbol and fullname for now.
 +
|- class="tr1"
 +
|
 +
| synonym_sgml
 +
| character varying(255)
 +
| '' NOT NULL ''<br /><br />The fully specified synonym, with any non-ascii characters encoded in SGML.
 +
|}
  
 +
Tables referencing this one via Foreign Key Constraints:
  
map feature id links to the feature (map) upon which the feature is
+
* [[Chado_Tables#Table:_feature_synonym| feature_synonym]]
  
 +
* [[Chado_Tables#Table:_library_synonym| library_synonym]]
  
Table 4.48: featuremap pub
+
----
  
ColumnDatatype Description
+
[[Category:BLAST]]
featuremap pub id integer
+
[[Category:Chado]]
featuremap id  integer
+
[[Category:Chado Modules]]
pub id integer
+

Latest revision as of 22:17, 18 December 2013

Introduction

A central module in Chado is the sequence module. The fundamental table within this module is the feature table, for describing biological sequence features. Chado defines a feature to be a region of a biological polymer (typically a DNA, RNA, or a polypeptide molecule) or an aggregate of regions on this polymer. As the term is used here, region can be the entire extent of the molecule, or a junction between two bases. Features can be typed according to an ontology, they can be localized relative to other features, and they can form part-whole and other relationships with other features.

You may find these related documents useful:

Features

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

Chado does not distinguish between a sequence and a sequence feature, on the theory that a feature is a piece of a sequence, and a piece of a sequence is a sequence. Both are represented as a row in the feature table.

There are many different types of features. Examples include gene, exon, transcript, regulatory region, chromosome, sequence variation, polypeptide, protein domain and cross-genome match regions. Chado does not have a different table for each kind of feature; all features are stored in the feature table.

Feature types are taken from the Sequence Ontology controlled vocabulary (see also Controlled Vocabulary module, also known as cv). Types of feature are differentiated using a type_id column, which is a foreign key to the cvterm table in the cv (ontology) module, described here. This allows us to type features according to the Sequence Ontology. The use of ontologies to type tables gives Chado a subtyping mechanism, which is absent from the standard relational model. For example, SO tells us that mRNA and snRNA are different kinds of transcript. This is discussed in more in the next section. For the purposes of discussion in this document, it can be assumed that any reference to genes, exons, polypeptides, SNPs, chromosomes, transcripts and various kinds of RNAs and so on refers to features of that Sequence Ontology type.

A selection of Chado-relevant types from SO are shown below:

Sequence Ontology Examples
SO Term SO id
Exon SL:0000025
Intron SL:0000027
mRNA SL:0000037
miRNA SL:0000044
regulatory_element SL:0000052
transcription_factor_binding_site SL:0000054


The Chado feature table has a text-valued column named residues for storing the sequence of the feature. The value of this column is string of IUPAC symbols corresponding to the sequence of biochemical residues encoded by the feature. This column is optional, because the sequence of the feature may not be known. Even if the sequence of a feature is known, it may not be desirable to store it in the feature table, as it may be possible to infer the sequence from the sequence of other features in the database. For example, exon sequences are generally not stored, as these can trivially be inferred from the sequence of the genomic feature on which the exon is located. In contrast, mRNA and other processed transcript sequences are stored as it is less trivial and more computationally expensive to dynamically splice together the mRNA sequence.

It is important to realize that the existence of a row in the feature table does not necessarily imply that the feature has been characterized as a result of genome annotation. It is possible to have features of SO type gene for genes that have only been characterized through genetic studies, and for which neither sequence nor sequence location is known. This is in contrast to other feature schemas (such as GFF) in which it is not possible to represent features without representing a location in sequence coordinates. This design decision is crucial for the use of Chado as a database for integrating information about the same entity from multiple perspectives.

Because the sequence is stored as a column in the feature table rather than as an independent table, sequences cannot exist in the absence of a row in the feature table; sequences are dependent upon features. This is in contrast with almost all other genomics schemas that allow independent treatment of sequences and features. This design decision follows for both philosophical and prag- matic reasons. The feature table also contains columns seqlen and md5checksum, for storing the length of the sequence and the 32-character checksum computed using the MD5 [RL Rivest. RFC 1321: The md5 message-digest algorithm. Technical report, Internet Activities Board, April 1992.] algorithm. The length and checksum can be stored even when the residues column is null valued. The checksum is useful for checking if two or more features share the same sequence, without comparing the entire sequence string.

The existence of these columns means that this table is no longer in third normal form (3NF), which is usually a desirable formal property of relational database. On balance, the utility of these columns outweighs the disadvantages of violating 3NF. In practical terms, it means that the values of the residues, seqlen and md5checksum columns are interdependent and cannot be updated independently of one another.

The feature table has a Boolean valued column, is_analysis, indicating whether this is an annotation or a computed feature from a computational analysis. Annotations are features that are generated or blessed by a human curator, or in some cases by an integrated genome pipeline (for example, MAKER or DIYA) capable of synthesizing gene models and other annotations from in silico analyses. They constitute the definitive version of a particular feature, in contrast to the features generated by gene prediction programs and sequence similarity searches such as BLAST.

The feature table has a dbxref_id column that refers to a global, stable public identifier for the feature. This column is optional, because not all classes of features have such identifiers for example, features resulting from gene predictions and BLAST HSP features may be less stable and thus lack public identifiers. It is recommended that most annotated features have dbxref_ids. The organism_id column refers to a row in the organism table (defined in the organism module). This column is mandatory if the feature derives from a single organism.

Names of Features

The name and uniquename columns allow features to be labelled. The name column is optional, but it is recommended that all annotated features (as opposed to those that arise from purely computational methods) have names. The name should be a simple, concise, human-friendly display label (such as a gene or gene product symbol, as defined by the nomenclature rules of governing the organism). User interface software (such as GBrowse and Apollo) can use the name column for labelling feature glyphs in user displays. Uniqueness of name within any particular organism or genome project is a desirable characteristic, but is not enforced in the schema, since there are occasions where name clashes are unavoidable. In contrast, the uniquename column is required, and guaranteed to be unique when taken in combination with organism_id and type_id this is enforced by a constraint in the relational schema. The unique name may be human-friendly (for example, it can be the same as the name); however, it is not guaranteed to be so, and in general should not be displayed to the end user. Its use is mainly as an alternate unique key on the table .

The unique name normally conforms to some naming rule these rules may vary across chado instances, but they should all guarantee the uniqueness of the uniquename, organism id, type id triple.


Feature-tables.png

Feature Synonyms

In addition to having a name or symbol, it is common for features such as genes to have multiple synonyms or aliases. These synonyms may exist due to different publications referring to the same gene with different symbols, or because one gene was once believed to be two or more separate genes. A common curation operation on genes is splitting and merging, which results in the creation of synonyms.

This is modelled in Chado with a synonym table and a feature_synonym linking table; thus multiple features can potentially share the same, and a single feature can be have multiple synonyms. Use of a synonym in the literature is indicated with a pub_id foreign key referencing the pub table (see the publications module), indicating historical provenance for the use of a synonym.

Feature synonyms are found by joining to feature_synonym and synonym. For example, here is a query to find gene by name or synonym:

SELECT feature_id FROM feature
WHERE name = 'name of interest'
UNION SELECT feature_id
FROM feature_synonym fs, synonym s
WHERE fs.synonym_id = s.synonym_id
AND s.name = 'name of interest'
AND fs.is_current;


Feature Locations

Features can potentially be localized using a sequence coordinate system. A relative localization model is used, so all feature localizations must be relative to another feature. Some features such as those of type chromosome are not localized in sequence coordinates. Locations are stored in the featureloc table, also part of this sequence module. Other non-sequence oriented kinds of localization (such as physical localization from in situ experiments, or genetic localizations from linkage studies) are modelled outside the sequence module (for example, in the expression module or map module).

A feature can have zero or more featurelocs, although it will typically have either one (for localized features for which the location is known) or zero (for unlocalized features such as chromosomes, or for features for which the location is not yet known, such as a gene discovered using classical genetics techniques). Features with multiple featurelocs will be explained later.

A featureloc is an interval in interbase sequence coordinates (see figure), bounded by the fmin and fmax columns, each representing the lower and upper linear position of the boundary between bases or base pairs, with directionality indicated by the strand column. Interbase coordinates were chosen over the more commonly used base-oriented coordinate system because they are more naturally amenable to the standard arithmetic operations that are typically performed upon sequence coordinates. This leads to cleaner and more efficient database coding logic that is arguably less prone to errors. Of course, interbase coordinates are typically transformed into the more common base-oriented system used by BLAST reports and so forth prior to presentation to the end-user.

The relational schema includes a constraint which ensures that fmin != fmax is always true, and any attempt to set the database in a state which violates this will flag an error .

As mentioned previously, a featureloc must be localized relative to another feature, indicated using the srcfeature_id foreign key column, referencing the feature table. There is nothing in the schema prohibiting localization chains; for example, locating an exon relative to a contig that is itself localized relative to a chromosome (see figure). The majority of Chado database instances will not require this flexibility; features are typically located relative to chromosomes or chromosomes arms. Nevertheless, the ability to store such localization networks or location graphs can be useful for unfinished genomes or parts of genomes such as heterochromatin, in which it is desirable to locate features relative to stable contigs or scaffolds, which are themselves localized in an unstable assembly to chromosomes or chromosome arms. Localization chains do not necessarily only span assemblies protein domains may be localized relative to polypeptide features, themselves localized to a transcript (or to the genome, as is more common). Chains may also span sequence alignments.


Featureloc-example.png


The Feature Location Graph

We will now present a short formal treatment of the properties of these hierarchies of localization using graph theory. This treatment can be ignored for the purposes of understanding the basics of the Chado schema; the end-user of the database will be entirely unaware of such technicalities. However, for the purposes of software engineering and ensuring interoperability between different Chado database instances and different applications, formal treatments such as these are an essential requirement for software specifications.

We can define a featureloc graph (LG) as being a set of vertices and edges, with each feature constituting a vertex, and each featureloc constituting an edge going from the parent feature_id vertex to the srcfeature_id vertex. The node is labeled with column values from the feature table, and the edge is labeled with column values from the featureloc table. The LG is not allowed to contain cycles, it is a directed acyclic graph (DAG). This includes self-cycles - no feature may be localized relative to itself.

The roots of the LG are the features that do not have featureloc rows, typically chromosomes or chromosome arms, although LG roots may also be unassembled contigs, scaffolds or features for which sequence localization is not yet known (such as genes discovered through classical genetics techniques). The leaves of the LG are any features that are not present as a srcfeature_id in any featurelocs row typically the bulk of features, such as genes, exons, matches and so on. The depth of a particular LG g, denoted D(g), is the maximum number of edges between any leaf- root pair. As has been previously noted, many Chados will have LGs with a uniform depth of 1. Such LGs are said to be simple and the features within them are said to be singletons. The maximum depth of all LGs in a particular database instance i is denoted LGDmax(i).


Featureloc-graph-example.png


The schema does not constrain the maximum depth of the LG. This flexibility proves useful when applying Chado to the highly variable needs of multiple different genome projects; however, it can lead to efficiency problems when querying the database. It can also make it more difficult to write software to interoperate with the database, as the software must take into account different contingencies. We can solve this problem by collapsing the LG, in which a graph of arbitrary depth is flattened to a depth of 1, transforming or projecting featurelocs onto the root features (typically chromosomes or chromosome arms). The original featurelocs are left unaltered in the database, and additional redundant featurelocs between leaf and root features are added to the database. These new featurelocs are known as inferred featurelocs. In the schema inferred featurelocs are differentiated from direct featurelocs using the locgroup column. Direct (non-inferred) localizations are indicated by the locgroup column taking value 0, and transitive localizations are indicated by this column having value !0.

The terminology used above can be used to define specifications for applications intended to interoperate with the database. Certain kinds of features have paired locations. These include hits and high-scoring-pairs (HSPs) coming from sequence search programs such as BLAST, and syntenic chromosomal regions. These kinds of features have two featurelocs (in contrast to the usual 1) one on the query feature and one on the subject (hit) feature. We differentiate the two featurelocs with the rank column. A rank of 0 indicates a location relative to the query (as is the default for most features), and a rank of 1 indicates a location relative to the subject (hit) feature.

For multiple alignments (e.g. CLUSTALW results), this scheme is extended to unbounded ranks [0..n], with arbitrary ordering. Alignments are stored in the residue info column. CIGAR format is used for pairwise alignments.

Multiple featurelocs may also be required for features of type "sequence variant" (SO:0000109), indicating points or extents which vary between reference and non-reference sequences. From a modelling standpoint, variants are conceptually similar to alignments; with variants we are noting a difference as opposed to a similarity. Here a rank of zero indicates the wild-type (or reference) feature and a rank of one or more indicates the variant (or non-reference) feature, with the residue info column representing the sequence on wild-type and variant. A featureloc is uniquely identified by the feature_id, rank, locgroup triple. This means that no feature can have more than one featureloc with the same rank and locgroup. In other words, rank and locgroup uniquely identify a featureloc for any particular feature.

Feature Coordinates

Features are located relative to other features using the featureloc table rows. Features can be located on more than one sequence. For example, a BLAST hit HSP can be a feature of both the query and target sequences. To locate a feature, create a featureloc record with:

  • srcfeature_id = the id of the sequence on which the feature is being located
  • feature_id = the id of the feature being located
  • strand is 1 for the positive strand, -1 for the negative, and 0 for both or indifferent.
  • fmin, fmax – the minimum and maximum coordinates of the interval
  • is_fmin_partial, is_fmax_partial = true if needed to indicate that the sequence is incomplete (e.g. for ESTs or EST assemblies which are known to not go all the way to the 3’ or 5’ end.)
  • phase = 0, 1, or 2 – denotes phase of first base pair in a nucleotide feature with respect to a source protein, or the offset of the first nucleotide in its codon.
  • rank, locgroup – these are used to organize groups of feature locations and can be ignored in simple cases (the details are discussed below).


Multiple Locations for a Feature

The ability to have multiple locations for a feature has many uses. For example one can locate a SNP, exon, or protein motif on the genome, on a transcript, and on a protein. A region of similarity between two sequences (HSP) can be located on both of them, so if either is viewed the “hit” is visible.


Difference Between the chado Location Model and Other Schemas

There is a crucial difference between the Chado location model and the sequence location model used in other schemas, such as GFF, GenBank, BioSQL, or BioPerl.

First, Chado is the only model to use the concept of rank and locgroup. Second, and perhaps more important, all these other models allow discontiguous locations (also known as "split locations"). These will be familiar to anyone who has inspected GenBank annotated DNA records for an organism that has introns within the transcripts; the transcript location is modelled as a sequence of non-contiguous intervals on the genome. The interval represents the location of an exon. For example:

            /gene="Acph"
    CDS     join(914..1063, 1143..1241, 1297..1536, 1605..2054,
                 2667..2925, 3063..3172)

Although Chado allows a feature to have multiple locations, this is only with variable rank and locgroup and this is enforced by a uniqueness constraint in the relational schema. We made a conscious decision to avoid discontiguous locations, because the extra degree of freedom this affords results in either redundancies or ambiguities. Redundancies arise when exons are stored in addition to a discontiguous transcript, and ambiguities arise by virtue of the fact that explicit representation of the exons may be seen as optional. Ambiguities are undesirable as it makes it harder for databases to interoperate. The omission of discontiguous locations does not restrict the expressive capacity of Chado in any way, because any discontiguous location can be modelled as a collection of features with contiguous locations. For example, a transcript with a discontiguous location can be modelled as a collection of exons with contiguous featurelocs, and a transcript with a single contiguous featureloc representing the outer boundaries defined by the outermost exons.

Feature Rank

The rank field is used when a feature has more than 1 location, otherwise the default rank value of 0 is used. Some features have two locations, for example BLAST hits and HSPs: one location on the query, rank = 0, and one location on the subject, rank = 1.


Extensible Feature Properties

The feature table has a fairly limited set of columns for recording feature data. For example, there is no anticodon column for recording the RNA triplet for the adapter in a tRNA feature (all feature types, including tRNAs, are recorded as rows in the feature table). If we were to add columns such as anticodon then the number of columns in the table would become very large and difficult to manage; most would end up being nullable (for example, anticodon does not apply to non-tRNA features). This is because different organisms, different types of feature and different projects have differing needs regarding what extra data should be attached to any one feature. How then are we to attach both biologically relevant and project specific data to features?

Chado solves this by using an extensible mechanism for attaching attribute-value pairs to features via the featureprop table. The featureprop.type_id foreign key column references a property in the Sequence Ontology. The value text column stores the value filler for that property. Sets or lists of values for any property can be stored in the featureprop table, differentiated by the value of the rank column. Provenance for the featureprop assignment is stored using the featureprop_pub table in the publications module, allowing multiple publications to be associated with any one assignment.

Because featureprop values can be of an arbitrary size, they are modelled using a SQL TEXT type. This has some disadvantages from a query efficiency perspective.

Numeric values cannot be indexed correctly, and sorting the results of a query can only be done via a SQL casting operation, or in software outside of the database management system, either of which may result in poorer performance. This is one of several areas in Chado where performance has been traded in favour of a simpler, more abstract and generic model.

Linking Features to External Databases

Public database identifiers are stored in the dbxref table, which holds the database name, the accession number, and an optional version number. Note that this table holds accession numbers published internally by the Chado instance as well as by other databases. A feature can have a primary dbxref, which is linked directly from the feature table. It can also have additional secondary dbxref's linked via feature_dbxref. A feature need not have a primary dbxref; e.g. computed features may be considered “lightweight” and not assigned accession numbers. Some groups may wish to set up a trigger to automatically assign primary dbxrefs to features of types that are locally accessioned; a sample trigger is provided with the schema.


Feature Annotations

Detailed annotations, such as associations to Gene Ontology (GO) terms or Cell Ontology terms, can be attached to features using the feature_cvterm linking table. This allows multiple ontology terms to be associated with each feature.

Provenance data can be attached with the feature_cvtermprop and feature_cvterm_dbxref higher-order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using feature_cvterm. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) that are linked to the gene features.

Annotations for existing features can also go into the featureprop table using the Chado feature_property ontology (defined in chado/load/etc/feature_property.obo) and the comment or description terms as appropriate. The purpose of the feature property ontology (and the related chado/load/etc/genbank_feature_property.obo file) is to capture terms that are likely to appear in GFF or GenBank sequence files. In theory there is no overlap between these ontologies and the Sequence Ontology.

Relationships Between Features

Biological features are inter-related; exons are part of transcripts, transcripts are part of genes, and polypeptides are derived from messenger RNAs. Relationships between individual features are stored in the feature_relationship table, which connects two features via the subject_id and object_id columns (foreign keys referring to the feature table) and a type_id (a foreign key referring to a relationship type in an ontology, either SO, or the OBO relationship ontology, OBO-REL, indicating the nature of the relationship between subject and object features.

The core relationships between features are part-whole (part_of) or temporal (derives_from). Subject and Object describes the linguistic role the two features play in a sentence describing the feature relationship. In English, many sentences follow a subject, predicate, object syntax, and word order is important. To say that ”exons are part of transcripts” is the correct way to describe a typical biological relationship. To say ”transcripts are part of exons” is either grammatically or biologically incorrect.

We use this same terminology (which comes from RDF) again in the cv module. The collection of features and feature relationships can be considered as vertices and edges in a graph, known as the Feature Graph (FG). Example feature graphs are shown above and in the Introduction to Chado.

The FG is independent of the LG and in general the FG and the LG should have no edges in common. If there is a featureloc connecting two features, then the addition of a feature relationship between these same two features is redundant. The FG is required in order to query the database for such things as alternately spliced genes, exons shared between transcripts, etc.

Although the chado schema admits any FG, certain configurations are biologically meaningless, and should not be used. The FG can be constrained by the Sequence Ontology. Standardized FG structures are required for complex applications to be interoperable.

Unlike the LG, the FG may be cyclic, although cycles in the FG are not common. The subset of the FG corresponding to certain kinds of relationship may be acyclic for example, the subset of the FG connecting parts with wholes via part of must be acyclic.

Compliance

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

This section is not complete, it is in progress.

Chado uses a layered model - this is tried and tested in software engineering. Some generic software can be targeted at the lower layers and be guaranteed to work no matter what. Other more specific software needs a more tightly defined rigorous model and should be targeted at the upper layers.

We require validation software and more formal or computable descriptions of these layers and policies - for now natural language descriptions will have to suffice.

Chado Compliance Layers

Proposal for levels of compliance.

Level 0: Relational Schema

Level 0 conformance basically means the schema is adhered to. Obviously, this is enforced by the DBMS.

Layer 1: Ontologies

Level 1 conformance is minimal conformance to SO - all feature.types must be SO terms, and all feature relationship.types must be SO relationship types.

Level 2: Graph

Level 2 conformance is graph conformance to SO - all feature relationships between a feature of type X and Y must correspond to relationship of that type in SO; for example, mRNA can be part of gene, but mRNA can not be part of golden path region. [more detailed/formal explanation to come]. In practice Level 2 conformance may be undesirable, we may need to make modifications to SO.

Orthogonal to these layers are various additional policy decisions. Some of these are more tolerant of non-conformance than others. (there is also some overlaps with levels 1 and 2).


Examples: Current implementations

This section describes details of how different sites are using Chado. This is likely outdated information.

TIGR: Currently at level 0 conformance, though most (if not all) of the terms being used have an obvious counterpart in SO. Therefore these ”TIGR Ontology” terms are used in the answers to the SO-related questions that appear below. We plan on updating our terms with SO terms very soon.

SO terms used for Standard Central-dogma Gene Model

FlyBase: gene mRNA exon protein [other types are derivable].

TIGR: gene transcript CDS exon protein [though the strict answer is for any of these SO questions is ”none” since we do not yet meet level 1 conformance].

NOTE: we should be using ’polypeptide’ instead of ’protein’. For now, software should be tolerant of both these uses.

SO terms Used for Storing Alignments

FlyBase: match

TIGR: match

NOTE: we want to use the new more specific SO types for match set, match part, for hits and hsps respectively. For now, software should be tolerant of either usage.

TIGR: We’ve also extended the model for storing pairwise alignments to store multiple alignments. Each member of the alignment is featureloced to the ’match’ feature. We’ve used this representation to store paralogous/orthologous gene families.

feature_relationship Types

FlyBase: partof (for mRNA to gene and exon to mRNA) producedby (for protein to mRNA)

TIGR: part of (gene-assembly, exon-transcript, assembly-supercontig) produced by (protein- CDS, CDS-transcript, transcript-gene)

NOTE: this should be ”part of” and ”derived from” to conform to SO. Most read-only software should be able to safely ignore feature relationship.type anyway. Protein should be polypeptide - see note above

NOTE: the main difference between FB and TIGR here is that TIGR introduce an intermediate CDS feature between mRNA and protein.

featureloc Policy

FlyBase: all constituent parts of a central dogma gene model are located relative to the same srcfeature (the chromosome arm). No redundant locations (i.e. featureloc.group ¿ 0) are used.

TIGR: Redundant locations are used and indicated with featureloc.group ¿ 0.

NOTE: we want to allow some flexibility with this policy. We believe that the constituent parts linked located relative to the feature should always be followed. This can be stated more formally as:


 IF  X is linked to Y via feature_relationship
 AND X is located relative to Z via featureloc.srcfeature_id
 THEN Y must also be located relative to Z via featureloc.srcfeature_id


TIGR: We’ve followed this policy in adding a featureloc between the protein and genomic contig in our databases (such a featureloc does not appear in the Chado usage documents). This additional featureloc simplifies many queries, especially when looking at the genomic context of ’match’ features associated with proteins.

We should also expect that the fmin/fmax boundaries of a feature be defined the the outermost boundaries of the outermost constituent part features (this rule may require refinement when we have promoters, enhancers and so on - but for now we don’t).

As to what the srcfeature should be, it could be a contig, and assembly or a top-level locat- able feature such as chromosome or chromosome arm. Software should be tolerant of different choices here. Whilst it is generally always best to locate relative to the topmost feature (ie the arm/chromosome), sometimes this is not possible or desirable (eg low coverage, heterochromatin).

Non-central Dogma Gene Models

FlyBase: we store a lot of non-central dogma gene models; noncoding gene models and pseudogenes [need to fill in more details here].

TIGR: not many of these stored yet, save for a few pseudogenes and the occasional non-coding ORF.

Other Features

FlyBase: the FlyBase implementation includes many other feature types, including polyA site and se- quence variant [need to fill in details].

TIGR: using ’SNP’ in some databases.

Derivable Feature Types

FlyBase: derivable features (introns, UTRs, intergenic region) are not included. Feature typing is always done to the most specific, non-derivale level. For example, we never use types ”5 prime exon”, ”dicistronic gene”, ”coding exon” as these are always inferrable. We always use type ”gene” - the specific type of gene is inferred from the child type (mRNA, tRNA, snRNA, etc)..

TIGR: derivable features are not included. currently not storing any tRNAs or snRNAs.

NOTE: whilst it is perfectly permissable to include redundant derivable features (useful for warehouse-style querying), you should not write software that expects to find these if you want the software to work on different chado db instances.

Sequence Variants

FlyBase: these are included in chado, but they are lacking full detail.

TIGR: only SNPs so far. the SNPs currently being stored are computed from pairwise alignments of sequences already loaded into Chado, so each SNP feature is featureloc’ed to the appropriate place on each of the two sequences (rather than having one of the featurelocs ”dangling”, as indicated in some of the Chado usage documents.) featureloc.residue info is used to redundantly store the base referenced in each of the two sequences.

NOTE: variation features should specify the edit that makes one feature (such as the reference/wild-type) from another (the variant/mutant/non-reference). There were perhaps 2 proposals for this [more details required...].

Tables

Table: feature

A feature is a biological sequence or a section of a biological sequence, or a collection of such sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and HSPs and so on; see the Sequence Ontology for more.

feature Structure
F-Key Name Type Description
feature_id serial PRIMARY KEY

dbxref

dbxref_id integer

An optional primary public stable identifier for this feature. Secondary identifiers and external dbxrefs go in the table feature_dbxref.

organism

organism_id integer UNIQUE#1 NOT NULL

The organism to which this feature belongs. This column is mandatory.
name character varying(255)

The optional human-readable common name for a feature, for display purposes.
uniquename text UNIQUE#1 NOT NULL

The unique name for a feature; may not be necessarily be particularly human-readable, although this is preferred. This name must be unique for this type of feature within this organism.
residues text

A sequence of alphabetic characters representing biological residues (nucleic acids, amino acids). This column does not need to be manifested for all features; it is optional for features such as exons where the residues can be derived from the featureloc. It is recommended that the value for this column be manifested for features which may may non-contiguous sublocations (e.g. transcripts), since derivation at query time is non-trivial. For expressed sequence, the DNA sequence should be used rather than the RNA sequence.
seqlen integer

The length of the residue feature. See column:residues. This column is partially redundant with the residues column, and also with featureloc. This column is required because the location may be unknown and the residue sequence may not be manifested, yet it may be desirable to store and query the length of the feature. The seqlen should always be manifested where the length of the sequence is known.
md5checksum character(32)

The 32-character checksum of the sequence, calculated using the MD5 algorithm. This is practically guaranteed to be unique for any feature. This column thus acts as a unique identifier on the mathematical sequence.

cvterm

type_id integer UNIQUE#1 NOT NULL

A required reference to a table:cvterm giving the feature type. This will typically be a Sequence Ontology identifier. This column is thus used to subclass the feature table.
is_analysis boolean NOT NULL DEFAULT false

Boolean indicating whether this feature is annotated or the result of an automated analysis. Analysis results also use the companalysis module. Note that the dividing line between analysis and annotation may be fuzzy, this should be determined on a per-project basis in a consistent manner. One requirement is that there should only be one non-analysis version of each wild-type gene feature in a genome, whereas the same gene feature can be predicted multiple times in different analyses.
is_obsolete boolean NOT NULL DEFAULT false

Boolean indicating whether this feature has been obsoleted. Some chado instances may choose to simply remove the feature altogether, others may choose to keep an obsolete row in the table.
timeaccessioned timestamp without time zone NOT NULL DEFAULT ('now'::text)::timestamp(6) with time zone

For handling object accession or modification timestamps (as opposed to database auditing data, handled elsewhere). The expectation is that these fields would be available to software interacting with chado.
timelastmodified timestamp without time zone NOT NULL DEFAULT ('now'::text)::timestamp(6) with time zone

For handling object accession or modification timestamps (as opposed to database auditing data, handled elsewhere). The expectation is that these fields would be available to software interacting with chado.

Tables referencing this one via Foreign Key Constraints:



Table: feature_cvterm

Associate a term from a cv with a feature, for example, GO annotation.

feature_cvterm Structure
F-Key Name Type Description
feature_cvterm_id serial PRIMARY KEY

feature

feature_id integer UNIQUE#1 NOT NULL

cvterm

cvterm_id integer UNIQUE#1 NOT NULL

pub

pub_id integer UNIQUE#1 NOT NULL

Provenance for the annotation. Each annotation should have a single primary publication (which may be of the appropriate type for computational analyses) where more details can be found. Additional provenance dbxrefs can be attached using feature_cvterm_dbxref.
is_not boolean NOT NULL DEFAULT false

If this is set to true, then this annotation is interpreted as a NEGATIVE annotation - i.e. the feature does NOT have the specified function, process, component, part, etc. See GO docs for more details.

Tables referencing this one via Foreign Key Constraints:



Table: feature_cvterm_dbxref

Additional dbxrefs for an association. Rows in the feature_cvterm table may be backed up by dbxrefs. For example, a feature_cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the WITH column in a GO gene association file (but can also be used for other analagous associations). See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details.

feature_cvterm_dbxref Structure
F-Key Name Type Description
feature_cvterm_dbxref_id serial PRIMARY KEY

feature_cvterm

feature_cvterm_id integer UNIQUE#1 NOT NULL

dbxref

dbxref_id integer UNIQUE#1 NOT NULL


Table: feature_cvterm_pub

Secondary pubs for an association. Each feature_cvterm association is supported by a single primary publication. Additional secondary pubs can be added using this linking table (in a GO gene association file, these corresponding to any IDs after the pipe symbol in the publications column.

feature_cvterm_pub Structure
F-Key Name Type Description
feature_cvterm_pub_id serial PRIMARY KEY

feature_cvterm

feature_cvterm_id integer UNIQUE#1 NOT NULL

pub

pub_id integer UNIQUE#1 NOT NULL


Table: feature_cvtermprop

Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers; metadata such as the date on which the entry was curated and the source of the association. See the featureprop table for meanings of type_id, value and rank.

feature_cvtermprop Structure
F-Key Name Type Description
feature_cvtermprop_id serial PRIMARY KEY

feature_cvterm

feature_cvterm_id integer UNIQUE#1 NOT NULL

cvterm

type_id integer UNIQUE#1 NOT NULL

The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. cvterms may come from the OBO evidence code cv.
value text

The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
rank integer UNIQUE#1 NOT NULL

Property-Value ordering. Any feature_cvterm can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.


Table: feature_dbxref

Links a feature to dbxrefs. This is for secondary identifiers; primary identifiers should use feature.dbxref_id.

feature_dbxref Structure
F-Key Name Type Description
feature_dbxref_id serial PRIMARY KEY

feature

feature_id integer UNIQUE#1 NOT NULL

dbxref

dbxref_id integer UNIQUE#1 NOT NULL
is_current boolean NOT NULL DEFAULT true

True if this secondary dbxref is the most up to date accession in the corresponding db. Retired accessions should set this field to false.

Table: feature_pub

Provenance. Linking table between features and publications that mention them.

feature_pub Structure
F-Key Name Type Description
feature_pub_id serial PRIMARY KEY

feature

feature_id integer UNIQUE#1 NOT NULL

pub

pub_id integer UNIQUE#1 NOT NULL

Tables referencing this one via Foreign Key Constraints:



Table: feature_pubprop

Property or attribute of a feature_pub link.

feature_pubprop Structure
F-Key Name Type Description
feature_pubprop_id serial PRIMARY KEY

feature_pub

feature_pub_id integer UNIQUE#1 NOT NULL

cvterm

type_id integer UNIQUE#1 NOT NULL
value text
rank integer UNIQUE#1 NOT NULL


Table: feature_relationship

Features can be arranged in graphs, e.g. "exon part_of transcript part_of gene"; If type is thought of as a verb, the each arc or edge makes a statement [Subject Verb Object]. The object can also be thought of as parent (containing feature), and subject as child (contained feature or subfeature). We include the relationship rank/order, because even though most of the time we can order things implicitly by sequence coordinates, we can not always do this - e.g. transpliced genes. It is also useful for quickly getting implicit introns.

feature_relationship Structure
F-Key Name Type Description
feature_relationship_id serial PRIMARY KEY

feature

subject_id integer UNIQUE#1 NOT NULL

The subject of the subj-predicate-obj sentence. This is typically the subfeature.

feature

object_id integer UNIQUE#1 NOT NULL

The object of the subj-predicate-obj sentence. This is typically the container feature.

cvterm

type_id integer UNIQUE#1 NOT NULL

Relationship type between subject and object. This is a cvterm, typically from the OBO relationship ontology, although other relationship types are allowed. The most common relationship type is OBO_REL:part_of. Valid relationship types are constrained by the Sequence Ontology.
value text

Additional notes or comments.
rank integer UNIQUE#1 NOT NULL

The ordering of subject features with respect to the object feature may be important (for example, exon ordering on a transcript - not always derivable if you take trans spliced genes into consideration). Rank is used to order these; starts from zero.

Tables referencing this one via Foreign Key Constraints:



Table: feature_relationship_pub

Provenance. Attach optional evidence to a feature_relationship in the form of a publication.

feature_relationship_pub Structure
F-Key Name Type Description
feature_relationship_pub_id serial PRIMARY KEY

feature_relationship

feature_relationship_id integer UNIQUE#1 NOT NULL

pub

pub_id integer UNIQUE#1 NOT NULL


Table: feature_relationshipprop

Extensible properties for feature_relationships. Analagous structure to featureprop. This table is largely optional and not used with a high frequency. Typical scenarios may be if one wishes to attach additional data to a feature_relationship - for example to say that the feature_relationship is only true in certain contexts.

feature_relationshipprop Structure
F-Key Name Type Description
feature_relationshipprop_id serial PRIMARY KEY

feature_relationship

feature_relationship_id integer UNIQUE#1 NOT NULL

cvterm

type_id integer UNIQUE#1 NOT NULL

The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. Currently there is no standard ontology for feature_relationship property types.
value text

The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
rank integer UNIQUE#1 NOT NULL

Property-Value ordering. Any feature_relationship can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.

Tables referencing this one via Foreign Key Constraints:



Table: feature_relationshipprop_pub

Provenance for feature_relationshipprop.

feature_relationshipprop_pub Structure
F-Key Name Type Description
feature_relationshipprop_pub_id serial PRIMARY KEY

feature_relationshipprop

feature_relationshipprop_id integer UNIQUE#1 NOT NULL

pub

pub_id integer UNIQUE#1 NOT NULL


Table: feature_synonym

Linking table between feature and synonym.

feature_synonym Structure
F-Key Name Type Description
feature_synonym_id serial PRIMARY KEY

synonym

synonym_id integer UNIQUE#1 NOT NULL

feature

feature_id integer UNIQUE#1 NOT NULL

pub

pub_id integer UNIQUE#1 NOT NULL

The pub_id link is for relating the usage of a given synonym to the publication in which it was used.
is_current boolean NOT NULL DEFAULT true

The is_current boolean indicates whether the linked synonym is the current -official- symbol for the linked feature.
is_internal boolean NOT NULL DEFAULT false

Typically a synonym exists so that somebody querying the db with an obsolete name can find the object theyre looking for (under its current name. If the synonym has been used publicly and deliberately (e.g. in a paper), it may also be listed in reports as a synonym. If the synonym was not used deliberately (e.g. there was a typo which went public), then the is_internal boolean may be set to -true- so that it is known that the synonym is -internal- and should be queryable but should not be listed in reports as a valid synonym.


Table: featureloc

The location of a feature relative to another feature. Important: interbase coordinates are used. This is vital as it allows us to represent zero-length features e.g. splice sites, insertion points without an awkward fuzzy system. Features typically have exactly ONE location, but this need not be the case. Some features may not be localized (e.g. a gene that has been characterized genetically but no sequence or molecular information is available). Note on multiple locations: Each feature can have 0 or more locations. Multiple locations do NOT indicate non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature designate alternate locations or grouped locations; for instance, a feature designating a blast hit or hsp will have two locations, one on the query feature, one on the subject feature. Features representing sequence variation could have alternate locations instantiated on a feature on the mutant strain. The column:rank is used to differentiate these different locations. Reflexive locations should never be stored - this is for -proper- (i.e. non-self) locations only; nothing should be located relative to itself.

featureloc Structure
F-Key Name Type Description
featureloc_id serial PRIMARY KEY

feature

feature_id integer UNIQUE#1 NOT NULL

The feature that is being located. Any feature can have zero or more featurelocs.

feature

srcfeature_id integer

The source feature which this location is relative to. Every location is relative to another feature (however, this column is nullable, because the srcfeature may not be known). All locations are -proper- that is, nothing should be located relative to itself. No cycles are allowed in the featureloc graph.
fmin integer

The leftmost/minimal boundary in the linear range represented by the featureloc. Sometimes (e.g. in Bioperl) this is called -start- although this is confusing because it does not necessarily represent the 5-prime coordinate. Important: This is space-based (interbase) coordinates, counting from zero. To convert this to the leftmost position in a base-oriented system (eg GFF, Bioperl), add 1 to fmin.
is_fmin_partial boolean NOT NULL DEFAULT false

This is typically false, but may be true if the value for column:fmin is inaccurate or the leftmost part of the range is unknown/unbounded.
fmax integer

The rightmost/maximal boundary in the linear range represented by the featureloc. Sometimes (e.g. in bioperl) this is called -end- although this is confusing because it does not necessarily represent the 3-prime coordinate. Important: This is space-based (interbase) coordinates, counting from zero. No conversion is required to go from fmax to the rightmost coordinate in a base-oriented system that counts from 1 (e.g. GFF, Bioperl).
is_fmax_partial boolean NOT NULL DEFAULT false

This is typically false, but may be true if the value for column:fmax is inaccurate or the rightmost part of the range is unknown/unbounded.
strand smallint

The orientation/directionality of the location. Should be 0, -1 or +1.
phase integer

Phase of translation with respect to srcfeature_id. Values are 0, 1, 2. It may not be possible to manifest this column for some features such as exons, because the phase is dependant on the spliceform (the same exon can appear in multiple spliceforms). This column is mostly useful for predicted exons and CDSs.
residue_info text

Alternative residues, when these differ from feature.residues. For instance, a SNP feature located on a wild and mutant protein would have different alternative residues. for alignment/similarity features, the alternative residues is used to represent the alignment string (CIGAR format). Note on variation features; even if we do not want to instantiate a mutant chromosome/contig feature, we can still represent a SNP etc with 2 locations, one (rank 0) on the genome, the other (rank 1) would have most fields null, except for alternative residues.
locgroup integer UNIQUE#1 NOT NULL

This is used to manifest redundant, derivable extra locations for a feature. The default locgroup=0 is used for the DIRECT location of a feature. Important: most Chado users may never use featurelocs WITH logroup > 0. Transitively derived locations are indicated with locgroup > 0. For example, the position of an exon on a BAC and in global chromosome coordinates. This column is used to differentiate these groupings of locations. The default locgroup 0 is used for the main or primary location, from which the others can be derived via coordinate transformations. Another example of redundant locations is storing ORF coordinates relative to both transcript and genome. Redundant locations open the possibility of the database getting into inconsistent states; this schema gives us the flexibility of both warehouse instantiations with redundant locations (easier for querying) and management instantiations with no redundant locations. An example of using both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of two different species. We may want to keep redundant locations on both contigs and chromosomes. We would thus have 4 locations for the single conserved region feature - two distinct locgroups (contig level and chromosome level) and two distinct ranks (for the two species).
rank integer UNIQUE#1 NOT NULL

Used when a feature has >1 location, otherwise the default rank 0 is used. Some features (e.g. blast hits and HSPs) have two locations - one on the query and one on the subject. Rank is used to differentiate these. Rank=0 is always used for the query, Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for sequence_variant features, such as SNPs. Rank=0 indicates the wildtype (or baseline) feature, Rank=1 indicates the mutant (or compared) feature.
featureloc Constraints
Name Constraint
featureloc_c2 CHECK ((fmin <= fmax))

Tables referencing this one via Foreign Key Constraints:



Table: featureloc_pub

Provenance of featureloc. Linking table between featurelocs and publications that mention them.

featureloc_pub Structure
F-Key Name Type Description
featureloc_pub_id serial PRIMARY KEY

featureloc

featureloc_id integer UNIQUE#1 NOT NULL

pub

pub_id integer UNIQUE#1 NOT NULL


Table: featureprop

A feature can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.

featureprop Structure
F-Key Name Type Description
featureprop_id serial PRIMARY KEY

feature

feature_id integer UNIQUE#1 NOT NULL

cvterm

type_id integer UNIQUE#1 NOT NULL

The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. Certain property types will only apply to certain feature types (e.g. the anticodon property will only apply to tRNA features) ; the types here come from the sequence feature property ontology.
value text

The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
rank integer UNIQUE#1 NOT NULL

Property-Value ordering. Any feature can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used

Tables referencing this one via Foreign Key Constraints:



Table: featureprop_pub

Provenance. Any featureprop assignment can optionally be supported by a publication.

featureprop_pub Structure
F-Key Name Type Description
featureprop_pub_id serial PRIMARY KEY

featureprop

featureprop_id integer UNIQUE#1 NOT NULL

pub

pub_id integer UNIQUE#1 NOT NULL


Table: synonym

A synonym for a feature. One feature can have multiple synonyms, and the same synonym can apply to multiple features.

synonym Structure
F-Key Name Type Description
synonym_id serial PRIMARY KEY
name character varying(255) UNIQUE#1 NOT NULL

The synonym itself. Should be human-readable machine-searchable ascii text.

cvterm

type_id integer UNIQUE#1 NOT NULL

Types would be symbol and fullname for now.
synonym_sgml character varying(255) NOT NULL

The fully specified synonym, with any non-ascii characters encoded in SGML.

Tables referencing this one via Foreign Key Constraints: