Latest revision as of 22:17, 18 December 2013

Introduction

A central module in Chado is the sequence module. The fundamental table within this module is the feature table, for describing biological sequence features. Chado defines a feature to be a region of a biological polymer (typically a DNA, RNA, or a polypeptide molecule) or an aggregate of regions on this polymer. As the term is used here, region can be the entire extent of the molecule, or a junction between two bases. Features can be typed according to an ontology, they can be localized relative to other features, and they can form part-whole and other relationships with other features.

You may find these related documents useful:

Chado Manual
Chado Best Practices - many issues specific to the Sequence module are discussed
Chado FAQ
Introduction to Chado
Chado cv module - the Sequence module makes extensive use of controlled vocabularies

Features

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

Chado does not distinguish between a sequence and a sequence feature, on the theory that a feature is a piece of a sequence, and a piece of a sequence is a sequence. Both are represented as a row in the feature table.

There are many different types of features. Examples include gene, exon, transcript, regulatory region, chromosome, sequence variation, polypeptide, protein domain and cross-genome match regions. Chado does not have a different table for each kind of feature; all features are stored in the feature table.

Feature types are taken from the Sequence Ontology controlled vocabulary (see also Controlled Vocabulary module, also known as cv). Types of feature are differentiated using a type_id column, which is a foreign key to the cvterm table in the cv (ontology) module, described here. This allows us to type features according to the Sequence Ontology. The use of ontologies to type tables gives Chado a subtyping mechanism, which is absent from the standard relational model. For example, SO tells us that mRNA and snRNA are different kinds of transcript. This is discussed in more in the next section. For the purposes of discussion in this document, it can be assumed that any reference to genes, exons, polypeptides, SNPs, chromosomes, transcripts and various kinds of RNAs and so on refers to features of that Sequence Ontology type.

A selection of Chado-relevant types from SO are shown below:

Sequence Ontology Examples
SO Term	SO id
Exon	SL:0000025
Intron	SL:0000027
mRNA	SL:0000037
miRNA	SL:0000044
regulatory_element	SL:0000052
transcription_factor_binding_site	SL:0000054

The Chado feature table has a text-valued column named residues for storing the sequence of the feature. The value of this column is string of IUPAC symbols corresponding to the sequence of biochemical residues encoded by the feature. This column is optional, because the sequence of the feature may not be known. Even if the sequence of a feature is known, it may not be desirable to store it in the feature table, as it may be possible to infer the sequence from the sequence of other features in the database. For example, exon sequences are generally not stored, as these can trivially be inferred from the sequence of the genomic feature on which the exon is located. In contrast, mRNA and other processed transcript sequences are stored as it is less trivial and more computationally expensive to dynamically splice together the mRNA sequence.

It is important to realize that the existence of a row in the feature table does not necessarily imply that the feature has been characterized as a result of genome annotation. It is possible to have features of SO type gene for genes that have only been characterized through genetic studies, and for which neither sequence nor sequence location is known. This is in contrast to other feature schemas (such as GFF) in which it is not possible to represent features without representing a location in sequence coordinates. This design decision is crucial for the use of Chado as a database for integrating information about the same entity from multiple perspectives.

Because the sequence is stored as a column in the feature table rather than as an independent table, sequences cannot exist in the absence of a row in the feature table; sequences are dependent upon features. This is in contrast with almost all other genomics schemas that allow independent treatment of sequences and features. This design decision follows for both philosophical and prag- matic reasons. The feature table also contains columns seqlen and md5checksum, for storing the length of the sequence and the 32-character checksum computed using the MD5 [RL Rivest. RFC 1321: The md5 message-digest algorithm. Technical report, Internet Activities Board, April 1992.] algorithm. The length and checksum can be stored even when the residues column is null valued. The checksum is useful for checking if two or more features share the same sequence, without comparing the entire sequence string.

The existence of these columns means that this table is no longer in third normal form (3NF), which is usually a desirable formal property of relational database. On balance, the utility of these columns outweighs the disadvantages of violating 3NF. In practical terms, it means that the values of the residues, seqlen and md5checksum columns are interdependent and cannot be updated independently of one another.

The feature table has a Boolean valued column, is_analysis, indicating whether this is an annotation or a computed feature from a computational analysis. Annotations are features that are generated or blessed by a human curator, or in some cases by an integrated genome pipeline (for example, MAKER or DIYA) capable of synthesizing gene models and other annotations from in silico analyses. They constitute the definitive version of a particular feature, in contrast to the features generated by gene prediction programs and sequence similarity searches such as BLAST.

The feature table has a dbxref_id column that refers to a global, stable public identifier for the feature. This column is optional, because not all classes of features have such identifiers for example, features resulting from gene predictions and BLAST HSP features may be less stable and thus lack public identifiers. It is recommended that most annotated features have dbxref_ids. The organism_id column refers to a row in the organism table (defined in the organism module). This column is mandatory if the feature derives from a single organism.

Names of Features

The name and uniquename columns allow features to be labelled. The name column is optional, but it is recommended that all annotated features (as opposed to those that arise from purely computational methods) have names. The name should be a simple, concise, human-friendly display label (such as a gene or gene product symbol, as defined by the nomenclature rules of governing the organism). User interface software (such as GBrowse and Apollo) can use the name column for labelling feature glyphs in user displays. Uniqueness of name within any particular organism or genome project is a desirable characteristic, but is not enforced in the schema, since there are occasions where name clashes are unavoidable. In contrast, the uniquename column is required, and guaranteed to be unique when taken in combination with organism_id and type_id this is enforced by a constraint in the relational schema. The unique name may be human-friendly (for example, it can be the same as the name); however, it is not guaranteed to be so, and in general should not be displayed to the end user. Its use is mainly as an alternate unique key on the table .

The unique name normally conforms to some naming rule these rules may vary across chado instances, but they should all guarantee the uniqueness of the uniquename, organism id, type id triple.

Feature Synonyms

In addition to having a name or symbol, it is common for features such as genes to have multiple synonyms or aliases. These synonyms may exist due to different publications referring to the same gene with different symbols, or because one gene was once believed to be two or more separate genes. A common curation operation on genes is splitting and merging, which results in the creation of synonyms.

This is modelled in Chado with a synonym table and a feature_synonym linking table; thus multiple features can potentially share the same, and a single feature can be have multiple synonyms. Use of a synonym in the literature is indicated with a pub_id foreign key referencing the pub table (see the publications module), indicating historical provenance for the use of a synonym.

Feature synonyms are found by joining to feature_synonym and synonym. For example, here is a query to find gene by name or synonym:

SELECT feature_id FROM feature
WHERE name = 'name of interest'
UNION SELECT feature_id
FROM feature_synonym fs, synonym s
WHERE fs.synonym_id = s.synonym_id
AND s.name = 'name of interest'
AND fs.is_current;

Feature Locations

Features can potentially be localized using a sequence coordinate system. A relative localization model is used, so all feature localizations must be relative to another feature. Some features such as those of type chromosome are not localized in sequence coordinates. Locations are stored in the featureloc table, also part of this sequence module. Other non-sequence oriented kinds of localization (such as physical localization from in situ experiments, or genetic localizations from linkage studies) are modelled outside the sequence module (for example, in the expression module or map module).

A feature can have zero or more featurelocs, although it will typically have either one (for localized features for which the location is known) or zero (for unlocalized features such as chromosomes, or for features for which the location is not yet known, such as a gene discovered using classical genetics techniques). Features with multiple featurelocs will be explained later.

A featureloc is an interval in interbase sequence coordinates (see figure), bounded by the fmin and fmax columns, each representing the lower and upper linear position of the boundary between bases or base pairs, with directionality indicated by the strand column. Interbase coordinates were chosen over the more commonly used base-oriented coordinate system because they are more naturally amenable to the standard arithmetic operations that are typically performed upon sequence coordinates. This leads to cleaner and more efficient database coding logic that is arguably less prone to errors. Of course, interbase coordinates are typically transformed into the more common base-oriented system used by BLAST reports and so forth prior to presentation to the end-user.

The relational schema includes a constraint which ensures that fmin != fmax is always true, and any attempt to set the database in a state which violates this will ﬂag an error .

As mentioned previously, a featureloc must be localized relative to another feature, indicated using the srcfeature_id foreign key column, referencing the feature table. There is nothing in the schema prohibiting localization chains; for example, locating an exon relative to a contig that is itself localized relative to a chromosome (see figure). The majority of Chado database instances will not require this ﬂexibility; features are typically located relative to chromosomes or chromosomes arms. Nevertheless, the ability to store such localization networks or location graphs can be useful for unfinished genomes or parts of genomes such as heterochromatin, in which it is desirable to locate features relative to stable contigs or scaffolds, which are themselves localized in an unstable assembly to chromosomes or chromosome arms. Localization chains do not necessarily only span assemblies protein domains may be localized relative to polypeptide features, themselves localized to a transcript (or to the genome, as is more common). Chains may also span sequence alignments.

The Feature Location Graph

We will now present a short formal treatment of the properties of these hierarchies of localization using graph theory. This treatment can be ignored for the purposes of understanding the basics of the Chado schema; the end-user of the database will be entirely unaware of such technicalities. However, for the purposes of software engineering and ensuring interoperability between different Chado database instances and different applications, formal treatments such as these are an essential requirement for software specifications.

We can define a featureloc graph (LG) as being a set of vertices and edges, with each feature constituting a vertex, and each featureloc constituting an edge going from the parent feature_id vertex to the srcfeature_id vertex. The node is labeled with column values from the feature table, and the edge is labeled with column values from the featureloc table. The LG is not allowed to contain cycles, it is a directed acyclic graph (DAG). This includes self-cycles - no feature may be localized relative to itself.

The roots of the LG are the features that do not have featureloc rows, typically chromosomes or chromosome arms, although LG roots may also be unassembled contigs, scaffolds or features for which sequence localization is not yet known (such as genes discovered through classical genetics techniques). The leaves of the LG are any features that are not present as a srcfeature_id in any featurelocs row typically the bulk of features, such as genes, exons, matches and so on. The depth of a particular LG g, denoted D(g), is the maximum number of edges between any leaf- root pair. As has been previously noted, many Chados will have LGs with a uniform depth of 1. Such LGs are said to be simple and the features within them are said to be singletons. The maximum depth of all LGs in a particular database instance i is denoted LGDmax(i).

The schema does not constrain the maximum depth of the LG. This ﬂexibility proves useful when applying Chado to the highly variable needs of multiple different genome projects; however, it can lead to efficiency problems when querying the database. It can also make it more difficult to write software to interoperate with the database, as the software must take into account different contingencies. We can solve this problem by collapsing the LG, in which a graph of arbitrary depth is ﬂattened to a depth of 1, transforming or projecting featurelocs onto the root features (typically chromosomes or chromosome arms). The original featurelocs are left unaltered in the database, and additional redundant featurelocs between leaf and root features are added to the database. These new featurelocs are known as inferred featurelocs. In the schema inferred featurelocs are differentiated from direct featurelocs using the locgroup column. Direct (non-inferred) localizations are indicated by the locgroup column taking value 0, and transitive localizations are indicated by this column having value !0.

The terminology used above can be used to define specifications for applications intended to interoperate with the database. Certain kinds of features have paired locations. These include hits and high-scoring-pairs (HSPs) coming from sequence search programs such as BLAST, and syntenic chromosomal regions. These kinds of features have two featurelocs (in contrast to the usual 1) one on the query feature and one on the subject (hit) feature. We differentiate the two featurelocs with the rank column. A rank of 0 indicates a location relative to the query (as is the default for most features), and a rank of 1 indicates a location relative to the subject (hit) feature.

For multiple alignments (e.g. CLUSTALW results), this scheme is extended to unbounded ranks [0..n], with arbitrary ordering. Alignments are stored in the residue info column. CIGAR format is used for pairwise alignments.

Multiple featurelocs may also be required for features of type "sequence variant" (SO:0000109), indicating points or extents which vary between reference and non-reference sequences. From a modelling standpoint, variants are conceptually similar to alignments; with variants we are noting a difference as opposed to a similarity. Here a rank of zero indicates the wild-type (or reference) feature and a rank of one or more indicates the variant (or non-reference) feature, with the residue info column representing the sequence on wild-type and variant. A featureloc is uniquely identified by the feature_id, rank, locgroup triple. This means that no feature can have more than one featureloc with the same rank and locgroup. In other words, rank and locgroup uniquely identify a featureloc for any particular feature.

Feature Coordinates

Features are located relative to other features using the featureloc table rows. Features can be located on more than one sequence. For example, a BLAST hit HSP can be a feature of both the query and target sequences. To locate a feature, create a featureloc record with:

srcfeature_id = the id of the sequence on which the feature is being located
feature_id = the id of the feature being located
strand is 1 for the positive strand, -1 for the negative, and 0 for both or indifferent.
fmin, fmax – the minimum and maximum coordinates of the interval
is_fmin_partial, is_fmax_partial = true if needed to indicate that the sequence is incomplete (e.g. for ESTs or EST assemblies which are known to not go all the way to the 3’ or 5’ end.)
phase = 0, 1, or 2 – denotes phase of first base pair in a nucleotide feature with respect to a source protein, or the offset of the first nucleotide in its codon.
rank, locgroup – these are used to organize groups of feature locations and can be ignored in simple cases (the details are discussed below).

Multiple Locations for a Feature

The ability to have multiple locations for a feature has many uses. For example one can locate a SNP, exon, or protein motif on the genome, on a transcript, and on a protein. A region of similarity between two sequences (HSP) can be located on both of them, so if either is viewed the “hit” is visible.

Difference Between the chado Location Model and Other Schemas

There is a crucial difference between the Chado location model and the sequence location model used in other schemas, such as GFF, GenBank, BioSQL, or BioPerl.

First, Chado is the only model to use the concept of rank and locgroup. Second, and perhaps more important, all these other models allow discontiguous locations (also known as "split locations"). These will be familiar to anyone who has inspected GenBank annotated DNA records for an organism that has introns within the transcripts; the transcript location is modelled as a sequence of non-contiguous intervals on the genome. The interval represents the location of an exon. For example:

            /gene="Acph"
    CDS     join(914..1063, 1143..1241, 1297..1536, 1605..2054,
                 2667..2925, 3063..3172)

Although Chado allows a feature to have multiple locations, this is only with variable rank and locgroup and this is enforced by a uniqueness constraint in the relational schema. We made a conscious decision to avoid discontiguous locations, because the extra degree of freedom this affords results in either redundancies or ambiguities. Redundancies arise when exons are stored in addition to a discontiguous transcript, and ambiguities arise by virtue of the fact that explicit representation of the exons may be seen as optional. Ambiguities are undesirable as it makes it harder for databases to interoperate. The omission of discontiguous locations does not restrict the expressive capacity of Chado in any way, because any discontiguous location can be modelled as a collection of features with contiguous locations. For example, a transcript with a discontiguous location can be modelled as a collection of exons with contiguous featurelocs, and a transcript with a single contiguous featureloc representing the outer boundaries defined by the outermost exons.

Feature Rank

The rank field is used when a feature has more than 1 location, otherwise the default rank value of 0 is used. Some features have two locations, for example BLAST hits and HSPs: one location on the query, rank = 0, and one location on the subject, rank = 1.

Extensible Feature Properties

The feature table has a fairly limited set of columns for recording feature data. For example, there is no anticodon column for recording the RNA triplet for the adapter in a tRNA feature (all feature types, including tRNAs, are recorded as rows in the feature table). If we were to add columns such as anticodon then the number of columns in the table would become very large and difficult to manage; most would end up being nullable (for example, anticodon does not apply to non-tRNA features). This is because different organisms, different types of feature and different projects have differing needs regarding what extra data should be attached to any one feature. How then are we to attach both biologically relevant and project specific data to features?

Chado solves this by using an extensible mechanism for attaching attribute-value pairs to features via the featureprop table. The featureprop.type_id foreign key column references a property in the Sequence Ontology. The value text column stores the value filler for that property. Sets or lists of values for any property can be stored in the featureprop table, differentiated by the value of the rank column. Provenance for the featureprop assignment is stored using the featureprop_pub table in the publications module, allowing multiple publications to be associated with any one assignment.

Because featureprop values can be of an arbitrary size, they are modelled using a SQL TEXT type. This has some disadvantages from a query efficiency perspective.

Numeric values cannot be indexed correctly, and sorting the results of a query can only be done via a SQL casting operation, or in software outside of the database management system, either of which may result in poorer performance. This is one of several areas in Chado where performance has been traded in favour of a simpler, more abstract and generic model.

Linking Features to External Databases

Public database identifiers are stored in the dbxref table, which holds the database name, the accession number, and an optional version number. Note that this table holds accession numbers published internally by the Chado instance as well as by other databases. A feature can have a primary dbxref, which is linked directly from the feature table. It can also have additional secondary dbxref's linked via feature_dbxref. A feature need not have a primary dbxref; e.g. computed features may be considered “lightweight” and not assigned accession numbers. Some groups may wish to set up a trigger to automatically assign primary dbxrefs to features of types that are locally accessioned; a sample trigger is provided with the schema.

Feature Annotations

Detailed annotations, such as associations to Gene Ontology (GO) terms or Cell Ontology terms, can be attached to features using the feature_cvterm linking table. This allows multiple ontology terms to be associated with each feature.

Provenance data can be attached with the feature_cvtermprop and feature_cvterm_dbxref higher-order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using feature_cvterm. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) that are linked to the gene features.

Annotations for existing features can also go into the featureprop table using the Chado feature_property ontology (defined in chado/load/etc/feature_property.obo) and the comment or description terms as appropriate. The purpose of the feature property ontology (and the related chado/load/etc/genbank_feature_property.obo file) is to capture terms that are likely to appear in GFF or GenBank sequence files. In theory there is no overlap between these ontologies and the Sequence Ontology.

Relationships Between Features

Biological features are inter-related; exons are part of transcripts, transcripts are part of genes, and polypeptides are derived from messenger RNAs. Relationships between individual features are stored in the feature_relationship table, which connects two features via the subject_id and object_id columns (foreign keys referring to the feature table) and a type_id (a foreign key referring to a relationship type in an ontology, either SO, or the OBO relationship ontology, OBO-REL, indicating the nature of the relationship between subject and object features.

The core relationships between features are part-whole (part_of) or temporal (derives_from). Subject and Object describes the linguistic role the two features play in a sentence describing the feature relationship. In English, many sentences follow a subject, predicate, object syntax, and word order is important. To say that ”exons are part of transcripts” is the correct way to describe a typical biological relationship. To say ”transcripts are part of exons” is either grammatically or biologically incorrect.

We use this same terminology (which comes from RDF) again in the cv module. The collection of features and feature relationships can be considered as vertices and edges in a graph, known as the Feature Graph (FG). Example feature graphs are shown above and in the Introduction to Chado.

The FG is independent of the LG and in general the FG and the LG should have no edges in common. If there is a featureloc connecting two features, then the addition of a feature relationship between these same two features is redundant. The FG is required in order to query the database for such things as alternately spliced genes, exons shared between transcripts, etc.

Although the chado schema admits any FG, certain configurations are biologically meaningless, and should not be used. The FG can be constrained by the Sequence Ontology. Standardized FG structures are required for complex applications to be interoperable.

Unlike the LG, the FG may be cyclic, although cycles in the FG are not common. The subset of the FG corresponding to certain kinds of relationship may be acyclic for example, the subset of the FG connecting parts with wholes via part of must be acyclic.

Compliance

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

This section is not complete, it is in progress.

Chado uses a layered model - this is tried and tested in software engineering. Some generic software can be targeted at the lower layers and be guaranteed to work no matter what. Other more specific software needs a more tightly defined rigorous model and should be targeted at the upper layers.

We require validation software and more formal or computable descriptions of these layers and policies - for now natural language descriptions will have to suffice.

Chado Compliance Layers

Proposal for levels of compliance.

Level 0: Relational Schema

Level 0 conformance basically means the schema is adhered to. Obviously, this is enforced by the DBMS.

Layer 1: Ontologies

Level 1 conformance is minimal conformance to SO - all feature.types must be SO terms, and all feature relationship.types must be SO relationship types.

Level 2: Graph

Level 2 conformance is graph conformance to SO - all feature relationships between a feature of type X and Y must correspond to relationship of that type in SO; for example, mRNA can be part of gene, but mRNA can not be part of golden path region. [more detailed/formal explanation to come]. In practice Level 2 conformance may be undesirable, we may need to make modifications to SO.

Orthogonal to these layers are various additional policy decisions. Some of these are more tolerant of non-conformance than others. (there is also some overlaps with levels 1 and 2).

Examples: Current implementations

This section describes details of how different sites are using Chado. This is likely outdated information.

TIGR: Currently at level 0 conformance, though most (if not all) of the terms being used have an obvious counterpart in SO. Therefore these ”TIGR Ontology” terms are used in the answers to the SO-related questions that appear below. We plan on updating our terms with SO terms very soon.

SO terms used for Standard Central-dogma Gene Model

FlyBase: gene mRNA exon protein [other types are derivable].

TIGR: gene transcript CDS exon protein [though the strict answer is for any of these SO questions is ”none” since we do not yet meet level 1 conformance].

NOTE: we should be using ’polypeptide’ instead of ’protein’. For now, software should be tolerant of both these uses.

SO terms Used for Storing Alignments

FlyBase: match

TIGR: match

NOTE: we want to use the new more specific SO types for match set, match part, for hits and hsps respectively. For now, software should be tolerant of either usage.

TIGR: We’ve also extended the model for storing pairwise alignments to store multiple alignments. Each member of the alignment is featureloced to the ’match’ feature. We’ve used this representation to store paralogous/orthologous gene families.

feature_relationship Types

FlyBase: partof (for mRNA to gene and exon to mRNA) producedby (for protein to mRNA)

TIGR: part of (gene-assembly, exon-transcript, assembly-supercontig) produced by (protein- CDS, CDS-transcript, transcript-gene)

NOTE: this should be ”part of” and ”derived from” to conform to SO. Most read-only software should be able to safely ignore feature relationship.type anyway. Protein should be polypeptide - see note above

NOTE: the main difference between FB and TIGR here is that TIGR introduce an intermediate CDS feature between mRNA and protein.

featureloc Policy

FlyBase: all constituent parts of a central dogma gene model are located relative to the same srcfeature (the chromosome arm). No redundant locations (i.e. featureloc.group ¿ 0) are used.

TIGR: Redundant locations are used and indicated with featureloc.group ¿ 0.

NOTE: we want to allow some ﬂexibility with this policy. We believe that the constituent parts linked located relative to the feature should always be followed. This can be stated more formally as:

 IF  X is linked to Y via feature_relationship
 AND X is located relative to Z via featureloc.srcfeature_id
 THEN Y must also be located relative to Z via featureloc.srcfeature_id

TIGR: We’ve followed this policy in adding a featureloc between the protein and genomic contig in our databases (such a featureloc does not appear in the Chado usage documents). This additional featureloc simplifies many queries, especially when looking at the genomic context of ’match’ features associated with proteins.

We should also expect that the fmin/fmax boundaries of a feature be defined the the outermost boundaries of the outermost constituent part features (this rule may require refinement when we have promoters, enhancers and so on - but for now we don’t).

As to what the srcfeature should be, it could be a contig, and assembly or a top-level locat- able feature such as chromosome or chromosome arm. Software should be tolerant of different choices here. Whilst it is generally always best to locate relative to the topmost feature (ie the arm/chromosome), sometimes this is not possible or desirable (eg low coverage, heterochromatin).

Non-central Dogma Gene Models

FlyBase: we store a lot of non-central dogma gene models; noncoding gene models and pseudogenes [need to fill in more details here].

TIGR: not many of these stored yet, save for a few pseudogenes and the occasional non-coding ORF.

Other Features

FlyBase: the FlyBase implementation includes many other feature types, including polyA site and se- quence variant [need to fill in details].

TIGR: using ’SNP’ in some databases.

Derivable Feature Types

FlyBase: derivable features (introns, UTRs, intergenic region) are not included. Feature typing is always done to the most specific, non-derivale level. For example, we never use types ”5 prime exon”, ”dicistronic gene”, ”coding exon” as these are always inferrable. We always use type ”gene” - the specific type of gene is inferred from the child type (mRNA, tRNA, snRNA, etc)..

TIGR: derivable features are not included. currently not storing any tRNAs or snRNAs.

NOTE: whilst it is perfectly permissable to include redundant derivable features (useful for warehouse-style querying), you should not write software that expects to find these if you want the software to work on different chado db instances.

Sequence Variants

FlyBase: these are included in chado, but they are lacking full detail.

TIGR: only SNPs so far. the SNPs currently being stored are computed from pairwise alignments of sequences already loaded into Chado, so each SNP feature is featureloc’ed to the appropriate place on each of the two sequences (rather than having one of the featurelocs ”dangling”, as indicated in some of the Chado usage documents.) featureloc.residue info is used to redundantly store the base referenced in each of the two sequences.

NOTE: variation features should specify the edit that makes one feature (such as the reference/wild-type) from another (the variant/mutant/non-reference). There were perhaps 2 proposals for this [more details required...].

Tables

Table: feature

A feature is a biological sequence or a section of a biological sequence, or a collection of such sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and HSPs and so on; see the Sequence Ontology for more.

feature Structure
F-Key	Name	Type	Description
	feature_id	serial	PRIMARY KEY
dbxref	dbxref_id	integer	An optional primary public stable identifier for this feature. Secondary identifiers and external dbxrefs go in the table feature_dbxref.
organism	organism_id	integer	UNIQUE#1 NOT NULL The organism to which this feature belongs. This column is mandatory.
	name	character varying(255)	The optional human-readable common name for a feature, for display purposes.
	uniquename	text	UNIQUE#1 NOT NULL The unique name for a feature; may not be necessarily be particularly human-readable, although this is preferred. This name must be unique for this type of feature within this organism.
	residues	text	A sequence of alphabetic characters representing biological residues (nucleic acids, amino acids). This column does not need to be manifested for all features; it is optional for features such as exons where the residues can be derived from the featureloc. It is recommended that the value for this column be manifested for features which may may non-contiguous sublocations (e.g. transcripts), since derivation at query time is non-trivial. For expressed sequence, the DNA sequence should be used rather than the RNA sequence.
	seqlen	integer	The length of the residue feature. See column:residues. This column is partially redundant with the residues column, and also with featureloc. This column is required because the location may be unknown and the residue sequence may not be manifested, yet it may be desirable to store and query the length of the feature. The seqlen should always be manifested where the length of the sequence is known.
	md5checksum	character(32)	The 32-character checksum of the sequence, calculated using the MD5 algorithm. This is practically guaranteed to be unique for any feature. This column thus acts as a unique identifier on the mathematical sequence.
cvterm	type_id	integer	UNIQUE#1 NOT NULL A required reference to a table:cvterm giving the feature type. This will typically be a Sequence Ontology identifier. This column is thus used to subclass the feature table.
	is_analysis	boolean	NOT NULL DEFAULT false Boolean indicating whether this feature is annotated or the result of an automated analysis. Analysis results also use the companalysis module. Note that the dividing line between analysis and annotation may be fuzzy, this should be determined on a per-project basis in a consistent manner. One requirement is that there should only be one non-analysis version of each wild-type gene feature in a genome, whereas the same gene feature can be predicted multiple times in different analyses.
	is_obsolete	boolean	NOT NULL DEFAULT false Boolean indicating whether this feature has been obsoleted. Some chado instances may choose to simply remove the feature altogether, others may choose to keep an obsolete row in the table.
	timeaccessioned	timestamp without time zone	NOT NULL DEFAULT ('now'::text)::timestamp(6) with time zone For handling object accession or modification timestamps (as opposed to database auditing data, handled elsewhere). The expectation is that these fields would be available to software interacting with chado.
	timelastmodified	timestamp without time zone	NOT NULL DEFAULT ('now'::text)::timestamp(6) with time zone For handling object accession or modification timestamps (as opposed to database auditing data, handled elsewhere). The expectation is that these fields would be available to software interacting with chado.

Tables referencing this one via Foreign Key Constraints:

analysisfeature

element

feature_cvterm

feature_dbxref

feature_expression

feature_genotype

feature_phenotype

feature_pub

feature_relationship

feature_synonym

featureloc

featurepos

featureprop

featurerange

library_feature

phylonode

wwwuser_feature

Table: feature_cvterm

Associate a term from a cv with a feature, for example, GO annotation.

feature_cvterm Structure
F-Key	Name	Type	Description
	feature_cvterm_id	serial	PRIMARY KEY
feature	feature_id	integer	UNIQUE#1 NOT NULL
cvterm	cvterm_id	integer	UNIQUE#1 NOT NULL
pub	pub_id	integer	UNIQUE#1 NOT NULL Provenance for the annotation. Each annotation should have a single primary publication (which may be of the appropriate type for computational analyses) where more details can be found. Additional provenance dbxrefs can be attached using feature_cvterm_dbxref.
	is_not	boolean	NOT NULL DEFAULT false If this is set to true, then this annotation is interpreted as a NEGATIVE annotation - i.e. the feature does NOT have the specified function, process, component, part, etc. See GO docs for more details.

Tables referencing this one via Foreign Key Constraints:

feature_cvterm_dbxref

feature_cvterm_pub

feature_cvtermprop

Table: feature_cvterm_dbxref

Additional dbxrefs for an association. Rows in the feature_cvterm table may be backed up by dbxrefs. For example, a feature_cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the WITH column in a GO gene association file (but can also be used for other analagous associations). See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details.

feature_cvterm_dbxref Structure
F-Key	Name	Type	Description
	feature_cvterm_dbxref_id	serial	PRIMARY KEY
feature_cvterm	feature_cvterm_id	integer	UNIQUE#1 NOT NULL
dbxref	dbxref_id	integer	UNIQUE#1 NOT NULL

Table: feature_cvterm_pub

Secondary pubs for an association. Each feature_cvterm association is supported by a single primary publication. Additional secondary pubs can be added using this linking table (in a GO gene association file, these corresponding to any IDs after the pipe symbol in the publications column.

feature_cvterm_pub Structure
F-Key	Name	Type	Description
	feature_cvterm_pub_id	serial	PRIMARY KEY
feature_cvterm	feature_cvterm_id	integer	UNIQUE#1 NOT NULL
pub	pub_id	integer	UNIQUE#1 NOT NULL

Table: feature_cvtermprop

Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers; metadata such as the date on which the entry was curated and the source of the association. See the featureprop table for meanings of type_id, value and rank.

feature_cvtermprop Structure
F-Key	Name	Type	Description
	feature_cvtermprop_id	serial	PRIMARY KEY
feature_cvterm	feature_cvterm_id	integer	UNIQUE#1 NOT NULL
cvterm	type_id	integer	UNIQUE#1 NOT NULL The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. cvterms may come from the OBO evidence code cv.
	value	text	The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
	rank	integer	UNIQUE#1 NOT NULL Property-Value ordering. Any feature_cvterm can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.

Table: feature_dbxref

Links a feature to dbxrefs. This is for secondary identifiers; primary identifiers should use feature.dbxref_id.

feature_dbxref Structure
F-Key	Name	Type	Description
	feature_dbxref_id	serial	PRIMARY KEY
feature	feature_id	integer	UNIQUE#1 NOT NULL
dbxref	dbxref_id	integer	UNIQUE#1 NOT NULL
	is_current	boolean	NOT NULL DEFAULT true True if this secondary dbxref is the most up to date accession in the corresponding db. Retired accessions should set this field to false.

Table: feature_pub

Provenance. Linking table between features and publications that mention them.

feature_pub Structure
F-Key	Name	Type	Description
	feature_pub_id	serial	PRIMARY KEY
feature	feature_id	integer	UNIQUE#1 NOT NULL
pub	pub_id	integer	UNIQUE#1 NOT NULL

Tables referencing this one via Foreign Key Constraints:

feature_pubprop

Table: feature_pubprop

Property or attribute of a feature_pub link.

feature_pubprop Structure
F-Key	Name	Type	Description
	feature_pubprop_id	serial	PRIMARY KEY
feature_pub	feature_pub_id	integer	UNIQUE#1 NOT NULL
cvterm	type_id	integer	UNIQUE#1 NOT NULL
	value	text
	rank	integer	UNIQUE#1 NOT NULL

Table: feature_relationship

Features can be arranged in graphs, e.g. "exon part_of transcript part_of gene"; If type is thought of as a verb, the each arc or edge makes a statement [Subject Verb Object]. The object can also be thought of as parent (containing feature), and subject as child (contained feature or subfeature). We include the relationship rank/order, because even though most of the time we can order things implicitly by sequence coordinates, we can not always do this - e.g. transpliced genes. It is also useful for quickly getting implicit introns.

feature_relationship Structure
F-Key	Name	Type	Description
	feature_relationship_id	serial	PRIMARY KEY
feature	subject_id	integer	UNIQUE#1 NOT NULL The subject of the subj-predicate-obj sentence. This is typically the subfeature.
feature	object_id	integer	UNIQUE#1 NOT NULL The object of the subj-predicate-obj sentence. This is typically the container feature.
cvterm	type_id	integer	UNIQUE#1 NOT NULL Relationship type between subject and object. This is a cvterm, typically from the OBO relationship ontology, although other relationship types are allowed. The most common relationship type is OBO_REL:part_of. Valid relationship types are constrained by the Sequence Ontology.
	value	text	Additional notes or comments.
	rank	integer	UNIQUE#1 NOT NULL The ordering of subject features with respect to the object feature may be important (for example, exon ordering on a transcript - not always derivable if you take trans spliced genes into consideration). Rank is used to order these; starts from zero.

Tables referencing this one via Foreign Key Constraints:

feature_relationship_pub

feature_relationshipprop

Table: feature_relationship_pub

Provenance. Attach optional evidence to a feature_relationship in the form of a publication.

feature_relationship_pub Structure
F-Key	Name	Type	Description
	feature_relationship_pub_id	serial	PRIMARY KEY
feature_relationship	feature_relationship_id	integer	UNIQUE#1 NOT NULL
pub	pub_id	integer	UNIQUE#1 NOT NULL

Table: feature_relationshipprop

Extensible properties for feature_relationships. Analagous structure to featureprop. This table is largely optional and not used with a high frequency. Typical scenarios may be if one wishes to attach additional data to a feature_relationship - for example to say that the feature_relationship is only true in certain contexts.

feature_relationshipprop Structure
F-Key	Name	Type	Description
	feature_relationshipprop_id	serial	PRIMARY KEY
feature_relationship	feature_relationship_id	integer	UNIQUE#1 NOT NULL
cvterm	type_id	integer	UNIQUE#1 NOT NULL The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. Currently there is no standard ontology for feature_relationship property types.
	value	text	The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
	rank	integer	UNIQUE#1 NOT NULL Property-Value ordering. Any feature_relationship can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.

Tables referencing this one via Foreign Key Constraints:

feature_relationshipprop_pub

Table: feature_relationshipprop_pub

Provenance for feature_relationshipprop.

feature_relationshipprop_pub Structure
F-Key	Name	Type	Description
	feature_relationshipprop_pub_id	serial	PRIMARY KEY
feature_relationshipprop	feature_relationshipprop_id	integer	UNIQUE#1 NOT NULL
pub	pub_id	integer	UNIQUE#1 NOT NULL

Table: feature_synonym

Linking table between feature and synonym.

feature_synonym Structure
F-Key	Name	Type	Description
	feature_synonym_id	serial	PRIMARY KEY
synonym	synonym_id	integer	UNIQUE#1 NOT NULL
feature	feature_id	integer	UNIQUE#1 NOT NULL
pub	pub_id	integer	UNIQUE#1 NOT NULL The pub_id link is for relating the usage of a given synonym to the publication in which it was used.
	is_current	boolean	NOT NULL DEFAULT true The is_current boolean indicates whether the linked synonym is the current -official- symbol for the linked feature.
	is_internal	boolean	NOT NULL DEFAULT false Typically a synonym exists so that somebody querying the db with an obsolete name can find the object theyre looking for (under its current name. If the synonym has been used publicly and deliberately (e.g. in a paper), it may also be listed in reports as a synonym. If the synonym was not used deliberately (e.g. there was a typo which went public), then the is_internal boolean may be set to -true- so that it is known that the synonym is -internal- and should be queryable but should not be listed in reports as a valid synonym.

Table: featureloc

The location of a feature relative to another feature. Important: interbase coordinates are used. This is vital as it allows us to represent zero-length features e.g. splice sites, insertion points without an awkward fuzzy system. Features typically have exactly ONE location, but this need not be the case. Some features may not be localized (e.g. a gene that has been characterized genetically but no sequence or molecular information is available). Note on multiple locations: Each feature can have 0 or more locations. Multiple locations do NOT indicate non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature designate alternate locations or grouped locations; for instance, a feature designating a blast hit or hsp will have two locations, one on the query feature, one on the subject feature. Features representing sequence variation could have alternate locations instantiated on a feature on the mutant strain. The column:rank is used to differentiate these different locations. Reflexive locations should never be stored - this is for -proper- (i.e. non-self) locations only; nothing should be located relative to itself.

featureloc Structure
F-Key	Name	Type	Description
	featureloc_id	serial	PRIMARY KEY
feature	feature_id	integer	UNIQUE#1 NOT NULL The feature that is being located. Any feature can have zero or more featurelocs.
feature	srcfeature_id	integer	The source feature which this location is relative to. Every location is relative to another feature (however, this column is nullable, because the srcfeature may not be known). All locations are -proper- that is, nothing should be located relative to itself. No cycles are allowed in the featureloc graph.
	fmin	integer	The leftmost/minimal boundary in the linear range represented by the featureloc. Sometimes (e.g. in Bioperl) this is called -start- although this is confusing because it does not necessarily represent the 5-prime coordinate. Important: This is space-based (interbase) coordinates, counting from zero. To convert this to the leftmost position in a base-oriented system (eg GFF, Bioperl), add 1 to fmin.
	is_fmin_partial	boolean	NOT NULL DEFAULT false This is typically false, but may be true if the value for column:fmin is inaccurate or the leftmost part of the range is unknown/unbounded.
	fmax	integer	The rightmost/maximal boundary in the linear range represented by the featureloc. Sometimes (e.g. in bioperl) this is called -end- although this is confusing because it does not necessarily represent the 3-prime coordinate. Important: This is space-based (interbase) coordinates, counting from zero. No conversion is required to go from fmax to the rightmost coordinate in a base-oriented system that counts from 1 (e.g. GFF, Bioperl).
	is_fmax_partial	boolean	NOT NULL DEFAULT false This is typically false, but may be true if the value for column:fmax is inaccurate or the rightmost part of the range is unknown/unbounded.
	strand	smallint	The orientation/directionality of the location. Should be 0, -1 or +1.
	phase	integer	Phase of translation with respect to srcfeature_id. Values are 0, 1, 2. It may not be possible to manifest this column for some features such as exons, because the phase is dependant on the spliceform (the same exon can appear in multiple spliceforms). This column is mostly useful for predicted exons and CDSs.
	residue_info	text	Alternative residues, when these differ from feature.residues. For instance, a SNP feature located on a wild and mutant protein would have different alternative residues. for alignment/similarity features, the alternative residues is used to represent the alignment string (CIGAR format). Note on variation features; even if we do not want to instantiate a mutant chromosome/contig feature, we can still represent a SNP etc with 2 locations, one (rank 0) on the genome, the other (rank 1) would have most fields null, except for alternative residues.
	locgroup	integer	UNIQUE#1 NOT NULL This is used to manifest redundant, derivable extra locations for a feature. The default locgroup=0 is used for the DIRECT location of a feature. Important: most Chado users may never use featurelocs WITH logroup > 0. Transitively derived locations are indicated with locgroup > 0. For example, the position of an exon on a BAC and in global chromosome coordinates. This column is used to differentiate these groupings of locations. The default locgroup 0 is used for the main or primary location, from which the others can be derived via coordinate transformations. Another example of redundant locations is storing ORF coordinates relative to both transcript and genome. Redundant locations open the possibility of the database getting into inconsistent states; this schema gives us the flexibility of both warehouse instantiations with redundant locations (easier for querying) and management instantiations with no redundant locations. An example of using both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of two different species. We may want to keep redundant locations on both contigs and chromosomes. We would thus have 4 locations for the single conserved region feature - two distinct locgroups (contig level and chromosome level) and two distinct ranks (for the two species).
	rank	integer	UNIQUE#1 NOT NULL Used when a feature has >1 location, otherwise the default rank 0 is used. Some features (e.g. blast hits and HSPs) have two locations - one on the query and one on the subject. Rank is used to differentiate these. Rank=0 is always used for the query, Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for sequence_variant features, such as SNPs. Rank=0 indicates the wildtype (or baseline) feature, Rank=1 indicates the mutant (or compared) feature.

featureloc Constraints
Name	Constraint
featureloc_c2	CHECK ((fmin <= fmax))

Tables referencing this one via Foreign Key Constraints:

featureloc_pub

Table: featureloc_pub

Provenance of featureloc. Linking table between featurelocs and publications that mention them.

featureloc_pub Structure
F-Key	Name	Type	Description
	featureloc_pub_id	serial	PRIMARY KEY
featureloc	featureloc_id	integer	UNIQUE#1 NOT NULL
pub	pub_id	integer	UNIQUE#1 NOT NULL

Table: featureprop

A feature can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.

featureprop Structure
F-Key	Name	Type	Description
	featureprop_id	serial	PRIMARY KEY
feature	feature_id	integer	UNIQUE#1 NOT NULL
cvterm	type_id	integer	UNIQUE#1 NOT NULL The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. Certain property types will only apply to certain feature types (e.g. the anticodon property will only apply to tRNA features) ; the types here come from the sequence feature property ontology.
	value	text	The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
	rank	integer	UNIQUE#1 NOT NULL Property-Value ordering. Any feature can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used

Tables referencing this one via Foreign Key Constraints:

featureprop_pub

Table: featureprop_pub

Provenance. Any featureprop assignment can optionally be supported by a publication.

featureprop_pub Structure
F-Key	Name	Type	Description
	featureprop_pub_id	serial	PRIMARY KEY
featureprop	featureprop_id	integer	UNIQUE#1 NOT NULL
pub	pub_id	integer	UNIQUE#1 NOT NULL

Table: synonym

A synonym for a feature. One feature can have multiple synonyms, and the same synonym can apply to multiple features.

synonym Structure
F-Key	Name	Type	Description
	synonym_id	serial	PRIMARY KEY
	name	character varying(255)	UNIQUE#1 NOT NULL The synonym itself. Should be human-readable machine-searchable ascii text.
cvterm	type_id	integer	UNIQUE#1 NOT NULL Types would be symbol and fullname for now.
	synonym_sgml	character varying(255)	NOT NULL The fully specified synonym, with any non-ascii characters encoded in SGML.

Tables referencing this one via Foreign Key Constraints:

feature_synonym

library_synonym

@@ Line 1: / Line 1: @@
-Chapter 4
+=Introduction=
+A central module in Chado is the sequence module. The fundamental table within this module
+is the feature table, for describing biological sequence features. Chado defines a feature to be a
+region of a biological polymer (typically a DNA, RNA, or a polypeptide molecule) or an aggregate
+of regions on this polymer. As the term is used here, region can be the entire extent of the molecule,
+or a junction between two bases. Features can be typed according to an ontology, they
+can be localized relative to other features, and they can form part-whole and other relationships
+with other features.
-The Sequence Module: Features
+You may find these related documents useful:
+* [[Chado Manual]]
+* [[Chado_Best_Practices|Chado Best Practices]] - many issues specific to the Sequence module are discussed
+* [[Chado_FAQ|Chado FAQ]]
+* [[Introduction_to_Chado|Introduction to Chado]]
+* [[Chado_CV_Module|Chado cv module]] - the Sequence module makes extensive use of controlled vocabularies
-.1  The role of features in Chado
+=Features=
+{{NeedsEditing}}
-The central module in Chado is the sequence module. The fundamental table within this module
+Chado does not distinguish between a sequence and a sequence feature, on the theory that a feature is a piece of a sequence, and a piece of a sequence is a sequence. Both are represented as a row in the [[#Table:_feature|feature]] table.
-is the feature table, for describing biological sequence features. Chado deﬁnes a feature to be a
-region of a biological polymer (typically a DNA, RNA, or a polypeptide molecule) or an aggregate
-of regions on this polymer. As the term is used here, region can be the entire extent of the molecule,
-or a junction between two bases. Features can be typed according to a classiﬁcation scheme[6], they
-can be localized relative to other features, and they can form part-whole and other relationships
-with other features.
-There are many diﬀerent types of features. Examples include gene, exon, transcript, regulatory
+There are many different types of features. Examples include gene, exon, transcript, regulatory
 region, chromosome, sequence variation, polypeptide, protein domain and cross-genome match
-regions. Chado does not have a diﬀerent table for each kind of feature; all features are stored in
+regions. Chado does not have a different table for each kind of feature; all features are stored in
-the feature table. Types of feature are diﬀerentiated using a type id column, which is a foreign key
+the [[#Table:_feature|feature]] table.
-to the cvterm table in the cv (ontology) module, described later. This allows us to type features
+Feature types are taken from the  [http://www.sequenceontology.org/ Sequence Ontology] controlled vocabulary (see also [[Chado_CV_Module|Controlled Vocabulary module]], also known as ''cv''). Types of feature are differentiated using a ''type_id'' column, which is a foreign key to the [[Chado_Tables#Table:_cvterm|cvterm]] table in the cv (ontology) module, described [[Chado_CV_Module|here]]. This allows us to type features
 according to the Sequence Ontology. The use of ontologies to type tables gives Chado a subtyping
 mechanism, which is absent from the standard relational model. For example, SO tells us that
-mRNA and snRNA are diﬀerent kinds of transcript. This is discussed in more in the next section.
+mRNA and snRNA are different kinds of transcript. This is discussed in more in the next section.
 For the purposes of discussion in this document, it can be assumed that any reference to genes,
 exons, polypeptides, SNPs, chromosomes, transcripts and various kinds of RNAs and so on refers
-to features of that sequence ontology type.
+to features of that Sequence Ontology type.
-The Chado feature table has a text-valued column named residues for storing the sequence
+A selection of Chado-relevant types from SO are shown below:
-of the feature. The value of this column is string of IUPAC[REF] symbols corresponding to the
+{| class="wikitable"
+|+ Sequence Ontology Examples
+!SO Term
+!SO id
+|-
+|[http://www.sequenceontology.org/miSO/SO_CVS/exon.html Exon]
+|SL:0000025
+|-
+|[http://www.sequenceontology.org/miSO/SO_CVS/intron.html Intron]
+|SL:0000027
+|-
+|[http://www.sequenceontology.org/miSO/SO_CVS/mRNA.html mRNA]
+|SL:0000037
+|-
+|[http://www.sequenceontology.org/miSO/SO_CVS/miRNA miRNA]
+|SL:0000044
+|-
+|[http://www.sequenceontology.org/miSO/SO_CVS/regulatory_element regulatory_element]
+|SL:0000052
+|-
+|[http://www.sequenceontology.org/miSO/SO_CVS/transcription_factor_binding_site.html transcription_factor_binding_site]
+|SL:0000054
+|-
+|}
+The Chado [[#Table:_feature|feature]] table has a text-valued column named ''residues'' for storing the sequence
+of the feature. The value of this column is string of [http://bioinformatics.org/sms/iupac.html IUPAC symbols] corresponding to the
 sequence of biochemical residues encoded by the feature. This column is optional, because the
 sequence of the feature may not be known. Even if the sequence of a feature is known, it may not
-be desirable to store it in the feature table, as it may be possible to infer the sequence from the
+be desirable to store it in the [[#Table:_feature|feature]] table, as it may be possible to infer the sequence from the
 sequence of other features in the database. For example, exon sequences are generally not stored,
 as these can trivially be inferred from the sequence of the genomic feature on which the exon is
@@ Line 38: / Line 74: @@
 and more computationally expensive to dynamically splice together the mRNA sequence.
+It is important to realize that the existence of a row in the [[#Table:_feature|feature]] table does not necessarily
-It is important to realize that the existence of a row in the feature table does not necessarily
 imply that the feature has been characterized as a result of genome annotation. It is possible to
-have features of SO type gene for genes that have only been characterized through genetic studies
+have features of SO type gene for genes that have only been characterized through genetic studies, and for which neither sequence nor sequence location is known. This is in contrast to other
-[REF], and for which neither sequence nor sequence location is known. This is in contrast to other
+feature schemas (such as [[GFF]]) in which it is not possible to represent features without representing
-feature schemas (such as GFF) in which it is not possible to represent features without representing
 a location in sequence coordinates. This design decision is crucial for the use of Chado as a database
 for integrating information about the same entity from multiple perspectives.
-Because the sequence is stored as a column in the feature table rather than as an independent
+Because the sequence is stored as a column in the [[#Table:_feature|feature]] table rather than as an independent
 table, sequences cannot exist in the absence of a row in the feature table; sequences are dependent
 upon features. This is in contrast with almost all other genomics schemas that allow independent
 treatment of sequences and features. This design decision follows for both philosophical and prag-
-matic reasons. The feature table also contains columns seqlen and md5checksum, for storing the
+matic reasons. The [[#Table:_feature|feature]] table also contains columns ''seqlen'' and ''md5checksum'', for storing the
 length of the sequence and the 32-character checksum computed using the MD5 [RL Rivest. RFC
 : The md5 message-digest algorithm. Technical report, Internet Activities Board, April 1992.]
-algorithm. The length and checksum can be stored even when the residues column is null valued.
+algorithm. The length and checksum can be stored even when the ''residues'' column is null valued.
 The checksum is useful for checking if two or more features share the same sequence, without
 comparing the entire sequence string.
-The existence of these columns means that this table is no longer in third normal form (3NF)[REF],
+The existence of these columns means that this table is no longer in  [[wp:Third_normal_form|third normal form (3NF)]],
 which is usually a desirable formal property of relational database. On balance, the utility of these
-columns outweighs the disadvantages of violating 3NF [updates]. In practical terms, it means that
+columns outweighs the disadvantages of violating [[wp:Third_normal_form|3NF]]. In practical terms, it means that
-the values of the residues, seqlen and md5checksum columns are interdependent and cannot be
+the values of the ''residues, seqlen'' and ''md5checksum'' columns are interdependent and cannot be
 updated independently of one another.
-The feature table has a Boolean valued column, is analysis, indicating whether this is an an-
+The [[#Table:_feature|feature]] table has a Boolean valued column, ''is_analysis'', indicating whether this is an annotation or a computed feature from a computational analysis. Annotations are features that are
-notation or a computed feature from a computational analysis. Annotations are features that are
+generated or blessed by a human curator, or in some cases by an integrated genome pipeline (for example, [[MAKER]] or [[DIYA]]) capable of synthesizing gene models and other annotations from ''in silico'' analyses. They constitute
-generated or blessed by a human curator, or in some cases by an integrated genome pipeline[7-9]
+the definitive version of a particular feature, in contrast to the features generated by gene prediction
-capable of synthesising gene models and other annotations from in-silico analyses. They constitute
-the deﬁnitive version of a particular feature, in contrast to the features generated by gene prediction
 programs and sequence similarity searches such as BLAST.
-The feature table has a dbxref id column that refers to a global, stable public identiﬁer for
+The [[#Table:_feature|feature]] table has a ''dbxref_id'' column that refers to a global, stable public identifier for
-the feature. This column is optional, because not all classes of features have such identiﬁers for
+the feature. This column is optional, because not all classes of features have such identifiers for
-example, features resulting from gene predictions and blast HSP features may be less stable and
+example, features resulting from gene predictions and BLAST HSP features may be less stable and
-thus lack public identiﬁers. It is recommended that most annotated features have dbxref ids. The
+thus lack public identifiers. It is recommended that most annotated features have ''dbxref_id''s. The
-organism id column refers to a row in the organism table (deﬁned in the organism module). This
+''organism_id'' column refers to a row in the [[Chado_Tables#Table:_organism|organism]] table (defined in the [[Chado_Organism_Module|organism module]]). This
-column is mandatoryall features derive from a single organism.
+column is mandatory if the feature derives from a single organism.
-The name and uniquename columns allow features to be labelled. The name column is optional,
+==Names of Features==
+The ''name'' and ''uniquename'' columns allow features to be labelled. The ''name'' column is optional,
 but it is recommended that all annotated features (as opposed to those that arise from purely
 computational methods) have names. The name should be a simple, concise, human-friendly display
-label (such as a gene or gene product symbol, as deﬁned by the nomenclature rules of governing the
+label (such as a gene or gene product symbol, as defined by the nomenclature rules of governing the
-organism). User interface software (such as GBrowse[10] and Apollo[11]) can use the name column
+organism). User interface software (such as [[GBrowse]] and [[Apollo]]) can use the ''name'' column
 for labelling feature glyphs in user displays. Uniqueness of name within any particular organism
 or genome project is a desirable characteristic, but is not enforced in the schema, since there are
-occasions where name clashes are unavoidable. In contrast, the uniquename column is required,
+occasions where name clashes are unavoidable. In contrast, the ''uniquename'' column is required,
-and guaranteed to be unique when taken in combination with organism id and type id  this is
+and guaranteed to be unique when taken in combination with ''organism_id'' and ''type_id''  this is
-enforced by a constraint in the relational schema. The uniquename may be human-friendly (for
+enforced by a constraint in the relational schema. The unique name may be human-friendly (for
 example, it can be the same as the name); however, it is not guaranteed to be so, and in general
 should not be displayed to the end user. Its use is mainly as an alternate unique key on the table .
- The uniquename normally conforms to some naming rule these rules may vary across chado
+The unique name normally conforms to some naming rule these rules may vary across chado
-instances, but they should all guarantee the uniqueness of the uniquename, organism id, type id
+instances, but they should all guarantee the uniqueness of the ''uniquename, organism id, type id''
 triple.
+[[Image:Feature-tables.png]]
-Feature synonyms
+==Feature Synonyms==
 In addition to having a name or symbol, it is common for features such as genes to have multiple
-synonyms or aliases. These synonyms may exist due to diﬀerent publications referring to the same
+synonyms or aliases. These synonyms may exist due to different publications referring to the same
-gene with diﬀerent symbols, or because one gene was once believed to be two or more separate
+gene with different symbols, or because one gene was once believed to be two or more separate
-genes. A common curation operation on genes[REF] is splitting and merging, which results in the
+genes. A common curation operation on genes is splitting and merging, which results in the
 creation of synonyms.
- This is modelled in Chado with a synonym table and a feature synonym linking table; thus
+This is modelled in Chado with a [[#Table:_synonym|synonym]] table and a [[#Table:_feature_synonym|feature_synonym]] linking table; thus multiple features can potentially share the same, and a single feature can be have multiple synonyms. Use of a synonym in the literature is indicated with a ''pub_id'' foreign key referencing the [[Chado_Tables#Table:_pub|pub]] table (see [[Chado_Publication_Module|the publications module]]), indicating historical provenance for the use of a synonym.
-multiple features can potentially share the same, and a single feature can be have multiple synonyms.
-Use of a synonym in the literature is indicated with a pub id foreign key referencing the pub table
-(described later in the section on publications module), indicating historical provenance for the use
-of a synonym.
+Feature synonyms are found by joining to [[#Table:_feature_synonym|feature_synonym]] and [[Chado_Tables#Table:_synonym|synonym]]. For example, here is a query to find gene by name or synonym:
+<syntaxhighlight lang="sql">
+select feature_id from feature
+where name = 'name of interest'
+union select feature_id
+from feature_synonym fs, synonym s
+where fs.synonym_id = s.synonym_id
+and s.name = 'name of interest'
+and fs.is_current;
+</syntaxhighlight>
-Feature locations
+==Feature Locations==
-Features can potentially be localized using a sequence coordinate system. A relative localization
+Features can potentially be localized using a sequence coordinate system. A relative localization model is used, so all feature localizations must be relative to another feature. Some features such
-model is used, so all feature localizations must be relative to another feature. Some features such
+as those of type chromosome are not localized in sequence coordinates. Locations are stored in the [[#Table:_featureloc|featureloc]] table, also part of this sequence module. Other non-sequence oriented kinds of localization (such as physical localization from ''in situ'' experiments, or genetic localizations from linkage studies) are modelled outside the sequence module (for example, in the [[Chado_Expression_Module|expression module]] or [[Chado_Map_Module|map module]]).
-as those of type chromosome are not localized in sequence coordinates. Locations are stored in the
-featureloc table, also part of the sequence module. Other non-sequence oriented kinds of localization
-(such as physical localization from in situ experiments, or genetic localizations from linkage studies)
-are modelled outside the sequence module (for example, in the expression or map module).
- A feature can have zero or more featurelocs, although it will typically have either one (for local-
+A feature can have zero or more featurelocs, although it will typically have either one (for localized features for which the location is known) or zero (for unlocalized features such as chromosomes,
-ized features for which the location is known) or zero (for unlocalized features such as chromosomes,
+or for features for which the location is not yet known, such as a gene discovered using classical genetics techniques). Features with multiple featurelocs will be explained later.
-or for features for which the location is not yet known, such as a gene discovered using classical
-genetics techniques). Features with multiple featurelocs will be explained later.
- A featureloc is an interval in interbase sequence coordinates (see ﬁgure), bounded by the fmin
+A featureloc is an interval in interbase sequence coordinates (see figure), bounded by the ''fmin'' and ''fmax'' columns, each representing the lower and upper linear position of the boundary between
-and fmax columns, each representing the lower and upper linear position of the boundary between
+bases or base pairs, with directionality indicated by the ''strand'' column. Interbase coordinates were
-bases or base pairs, with directionality indicated by the strand column. Interbase coordinates were
+chosen over the more commonly used base-oriented coordinate system because they are more naturally amenable to the standard arithmetic operations that are typically performed upon sequence
-chosen over the more commonly used base-oriented coordinate system because they are more nat-
+coordinates. This leads to cleaner and more efficient database coding logic that is arguably less
-urally amenable to the standard arithmetic operations that are typically performed upon sequence
-coordinates. This leads to cleaner and more eﬃcient database coding logic that is arguably less
 prone to errors. Of course, interbase coordinates are typically transformed into the more common
 base-oriented system used by BLAST reports and so forth prior to presentation to the end-user.
- The relational schema includes a constraint which ensures that fmin ¡= fmax is always true any
+The relational schema includes a constraint which ensures that fmin != fmax is always true, and any
 attempt to set the database in a state which violates this will ﬂag an error .
+As mentioned previously, a featureloc must be localized relative to another feature, indicated
- As mentioned previously, a featureloc must be localized relative to another feature, indicated
+using the ''srcfeature_id'' foreign key column, referencing the [[#Table:_feature|feature]] table. There is nothing in the
-using the srcfeature id foreign key column, referencing the feature table. There is nothing in the
 schema prohibiting localization chains; for example, locating an exon relative to a contig that is
-itself localized relative to a chromosome (see ﬁgure). The majority of Chado database instances will
+itself localized relative to a chromosome (see figure). The majority of Chado database instances will
 not require this ﬂexibility; features are typically located relative to chromosomes or chromosomes
 arms. Nevertheless, the ability to store such localization networks or location graphs can be useful
-for unﬁnished genomes or parts of genomes such as heterochromatin [REF], in which it is desirable
+for unfinished genomes or parts of genomes such as [[wp:Heterochromatin|heterochromatin]], in which it is desirable
-to locate features relative to stable contigs or scaﬀolds, which are themselves localized in an unstable
+to locate features relative to stable contigs or scaffolds, which are themselves localized in an unstable
 assembly to chromosomes or chromosome arms. Localization chains do not necessarily only span
 assemblies protein domains may be localized relative to polypeptide features, themselves localized
 to a transcript (or to the genome, as is more common). Chains may also span sequence alignments.
- We will now present a short formal treatment of the properties of these hierarchies of localization
+[[Image:Featureloc-example.png]]
+===The Feature Location Graph===
+We will now present a short formal treatment of the properties of these hierarchies of localization
 using graph theory. This treatment can be ignored for the purposes of understanding the basics
 of the Chado schema; the end-user of the database will be entirely unaware of such technicalities.
-However, for the purposes of software engineering and ensuring interoperability between diﬀerent
+However, for the purposes of software engineering and ensuring interoperability between different
-Chado database instances and diﬀerent applications, formal treatments such as these are an essential
+Chado database instances and different applications, formal treatments such as these are an essential
-requirement for software speciﬁcations.
+requirement for software specifications.
- We can deﬁne a featureloc graph (LG) as being a set of vertices and edges, with each feature
+We can define a featureloc graph (LG) as being a set of vertices and edges, with each feature
-constituting a vertex, and each featureloc constituting an edge going from the parent feature id
+constituting a vertex, and each featureloc constituting an edge going from the parent ''feature_id''
-vertex to the srcfeature id vertex. The node is labelled with column values from the feature table,
+vertex to the ''srcfeature_id'' vertex. The node is labeled with column values from the [[#Table:_feature|feature]] table,
-and the edge is labelled with column values from the featureloc table. The LG is not allowed to
+and the edge is labeled with column values from the [[#Table:_featureloc|featureloc]] table. The LG is not allowed to
-contain cycles it is a directed acyclic graph (DAG). This includes self-cycles - no feature may be
+contain cycles, it is a {{GlossaryLink|DAG|directed acyclic graph (DAG)}}. This includes self-cycles - no feature may be
 localized relative to itself.
- The roots of the LG are the features that do not have featureloc row typically chromosomes
+The roots of the LG are the features that do not have featureloc rows, typically chromosomes
-or chromosome arms, although LG roots may also be unassembled contigs, scaﬀolds or features for
+or chromosome arms, although LG roots may also be unassembled contigs, scaffolds or features for
-which sequence localization is not get known (such as genes discovered through classical genetics
+which sequence localization is not yet known (such as genes discovered through classical genetics
-techniques). The leaves of the LG are any features that are not present as a srcfeature id in any
+techniques). The leaves of the LG are any features that are not present as a ''srcfeature_id'' in any
 featurelocs row typically the bulk of features, such as genes, exons, matches and so on. The depth
-of a particular LG g, denoted D(g), is the maximum number of edges between any leaf- root pair.
+of a particular LG ''g'', denoted ''D(g)'', is the maximum number of edges between any leaf- root pair.
 As has been previously noted, many Chados will have LGs with a uniform depth of 1. Such LGs
 are said to be simple and the features within them are said to be singletons. The maximum depth
-of all LGs in a particular database instance i is denoted LGDmax(i).
+of all LGs in a particular database instance i is denoted ''LGDmax(i)''.
+[[Image:Featureloc-graph-example.png]]
 The schema does not constrain the maximum depth of the LG. This ﬂexibility proves useful
-when applying Chado to the highly variable needs of multiple diﬀerent genome projects; however,
+when applying Chado to the highly variable needs of multiple different genome projects; however,
-it can lead to eﬃciency problems when querying the database. It can also make it more diﬃcult to
+it can lead to efficiency problems when querying the database. It can also make it more difficult to
-write software to interoperate with the database, as the software must take into account diﬀerent
+write software to interoperate with the database, as the software must take into account different
 contingencies. We can solve this problem by collapsing the LG, in which a graph of arbitrary depth
 is ﬂattened to a depth of 1, transforming or projecting featurelocs onto the root features (typically
@@ Line 188: / Line 228: @@
 and additional redundant featurelocs between leaf and root features are added to the database.
 These new featurelocs are known as inferred featurelocs. In the schema inferred featurelocs are
-diﬀerentiated from direct featurelocs using the locgroup column. Direct (non-inferred) localizations
+differentiated from direct featurelocs using the locgroup column. Direct (non-inferred) localizations
 are indicated by the locgroup column taking value 0, and transitive localizations are indicated by
-this column having value ¿0.
+this column having value !0.
-The terminology used above can be used to deﬁne speciﬁcations for applications intended to
+The terminology used above can be used to define specifications for applications intended to
-interoperate with the database. Feature location pairs Certain kinds of features have paired loca-
+interoperate with the database. Certain kinds of features have paired locations. These include hits and high-scoring-pairs (HSPs) coming from sequence search programs
-tions. These include hits and high-scoring- pairs (HSPs) coming from sequence search programs
 such as BLAST, and syntenic chromosomal regions. These kinds of features have two featurelocs
 (in contrast to the usual 1) one on the query feature and one on the subject (hit) feature. We
-diﬀerentiate the two featurelocs with the rank column. A rank of 0 indicates a location relative to
+differentiate the two featurelocs with the ''rank'' column. A rank of 0 indicates a location relative to
 the query (as is the default for most features), and a rank of 1 indicates a location relative to the
 subject (hit) feature.
-For multiple alignments (e.g. CLUSTALW [REF] results), this scheme is extended to unbounded
+For multiple alignments (e.g. [[bp:Clustalw|CLUSTALW]] results), this scheme is extended to unbounded
-ranks [0..n], with arbitrary ordering. Alignments are stored in the residue info column. CIGAR
+ranks [0..n], with arbitrary ordering. Alignments are stored in the residue info column. [http://www.ensembl.org/info/software/Pdoc/ensembl/modules/Bio/EnsEMBL/Utils/CigarString.html CIGAR]
-format[REF] is used for pairwise alignments.
+format is used for pairwise alignments.
-Multiple featurelocs may also be required for features of type sequence variant (SO:0000109),
+Multiple featurelocs may also be required for features of type "sequence variant" (SO:0000109),
-indicating points or extents which vary between reference and non- reference sequences. From a
+indicating points or extents which vary between reference and non-reference sequences. From a
 modelling standpoint, variants are conceptually similar to alignments; with variants we are noting a
-diﬀerence as opposed to a similarity. Here a rank of zero indicates the wild-type (or reference) fea-
+difference as opposed to a similarity. Here a rank of zero indicates the wild-type (or reference) feature and a rank of one or more indicates the variant (or non-reference) feature, with the residue info
-ture and a rank of one or more indicates the variant (or non-reference) feature, with the residue info
+column representing the sequence on wild-type and variant. A featureloc is uniquely identified by the ''feature_id, rank, locgroup'' triple. This means that no feature can have more than one
-column representing the sequence on wild-type and variant. [?ﬁgure ] A featureloc is uniquely iden-
+featureloc with the same rank and locgroup. In other words, rank and locgroup uniquely identify a featureloc for any particular feature.
-tiﬁed by the [feature id, rank, locgroup] triple. This means that no feature can have more than one
-featureloc with the same rank and locgroup. In other words, rank and locgroup uniquely identify
-a featureloc for any particular feature.
+===Feature Coordinates===
+Features are located relative to other features using the [[#Table:_featureloc|featureloc]] table rows. Features can be located on more than one sequence. For example, a BLAST hit HSP can be a feature of both the query and target sequences. To locate a feature, create a  [[#Table:_featureloc|featureloc]]  record with:
-Diﬀerence between the chado location model and other schemas
+* ''srcfeature_id'' = the id of the sequence on which the feature is being located
+* ''feature_id'' = the id of the feature being located
+* ''strand'' is 1 for the positive strand, -1 for the negative, and 0 for both or indifferent.
+* ''fmin, fmax'' – the minimum and maximum coordinates of the interval
+* ''is_fmin_partial, is_fmax_partial'' = true if needed to indicate that the sequence is incomplete (e.g. for ESTs or EST assemblies which are known to not go all the way to the 3’ or 5’ end.)
+* ''phase''  = 0, 1, or 2 – denotes phase of first base pair in a nucleotide feature with respect to a source protein, or the offset of the first nucleotide in its codon.
+* ''rank, locgroup'' – these are used to organize groups of feature locations and can be ignored in simple cases (the details are discussed below).
-There is a crucial diﬀerence between the Chado location model and the sequence location model
+====Multiple Locations for a Feature====
-used in other schemas, such as GFF, GenBank, BioSQL, BioPerl, etc.
-First, Chado is the only model to use the concept of rank and locgroup. Second, and perhaps
+The ability to have multiple locations for a feature has many uses. For example one can locate a SNP, exon, or protein motif on the genome, on a transcript, and on a protein. A region of similarity between two sequences (HSP) can be located on both of them, so if either is viewed the “hit” is visible.
-more important, all these other models allow discontiguous locations (also known as split locations).
-These will be familiar to anyone who has inspected GenBank annotated DNA records for an or-
-ganism that has introns within the transcripts; the transcript location is modelled as a sequence of
-non-contiguous intervals on the genome. The interval represents the location of an exon.
+===Difference Between the chado Location Model and Other Schemas===
+There is a crucial difference between the Chado location model and the sequence location model
+used in other schemas, such as [[GFF]], GenBank, [http://biosql.org BioSQL], or [http://bioperl.org BioPerl].
+First, Chado is the only model to use the concept of rank and locgroup. Second, and perhaps
+more important, all these other models allow discontiguous locations (also known as "split locations").
+These will be familiar to anyone who has inspected GenBank annotated DNA records for an organism that has introns within the transcripts; the transcript location is modelled as a sequence of
+non-contiguous intervals on the genome. The interval represents the location of an exon. For example:
+             /gene="Acph"
+     CDS     join(914..1063, 1143..1241, 1297..1536, 1605..2054,
+..2925, 3063..3172)
-Although Chado allows a feature to have multiple locations, this is only with variable rank and
+Although Chado allows a feature to have multiple locations, this is only with variable ''rank'' and
-locgroup this is enforced by a uniqueness constraint in the relational schema. We made a conscious
+''locgroup'' and this is enforced by a uniqueness constraint in the relational schema. We made a conscious
-decision to avoid discontiguous locations, because the extra degree of freedom this aﬀords results
+decision to avoid discontiguous locations, because the extra degree of freedom this affords results
 in either redundancies or ambiguities. Redundancies arise when exons are stored in addition to a
 discontiguous transcript, and ambiguities arise by virtue of the fact that explicit representation of
@@ Line 241: / Line 293: @@
 with contiguous locations. For example, a transcript with a discontiguous location can be modelled
 as a collection of exons with contiguous featurelocs, and a transcript with a single contiguous
-featureloc representing the outer boundaries deﬁned by the outermost exons.
+featureloc representing the outer boundaries defined by the outermost exons.
+==Feature Rank==
+The ''rank'' field is used when a feature has more than 1 location, otherwise the default rank value of 0 is used. Some features have two locations, for example BLAST hits and HSPs:  one location on the query, rank = 0, and one location on the subject, rank = 1.
-Extensible feature properties
+==Extensible Feature Properties==
-The feature table has a fairly limited set of columns for recording feature data. For example, there
+The [[#Table:_feature|feature]] table has a fairly limited set of columns for recording feature data. For example, there
 is no anticodon column for recording the RNA triplet for the adapter in a tRNA feature (all feature
-types, including tRNAs, are recorded as rows in the feature table). If we were to add columns such
+types, including tRNAs, are recorded as rows in the [[#Table:_feature|feature]] table). If we were to add columns such
-as anticodon then the number of columns in the table would become very large and diﬃcult to
+as anticodon then the number of columns in the table would become very large and difficult to
 manage; most would end up being nullable (for example, anticodon does not apply to non-tRNA
-features). This is because diﬀerent organisms, diﬀerent types of feature and diﬀerent projects have
+features). This is because different organisms, different types of feature and different projects have
-diﬀering needs regarding what extra data should be attached to any one feature. How then are
+differing needs regarding what extra data should be attached to any one feature. How then are
-we to attach both biologically relevant and project speciﬁc data to features? Chado solves this by
+we to attach both biologically relevant and project specific data to features?
-using an extensible mechanism for attaching attribute- value pairs to features via the featureprop
-table. The featureprop.type id foreign key column references a property in the Sequence Feature
+Chado solves this by using an extensible mechanism for attaching attribute-value pairs to features via the [[#Table:_featureprop|featureprop]]
-Property Ontology (SFPO)[url], distributed as part of Chado. The value text column stores the
+table. The ''featureprop.type_id'' foreign key column references a property in the [http://www.sequenceontology.org/ Sequence Ontology]. The ''value'' text column stores the
-value ﬁller for that property. Sets or lists of values for any property can be stored in the featureprop
+value filler for that property. Sets or lists of values for any property can be stored in the [[#Table:_featureprop|featureprop]]
-table, diﬀerentiated by the value of the rank column. Provenance for the featureprop assignment
+table, differentiated by the value of the ''rank'' column. Provenance for the [[#Table:_featureprop|featureprop]] assignment
-is stored using the featureprop pub table in the publications module, described later, allowing
+is stored using the [[Chado_Tables#Table:_featureprop_pub|featureprop_pub]] table in the [[Chado_Publication_Module|publications module]], allowing
 multiple publications to be associated with any one assignment.
-Because featureprop values can be of an arbitrary size, they are modelled using a SQL TEXT
+Because  [[#Table:_featureprop|featureprop]] values can be of an arbitrary size, they are modelled using a SQL TEXT
-type. This has some disadvantages from a query eﬃciency perspective.
+type. This has some disadvantages from a query efficiency perspective.
 Numeric values cannot be indexed correctly, and sorting the results of a query can only be done
 via a SQL casting operation, or in software outside of the database management system, either of
 which may result in poorer performance. This is one of several areas in Chado where performance
-has been traded in favour of a simpler, more abstract and generic model. Later on we will look at
+has been traded in favour of a simpler, more abstract and generic model.
-strategies for oﬀsetting some of these performance penalties.
-[example table]
+==Linking Features to External Databases==
+Public database identifiers are stored in the [[Chado_Tables#Table:_dbxref|dbxref]] table, which holds the database name, the accession number, and an optional version number. Note that this table holds accession numbers published internally by the Chado instance as well as by other databases. A feature can have a primary dbxref, which is linked directly from the [[#Table:_feature|feature]] table. It can also have additional secondary dbxref's linked via ''feature_dbxref''. A feature need not have a primary dbxref; e.g. computed features may be considered “lightweight” and not assigned accession numbers. Some groups may wish to set up a trigger to automatically assign primary dbxrefs to features of types that are locally accessioned; a sample trigger is provided with the schema.
-Feature annotations
+==Feature Annotations==
+Detailed annotations, such as associations to [http://geneontology.org Gene Ontology (GO)] terms or [http://obofoundry Cell Ontology] terms, can be attached to features using the [[#Table:_feature_cvterm|feature_cvterm]] linking table. This allows multiple ontology terms to be associated with each feature.
-Detailed annotations, such as associations to Gene Ontology[5] (GO) terms or Cell Ontology[12]
+Provenance data can be attached with the  [[#Table:_feature_cvtermprop|feature_cvtermprop]] and [[#Table:_feature_cvterm_dbxref|feature_cvterm_dbxref]] higher-order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using  [[#Table:_feature_cvterm|feature_cvterm]]. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) that are linked to the gene features.
-terms, can be attached to features using the feature cvterm linking table. This allows multiple
-ontology terms to be associated with each feature.
+Annotations for existing features can also go into the  [[#Table:_featureprop|featureprop table]] using the Chado feature_property ontology (defined in <code>chado/load/etc/feature_property.obo</code>) and the comment or description terms as appropriate. The purpose of the feature property ontology (and the related <code>chado/load/etc/genbank_feature_property.obo</code> file) is to capture terms that are likely to appear in [[GFF]] or GenBank sequence files. In theory there is no overlap between these ontologies and the Sequence Ontology.
+==Relationships Between Features==
-Provenance data can be attached with the feature cvtermprop and feature cvterm dbxref higher-
-order linking tables. It is up to the curation policy of each individual Chado database instance to
-decide which kinds of features will be linked using feature cvterm. Some may link terms to gene
-features, others to the distinct gene products (processed RNAs and polypeptides) linked to the
-gene features (see next section)
-Relationships between features
 Biological features are inter-related; exons are part of transcripts, transcripts are part of genes,
 and polypeptides are derived from messenger RNAs. Relationships between individual features
-are stored in the feature relationship table, which connects two features via the subject id and
+are stored in the  [[#Table:_feature_relationship|feature_relationship]] table, which connects two features via the ''subject_id'' and
-object id columns (foreign keys referring to the feature table) and a type id (a foreign key referring
+''object_id'' columns (foreign keys referring to the [[#Table:_feature|feature]] table) and a ''type_id'' (a foreign key referring
-to a relationship type in an ontology, either SO[6], or the OBO relationship ontology, OBO-REL[13])
+to a relationship type in an ontology, either [http://sequenceontology.org SO], or the [http://obofoundry.org/ro/ OBO relationship ontology, OBO-REL],
-indicating the nature of the relationship between subject and object features. The core relationships
+indicating the nature of the relationship between subject and object features.
-between features are part-whole (part of) or temporal (derives from). ”Subject” and ”Object”
+The core relationships between features are part-whole (''part_of'') or temporal (''derives_from''). ''Subject'' and ''Object''
 describes the linguistic role the two features play in a sentence describing the feature relationship.
-In English, many sentences follow a subject, predicate, object word order. To say that ”exons
+In English, many sentences follow a subject, predicate, object syntax, and word order is important. To say that ”exons
 are part of transcripts” is the correct way to describe a typical biological relationship. To say
 ”transcripts are part of exons” is either grammatically or biologically incorrect.
-We use this same terminology (which comes from RDF[REF]) again in the cv module. The
+We use this same terminology (which comes from [http://www.w3.org/RDF/ RDF]) again in the [[Chado_CV_Module|cv module]]. The
-collection of features and feature relationships can be considered as vertices and edges in a graph,
+collection of features and feature relationships can be considered as vertices and edges in a graph, known as the Feature Graph (FG). Example feature graphs are shown above and in the [[Introduction_to_Chado|Introduction to Chado]].
-known as the Feature Graph (FG). Some example feature graphs are shown [ﬁgure FEATURE-
-GRAPH]. The FG is independent of the LG in general the FG and the LG should have no edges in
-common if there is a featureloc connecting two features, then the addition of a feature relationship
-between these same two features is redundant.
-The FG is required in order to query the database for such things as alternately spliced genes,
+The FG is independent of the LG and in general the FG and the LG should have no edges in
+common. If there is a featureloc connecting two features, then the addition of a feature relationship
+between these same two features is redundant. The FG is required in order to query the database for such things as alternately spliced genes,
 exons shared between transcripts, etc.
-Although the chado schema admits any FG, certain conﬁgurations are biologically meaningless,
+Although the chado schema admits any FG, certain configurations are biologically meaningless,
-and should not be used. The FG can be constrained by the Sequence Ontology. Standardized FG
+and should not be used. The FG can be constrained by the [http://sequenceontology.org Sequence Ontology]. Standardized FG
-structures are required for complex applications to be interoperable - this is discussed later on.
+structures are required for complex applications to be interoperable.
 Unlike the LG, the FG may be cyclic, although cycles in the FG are not common. The subset
@@ Line 326: / Line 370: @@
 the FG connecting parts with wholes via part of must be acyclic.
+==Compliance==
-Canonical gene models
+{{NeedsEditing}}
+''This section is not complete, it is in progress.''
-Regulatory regions
-Sequence variants
-Feature example
-[Diagram showing an example that puts this all together]
-   canonical-gene-model
-   The "central dogma" gene model - gene makes mRNA makes polypeptide
-   For many people this may be the only data they store in Chado. The
-   typical protein coding gene model consists of a gene, one or more
-   mRNAs, one or more exons, and at least one polypeptide.
-   Alternately spliced genes have a 1 to many relation between gene and
-   mRNA. Exons can be part_of more than one mRNA. No two distinct exon
-   rows should have exact same featureloc coordinates (this indicates
-   they are the same exon).
-   Every [1]feature must have a [2]featureloc with rank=0 and locgroup=0.
-   The value of the srcfeature_id column should be identical (i.e. all
-   features are located relative to the same feature), except in rare
-   circumstances such as when a feature crosses two contigs. Software is
-   not guaranteed to support this. The srcfeature_id can point to a
-   [3]contig, a [4]chromosome[5]chromosome_arm or other appropriate
-   assembly unit.
-This scenario involves rows in the following tables:
-   table
-   type_id
-   number comments
-   feature SO:gene 1
-   feature SO:mRNA
-   feature exon
-   feature polypeptide
-   Tool: apollo
-   Status: supported
-   Tool: gbrowse
-   Status: supported
-Example
-   [.] Download:
-   noncoding-gene
-   Similar to [6]canonical-gene-model, except with noncoding-RNA
-   Not all genes are protein-coding. Genes can code for tRNA, miRNA,
-   snoRNA, etc. A noncoding gene model is identical to a
-   [7]canonical-gene-model, with the following exceptions:
-     * There is no polypeptide feature
-     * Instead of an mRNA feature, there is a feature that is some other
-       sub-type of [8]RNA
-   Tool: apollo
-   Status: supported
-   Tool: gbrowse
-   Status: supported
-   pseudogene
-   A pseudogene is a non-functional relic of a gene
-   See [9]pseudogene. A pseudogene may look like an ordinary gene, and
-   may even have discernable parts such as exons. It may sometimes be
-   desirable to annotate the exon structure of a pseudogene - this can in
-   principle be done using SO types such as [10]decayed_exon. In practice
-   no-one is using Chado to do this. There are currently two practices:
-     * pseudogenes are treated analagously to [11]noncoding-genes. That
-       is, there are normal "gene" and "exon" features. However, in place
-       of a subtype of RNA, there is a feature of type pseudogene. This
-       practice is STRONGLY DISCOURAGED (it is not compliant with the
-       relations in SO, it gives false counts to the number of real genes
-       in the database). Note that this is the current default for
-       FlyBase.
-     * Pseudogenes are normal [12]singleton-features. There is no
-       annotation of exon structure. This practice is encouraged. If at a
-       later date it becomes desirable to annotated the exon structure of
-       a pseudogene, it will be compatible with this.
-   Tool: apollo
-   Status: unclear
-   Apollo by default treats pseudogenes using the first method, above. It
-   may also be possible to configure it to the second, singleton, method.
-   Annotating the exon structure of pseudogenes the correct way has not
-   yet been attempted to our knowledge.
-   singleton-feature
-   Many types of features are singletons - that is they are not related
-   to other features through feature_relationships. Storage of these is
-   basic and as one may expect
-   Singleton features present no major problems. Unlike genes, which
-   typically have parts (with the parts having subparts), singletons do
-   not form feature graphs (or rather, they form feature graphs
-   consisting of single nodes). Singleton features are located relative
-   to other features (usually the genome, but once can have singletons
-   that are located relative to other features - this may not be
-   supported by all applications)
-   Tool: gbrowse
-   Status: suppported
-   Tool: apollo
-   Status: suppported
-   Apollo supports singletons provided they are located relative to the
-   genome (singletons located relative to other features will be
-   ignored). It may be necessary to configure apollo to make the feature
-   type "1-level"
-   dicistronic-gene
-   A dicistronic gene is a gene with a mRNA that codes for two distinct
-   non-overlapping CDSs
-   Dicistronic genes (see for example, the dmel Adh and Adhr genes) have
-   totally distinct gene products deriving from the same transcript. To
-   confuse matters, the two polypeptides are commonly refered to as being
-   derived from two distinct genes (e.g. Adh and Adhr). The entire
-   genomic region comprising the transcript (e.g. Adh+Adhr) that includes
-   both CDSs is refered to as the [13]gene_cassette. In a database such
-   as FlyBase, there are 3 gene IDs stored in the database - one for each
-   of the two non-overlapping genes, and one for the gene cassette
-   Dicistronic genes make it difficult to have a formal definition of
-   gene that corresponds nicely with how biologists use the term.
-   There are currently two proposals for handling dicistronic genes. The
-   first is a hack and introduces redundancy, but works well with
-   existing software and tools. The second is prefered from a modeling
-   standpoint, but introduces a lot of complexity to software
-   operon
-   Bacterial genes are often transcribed in groups; eg LacZ
-   There are many similarities with [14]dicistronic-genes here.
-   trans-spliced-gene
-   A trans-spliced gene has one or more transcripts in which that
-   transcript may be spliced together from different parts of the genome
-   A trans spliced transcript is spliced from exons coming from different
-   parts of the genome. The distance between each trans spliced part may
-   be large, or it may be in the same location on the opposite strand.
-   Most C elegans genes have a trans spliced leader sequence. This is
-   different from the trans splicing involved in dmel , where we observe
-   what appears to be two transcripts on separate strands (both
-   containing coding sequence) joining together in a single functional
-   transcript
-   There are two proposals for dealing with this. One treats the trans
-   spliced transcript as a single transcripts, with exons coming from
-   different locations. The other treats the trans spliced transcript as
-   a mature transcript created from two distinct primary transcripts.
-   Note that these proposals focus on the dmel example. A solution for
-   the C elegans example is not proposed (not sure if we even need one?)
-   We treat this as an ordinary gene model, but relax our rules for exon
-   locations in a transcript
-   For example, for the canonical Dmel trans spliced gene, we would allow
-   transcripts to have exons on different strands. Note that in Chado,
-   exon ordering comes from [15]feature_relationship.rank (between exon
-   and transcript), NOT from the featureloc of the exon. Chado has no
-   problem with this. However, some software may make assumptions that
-   all exons are on the same strand, or may try to order exons by their
-   location to get a transcript sequence. This software will have
-   unintended consequences with trans spliced genes modeled using this
-   proposal
-   Tool: apollo
-   Status: unclear
-   apollo may accidentally scramble the order of exons. Need to check
-   Tool: gbrowse
-   Status: unclear
-   Not sure.
-   We would introduce extra transcripts, and have relations between the
-   transcripts. Only the mature, spliced, transcript would have a
-   relation to the polypeptide
-   This may model the biology better. However, it introduces a major
-   departure from the [16]canonical-gene-model. For this reason this
-   proposal is unlikely to be adopted
-   gene-with-regulatory-elements
-   regulatory elements may be implicitly or explicitly associated with a
-   gene
-   transposons
-   transposons can be annotated as [17]singleton-features or as complex
-   annotations
-   A transposon may consist of various parts such as
-   [18]long_terminal_repeats and gene models coding for genes like gag,
-   pol, env. These parts may have all decayed over time. Transposon
-   annotation typically ignores these subtleties as all that is usually
-   required is a [19]singleton-feature of type
-   [20]transposable_element_feature. In this case, there is no difficulty
-   If one requires detailed transposon annotation then one is entering
-   uncharted water as far as both Chado and annotation tools are
-   concerned (which is why this scenario is marked as being under
-   discussion). One option would be to treat each transposon part as
-   distinct singletons, but this may be unsatisfactory as one may desire
-   to have the appropriate part_of relations between the parts.
-   P-element-insertions
-   SNPs
-   gene-with-implicit-features-manifested
-   Some feature types such as introns are not normally manifested as rows
-   in chado. They are normally derived on-the-fly from the gaps between
-   consecutive exons. See for an example. Occasionally it may be
-   desirable to store the introns actual rows in the feature table - for
-   scenario in a report database
-   feature-localization
-   All features with sequence annotation should be localized using
-   featureloc
-   localized features must have a [21]featureloc with rank=0 and
-   locgroup=0. This is the primary location of the feature. The location
-   always indicates the boundaries of the feature. If the feature is
-   composed of distinct subfeatures (e.g. a transcript composes of
-   exons), then it is NOT permitted to use multiple featurelocs to
-   indicate this. Instead, there must be rows for the subfeatures, each
-   with their own featureloc
-   In a feature graph (i.e. a group of features connected via
-   [22]feature_relationship rows, all features will typically be
-   localized relative to the same source feature (i.e. they will all have
-   the same value for featureloc.srcfeature_id)
-   features are typically localized to some kind of genomic or assembly
-   feature, but chado does not constrain you to using only this. For
-   example, localizing features relative to a transcript or polypeptide
-   or even exon is permitted, but unusual practices will most likely not
-   be recognized by most software
-   feature-localization-to-contigs-in-assembly
-   In an assembled genome, it is common to locate relative to the
-   top-level assembly units (e.g. chromosomes). However, it is also
-   permissable to locate to smaller units such as [23]contigs or
-   [24]golden_path_units
-   If a genome assembly is not stable, it is common to locate relative to
-   assembly units such as [25]contigs. These contigs may then be
-   localized relative to the top-level assembly units. This is known in
-   chado terms as a location graph.
-   We discuss here location graphs of depth 2. See also
-   [26]n-level-assemblies. This scenario is often invisible to software
-   interoperating with Chado. The software is free to only look at the
-   main features and the contig-level feature and ignore the top-level
-   assembly feature. It may sometimes be desirable to have software that
-   can perform location transformations, mapping features from contigs to
-   top-level units and back
-   Tool: apollo
-   Status: unclear
-   apollo should be happy to treat contigs just as if they were top-level
-   units as chromosome arms. However, the user may have to explicitly
-   provide contigs if location queries are desired. For example, apollo
-   may retrieve nothing if the user asks for a certain range on
-   chromosome 4, and the features are located relative to contigs which
-   are themselves on chromosome 4.
-   Tool: gbrowse
-   Status: unclear
-   Gbrowse may expect features to be located relative to top-level units
-   such as chromosomes.
-   redundant-localizations-to-different-assembly-levels
-   Features can be located relative to both contigs and top-level
-   assembly units
-   Chado allows redundant feature localization using
-   [27]featureloc.locgroup>0. This allows a database to have primary
-   locations for features relative to contigs, and secondary locations
-   relative to top-level units such as chromosomes. The converse is also
-   allowed.
-   This scenario is discouraged unless the chado db admin knows what they
-   are doing. They must implement solutions to ensure that featurelocs
-   with varying locgroup do not get out of sync. These solutions are not
-   part of the standard Chado software suite. Nevertheless, this scenario
-   may be useful for advanced users in certain circumstances
-   Tool: gbrowse
-   Status: unclear
-   Not clear if gbrowse uses locgroup in querying. If it constrains by
-   locgroup, then this is essentially the same as
-   [28]feature-localization-to-contigs-in-assembly
-   Tool: gbrowse
-   Status: partial
-   Not clear if apollo uses locgroup in querying. If it constrains by
-   locgroup, then this is essentially the same as
-   [29]feature-localization-to-contigs-in-assembly. Apollo will not
-   preserve redundant featurelocs when writing back to db. This could
-   lead to db getting out of sync.
-   n-level-assemblies
-   In theory it is possible (but rare) to have assemblies with variable
-   depths, or with depths>2
-   This scenario is rare. If required, then Chado can deal with this -
-   there is no theoretical limit to the depth of a location graph. One
-   can have annotated features located relative to minicontigs which are
-   located relative to supercontigs which are located relative to
-   chromosomes. Most software that interoperates with Chado will not be
-   able to deal with this, so this scenario is discouraged except by
-   advanced users who have no other option
-   unlocalized-gene
-   A gene without sequence based localization
-   Many chado instances are purely concerned with genome annotation - in
-   these cases it would be strange to have genes or other features such
-   as transcripts with no localization (i.e. no featurelocs). However,
-   this scenario is actually common when Chado is used in a wider
-   context. We may of the existence of genes through non-sequence
-   evidence such as genetics. When we have no sequence-based localization
-   it is perfectly valid to have gene features with no featurelocs. When
-   the time comes to create genome annotations for these, we just 'fill
-   out' the gene feature by adding transcript and exon features.
-   Tool: gbrowse
-   Status: supported
-   Gbrowse supports this scenario in that unlocalized features will be
-   ignored from the genome viewer, which is appropriate
-   Tool: apollo
-   Status: supported
-   Apollo supports this scenario in that unlocalized features will be
-   ignored, which is appropriate behaviour for a genome annotation tool
-References
-. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature
-. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
-. http://song.sourceforge.net/#contig
-. http://song.sourceforge.net/#chromosome
-. http://song.sourceforge.net/#chromosome_arm
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
-. http://song.sourceforge.net/#RNA
-. http://song.sourceforge.net/#pseudogene
-. http://song.sourceforge.net/#decayed_exon
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#noncoding-gene
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
-. http://song.sourceforge.net/#gene_cassette
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#dicistronic-gene
-. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature_relationship
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
-. http://song.sourceforge.net/#long_terminal_repeat
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
-. http://song.sourceforge.net/#transposable_element_feature
-. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
-. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature_relationship
-. http://song.sourceforge.net/#contig
-. http://song.sourceforge.net/#golden_path_unit
-. http://song.sourceforge.net/#contig
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#n-level-assemblies
-. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#feature-localization-to-contigs-in-assembly
-. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#feature-localization-to-contigs-in-assembly
-.2 Best Practices
-Chado is a generic schema, which means anyone writing software to query or write to chado (either
-middleware or applications) should be aware of the diﬀerent ways in which data can be stored.
-We want to strike a nice balance between ﬂexibility and extensibility on the one hand, and strong
-typing and rigor on the other. We want to avoid the situation we have with GenBank entries where
-there are a dozen ways of representing a gene model, but we need to be able to cope with the
-constant surprises biology throws at us in an attempt to confound our nice computable models.
 Chado uses a layered model - this is tried and tested in software engineering. Some generic
 software can be targeted at the lower layers and be guaranteed to work no matter what. Other
-more speciﬁc software needs a more tightly deﬁned rigorous model and should be targeted at the
+more specific software needs a more tightly defined rigorous model and should be targeted at the
 upper layers.
-We require validation software and more formal/computable descriptions of these layers and
+We require validation software and more formal or computable descriptions of these layers and
-policies - for now natural language descriptions will have to suﬃce.
+policies - for now natural language descriptions will have to suffice.
+===Chado Compliance Layers===
+Proposal for levels of compliance.
-.2.1  Chado Compliance Layers
+====Level 0: Relational Schema====
+Level 0 conformance basically means the schema is adhered to. Obviously, this is enforced by the DBMS.
-Layer 0: Relational Schema
+====Layer 1: Ontologies====
+Level 1 conformance is minimal conformance to [http://sequenceontology.org SO] - all feature.types must be SO terms, and all
-Level 0 conformance basically means the schema is adhered to. Obviously, this is enforced by the
-DBMS.
-Layer 1: Ontologies
-Level 1 conformance is minimal conformance to SO - all feature.types must be SO terms, and all
 feature relationship.types must be SO relationship types.
+====Level 2: Graph====
-Layer 2: Graph
 Level 2 conformance is graph conformance to SO - all feature relationships between a feature of
 type X and Y must correspond to relationship of that type in SO; for example, mRNA can be
-part of gene, but mRNA can not be part of golden path region. [more detailed/formal explanation
+part of gene, but mRNA can not be part of golden path region. '''[more detailed/formal explanation
-to come]. In practice Level 2 conformance may be undesirable, we may need to make modiﬁcations
+to come].''' In practice Level 2 conformance may be undesirable, we may need to make modifications
 to SO.
 Orthogonal to these layers are various additional policy decisions. Some of these are more
-tolerant of non-conformance than others. (there is also some overlaps with levels 1/2).
+tolerant of non-conformance than others. (there is also some overlaps with levels 1 and 2).
+===Examples: Current implementations===
-.2.2  Examples: Current implementations
+This section describes details of how different sites are using Chado. '''This is likely outdated information.'''
+[http://tigr.org TIGR]: Currently at level 0 conformance, though most (if not all) of the terms being used have
-I have listed how FB implements each policy choice - other chado instances feel free to add....
-TIGR: Currently at level 0 conformance, though most (if not all) of the terms being used have
 an obvious counterpart in SO. Therefore these ”TIGR Ontology” terms are used in the answers to
 the SO-related questions that appear below. We plan on updating our terms with SO terms very
 soon.
-SO terms used for standard central-dogma gene model
+====SO terms used for Standard Central-dogma Gene Model====
-FB: gene mRNA exon protein [other types are derivable]
+[http://flybase.org FlyBase]: gene mRNA exon protein [other types are derivable].
-TIGR: gene transcript CDS exon protein [though the strict answer is for any of these SO
+[http://tigr.org TIGR]: gene transcript CDS exon protein [though the strict answer is for any of these SO
-questions is ”none” since we do not yet meet level 1 conformance]
+questions is ”none” since we do not yet meet level 1 conformance].
 NOTE: we should be using ’polypeptide’ instead of ’protein’. For now, software should be
 tolerant of both these uses.
-SO terms used for storing alignments
+====SO terms Used for Storing Alignments====
-FB: match
+[http://flybase.org FlyBase]: match
-TIGR: match
+[http://tigr.org TIGR]: match
-NOTE: we want to use the new more speciﬁc SO types for match set, match part, for hits and
+NOTE: we want to use the new more specific SO types for match set, match part, for hits and
 hsps respectively. For now, software should be tolerant of either usage.
-TIGR: We’ve also extended the model for storing pairwise alignments to store multiple align-
+[http://tigr.org TIGR]: We’ve also extended the model for storing pairwise alignments to store multiple alignments. Each member of the alignment is featureloced to the ’match’ feature. We’ve used this
-ments. Each member of the alignment is featureloced to the ’match’ feature. We’ve used this
 representation to store paralogous/orthologous gene families.
+====feature_relationship Types====
+[http://flybase.org FlyBase]: partof (for mRNA to gene and exon to mRNA) producedby (for protein to mRNA)
+[http://tigr.org TIGR]: part of (gene-assembly, exon-transcript, assembly-supercontig) produced by (protein-
-feature relationship.types
-FB: partof (for mRNA to gene and exon to mRNA) producedby (for protein to mRNA)
-TIGR: part of (gene-assembly, exon-transcript, assembly-supercontig) produced by (protein-
 CDS, CDS-transcript, transcript-gene)
@@ Line 797: / Line 451: @@
 see note above
-NOTE: the main diﬀerence between FB and TIGR here is that TIGR introduce an intermediate
+NOTE: the main difference between FB and TIGR here is that TIGR introduce an intermediate
-CDS feature between mRNA and protein
+CDS feature between mRNA and protein.
+====featureloc Policy====
+[http://flybase.org FlyBase]: all constituent parts of a central dogma gene model are located relative to the same srcfeature
+(the chromosome arm). No redundant locations (i.e. featureloc.group ¿ 0) are used.
-featureloc policy
+[http://tigr.org TIGR]: Redundant locations are used and indicated with featureloc.group ¿ 0.
+NOTE: we want to allow some ﬂexibility with this policy. We believe that the constituent parts
-FB: all constituent parts of a central dogma gene model are located relative to the same srcfeature
-(the chromosome arm). No redundant locations (ie featureloc.group ¿ 0) are used
-TIGR: Redundant locations are used and indicated with featureloc.group ¿ 0.
- NOTE: we want to allow some ﬂexibility with this policy. I believe that the constituent parts
 linked located relative to the feature should always be followed. This can be stated more formally
 as:
-   IF  X is linked to Y via feature\_relationship
+   IF  X is linked to Y via feature_relationship
-   AND X is located relative to Z via featureloc.srcfeature\_id
+   AND X is located relative to Z via featureloc.srcfeature_id
-   THEN Y must also be located relative to Z via featureloc.srcfeature\_id
+   THEN Y must also be located relative to Z via featureloc.srcfeature_id
- TIGR: We’ve followed this policy in adding a featureloc between the protein and genomic
+[http://tigr.org TIGR]: We’ve followed this policy in adding a featureloc between the protein and genomic
 contig in our databases (such a featureloc does not appear in the Chado usage documents). This
-additional featureloc simpliﬁes many queries, especially when looking at the genomic context of
+additional featureloc simplifies many queries, especially when looking at the genomic context of
 ’match’ features associated with proteins.
- We should also expect that the fmin/fmax boundaries of a feature be deﬁned the the outermost
+We should also expect that the fmin/fmax boundaries of a feature be defined the the outermost
-boundaries of the outermost constituent part features (this rule may require reﬁnement when we
+boundaries of the outermost constituent part features (this rule may require refinement when we
 have promoters, enhancers and so on - but for now we don’t).
- As to what the srcfeature should be, it could be a contig, and assembly or a top-level locat-
+As to what the srcfeature should be, it could be a contig, and assembly or a top-level locat-
-able feature such as chromosome or chromosome arm. Software should be tolerant of diﬀerent
+able feature such as chromosome or chromosome arm. Software should be tolerant of different
 choices here. Whilst it is generally always best to locate relative to the topmost feature (ie the
 arm/chromosome), sometimes this is not possible or desirable (eg low coverage, heterochromatin).
+====Non-central Dogma Gene Models====
+[http://flybase.org FlyBase]: we store a lot of non-central dogma gene models; noncoding gene models and pseudogenes
+[need to fill in more details here].
-non-central dogma gene models
+[http://tigr.org TIGR]: not many of these stored yet, save for a few pseudogenes and the occasional non-coding
+ORF.
+====Other Features====
-FB: we store a lot of non-central dogma gene models; noncoding gene models and pseudogenes
+[http://flybase.org FlyBase]: the FlyBase implementation includes many other feature types, including polyA site and se-
-[need to ﬁll in more details here]
+quence variant [need to fill in details].
- TIGR: not many of these stored yet, save for a few pseudogenes and the occasional non-coding
+[http://tigr.org TIGR]: using ’SNP’ in some databases.
-ORF
+====Derivable Feature Types====
+[http://flybase.org FlyBase]: derivable features (introns, UTRs, intergenic region) are not included. Feature typing is always
-other features
+done to the most specific, non-derivale level. For example, we never use types ”5 prime exon”,
-FB: the FlyBase implementation includes many other feature types, including polyA site and se-
-quence variant [need to ﬁll in details]
- TIGR: using ’SNP’ in some databases
-derivable features types
-FB: derivable features (introns, UTRs, intergenic region) are not included. Feature typing is always
-done to the most speciﬁc, non-derivale level. For example, we never use types ”5 prime exon”,
 ”dicistronic gene”, ”coding exon” as these are always inferrable. We always use type ”gene” - the
-speciﬁc type of gene is inferred from the child type (mRNA, tRNA, snRNA, etc).
+specific type of gene is inferred from the child type (mRNA, tRNA, snRNA, etc)..
- TIGR: derivable features are not included. currently not storing any tRNAs or snRNAs.
+[http://tigr.org TIGR]: derivable features are not included. currently not storing any tRNAs or snRNAs.
 NOTE: whilst it is perfectly permissable to include redundant derivable features (useful for
-warehouse-style querying), you should not write software that expects to ﬁnd these if you want the
+warehouse-style querying), you should not write software that expects to find these if you want the
-software to work on diﬀerent chado db instances.
+software to work on different chado db instances.
+====Sequence Variants====
+[http://flybase.org FlyBase]: these are included in chado, but they are lacking full detail.
-sequence variants
+[http://tigr.org TIGR]: only SNPs so far. the SNPs currently being stored are computed from pairwise alignments of sequences already loaded into Chado, so each SNP feature is featureloc’ed to the appropriate place on each of the two sequences (rather than having one of the featurelocs ”dangling”, as
-FB: these are included in chado, but they are lacking full detail
-TIGR: only SNPs so far. the SNPs currently being stored are computed from pairwise align-
-ments of sequences already loaded into Chado, so each SNP feature is featureloc’ed to the appro-
-priate place on each of the two sequences (rather than having one of the featurelocs ”dangling”, as
 indicated in some of the Chado usage documents.) featureloc.residue info is used to redundantly
 store the base referenced in each of the two sequences.
-NOTE: variation features should specify the edit that makes one feature (such as the reference/wild-
+NOTE: variation features should specify the edit that makes one feature (such as the reference/wild-type) from another (the variant/mutant/non-reference). There were perhaps 2 proposals for this
-type) from another (the variant/mutant/non-reference). There were perhaps 2 proposals for this
+[more details required...].
-[more details required...]
-Chado usage scenarios version:
-Index
-canonical-gene-model	final	The "central dogma" gene model - gene makes mRNA makes polypeptide
-noncoding-gene	final	Similar to , except with noncoding-RNA
-pseudogene	discussion	A pseudogene is a non-functional relic of a gene
-singleton-feature	discussion	Many types of features are singletons - that is they are not related to other features through feature_relationships. Storage of these is basic and as one may expect
-dicistronic-gene	discussion	A dicistronic gene is a gene with a mRNA that codes for two distinct non-overlapping CDSs
-operon	discussion	Bacterial genes are often transcribed in groups; eg LacZ
-trans-spliced-gene	discussion	A trans-spliced gene has one or more transcripts in which that transcript may be spliced together from different parts of the genome
-gene-with-regulatory-elements	discussion	regulatory elements may be implicitly or explicitly associated with a gene
-transposons	discussion	transposons can be annotated as s or as complex annotations
-P-element-insertions	final
-SNPs	final
-gene-with-implicit-features-manifested	discussion	Some feature types such as introns are not normally manifested as rows in chado. They are normally derived on-the-fly from the gaps between consecutive exons. See for an example. Occasionally it may be desirable to store the introns actual rows in the feature table - for scenario in a report database
-feature-localization	final	All features with sequence annotation should be localized using featureloc
-feature-localization-to-contigs-in-assembly	final	In an assembled genome, it is common to locate relative to the top-level assembly units (e.g. chromosomes). However, it is also permissable to locate to smaller units such as s or s
-redundant-localizations-to-different-assembly-levels	final	Features can be located relative to both contigs and top-level assembly units
-n-level-assemblies	final	In theory it is possible (but rare) to have assemblies with variable depths, or with depths>2
-unlocalized-gene	final	A gene without sequence based localization
-Abstract
-This page contains a selection of Chado best-practices for different usage scenarios. It is designed to complement the Chado SQL DDL (you should familiarize yourself with this first) and the Sequence Ontology. This document status is ALPHA - in progress
-Scenarios
-canonical-gene-model
-The "central dogma" gene model - gene makes mRNA makes polypeptide
-For many people this may be the only data they store in Chado. The typical protein coding gene model consists of a gene, one or more mRNAs, one or more exons, and at least one polypeptide.
-Alternately spliced genes have a 1 to many relation between gene and mRNA. Exons can be part_of more than one mRNA. No two distinct exon rows should have exact same featureloc coordinates (this indicates they are the same exon).
-Every feature must have a featureloc with rank=0 and locgroup=0. The value of the srcfeature_id column should be identical (i.e. all features are located relative to the same feature), except in rare circumstances such as when a feature crosses two contigs. Software is not guaranteed to support this. The srcfeature_id can point to a contig, a chromosome/chromosome_arm or other appropriate assembly unit.
-This scenario involves rows in the following tables:
-table	type_id	number	comments
-feature	SO:gene	1	The gene must always be provided
-feature	SO:mRNA	1..n	 One or more transcripts are required, and these are always of type mRNA for protein-coding genes.
-feature_relationship	OBO_REL:part_of	SO:mRNA[1..n]---->[1]SO:gene	 transcripts are always linked to genes by a part_of relation. (Note that SO uses member_of here). One gene can have amny transcripts (multiple splicing). A transcript must always belong to exactly one gene (for an exception, see .
-feature	SO:exon	1..n	 Exons are always required, even if the genome under consideration has no introns
-feature_relationship	OBO_REL:part_of	SO:exon[1..n]---->[1..n]SO:mRNA	 Exons are always linked to their container transcript (in this case, an mRNA) via the part_of relation. If a transcript is alternately spliced, then an exon can be part_of multiple transcripts
-feature	SO:polypeptide	1..n	 A protein-coding gene always produces a polypeptide, by definition. The polypeptide is located relative to the same genomic feature as the exons, mRNAs and gene. A single featureloc is used, with fmin and fmax indicating the start and stop codon positions (location is inclusive of stop codon). The polypeptide sequence should be specified as an amino acid sequence.
-feature_relationship	OBO_REL:derived_from	SO:polypeptide[1]---->[1..n]SO:mRNA	 The polypeptide is always derived_from the mRNA. If two alternate spliceforms produce the same polypeptide (i.e. their sequence is the same) then the same polypeptide feature should be used. An mRNA can only derive one polypeptide. For exceptions, see dicistronic-gene
-featureloc		1..n	 Every feature above must have a featureloc
-Tool: apollo
-Status: supported
-Tool: gbrowse
-Status: supported
-Example
-A Drosophila gene with 5 exons and a single spliceform Download: [game] [chado] [chaos]
-noncoding-gene
-Similar to canonical-gene-model, except with noncoding-RNA
-Not all genes are protein-coding. Genes can code for tRNA, miRNA, snoRNA, etc. A noncoding gene model is identical to a canonical-gene-model, with the following exceptions:
-There is no polypeptide feature
-Instead of an mRNA feature, there is a feature that is some other sub-type of RNA
-This scenario involves rows in the following tables:
-table	type_id	number	comments
-feature	SO:gene	1	The gene must always be provided
-feature	SO:RNA	1..n	 Type can be SO:RNA or any subtype of this type
-feature_relationship	OBO_REL:part_of	SO:RNA[1..n]---->[1]SO:gene	 noncoding transcripts can also be alternately spliced
-feature	SO:exon	1..n	 Exons are always required, even if the genome under consideration has no introns.
-feature_relationship	OBO_REL:part_of	SO:exon[1..n]---->[1..n]SO:RNA	 Exons are always linked to their container transcript (in this case, a non-mRNA subtype of SO:RNA) via the part_of relation. If a transcript is alternately spliced, then an exon can be part_of multiple transcripts
-featureloc		1..n	 Every feature above must have a featureloc
-Tool: apollo
-Status: supported
-Tool: gbrowse
-Status: supported
-pseudogene
-A pseudogene is a non-functional relic of a gene
-See pseudogene. A pseudogene may look like an ordinary gene, and may even have discernable parts such as exons. It may sometimes be desirable to annotate the exon structure of a pseudogene - this can in principle be done using SO types such as decayed_exon. In practice no-one is using Chado to do this. There are currently two practices:
-pseudogenes are treated analagously to noncoding-genes. That is, there are normal "gene" and "exon" features. However, in place of a subtype of RNA, there is a feature of type pseudogene. This practice is STRONGLY DISCOURAGED (it is not compliant with the relations in SO, it gives false counts to the number of real genes in the database). Note that this is the current default for FlyBase.
-Pseudogenes are normal singleton-features. There is no annotation of exon structure. This practice is encouraged. If at a later date it becomes desirable to annotated the exon structure of a pseudogene, it will be compatible with this.
-Tool: apollo
-Status: unclear
-Apollo by default treats pseudogenes using the first method, above. It may also be possible to configure it to the second, singleton, method. Annotating the exon structure of pseudogenes the correct way has not yet been attempted to our knowledge.
-singleton-feature
-Many types of features are singletons - that is they are not related to other features through feature_relationships. Storage of these is basic and as one may expect
-Singleton features present no major problems. Unlike genes, which typically have parts (with the parts having subparts), singletons do not form feature graphs (or rather, they form feature graphs consisting of single nodes). Singleton features are located relative to other features (usually the genome, but once can have singletons that are located relative to other features - this may not be supported by all applications)
-Tool: gbrowse
-Status: suppported
-Tool: apollo
-Status: suppported
-Apollo supports singletons provided they are located relative to the genome (singletons located relative to other features will be ignored). It may be necessary to configure apollo to make the feature type "1-level"
-dicistronic-gene
-A dicistronic gene is a gene with a mRNA that codes for two distinct non-overlapping CDSs
-Dicistronic genes (see for example, the dmel Adh and Adhr genes) have totally distinct gene products deriving from the same transcript. To confuse matters, the two polypeptides are commonly refered to as being derived from two distinct genes (e.g. Adh and Adhr). The entire genomic region comprising the transcript (e.g. Adh+Adhr) that includes both CDSs is refered to as the gene_cassette. In a database such as FlyBase, there are 3 gene IDs stored in the database - one for each of the two non-overlapping genes, and one for the gene cassette
-Dicistronic genes make it difficult to have a formal definition of gene that corresponds nicely with how biologists use the term.
-There are currently two proposals for handling dicistronic genes. The first is a hack and introduces redundancy, but works well with existing software and tools. The second is prefered from a modeling standpoint, but introduces a lot of complexity to software
-operon
-Bacterial genes are often transcribed in groups; eg LacZ
-There are many similarities with dicistronic-genes here.
-trans-spliced-gene
-A trans-spliced gene has one or more transcripts in which that transcript may be spliced together from different parts of the genome
-A trans spliced transcript is spliced from exons coming from different parts of the genome. The distance between each trans spliced part may be large, or it may be in the same location on the opposite strand.
-Most C elegans genes have a trans spliced leader sequence. This is different from the trans splicing involved in dmel , where we observe what appears to be two transcripts on separate strands (both containing coding sequence) joining together in a single functional transcript
-There are two proposals for dealing with this. One treats the trans spliced transcript as a single transcripts, with exons coming from different locations. The other treats the trans spliced transcript as a mature transcript created from two distinct primary transcripts. Note that these proposals focus on the dmel example. A solution for the C elegans example is not proposed (not sure if we even need one?)
-We treat this as an ordinary gene model, but relax our rules for exon locations in a transcript
-For example, for the canonical Dmel trans spliced gene, we would allow transcripts to have exons on different strands. Note that in Chado, exon ordering comes from feature_relationship.rank (between exon and transcript), NOT from the featureloc of the exon. Chado has no problem with this. However, some software may make assumptions that all exons are on the same strand, or may try to order exons by their location to get a transcript sequence. This software will have unintended consequences with trans spliced genes modeled using this proposal
-Tool: apollo
-Status: unclear
-apollo may accidentally scramble the order of exons. Need to check
-Tool: gbrowse
-Status: unclear
-Not sure.
-We would introduce extra transcripts, and have relations between the transcripts. Only the mature, spliced, transcript would have a relation to the polypeptide
-This may model the biology better. However, it introduces a major departure from the canonical-gene-model. For this reason this proposal is unlikely to be adopted
-gene-with-regulatory-elements
-regulatory elements may be implicitly or explicitly associated with a gene
-transposons
-transposons can be annotated as singleton-features or as complex annotations
-A transposon may consist of various parts such as long_terminal_repeats and gene models coding for genes like gag, pol, env. These parts may have all decayed over time. Transposon annotation typically ignores these subtleties as all that is usually required is a singleton-feature of type transposable_element_feature. In this case, there is no difficulty
-If one requires detailed transposon annotation then one is entering uncharted water as far as both Chado and annotation tools are concerned (which is why this scenario is marked as being under discussion). One option would be to treat each transposon part as distinct singletons, but this may be unsatisfactory as one may desire to have the appropriate part_of relations between the parts.
-P-element-insertions
-SNPs
-This outlines one way of modeling SNPs in chado. it also illustrates
-use of the featureloc table.
-Most of this applies to other variation features, but I'll illustrute
-using SNPs for now to keep it simple.
-A SNP is represented as a single feature in chado.
-Let's take a basic example - a SNP that flips an A to a G on the
-genome.
-Here we would have one feature and two featurelocs.
-(feature
-  (name "SNP_01")
-  (featureloc
-    (srcfeature "Chromosome_arm_2L") ;;; dna feature identifier
-    (nbeg 1000000)
-    (nend 1000001)
-    (strand 1)
-    (residue_info "A")
-    (rank 0)
-    (locgroup 0))
-  (featureloc
-    (residue_info "G")
-    (rank 1)
-    (locgroup 0)))
-the first location is on the chromosome arm (presumably wildtype). the
-second location has no srcfeature (ie it is set to null). however, it
-is effectively paired with the first location. if we later wished to
-instantiate the mutant chromosome arm feature, we would fill in the
-second locgroup's srcfeature.
-Let's take another example - a SNP that has only been characterised at
-the protein level. This SNP flips an I to a V
-(feature
-  (name "SNP_02")
-  (featureloc
-    (srcfeature "dpp-P1")    ;;; protein feature identifier
-    (nbeg 23)
-    (nend 24)
-    (strand 1)
-    (residue_info "I")
-    (rank 0)
-    (locgroup 0))
-  (featureloc
-    (residue_info "V")
-    (rank 1)
-    (locgroup 0)))
-Again, the second featureloc has no srcfeature. the mutant protein is
-implicit. the mutant protein sequence can be infered by taking the
-sequence of "dpp-P1" and substituting the 24th residue with a V.
-To do a query for all SNPs that switch I to V or vice versa:
-SELECT snp.*
-FROM
-  featureloc AS wildloc,
-  featureloc AS mutloc,
-  feature AS snp,
-  cvterm AS ftype
-WHERE
-  snp.type_id = ftype.cvterm_id        AND
-  ftype.termname = 'snp'               AND
-  wildloc.feature_id = snp.feature_id  AND
-  mutloc.feature_id = snp.feature_id   AND
-  wildloc.locgroup = mutloc.locgroup   AND
-  wildloc.residue_info = 'I'           AND
-  mutloc.residue_info = 'I';
-note that this query remains the same even if mutant protein features
-are instantiated as opposed to left implicit.
-Let's look at a more complex example. If we have a SNP that has been
-localised to the genome, and the SNP has an effect on a protein
-(Isoleucine to Threonine), and we want to redundantly store the SNP
-effect on the genome, transcript and translation.
-[note that in this example, the transcript is on the reverse strand,
-so the residue is reverse complemented]
-(feature
-  (name "SNP_03")
-  ;; position on genome
-  (featureloc
-    (srcfeature "chrom_arm_3R")
-    (nbeg 2000000)
-    (nend 2000001)
-    (strand 1)
-    (residue_info "A")
-    (rank 0)                       ;; wild
-    (locgroup 0))
-  (featureloc
-    (residue_info "G")
-    (rank 1)                       ;; mutant
-    (locgroup 0))
-  ;; position on transcript
-  (featureloc
-    (srcfeature "blah-transcript001")     ;; processed transcript ID
-    (nbeg 1000)
-    (nend 1001)
-    (strand 1)
-    (residue_info "T")
-    (rank 0)                       ;; wild
-    (locgroup 1))
-  (featureloc
-    (residue_info "C")
-    (rank 1)                       ;; mutant
-    (locgroup 1))
-  ;; position on protein
-  (featureloc
-    (srcfeature "blah-protein001")    ;;; protein feature identifier
-    (nbeg 23)
-    (nend 24)
-    (strand 1)
-    (residue_info "I")
-    (rank 0)                       ;; wild
-    (locgroup 2))
-  (featureloc
-    (residue_info "T")
-    (rank 1)                       ;; mutant
-    (locgroup 2)))
-Here we have 6 locations for one SNP. The 6 locations can be imagined
-to be in a 2D matrix. the purpose of rank and locgroup is to specify
-the column and row in the matrix
-        | genome    transcript   protein
---------+-------------------------------
-wild    |   A           T        I
-        |
-mutant  |   G           C        T
-rank is used to group the strain and locgroup is used for the grouping
-within that strain. rank=0 should be used for the wildtype, but this
-is not always possible; locgroup=0 should be used for primary (as
-opposed to derived) location, this is not always possible. the
-important thing is consistency within a SNP to preserve the matrix.
-One can imagine rare (but entirely possible) cases where by a single
-SNP causes different protein level changes in two proteins (for
-instance, HIV carries a doubly encoded gene - ie the ORFs overlap but
-have different frames).
-Here we would want to add another locgroup, for the second protein
-        | genome    transcript   protein1 protein2
---------+-----------------------------------------
-wild    |   A           T        I        Y
-        |
-mutant  |   G           C        T        H
-Again, if we don't need to instantiate the 2 mutant proteins, but
-their sequence can be reconstructed from the wild proteins plus the
-corresponding mutation
-[remember chado is interbase, and postgresql substring counts from 1]
-The following query dynamically constructs mutant feature residues
-based on the wildtype feature and the mutant residue changes. this
-should work for a variety of variation features, not just SNPs. Note
-that we need to use locgroup to properly group wild/mutant pairs of
-locations otherwise this query will give bad data.
-SELECT
- snp.name,
- wildfeat.name,
- substr(wildfeat.residues,
-,
-        wildloc.nbeg) ||
- mutloc.residue_info  ||
- substr(wildfeat.residues,
-        wildloc.nend+1)
-FROM
-  featureloc AS wildloc,
-  feature AS wildfeat,
-  featureloc AS mutloc,
-  feature AS snp,
-  cvterm AS ftype
-WHERE
-  snp.type_id = ftype.cvterm_id         AND
-  ftype.termname = 'snp'                AND
-  wildloc.feature_id = snp.feature_id   AND
-  mutloc.feature_id = snp.feature_id    AND
-  wildloc.locgroup = mutloc.locgroup    AND
-  wildloc.srcfeature = wildfeat
-EXTENSIONS
-==========
-The above will also work if we have a polymorphic site with a number
-of different possibilities across multiple strains. We just extend the
-number of rows in the location matrix (ie we have rank > 1).
-We could also instantiate multiple SNPs, one per strain, and keep the
-locations pairwise.
-SIMILARITIES TO ALIGNMENTS
-==========================
-You should hopefully notice the parallels between modeling SNPs and
-modeling pairwise (eg BLAST) and multiple alignments. The difference
-is, alignments would always have locgroup=0, with the rank
-distinguishing query from subject. Also, with an HSP feature, the
-residue_info is used to store the alignment string.
-REDUNDANT STORAGE OF COORDINATES ON DIFFERENT ASSEMBLY LEVELS
-=============================================================
-Some groups may find it advantageous to redundantly store features
-relative to both BACs and chromosomes (or to mini-contigs and
-scaffolds... choose your favourite assembly units). The approach
-outlined above works perfectly well with this, we would simple add
-another column in the location matrix (ie another wild/mutant pair
-with a distinct locgroup). All queries should work the same.
-gene-with-implicit-features-manifested
-Some feature types such as introns are not normally manifested as rows in chado. They are normally derived on-the-fly from the gaps between consecutive exons. See for an example. Occasionally it may be desirable to store the introns actual rows in the feature table - for scenario in a report database
-feature-localization
-All features with sequence annotation should be localized using featureloc
-localized features must have a featureloc with rank=0 and locgroup=0. This is the primary location of the feature. The location always indicates the boundaries of the feature. If the feature is composed of distinct subfeatures (e.g. a transcript composes of exons), then it is NOT permitted to use multiple featurelocs to indicate this. Instead, there must be rows for the subfeatures, each with their own featureloc
-In a feature graph (i.e. a group of features connected via feature_relationship rows, all features will typically be localized relative to the same source feature (i.e. they will all have the same value for featureloc.srcfeature_id)
-features are typically localized to some kind of genomic or assembly feature, but chado does not constrain you to using only this. For example, localizing features relative to a transcript or polypeptide or even exon is permitted, but unusual practices will most likely not be recognized by most software
-feature-localization-to-contigs-in-assembly
-In an assembled genome, it is common to locate relative to the top-level assembly units (e.g. chromosomes). However, it is also permissable to locate to smaller units such as contigs or golden_path_units
-If a genome assembly is not stable, it is common to locate relative to assembly units such as contigs. These contigs may then be localized relative to the top-level assembly units. This is known in chado terms as a location graph.
-We discuss here location graphs of depth 2. See also n-level-assemblies. This scenario is often invisible to software interoperating with Chado. The software is free to only look at the main features and the contig-level feature and ignore the top-level assembly feature. It may sometimes be desirable to have software that can perform location transformations, mapping features from contigs to top-level units and back
-Tool: apollo
-Status: unclear
-apollo should be happy to treat contigs just as if they were top-level units as chromosome arms. However, the user may have to explicitly provide contigs if location queries are desired. For example, apollo may retrieve nothing if the user asks for a certain range on chromosome 4, and the features are located relative to contigs which are themselves on chromosome 4.
-Tool: gbrowse
-Status: unclear
-Gbrowse may expect features to be located relative to top-level units such as chromosomes.
-redundant-localizations-to-different-assembly-levels
-Features can be located relative to both contigs and top-level assembly units
-Chado allows redundant feature localization using featureloc.locgroup>0. This allows a database to have primary locations for features relative to contigs, and secondary locations relative to top-level units such as chromosomes. The converse is also allowed.
-This scenario is discouraged unless the chado db admin knows what they are doing. They must implement solutions to ensure that featurelocs with varying locgroup do not get out of sync. These solutions are not part of the standard Chado software suite. Nevertheless, this scenario may be useful for advanced users in certain circumstances
-Tool: gbrowse
-Status: unclear
-Not clear if gbrowse uses locgroup in querying. If it constrains by locgroup, then this is essentially the same as feature-localization-to-contigs-in-assembly
-Tool: gbrowse
-Status: partial
-Not clear if apollo uses locgroup in querying. If it constrains by locgroup, then this is essentially the same as feature-localization-to-contigs-in-assembly. Apollo will not preserve redundant featurelocs when writing back to db. This could lead to db getting out of sync.
-n-level-assemblies
-In theory it is possible (but rare) to have assemblies with variable depths, or with depths>2
-This scenario is rare. If required, then Chado can deal with this - there is no theoretical limit to the depth of a location graph. One can have annotated features located relative to minicontigs which are located relative to supercontigs which are located relative to chromosomes. Most software that interoperates with Chado will not be able to deal with this, so this scenario is discouraged except by advanced users who have no other option
-unlocalized-gene
-A gene without sequence based localization
-Many chado instances are purely concerned with genome annotation - in these cases it would be strange to have genes or other features such as transcripts with no localization (i.e. no featurelocs). However, this scenario is actually common when Chado is used in a wider context. We may of the existence of genes through non-sequence evidence such as genetics. When we have no sequence-based localization it is perfectly valid to have gene features with no featurelocs. When the time comes to create genome annotations for these, we just 'fill out' the gene feature by adding transcript and exon features.
-Tool: gbrowse
-Status: supported
-Gbrowse supports this scenario in that unlocalized features will be ignored from the genome viewer, which is appropriate
-Tool: apollo
-Status: supported
-Apollo supports this scenario in that unlocalized features will be ignored, which is appropriate behaviour for a genome annotation tool
-.3  Table deﬁnitions
-feature
-A feature is a biological sequence or a section of a biological sequence, or a collection of such
-sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein
-domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and
-HSPs and so on; see the Sequence Ontology for more
-  Table 4.1: feature
-Column Datatype  Description
-feature idinteger
-dbxref id integerAn optional primary public stable identiﬁer for this
-  feature. Secondary identiﬁers and external dbxrefs
-  go in table:feature dbxref
-organism id  integerThe organism to which this feature belongs. This
-  column is mandatory
-namevarcharThe optional human-readable common name for a
-  feature, for display purposes
-uniquenametextThe unique name for a feature; may not be necessar-
-  ily be particularly human-readable, although this is
-  prefered. This name must be unique for this type of
-  feature within this organism
-residues  textA sequence of alphabetic characters representing bi-
-  ological residues (nucleic acids, amino acids). This
-  column does not need to be manifested for all fea-
-  tures; it is optional for features such as exons where
-  the residues can be derived from the featureloc. It is
-  recommended that the value for this column be man-
-  ifested for features which may may non-contiguous
-  sublocations (eg transcripts), since derivation at
-  query time is non-trivial. For expressed sequence,
-  the DNA sequence should be used rather than the
-  RNA sequence
-seqlen integerThe length of the residue feature. See col-
-  umn:residues. This column is partially redundant
-  with the residues column, and also with featureloc.
-  This column is required because the location may be
-  unknown and the residue sequence may not be man-
-  ifested, yet it may be desirable to store and query
-  the length of the feature. The seqlen should always
-  be manifested where the length of the sequence is
-  known
-md5checksum  charThe 32-character checksum of the sequence, calcu-
-  lated using the MD5 algorithm. This is practically
-  guaranteed to be unique for any feature. This col-
-  umn thus acts as a unique identiﬁer on the mathe-
-  matical sequence
-type idintegerA required reference to a table:cvterm giving the fea-
-  ture type. This will typically be a Sequence Ontology
-  identiﬁer. This column is thus used to subclass the
-  feature table
-is analysis  booleanBoolean indicating whether this feature is annotated
-  or the result of an automated analysis. Analysis re-
-  sults also use the companalysis module. Note that
-  the dividing line between analysis/annotation may
-  be fuzzy, this should be determined on a per-project
-  basis in a consistent manner. One requirement is
-  that there should only be one non-analysis version of
-  each wild-type gene feature in a genome, whereas the
-  same gene feature can be predicted multiple times in
-  diﬀerent analyses
-is obsolete  booleanBoolean indicating whether this feature has been ob-
-  soleted. Some chado instances may choose to simply
-  remove the feature altogether, others may choose to
-  keep an obsolete row in the table
-timeaccessioned timestamp for handling object accession/modiﬁcation times-
-  tamps (as opposed to db auditing info, handled else-
-  where). The expectation is that these ﬁelds would
-  be available to software interacting with chado
-timelastmodiﬁed timestamp for handling object accession/modiﬁcation times-
-  tamps (as opposed to db auditing info, handled else-
-  where). The expectation is that these ﬁelds would
-  be available to software interacting with chado
-featureloc
-The location of a feature relative to another feature. IMPORTANT: INTERBASE COORDI-
-NATES ARE USED.(This is vital as it allows us to represent zero-length features eg splice sites,
-insertion points without an awkward fuzzy system). Features typically have exactly ONE loca-
-tion, but this need not be the case. Some features may not be localized (eg a gene that has been
-characterized genetically but no sequence/molecular info is available). NOTE ON MULTIPLE
-LOCATIONS: Each feature can have 0 or more locations. Multiple locations do NOT indicate
-non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the
-subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature
-designate alternate locations or grouped locations; for instance, a feature designating a blast hit or
-hsp will have two locations, one on the query feature, one on the subject feature. features repre-
-senting sequence variation could have alternate locations instantiated on a feature on the mutant
-strain. the column:rank is used to diﬀerentiate these diﬀerent locations. Reﬂexive locations should
-never be stored - this is for -proper- (ie non-self) locations only; i.e. nothing should be located
-relative to itself
-  Table 4.2: featureloc
- Column Datatype Description
- featureloc idinteger
- feature idinteger  The feature that is being located. Any feature can
-  have zero or more featurelocs
- srcfeature idinteger  The source feature which this location is relative to.
-  Every location is relative to another feature (how-
-  ever, this column is nullable, because the srcfeature
-  may not be known). All locations are -proper- that
-  is, nothing should be located relative to itself. No
-  cycles are allowed in the featureloc graph
- fmininteger  The leftmost/minimal boundary in the linear range
-  represented by the featureloc.  Sometimes (eg in
-  bioperl) this is called -start- although this is con-
-  fusing because it does not necessarily represent the
--prime coordinate. IMPORTANT: This is space-
-  based (INTERBASE) coordinates, counting from
-  zero. To convert this to the leftmost position in a
-  base-oriented system (eg GFF, bioperl), add 1 to
-  fmin
- is fmin partial boolean  This is typically false, but may be true if the value
-  for column:fmin is inaccurate or the leftmost part of
-  the range is unknown/unbounded
- fmaxinteger  The rightmost/maximal boundary in the linear range
-  represented by the featureloc.  Sometimes (eg in
-  bioperl) this is called -end- although this is con-
-  fusing because it does not necessarily represent the
--prime coordinate. IMPORTANT: This is space-
-  based (INTERBASE) coordinates, counting from
-  zero. No conversion is required to go from fmax to
-  the rightmost coordinate in a base-oriented system
-  that counts from 1 (eg GFF, bioperl)
- is fmax partial boolean  This is typically false, but may be true if the value
-  for column:fmax is inaccurate or the rightmost part
-  of the range is unknown/unbounded
- strand integer  The  orientation/directionality of the  location.
-  Should be 0,-1 or +1
- phase  integer  phase of translation wrt srcfeature id.Values are
-,1,2. It may not be possible to manifest this column for some features such as exons, because the
-  phase is dependant on the spliceform (the same exon
-  can appear in multiple spliceforms). This column is
-  mostly useful for predicted exons and CDSs
- residue info text  Alternative residues, when these diﬀer from fea-
-  ture.residues. for instance, a SNP feature located
-  on a wild and mutant protein would have diﬀerent
-  alresidues. for alignment/similarity features, the altresidues is used to represent the alignment string
-  (CIGAR format). Note on variation features; even
-  if we dont want to instantiate a mutant chromo-
-  some/contig feature, we can still represent a SNP
-  etc with 2 locations, one (rank 0) on the genome,
-  the other (rank 1) would have most ﬁelds null, ex-
-  cept for altresidues
- locgroup  integer  This is used to manifest redundant, derivable ex-
-  tra locations for a feature. The default locgroup=0
-  is used for the DIRECT location of a feature.  !!
-  MOST CHADO USERS MAY NEVER USE featurelocs WITH logroup¿0 !! Transitively derived locations are indicated with locgroup¿0. For example,
-  the position of an exon on a BAC and in global chromosome coordinates.This column is used to dif-
-  ferentiate these groupings of locations. the default
-  locgroup 0 is used for the main/primary location,
-  from which the others can be derived via coordinate
-  transformations. another example of redundant locations is storing ORF coordinates relative to both
-  transcript and genome.redundant locations open
-  the possibility of the database getting into inconsistent states; this schema gives us the ﬂexibility of both
-  warehouse instantiations with redundant locations
-  (easier for querying) and management instantiations
-  with no redundant locations. An example of using
-  both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of
-  two diﬀerent species. we may want to keep redundant locations on both contigs and chromosomes. we
-  would thus have 4 locations for the single conserved
-  region feature - two distinct locgroups (contig level
-  and chromosome level) and two distinct ranks (for
-  the two species)
- rankinteger  Used when a feature has ¿1 location, otherwise the
-  default rank 0 is used. Some features (eg blast hits
-  and HSPs) have two locations - one on the query
-  and one on the subject. Rank is used to diﬀerentiate these. Rank=0 is always used for the query,
-  Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for
-  sequence variant features, such as SNPs. Rank=0
-  indicates the wildtype (or baseline) feature, Rank=1
-  indicates the mutant (or compared) feature
-featureloc pub
-COMMENT ON INDEX featureloc c1 IS ’locgroup and rank serve to uniquely
-  Table 4.3: featureloc pub
-ColumnDatatypeDescription
-featureloc pub id integer
-featureloc id  integer
-pub idinteger
-feature pub
-Provenance. Linking table between features and publications that mention them
-  Table 4.4: feature pub
-  ColumnDatatype Description
-  feature pub id integer
-  feature id  integer
-  pub idinteger
-featureprop
-A feature can have any number of slot-value property tags attached to it. This is an alternative to
-hardcoding a list of columns in the relational schema, and is completely extensible
-  Table 4.5: featureprop
-  ColumnDatatype Description
-  featureprop id integer
-  feature id  integer
-  type id  integer  The name of the property/slot is a cvterm. The
-  meaning of the property is deﬁned in that cvterm.
-  Certain property types will only apply to certain feature types (e.g. the anticodon property will only apply to tRNA features) ; the types here come from
-  the sequence feature property ontology
-  value text  The value of the property, represented as text. Numeric values are converted to their text representation. This is less eﬃcient than using native database
-  types, but is easier to query.
-  rank  integer  Property-Value ordering. Any feature can have multiple values for any particular property type - these
-  are ordered in a list using rank, counting from zero.
-  For properties that are single-valued rather than
-  multi-valued, the default 0 value should be used
-featureprop pub
-for any one feature, multivalued property-value pairs must be diﬀerentiated by rank
-Table 4.6: featureprop pub
-Column Datatype Description
-featureprop pub id integer
-featureprop id  integer
-pub id integer
-feature dbxref
-links a feature to dbxrefs. This is for secondary identiﬁers; primary identiﬁers should use fea-
-ture.dbxref id
-Table 4.7: feature dbxref
- ColumnDatatype  Description
- feature dbxref id integer
- feature id  integer
- dbxref idinteger
- is current  booleanthe is current boolean indicates whether the linked
-  dbxref is the current -oﬃcial- dbxref for the linked
-  feature
-feature relationship
-features can be arranged in graphs, eg exon part of transcript part of gene; translation madeby
-transcript if type is thought of as a verb, each arc makes a statement [SUBJECT VERB OBJECT]
-object can also be thought of as parent (containing feature), and subject as child (contained feature
-or subfeature) – we include the relationship rank/order, because even though most of the time we
-can order things implicitly by sequence coordinates, we cant always do this - eg transpliced genes.
-its also useful for quickly getting implicit introns
-  Table 4.8: feature relationship
-  ColumnDatatype Description
-  feature relationship id integer
-  subject id  integer  the subject of the subj-predicate-obj sentence. This
-  is typically the subfeature
-  object idinteger  the object of the subj-predicate-obj sentence. This
-  is typically the container feature
-  type id  integer  relationship type between subject and object. This
-  is a cvterm, typically from the OBO relationship
-  ontology, although other relationship types are al-
-  lowed. The most common relationship type is
-  OBO REL:part of. Valid relationship types are con-
-  strained by the Sequence Ontology
-  value text  Additional notes/comments
-  rank  integer  The ordering of subject features with respect to the
-  object feature may be important (for example, exon
-  ordering on a transcript - not always derivable if you
-  take trans spliced genes into consideration). rank is
-  used to order these; starts from zero
-feature relationship pub
-Provenance. Attach optional evidence to a feature relationship in the form of a publication
-Table 4.9: feature relationship pub
-  Column  Datatype Description
-  feature relationship pub id  integer
-  feature relationship idinteger
-  pub id  integer
-feature relationshipprop
-Extensible properties for feature relationships. Analagous structure to featureprop. This table is
-largely optional and not used with a high frequency. Typical scenarios may be if one wishes to
-attach additional data to a feature relationship - for example to say that the feature relationship
-is only true in certain contexts
- Table 4.10: feature relationshipprop
-Column  Datatype Description
-feature relationshipprop id  integer
-feature relationship idinteger
-type id integer  The name of the property/slot is a cvterm.The
-  meaning of the property is deﬁned in that cvterm.
-  Currently there is no standard ontology for feature relationship property types
-valuetext  The value of the property, represented as text. Numeric values are converted to their text representation. This is less eﬃcient than using native database
-  types, but is easier to query.
-rank integer  Property-Value ordering. Any feature relationship
-  can have multiple values for any particular property
-  type - these are ordered in a list using rank, counting from zero. For properties that are single-valued
-  rather than multi-valued, the default 0 value should
-  be used
-feature relationshipprop pub
-Provenance for feature relationshipprop
- Table 4.11: feature relationshipprop pub
-Column Datatype Description
-feature relationshipprop pub idinteger
-feature relationshipprop id integer
-pub id integer
-feature cvterm
-Associate a term from a cv with a feature, for example, GO annotation
-  Table 4.12: feature cvterm
-ColumnDatatypeDescription
-feature cvterm id integer
-feature id  integer
-cvterm idinteger
-pub idinteger Provenance for the annotation.Each annotation
-  should have a single primary publication (which
-  may be of the appropriate type for computational
-  analyses) where more details can be found. Additional provenance dbxrefs can be attached using feature cvterm dbxref
-is notboolean if this is set to true, then this annotation is interpreted as a NEGATIVE annotation - ie the feature
-  does NOT have the speciﬁed function, process, component, part, etc. See GO docs for more details
-feature cvtermprop
-Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualiﬁers;
-metadata such as the date on which the entry was curated and the source of the association. See
-the featureprop table for meanings of type id, value and rank
-Table 4.13: feature cvtermprop
-  Column Datatype Description
-  feature cvtermprop id integer
-  feature cvterm id  integer
-  type idinteger  The name of the property/slot is a cvterm.  The
-meaning of the property is deﬁned in that cvterm.
-cvterms may come from the OBO evidence code cv
-  value  text  The value of the property, represented as text. Numeric values are converted to their text representation. This is less eﬃcient than using native database
-types, but is easier to query.
-  rankinteger  Property-Value ordering.  Any feature cvterm can
-have multiple values for any particular property type
-- these are ordered in a list using rank, counting from
-zero. For properties that are single-valued rather
-than multi-valued, the default 0 value should be used
-feature cvterm dbxref
-Additional dbxrefs for an association. Rows in the feature cvterm table may be backed up by
-dbxrefs. For example, a feature cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the
-WITH column in a GO gene association ﬁle (but can also be used for other analagous associations).
-See http://www.geneontology.org/doc/GO.annotation.shtml#ﬁle for more details
- Table 4.14: feature cvterm dbxref
-Column Datatype Description
-feature cvterm dbxref id integer
-feature cvterm id  integer
-dbxref id integer
-feature cvterm pub
-Secondary pubs for an association. Each feature cvterm association is supported by a single primary
-publication. Additional secondary pubs can be added using this linking table (in a GO gene
-association ﬁle, these corresponding to any IDs after the pipe symbol in the publications column
-  Table 4.15: feature cvterm pub
-  Column Datatype Description
-  feature cvterm pub id integer
-  feature cvterm id  integer
-  pub id integer
-synonym
-A synonym for a feature. One feature can have multiple synonyms, and the same synonym can
-apply to multiple features
-Table 4.16: synonym
-  Column Datatype Description
-  synonym idinteger
-  namevarchar  The synonym itself.  Should be human-readable
-machine-searchable ascii text
-  type idinteger  types would be symbol and fullname for now
-  synonym sgml varchar  The fully speciﬁed synonym, with any non-ascii characters encoded in SGML
-feature synonym
-Linking table between feature and synonym
-  Table 4.17: feature synonym
-  Column Datatype Description
-  feature synonym id integer
-  synonym idinteger
-  feature idinteger
-  pub id integer  the pub id link is for relating the usage of a given
-synonym to the publication in which it was used
-  is currentboolean  the is current boolean indicates whether the linked
-synonym is the current -oﬃcial- symbol for the linked
-feature
-  is internal  boolean  typically a synonym exists so that somebody query-
-ing the db with an obsolete name can ﬁnd the ob-
-ject theyre looking for (under its current name. If
-the synonym has been used publicly & deliberately
-(eg in a paper), it my also be listed in reports as a
-synonym. If the synonym was not used deliberately
-(eg, there was a typo which went public), then the
-is internal boolean may be set to -true- so that it is
-known that the synonym is -internal- and should be
-queryable but should not be listed in reports as a
-valid synonym
-feature
-A feature is a biological sequence or a section of a biological sequence, or a collection of such
-sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein
-domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and
-HSPs and so on; see the Sequence Ontology for more
-  Table 4.18: feature
-Column Datatype  Description
-feature idinteger
-dbxref id integerAn optional primary public stable identiﬁer for this
-  feature. Secondary identiﬁers and external dbxrefs
-  go in table:feature dbxref
-organism id  integerThe organism to which this feature belongs. This
-  column is mandatory
-namevarchar
-The optional human-readable common name for a
-  feature, for display purposes
-uniquenametextThe unique name for a feature; may not be necessarily be particularly human-readable, although this is
-  prefered. This name must be unique for this type of
-  feature within this organism
-residues  textA sequence of alphabetic characters representing biological residues (nucleic acids, amino acids). This
-  column does not need to be manifested for all features; it is optional for features such as exons where
-  the residues can be derived from the featureloc. It is
-  recommended that the value for this column be manifested for features which may may non-contiguous
-  sublocations (eg transcripts), since derivation at
-  query time is non-trivial. For expressed sequence,
-  the DNA sequence should be used rather than the
-  RNA sequence
-seqlen integerThe length of the residue feature. See column:residues. This column is partially redundant
-  with the residues column, and also with featureloc.
-  This column is required because the location may be
-  unknown and the residue sequence may not be manifested, yet it may be desirable to store and query
-  the length of the feature. The seqlen should always
-  be manifested where the length of the sequence is
-  known
-md5checksum  charThe 32-character checksum of the sequence, calculated using the MD5 algorithm. This is practically
-  guaranteed to be unique for any feature. This column thus acts as a unique identiﬁer on the mathematical sequence
-type idintegerA required reference to a table:cvterm giving the feature type. This will typically be a Sequence Ontology
-  identiﬁer. This column is thus used to subclass the
-  feature table
-is analysis  booleanBoolean indicating whether this feature is annotated
-  or the result of an automated analysis. Analysis results also use the companalysis module. Note that
-  the dividing line between analysis/annotation may
-  be fuzzy, this should be determined on a per-project
-  basis in a consistent manner. One requirement is
-  that there should only be one non-analysis version of
-  each wild-type gene feature in a genome, whereas the
-  same gene feature can be predicted multiple times in
-  diﬀerent analyses
-is obsolete  booleanBoolean indicating whether this feature has been obsoleted. Some chado instances may choose to simply
-  remove the feature altogether, others may choose to
-  keep an obsolete row in the table
-timeaccessioned timestamp for handling object accession/modiﬁcation timestamps (as opposed to db auditing info, handled elsewhere). The expectation is that these ﬁelds would
-  be available to software interacting with chado
-timelastmodiﬁed timestamp for handling object accession/modiﬁcation timestamps (as opposed to db auditing info, handled else where). The expectation is that these ﬁelds would
-  be available to software interacting with chado
-featureloc
-The location of a feature relative to another feature. IMPORTANT: INTERBASE COORDI-
-NATES ARE USED.(This is vital as it allows us to represent zero-length features eg splice sites,
-insertion points without an awkward fuzzy system). Features typically have exactly ONE location, but this need not be the case. Some features may not be localized (eg a gene that has been
-characterized genetically but no sequence/molecular info is available). NOTE ON MULTIPLE
-LOCATIONS: Each feature can have 0 or more locations. Multiple locations do NOT indicate
-non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the
-subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature
-designate alternate locations or grouped locations; for instance, a feature designating a blast hit or
-hsp will have two locations, one on the query feature, one on the subject feature. features representing sequence variation could have alternate locations instantiated on a feature on the mutant
-strain. the column:rank is used to diﬀerentiate these diﬀerent locations. Reﬂexive locations should
-never be stored - this is for -proper- (ie non-self) locations only; i.e. nothing should be located
-relative to itself
-  Table 4.19: featureloc
- Column Datatype Description
- featureloc idinteger
- feature idinteger  The feature that is being located. Any feature can
-  have zero or more featurelocs
- srcfeature idinteger  The source feature which this location is relative to.
-  Every location is relative to another feature (how-
-  ever, this column is nullable, because the srcfeature
-  may not be known). All locations are '''proper''' - that
-  is, nothing should be located relative to itself. No
-  cycles are allowed in the featureloc graph
- fmininteger  The leftmost/minimal boundary in the linear range
-  represented by the featureloc.  Sometimes (e.g. in
-  [http://bioperl.org Bioperl]) this is called -start- although this is confusing because it does not necessarily represent the
--prime coordinate. IMPORTANT: This is space-based (INTERBASE) coordinates, counting from
-  zero. To convert this to the leftmost position in a
-  base-oriented system (eg GFF, bioperl), add 1 to
-  fmin
- is fmin partial boolean  This is typically false, but may be true if the value
-  for column:fmin is inaccurate or the leftmost part of
-  the range is unknown/unbounded
- fmaxinteger  The rightmost/maximal boundary in the linear range
-  represented by the featureloc.  Sometimes (eg in
-  bioperl) this is called -end- although this is con-
-  fusing because it does not necessarily represent the
--prime coordinate. IMPORTANT: This is space-
-  based (INTERBASE) coordinates, counting from
-  zero. No conversion is required to go from fmax to
-  the rightmost coordinate in a base-oriented system
-  that counts from 1 (eg GFF, bioperl)
- is fmax partial boolean  This is typically false, but may be true if the value
-  for column:fmax is inaccurate or the rightmost part
-  of the range is unknown/unbounded
- strand integer  The  orientation/directionality of the  location.
-  Should be 0,-1 or +1
- phase  integer  phase of translation wrt srcfeature id.Values are
-,1,2. It may not be possible to manifest this column for some features such as exons, because the
-  phase is dependant on the spliceform (the same exon
-  can appear in multiple spliceforms). This column is
-  mostly useful for predicted exons and CDSs
- residue info text  Alternative residues, when these diﬀer from feature.residues. for instance, a SNP feature located
-  on a wild and mutant protein would have diﬀerent
-  alresidues. for alignment/similarity features, the altresidues is used to represent the alignment string
-  (CIGAR format). Note on variation features; even
-  if we dont want to instantiate a mutant chromosome/contig feature, we can still represent a SNP
-  etc with 2 locations, one (rank 0) on the genome,
-  the other (rank 1) would have most ﬁelds null, except for altresidues
- locgroup  integer  This is used to manifest redundant, derivable ettra locations for a feature. The default locgroup=0
-  is used for the DIRECT location of a feature.  !!
-  MOST CHADO USERS MAY NEVER USE featurelocs WITH logroup¿0 !! Transitively derived loca-
-  tions are indicated with locgroup¿0. For example,
-  the position of an exon on a BAC and in global chromosome coordinates.This column is used to dif-
-  ferentiate these groupings of locations. the default
-  locgroup 0 is used for the main/primary location,
-  from which the others can be derived via coordinate
-  transformations. another example of redundant locations is storing ORF coordinates relative to both
-  transcript and genome.redundant locations open
-  the possibility of the database getting into inconsistent states; this schema gives us the ﬂexibility of both
-  warehouse instantiations with redundant locations
-  (easier for querying) and management instantiations
-  with no redundant locations. An example of using
-  both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of
-  two diﬀerent species. we may want to keep redundant locations on both contigs and chromosomes. we
-  would thus have 4 locations for the single conserved
-  region feature - two distinct locgroups (contig level
-  and chromosome level) and two distinct ranks (for
-  the two species)
- rankinteger  Used when a feature has ¿1 location, otherwise the
-  default rank 0 is used. Some features (eg blast hits
-  and HSPs) have two locations - one on the query
-  and one on the subject. Rank is used to diﬀerentiate these. Rank=0 is always used for the query,
-  Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for
-  sequence variant features, such as SNPs. Rank=0
-  indicates the wildtype (or baseline) feature, Rank=1
-  indicates the mutant (or compared) feature
-featureloc pub
-COMMENT ON INDEX featureloc c1 IS ’locgroup and rank serve to uniquely
-  Table 4.20: featureloc pub
-ColumnDatatypeDescription
-featureloc pub id integer
-featureloc id  integer
-pub idinteger
-feature pub
-Provenance. Linking table between features and publications that mention them
-  Table 4.21: feature pub
-  ColumnDatatype Description
-  feature pub id integer
-  feature id  integer
-  pub idinteger
-featureprop
-A feature can have any number of slot-value property tags attached to it. This is an alternative to
-hardcoding a list of columns in the relational schema, and is completely extensible
-  Table 4.22: featureprop
-  ColumnDatatype Description
-  featureprop id integer
-  feature id  integer
-  type id  integer  The name of the property/slot is a cvterm. The
-  meaning of the property is deﬁned in that cvterm.
-  Certain property types will only apply to certain fea-
-  ture types (e.g. the anticodon property will only ap-
-  ply to tRNA features) ; the types here come from
-  the sequence feature property ontology
-  value text  The value of the property, represented as text. Nu-
-  meric values are converted to their text representa-
-  tion. This is less eﬃcient than using native database
-  types, but is easier to query.
-  rank  integer  Property-Value ordering. Any feature can have mul-
-  tiple values for any particular property type - these
-  are ordered in a list using rank, counting from zero.
-  For properties that are single-valued rather than
-  multi-valued, the default 0 value should be used
-featureprop pub
-for any one feature, multivalued property-value pairs must be diﬀerentiated by rank
- Table 4.23: featureprop pub
-Column Datatype Description
-featureprop pub id integer
-featureprop id  integer
-pub id integer
-feature dbxref
-links a feature to dbxrefs. This is for secondary identiﬁers; primary identiﬁers should use fea-
-ture.dbxref id
-Table 4.24: feature dbxref
- ColumnDatatypeDescription
- feature dbxref id integer
- feature id  integer
- dbxref idinteger
- is current  boolean the is current boolean indicates whether the linked
-dbxref is the current -oﬃcial- dbxref for the linked
-feature
-feature relationship
-features can be arranged in graphs, eg exon part of transcript part of gene; translation madeby
-transcript if type is thought of as a verb, each arc makes a statement [SUBJECT VERB OBJECT]
-object can also be thought of as parent (containing feature), and subject as child (contained feature
-or subfeature) – we include the relationship rank/order, because even though most of the time we
-can order things implicitly by sequence coordinates, we cant always do this - eg transpliced genes.
-its also useful for quickly getting implicit introns
- Table 4.25: feature relationship
-  ColumnDatatype Description
-  feature relationship id integer
-  subject id  integer  the subject of the subj-predicate-obj sentence. This
-  is typically the subfeature
-  object idinteger  the object of the subj-predicate-obj sentence. This
-  is typically the container feature
-  type id  integer  relationship type between subject and object. This
-  is a cvterm, typically from the OBO relationship
-  ontology, although other relationship types are al-
-  lowed. The most common relationship type is
-  OBO REL:part of. Valid relationship types are con-
-  strained by the Sequence Ontology
-  value text  Additional notes/comments
-  rank  integer  The ordering of subject features with respect to the
-  object feature may be important (for example, exon
-  ordering on a transcript - not always derivable if you
-  take trans spliced genes into consideration). rank is
-  used to order these; starts from zero
-feature relationship pub
-Provenance. Attach optional evidence to a feature relationship in the form of a publication
-Table 4.26: feature relationship pub
-  Column  Datatype Description
-  feature relationship pub id  integer
-  feature relationship idinteger
-  pub id  integer
-feature relationshipprop
-Extensible properties for feature relationships. Analagous structure to featureprop. This table is
-largely optional and not used with a high frequency. Typical scenarios may be if one wishes to
-attach additional data to a feature relationship - for example to say that the feature relationship
-is only true in certain contexts
- Table 4.27: feature relationshipprop
-Column  Datatype Description
-feature relationshipprop id  integer
-feature relationship idinteger
-type id integer  The name of the property/slot is a cvterm.The
-  meaning of the property is deﬁned in that cvterm.
-  Currently there is no standard ontology for fea-
-  ture relationship property types
-valuetext  The value of the property, represented as text. Nu-
-  meric values are converted to their text representa-
-  tion. This is less eﬃcient than using native database
-  types, but is easier to query.
-rank integer  Property-Value ordering. Any feature relationship
-  can have multiple values for any particular property
-  type - these are ordered in a list using rank, count-
-  ing from zero. For properties that are single-valued
-  rather than multi-valued, the default 0 value should
-  be used
-feature relationshipprop pub
-Provenance for feature relationshipprop
- Table 4.28: feature relationshipprop pub
-Column Datatype Description
-feature relationshipprop pub idinteger
-feature relationshipprop id integer
-pub id integer
-feature cvterm
-Associate a term from a cv with a feature, for example, GO annotation
-  Table 4.29: feature cvterm
+=Tables=
-ColumnDatatypeDescription
+== Table: feature ==
-feature cvterm id integer
-feature id  integer
-cvterm idinteger
-pub idinteger Provenance for the annotation.Each annotation
-  should have a single primary publication (which
-  may be of the appropriate type for computational
-  analyses) where more details can be found. Addi-
-  tional provenance dbxrefs can be attached using fea-
-  ture cvterm dbxref
-is notboolean if this is set to true, then this annotation is inter-
-  preted as a NEGATIVE annotation - ie the feature
-  does NOT have the speciﬁed function, process, com-
-  ponent, part, etc. See GO docs for more details
+A feature is a biological sequence or a section of a biological sequence, or a collection of such sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and HSPs and so on; see the Sequence Ontology for more.
-feature cvtermprop
+{| border="1" cellpadding="3"
+|+ feature Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_dbxref| dbxref]]
+| dbxref_id
+| integer
+| '' ''<br /><br />An optional primary public stable identifier for this feature. Secondary identifiers and external dbxrefs go in the table feature_dbxref.
+|- class="tr0"
+|
+[[Chado_Tables#Table:_organism| organism]]
+| organism_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The organism to which this feature belongs. This column is mandatory.
+|- class="tr1"
+|
+| name
+| character varying(255)
+| '' ''<br /><br />The optional human-readable common name for a feature, for display purposes.
+|- class="tr0"
+|
+| uniquename
+| text
+| '' UNIQUE#1 NOT NULL ''<br /><br />The unique name for a feature; may not be necessarily be particularly human-readable, although this is preferred. This name must be unique for this type of feature within this organism.
+|- class="tr1"
+|
+| residues
+| text
+| '' ''<br /><br />A sequence of alphabetic characters representing biological residues (nucleic acids, amino acids). This column does not need to be manifested for all features; it is optional for features such as exons where the residues can be derived from the featureloc. It is recommended that the value for this column be manifested for features which may may non-contiguous sublocations (e.g. transcripts), since derivation at query time is non-trivial. For expressed sequence, the DNA sequence should be used rather than the RNA sequence.
+|- class="tr0"
+|
+| seqlen
+| integer
+| '' ''<br /><br />The length of the residue feature. See column:residues. This column is partially redundant with the residues column, and also with featureloc. This column is required because the location may be unknown and the residue sequence may not be manifested, yet it may be desirable to store and query the length of the feature. The seqlen should always be manifested where the length of the sequence is known.
+|- class="tr1"
+|
+| md5checksum
+| character(32)
+| '' ''<br /><br />The 32-character checksum of the sequence, calculated using the MD5 algorithm. This is practically guaranteed to be unique for any feature. This column thus acts as a unique identifier on the mathematical sequence.
+|- class="tr0"
+|
+[[Chado_Tables#Table:_cvterm| cvterm]]
+| type_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />A required reference to a table:cvterm giving the feature type. This will typically be a Sequence Ontology identifier. This column is thus used to subclass the feature table.
+|- class="tr1"
+|
+| is_analysis
+| boolean
+| '' NOT NULL DEFAULT false ''<br /><br />Boolean indicating whether this feature is annotated or the result of an automated analysis. Analysis results also use the companalysis module. Note that the dividing line between analysis and annotation may be fuzzy, this should be determined on a per-project basis in a consistent manner. One requirement is that there should only be one non-analysis version of each wild-type gene feature in a genome, whereas the same gene feature can be predicted multiple times in different analyses.
+|- class="tr0"
+|
+| is_obsolete
+| boolean
+| '' NOT NULL DEFAULT false ''<br /><br />Boolean indicating whether this feature has been obsoleted. Some chado instances may choose to simply remove the feature altogether, others may choose to keep an obsolete row in the table.
+|- class="tr1"
+|
+| timeaccessioned
+| timestamp without time zone
+| '' NOT NULL DEFAULT ('now'::text)::timestamp(6) with time zone ''<br /><br />For handling object accession or modification timestamps (as opposed to database auditing data, handled elsewhere). The expectation is that these fields would be available to software interacting with chado.
+|- class="tr0"
+|
+| timelastmodified
+| timestamp without time zone
+| '' NOT NULL DEFAULT ('now'::text)::timestamp(6) with time zone ''<br /><br />For handling object accession or modification timestamps (as opposed to database auditing data, handled elsewhere). The expectation is that these fields would be available to software interacting with chado.
+|}
+Tables referencing this one via Foreign Key Constraints:
-Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualiﬁers;
+* [[Chado_Tables#Table:_analysisfeature| analysisfeature]]
-metadata such as the date on which the entry was curated and the source of the association. See
-the featureprop table for meanings of type id, value and rank
+* [[Chado_Tables#Table:_element| element]]
-Table 4.30: feature cvtermprop
+* [[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
-  Column Datatype Description
+* [[Chado_Tables#Table:_feature_dbxref| feature_dbxref]]
-  feature cvtermprop id integer
-  feature cvterm id  integer
-  type idinteger  The name of the property/slot is a cvterm.  The
-meaning of the property is deﬁned in that cvterm.
-cvterms may come from the OBO evidence code cv
-  value  text  The value of the property, represented as text. Nu-
-meric values are converted to their text representa-
-tion. This is less eﬃcient than using native database
-types, but is easier to query.
-  rankinteger  Property-Value ordering.  Any feature cvterm can
-have multiple values for any particular property type
-- these are ordered in a list using rank, counting from
-zero. For properties that are single-valued rather
-than multi-valued, the default 0 value should be used
+* [[Chado_Tables#Table:_feature_expression| feature_expression]]
-feature cvterm dbxref
+* [[Chado_Tables#Table:_feature_genotype| feature_genotype]]
+* [[Chado_Tables#Table:_feature_phenotype| feature_phenotype]]
-Additional dbxrefs for an association. Rows in the feature cvterm table may be backed up by
+* [[Chado_Tables#Table:_feature_pub| feature_pub]]
-dbxrefs. For example, a feature cvterm association that was inferred via a protein-protein inter-
-action may be backed by by refering to the dbxref for the alternate protein. Corresponds to the
-WITH column in a GO gene association ﬁle (but can also be used for other analagous associations).
-See http://www.geneontology.org/doc/GO.annotation.shtml#ﬁle for more details
+* [[Chado_Tables#Table:_feature_relationship| feature_relationship]]
- Table 4.31: feature cvterm dbxref
+* [[Chado_Tables#Table:_feature_synonym| feature_synonym]]
-Column Datatype Description
+* [[Chado_Tables#Table:_featureloc| featureloc]]
-feature cvterm dbxref id integer
-feature cvterm id  integer
-dbxref id integer
+* [[Chado_Tables#Table:_featurepos| featurepos]]
-feature cvterm pub
+* [[Chado_Tables#Table:_featureprop| featureprop]]
+* [[Chado_Tables#Table:_featurerange| featurerange]]
-Secondary pubs for an association. Each feature cvterm association is supported by a single primary
+* [[Chado_Tables#Table:_library_feature| library_feature]]
-publication. Additional secondary pubs can be added using this linking table (in a GO gene
-association ﬁle, these corresponding to any IDs after the pipe symbol in the publications column
+* [[Chado_Tables#Table:_phylonode| phylonode]]
-  Table 4.32: feature cvterm pub
+* [[Chado_Tables#Table:_wwwuser_feature| wwwuser_feature]]
-  Column Datatype Description
+----
-  feature cvterm pub id integer
-  feature cvterm id  integer
-  pub id integer
-synonym
+== Table: feature_cvterm ==
-A synonym for a feature. One feature can have multiple synonyms, and the same synonym can
+Associate a term from a cv with a feature, for example, GO annotation.
-apply to multiple features
+{| border="1" cellpadding="3"
+|+ feature_cvterm Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_cvterm_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| feature_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_cvterm| cvterm]]
+| cvterm_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_pub| pub]]
+| pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />Provenance for the annotation. Each annotation should have a single primary publication (which may be of the appropriate type for computational analyses) where more details can be found. Additional provenance dbxrefs can be attached using feature_cvterm_dbxref.
+|- class="tr0"
+|
+| is_not
+| boolean
+| '' NOT NULL DEFAULT false ''<br /><br />If this is set to true, then this annotation is interpreted as a NEGATIVE annotation - i.e. the feature does NOT have the specified function, process, component, part, etc. See GO docs for more details.
+|}
-Table 4.33: synonym
+Tables referencing this one via Foreign Key Constraints:
-  Column Datatype Description
+* [[Chado_Tables#Table:_feature_cvterm_dbxref| feature_cvterm_dbxref]]
-  synonym idinteger
-  namevarchar  The synonym itself.  Should be human-readable
-machine-searchable ascii text
-  type idinteger  types would be symbol and fullname for now
-  synonym sgml varchar  The fully speciﬁed synonym, with any non-ascii char-
-acters encoded in SGML
+* [[Chado_Tables#Table:_feature_cvterm_pub| feature_cvterm_pub]]
-feature synonym
+* [[Chado_Tables#Table:_feature_cvtermprop| feature_cvtermprop]]
+----
-Linking table between feature and synonym
-  Table 4.34: feature synonym
+== Table: feature_cvterm_dbxref ==
-  Column Datatype Description
+Additional dbxrefs for an association. Rows in the feature_cvterm table may be backed up by dbxrefs. For example, a feature_cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the WITH column in a GO gene association file (but can also be used for other analagous associations). See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details.
-  feature synonym id integer
-  synonym idinteger
-  feature idinteger
-  pub id integer  the pub id link is for relating the usage of a given
-synonym to the publication in which it was used
-  is currentboolean  the is current boolean indicates whether the linked
-synonym is the current -oﬃcial- symbol for the linked
-feature
-  is internal  boolean  typically a synonym exists so that somebody query-
-ing the db with an obsolete name can ﬁnd the ob-
-ject theyre looking for (under its current name. If
-the synonym has been used publicly & deliberately
-(eg in a paper), it my also be listed in reports as a
-synonym. If the synonym was not used deliberately
-(eg, there was a typo which went public), then the
-is internal boolean may be set to -true- so that it is
-known that the synonym is -internal- and should be
-queryable but should not be listed in reports as a
-valid synonym
+{| border="1" cellpadding="3"
+|+ feature_cvterm_dbxref Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_cvterm_dbxref_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
+| feature_cvterm_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_dbxref| dbxref]]
+| dbxref_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|}
+----
-genotype
- Table 4.35: genotype
+== Table: feature_cvterm_pub ==
-ColumnDatatype Description
+Secondary pubs for an association. Each feature_cvterm association is supported by a single primary publication. Additional secondary pubs can be added using this linking table (in a GO gene association file, these corresponding to any IDs after the pipe symbol in the publications column.
-genotype id integer
-uniquename  text
-description varchar
+{| border="1" cellpadding="3"
+|+ feature_cvterm_pub Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_cvterm_pub_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
+| feature_cvterm_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_pub| pub]]
+| pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|}
+----
-feature genotype
+== Table: feature_cvtermprop ==
+Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers; metadata such as the date on which the entry was curated and the source of the association. See the featureprop table for meanings of type_id, value and rank.
-  Table 4.36: feature genotype
+{| border="1" cellpadding="3"
+|+ feature_cvtermprop Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_cvtermprop_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
+| feature_cvterm_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_cvterm| cvterm]]
+| type_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. cvterms may come from the OBO evidence code cv.
+|- class="tr1"
+|
+| value
+| text
+| '' ''<br /><br />The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
+|- class="tr0"
+|
+| rank
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />Property-Value ordering. Any feature_cvterm can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.
+|}
- Column  Datatype Description
+----
- feature genotype id integer
- feature id integer
- genotype idinteger
- chromosome id integer
- rank integer
- cgroup  integer
- cvterm id  integer
-environment
+== Table: feature_dbxref ==
+Links a feature to dbxrefs. This is for secondary identifiers; primary identifiers should use feature.dbxref_id.
+{| border="1" cellpadding="3"
+|+ feature_dbxref Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_dbxref_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| feature_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_dbxref| dbxref]]
+| dbxref_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr1"
+|
+| is_current
+| boolean
+| '' NOT NULL DEFAULT true ''<br /><br />True if this secondary dbxref is the most up to date accession in the corresponding db. Retired accessions should set this field to false.
+|}
+----
+== Table: feature_pub ==
-  Table 4.37: environment
+Provenance. Linking table between features and publications that mention them.
-  ColumnDatatype  Description
+{| border="1" cellpadding="3"
-  environment id integer
+|+ feature_pub Structure
-  uniquename  text
+|-
-  description text
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_pub_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| feature_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_pub| pub]]
+| pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|}
+Tables referencing this one via Foreign Key Constraints:
+* [[Chado_Tables#Table:_feature_pubprop| feature_pubprop]]
-environment cvterm
+----
+== Table: feature_pubprop ==
+Property or attribute of a feature_pub link.
-  Table 4.38: environment cvterm
+{| border="1" cellpadding="3"
+|+ feature_pubprop Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_pubprop_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature_pub| feature_pub]]
+| feature_pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_cvterm| cvterm]]
+| type_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr1"
+|
+| value
+| text
+| '' ''
+|- class="tr0"
+|
+| rank
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|}
-  Column Datatype Description
+----
-  environment cvterm id integer
-  environment id  integer
-  cvterm id integer
-phenstatement
-Phenotypes are things like ”larval lethal”. Phenstatements are things like ”dpp[1] is recessive
+== Table: feature_relationship ==
-larval lethal”. So essentially phenstatement is a linking table expressing the relationship between
-genotype, environment, and phenotype.
+Features can be arranged in graphs, e.g. "exon part_of transcript part_of gene"; If type is thought of as a verb, the each arc or edge makes a statement [Subject Verb Object]. The object can also be thought of as parent (containing feature), and subject as child (contained feature or subfeature). We include the relationship rank/order, because even though most of the time we can order things implicitly by sequence coordinates, we can not always do this - e.g. transpliced genes. It is also useful for quickly getting implicit introns.
-  Table 4.39: phenstatement
+{| border="1" cellpadding="3"
+|+ feature_relationship Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_relationship_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| subject_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The subject of the subj-predicate-obj sentence. This is typically the subfeature.
+|- class="tr0"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| object_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The object of the subj-predicate-obj sentence. This is typically the container feature.
+|- class="tr1"
+|
+[[Chado_Tables#Table:_cvterm| cvterm]]
+| type_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />Relationship type between subject and object. This is a cvterm, typically from the OBO relationship ontology, although other relationship types are allowed. The most common relationship type is OBO_REL:part_of. Valid relationship types are constrained by the Sequence Ontology.
+|- class="tr0"
+|
+| value
+| text
+| '' ''<br /><br />Additional notes or comments.
+|- class="tr1"
+|
+| rank
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The ordering of subject features with respect to the object feature may be important (for example, exon ordering on a transcript - not always derivable if you take trans spliced genes into consideration). Rank is used to order these; starts from zero.
+|}
- Column  DatatypeDescription
+Tables referencing this one via Foreign Key Constraints:
- phenstatement id integer
- genotype idinteger
- environment idinteger
- phenotype id  integer
- type id integer
- pub id  integer
+* [[Chado_Tables#Table:_feature_relationship_pub| feature_relationship_pub]]
-phendesc
+* [[Chado_Tables#Table:_feature_relationshipprop| feature_relationshipprop]]
+----
-a summary of a set of phenotypic statements for any one gcontext made in any one publication
-Table 4.40: phendesc
+== Table: feature_relationship_pub ==
-ColumnDatatype Description
+Provenance. Attach optional evidence to a feature_relationship in the form of a publication.
-phendesc id integer
-genotype id integer
-environment id integer
-description text
-pub idinteger
+{| border="1" cellpadding="3"
+|+ feature_relationship_pub Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_relationship_pub_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature_relationship| feature_relationship]]
+| feature_relationship_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_pub| pub]]
+| pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|}
-phenotype comparison
+----
-comparison of phenotypes eg, genotype1/environment1/phenotype1 ”non-suppressible” wrt geno-
-type2/environment2/phenotype2
+== Table: feature_relationshipprop ==
-Table 4.41: phenotype comparison
+Extensible properties for feature_relationships. Analagous structure to featureprop. This table is largely optional and not used with a high frequency. Typical scenarios may be if one wishes to attach additional data to a feature_relationship - for example to say that the feature_relationship is only true in certain contexts.
-ColumnDatatype Description
+{| border="1" cellpadding="3"
-phenotype comparison id integer
+|+ feature_relationshipprop Structure
-genotype1 idinteger
+|-
-environment1 idinteger
+! F-Key
-genotype2 idinteger
+! Name
-environment2 idinteger
+! Type
-phenotype1 id  integer
+! Description
-phenotype2 id  integer
+|- class="tr0"
-type id  integer
+|
-pub idinteger
+| feature_relationshipprop_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature_relationship| feature_relationship]]
+| feature_relationship_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_cvterm| cvterm]]
+| type_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. Currently there is no standard ontology for feature_relationship property types.
+|- class="tr1"
+|
+| value
+| text
+| '' ''<br /><br />The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
+|- class="tr0"
+|
+| rank
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />Property-Value ordering. Any feature_relationship can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.
+|}
+Tables referencing this one via Foreign Key Constraints:
-phenotype
+* [[Chado_Tables#Table:_feature_relationshipprop_pub| feature_relationshipprop_pub]]
+----
-a phenotypic statement, or a single atomic phenotypic observation a controlled sentence describing
-observable eﬀect of non-wt function – e.g. Obs=eye, attribute=color, cvalue=red
-Table 4.42: phenotype
+== Table: feature_relationshipprop_pub ==
- Column  Datatype Description
+Provenance for feature_relationshipprop.
- phenotype id  integer
- uniquename text
- observable id integer  The entity: e.g. anatomy part, biological process
- attr id integer  Phenotypic attribute (quality, property, attribute,
-character) - drawn from PATO
- valuetext  value of attribute - unconstrained free text. Used
-only if cvalue id is not appropriate
- cvalue id  integer  Phenotype attribute value (state)
- assay idinteger  evidence type
+{| border="1" cellpadding="3"
+|+ feature_relationshipprop_pub Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_relationshipprop_pub_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature_relationshipprop| feature_relationshipprop]]
+| feature_relationshipprop_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_pub| pub]]
+| pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|}
+----
-phenotype cvterm
-NULL
+== Table: feature_synonym ==
+Linking table between feature and synonym.
-  Table 4.43: phenotype cvterm
+{| border="1" cellpadding="3"
+|+ feature_synonym Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| feature_synonym_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_synonym| synonym]]
+| synonym_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| feature_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_pub| pub]]
+| pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The pub_id link is for relating the usage of a given synonym to the publication in which it was used.
+|- class="tr0"
+|
+| is_current
+| boolean
+| '' NOT NULL DEFAULT true ''<br /><br />The is_current boolean indicates whether the linked synonym is the current -official- symbol for the linked feature.
+|- class="tr1"
+|
+| is_internal
+| boolean
+| '' NOT NULL DEFAULT false ''<br /><br />Typically a synonym exists so that somebody querying the db with an obsolete name can find the object theyre looking for (under its current name. If the synonym has been used publicly and deliberately (e.g. in a paper), it may also be listed in reports as a synonym. If the synonym was not used deliberately (e.g. there was a typo which went public), then the is_internal boolean may be set to -true- so that it is known that the synonym is -internal- and should be queryable but should not be listed in reports as a valid synonym.
+|}
-  Column  Datatype Description
+----
-  phenotype cvterm id integer
-  phenotype id  integer
-  cvterm id  integer
-feature phenotype
+== Table: featureloc ==
-NULL
+The location of a feature relative to another feature. Important: interbase coordinates are used. This is vital as it allows us to represent zero-length features e.g. splice sites, insertion points without an awkward fuzzy system. Features typically have exactly ONE location, but this need not be the case. Some features may not be localized (e.g. a gene that has been characterized genetically but no sequence or molecular information is available). Note on multiple locations: Each feature can have 0 or more locations. Multiple locations do NOT indicate non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature designate alternate locations or grouped locations; for instance, a feature designating a blast hit or hsp will have two locations, one on the query feature, one on the subject feature. Features representing sequence variation could have alternate locations instantiated on a feature on the mutant strain. The column:rank is used to differentiate these different locations. Reflexive locations should never be stored - this is for -proper- (i.e. non-self) locations only; nothing should be located relative to itself.
+{| border="1" cellpadding="3"
+|+ featureloc Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| featureloc_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| feature_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The feature that is being located. Any feature can have zero or more featurelocs.
+|- class="tr0"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| srcfeature_id
+| integer
+| '' ''<br /><br />The source feature which this location is relative to. Every location is relative to another feature (however, this column is nullable, because the srcfeature may not be known). All locations are -proper- that is, nothing should be located relative to itself. No cycles are allowed in the featureloc graph.
+|- class="tr1"
+|
+| fmin
+| integer
+| '' ''<br /><br />The leftmost/minimal boundary in the linear range represented by the featureloc. Sometimes (e.g. in Bioperl) this is called -start- although this is confusing because it does not necessarily represent the 5-prime coordinate. Important: This is space-based (interbase) coordinates, counting from zero. To convert this to the leftmost position in a base-oriented system (eg GFF, Bioperl), add 1 to fmin.
+|- class="tr0"
+|
+| is_fmin_partial
+| boolean
+| '' NOT NULL DEFAULT false ''<br /><br />This is typically false, but may be true if the value for column:fmin is inaccurate or the leftmost part of the range is unknown/unbounded.
+|- class="tr1"
+|
+| fmax
+| integer
+| '' ''<br /><br />The rightmost/maximal boundary in the linear range represented by the featureloc. Sometimes (e.g. in bioperl) this is called -end- although this is confusing because it does not necessarily represent the 3-prime coordinate. Important: This is space-based (interbase) coordinates, counting from zero. No conversion is required to go from fmax to the rightmost coordinate in a base-oriented system that counts from 1 (e.g. GFF, Bioperl).
+|- class="tr0"
+|
+| is_fmax_partial
+| boolean
+| '' NOT NULL DEFAULT false ''<br /><br />This is typically false, but may be true if the value for column:fmax is inaccurate or the rightmost part of the range is unknown/unbounded.
+|- class="tr1"
+|
+| strand
+| smallint
+| '' ''<br /><br />The orientation/directionality of the location. Should be 0, -1 or +1.
+|- class="tr0"
+|
+| phase
+| integer
+| '' ''<br /><br />Phase of translation with respect to srcfeature_id. Values are 0, 1, 2. It may not be possible to manifest this column for some features such as exons, because the phase is dependant on the spliceform (the same exon can appear in multiple spliceforms). This column is mostly useful for predicted exons and CDSs.
+|- class="tr1"
+|
+| residue_info
+| text
+| '' ''<br /><br />Alternative residues, when these differ from feature.residues. For instance, a SNP feature located on a wild and mutant protein would have different alternative residues. for alignment/similarity features, the alternative residues is used to represent the alignment string (CIGAR format). Note on variation features; even if we do not want to instantiate a mutant chromosome/contig feature, we can still represent a SNP etc with 2 locations, one (rank 0) on the genome, the other (rank 1) would have most fields null, except for alternative residues.
+|- class="tr0"
+|
+| locgroup
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />This is used to manifest redundant, derivable extra locations for a feature. The default locgroup=0 is used for the DIRECT location of a feature. Important: most Chado users may never use featurelocs WITH logroup &gt; 0. Transitively derived locations are indicated with locgroup &gt; 0. For example, the position of an exon on a BAC and in global chromosome coordinates. This column is used to differentiate these groupings of locations. The default locgroup 0 is used for the main or primary location, from which the others can be derived via coordinate transformations. Another example of redundant locations is storing ORF coordinates relative to both transcript and genome. Redundant locations open the possibility of the database getting into inconsistent states; this schema gives us the flexibility of both warehouse instantiations with redundant locations (easier for querying) and management instantiations with no redundant locations. An example of using both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of two different species. We may want to keep redundant locations on both contigs and chromosomes. We would thus have 4 locations for the single conserved region feature - two distinct locgroups (contig level and chromosome level) and two distinct ranks (for the two species).
+|- class="tr1"
+|
+| rank
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />Used when a feature has &gt;1 location, otherwise the default rank 0 is used. Some features (e.g. blast hits and HSPs) have two locations - one on the query and one on the subject. Rank is used to differentiate these. Rank=0 is always used for the query, Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for sequence_variant features, such as SNPs. Rank=0 indicates the wildtype (or baseline) feature, Rank=1 indicates the mutant (or compared) feature.
+|}
-Table 4.44: feature phenotype
+{| width="100%" cellpadding="3"
+|+ featureloc Constraints
+|-
+! Name
+! Constraint
+|- class="tr0"
+| featureloc_c2
+| CHECK ((fmin &lt;= fmax))
+|}
-  ColumnDatatype Description
+Tables referencing this one via Foreign Key Constraints:
-  feature phenotype id integer
-  feature id  integer
-  phenotype idinteger
+* [[Chado_Tables#Table:_featureloc_pub| featureloc_pub]]
-featuremap
+----
-NOTE: this module is all due for revision...
+== Table: featureloc_pub ==
- Table 4.45: featuremap
+Provenance of featureloc. Linking table between featurelocs and publications that mention them.
-  Column  Datatype Description
+{| border="1" cellpadding="3"
-  featuremap id integer
+|+ featureloc_pub Structure
-  name varchar
+|-
-  descriptiontext
+! F-Key
-  unittype idinteger
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| featureloc_pub_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_featureloc| featureloc]]
+| featureloc_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_pub| pub]]
+| pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|}
+----
-featurerange
+== Table: featureprop ==
+A feature can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.
+{| border="1" cellpadding="3"
+|+ featureprop Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| featureprop_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_feature| feature]]
+| feature_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_cvterm| cvterm]]
+| type_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm. Certain property types will only apply to certain feature types (e.g. the anticodon property will only apply to tRNA features) ; the types here come from the sequence feature property ontology.
+|- class="tr1"
+|
+| value
+| text
+| '' ''<br /><br />The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database types, but is easier to query.
+|- class="tr0"
+|
+| rank
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />Property-Value ordering. Any feature can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used
+|}
+Tables referencing this one via Foreign Key Constraints:
- Table 4.46: featurerange
+* [[Chado_Tables#Table:_featureprop_pub| featureprop_pub]]
-Column Datatype  Description
+----
-featurerange id integer
-featuremap idinteger
-feature idinteger
-leftstartf idinteger
-leftendf id  integer
-rightstartf id  integer
-rightendf id integer
-rangestr  varcha
-featurepos
+== Table: featureprop_pub ==
+Provenance. Any featureprop assignment can optionally be supported by a publication.
-Table 4.47: featurepos
+{| border="1" cellpadding="3"
+|+ featureprop_pub Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| featureprop_pub_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+[[Chado_Tables#Table:_featureprop| featureprop]]
+| featureprop_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|- class="tr0"
+|
+[[Chado_Tables#Table:_pub| pub]]
+| pub_id
+| integer
+| '' UNIQUE#1 NOT NULL ''
+|}
-ColumnDatatype Description
+----
-featurepos id  integer
-featuremap id  integer
-feature id  integer
-map feature id integer
-mapposﬂoat
+== Table: synonym ==
+A synonym for a feature. One feature can have multiple synonyms, and the same synonym can apply to multiple features.
-featuremap pub
+{| border="1" cellpadding="3"
+|+ synonym Structure
+|-
+! F-Key
+! Name
+! Type
+! Description
+|- class="tr0"
+|
+| synonym_id
+| serial
+| '' PRIMARY KEY ''
+|- class="tr1"
+|
+| name
+| character varying(255)
+| '' UNIQUE#1 NOT NULL ''<br /><br />The synonym itself. Should be human-readable machine-searchable ascii text.
+|- class="tr0"
+|
+[[Chado_Tables#Table:_cvterm| cvterm]]
+| type_id
+| integer
+| '' UNIQUE#1 NOT NULL ''<br /><br />Types would be symbol and fullname for now.
+|- class="tr1"
+|
+| synonym_sgml
+| character varying(255)
+| '' NOT NULL ''<br /><br />The fully specified synonym, with any non-ascii characters encoded in SGML.
+|}
+Tables referencing this one via Foreign Key Constraints:
-map feature id links to the feature (map) upon which the feature is
+* [[Chado_Tables#Table:_feature_synonym| feature_synonym]]
+* [[Chado_Tables#Table:_library_synonym| library_synonym]]
-Table 4.48: featuremap pub
+----
-ColumnDatatype Description
+[[Category:BLAST]]
-featuremap pub id integer
+[[Category:Chado]]
-featuremap id  integer
+[[Category:Chado Modules]]
-pub id integer

Difference between revisions of "Chado Sequence Module"

Latest revision as of 22:17, 18 December 2013

Contents

Introduction

Features

Names of Features

Feature Synonyms

Feature Locations

The Feature Location Graph

Feature Coordinates

Multiple Locations for a Feature

Difference Between the chado Location Model and Other Schemas

Feature Rank

Extensible Feature Properties

Linking Features to External Databases

Feature Annotations

Relationships Between Features

Compliance

Chado Compliance Layers

Level 0: Relational Schema

Layer 1: Ontologies

Level 2: Graph

Examples: Current implementations

SO terms used for Standard Central-dogma Gene Model

SO terms Used for Storing Alignments

feature_relationship Types

featureloc Policy

Non-central Dogma Gene Models

Other Features

Derivable Feature Types

Sequence Variants

Tables

Table: feature

Table: feature_cvterm

Table: feature_cvterm_dbxref

Table: feature_cvterm_pub

Table: feature_cvtermprop

Table: feature_dbxref

Table: feature_pub

Table: feature_pubprop

Table: feature_relationship

Table: feature_relationship_pub

Table: feature_relationshipprop

Table: feature_relationshipprop_pub

Table: feature_synonym

Table: featureloc

Table: featureloc_pub

Table: featureprop

Table: featureprop_pub

Table: synonym

Navigation menu

Search