Chado CV Module
Contents
- 1 Introduction
- 1.1 Transitive Closure
- 1.2 Rules
- 1.3 =Populating cvterm_path
- 1.3.1 Background
- 1.3.2 Logical definitions
- 1.3.3 How logical definitions are stored in Chado
- 1.3.4 Logical Definition Views
- 1.3.5 Example use case: Phenotypes
- 1.3.6 Example use case: feature types
- 1.3.7 Example use case: GO
- 1.3.8 Example use case: Drawing DAGs
- 1.3.9 Loading OWL into Chado
- 1.3.10 Post-coordinating terms
Introduction
We have seen how the sequence module makes extensive use of terms taken from various ontologies such as SO and the OBO Relations Ontology, using the type_id foreign key column. In addition, features can be annotated using ontologies such as GO using the feature_cvterm linking table. These terms are modelled using the cv module, the core of which is the cvterm table.
An ontology, terminology or cv (controlled vocabulary) , is a collection of terms (here equivalent to what are more typically called classes, types, categories or kinds in the ontology literature [REF]) in a particular domain of interest. Examples include "gene" (from SO), "transcription factor activity" (from GO molecular function) and "lymphocyte" (from OBO-Cell). The chado cv module is based on the GO Database schema, described here [14]. Terms are stored in the cvterm table, and relationships between terms are stored in the cvterm_relationship table. This table follows an analogous structure to the feature_relationship table, in that it has columns subject_id, object_id and type_id. Here, all three of these foreign keys refer to rows in the cvterm table.
A detailed treatment of relationship types in biological ontologies can be found here[13]. Of particular interest to Chado is the is_a relation, which specifies a sub-typing relationship between two terms or classes. Recall that tables in the sequence module frequently (such as the feature table) defined a type_id foreign key column to indicate the specific type or class of entity for each row in that table. The combination of the type_id column and the is_a relationship gives Chado a data sub-classing system, beyond what is possible with traditional SQL database semantics.
This is discussed further in a later section The collection of cvterms and cvterm_relationships can be considered to constitute vertices and edges in a graph. This graph is typically acyclic (a DAG), though it is not guaranteed to be as certain relationship types are allowed to form cycles.
3.1 Introduction
We have seen how the sequence module makes extensive use of terms taken from various ontologies
such as SO and the OBO Relations Ontology, using the type id foreign key column. In addition,
features can be annotated using ontologies such as GO using the feature cvterm linking table.
These terms are modelled using the cv module, the core of which is the cvterm table.
An ontology, terminology or cv (controlled vocabulary) , is a collection of terms (here equivalent to what are more typically called classes, types, categories or kinds in the ontology literature[REF]) in a particular domain of interest. Examples include ”gene” (from SO), ”transcription factor activity” (from GO molecular function) and ”lymphocyte” (from OBO-Cell). The chado cv module is based on the GO Database schema, described here[14]. Terms are stored in the cvterm table, and relationships between terms are stored in the cvterm relationship table. This table follows an analogous structure to the feature relationship table, in that it has columns subject id, object id and type id. Here, all three of these foreign keys refer to rows in the cvterm table.
SO Term | SO id |
---|---|
Exon | SL:0000025 |
Intron | SL:0000027 |
mRNA | SL:0000037 |
3_utr | SL:0000029 |
5_utr | SL:0000028 |
noncoding_transcript | SL:0000040 |
miRNA_precursor | SL:0000043 |
miRNA | SL:0000044 |
rRNA | SL:0000042 |
tRNA | SL:0000041 |
regulatory_element | SL:0000052 |
transcription_factor_binding_site | SL:0000054 |
Clone | SL:0000050 |
genbank_entry | SL:0000061 |
Match | SL:0000008 |
nucleotide_match | SL:0000018 |
cross_genome_match | SL:0000074 |
EST_match | SL:0000021 |
mRNA_match | SL:0000022 |
protein_match | SL:0000020 |
translated_nucleotide_match | SL:0000019 |
Pseudogene | SL:0000007 |
repeat_region | SL:0000002 |
direct_repeats | SL:0000005 |
inverted_repeat | SL:0000004 |
repeat_family | SL:0000006 |
retrotransposon | SL:0000014 |
Transposon | SL:0000012 |
tandem_repeat | SL:0000003 |
microsatellite | SL:0000013 |
Remark | SL:0000062 |
Variant | SL:0000065 |
deletion | SL:0000067 |
insertion | SL:0000066 |
SNP | SL:0000072 |
A detailed treatment of relationship types in biological ontologies can be found here[13]. Of
particular interest to Chado is the is a relation, which specifies a sub- typing relationship between
two terms or classes. Recall that tables in the sequence module frequently (such as the feature
table) defined a type id foreign key column to indicate the specific type or class of entity for each
row in that table. The combination of the type id column and the is a relationship gives Chado a
data sub- classing system, beyond what is possible with traditional SQL database semantics.
This is discussed further in a later section The collection of cvterms and cvterm relationships can be considered to constitute vertices and edges in a graph. This graph is typically acyclic (a DAG), though it is not guaranteed to be as certain relationship types are allowed to form cycles.
Transitive Closure
Rules
The cvtermpath is for calculating the reflexive transitive closure of a relationship, and any derived relationships.
Normal (direct) relationships are stored in the cvterm_relationship table. A entry in this table represents a cvterm_relationship S over some relation R.
S = Subj R Obj
For example:
S = "cardioblast" develops_from "mesodermal cell"
The relation isa represents a special kind of relation -
subsumption, or inheritance.
If X isa Y, then it follows that all of Y's cvterm_relationship statements are inherited by X.
[Rule 1]
IfX is_a Y
and Y R Z
then X R(inh) Z
For example
"cilium axoneme" is_a "axoneme" "axoneme"part_of "cell projection"
THEREFORE:
"cilium axoneme" part_of(inh) "cell projection"
Here we use T(inh) to represent an inherited relationship.
=Populating cvterm_path
The cvtermpath table stores the reflexive transitive closure of a relationship, taking into account subsumption/inheritance. The number of intermediate relationships is represented in the 'distance' column of the table.
Here we use T(path) to represent the 'path' or closure of a relationship. Every T(path) is stored in cvtermpath. We use the same cvterm for T, the fact that it is a path is implicit.
We use these rules:
Reflexive relationships:
for all relations T,
X T(path) X
In this case the distance=0
Direct relationships:
these are also included in the cvtermpath table, distance=1
IfX T Y Then X T(path) Y
Transitive relationships:
these have distance > 1; these also make use of inheritance rule,
[Rule1], which gives us T(inh)
IfX T(inh) Y
and Y T(path) Z
Then X T(path) Z
Note that this rule is recursive.
These rules should be used for populating cvtermpath. Attempting to calculate a more general closure where all relations are treated equally or ignored will produce combinatorial explosions over certain ontologies (e.g. flybase anatomy ontology)
What does this mean in practice?
For a typical database, which may only have relations isa, part_of and develops_from, we will end up with 3 sets of paths.
The isa closure, isa (path) will include paths over cvterm_relationships that look like this:
a is_a b is_a c is_a d is_a e
The "part_of" closure, part_of(path) will include paths over
cvterm_relationships that look like this:
a is_a b part_of c part_of d is_a e part_of f
The "develops_from" closure, develops_from(path) will include paths over cvterm_relationships that look like this:
a develops_from b develops_from c is_a d is_a e develops_from f
It may be tempting to mix different non isa relationships in the same path, but this should NEVER be done - there will be an unacceptable combinatorial explosion in many cases. Besides, there is no use for such a cvtermpath; it is meaningless.
Note that for amigolike query behaviour, it is necessary only to query cvtermpath ignoring cvtermpath.type_id (these are obtained by querying cvterm_relationship)
3.2 Table Definitions
cv
A controlled vocabulary or ontology. A cv is composed of cvterms (aka terms, classes, types,
universals - relations and properties are also stored in cvterm)) and the relationships between them
Table 3.1: cv
Column Datatype Description cv id integer namevarchar The name of the ontology. This corresponds to the
obo-format -namespace-. cv names uniquely identify the cv. In obo file format, the cv.name is known as the namespace
definition text A text description of the criteria for membership of
this ontology
cvterm
A term, class, universal or type within an ontology or controlled vocabulary. This table is also
used for relations and properties. cvterms constitute nodes in the graph defined by the collection
of cvterms and cvterm relationships
Table 3.2: cvterm
Column Datatype Description cvterm id integer cv idinteger The cv/ontology/namespace to which this cvterm
belongs
name varchar A concise human-readable name or label for the
cvterm. uniquely identifies a cvterm within a cv
definition text A human-readable text definition dbxref id integer Primary identifier dbxref - The unique global OBO
identifier for this cvterm. Note that a cvterm may have multiple secondary dbxrefs - see also table: cvterm dbxref
is obsoleteinteger Boolean 0=false,1=true; see GO documentation for
details of obsoletion. note that two terms with dif- ferent primary dbxrefs may exist if one is obsolete
is relationshiptype integer Boolean 0=false,1=true relations or relationship
types (also known as Typedefs in OBO format, or as properties or slots) form a cv/ontology in themselves. We use this flag to indicate whether this cvterm is an actual term/class/universal or a relation. Relations may be drawn from the OBO Relations ontology, but are not exclusively drawn from there
cvterm relationship
A name can mean different things in different contexts; for example ”chromosome” in SO and GO.
A name should be unique within an ontology/cv. A name may exist twice in a cv, in both obsolete
and non-obsolete forms - these will be for different cvterms with different OBO identifiers; so GO
documentation for more details on obsoletion. Note that occasionally multiple obsolete terms with
the same name will exist in the same cv. If this is a possibility for the ontology under consideration
(eg GO) then the ID should be appended to the name to ensure uniqueness
Table 3.3: cvterm relationship
Column Datatype Description cvterm relationship id integer type id integer The nature of the relationship between subject and object. Note that relations are also housed in the cvterm table, typically from the OBO relationship ontology, although other relationship types are al- lowed subject id integer the subject of the subj-predicate-obj sentence. The cvterm relationship is about the subject. In a graph, this typically corresponds to the child node object id integer the object of the subj-predicate-obj sentence. The cvterm relationship refers to the object. In a graph, this typically corresponds to the parent node
cvtermpath
The reflexive transitive closure of the cvterm relationship relation. For a full discussion, see the file
populating-cvtermpath.txt in this directory
Table 3.4: cvtermpath
Column Datatype Description cvtermpath id integer type id integer The relationship type that this is a closure over. If
null, then this is a closure over ALL relationship types. If non-null, then this references a relation- ship cvterm - note that the closure will apply to both this relationship AND the OBO REL:is a (subclass) relationship
subject id integer object id integer cv idinteger Closures will mostly be within one cv. If the closure
of a relationship traverses a cv, then this refers to the cv of the object id cvterm
pathdistance integer The number of steps required to get from the sub-
ject cvterm to the object cvterm, counting from zero (reflexive relationship)
cvtermsynonym
A cvterm actually represents a distinct class or concept. A concept can be refered to by different
phrases or names. In addition to the primary name (cvterm.name) there can be a number of
alternative aliases or synonyms. For example, -T cell- as a synonym for -T lymphocyte-
Table 3.5: cvtermsynonym
Column DatatypeDescription cvtermsynonym id integer cvterm id integer synonym varchar type id integer A synonym can be exact, narrow or borader than
cvterm dbxref
In addition to the primary identifier (cvterm.dbxref id) a cvterm can have zero or more secondary
identifiers/dbxrefs, which may refer to records in external databases. The exact semantics of
cvterm dbxref are not fixed. For example: the dbxref could be a pubmed ID that is pertinent to the
cvterm, or it could be an equivalent or similar term in another ontology. For example, GO cvterms
are typically linked to InterPro IDs, even though the nature of the relationship between them is
largely one of statistical association. The dbxref may be have data records attached in the same
database instance, or it could be a ”hanging” dbxref pointing to some external database. NOTE:
If the desired objective is to link two cvterms together, and the nature of the relation is known
and holds for all instances of the subject cvterm then consider instead using cvterm relationship
together with a well-defined relation.
Table 3.6: cvterm dbxref
Column Datatype Description cvterm dbxref id integer cvterm id integer dbxref id integer is for definition integerA cvterm.definition should be supported by one or more references. If this column is true, the dbxref is not for a term in an external db - it is a dbxref for provenance information for the definition
cvtermprop
Additional extensible properties can be attached to a cvterm using this table. Corresponds to
-AnnotationProperty- in W3C OWL format
Table 3.7: cvtermprop
Column Datatype Description cvtermprop id integer cvterm id integer type id integer The name of the property/slot is a cvterm. The meaning of the property is defined in that cvterm valuetext The value of the property, represented as text. Nu- meric values are converted to their text representa- tion rank integer Property-Value ordering. Any cvterm can have mul- tiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used
dbxrefprop
Metadata about a dbxref. Note that this is not defined in the dbxref module, as it depends on the
cvterm table. This table has a structure analagous to cvtermprop
Table 3.8: dbxrefprop
Column Datatype Description dbxrefprop id integer dbxref id integer type id integer valuetext rank integer
organism
The organismal taxonomic classification. Note that phylogenies are represented using the phylogeny
module, and taxonomies can be represented using the cvterm module or the phylogeny module
Table 3.9: organism
Column Datatype Description organism id integer abbreviation varchar genus varchar speciesvarchar A type of organism is always uniquely identified
by genus+species. When mapping from the NCBI taxonomy names.dmp file, the unique-name column must be used where it is present, as the name column is not always unique (eg environmental samples). If a particular strain or subspecies is to be represented, this is appended onto the species name. Follows stan- dard NCBI taxonomy pattern
common name varchar commenttext
organism dbxref
Table 3.10: organism dbxref
Column Datatype Description organism dbxref id integer organism id integer dbxref id integer
organismprop
tag-value properties - follows standard chado model
Table 3.11: organismprop
Column Datatype Description organismprop id integer organism id integer type idinteger value text rankinteger
This section describes advanced usage of the cv module for use with
OWL-DL advanced Obo format 1.2 [REF] features
or elements from other ontology formalisms.
If you aren't sure what this means, you probably don't need to read this section yet.
Background
See the document on ConvertingOboToOWL.
Logical definitions
In a normal ontology DAG representation in chado, the cvterm_relationship rows represent relationships between terms, or more formally, necessary conditions. A logical definition must have both necessary and sufficient conditions. A logical definition often consists of a generic term (aka genus) and one or more discriminating characteristics (aka differentiae). The discriminating characteristics are typically relationships
For example, the logical definition of larval locomotory behaviour would be a locomotory behaviour (genus) which during tt larval stage (where during could be drawn from an ontology of relations, and larval stage may come from an insect developmental stage ontology). These constitute both necessary and sufficient conditions: the conditions are necessary in that all instances of larval locomotory behavior are necessarily locomotory behaviors and are necessarily manifested at the larval stage. We could represent this using a normal DAG. However, because this is a definition it also constitutes sufficient conditions, in that any instance of locomotory behavior which manifests at the larval stage is by definition a larval locomotory behavior.
In an ontology formalism like OWL-DL or Obo-1.2, genus-differentiae are represented using set-intersections.
Here is the Obo 1.2 representation:
[Term]
id: GO:0008345
name: larval locomotory behavior
namespace: biological_process
is_a: GO:0007626 ! locomotory behavior
is_a: GO:0030537 ! larval behavior
intersection_of: GO:0007626 ! GENUS: locomotory behavior
intersection_of: during FBdv:00005336 ! DIFFERENTIUM: during larval stage
Here is the equivalent in OWL (note: RDF-XML syntax is very verbose!):
<owl:Class rdf:ID="GO_0008345"> <rdfs:label xml:lang="en">larval locomotory behavior</rdfs:label> <rdfs:subClassOf rdf:resource="#GO_0007626"/> <rdfs:subClassOf rdf:resource="#GO_0030537"/> <owl:equivalentClass>
<owl:Class>
<owl:intersectionOf rdf:parseType="Collection"> <owl:Class rdf:about="#GO_0007626"/> <owl:Restriction>
<owl:onProperty>
<owl:ObjectProperty rdf:about="#during"/>
</owl:onProperty> <owl:someValuesFrom rdf:resource="#FBdv_00005336"/>
</owl:Restriction> </owl:intersectionOf>
</owl:Class>
</owl:equivalentClass> </owl:Class>
When converting to chado we employ a more economical representation, in terms of the number of triples we use:
<cvterm_relationship> <type_id>is_a</type_id> <subject_id>GO:0008345</subject_id> <object_id>GO:0007626</object_id> </cvterm_relationship> <cvterm_relationship> <type_id>is_a</type_id> <subject_id>GO:0008345</subject_id> <object_id>GO:0030537</object_id> </cvterm_relationship>
<cvterm_relationship> <type_id>intersection_of</type_id> <subject_id>GO:0008345</subject_id> <object_id>GO:0007626</object_id> </cvterm_relationship>
<cvterm_relationship> <type_id>intersection_of</type_id> <subject_id>GO:0008345</subject_id> <object_id>
<cvterm>
<dbxref_id> <dbxref>
<db_id>internal</db_id> <accession>restriction--OBOL:during--GO:0008345</accession>
</dbxref> </dbxref_id>
<name>restriction--OBOL:during--GO:0008345</name> <cv_id>anonymous_cv</cv_id> <cvtermprop> <type_id>is_anonymous</type_id> <value>1</value> <rank>0</rank> </cvtermprop> <cvterm_relationship> <type_id>OBOL:during</type_id> <object_id>FBdv:00005336</object_id> </cvterm_relationship>
</cvterm>
</object_id> </cvterm_relationship>
Note that in the above, we are creating anonymous terms. We give them fake names and fake dbxrefs. In the bbop-experimental cvs branch of chado, names and dbxrefs are nullable, so these can be omitted. With the current schema, you must provide fake dbxrefs and names that are unique, such as the above (if you are not familiar with how ChadoXML maps to the chado schema, see the explanation below).
If you wish to convert Obo-specified logical definitions to chadoxml you will need go-perl v0.05 or higher (if you have a lower version, the intersection_of tags will simply be ignored).
go2chadoxml ont.obo > ont.chado
How logical definitions are stored in Chado
This involves no schema changes to the cv module. Each intersection_of goes in as a DAG arc of type internal:intersection_of. The object_id in the arc is either a term (for the genus) or an anonymous term representing a restriction (the differentium). the restriction has a relationship of some type to another term.
For example, for "larval locomotory behavior" we would normally just have:
LLB is_a LocomotoryBehavior
LLB is_a LarvalBehavior
If we load a logical definition for this term (see go-dev/go-perl/t/data/llm/obo), like this:
[Term]
id: GO:0008345
name: larval locomotory behavior
namespace: biological_process
is_a: GO:0007626 ! locomotory behavior
is_a: GO:0030537 ! larval behavior
intersection_of: GO:0007626 ! locomotory behavior
intersection_of: during FBdv:00005336 ! larval stage
Then the intersection_ofs get stored using the basic DAG tables as:
Subject & Relation & Object \\ \hline LLB & intersection_of & LocomotoryBehavior LLB & intersection_of & anon:xxx anon:xxx & during & FBv:00005336
\label{tab:intersections-in-Chado}
\end{tabular}
}
\caption{Logical definition stored ib cvterm_relationship table}
\label{tab:tab-esc-str}
\end{table}
This uses 4 cvterm_relationships and the creation of a new anonymous term that is never shown directly to the user. The anonymous term represents the class of things that happen during the larval stage
Logical Definition Views
Two views: cvterm_genus and cvterm_differentium views are in chado/modules/cv/views.
Example use case: Phenotypes
The idea here is that queries for composed term "syndactyly" should automatically return the same results as a boolean query for "fusion"+inheres_in="finger" regardless of whether the annotation is to the composed term or is a composed annotation (provided we put the logical definition of syndactyly in the database)
Example use case: feature types
The Sequence Ontology has some logical definitions - you will need to load the file {\tt so-xp.obo}
Example use case: GO
See http://www.fruitfly.org/~cjm/obol
Example use case: Drawing DAGs
Currently the DAGs of many OBO ontologies are highly tangled; see: http://www.fruitfly.org/~cjm/obol/doc/go-complexity.html
If all terms have logical definitions, then there is only one 'true' (genus) \isa parent. This enables us to disentangle the DAGs and draw distinct hierarchies. For example, the GO term cysteine biosynthesis could be drawn as two distinct hierarchies - one process and one chemical
Loading OWL into Chado
Not all OWL-DL features are supported. Only intersectionOfs correspondibg to genus-differentiae are loaded.
First you must convert OWL into Obo 1.2 format. There will soon be a way to do this in OboEdit. For now you can use blipkit (http://www.blipkit.org)
blip io-convert my.owl -to obo -o my.obo
Once you have an obo file you can run go2chadoxml, as above
Post-coordinating terms
Sometimes we want to be able to refer to a term such as {\em plasma membrane of spermatocyte}, but no such term exists in the ontology. Introducing these as {\em pre-coordinated} cross-product terms would make the ontology unwieldy.
Chado allows the post-coordination or post-composition of terms using the same formalism as described above. Briefly: we would create an anonymous. This anonymous term would be defined using the terms plasma membrane and em spermatocyte, using a genus-differentia definition as above.
<cvterm_relationship> <type_id>intersection_of</type_id> <subject_id>anon_1</subject_id> <object_id>GO__plasma_membrane</object_id> </cvterm_relationship>
<cvterm_relationship> <type_id>intersection_of</type_id> <subject_id>anon_1</subject_id> <object_id>
<cvterm>
<dbxref_id> <dbxref>
<db_id>internal</db_id> <accession>restriction--part_of--spermatocyte</accession>
</dbxref> </dbxref_id>
<name>restriction--part_of--spermatocyte</name> <cv_id>anonymous_cv</cv_id> <cvtermprop> <type_id>is_anonymous</type_id> <value>1</value> <rank>0</rank> </cvtermprop> <cvterm_relationship> <type_id>OBO_REL:part_of</type_id> <object_id>CL__spermatocyte</object_id> </cvterm_relationship>
</cvterm>
</object_id> </cvterm_relationship>
The above assumes XORT macro IDs defined for GO__plasma_membrane and CL__spermatocyte
Allow post-coordinated terms places a greater burden on applications
that use the cv module. More documentation will be provided here on
this.