Difference between revisions of "Chado CV Module"

From GMOD
Jump to: navigation, search
m
 
(108 intermediate revisions by 5 users not shown)
Line 1: Line 1:
==Introduction==
+
=Introduction=
  
We have seen how the sequence module makes extensive use of terms
+
This module is for controlled vocabularies (CVs), semantic networks
taken from various ontologies such as SO and the OBO Relations
+
and ontologies, depending on which terminology you prefer.
Ontology, using the type_id foreign key column. In addition, features
+
can be annotated using ontologies such as GO using the feature_cvterm
+
linking table. These terms are modelled using the cv module, the core
+
of which is the cvterm table.
+
  
An ontology, terminology or cv (controlled vocabulary) , is a
+
It is intended to be rich enough to encapsulate anything in the Gene Ontology (GO) or
 +
OBO family of ontologies (see the [http://www.geneontology.org GO website] and
 +
the [http://obo.sourceforge.net OBO project]). The schema reflects the data model of OBO and of the
 +
[http://oboedit.org/ OBO Edit] tool currently used by these projects.
 +
 
 +
This module is also intended to be extensible to richer formalisms
 +
such as [http://en.wikipedia.org/wiki/Web_Ontology_Language OWL (Ontology Web Language)], but this is outside the current
 +
requirements.
 +
 
 +
==Similarities to the GO Database schema==
 +
 
 +
The schema is similar to the [http://geneontology.org/GO.database.shtml GO database schema], which was also developed by
 +
one of the Chado designers.
 +
 
 +
There is a ''bridge'' layer in the directory <code>modules/cv/bridges/</code>, which can make
 +
the Chado cv module look like the GO DB, and vice versa.
 +
 
 +
==Overview==
 +
 
 +
An ontology, or controlled vocabulary (cv) is a collection of classes (or concepts or terms, depending on your terminology) with definitions and relationships to other classes. Each class--a word or phrase--can only appear once in a controlled vocabulary and has a defined meaning within that vocabulary. The controlled vocabularies are chosen so that the contents do not overlap; if the same text string is used to describe two different concepts in two different cvs, these are distinct classes. These terms are housed in the [[#Table:_cvterm|cvterm]] table in the Chado schema.
 +
 
 +
cvterms are related to one another via [[#Table:_cvterm_relationship|cvterm_relationship]]. This can be thought of as a graph, or semantic network. The relationship types (the labels on the arcs of the graph) are also stored in the [[#Table:_cvterm|cvterm]] table. The relationship types are extensible, but the type ''is a'' (subtyping relationship) is assumed to be present; many OBO ontologies use the ''part of'' relationship, and GO also uses the ''regulates'' relation. Relationship types also come from a controlled vocabulary, the [http://obofoundry.org/ro/ OBO Relation Ontology].
 +
 
 +
The [[#Table:_cvterm_relationship|cvterm_relationship]] can be thought of as specifying sentences about the cvterms. These sentences have 3 parts - a subject term, an object term, and a verb or type. For example in the phrase "an exon is part of a transcript" the subject of the sentence is "exon" and the object is "transcript". If you prefer to think of it as a [[wp:Directed_graph|directed graph)]], then you can think of the subject as the child node, and the object as the parent node.
 +
 
 +
==Associating features to cvterms==
 +
 
 +
This module is used by most of the Chado modules. But it is useful to describe here how this module would be used in the context of the [[Chado_Sequence_Module|sequence module]].
 +
 
 +
Often we want to attach cvterms to features. One example is typing features with SO - this is central to the [[Chado_Sequence_Module|sequence module]]. Each feature has one primary type, stored in [[Chado_Tables#Table:_feature|feature.type_id]].
 +
 
 +
We can also attach an arbitrary number of non-primary cvterms to a feature.
 +
 
 +
For example, we may want to attach GO annotations to gene or protein features. We may also want to attach phenotypic terms to gene features (although the preferred way to do this is ''via'' a genotype using the [[Chado_Genetic_Module|genetics module]]).
 +
 
 +
==Complex annotations==
 +
 
 +
Note that this is something that is not handled by the current GO DB,
 +
but it is something we may want to allow for in future.
 +
 
 +
Currently in GO, all annotations are disjunctive; for example, if we have
 +
 
 +
gene | GO ID
 +
-----+------
 +
foo  | GO:001
 +
foo  | GO:002
 +
foo  | GO:003
 +
 
 +
 
 +
{{NeedsEditing}}
 +
 
 +
'''The text above was taken from {{SF SVN|schema/trunk/chado/modules/cv/ <tt>modules/cv/cv-intro.txt</tt>}} , which was incomplete''' (and no longer exists).
 +
 
 +
 
 +
The [[Chado_Sequence_Module|sequence module]] makes extensive use of terms taken from various ontologies such as [http://song.sourceforge.net/ SO] and the [http://obofoundry.org OBO] Relations Ontology, using the ''type_id'' foreign key column. In addition, features can be annotated using ontologies such as GO using the [[Chado_Tables#Table:_feature_cvterm|feature_cvterm]] linking table. These terms are modelled using the cv module, the core of which is the [[#Table:_cvterm|cvterm]] table.
 +
 
 +
An ontology, terminology or cv (controlled vocabulary) is a
 
collection of terms (here equivalent to what are more typically called
 
collection of terms (here equivalent to what are more typically called
classes, types, categories or kinds in the ontology literature [REF])
+
classes, types, categories or kinds in the ontology literature
 
in a particular domain of interest. Examples include "gene" (from SO),
 
in a particular domain of interest. Examples include "gene" (from SO),
 
"transcription factor activity" (from GO molecular function) and
 
"transcription factor activity" (from GO molecular function) and
"lymphocyte" (from OBO-Cell).  The chado cv module is based on the GO
+
"lymphocyte" (from OBO-Cell).  The chado cv module is based on the [http://www.godatabase.org/dev/sql/doc/godb-sql-doc.html GO Database schema].  Terms are stored in the [[#Table:_cvterm|cvterm]]
Database schema, described here [14].  Terms are stored in the cvterm
+
 
table, and relationships between terms are stored in the
 
table, and relationships between terms are stored in the
cvterm_relationship table. This table follows an analogous structure
+
[[#Table:_cvterm_relationship|cvterm_relationship]] table. This table follows an analogous structure
to the feature_relationship table, in that it has columns subject_id,
+
to the [[Chado_Tables#Table:_feature_relationship|feature_relationship]] table, in that it has columns ''subject_id'',
object_id and type_id. Here, all three of these foreign keys refer to
+
''object_id'' and ''type_id''. Here, all three of these foreign keys refer to
rows in the cvterm table.
+
rows in the [[#Table:_cvterm|cvterm]] table.
  
A detailed treatment of relationship types in biological ontologies
+
A brief treatment of relationship types in biological ontologies
can be found here[13]. Of particular interest to Chado is the ''is_a''
+
can be [http://www.geneontology.org/GO.ontology.relations.shtml found here]. Of particular interest to Chado is the ''is_a''
 
relation, which specifies a sub-typing relationship between two terms
 
relation, which specifies a sub-typing relationship between two terms
 
or classes. Recall that tables in the sequence module frequently (such
 
or classes. Recall that tables in the sequence module frequently (such
as the feature table) defined a type_id foreign key column to indicate
+
as the [[Chado_Tables#Table:_feature|feature table]]) defined a ''type_id'' foreign key column to indicate
 
the specific type or class of entity for each row in that table. The
 
the specific type or class of entity for each row in that table. The
combination of the type_id column and the ''is_a'' relationship gives
+
combination of the ''type_id'' column and the ''is_a'' relationship gives
 
Chado a data sub-classing system, beyond what is possible with
 
Chado a data sub-classing system, beyond what is possible with
 
traditional SQL database semantics.
 
traditional SQL database semantics.
  
This is discussed further in a later section The collection of
+
This is discussed further below. The collection of
 
cvterms and cvterm_relationships can be considered to constitute
 
cvterms and cvterm_relationships can be considered to constitute
vertices and edges in a graph. This graph is typically acyclic (a
+
vertices and edges in a graph. This graph is typically acyclic (a {{GlossaryLink|DAG|DAG}}), though it is not guaranteed to be as certain relationship types are allowed to form cycles.
DAG), though it is not guaranteed to be as certain relationship types
+
are allowed to form cycles.
+
  
  
 
+
{| class="wikitable"
3.1  Introduction
+
|+ Sequence Ontology Examples
 
+
 
+
We have seen how the sequence module makes extensive use of terms taken from various ontologies
+
such as SO and the OBO Relations Ontology, using the type id foreign key column. In addition,
+
features can be annotated using ontologies such as GO using the feature cvterm linking table.
+
These terms are modelled using the cv module, the core of which is the cvterm table.
+
 
+
An ontology, terminology or cv (controlled vocabulary) , is a collection of terms (here equivalent
+
to what are more typically called classes, types, categories or kinds in the ontology literature[REF])
+
in a particular domain of interest. Examples include ”gene” (from SO), ”transcription factor
+
activity” (from GO molecular function) and ”lymphocyte” (from OBO-Cell). The chado cv module
+
is based on the GO Database schema, described here[14]. Terms are stored in the cvterm table,
+
and relationships between terms are stored in the cvterm relationship table. This table follows an
+
analogous structure to the feature relationship table, in that it has columns subject id, object id
+
and type id. Here, all three of these foreign keys refer to rows in the cvterm table.
+
 
+
{| border="1" cellspacing="0"
+
 
!SO Term
 
!SO Term
 
!SO id
 
!SO id
 
|-
 
|-
|Exon
+
| exon
|SL:0000025
+
| {{SOIDLink|SO:0000147|SO:0000147}}
|-
+
|Intron
+
|SL:0000027
+
|-
+
|mRNA
+
|SL:0000037
+
|-
+
|3_utr
+
|SL:0000029
+
|-
+
|5_utr
+
|SL:0000028
+
|-
+
|noncoding_transcript
+
|SL:0000040
+
|-
+
|  miRNA_precursor
+
|SL:0000043
+
|-
+
|  miRNA
+
|SL:0000044
+
|-
+
|  rRNA
+
|SL:0000042
+
|-
+
|  tRNA
+
|SL:0000041
+
|-
+
|regulatory_element
+
|SL:0000052
+
|-
+
|  transcription_factor_binding_site
+
|SL:0000054
+
|-
+
|Clone
+
|SL:0000050
+
|-
+
|genbank_entry
+
|SL:0000061
+
|-
+
|Match
+
|SL:0000008
+
|-
+
| nucleotide_match
+
|SL:0000018
+
 
|-
 
|-
|   cross_genome_match
+
| intron
|SL:0000074
+
| {{SOIDLink|SO:0000188|SO:0000188}}
 
|-
 
|-
|   EST_match
+
| mRNA
|SL:0000021
+
| ...
 
|-
 
|-
|   mRNA_match
+
| miRNA
|SL:0000022
+
| ...
 
|-
 
|-
| protein_match
+
| regulatory_element
|SL:0000020
+
| ...
|-
+
|  translated_nucleotide_match
+
|SL:0000019
+
|-
+
|Pseudogene
+
|SL:0000007
+
|-
+
|repeat_region
+
|SL:0000002
+
|-
+
|  direct_repeats
+
|SL:0000005
+
|-
+
|  inverted_repeat
+
|SL:0000004
+
|-
+
|  repeat_family
+
|SL:0000006
+
|-
+
|  retrotransposon
+
|SL:0000014
+
|-
+
|  Transposon
+
|SL:0000012
+
|-
+
|  tandem_repeat
+
|SL:0000003
+
|-
+
|  microsatellite
+
|SL:0000013
+
|-
+
|Remark
+
|SL:0000062
+
|-
+
|Variant
+
|SL:0000065
+
|-
+
|  deletion
+
|SL:0000067
+
|-
+
|  insertion
+
|SL:0000066
+
|-
+
|  SNP
+
|SL:0000072
+
|-
+
|
+
|
+
 
|-
 
|-
 +
| transcription_factor_binding_site
 +
| ...
 
|}
 
|}
  
 +
==Transitive Closure==
  
A detailed treatment of relationship types in biological ontologies can be found here[13]. Of
+
This section concerns relations between ontology terms and how defined terms and relations can be used to reason, either by humans or computers. A specialized ontology concerning these relations has been developed, the [http://obofoundry.org/ro/ OBO Relation Ontology].
particular interest to Chado is the is a relation, which specifies a sub- typing relationship between
+
two terms or classes. Recall that tables in the sequence module frequently (such as the feature
+
table) defined a type id foreign key column to indicate the specific type or class of entity for each
+
row in that table. The combination of the type id column and the is a relationship gives Chado a
+
data sub- classing system, beyond what is possible with traditional SQL database semantics.
+
  
This is discussed further in a later section The collection of cvterms and cvterm relationships
+
Often it is useful to know the [[wp:Transitive_closure|transitive closure]] over a
can be considered to constitute vertices and edges in a graph. This graph is typically acyclic (a
+
relationship type, or a collection of relationship types. The closure
DAG), though it is not guaranteed to be as certain relationship types are allowed to form cycles.
+
is the result of recursively applying the relationship. For example,
 +
if A ''is_a'' B, ''B is_a C'', then the closure of ''is_a'' includes A ''is_a'' C.
  
 +
In particular, we want the reflexive transitive closure. A term is
 +
always related to itself in a reflexive closure. Meaning:
  
 +
<code>
 +
X is_a X
 +
</code>
  
=====Transitive Closure=====
+
This may seem odd, but it comes in useful both for doing queries and for deriving future rules. This makes it easier
 +
to ask "find me all genes of class X", and to get back genes attached to X and subtypes of X.
  
=====Rules=====
+
The closure goes in the [[#Table:_cvtermpath|cvtermpath]] table - the closure can also be
 +
thought of as a path through the graph or semantic network.
  
The cvtermpath is for calculating the reflexive transitive closure of
+
===Transitivity of other Relations===
a relationship, and any derived relationships.
+
  
Normal (direct) relationships are stored in the cvterm_relationship
+
Many other relations, such as ''part_of'' are also transitive.
table. A entry in this table represents a cvterm_relationship S over
+
 
some relation R.
+
If R is a transitive relation, then we can say
 +
 
 +
X R Z <= X R Y, Y R Z
 +
 
 +
For example, assume we have the following 3 ''develops_from'' links, and ''develops_from'' is a transitive relation:
 +
<code>
 +
  neurectodermal cell ''develops_from'' glioblast
 +
  glioblast ''develops_from'' glial cell
 +
</code>
 +
Then it follows that glial cells develop from neurectodermal cells
 +
 
 +
===Transitivity over ''is_a''===
 +
 
 +
It can be proved from the definition of ''is_a'' (proof not shown here) that:
 +
<code>
 +
  X R Z <= X is_a Y, Y R Z
 +
</code>
 +
and
 +
<code>
 +
  X R Z <= X R Y, Y is_a Z
 +
</code>
 +
This can be thought of as "inheritance".
 +
 
 +
For example, if an astrocyte ''is_a'' glial cell and a glial cell
 +
''develops_from'' a glioblast, then it follows that an astrocyte
 +
''develop_from'' a glioblast.
 +
 
 +
===Difference between Deductive Closure and Transitive Closure===
 +
 
 +
With a transitive closure we simply follow all links in the {{GlossaryLink|DAG|DAG}}, ignoring the relationship type. This works fine for ontologies such as
 +
GO that have only ''is_a'' and ''part_of'', but is not ideal for other
 +
ontologies such as anatomical ontologies.
 +
 
 +
First of all, it may be possible for the closure to grow in size
 +
explosively.
 +
 
 +
Second of all, a closure that ignores the relations may be scientifically meaningless. It is also less useful for queries. For example, we may
 +
want to query for genes expressed in the larva or part of the larva,
 +
but not genes expressed in anatomical entities that develop from the
 +
larva.
 +
 
 +
===Rules===
 +
 
 +
The [[#Table:_cvtermpath|cvtermpath]] table is for calculating the reflexive transitive closure of a relationship, and any derived relationships.
 +
 
 +
Normal (direct) relationships are stored in the [[#Table:_cvterm_relationship|cvterm_relationship]] table. A entry in this table represents a ''cvterm_relationship'' S over some relation R.
  
 
<code>
 
<code>
Line 200: Line 189:
  
 
For example:
 
For example:
 +
 
<code>
 
<code>
S = "cardioblast" develops_from "mesodermal cell"
+
S = "cardioblast" develops_from "mesodermal cell"
 
</code>
 
</code>
The relation ''isa'' represents a special kind of relation -
 
subsumption, or inheritance.
 
  
If X ''isa'' Y, then it follows that all of Y's cvterm_relationship
+
In addition to these ''asserted'' links, we want to be able to ''deduce'' links between terms.
statements are inherited by X.
+
 
 +
If X ''is_a'' Y, then it follows that all of Y's [[#Table:_cvterm_relationship|cvterm_relationship]] statements are inherited by X.
  
 +
'''Rule 1'''
 
<code>
 
<code>
[Rule 1]
+
If X is_a Y
IfX is_a Y
+
and  Y R Z
and  Y R Z
+
then X R(inh) Z
then X R(inh) Z
+
 
</code>
 
</code>
  
For example
+
For example:
  
 
<code>
 
<code>
 
   "cilium axoneme"  is_a "axoneme"
 
   "cilium axoneme"  is_a "axoneme"
 
   "axoneme"part_of "cell projection"
 
   "axoneme"part_of "cell projection"
THEREFORE:
 
  "cilium axoneme"  part_of(inh) "cell projection"
 
 
</code>
 
</code>
  
Here we use T(inh) to represent an inherited relationship.
+
Therefore:
  
 +
<code>
 +
  "cilium axoneme"  part_of(inh) "cell projection"
 +
</code>
  
=====Populating cvterm_path====
+
Here we use ''T(inh)'' to represent an inherited relationship.
  
The cvtermpath table stores the reflexive transitive closure of a
+
===Populating cvtermpath===
relationship, taking into account subsumption/inheritance. The number
+
of intermediate relationships is represented in the 'distance' column
+
of the table.
+
  
Here we use T(path) to represent the 'path' or closure of a
+
The [[#Table:_cvtermpath|cvtermpath]] table stores the reflexive transitive closure of a
relationship. Every T(path) is stored in cvtermpath. We use the same
+
relationship, taking into account subsumption or inheritance. The number
cvterm for T, the fact that it is a path is implicit.
+
of intermediate relationships is represented in the ''distance'' column of the table.
 +
 
 +
Here we use ''T(path)'' to represent the "path" or closure of a
 +
relationship. Every ''T(path)'' is stored in [[#Table:_cvtermpath|cvtermpath]] . We use the same
 +
''cvterm'' for T, the fact that it is a path is implicit.
  
 
We use these rules:
 
We use these rules:
  
 
Reflexive relationships:
 
Reflexive relationships:
 +
 
<code>
 
<code>
for all relations T,
+
for all relations T,
  X T(path) X  
+
  X T(path) X
 
</code>
 
</code>
In this case the distance=0
+
 
 +
In this case the distance = 0.
  
 
Direct relationships:
 
Direct relationships:
  
these are also included in the cvtermpath table, distance=1
+
These are also included in the [[#Table:_cvtermpath|cvtermpath]] table, distance = 1.
 
<code>
 
<code>
 
  IfX T Y
 
  IfX T Y
 
  Then X T(path) Y
 
  Then X T(path) Y
 
</code>
 
</code>
 +
 
Transitive relationships:
 
Transitive relationships:
  
these have distance > 1; these also make use of inheritance rule,
+
These have distance > 1; these also make use of inheritance rule, '''Rule 1''', which gives us ''T(inh)''.
[Rule1], which gives us T(inh)
+
 
 
<code>
 
<code>
IfX T(inh)  Y
+
If X T(inh)  Y
and  Y T(path) Z
+
and  Y T(path) Z
Then X T(path) Z
+
Then X T(path) Z
 
</code>
 
</code>
 +
 
Note that this rule is recursive.
 
Note that this rule is recursive.
  
These rules should be used for populating cvtermpath. Attempting to
+
These rules should be used for populating [[#Table:_cvtermpath|cvtermpath]]. Attempting to calculate a more general closure where all relations are
calculate a more general closure where all relations are
+
treated equally or ignored will produce combinatorial explosions over certain ontologies (e.g. [http://www.geneontology.org/gobo/anatomy.ontology/anatomy.fb Flybase anatomy ontology]). What does this mean in practice?
treated equally or ignored will produce combinatorial explosions over
+
certain ontologies (e.g. flybase anatomy ontology)
+
  
What does this mean in practice?
+
For a typical database, which may only have relations ''isa'', ''part_of'' and ''develops_from'', we will end up with 3 sets of paths.
  
For a typical database, which may only have relations ''isa'',
+
The ''isa'' closure, ''isa (path)'' will include paths over cvterm_relationships that look like this:
''part_of'' and ''develops_from'', we will end up with 3 sets of paths.
+
 
+
The ''isa closure'', ''isa'' (path) will include paths over
+
cvterm_relationships that look like this:
+
  
 
<code>
 
<code>
a is_a b is_a c is_a d is_a e
+
a is_a b is_a c is_a d is_a e
 
</code>
 
</code>
  
The "part_of" closure, part_of(path) will include paths over
+
The ''part_of'' closure, ''part_of(path)'' will include paths over cvterm_relationships that look like this:
cvterm_relationships that look like this:
+
 
 
<code>
 
<code>
a is_a b part_of c part_of d is_a e part_of f
+
a is_a b part_of c part_of d is_a e part_of f
 
</code>
 
</code>
  
The "develops_from" closure, develops_from(path) will include paths over
+
The ''develops_from'' closure, ''develops_from(path)'' will include paths over cvterm_relationships that look like this:
cvterm_relationships that look like this:
+
  
 
<code>
 
<code>
a develops_from b develops_from c is_a d is_a e develops_from f
+
a develops_from b develops_from c is_a d is_a e develops_from f
 
</code>
 
</code>
  
It may be tempting to mix different non ''isa'' relationships in the same
+
It may be tempting to mix different non-''isa'' relationships in the same path, but this should '''never''' be done - there will be an unacceptable combinatorial explosion in many cases. Besides, there is no use for
path, but this should NEVER be done - there will be an unacceptable
+
combinatorial explosion in many cases. Besides, there is no use for
+
 
such a cvtermpath; it is meaningless.
 
such a cvtermpath; it is meaningless.
  
Note that for amigolike query behaviour, it is necessary only to query
+
Note that for [http://amigo.geneontology.org/ Amigo]-like query behaviour, it is necessary only to query [[#Table:_cvtermpath|cvtermpath]], ignoring cvtermpath.type_id (these are obtained by querying [[#Table:_cvterm_relationship|cvterm_relationship]]).
cvtermpath ignoring cvtermpath.type_id (these are obtained by querying
+
cvterm_relationship)
+
  
  
3.2 Table Definitions
+
==Advanced Usage==
  
 +
This section describes advanced usage of the cv module for use with OWL-DL advanced Obo format 1.2 [REF] features or elements from other ontology formalisms.
  
cv
+
If you aren't sure what this means, you probably don't need to read this section yet.
  
  
A controlled vocabulary or ontology. A cv is composed of cvterms (aka terms, classes, types,
+
Note that this section is liable to change; in particular the scheme below may be replaced with a simpler one. For details of the simpler scheme, along the lines of the transform used in the GO Database. See:
universals - relations and properties are also stored in cvterm)) and the relationships between them
+
  
 +
* http://geneontology.cvs.sourceforge.net/geneontology/go-dev/xml/xsl/oboxml_to_godb_prestore.xsl?view=markup
  
 +
(search for intersection_of)
  
Table 3.1: cv
+
===Background===
  
  Column Datatype Description
+
See the document on [http://www.fruitfly.org/~cjm/obol/doc/mapping-obo-to-owl.html Converting OBO to OWL].
  cv id  integer
+
  namevarchar  The name of the ontology. This corresponds to the
+
obo-format -namespace-. cv names uniquely identify
+
the cv. In obo file format, the cv.name is known as
+
the namespace
+
  definition text  A text description of the criteria for membership of
+
this ontology
+
  
cvterm
+
===Logical definitions===
  
 +
In a normal ontology {{GlossaryLink|DAG|DAG}} representation in [[Chado]], the [[#Table:_cvterm_relationship|cvterm_relationship]] rows represent relationships between terms, or more formally, '''necessary conditions'''. A logical definition must
 +
have both '''necessary and sufficient conditions'''. A logical
 +
definition often consists of a '''generic term''' (also known as "genus") and one
 +
or more '''discriminating characteristics''' (also known as "differentiae"). The discriminating characteristics are typically relationships.
  
A term, class, universal or type within an ontology or controlled vocabulary. This table is also
+
For example, the logical definition of "larval locomotory behaviour" would be a "locomotory behaviour" (genus) which "during" "tt larval stage" (where "during" could be drawn from an ontology of relations, and larval stage may come from an insect developmental stage ontology). These constitute both necessary and
used for relations and properties. cvterms constitute nodes in the graph defined by the collection
+
sufficient conditions: the conditions are necessary in that all
of cvterms and cvterm relationships
+
instances of larval locomotory behavior are necessarily locomotory
 +
behaviors and are necessarily manifested at the larval stage. We could
 +
represent this using a normal DAG. However, because this is a
 +
definition it also constitutes sufficient conditions, in that any
 +
instance of locomotory behavior which manifests at the larval stage is
 +
by definition a larval locomotory behavior.
  
 +
In an ontology formalism like OWL-DL or OBO-1.2, genus-differentiae are represented using set-intersections.
  
Table 3.2: cvterm
+
Here is the OBO 1.2 representation:
  
  Column  Datatype Description
+
<code>
  cvterm id integer
+
  [Term]
cv idinteger  The cv/ontology/namespace to which this cvterm
+
  id: GO:0008345
belongs
+
  name: larval locomotory behavior
  name varchar  A concise human-readable name or label for the
+
  namespace: biological_process
cvterm. uniquely identifies a cvterm within a cv
+
  is_a: GO:0007626 ! locomotory behavior
  definition  text  A human-readable text definition
+
  is_a: GO:0030537 ! larval behavior
  dbxref id  integer  Primary identifier dbxref - The unique global OBO
+
  intersection_of: GO:0007626 ! GENUS: locomotory behavior
identifier for this cvterm. Note that a cvterm may
+
intersection_of: during FBdv:00005336 ! DIFFERENTIUM: during larval stage
have multiple secondary dbxrefs - see also table:
+
</code>
cvterm dbxref
+
  is obsoleteinteger  Boolean 0=false,1=true; see GO documentation for
+
details of obsoletion. note that two terms with dif-
+
ferent primary dbxrefs may exist if one is obsolete
+
  is relationshiptype integer Boolean 0=false,1=true relations or relationship
+
types (also known as Typedefs in OBO format, or as
+
properties or slots) form a cv/ontology in themselves.
+
We use this flag to indicate whether this cvterm is an
+
actual term/class/universal or a relation. Relations
+
may be drawn from the OBO Relations ontology, but
+
are not exclusively drawn from there
+
  
 +
Here is the equivalent in OWL (note: RDF-XML syntax is very verbose!):
  
cvterm relationship
+
<syntaxhighlight lang="xml">
 +
  <owl:Class rdf:ID="GO_0008345">
 +
    <rdfs:label xml:lang="en">larval locomotory behavior</rdfs:label>
 +
    <rdfs:subClassOf rdf:resource="#GO_0007626"/>
 +
    <rdfs:subClassOf rdf:resource="#GO_0030537"/>
 +
    <owl:equivalentClass>
 +
      <owl:Class>
 +
        <owl:intersectionOf rdf:parseType="Collection">
 +
          <owl:Class rdf:about="#GO_0007626"/>
 +
          <owl:Restriction>
 +
            <owl:onProperty>
 +
              <owl:ObjectProperty rdf:about="#during"/>
 +
            </owl:onProperty>
 +
            <owl:someValuesFrom rdf:resource="#FBdv_00005336"/>
 +
          </owl:Restriction>
 +
        </owl:intersectionOf>
 +
      </owl:Class>
 +
    </owl:equivalentClass>
 +
  </owl:Class>
 +
</syntaxhighlight>
  
 +
When converting to Chado we employ a more economical representation,
 +
in terms of the number of triples we use:
  
A name can mean different things in different contexts; for example ”chromosome” in SO and GO.
+
<syntaxhighlight lang="xml">
A name should be unique within an ontology/cv. A name may exist twice in a cv, in both obsolete
+
  <!-- normal DAG relationships (necessary conditions) -->
and non-obsolete forms - these will be for different cvterms with different OBO identifiers; so GO
+
  <cvterm_relationship>
documentation for more details on obsoletion. Note that occasionally multiple obsolete terms with
+
    <type_id>is_a</type_id>
the same name will exist in the same cv. If this is a possibility for the ontology under consideration
+
    <subject_id>GO:0008345</subject_id>
(eg GO) then the ID should be appended to the name to ensure uniqueness
+
    <object_id>GO:0007626</object_id>
 +
  </cvterm_relationship>
 +
  <cvterm_relationship>
 +
    <type_id>is_a</type_id>
 +
    <subject_id>GO:0008345</subject_id>
 +
    <object_id>GO:0030537</object_id>
 +
  </cvterm_relationship>
  
 +
  <!-- Genus/generic term -->
 +
  <cvterm_relationship>
 +
    <type_id>intersection_of</type_id>
 +
    <subject_id>GO:0008345</subject_id>
 +
    <object_id>GO:0007626</object_id> <!-- locomotory behavior -->
 +
  </cvterm_relationship>
  
Table 3.3: cvterm relationship
+
  <!-- Discriminating characteristics -->
 +
  <cvterm_relationship>
 +
    <type_id>intersection_of</type_id>
 +
    <subject_id>GO:0008345</subject_id>
 +
    <object_id>
  
  Column Datatype Description
+
      <!-- anonymous term representing during(larval stage) -->
  cvterm relationship id integer
+
      <cvterm>
  type id integer  The nature of the relationship between subject and
+
        <dbxref_id>
object. Note that relations are also housed in the
+
          <dbxref>
cvterm table, typically from the OBO relationship
+
            <db_id>internal</db_id>
ontology, although other relationship types are al-
+
            <accession>restriction--OBOL:during--GO:0008345</accession>
lowed
+
          </dbxref>
  subject id integer  the subject of the subj-predicate-obj sentence. The
+
        </dbxref_id>
cvterm relationship is about the subject. In a graph,
+
this typically corresponds to the child node
+
  object id  integer  the object of the subj-predicate-obj sentence. The
+
cvterm relationship refers to the object. In a graph,
+
this typically corresponds to the parent node
+
  
 +
        <!-- note: as this is an anon term, the name will never
 +
            be shown to a user -->
 +
        <name>restriction--OBOL:during--GO:0008345</name>
 +
        <cv_id>anonymous_cv</cv_id>
 +
        <cvtermprop>
 +
          <type_id>is_anonymous</type_id>
 +
          <value>1</value>
 +
          <rank>0</rank>
 +
        </cvtermprop>
 +
        <cvterm_relationship>
 +
          <type_id>OBOL:during</type_id>
 +
          <object_id>FBdv:00005336</object_id>
 +
        </cvterm_relationship>
 +
      </cvterm>
  
cvtermpath
+
    </object_id>
 +
  </cvterm_relationship>
 +
</syntaxhighlight>
  
 +
Note that in the above, we are creating '''anonymous''' terms. We give them fake names and fake dbxrefs. In the {{SF_SVN|schema/branches/bbop-experimental/|bbop-experimental SVN branch}} of chado, names and dbxrefs are nullable, so these can be
 +
omitted. With the current schema, you must provide fake dbxrefs and
 +
names that are unique, such as the above (if you are not familiar with how [[Chado_XML|Chado XML]] maps to the Chado schema,
 +
see the explanation below).
  
The reflexive transitive closure of the cvterm relationship relation. For a full discussion, see the file
+
If you wish to convert OBO-specified logical definitions to [[Chado_XML|Chado XML]] you will need [http://www.godatabase.org/dev/pod/go-perl.html go-perl], v0.05 or higher (if you have a lower version, the ''intersection_of'' tags will simply be ignored).
populating-cvtermpath.txt in this directory
+
  
 +
<code>
 +
go2chadoxml ont.obo > ont.chado
 +
</code>
  
  Table 3.4: cvtermpath
+
===How Logical Definitions are Stored in Chado===
  
Column  Datatype Description
+
This involves no schema changes to the cv module. Each ''intersection_of''
cvtermpath id integer
+
goes in as a {{GlossaryLink|DAG|DAG}} arc of type ''internal:intersection_of''. The ''object_id''
type id integer  The relationship type that this is a closure over. If
+
in the arc is either a term (for the genus) or an anonymous term
  null, then this is a closure over ALL relationship
+
representing a restriction (the differentium). The restriction has a
  types. If non-null, then this references a relation-
+
relationship of some type to another term.
  ship cvterm - note that the closure will apply to both
+
  this relationship AND the OBO REL:is a (subclass)
+
  relationship
+
subject id integer
+
object id  integer
+
cv idinteger  Closures will mostly be within one cv. If the closure
+
  of a relationship traverses a cv, then this refers to
+
  the cv of the object id cvterm
+
pathdistance  integer  The number of steps required to get from the sub-
+
  ject cvterm to the object cvterm, counting from zero
+
  (reflexive relationship)
+
  
 +
For example, for "larval locomotory behavior" we would normally just have:
  
cvtermsynonym
+
<code>
 +
LLB is_a LocomotoryBehavior
 +
LLB is_a LarvalBehavior
 +
</code>
  
 +
If we load a logical definition for this term (see /t/data/llm/obo in the [http://www.godatabase.org/dev/pod/go-perl.html go-perl] package), like this:
  
A cvterm actually represents a distinct class or concept. A concept can be refered to by different
+
<code>
phrases or names. In addition to the primary name (cvterm.name) there can be a number of
+
[Term]
alternative aliases or synonyms. For example, -T cell- as a synonym for -T lymphocyte-
+
id: GO:0008345
 +
name: larval locomotory behavior
 +
namespace: biological_process
 +
is_a: GO:0007626 ! locomotory behavior
 +
is_a: GO:0030537 ! larval behavior
 +
intersection_of: GO:0007626  ! locomotory behavior
 +
intersection_of: during FBdv:00005336 ! larval stage
 +
</code>
  
 +
Then the ''intersection_of''s get stored using the basic DAG tables as:
  
Table 3.5: cvtermsynonym
 
  
Column  DatatypeDescription
+
{| class="wikitable"
cvtermsynonym id integer
+
|+ Definition stored in [[#Table:_cvterm_relationship|cvterm_relationship]] table
cvterm id  integer
+
!Subject
synonym varchar
+
!Relation
type id integer A synonym can be exact, narrow or borader than
+
!Object
 +
|-
 +
|LLB
 +
|''intersection_of''
 +
|Locomotory Behaviour
 +
|-
 +
|LLB
 +
|''intersection_of''
 +
|anon:xxx
 +
|-
 +
|anon:xxx
 +
|during
 +
|FBv:00005336
 +
|-
 +
|}
  
  
cvterm dbxref
+
This uses 4 cvterm_relationships and the creation of a new '''anonymous''' term that is never shown directly to the user. The anonymous term represents the class of things that happen during the
 +
larval stage.
  
 +
===Logical Definition Views===
  
In addition to the primary identifier (cvterm.dbxref id) a cvterm can have zero or more secondary
+
Two views: ''cvterm_genus'' and ''cvterm_differentium'' views are in chado/modules/cv/views.
identifiers/dbxrefs, which may refer to records in external databases. The exact semantics of
+
cvterm dbxref are not fixed. For example: the dbxref could be a pubmed ID that is pertinent to the
+
cvterm, or it could be an equivalent or similar term in another ontology. For example, GO cvterms
+
are typically linked to InterPro IDs, even though the nature of the relationship between them is
+
largely one of statistical association. The dbxref may be have data records attached in the same
+
database instance, or it could be a ”hanging” dbxref pointing to some external database. NOTE:
+
If the desired objective is to link two cvterms together, and the nature of the relation is known
+
and holds for all instances of the subject cvterm then consider instead using cvterm relationship
+
together with a well-defined relation.
+
  
 +
===Example Use Case: [[:Category:Phenotypes|Phenotypes]]===
  
  Table 3.6: cvterm dbxref
+
The idea here is that queries for composed term "syndactyly" should
 +
automatically return the same results as a boolean query for
 +
"fusion" + ''inheres_in'' = "finger" regardless of whether the annotation is
 +
to the composed term or is a composed annotation (provided we put the
 +
logical definition of syndactyly in the database).
  
Column  Datatype  Description
+
===Example Use Case: Feature Types===
cvterm dbxref id integer
+
cvterm id  integer
+
dbxref id  integer
+
is for definition integerA cvterm.definition should be supported by one or
+
more references. If this column is true, the dbxref is
+
not for a term in an external db - it is a dbxref for
+
provenance information for the definition
+
  
 +
The Sequence Ontology has some logical definitions - you will need to load the file [http://song.cvs.sourceforge.net/song/ontology/so-xp.obo?view=log so-xp.obo].
  
cvtermprop
+
====Example use case: GO====
  
 +
See [http://wiki.geneontology.org/index.php/Obol Obol].
  
Additional extensible properties can be attached to a cvterm using this table. Corresponds to
+
====Example use case: Drawing DAGs====
-AnnotationProperty- in W3C OWL format
+
  
 +
Currently the DAGs of many OBO ontologies are highly tangled; see http://www.fruitfly.org/~cjm/obol/doc/go-complexity.html
  
Table 3.7: cvtermprop
+
If all terms have logical definitions, then there is only one '''true''' (genus) or ''isa'' parent. This enables us to disentangle the DAGs and draw distinct hierarchies. For example, the GO term '''cysteine biosynthesis''' could be drawn as two distinct hierarchies - one process and one chemical.
  
  Column  Datatype Description
+
====Loading OWL into Chado====
  cvtermprop id integer
+
  cvterm id  integer
+
  type id integer  The name of the property/slot is a cvterm. The
+
meaning of the property is defined in that cvterm
+
  valuetext  The value of the property, represented as text. Nu-
+
meric values are converted to their text representa-
+
tion
+
  rank integer  Property-Value ordering. Any cvterm can have mul-
+
tiple values for any particular property type - these
+
are ordered in a list using rank, counting from zero.
+
For properties that are single-valued rather than
+
multi-valued, the default 0 value should be used
+
  
 +
Not all OWL-DL features are supported. Only ''intersection_of''s corresponding to genus-differentiae are loaded.
  
dbxrefprop
+
First you must convert OWL into OBO 1.2 format. There will soon be a way to do this in [http://www.oboedit.org/ OboEdit]. For now you can use [http://www.blipkit.org blipkit].
  
 +
<code>
 +
blip io-convert my.owl -to obo -o my.obo
 +
</code>
  
Metadata about a dbxref. Note that this is not defined in the dbxref module, as it depends on the
+
Once you have an OBO file you can run [http://search.cpan.org/src/CMUNGALL/go-perl-0.06/scripts/go2chadoxml go2chadoxml], as above.
cvterm table. This table has a structure analagous to cvtermprop
+
  
 +
====Post-coordinating Terms====
  
Table 3.8: dbxrefprop
+
Sometimes we want to be able to refer to a term such as "plasma membrane of spermatocyte", but no such term exists in the
 +
ontology. Introducing these as '''pre-coordinated''' cross-product
 +
terms would make the ontology unwieldy due the large number of possible combinations.
  
  Column  Datatype Description
+
Chado allows the '''post-coordination''' or '''post-composition''' of
  dbxrefprop id integer
+
terms using the same formalism as described above. Briefly: we would
  dbxref id  integer
+
create an '''anonymous''' term. This anonymous term would be defined using
  type id integer
+
the terms "plasma membrane" and "spermatocyte", using a genus-differentia definition as above.
  valuetext
+
  rank integer
+
  
 +
<syntaxhighlight lang="xml">
 +
  <!-- Genus/generic term -->
 +
  <cvterm_relationship>
 +
    <type_id>intersection_of</type_id>
 +
    <subject_id>anon_1</subject_id>
 +
    <object_id>GO__plasma_membrane</object_id>
 +
  </cvterm_relationship>
  
organism
+
  <!-- Discriminating characteristics -->
 +
  <cvterm_relationship>
 +
    <type_id>intersection_of</type_id>
 +
    <subject_id>anon_1</subject_id>
 +
    <object_id>
  
 +
      <!-- anonymous term representing  part_of(spermatocyte) -->
 +
      <cvterm>
 +
        <dbxref_id>
 +
          <dbxref>
 +
            <db_id>internal</db_id>
 +
            <accession>restriction--part_of--spermatocyte</accession>
 +
          </dbxref>
 +
        </dbxref_id>
  
The organismal taxonomic classification. Note that phylogenies are represented using the phylogeny
+
        <!-- note: as this is an anon term, the name will never
module, and taxonomies can be represented using the cvterm module or the phylogeny module
+
            be shown to a user -->
 +
        <name>restriction--part_of--spermatocyte</name>
 +
        <cv_id>anonymous_cv</cv_id>
 +
        <cvtermprop>
 +
          <type_id>is_anonymous</type_id>
 +
          <value>1</value>
 +
          <rank>0</rank>
 +
        </cvtermprop>
 +
        <cvterm_relationship>
 +
          <type_id>OBO_REL:part_of</type_id>
 +
          <object_id>CL__spermatocyte</object_id>
 +
        </cvterm_relationship>
 +
      </cvterm>
  
 +
    </object_id>
 +
  </cvterm_relationship>
 +
</syntaxhighlight>
  
Table 3.9: organism
+
The above assumes [[XORT]] macro IDs defined for ''GO__plasma_membrane'' and ''CL__spermatocyte''.
  
Column Datatype Description
+
Allow post-coordinated terms places a greater burden on applications that use the cv module. More documentation will be provided here on this.
organism id  integer
+
abbreviation varchar
+
genus  varchar
+
speciesvarchar  A type of organism is always uniquely identified
+
by genus+species. When mapping from the NCBI
+
taxonomy names.dmp file, the unique-name column
+
must be used where it is present, as the name column
+
is not always unique (eg environmental samples). If
+
a particular strain or subspecies is to be represented,
+
this is appended onto the species name. Follows stan-
+
dard NCBI taxonomy pattern
+
common name  varchar
+
commenttext
+
  
 +
{{NeedsEditing}}
  
organism dbxref
+
=Tables=
  
 +
== Table: cv ==
  
 +
A controlled vocabulary or ontology. A cv is composed of cvterms (AKA terms, classes, types, universals - relations and properties are also stored in cvterm) and the relationships between them.
  
  Table 3.10: organism dbxref
+
{| border="1" cellpadding="3"
 +
|+ cv Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| cv_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
| name
 +
| character varying(255)
 +
| '' UNIQUE NOT NULL ''<br /><br />The name of the ontology. This corresponds to the obo-format -namespace-. cv names uniquely identify the cv. In OBO file format, the cv.name is known as the namespace.
 +
|- class="tr0"
 +
|
 +
| definition
 +
| text
 +
| '' ''<br /><br />A text description of the criteria for membership of this ontology.
 +
|}
  
Column Datatype Description
+
Tables referencing this one via Foreign Key Constraints:
organism dbxref id integer
+
organism id  integer
+
dbxref id integer
+
  
 +
* [[Chado_Tables#Table:_cvterm| cvterm]]
  
organismprop
+
* [[Chado_Tables#Table:_cvtermpath| cvtermpath]]
  
 +
----
  
tag-value properties - follows standard chado model
 
  
  
Table 3.11: organismprop
+
== Table: cvterm ==
  
Column Datatype  Description
+
A term, class, universal or type within an ontology or controlled vocabulary. This table is also used for relations and properties. cvterms constitute nodes in the graph defined by the collection of cvterms and cvterm_relationships.
organismprop id integer
+
organism id  integer
+
type idinteger
+
value  text
+
rankinteger
+
  
 +
{| border="1" cellpadding="3"
 +
|+ cvterm Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| cvterm_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cv| cv]]
 +
| cv_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The cv or ontology or namespace to which this cvterm belongs.
 +
|- class="tr0"
 +
|
 +
| name
 +
| character varying(1024)
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />A concise human-readable name or label for the cvterm. Uniquely identifies a cvterm within a cv.
 +
|- class="tr1"
 +
|
 +
| definition
 +
| text
 +
| '' ''<br /><br />A human-readable text definition.
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_dbxref| dbxref]]
 +
| dbxref_id
 +
| integer
 +
| '' UNIQUE NOT NULL ''<br /><br />Primary identifier dbxref - The unique global OBO identifier for this cvterm. Note that a cvterm may have multiple secondary dbxrefs - see also table: cvterm_dbxref.
 +
|- class="tr1"
 +
|
 +
| is_obsolete
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Boolean 0=false,1=true; see GO documentation for details of obsoletion. Note that two terms with different primary dbxrefs may exist if one is obsolete.
 +
|- class="tr0"
 +
|
 +
| is_relationshiptype
 +
| integer
 +
| '' NOT NULL ''<br /><br />Boolean 0=false,1=true relations or relationship types (also known as Typedefs in OBO format, or as properties or slots) form a cv/ontology in themselves. We use this flag to indicate whether this cvterm is an actual term/class/universal or a relation. Relations may be drawn from the OBO Relations ontology, but are not exclusively drawn from there.
 +
|}
  
This section describes advanced usage of the cv module for use with
+
Tables referencing this one via Foreign Key Constraints:
OWL-DL advanced Obo format 1.2 [REF] features
+
or elements from other ontology formalisms.
+
  
If you aren't sure what this means, you probably don't need to read
+
* [[Chado_Tables#Table:_acquisition_relationship| acquisition_relationship]]
this section yet.
+
  
=====Background=====
+
* [[Chado_Tables#Table:_acquisitionprop| acquisitionprop]]
  
See the document on ConvertingOboToOWL.
+
* [[Chado_Tables#Table:_analysisprop| analysisprop]]
  
=====Logical definitions=====
+
* [[Chado_Tables#Table:_arraydesign| arraydesign]]
  
In a normal ontology DAG representation in chado, the
+
* [[Chado_Tables#Table:_arraydesignprop| arraydesignprop]]
cvterm_relationship rows represent relationships between terms, or
+
more formally, '''necessary conditions'''. A logical definition must
+
have both '''necessary and sufficient conditions'''. A logical
+
definition often consists of a '''generic term''' (aka genus) and one
+
or more '''discriminating characteristics''' (aka differentiae). The
+
discriminating characteristics are typically relationships
+
  
For example, the logical definition of ''larval locomotory
+
* [[Chado_Tables#Table:_assayprop| assayprop]]
behaviour'' would be a ''locomotory behaviour'' (genus) which '''during''' ''tt larval stage'' (where '''during''' could be drawn from an
+
ontology of relations, and larval stage may come from an insect
+
developmental stage ontology). These constitute both necessary and
+
sufficient conditions: the conditions are necessary in that all
+
instances of larval locomotory behavior are necessarily locomotory
+
behaviors and are necessarily manifested at the larval stage. We could
+
represent this using a normal DAG. However, because this is a
+
definition it also constitutes sufficient conditions, in that any
+
instance of locomotory behavior which manifests at the larval stage is
+
by definition a larval locomotory behavior.
+
  
In an ontology formalism like OWL-DL or Obo-1.2, genus-differentiae
+
* [[Chado_Tables#Table:_biomaterial_relationship| biomaterial_relationship]]
are represented using set-intersections.
+
  
Here is the Obo 1.2 representation:
+
* [[Chado_Tables#Table:_biomaterial_treatment| biomaterial_treatment]]
  
<code>
+
* [[Chado_Tables#Table:_biomaterialprop| biomaterialprop]]
[Term]
+
id: GO:0008345
+
name: larval locomotory behavior
+
namespace: biological_process
+
is_a: GO:0007626 ! locomotory behavior
+
is_a: GO:0030537 ! larval behavior
+
intersection_of: GO:0007626  ! GENUS: locomotory behavior
+
intersection_of: during FBdv:00005336 ! DIFFERENTIUM: during larval stage
+
</code>
+
  
Here is the equivalent in OWL (note: RDF-XML syntax is very verbose!):
+
* [[Chado_Tables#Table:_contact| contact]]
  
<code>
+
* [[Chado_Tables#Table:_contact_relationship| contact_relationship]]
  <owl:Class rdf:ID="GO_0008345">
+
<rdfs:label xml:lang="en">larval locomotory behavior</rdfs:label>
+
<rdfs:subClassOf rdf:resource="#GO_0007626"/>
+
<rdfs:subClassOf rdf:resource="#GO_0030537"/>
+
<owl:equivalentClass>
+
<owl:Class>
+
  <owl:intersectionOf rdf:parseType="Collection">
+
<owl:Class rdf:about="#GO_0007626"/>
+
<owl:Restriction>
+
<owl:onProperty>
+
  <owl:ObjectProperty rdf:about="#during"/>
+
</owl:onProperty>
+
<owl:someValuesFrom rdf:resource="#FBdv_00005336"/>
+
</owl:Restriction>
+
  </owl:intersectionOf>
+
</owl:Class>
+
</owl:equivalentClass>
+
  </owl:Class>
+
</code>
+
  
When converting to chado we employ a more economical representation,
+
* [[Chado_Tables#Table:_control| control]]
in terms of the number of triples we use:
+
  
<code>
+
* [[Chado_Tables#Table:_cvterm_dbxref| cvterm_dbxref]]
  <!-- normal DAG relationships (necessary conditions) -->
+
  <cvterm_relationship>
+
<type_id>is_a</type_id>
+
<subject_id>GO:0008345</subject_id>
+
<object_id>GO:0007626</object_id>
+
  </cvterm_relationship>
+
  <cvterm_relationship>
+
<type_id>is_a</type_id>
+
<subject_id>GO:0008345</subject_id>
+
<object_id>GO:0030537</object_id>
+
  </cvterm_relationship>
+
  
  <!-- Genus/generic term -->
+
* [[Chado_Tables#Table:_cvterm_relationship| cvterm_relationship]]
  <cvterm_relationship>
+
<type_id>intersection_of</type_id>
+
<subject_id>GO:0008345</subject_id>
+
<object_id>GO:0007626</object_id> <!-- locomotory behavior -->
+
  </cvterm_relationship>
+
  
  <!-- Discriminating characteristics -->
+
* [[Chado_Tables#Table:_cvtermpath| cvtermpath]]
  <cvterm_relationship>
+
<type_id>intersection_of</type_id>
+
<subject_id>GO:0008345</subject_id>
+
<object_id>
+
  
<!-- anonymous term representing  during(larval stage) -->
+
* [[Chado_Tables#Table:_cvtermprop| cvtermprop]]
<cvterm>
+
  <dbxref_id>
+
<dbxref>
+
<db_id>internal</db_id>
+
<accession>restriction--OBOL:during--GO:0008345</accession>
+
</dbxref>
+
  </dbxref_id>
+
  
  <!-- note: as this is an anon term, the name will never
+
* [[Chado_Tables#Table:_cvtermsynonym| cvtermsynonym]]
be shown to a user -->
+
  <name>restriction--OBOL:during--GO:0008345</name>
+
  <cv_id>anonymous_cv</cv_id>
+
  <cvtermprop>
+
<type_id>is_anonymous</type_id>
+
<value>1</value>
+
<rank>0</rank>
+
  </cvtermprop>
+
  <cvterm_relationship>
+
<type_id>OBOL:during</type_id>
+
<object_id>FBdv:00005336</object_id>
+
  </cvterm_relationship>
+
</cvterm>
+
  
</object_id>
+
* [[Chado_Tables#Table:_dbxrefprop| dbxrefprop]]
  </cvterm_relationship>
+
  
</code>
+
* [[Chado_Tables#Table:_element| element]]
  
Note that in the above, we are creating '''anonymous''' terms. We give
+
* [[Chado_Tables#Table:_element_relationship| element_relationship]]
them fake names and fake dbxrefs. In the bbop-experimental cvs branch
+
of chado, names and dbxrefs are nullable, so these can be
+
omitted. With the current schema, you must provide fake dbxrefs and
+
names that are unique, such as the above (if you are not familiar with how ChadoXML maps to the chado schema,
+
see the explanation below).
+
  
If you wish to convert Obo-specified logical definitions to chadoxml
+
* [[Chado_Tables#Table:_elementresult_relationship| elementresult_relationship]]
you will need go-perl v0.05 or higher (if you have a lower version,
+
the intersection_of tags will simply be ignored).
+
  
<code>
+
* [[Chado_Tables#Table:_environment_cvterm| environment_cvterm]]
go2chadoxml ont.obo > ont.chado
+
</code>
+
  
=====How logical definitions are stored in Chado=====
+
* [[Chado_Tables#Table:_feature| feature]]
  
This involves no schema changes to the cv module. Each intersection_of
+
* [[Chado_Tables#Table:_feature_cvterm| feature_cvterm]]
goes in as a DAG arc of type internal:intersection_of. The object_id
+
in the arc is either a term (for the genus) or an anonymous term
+
representing a restriction (the differentium). the restriction has a
+
relationship of some type to another term.
+
  
For example, for "larval locomotory behavior" we would normally just have:
+
* [[Chado_Tables#Table:_feature_cvtermprop| feature_cvtermprop]]
  
<code>
+
* [[Chado_Tables#Table:_feature_genotype| feature_genotype]]
LLB is_a LocomotoryBehavior
+
LLB is_a LarvalBehavior
+
</code>
+
  
If we load a logical definition for this term (see
+
* [[Chado_Tables#Table:_feature_pubprop| feature_pubprop]]
go-dev/go-perl/t/data/llm/obo), like this:
+
  
<code>
+
* [[Chado_Tables#Table:_feature_relationship| feature_relationship]]
[Term]
+
id: GO:0008345
+
name: larval locomotory behavior
+
namespace: biological_process
+
is_a: GO:0007626 ! locomotory behavior
+
is_a: GO:0030537 ! larval behavior
+
intersection_of: GO:0007626  ! locomotory behavior
+
intersection_of: during FBdv:00005336 ! larval stage
+
</code>
+
  
Then the intersection_ofs get stored using the basic DAG tables as:
+
* [[Chado_Tables#Table:_feature_relationshipprop| feature_relationshipprop]]
  
Subject & Relation & Object \\ \hline
+
* [[Chado_Tables#Table:_featuremap| featuremap]]
LLB & intersection_of & LocomotoryBehavior
+
LLB & intersection_of & anon:xxx
+
anon:xxx & during & FBv:00005336
+
  
 +
* [[Chado_Tables#Table:_featureprop| featureprop]]
  
\label{tab:intersections-in-Chado}
+
* [[Chado_Tables#Table:_library| library]]
\end{tabular}
+
}
+
\caption{Logical definition stored ib cvterm_relationship table}
+
\label{tab:tab-esc-str}
+
\end{table}
+
  
This uses 4 cvterm_relationships and the creation of a new
+
* [[Chado_Tables#Table:_library_cvterm| library_cvterm]]
'''anonymous''' term that is never shown directly to the user. The
+
anonymous term represents the class of things that happen during the
+
larval stage
+
  
=====Logical Definition Views=====
+
* [[Chado_Tables#Table:_libraryprop| libraryprop]]
  
Two views: cvterm_genus and cvterm_differentium views are in
+
* [[Chado_Tables#Table:_organism_relationship| organism_relationship]]
chado/modules/cv/views.
+
  
=====Example use case: Phenotypes=====
+
* [[Chado_Tables#Table:_organismpath| organismpath]]
  
The idea here is that queries for composed term "syndactyly" should
+
* [[Chado_Tables#Table:_organismprop| organismprop]]
automatically return the same results as a boolean query for
+
"fusion"+inheres_in="finger" regardless of whether the annotation is
+
to the composed term or is a composed annotation (provided we put the
+
logical definition of syndactyly in the database)
+
  
=====Example use case: feature types=====
+
* [[Chado_Tables#Table:_phendesc| phendesc]]
  
The Sequence Ontology has some logical definitions - you will need to
+
* [[Chado_Tables#Table:_phenotype| phenotype]]
load the file {\tt so-xp.obo}
+
  
=====Example use case: GO=====
+
* [[Chado_Tables#Table:_phenotype_comparison| phenotype_comparison]]
  
See http://www.fruitfly.org/~cjm/obol
+
* [[Chado_Tables#Table:_phenotype_cvterm| phenotype_cvterm]]
  
=====Example use case: Drawing DAGs=====
+
* [[Chado_Tables#Table:_phenstatement| phenstatement]]
  
Currently the DAGs of many OBO ontologies are highly tangled; see:
+
* [[Chado_Tables#Table:_phylonode| phylonode]]
http://www.fruitfly.org/~cjm/obol/doc/go-complexity.html
+
  
If all terms have logical definitions, then there is only one 'true'
+
* [[Chado_Tables#Table:_phylonode_relationship| phylonode_relationship]]
(genus) \isa parent. This enables us to disentangle the DAGs and draw
+
distinct hierarchies. For example, the GO term '''cysteine
+
biosynthesis''' could be drawn as two distinct hierarchies - one process
+
and one chemical
+
  
=====Loading OWL into Chado=====
+
* [[Chado_Tables#Table:_phylonodeprop| phylonodeprop]]
  
Not all OWL-DL features are supported. Only intersectionOfs
+
* [[Chado_Tables#Table:_phylotree| phylotree]]
correspondibg to genus-differentiae are loaded.
+
  
First you must convert OWL into Obo 1.2 format. There will soon be a
+
* [[Chado_Tables#Table:_protocol| protocol]]
way to do this in OboEdit. For now you can use blipkit
+
(http://www.blipkit.org)
+
  
<code>
+
* [[Chado_Tables#Table:_protocolparam| protocolparam]]
blip io-convert my.owl -to obo -o my.obo
+
</code>
+
  
Once you have an obo file you can run go2chadoxml, as above
+
* [[Chado_Tables#Table:_pub| pub]]
  
=====Post-coordinating terms=====
+
* [[Chado_Tables#Table:_pub_relationship| pub_relationship]]
  
Sometimes we want to be able to refer to a term such as {\em plasma
+
* [[Chado_Tables#Table:_pubprop| pubprop]]
membrane of spermatocyte}, but no such term exists in the
+
ontology. Introducing these as {\em pre-coordinated} cross-product
+
terms would make the ontology unwieldy.
+
  
Chado allows the '''post-coordination''' or '''post-composition''' of
+
* [[Chado_Tables#Table:_quantification_relationship| quantification_relationship]]
terms using the same formalism as described above. Briefly: we would
+
create an '''anonymous'''. This anonymous term would be defined using
+
the terms '''plasma membrane''' and '''em spermatocyte''', using a
+
genus-differentia definition as above.
+
  
<code>
+
* [[Chado_Tables#Table:_quantificationprop| quantificationprop]]
  
  <!-- Genus/generic term -->
+
* [[Chado_Tables#Table:_stock| stock]]
  <cvterm_relationship>
+
<type_id>intersection_of</type_id>
+
<subject_id>anon_1</subject_id>
+
<object_id>GO__plasma_membrane</object_id>
+
  </cvterm_relationship>
+
  
  <!-- Discriminating characteristics -->
+
* [[Chado_Tables#Table:_stock_cvterm| stock_cvterm]]
  <cvterm_relationship>
+
<type_id>intersection_of</type_id>
+
<subject_id>anon_1</subject_id>
+
<object_id>
+
  
<!-- anonymous term representing  part_of(spermatocyte) -->
+
* [[Chado_Tables#Table:_stock_relationship| stock_relationship]]
<cvterm>
+
  <dbxref_id>
+
<dbxref>
+
<db_id>internal</db_id>
+
<accession>restriction--part_of--spermatocyte</accession>
+
</dbxref>
+
  </dbxref_id>
+
  
  <!-- note: as this is an anon term, the name will never
+
* [[Chado_Tables#Table:_stockcollection| stockcollection]]
be shown to a user -->
+
  <name>restriction--part_of--spermatocyte</name>
+
  <cv_id>anonymous_cv</cv_id>
+
  <cvtermprop>
+
<type_id>is_anonymous</type_id>
+
<value>1</value>
+
<rank>0</rank>
+
  </cvtermprop>
+
  <cvterm_relationship>
+
<type_id>OBO_REL:part_of</type_id>
+
<object_id>CL__spermatocyte</object_id>
+
  </cvterm_relationship>
+
</cvterm>
+
  
</object_id>
+
* [[Chado_Tables#Table:_stockcollectionprop| stockcollectionprop]]
  </cvterm_relationship>
+
  
</code>
+
* [[Chado_Tables#Table:_stockprop| stockprop]]
  
The above assumes XORT macro IDs defined for GO__plasma_membrane and
+
* [[Chado_Tables#Table:_studydesignprop| studydesignprop]]
CL__spermatocyte
+
  
 +
* [[Chado_Tables#Table:_studyfactor| studyfactor]]
 +
 +
* [[Chado_Tables#Table:_synonym| synonym]]
 +
 +
* [[Chado_Tables#Table:_treatment| treatment]]
 +
 +
* [[Chado_Tables#Table:_wwwuser_cvterm| wwwuser_cvterm]]
 +
 +
----
 +
 +
 +
 +
== Table: cvterm_dbxref ==
 +
 +
In addition to the primary identifier (cvterm.dbxref_id) a cvterm can have zero or more secondary identifiers/dbxrefs, which may refer to records in external databases. The exact semantics of cvterm_dbxref are not fixed. For example: the dbxref could be a pubmed ID that is pertinent to the cvterm, or it could be an equivalent or similar term in another ontology. For example, GO cvterms are typically linked to InterPro IDs, even though the nature of the relationship between them is largely one of statistical association. The dbxref may be have data records attached in the same database instance, or it could be a "hanging" dbxref pointing to some external database. NOTE: If the desired objective is to link two cvterms together, and the nature of the relation is known and holds for all instances of the subject cvterm then consider instead using cvterm_relationship together with a well-defined relation.
 +
 +
{| border="1" cellpadding="3"
 +
|+ cvterm_dbxref Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| cvterm_dbxref_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| cvterm_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_dbxref| dbxref]]
 +
| dbxref_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr1"
 +
|
 +
| is_for_definition
 +
| integer
 +
| '' NOT NULL ''<br /><br />A cvterm.definition should be supported by one or more references. If this column is true, the dbxref is not for a term in an external database - it is a dbxref for provenance information for the definition.
 +
|}
 +
 +
----
 +
 +
 +
 +
== Table: cvterm_relationship ==
 +
 +
A relationship linking two cvterms. Each cvterm_relationship constitutes an edge in the graph defined by the collection of cvterms and cvterm_relationships. The meaning of the cvterm_relationship depends on the definition of the cvterm R refered to by type_id. However, in general the definitions are such that the statement "all SUBJs REL some OBJ" is true. The cvterm_relationship statement is about the subject, not the object. For example "insect wing part_of thorax".
 +
 +
{| border="1" cellpadding="3"
 +
|+ cvterm_relationship Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| cvterm_relationship_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The nature of the relationship between subject and object. Note that relations are also housed in the cvterm table, typically from the OBO relationship ontology, although other relationship types are allowed.
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| subject_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The subject of the subj-predicate-obj sentence. The cvterm_relationship is about the subject. In a graph, this typically corresponds to the child node.
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| object_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The object of the subj-predicate-obj sentence. The cvterm_relationship refers to the object. In a graph, this typically corresponds to the parent node.
 +
|}
 +
 +
----
 +
 +
 +
 +
== Table: cvtermpath ==
 +
 +
The reflexive transitive closure of the cvterm_relationship relation.
 +
 +
{| border="1" cellpadding="3"
 +
|+ cvtermpath Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| cvtermpath_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 ''<br /><br />The relationship type that this is a closure over. If null, then this is a closure over ALL relationship types. If non-null, then this references a relationship cvterm - note that the closure will apply to both this relationship AND the OBO_REL:is_a (subclass) relationship.
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| subject_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| object_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cv| cv]]
 +
| cv_id
 +
| integer
 +
| '' NOT NULL ''<br /><br />Closures will mostly be within one cv. If the closure of a relationship traverses a cv, then this refers to the cv of the object_id cvterm.
 +
|- class="tr1"
 +
|
 +
| pathdistance
 +
| integer
 +
| '' UNIQUE#1 ''<br /><br />The number of steps required to get from the subject cvterm to the object cvterm, counting from zero (reflexive relationship).
 +
|}
 +
 +
----
 +
 +
 +
 +
== Table: cvtermprop ==
 +
 +
Additional extensible properties can be attached to a cvterm using this table. Corresponds to -AnnotationProperty- in W3C OWL format.
 +
 +
{| border="1" cellpadding="3"
 +
|+ cvtermprop Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| cvtermprop_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| cvterm_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />The name of the property or slot is a cvterm. The meaning of the property is defined in that cvterm.
 +
|- class="tr1"
 +
|
 +
| value
 +
| text
 +
| ''<nowiki> UNIQUE#1 NOT NULL DEFAULT ''::text </nowiki>''<br /><br />The value of the property, represented as text. Numeric values are converted to their text representation.
 +
|- class="tr0"
 +
|
 +
| rank
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''<br /><br />Property-Value ordering. Any cvterm can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.
 +
|}
 +
 +
----
 +
 +
 +
 +
== Table: cvtermsynonym ==
 +
 +
A cvterm actually represents a distinct class or concept. A concept can be refered to by different phrases or names. In addition to the primary name (cvterm.name) there can be a number of alternative aliases or synonyms. For example, "T cell" as a synonym for "T lymphocyte".
 +
 +
{| border="1" cellpadding="3"
 +
|+ cvtermsynonym Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| cvtermsynonym_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| cvterm_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
| synonym
 +
| character varying(1024)
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' ''<br /><br />A synonym can be exact, narrower, or broader than.
 +
|}
 +
 +
----
 +
 +
 +
 +
== Table: dbxrefprop ==
 +
 +
Metadata about a dbxref. Note that this is not defined in the dbxref module, as it depends on the cvterm table. This table has a structure analagous to cvtermprop.
 +
 +
{| border="1" cellpadding="3"
 +
|+ dbxrefprop Structure
 +
|-
 +
! F-Key
 +
! Name
 +
! Type
 +
! Description
 +
|- class="tr0"
 +
|
 +
| dbxrefprop_id
 +
| serial
 +
| '' PRIMARY KEY ''
 +
|- class="tr1"
 +
|
 +
[[Chado_Tables#Table:_dbxref| dbxref]]
 +
| dbxref_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr0"
 +
|
 +
[[Chado_Tables#Table:_cvterm| cvterm]]
 +
| type_id
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|- class="tr1"
 +
|
 +
| value
 +
| text
 +
| ''<nowiki> NOT NULL DEFAULT ''::text </nowiki>''
 +
|- class="tr0"
 +
|
 +
| rank
 +
| integer
 +
| '' UNIQUE#1 NOT NULL ''
 +
|}
  
Allow post-coordinated terms places a greater burden on applications
+
----
that use the cv module. More documentation will be provided here on
+
this.
+
  
[[Category:To Do]]
+
[[Category:Chado Modules]]
[[Category:Chado]]
+
[[Category:Ontologies]]
 +
[[Category:!Lacking ERD]]

Latest revision as of 04:35, 18 February 2015

Introduction

This module is for controlled vocabularies (CVs), semantic networks and ontologies, depending on which terminology you prefer.

It is intended to be rich enough to encapsulate anything in the Gene Ontology (GO) or OBO family of ontologies (see the GO website and the OBO project). The schema reflects the data model of OBO and of the OBO Edit tool currently used by these projects.

This module is also intended to be extensible to richer formalisms such as OWL (Ontology Web Language), but this is outside the current requirements.

Similarities to the GO Database schema

The schema is similar to the GO database schema, which was also developed by one of the Chado designers.

There is a bridge layer in the directory modules/cv/bridges/, which can make the Chado cv module look like the GO DB, and vice versa.

Overview

An ontology, or controlled vocabulary (cv) is a collection of classes (or concepts or terms, depending on your terminology) with definitions and relationships to other classes. Each class--a word or phrase--can only appear once in a controlled vocabulary and has a defined meaning within that vocabulary. The controlled vocabularies are chosen so that the contents do not overlap; if the same text string is used to describe two different concepts in two different cvs, these are distinct classes. These terms are housed in the cvterm table in the Chado schema.

cvterms are related to one another via cvterm_relationship. This can be thought of as a graph, or semantic network. The relationship types (the labels on the arcs of the graph) are also stored in the cvterm table. The relationship types are extensible, but the type is a (subtyping relationship) is assumed to be present; many OBO ontologies use the part of relationship, and GO also uses the regulates relation. Relationship types also come from a controlled vocabulary, the OBO Relation Ontology.

The cvterm_relationship can be thought of as specifying sentences about the cvterms. These sentences have 3 parts - a subject term, an object term, and a verb or type. For example in the phrase "an exon is part of a transcript" the subject of the sentence is "exon" and the object is "transcript". If you prefer to think of it as a directed graph), then you can think of the subject as the child node, and the object as the parent node.

Associating features to cvterms

This module is used by most of the Chado modules. But it is useful to describe here how this module would be used in the context of the sequence module.

Often we want to attach cvterms to features. One example is typing features with SO - this is central to the sequence module. Each feature has one primary type, stored in feature.type_id.

We can also attach an arbitrary number of non-primary cvterms to a feature.

For example, we may want to attach GO annotations to gene or protein features. We may also want to attach phenotypic terms to gene features (although the preferred way to do this is via a genotype using the genetics module).

Complex annotations

Note that this is something that is not handled by the current GO DB, but it is something we may want to allow for in future.

Currently in GO, all annotations are disjunctive; for example, if we have

gene | GO ID
-----+------
foo  | GO:001
foo  | GO:002
foo  | GO:003


This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

The text above was taken from modules/cv/cv-intro.txt , which was incomplete (and no longer exists).


The sequence module makes extensive use of terms taken from various ontologies such as SO and the OBO Relations Ontology, using the type_id foreign key column. In addition, features can be annotated using ontologies such as GO using the feature_cvterm linking table. These terms are modelled using the cv module, the core of which is the cvterm table.

An ontology, terminology or cv (controlled vocabulary) is a collection of terms (here equivalent to what are more typically called classes, types, categories or kinds in the ontology literature in a particular domain of interest. Examples include "gene" (from SO), "transcription factor activity" (from GO molecular function) and "lymphocyte" (from OBO-Cell). The chado cv module is based on the GO Database schema. Terms are stored in the cvterm table, and relationships between terms are stored in the cvterm_relationship table. This table follows an analogous structure to the feature_relationship table, in that it has columns subject_id, object_id and type_id. Here, all three of these foreign keys refer to rows in the cvterm table.

A brief treatment of relationship types in biological ontologies can be found here. Of particular interest to Chado is the is_a relation, which specifies a sub-typing relationship between two terms or classes. Recall that tables in the sequence module frequently (such as the feature table) defined a type_id foreign key column to indicate the specific type or class of entity for each row in that table. The combination of the type_id column and the is_a relationship gives Chado a data sub-classing system, beyond what is possible with traditional SQL database semantics.

This is discussed further below. The collection of cvterms and cvterm_relationships can be considered to constitute vertices and edges in a graph. This graph is typically acyclic (a DAG), though it is not guaranteed to be as certain relationship types are allowed to form cycles.


Sequence Ontology Examples
SO Term SO id
exon SO:0000147
intron SO:0000188
mRNA ...
miRNA ...
regulatory_element ...
transcription_factor_binding_site ...

Transitive Closure

This section concerns relations between ontology terms and how defined terms and relations can be used to reason, either by humans or computers. A specialized ontology concerning these relations has been developed, the OBO Relation Ontology.

Often it is useful to know the transitive closure over a relationship type, or a collection of relationship types. The closure is the result of recursively applying the relationship. For example, if A is_a B, B is_a C, then the closure of is_a includes A is_a C.

In particular, we want the reflexive transitive closure. A term is always related to itself in a reflexive closure. Meaning:

X is_a X

This may seem odd, but it comes in useful both for doing queries and for deriving future rules. This makes it easier to ask "find me all genes of class X", and to get back genes attached to X and subtypes of X.

The closure goes in the cvtermpath table - the closure can also be thought of as a path through the graph or semantic network.

Transitivity of other Relations

Many other relations, such as part_of are also transitive.

If R is a transitive relation, then we can say

X R Z <= X R Y, Y R Z

For example, assume we have the following 3 develops_from links, and develops_from is a transitive relation:

 neurectodermal cell develops_from glioblast
 glioblast develops_from glial cell

Then it follows that glial cells develop from neurectodermal cells

Transitivity over is_a

It can be proved from the definition of is_a (proof not shown here) that:

 X R Z <= X is_a Y, Y R Z

and

 X R Z <= X R Y, Y is_a Z

This can be thought of as "inheritance".

For example, if an astrocyte is_a glial cell and a glial cell develops_from a glioblast, then it follows that an astrocyte develop_from a glioblast.

Difference between Deductive Closure and Transitive Closure

With a transitive closure we simply follow all links in the DAG, ignoring the relationship type. This works fine for ontologies such as GO that have only is_a and part_of, but is not ideal for other ontologies such as anatomical ontologies.

First of all, it may be possible for the closure to grow in size explosively.

Second of all, a closure that ignores the relations may be scientifically meaningless. It is also less useful for queries. For example, we may want to query for genes expressed in the larva or part of the larva, but not genes expressed in anatomical entities that develop from the larva.

Rules

The cvtermpath table is for calculating the reflexive transitive closure of a relationship, and any derived relationships.

Normal (direct) relationships are stored in the cvterm_relationship table. A entry in this table represents a cvterm_relationship S over some relation R.

S = Subj R Obj

For example:

S = "cardioblast" develops_from "mesodermal cell"

In addition to these asserted links, we want to be able to deduce links between terms.

If X is_a Y, then it follows that all of Y's cvterm_relationship statements are inherited by X.

Rule 1

If X is_a Y
and  Y R Z
then X R(inh) Z

For example:

 "cilium axoneme"  is_a "axoneme"
 "axoneme"part_of "cell projection"

Therefore:

 "cilium axoneme"  part_of(inh) "cell projection"

Here we use T(inh) to represent an inherited relationship.

Populating cvtermpath

The cvtermpath table stores the reflexive transitive closure of a relationship, taking into account subsumption or inheritance. The number of intermediate relationships is represented in the distance column of the table.

Here we use T(path) to represent the "path" or closure of a relationship. Every T(path) is stored in cvtermpath . We use the same cvterm for T, the fact that it is a path is implicit.

We use these rules:

Reflexive relationships:

for all relations T,
  X T(path) X

In this case the distance = 0.

Direct relationships:

These are also included in the cvtermpath table, distance = 1.

IfX T Y
Then X T(path) Y

Transitive relationships:

These have distance > 1; these also make use of inheritance rule, Rule 1, which gives us T(inh).

If X T(inh)  Y
and  Y T(path) Z
Then X T(path) Z

Note that this rule is recursive.

These rules should be used for populating cvtermpath. Attempting to calculate a more general closure where all relations are treated equally or ignored will produce combinatorial explosions over certain ontologies (e.g. Flybase anatomy ontology). What does this mean in practice?

For a typical database, which may only have relations isa, part_of and develops_from, we will end up with 3 sets of paths.

The isa closure, isa (path) will include paths over cvterm_relationships that look like this:

a is_a b is_a c is_a d is_a e

The part_of closure, part_of(path) will include paths over cvterm_relationships that look like this:

a is_a b part_of c part_of d is_a e part_of f

The develops_from closure, develops_from(path) will include paths over cvterm_relationships that look like this:

a develops_from b develops_from c is_a d is_a e develops_from f

It may be tempting to mix different non-isa relationships in the same path, but this should never be done - there will be an unacceptable combinatorial explosion in many cases. Besides, there is no use for such a cvtermpath; it is meaningless.

Note that for Amigo-like query behaviour, it is necessary only to query cvtermpath, ignoring cvtermpath.type_id (these are obtained by querying cvterm_relationship).


Advanced Usage

This section describes advanced usage of the cv module for use with OWL-DL advanced Obo format 1.2 [REF] features or elements from other ontology formalisms.

If you aren't sure what this means, you probably don't need to read this section yet.


Note that this section is liable to change; in particular the scheme below may be replaced with a simpler one. For details of the simpler scheme, along the lines of the transform used in the GO Database. See:

(search for intersection_of)

Background

See the document on Converting OBO to OWL.

Logical definitions

In a normal ontology DAG representation in Chado, the cvterm_relationship rows represent relationships between terms, or more formally, necessary conditions. A logical definition must have both necessary and sufficient conditions. A logical definition often consists of a generic term (also known as "genus") and one or more discriminating characteristics (also known as "differentiae"). The discriminating characteristics are typically relationships.

For example, the logical definition of "larval locomotory behaviour" would be a "locomotory behaviour" (genus) which "during" "tt larval stage" (where "during" could be drawn from an ontology of relations, and larval stage may come from an insect developmental stage ontology). These constitute both necessary and sufficient conditions: the conditions are necessary in that all instances of larval locomotory behavior are necessarily locomotory behaviors and are necessarily manifested at the larval stage. We could represent this using a normal DAG. However, because this is a definition it also constitutes sufficient conditions, in that any instance of locomotory behavior which manifests at the larval stage is by definition a larval locomotory behavior.

In an ontology formalism like OWL-DL or OBO-1.2, genus-differentiae are represented using set-intersections.

Here is the OBO 1.2 representation:

[Term]
id: GO:0008345
name: larval locomotory behavior
namespace: biological_process
is_a: GO:0007626 ! locomotory behavior
is_a: GO:0030537 ! larval behavior
intersection_of: GO:0007626  ! GENUS: locomotory behavior
intersection_of: during FBdv:00005336 ! DIFFERENTIUM: during larval stage

Here is the equivalent in OWL (note: RDF-XML syntax is very verbose!):

  <owl:Class rdf:ID="GO_0008345">
    <rdfs:label xml:lang="en">larval locomotory behavior</rdfs:label>
    <rdfs:subClassOf rdf:resource="#GO_0007626"/>
    <rdfs:subClassOf rdf:resource="#GO_0030537"/>
    <owl:equivalentClass>
      <owl:Class>
        <owl:intersectionOf rdf:parseType="Collection">
          <owl:Class rdf:about="#GO_0007626"/>
          <owl:Restriction>
            <owl:onProperty>
              <owl:ObjectProperty rdf:about="#during"/>
            </owl:onProperty>
            <owl:someValuesFrom rdf:resource="#FBdv_00005336"/>
          </owl:Restriction>
        </owl:intersectionOf>
      </owl:Class>
    </owl:equivalentClass>
  </owl:Class>

When converting to Chado we employ a more economical representation, in terms of the number of triples we use:

  <!-- normal DAG relationships (necessary conditions) -->
  <cvterm_relationship>
    <type_id>is_a</type_id>
    <subject_id>GO:0008345</subject_id>
    <object_id>GO:0007626</object_id>
  </cvterm_relationship>
  <cvterm_relationship>
    <type_id>is_a</type_id>
    <subject_id>GO:0008345</subject_id>
    <object_id>GO:0030537</object_id>
  </cvterm_relationship>
 
  <!-- Genus/generic term -->
  <cvterm_relationship>
    <type_id>intersection_of</type_id>
    <subject_id>GO:0008345</subject_id>
    <object_id>GO:0007626</object_id> <!-- locomotory behavior -->
  </cvterm_relationship>
 
  <!-- Discriminating characteristics -->
  <cvterm_relationship>
    <type_id>intersection_of</type_id>
    <subject_id>GO:0008345</subject_id>
    <object_id>
 
      <!-- anonymous term representing  during(larval stage) -->
      <cvterm>
        <dbxref_id>
          <dbxref>
            <db_id>internal</db_id>
            <accession>restriction--OBOL:during--GO:0008345</accession>
          </dbxref>
        </dbxref_id>
 
        <!-- note: as this is an anon term, the name will never
             be shown to a user -->
        <name>restriction--OBOL:during--GO:0008345</name>
        <cv_id>anonymous_cv</cv_id>
        <cvtermprop>
          <type_id>is_anonymous</type_id>
          <value>1</value>
          <rank>0</rank>
        </cvtermprop>
        <cvterm_relationship>
          <type_id>OBOL:during</type_id>
          <object_id>FBdv:00005336</object_id>
        </cvterm_relationship>
      </cvterm>
 
    </object_id>
  </cvterm_relationship>

Note that in the above, we are creating anonymous terms. We give them fake names and fake dbxrefs. In the bbop-experimental SVN branch of chado, names and dbxrefs are nullable, so these can be omitted. With the current schema, you must provide fake dbxrefs and names that are unique, such as the above (if you are not familiar with how Chado XML maps to the Chado schema, see the explanation below).

If you wish to convert OBO-specified logical definitions to Chado XML you will need go-perl, v0.05 or higher (if you have a lower version, the intersection_of tags will simply be ignored).

go2chadoxml ont.obo > ont.chado

How Logical Definitions are Stored in Chado

This involves no schema changes to the cv module. Each intersection_of goes in as a DAG arc of type internal:intersection_of. The object_id in the arc is either a term (for the genus) or an anonymous term representing a restriction (the differentium). The restriction has a relationship of some type to another term.

For example, for "larval locomotory behavior" we would normally just have:

LLB is_a LocomotoryBehavior
LLB is_a LarvalBehavior

If we load a logical definition for this term (see /t/data/llm/obo in the go-perl package), like this:

[Term]
id: GO:0008345
name: larval locomotory behavior
namespace: biological_process
is_a: GO:0007626 ! locomotory behavior
is_a: GO:0030537 ! larval behavior
intersection_of: GO:0007626  ! locomotory behavior
intersection_of: during FBdv:00005336 ! larval stage

Then the intersection_ofs get stored using the basic DAG tables as:


Definition stored in cvterm_relationship table
Subject Relation Object
LLB intersection_of Locomotory Behaviour
LLB intersection_of anon:xxx
anon:xxx during FBv:00005336


This uses 4 cvterm_relationships and the creation of a new anonymous term that is never shown directly to the user. The anonymous term represents the class of things that happen during the larval stage.

Logical Definition Views

Two views: cvterm_genus and cvterm_differentium views are in chado/modules/cv/views.

Example Use Case: Phenotypes

The idea here is that queries for composed term "syndactyly" should automatically return the same results as a boolean query for "fusion" + inheres_in = "finger" regardless of whether the annotation is to the composed term or is a composed annotation (provided we put the logical definition of syndactyly in the database).

Example Use Case: Feature Types

The Sequence Ontology has some logical definitions - you will need to load the file so-xp.obo.

Example use case: GO

See Obol.

Example use case: Drawing DAGs

Currently the DAGs of many OBO ontologies are highly tangled; see http://www.fruitfly.org/~cjm/obol/doc/go-complexity.html

If all terms have logical definitions, then there is only one true (genus) or isa parent. This enables us to disentangle the DAGs and draw distinct hierarchies. For example, the GO term cysteine biosynthesis could be drawn as two distinct hierarchies - one process and one chemical.

Loading OWL into Chado

Not all OWL-DL features are supported. Only intersection_ofs corresponding to genus-differentiae are loaded.

First you must convert OWL into OBO 1.2 format. There will soon be a way to do this in OboEdit. For now you can use blipkit.

blip io-convert my.owl -to obo -o my.obo

Once you have an OBO file you can run go2chadoxml, as above.

Post-coordinating Terms

Sometimes we want to be able to refer to a term such as "plasma membrane of spermatocyte", but no such term exists in the ontology. Introducing these as pre-coordinated cross-product terms would make the ontology unwieldy due the large number of possible combinations.

Chado allows the post-coordination or post-composition of terms using the same formalism as described above. Briefly: we would create an anonymous term. This anonymous term would be defined using the terms "plasma membrane" and "spermatocyte", using a genus-differentia definition as above.

  <!-- Genus/generic term -->
  <cvterm_relationship>
    <type_id>intersection_of</type_id>
    <subject_id>anon_1</subject_id>
    <object_id>GO__plasma_membrane</object_id>
  </cvterm_relationship>
 
  <!-- Discriminating characteristics -->
  <cvterm_relationship>
    <type_id>intersection_of</type_id>
    <subject_id>anon_1</subject_id>
    <object_id>
 
      <!-- anonymous term representing  part_of(spermatocyte) -->
      <cvterm>
        <dbxref_id>
          <dbxref>
            <db_id>internal</db_id>
            <accession>restriction--part_of--spermatocyte</accession>
          </dbxref>
        </dbxref_id>
 
        <!-- note: as this is an anon term, the name will never
             be shown to a user -->
        <name>restriction--part_of--spermatocyte</name>
        <cv_id>anonymous_cv</cv_id>
        <cvtermprop>
          <type_id>is_anonymous</type_id>
          <value>1</value>
          <rank>0</rank>
        </cvtermprop>
        <cvterm_relationship>
          <type_id>OBO_REL:part_of</type_id>
          <object_id>CL__spermatocyte</object_id>
        </cvterm_relationship>
      </cvterm>
 
    </object_id>
  </cvterm_relationship>

The above assumes XORT macro IDs defined for GO__plasma_membrane and CL__spermatocyte.

Allow post-coordinated terms places a greater burden on applications that use the cv module. More documentation will be provided here on this.

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

Tables

Table: cv

A controlled vocabulary or ontology. A cv is composed of cvterms (AKA terms, classes, types, universals - relations and properties are also stored in cvterm) and the relationships between them.

cv Structure
F-Key Name Type Description
cv_id serial PRIMARY KEY
name character varying(255) UNIQUE NOT NULL

The name of the ontology. This corresponds to the obo-format -namespace-. cv names uniquely identify the cv. In OBO file format, the cv.name is known as the namespace.
definition text

A text description of the criteria for membership of this ontology.

Tables referencing this one via Foreign Key Constraints:



Table: cvterm

A term, class, universal or type within an ontology or controlled vocabulary. This table is also used for relations and properties. cvterms constitute nodes in the graph defined by the collection of cvterms and cvterm_relationships.

cvterm Structure
F-Key Name Type Description
cvterm_id serial PRIMARY KEY

cv

cv_id integer UNIQUE#1 NOT NULL

The cv or ontology or namespace to which this cvterm belongs.
name character varying(1024) UNIQUE#1 NOT NULL

A concise human-readable name or label for the cvterm. Uniquely identifies a cvterm within a cv.
definition text

A human-readable text definition.

dbxref

dbxref_id integer UNIQUE NOT NULL

Primary identifier dbxref - The unique global OBO identifier for this cvterm. Note that a cvterm may have multiple secondary dbxrefs - see also table: cvterm_dbxref.
is_obsolete integer UNIQUE#1 NOT NULL

Boolean 0=false,1=true; see GO documentation for details of obsoletion. Note that two terms with different primary dbxrefs may exist if one is obsolete.
is_relationshiptype integer NOT NULL

Boolean 0=false,1=true relations or relationship types (also known as Typedefs in OBO format, or as properties or slots) form a cv/ontology in themselves. We use this flag to indicate whether this cvterm is an actual term/class/universal or a relation. Relations may be drawn from the OBO Relations ontology, but are not exclusively drawn from there.

Tables referencing this one via Foreign Key Constraints:



Table: cvterm_dbxref

In addition to the primary identifier (cvterm.dbxref_id) a cvterm can have zero or more secondary identifiers/dbxrefs, which may refer to records in external databases. The exact semantics of cvterm_dbxref are not fixed. For example: the dbxref could be a pubmed ID that is pertinent to the cvterm, or it could be an equivalent or similar term in another ontology. For example, GO cvterms are typically linked to InterPro IDs, even though the nature of the relationship between them is largely one of statistical association. The dbxref may be have data records attached in the same database instance, or it could be a "hanging" dbxref pointing to some external database. NOTE: If the desired objective is to link two cvterms together, and the nature of the relation is known and holds for all instances of the subject cvterm then consider instead using cvterm_relationship together with a well-defined relation.

cvterm_dbxref Structure
F-Key Name Type Description
cvterm_dbxref_id serial PRIMARY KEY

cvterm

cvterm_id integer UNIQUE#1 NOT NULL

dbxref

dbxref_id integer UNIQUE#1 NOT NULL
is_for_definition integer NOT NULL

A cvterm.definition should be supported by one or more references. If this column is true, the dbxref is not for a term in an external database - it is a dbxref for provenance information for the definition.


Table: cvterm_relationship

A relationship linking two cvterms. Each cvterm_relationship constitutes an edge in the graph defined by the collection of cvterms and cvterm_relationships. The meaning of the cvterm_relationship depends on the definition of the cvterm R refered to by type_id. However, in general the definitions are such that the statement "all SUBJs REL some OBJ" is true. The cvterm_relationship statement is about the subject, not the object. For example "insect wing part_of thorax".

cvterm_relationship Structure
F-Key Name Type Description
cvterm_relationship_id serial PRIMARY KEY

cvterm

type_id integer UNIQUE#1 NOT NULL

The nature of the relationship between subject and object. Note that relations are also housed in the cvterm table, typically from the OBO relationship ontology, although other relationship types are allowed.

cvterm

subject_id integer UNIQUE#1 NOT NULL

The subject of the subj-predicate-obj sentence. The cvterm_relationship is about the subject. In a graph, this typically corresponds to the child node.

cvterm

object_id integer UNIQUE#1 NOT NULL

The object of the subj-predicate-obj sentence. The cvterm_relationship refers to the object. In a graph, this typically corresponds to the parent node.


Table: cvtermpath

The reflexive transitive closure of the cvterm_relationship relation.

cvtermpath Structure
F-Key Name Type Description
cvtermpath_id serial PRIMARY KEY

cvterm

type_id integer UNIQUE#1

The relationship type that this is a closure over. If null, then this is a closure over ALL relationship types. If non-null, then this references a relationship cvterm - note that the closure will apply to both this relationship AND the OBO_REL:is_a (subclass) relationship.

cvterm

subject_id integer UNIQUE#1 NOT NULL

cvterm

object_id integer UNIQUE#1 NOT NULL

cv

cv_id integer NOT NULL

Closures will mostly be within one cv. If the closure of a relationship traverses a cv, then this refers to the cv of the object_id cvterm.
pathdistance integer UNIQUE#1

The number of steps required to get from the subject cvterm to the object cvterm, counting from zero (reflexive relationship).


Table: cvtermprop

Additional extensible properties can be attached to a cvterm using this table. Corresponds to -AnnotationProperty- in W3C OWL format.

cvtermprop Structure
F-Key Name Type Description
cvtermprop_id serial PRIMARY KEY

cvterm

cvterm_id integer UNIQUE#1 NOT NULL

cvterm

type_id integer UNIQUE#1 NOT NULL

The name of the property or slot is a cvterm. The meaning of the property is defined in that cvterm.
value text UNIQUE#1 NOT NULL DEFAULT ''::text

The value of the property, represented as text. Numeric values are converted to their text representation.
rank integer UNIQUE#1 NOT NULL

Property-Value ordering. Any cvterm can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.


Table: cvtermsynonym

A cvterm actually represents a distinct class or concept. A concept can be refered to by different phrases or names. In addition to the primary name (cvterm.name) there can be a number of alternative aliases or synonyms. For example, "T cell" as a synonym for "T lymphocyte".

cvtermsynonym Structure
F-Key Name Type Description
cvtermsynonym_id serial PRIMARY KEY

cvterm

cvterm_id integer UNIQUE#1 NOT NULL
synonym character varying(1024) UNIQUE#1 NOT NULL

cvterm

type_id integer

A synonym can be exact, narrower, or broader than.


Table: dbxrefprop

Metadata about a dbxref. Note that this is not defined in the dbxref module, as it depends on the cvterm table. This table has a structure analagous to cvtermprop.

dbxrefprop Structure
F-Key Name Type Description
dbxrefprop_id serial PRIMARY KEY

dbxref

dbxref_id integer UNIQUE#1 NOT NULL

cvterm

type_id integer UNIQUE#1 NOT NULL
value text NOT NULL DEFAULT ''::text
rank integer UNIQUE#1 NOT NULL