Chado CV Module

From GMOD
Revision as of 17:05, 14 February 2007 by Bosborne (Talk | contribs)

Jump to: navigation, search

Introduction

We have seen how the sequence module makes extensive use of terms taken from various ontologies such as SO and the OBO Relations Ontology, using the type_id foreign key column. In addition, features can be annotated using ontologies such as GO using the feature_cvterm linking table. These terms are modelled using the cv module, the core of which is the cvterm table.

An ontology, terminology or cv (controlled vocabulary) , is a collection of terms (here equivalent to what are more typically called classes, types, categories or kinds in the ontology literature [REF]) in a particular domain of interest. Examples include "gene" (from SO), "transcription factor activity" (from GO molecular function) and "lymphocyte" (from OBO-Cell). The chado cv module is based on the GO Database schema, described here [14]. Terms are stored in the cvterm table, and relationships between terms are stored in the cvterm_relationship table. This table follows an analogous structure to the feature_relationship table, in that it has columns subject_id, object_id and type_id. Here, all three of these foreign keys refer to rows in the cvterm table.

A detailed treatment of relationship types in biological ontologies can be found here[13]. Of particular interest to Chado is the is_a relation, which specifies a sub-typing relationship between two terms or classes. Recall that tables in the sequence module frequently (such as the feature table) defined a type_id foreign key column to indicate the specific type or class of entity for each row in that table. The combination of the type_id column and the is_a relationship gives Chado a data sub-classing system, beyond what is possible with traditional SQL database semantics.

This is discussed further in a later section The collection of cvterms and cvterm_relationships can be considered to constitute vertices and edges in a graph. This graph is typically acyclic (a DAG), though it is not guaranteed to be as certain relationship types are allowed to form cycles.


3.1 Introduction


We have seen how the sequence module makes extensive use of terms taken from various ontologies such as SO and the OBO Relations Ontology, using the type id foreign key column. In addition, features can be annotated using ontologies such as GO using the feature cvterm linking table. These terms are modelled using the cv module, the core of which is the cvterm table.

An ontology, terminology or cv (controlled vocabulary) , is a collection of terms (here equivalent to what are more typically called classes, types, categories or kinds in the ontology literature[REF]) in a particular domain of interest. Examples include ”gene” (from SO), ”transcription factor activity” (from GO molecular function) and ”lymphocyte” (from OBO-Cell). The chado cv module is based on the GO Database schema, described here[14]. Terms are stored in the cvterm table, and relationships between terms are stored in the cvterm relationship table. This table follows an analogous structure to the feature relationship table, in that it has columns subject id, object id and type id. Here, all three of these foreign keys refer to rows in the cvterm table.

SO Term SO id
Exon SL:0000025
Intron SL:0000027
mRNA SL:0000037
3_utr SL:0000029
5_utr SL:0000028
noncoding_transcript SL:0000040
miRNA_precursor SL:0000043
miRNA SL:0000044
rRNA SL:0000042
tRNA SL:0000041
regulatory_element SL:0000052
transcription_factor_binding_site SL:0000054
Clone SL:0000050
genbank_entry SL:0000061
Match SL:0000008
nucleotide_match SL:0000018
cross_genome_match SL:0000074
EST_match SL:0000021
mRNA_match SL:0000022
protein_match SL:0000020
translated_nucleotide_match SL:0000019
Pseudogene SL:0000007
repeat_region SL:0000002
direct_repeats SL:0000005
inverted_repeat SL:0000004
repeat_family SL:0000006
retrotransposon SL:0000014
Transposon SL:0000012
tandem_repeat SL:0000003
microsatellite SL:0000013
Remark SL:0000062
Variant SL:0000065
deletion SL:0000067
insertion SL:0000066
SNP SL:0000072


A detailed treatment of relationship types in biological ontologies can be found here[13]. Of particular interest to Chado is the is a relation, which specifies a sub- typing relationship between two terms or classes. Recall that tables in the sequence module frequently (such as the feature table) defined a type id foreign key column to indicate the specific type or class of entity for each row in that table. The combination of the type id column and the is a relationship gives Chado a data sub- classing system, beyond what is possible with traditional SQL database semantics.

This is discussed further in a later section The collection of cvterms and cvterm relationships can be considered to constitute vertices and edges in a graph. This graph is typically acyclic (a DAG), though it is not guaranteed to be as certain relationship types are allowed to form cycles.


Transitive Closure
Rules

The cvtermpath is for calculating the reflexive transitive closure of a relationship, and any derived relationships.

Normal (direct) relationships are stored in the cvterm_relationship table. A entry in this table represents a cvterm_relationship S over some relation R.

S = Subj R Obj

For example: S = "cardioblast" develops_from "mesodermal cell" The relation isa represents a special kind of relation - subsumption, or inheritance.

If X isa Y, then it follows that all of Y's cvterm_relationship statements are inherited by X.

[Rule 1] IfX is_a Y and Y R Z then X R(inh) Z

For example

 "cilium axoneme"  is_a "axoneme"
 "axoneme"part_of "cell projection"

THEREFORE:

 "cilium axoneme"  part_of(inh) "cell projection"

Here we use T(inh) to represent an inherited relationship.


=Populating cvterm_path

The cvtermpath table stores the reflexive transitive closure of a relationship, taking into account subsumption/inheritance. The number of intermediate relationships is represented in the 'distance' column of the table.

Here we use T(path) to represent the 'path' or closure of a relationship. Every T(path) is stored in cvtermpath. We use the same cvterm for T, the fact that it is a path is implicit.

We use these rules:

Reflexive relationships: for all relations T,

 X T(path) X 

In this case the distance=0

Direct relationships:

these are also included in the cvtermpath table, distance=1

IfX T Y
Then X T(path) Y

Transitive relationships:

these have distance > 1; these also make use of inheritance rule, [Rule1], which gives us T(inh) IfX T(inh) Y and Y T(path) Z Then X T(path) Z Note that this rule is recursive.

These rules should be used for populating cvtermpath. Attempting to calculate a more general closure where all relations are treated equally or ignored will produce combinatorial explosions over certain ontologies (e.g. flybase anatomy ontology)

What does this mean in practice?

For a typical database, which may only have relations isa, part_of and develops_from, we will end up with 3 sets of paths.

The isa closure, isa (path) will include paths over cvterm_relationships that look like this:

a is_a b is_a c is_a d is_a e

The "part_of" closure, part_of(path) will include paths over cvterm_relationships that look like this: a is_a b part_of c part_of d is_a e part_of f

The "develops_from" closure, develops_from(path) will include paths over cvterm_relationships that look like this:

a develops_from b develops_from c is_a d is_a e develops_from f

It may be tempting to mix different non isa relationships in the same path, but this should NEVER be done - there will be an unacceptable combinatorial explosion in many cases. Besides, there is no use for such a cvtermpath; it is meaningless.

Note that for amigolike query behaviour, it is necessary only to query cvtermpath ignoring cvtermpath.type_id (these are obtained by querying cvterm_relationship)


3.2 Table Definitions


cv


A controlled vocabulary or ontology. A cv is composed of cvterms (aka terms, classes, types, universals - relations and properties are also stored in cvterm)) and the relationships between them


Table 3.1: cv
 Column Datatype Description
 cv id  integer
 namevarchar  The name of the ontology. This corresponds to the

obo-format -namespace-. cv names uniquely identify the cv. In obo file format, the cv.name is known as the namespace

 definition text  A text description of the criteria for membership of

this ontology

cvterm


A term, class, universal or type within an ontology or controlled vocabulary. This table is also used for relations and properties. cvterms constitute nodes in the graph defined by the collection of cvterms and cvterm relationships


Table 3.2: cvterm

Column  Datatype Description
cvterm id  integer
cv idinteger  The cv/ontology/namespace to which this cvterm

belongs

name varchar  A concise human-readable name or label for the

cvterm. uniquely identifies a cvterm within a cv

definition  text  A human-readable text definition
dbxref id  integer  Primary identifier dbxref - The unique global OBO

identifier for this cvterm. Note that a cvterm may have multiple secondary dbxrefs - see also table: cvterm dbxref

is obsoleteinteger  Boolean 0=false,1=true; see GO documentation for

details of obsoletion. note that two terms with dif- ferent primary dbxrefs may exist if one is obsolete

is relationshiptype integer  Boolean 0=false,1=true relations or relationship

types (also known as Typedefs in OBO format, or as properties or slots) form a cv/ontology in themselves. We use this flag to indicate whether this cvterm is an actual term/class/universal or a relation. Relations may be drawn from the OBO Relations ontology, but are not exclusively drawn from there


cvterm relationship


A name can mean different things in different contexts; for example ”chromosome” in SO and GO. A name should be unique within an ontology/cv. A name may exist twice in a cv, in both obsolete and non-obsolete forms - these will be for different cvterms with different OBO identifiers; so GO documentation for more details on obsoletion. Note that occasionally multiple obsolete terms with the same name will exist in the same cv. If this is a possibility for the ontology under consideration (eg GO) then the ID should be appended to the name to ensure uniqueness


Table 3.3: cvterm relationship
 Column  Datatype Description
 cvterm relationship id integer
 type id integer  The nature of the relationship between subject and
object. Note that relations are also housed in the
cvterm table, typically from the OBO relationship
ontology, although other relationship types are al-
lowed
 subject id integer  the subject of the subj-predicate-obj sentence. The
cvterm relationship is about the subject. In a graph,
this typically corresponds to the child node
 object id  integer  the object of the subj-predicate-obj sentence. The
cvterm relationship refers to the object. In a graph,
this typically corresponds to the parent node


cvtermpath


The reflexive transitive closure of the cvterm relationship relation. For a full discussion, see the file populating-cvtermpath.txt in this directory


 Table 3.4: cvtermpath

Column Datatype Description cvtermpath id integer type id integer The relationship type that this is a closure over. If

 null, then this is a closure over ALL relationship
 types. If non-null, then this references a relation-
 ship cvterm - note that the closure will apply to both
 this relationship AND the OBO REL:is a (subclass)
 relationship

subject id integer object id integer cv idinteger Closures will mostly be within one cv. If the closure

 of a relationship traverses a cv, then this refers to
 the cv of the object id cvterm

pathdistance integer The number of steps required to get from the sub-

 ject cvterm to the object cvterm, counting from zero
 (reflexive relationship)


cvtermsynonym


A cvterm actually represents a distinct class or concept. A concept can be refered to by different phrases or names. In addition to the primary name (cvterm.name) there can be a number of alternative aliases or synonyms. For example, -T cell- as a synonym for -T lymphocyte-


Table 3.5: cvtermsynonym

Column DatatypeDescription cvtermsynonym id integer cvterm id integer synonym varchar type id integer A synonym can be exact, narrow or borader than


cvterm dbxref


In addition to the primary identifier (cvterm.dbxref id) a cvterm can have zero or more secondary identifiers/dbxrefs, which may refer to records in external databases. The exact semantics of cvterm dbxref are not fixed. For example: the dbxref could be a pubmed ID that is pertinent to the cvterm, or it could be an equivalent or similar term in another ontology. For example, GO cvterms are typically linked to InterPro IDs, even though the nature of the relationship between them is largely one of statistical association. The dbxref may be have data records attached in the same database instance, or it could be a ”hanging” dbxref pointing to some external database. NOTE: If the desired objective is to link two cvterms together, and the nature of the relation is known and holds for all instances of the subject cvterm then consider instead using cvterm relationship together with a well-defined relation.


 Table 3.6: cvterm dbxref
Column  Datatype  Description
cvterm dbxref id integer
cvterm id  integer
dbxref id  integer
is for definition integerA cvterm.definition should be supported by one or
more references. If this column is true, the dbxref is
not for a term in an external db - it is a dbxref for
provenance information for the definition


cvtermprop


Additional extensible properties can be attached to a cvterm using this table. Corresponds to -AnnotationProperty- in W3C OWL format


Table 3.7: cvtermprop
 Column  Datatype Description
 cvtermprop id integer
 cvterm id  integer
 type id integer  The name of the property/slot is a cvterm. The
meaning of the property is defined in that cvterm
 valuetext  The value of the property, represented as text. Nu-
meric values are converted to their text representa-
tion
 rank integer  Property-Value ordering. Any cvterm can have mul-
tiple values for any particular property type - these
are ordered in a list using rank, counting from zero.
For properties that are single-valued rather than
multi-valued, the default 0 value should be used


dbxrefprop


Metadata about a dbxref. Note that this is not defined in the dbxref module, as it depends on the cvterm table. This table has a structure analagous to cvtermprop


Table 3.8: dbxrefprop
 Column  Datatype Description
 dbxrefprop id integer
 dbxref id  integer
 type id integer
 valuetext
 rank integer


organism


The organismal taxonomic classification. Note that phylogenies are represented using the phylogeny module, and taxonomies can be represented using the cvterm module or the phylogeny module


Table 3.9: organism

Column Datatype Description organism id integer abbreviation varchar genus varchar speciesvarchar A type of organism is always uniquely identified

by genus+species. When mapping from the NCBI
taxonomy names.dmp file, the unique-name column
must be used where it is present, as the name column
is not always unique (eg environmental samples). If
a particular strain or subspecies is to be represented,
this is appended onto the species name. Follows stan-
dard NCBI taxonomy pattern

common name varchar commenttext


organism dbxref


 Table 3.10: organism dbxref
Column Datatype Description
organism dbxref id integer
organism id  integer
dbxref id integer


organismprop


tag-value properties - follows standard chado model


Table 3.11: organismprop
Column Datatype  Description
organismprop id integer
organism id  integer
type idinteger
value  text
rankinteger


This section describes advanced usage of the cv module for use with OWL-DL advanced Obo format 1.2 [REF] features or elements from other ontology formalisms.

If you aren't sure what this means, you probably don't need to read this section yet.

Background

See the document on ConvertingOboToOWL.

Logical definitions

In a normal ontology DAG representation in chado, the cvterm_relationship rows represent relationships between terms, or more formally, necessary conditions. A logical definition must have both necessary and sufficient conditions. A logical definition often consists of a generic term (aka genus) and one or more discriminating characteristics (aka differentiae). The discriminating characteristics are typically relationships

For example, the logical definition of larval locomotory behaviour would be a locomotory behaviour (genus) which during tt larval stage (where during could be drawn from an ontology of relations, and larval stage may come from an insect developmental stage ontology). These constitute both necessary and sufficient conditions: the conditions are necessary in that all instances of larval locomotory behavior are necessarily locomotory behaviors and are necessarily manifested at the larval stage. We could represent this using a normal DAG. However, because this is a definition it also constitutes sufficient conditions, in that any instance of locomotory behavior which manifests at the larval stage is by definition a larval locomotory behavior.

In an ontology formalism like OWL-DL or Obo-1.2, genus-differentiae are represented using set-intersections.

Here is the Obo 1.2 representation:

[Term] id: GO:0008345 name: larval locomotory behavior namespace: biological_process is_a: GO:0007626 ! locomotory behavior is_a: GO:0030537 ! larval behavior intersection_of: GO:0007626  ! GENUS: locomotory behavior intersection_of: during FBdv:00005336 ! DIFFERENTIUM: during larval stage

Here is the equivalent in OWL (note: RDF-XML syntax is very verbose!):

 <owl:Class rdf:ID="GO_0008345">
<rdfs:label xml:lang="en">larval locomotory behavior</rdfs:label>
<rdfs:subClassOf rdf:resource="#GO_0007626"/>
<rdfs:subClassOf rdf:resource="#GO_0030537"/>
<owl:equivalentClass>

<owl:Class>

 <owl:intersectionOf rdf:parseType="Collection">
<owl:Class rdf:about="#GO_0007626"/>
<owl:Restriction>

<owl:onProperty>

 <owl:ObjectProperty rdf:about="#during"/>

</owl:onProperty> <owl:someValuesFrom rdf:resource="#FBdv_00005336"/>

</owl:Restriction>
 </owl:intersectionOf>

</owl:Class>

</owl:equivalentClass>
 </owl:Class>

When converting to chado we employ a more economical representation, in terms of the number of triples we use:

 <cvterm_relationship>
<type_id>is_a</type_id>
<subject_id>GO:0008345</subject_id>
<object_id>GO:0007626</object_id>
 </cvterm_relationship>
 <cvterm_relationship>
<type_id>is_a</type_id>
<subject_id>GO:0008345</subject_id>
<object_id>GO:0030537</object_id>
 </cvterm_relationship>
 <cvterm_relationship>
<type_id>intersection_of</type_id>
<subject_id>GO:0008345</subject_id>
<object_id>GO:0007626</object_id> 
 </cvterm_relationship>
 <cvterm_relationship>
<type_id>intersection_of</type_id>
<subject_id>GO:0008345</subject_id>
<object_id>

<cvterm>

 <dbxref_id>
<dbxref>

<db_id>internal</db_id> <accession>restriction--OBOL:during--GO:0008345</accession>

</dbxref>
 </dbxref_id>
 <name>restriction--OBOL:during--GO:0008345</name>
 <cv_id>anonymous_cv</cv_id>
 <cvtermprop>
<type_id>is_anonymous</type_id>
<value>1</value>
<rank>0</rank>
 </cvtermprop>
 <cvterm_relationship>
<type_id>OBOL:during</type_id>
<object_id>FBdv:00005336</object_id>
 </cvterm_relationship>

</cvterm>

</object_id>
 </cvterm_relationship>

Note that in the above, we are creating anonymous terms. We give them fake names and fake dbxrefs. In the bbop-experimental cvs branch of chado, names and dbxrefs are nullable, so these can be omitted. With the current schema, you must provide fake dbxrefs and names that are unique, such as the above (if you are not familiar with how ChadoXML maps to the chado schema, see the explanation below).

If you wish to convert Obo-specified logical definitions to chadoxml you will need go-perl v0.05 or higher (if you have a lower version, the intersection_of tags will simply be ignored).

go2chadoxml ont.obo > ont.chado

How logical definitions are stored in Chado

This involves no schema changes to the cv module. Each intersection_of goes in as a DAG arc of type internal:intersection_of. The object_id in the arc is either a term (for the genus) or an anonymous term representing a restriction (the differentium). the restriction has a relationship of some type to another term.

For example, for "larval locomotory behavior" we would normally just have:

LLB is_a LocomotoryBehavior LLB is_a LarvalBehavior

If we load a logical definition for this term (see go-dev/go-perl/t/data/llm/obo), like this:

[Term] id: GO:0008345 name: larval locomotory behavior namespace: biological_process is_a: GO:0007626 ! locomotory behavior is_a: GO:0030537 ! larval behavior intersection_of: GO:0007626  ! locomotory behavior intersection_of: during FBdv:00005336 ! larval stage

Then the intersection_ofs get stored using the basic DAG tables as:

Subject & Relation & Object \\ \hline LLB & intersection_of & LocomotoryBehavior LLB & intersection_of & anon:xxx anon:xxx & during & FBv:00005336


\label{tab:intersections-in-Chado} \end{tabular} } \caption{Logical definition stored ib cvterm_relationship table} \label{tab:tab-esc-str} \end{table}

This uses 4 cvterm_relationships and the creation of a new anonymous term that is never shown directly to the user. The anonymous term represents the class of things that happen during the larval stage

Logical Definition Views

Two views: cvterm_genus and cvterm_differentium views are in chado/modules/cv/views.

Example use case: Phenotypes

The idea here is that queries for composed term "syndactyly" should automatically return the same results as a boolean query for "fusion"+inheres_in="finger" regardless of whether the annotation is to the composed term or is a composed annotation (provided we put the logical definition of syndactyly in the database)

Example use case: feature types

The Sequence Ontology has some logical definitions - you will need to load the file {\tt so-xp.obo}

Example use case: GO

See http://www.fruitfly.org/~cjm/obol

Example use case: Drawing DAGs

Currently the DAGs of many OBO ontologies are highly tangled; see: http://www.fruitfly.org/~cjm/obol/doc/go-complexity.html

If all terms have logical definitions, then there is only one 'true' (genus) \isa parent. This enables us to disentangle the DAGs and draw distinct hierarchies. For example, the GO term cysteine biosynthesis could be drawn as two distinct hierarchies - one process and one chemical

Loading OWL into Chado

Not all OWL-DL features are supported. Only intersectionOfs correspondibg to genus-differentiae are loaded.

First you must convert OWL into Obo 1.2 format. There will soon be a way to do this in OboEdit. For now you can use blipkit (http://www.blipkit.org)

blip io-convert my.owl -to obo -o my.obo

Once you have an obo file you can run go2chadoxml, as above

Post-coordinating terms

Sometimes we want to be able to refer to a term such as {\em plasma membrane of spermatocyte}, but no such term exists in the ontology. Introducing these as {\em pre-coordinated} cross-product terms would make the ontology unwieldy.

Chado allows the post-coordination or post-composition of terms using the same formalism as described above. Briefly: we would create an anonymous. This anonymous term would be defined using the terms plasma membrane and em spermatocyte, using a genus-differentia definition as above.

 <cvterm_relationship>
<type_id>intersection_of</type_id>
<subject_id>anon_1</subject_id>
<object_id>GO__plasma_membrane</object_id>
 </cvterm_relationship>
 <cvterm_relationship>
<type_id>intersection_of</type_id>
<subject_id>anon_1</subject_id>
<object_id>

<cvterm>

 <dbxref_id>
<dbxref>

<db_id>internal</db_id> <accession>restriction--part_of--spermatocyte</accession>

</dbxref>
 </dbxref_id>
 <name>restriction--part_of--spermatocyte</name>
 <cv_id>anonymous_cv</cv_id>
 <cvtermprop>
<type_id>is_anonymous</type_id>
<value>1</value>
<rank>0</rank>
 </cvtermprop>
 <cvterm_relationship>
<type_id>OBO_REL:part_of</type_id>
<object_id>CL__spermatocyte</object_id>
 </cvterm_relationship>

</cvterm>

</object_id>
 </cvterm_relationship>

The above assumes XORT macro IDs defined for GO__plasma_membrane and CL__spermatocyte


Allow post-coordinated terms places a greater burden on applications that use the cv module. More documentation will be provided here on this.