IGS Data Representation

From GMOD
Jump to: navigation, search

Chado is an elegant schema that can hold nearly anything from gene annotations to an MP3 collection. This fabulous flexibility comes with a price - different MODs arrive at different ways of storing the same biological information. This page is not meant to be a tutorial of how YOU should model your biological information in Chado. Rather, it is simply a brain dump of the way we are doing things at IGS, for better or worse. The hope is that, where our encoding is different, we can converge on a single community-defined standard.

The reference document is currently the Chado Best Practices page, into which much of this information may become merged at some point.

What we store (scope)

We currently use the Chado schema primarily to store genome annotation data, including comparative genomics. This includes both read-only databases from 'finished' annotations and ongoing, actively modified data. Prokaryotes and Eukaryotes are both represented in our datasets and we use MySQL and PostgreSQL back-ends. We've started using Oracle a bit and have essentially abandoned Sybase usage.

In terms of tools, the data representations discussed below are generated by custom scripts as well as components in Ergatis. They are read/edited by Manatee for manual curation, Sybil for comparative displays and several web applications such as Gemina and Pathema.

Feature naming convention

The de novo features we store in Chado follow the same naming convention, which is db.feature_type.N.M, where 'db' is an abbreviation for a database or project (usually an organism), feature_type corresponds to a value in cvterm.name, N is an incrementing integer scoped to that feature type, and M is the version number of that feature. So the second version of a gene from our Aspergillus fumigatus annotation project may have a name like: afu.gene.3320.2

Structural annotation

Our structural representation differs quite a bit from the current recommended implementation. The canonical gene model below illustrates this. For now I'll only touch on structural annotation, which is beginning to be covered well on the Chado Best Practices page and instead focus on functional annotation, which isn't.

Gene models

Canonical gene model

The following query shows all the features in our gene graph as well as their relationships. The example query is for a transcript feature 'hsn.transcript.39176.1'

    SELECT f1.name AS subject, c.name AS relationship, f2.name AS object
      FROM feature f1
          JOIN feature_relationship fr ON f1.feature_id = fr.subject_id
          JOIN feature f2 ON fr.object_id = f2.feature_id
          JOIN cvterm c ON fr.type_id = c.cvterm_id
     WHERE f1.uniquename = 'hsn.transcript.39176.1'
        OR f2.uniquename = 'hsn.transcript.39176.1';
 
    +-------------------------+--------------+------------------------+
    | subject                 | relationship | object                 |
    +-------------------------+--------------+------------------------+
    | hsn.transcript.39176.1  | derives_from | hsn.gene.39416.1       |
    | hsn.polypeptide.39176.1 | part_of      | hsn.transcript.39176.1 |
    | hsn.CDS.39416.1         | derives_from | hsn.transcript.39176.1 |
    | hsn.exon.39416.1        | part_of      | hsn.transcript.39176.1 |
    +-------------------------+--------------+------------------------+

Functional annotation

The assertions made in functional annotation differ greatly between MODs and annotation sources in general. Our minimal goal is to provide the following, whenever possible, for any given gene:

  • Gene product name
  • Gene symbol
  • GO terms (process, function, component)
  • Enzyme Commission (EC) number

Ideally, evidence should also be stored for each of these assertions. We currently attach all annotation to the 'transcript' feature in each gene graph.

Gene product name

Both gene product name and their evidence are stored as feature properties of the transcript. To query both:

SELECT f.uniquename, product.VALUE AS product, sym.VALUE AS symbol
  FROM feature f
      JOIN featureprop product ON f.feature_id = product.feature_id
      JOIN featureprop sym ON f.feature_id = sym.feature_id
      JOIN cvterm cvt ON product.type_id = cvt.cvterm_id
      JOIN cvterm cvt2 ON sym.type_id = cvt2.cvterm_id
 WHERE cvt.name = 'gene_product_name'
   AND cvt2.name = 'gene_product_name_source'
   AND f.uniquename = 'hsn.transcript.39176.1';
 
    +------------------------+--------------------+-----------+
    | uniquename             | product            | symbol    |
    +------------------------+--------------------+-----------+
    | hsn.transcript.39176.1 | cytidine deaminase | TIGR01354 |
    +------------------------+--------------------+-----------+

This means that the name of this product was assigned because of a hit to the HMM with ID TIGR01354. Ideally, this would be a dbxref and not simply the value of a featureprop, but we had to do it this way because there is no featureprop_dbxref table.

Gene symbol

Gene symbols and their evidence are stored just like the gene product names described above, except that the terms used are 'gene_symbol' and 'gene_symbol_source'.

GO terms

Here I'll discuss first how the collection of terms that makes up GO is stored in the Chado schema and then discuss how we assign these terms to features.

Representation of the GO ontology

The source of the data is an entry within an OBO file, such as:

   [Term]
   id: GO:0004126
   name: cytidine deaminase activity
   namespace: function
   def: "Catalysis of the reaction\: cytidine + H2O = uridine + NH3." [EC:3.5.4.5]
   xref_analog: EC:3.5.4.5
   xref_analog: MetaCyc:CYTIDEAM2-RXN
   xref_analog: Reactome:83524
   is_a: GO:0019239

GO is stored as a controlled vocabulary, which has an entry in the 'cv' table. Here we already run into a contentious point on how this should be entered. Should there be a single entry for 'GO' as a whole or three different ones, since it contains three separate namespaces (process, function, component)? For this example I'll use the single 'GO' entry:

    SELECT *
      FROM cv
     WHERE name = 'GO';
 
    +-------+------+------------+
    | cv_id | name | definition |
    +-------+------+------------+
    |    10 | GO   | NULL       |
    +-------+------+------------+

Whether to respect the namespaces within GO and create three distinct ontology entries is configurable within the initdb Ergatis component, which instantiates our Chado instances.

Next, there's a entry in 'cvterm' for this term but not by the GO:NNNNNNNN value. Instead, we can look it up by the name:

    SELECT cvterm_id, name, SUBSTRING(definition,1,20), dbxref_id, is_obsolete
      FROM cvterm
     WHERE cv_id = 10
       AND name = 'cytidine deaminase activity';
 
    +-----------+-----------------------------+----------------------------+-----------+-------------+
    | cvterm_id | name                        | SUBSTRING(definition,1,20) | dbxref_id | is_obsolete |
    +-----------+-----------------------------+----------------------------+-----------+-------------+
    |      6657 | cytidine deaminase activity | "Catalysis of the re       |     12389 |           0 |
    +-----------+-----------------------------+----------------------------+-----------+-------------+

The actual GO:NNNNNNN value is a database reference (dbxref_id returned in last query):

    SELECT *
      FROM dbxref
     WHERE dbxref_id = 12389;
 
    +-----------+-------+------------+---------+-------------+
    | dbxref_id | db_id | accession  | version | description |
    +-----------+-------+------------+---------+-------------+
    |     12389 |    12 | GO:0004126 | 1.0     | NULL        |
    +-----------+-------+------------+---------+-------------+

Which, of course, means there's a GO entry in the 'db' table too:

    SELECT *
      FROM db
     WHERE name = 'GO';
 
    +-------+------+-------------+-----------+------+
    | db_id | name | description | urlprefix | url  |
    +-------+------+-------------+-----------+------+
    |    12 | GO   | NULL        | NULL      | NULL |
    +-------+------+-------------+-----------+------+

So, reviewing, to get the basic annotation for a GO term:

    SELECT dbx.accession, cvt.name, cvt.definition
      FROM cvterm cvt
        JOIN dbxref dbx ON cvt.dbxref_id = dbx.dbxref_id
     WHERE dbx.accession = 'GO:0004126';
 
    +------------+-----------------------------+---------------------------------------------------------------------------+
    | accession  | name                        | definition                                                                |
    +------------+-----------------------------+---------------------------------------------------------------------------+
    | GO:0004126 | cytidine deaminase activity | "Catalysis of the reaction: cytidine + H2O = uridine + NH3." [EC:3.5.4.5] |
    +------------+-----------------------------+---------------------------------------------------------------------------+

What about the xref_analog entries we saw in the OBO definition? The additional entries are stored using the cvterm_dxref table.

Assigning GO terms to features

A more narrative description has yet to be written, but here's a query to get all assigned GO terms for a given transcript, with evidence.

    SELECT f.uniquename, d.accession, c2.name "evidence type",
           cs.synonym "evidence code", fcp.VALUE
    FROM feature f
        JOIN feature_cvterm fc ON f.feature_id = fc.feature_id
        JOIN cvterm c ON fc.cvterm_id = c.cvterm_id
        JOIN cv ON c.cv_id = cv.cv_id
        JOIN dbxref d ON c.dbxref_id = d.dbxref_id
        JOIN feature_cvtermprop fcp ON fc.feature_cvterm_id = fcp.feature_cvterm_id
        JOIN cvterm c2 ON fcp.type_id = c2.cvterm_id
        JOIN cvtermsynonym cs ON c2.cvterm_id = cs.cvterm_id
    WHERE f.uniquename = 'hsn.transcript.39176.1'
      AND cv.name = 'GO';
 
    +------------------------+-------------+-------------------------------------+---------------+-----------+
    | uniquename             | ROLE id     | evidence TYPE                       | evidence code | VALUE     |
    +------------------------+-------------+-------------------------------------+---------------+-----------+
    | hsn.transcript.39176.1 | GO:0008655  | inferred FROM electronic annotation | IEA           | TIGR01354 |
    | hsn.transcript.39176.1 | GO:0005737  | inferred FROM electronic annotation | IEA           | TIGR01354 |
    | hsn.transcript.39176.1 | GO:0004126  | inferred FROM electronic annotation | IEA           | TIGR01354 |
    +------------------------+-------------+-------------------------------------+---------------+-----------+

Enzyme Commission (EC) number

We convert the enzyme.dat file distributed by ExPASy to an ontology in OBO format and load them along with GO and SO. Further notes should go here.

Storing Analyses

BLAST data

HMM alignments

Comparative data

Our mechanisms for storing comparative computes are already described in the page discussion Sybil Chado Comparative Genomics Data

Versioning of features

Background

One important requirement for any institution doing active annotation with the schema is the ability to store a versioned history of genes as they're modified. If you consider the example of editing the start site of a gene, it's certainly possible to just directly modify the coordinates of the gene feature. In production, this causes complications.

We've taken the position that direct resizing of a feature's coordinates on a molecule is not allowed. In fact, any modifications of a feature that change its underlying sequence is problematic. Features are defined by the sequence they represent, and since any number of analyses (and annotation) made on a feature are based on the underlying sequence, changing this sequence invalidates the feature. If you've stored BLAST analysis for a polypeptide, as an example, and then change the coordinates of the feature all linked BLAST alignment coordinates would be incorrect and misleading.

Start site edits

To resolve these sorts of problems we create new, versioned features any time the underlying sequence of a feature is edited. The most common form of these edits is start site modification of a gene. Using that as an example, consider an example gene that has a feature.uniquename value of 'hsn.gene.39416.1'. It has functional annotation as well as several analyses linked such as blastp and hmmpfam results.

We don't delete any features from the database. Instead, we toggle the feature.is_obsolete field value to TRUE for those features that are no longer 'current.' The overal steps are as follows:

  1. Mark the features in the hsn.gene.39416.1 gene graph as obsolete
  2. Create new features, hsn.gene.39416.2 and other graph features, each incrementing the version number portion of each obsolete feature.
  3. Copy/reattach functional annotation from obsolete to new feature

It's important to note here that we do not copy the analyses (like blastp and hmmpfam), since those are based on the older sequences. This actually provides a benefit to production annotation centers. We use cron scripts to check for any features of a given type within our databases that are not associated with an analysis so that pipelines can automatically spawn to perform them. Since the new feature versions will not have these links, it's easy to determine which features should be included in the pipeline.

Deleting a gene

Again, we don't actually delete any annotation features from the database. When a user wants a gene removed/deleted from the active set we simply mark the feature.is_obsolete value to TRUE for each element of that gene's graph.

Data de-normalization

Chado is fantastic at flexibly storing millions of rows of varying data types to support your annotation projects. It's also crazy slow to query that data out - at least at the speed expected by users for things like web applications. For these applications, a delay of 4 or 5 seconds is an eternity and you'll lose users. This isn't necessarily a Chado limitation, but rather one on highly normalized schemas in general. To query all the BLAST results for a given polypeptide you might do this:

SELECT q.uniquename AS query, s.uniquename AS subject, af.rawscore, af.pidentity,
       af.significance, flq.fmin AS qstart, flq.fmax AS qstop, flq.strand AS qstrand
  FROM feature m, feature q, feature s, analysisfeature af, analysis a, cvterm c,
       featureloc flq, featureloc fls
 WHERE m.feature_id=af.feature_id
   AND m.type_id=c.cvterm_id
   AND af.analysis_id=a.analysis_id
   AND m.feature_id=flq.feature_id
   AND m.feature_id=fls.feature_id
   AND flq.srcfeature_id=q.feature_id
   AND flq.rank = 0
   AND fls.srcfeature_id=s.feature_id
   AND fls.rank = 1
   AND a.name = 'wu-blastp'
   AND c.name = 'match_part'
   AND q.uniquename = 'nfa1.polypeptide.9292';

That's just nasty isn't it? Not only do developers not want to write these but the sheer number of JOINs involved makes the query intolerably slow.

Some of our solutions are described below. A few of them, such as the chado marts, have been made rather general and are reused for several applications. Others are very specific to a precise need. It should be completely acceptable, for example, to implement a cached data store for a single page in a web application.

Materialized views (chado marts)

We employ a few different methodologies for de-normalization of data for user displays. The first is the creation of materialized views within the database. We call our collection of these 'chado marts'. The table definition for one of these to support BLAST results is:

CREATE TABLE cm_blast (
  cm_blast_id         INT NOT NULL,
  qfeature_id         INT NOT NULL,
  qorganism_id        INT NOT NULL,
  hfeature_id         INT NOT NULL,
  horganism_id        INT NOT NULL,
  per_id              DOUBLE PRECISION NULL,
  per_sim             DOUBLE PRECISION NULL,
  p_value             DOUBLE PRECISION NULL,
  mfeature_id         INT NOT NULL
);

For those database engines that support it, such as Oracle, these materialized views can be defined once and are maintained by the underlying database engine as their base tables are updated. For others they are normal tables populated at regular intervals by scripts.

Disk Caching

Notes on Lucene, memcache, Storable, etc. here.