Difference between revisions of "IGS Data Representation"

From GMOD
Jump to: navigation, search
(Versioning of feature)
Line 8: Line 8:
  
 
In terms of tools, the data representations discussed below are generated by custom scripts as well as components in [[Ergatis]].  They are read/edited by [[Manatee]] for manual curation, [[Sybil]] for comparative displays and several web applications such as [http://gemina.igs.umaryland.edu Gemina] and [http://pathema.jcvi.org Pathema].
 
In terms of tools, the data representations discussed below are generated by custom scripts as well as components in [[Ergatis]].  They are read/edited by [[Manatee]] for manual curation, [[Sybil]] for comparative displays and several web applications such as [http://gemina.igs.umaryland.edu Gemina] and [http://pathema.jcvi.org Pathema].
 
== Structural annotation ==
 
 
Our structural representation differs quite a bit from the current recommended implementation.  The canonical gene model below illustrates this.  For now I'll only touch on structural annotation, which is beginning to be covered well on the [[Chado Best Practices]] page and instead focus on functional annotation, which isn't.
 
  
 
== Feature naming convention ==
 
== Feature naming convention ==
  
 
The ''de novo'' features we store in Chado follow the same naming convention, which is db.feature_type.N.M, where 'db' is an abbreviation for a database or project (usually an organism), feature_type corresponds to a value in cvterm.name, N is an incrementing integer scoped to that feature type, and M is the version number of that feature.  So the second version of a gene from our ''Aspergillus fumigatus'' annotation project may have a name like: afu.gene.3320.2
 
The ''de novo'' features we store in Chado follow the same naming convention, which is db.feature_type.N.M, where 'db' is an abbreviation for a database or project (usually an organism), feature_type corresponds to a value in cvterm.name, N is an incrementing integer scoped to that feature type, and M is the version number of that feature.  So the second version of a gene from our ''Aspergillus fumigatus'' annotation project may have a name like: afu.gene.3320.2
 +
 +
== Structural annotation ==
 +
 +
Our structural representation differs quite a bit from the current recommended implementation.  The canonical gene model below illustrates this.  For now I'll only touch on structural annotation, which is beginning to be covered well on the [[Chado Best Practices]] page and instead focus on functional annotation, which isn't.
  
 
=== Gene models ===
 
=== Gene models ===

Revision as of 06:56, 15 January 2009

Chado is an elegant schema that can hold nearly anything from gene annotations to an MP3 collection. This fabulous flexibility comes with a price - different MODs arrive at different ways of storing the same biological information. This page is not meant to be a tutorial of how YOU should model your biological information in Chado. Rather, it is simply a brain dump of the way we are doing things at IGS, for better or worse.

The reference document is currently the Chado Best Practices page, into which much of this information may become merged at some point.

What we store

We currently use the Chado schema primarily to store genome annotation data, including comparative genomics. This includes both read-only databases from 'finished' annotations and ongoing, actively modified data. Prokaryotes and Eukaryotes are both represented in our datasets and we use MySQL and PostgreSQL back-ends. We've started using Oracle a bit and have essentially abandoned Sybase usage.

In terms of tools, the data representations discussed below are generated by custom scripts as well as components in Ergatis. They are read/edited by Manatee for manual curation, Sybil for comparative displays and several web applications such as Gemina and Pathema.

Feature naming convention

The de novo features we store in Chado follow the same naming convention, which is db.feature_type.N.M, where 'db' is an abbreviation for a database or project (usually an organism), feature_type corresponds to a value in cvterm.name, N is an incrementing integer scoped to that feature type, and M is the version number of that feature. So the second version of a gene from our Aspergillus fumigatus annotation project may have a name like: afu.gene.3320.2

Structural annotation

Our structural representation differs quite a bit from the current recommended implementation. The canonical gene model below illustrates this. For now I'll only touch on structural annotation, which is beginning to be covered well on the Chado Best Practices page and instead focus on functional annotation, which isn't.

Gene models

Canonical gene model

The following query shows all the features in our gene graph as well as their relationships. The example query is for a transcript feature 'hsn.transcript.39176.1' <sql>

   SELECT f1.name as subject, c.name as relationship, f2.name as object
     FROM feature f1
         JOIN feature_relationship fr ON f1.feature_id = fr.subject_id
         JOIN feature f2 ON fr.object_id = f2.feature_id
         JOIN cvterm c ON fr.type_id = c.cvterm_id
    WHERE f1.uniquename = 'hsn.transcript.39176.1'
       OR f2.uniquename = 'hsn.transcript.39176.1';
   +-------------------------+--------------+------------------------+
   | subject                 | relationship | object                 |
   +-------------------------+--------------+------------------------+
   | hsn.transcript.39176.1  | derives_from | hsn.gene.39416.1       |
   | hsn.polypeptide.39176.1 | part_of      | hsn.transcript.39176.1 |
   | hsn.CDS.39416.1         | derives_from | hsn.transcript.39176.1 |
   | hsn.exon.39416.1        | part_of      | hsn.transcript.39176.1 |
   +-------------------------+--------------+------------------------+

</sql>

Functional annotation

The assertions made in functional annotation differ greatly between MODs and annotation sources in general. Our minimal goal is to provide the following, whenever possible, for any given gene:

  • Gene product name
  • Gene symbol
  • GO terms (process, function, component)
  • Enzyme Commission (EC) number

Ideally, evidence should also be stored for each of these assertions.

Versioning of features