Store an unigene in Chado HOWTO

From GMOD
Jump to: navigation, search

How to store an unigene in a Chado database

We have an EST set. We have done a clustering and an assembly, so we have now a set of contigs (assemblies composed by more than one EST) and singletons (composed by just one EST). In the Sequence ontology an EST is the term EST (SO:0000345) ('A tag produced from a single sequencing read from a cDNA clone or PCR product; typically a few hundred base pairs long.').

An unigene is the clustering of several ESTs, without taking into account the assembly. So an unigene would indicate just that the EST1 and EST2 are related, so they belong to the same unigene. But an unigene does not have any information about how these ESTs are aligned and assembled. In the Sequence ontology the unigene is named transcribed_cluster (SO:0001457) ('A region defined by a set of transcribed sequences from the same gene or expressed pseudogene.').

The sequence assembly consensus exists is represented as de residues in the transcribed_cluster feature.

Chado layout

Let's store an unigene in the database:

transcribed_cluster
EST1       --------->     (EST) 
EST2           <--------- (EST)
consensus  -------------> (stored in the transcribed_cluster residues)

This schema would also work with the NCBI unigenes that do not have consensus.

We have to store several items in the chado database:

  1. The analysis that has clustered and assembled the ESTs.
  2. The ESTs.
  3. The consensus assembly.
  4. The relationship graph between the ESTs and the consensus (aka the unigene cluster)
  5. The alignments between each EST and the consensus.

Analysis

The analysis is stored in the analysis table:

  • name: A way of grouping analyses. This should be a handy short identifier that can help people find an analysis they want. For instance "unigene clustering", and it should not be assumed to be unique. For instance, there may be lots of separate analyses done against a cDNA database.
  • description: Some description of this clustering and assembly analysis.
  • program: Program used, like CAP3 or mira.
  • programversion:
  • algorithm:

ESTs

Each EST is stored in the feature table.

  • dbxref_id: public stable identifier for this EST.
  • name: Human readable.
  • uniquename: Unique for this EST in this organism.
  • residues: The sequence of nucleotides.
  • seqlen: The sequence length.
  • md5checksum: The sequence md5checksum.
  • type_id: 345
  • is_analysis: False
dbxref_id organism name uniquename type_id is_analysis
CMV:EST001 organism id EST001 CMV:EST001 SO:345 id False
CMV:EST002 organism id EST002 CMV:EST002 SO:345 id False

Unigene

The unigene is the set of ESTs and the consensus. It is represented by a feature and by a feature graph. This graph defines which ESTs and assemblies belong to which unigenes. Each EST is part of the unigene cluster.

If there is a consensus sequence it should be stored in this feature residues.

The feature entry for the unigene.

dbxref_id organism name uniquename type_id is_analysis
CMV:UNI001 ? UNI001 CMV:UNI001 SO:1457 id True

The feature relationships (feature graph).

subject_id object_id type_id rank
EST001 id UNI001 id part_of id 0
EST002 id UNI000 id part_of id 0

Note

A unigene_cluster is a subtype of transcribed_cluster and represents a specific clustering methodology used by the NCBI to produce UniGenes (http://www.ncbi.nlm.nih.gov/unigene). If the clustering algorithm used was not Unigene you would use transcribed_cluster.

ESTs and assembly alignments

The ESTs are all aligned with the assembly consensus. For each EST there is an alignment. For each alignment there is:

  • a feature (type EST_match (SO:0000668))
  • an entry in the analysis feature table.
  • two feature_locs, one for the consensus assembly and another for the EST.

Additional features for the alignments.

organism uniquename type_id is_analysis
? CMV:UNI001_EST1 SO:668 id True
? CMV:UNI001_EST2 SO:668 id True

An analysisfeature for each of the previous features with the alignment scores. If we don't need to store these scores, maybe be this table could be optional.

feature_id analysis_id identity
CMV:UNI001_EST1 id EST clustering analysis id The identity %
CMV:UNI001_EST2 id EST clustering analysis id The identity %

The alignment coordinates are stored in two featurelocs in the featureloc table.

feature_id srcfeature_id fmin fmax strand residue_info rank
CMV:UNI001_EST1 id UNI001 id leftmost boundary rightmost boundary strand CIGAR 0
CMV:UNI001_EST1 id EST001 id leftmost boundary rightmost boundary strand CIGAR 1
CMV:UNI001_EST2 id UNI001 id leftmost boundary rightmost boundary strand CIGAR 0
CMV:UNI001_EST2 id EST002 id leftmost boundary rightmost boundary strand CIGAR 1