Store an unigene in Chado HOWTO


How to store an unigene in a Chado database

We have an EST set. We have done a clustering and an assembly, so we have now a set of contigs (assemblies composed by more than one EST) and singletons (composed by just one EST). In the Sequence ontology an EST is the term EST (SO:0000345) (‘A tag produced from a single sequencing read from a cDNA clone or PCR product; typically a few hundred base pairs long.’).

An unigene is the clustering of several ESTs, without taking into account the assembly. So an unigene would indicate just that the EST1 and EST2 are related, so they belong to the same unigene. But an unigene does not have any information about how these ESTs are aligned and assembled. In the Sequence ontology the unigene is named transcribed_cluster (SO:0001457) (‘A region defined by a set of transcribed sequences from the same gene or expressed pseudogene.’).

The sequence assembly consensus exists is represented as de residues in the transcribed_cluster feature.

Chado layout

Let’s store an unigene in the database:

EST1       --------->     (EST) 
EST2           <--------- (EST)
consensus  -------------> (stored in the transcribed_cluster residues)

This schema would also work with the NCBI unigenes that do not have consensus.

We have to store several items in the chado database:

  1. The analysis that has clustered and assembled the ESTs.
  2. The ESTs.
  3. The consensus assembly.
  4. The relationship graph between the ESTs and the consensus (aka the unigene cluster)
  5. The alignments between each EST and the consensus.


The analysis is stored in the analysis table:


Each EST is stored in the feature table.

dbxref_id organism name uniquename type_id is_analysis
CMV:EST001 organism id EST001 CMV:EST001 SO:345 id False
CMV:EST002 organism id EST002 CMV:EST002 SO:345 id False


The unigene is the set of ESTs and the consensus. It is represented by a feature and by a feature graph. This graph defines which ESTs and assemblies belong to which unigenes. Each EST is part of the unigene cluster.

If there is a consensus sequence it should be stored in this feature residues.

The feature entry for the unigene.

dbxref_id organism name uniquename type_id is_analysis
CMV:UNI001 ? UNI001 CMV:UNI001 SO:1457 id True

The feature relationships (feature graph).

subject_id object_id type_id rank
EST001 id UNI001 id part_of id 0
EST002 id UNI000 id part_of id 0


A unigene_cluster is a subtype of transcribed_cluster and represents a specific clustering methodology used by the NCBI to produce UniGenes ( If the clustering algorithm used was not Unigene you would use transcribed_cluster.

ESTs and assembly alignments

The ESTs are all aligned with the assembly consensus. For each EST there is an alignment. For each alignment there is:

Additional features for the alignments.

organism uniquename type_id is_analysis
? CMV:UNI001_EST1 SO:668 id True
? CMV:UNI001_EST2 SO:668 id True

An analysisfeature for each of the previous features with the alignment scores. If we don’t need to store these scores, maybe be this table could be optional.

feature_id analysis_id identity
CMV:UNI001_EST1 id EST clustering analysis id The identity %
CMV:UNI001_EST2 id EST clustering analysis id The identity %

The alignment coordinates are stored in two featurelocs in the featureloc table.

feature_id srcfeature_id fmin fmax strand residue_info rank
CMV:UNI001_EST1 id UNI001 id leftmost boundary rightmost boundary strand CIGAR 0
CMV:UNI001_EST1 id EST001 id leftmost boundary rightmost boundary strand CIGAR 1
CMV:UNI001_EST2 id UNI001 id leftmost boundary rightmost boundary strand CIGAR 0
CMV:UNI001_EST2 id EST002 id leftmost boundary rightmost boundary strand CIGAR 1



