We have an EST set. We have done a clustering and an assembly, so we have now a set of contigs (assemblies composed by more than one EST) and singletons (composed by just one EST). In the Sequence ontology an EST is the term EST (SO:0000345) (‘A tag produced from a single sequencing read from a cDNA clone or PCR product; typically a few hundred base pairs long.’).
An unigene is the clustering of several ESTs, without taking into account the assembly. So an unigene would indicate just that the EST1 and EST2 are related, so they belong to the same unigene. But an unigene does not have any information about how these ESTs are aligned and assembled. In the Sequence ontology the unigene is named transcribed_cluster (SO:0001457) (‘A region defined by a set of transcribed sequences from the same gene or expressed pseudogene.’).
The sequence assembly consensus exists is represented as de residues in the transcribed_cluster feature.
Let’s store an unigene in the database:
transcribed_cluster
EST1 ---------> (EST)
EST2 <--------- (EST)
consensus -------------> (stored in the transcribed_cluster residues)
This schema would also work with the NCBI unigenes that do not have consensus.
We have to store several items in the chado database:
The analysis is stored in the analysis table:
Each EST is stored in the feature table.
dbxref_id | organism | name | uniquename | type_id | is_analysis |
---|---|---|---|---|---|
CMV:EST001 | organism id | EST001 | CMV:EST001 | SO:345 id | False |
CMV:EST002 | organism id | EST002 | CMV:EST002 | SO:345 id | False |
The unigene is the set of ESTs and the consensus. It is represented by a feature and by a feature graph. This graph defines which ESTs and assemblies belong to which unigenes. Each EST is part of the unigene cluster.
If there is a consensus sequence it should be stored in this feature residues.
The feature entry for the unigene.
dbxref_id | organism | name | uniquename | type_id | is_analysis |
---|---|---|---|---|---|
CMV:UNI001 | ? | UNI001 | CMV:UNI001 | SO:1457 id | True |
The feature relationships (feature graph).
subject_id | object_id | type_id | rank |
---|---|---|---|
EST001 id | UNI001 id | part_of id | 0 |
EST002 id | UNI000 id | part_of id | 0 |
A unigene_cluster is a subtype of transcribed_cluster and represents a specific clustering methodology used by the NCBI to produce UniGenes (http://www.ncbi.nlm.nih.gov/unigene). If the clustering algorithm used was not Unigene you would use transcribed_cluster.
The ESTs are all aligned with the assembly consensus. For each EST there is an alignment. For each alignment there is:
Additional features for the alignments.
organism | uniquename | type_id | is_analysis |
---|---|---|---|
? | CMV:UNI001_EST1 | SO:668 id | True |
? | CMV:UNI001_EST2 | SO:668 id | True |
An analysisfeature for each of the previous features with the alignment scores. If we don’t need to store these scores, maybe be this table could be optional.
feature_id | analysis_id | identity |
---|---|---|
CMV:UNI001_EST1 id | EST clustering analysis id | The identity % |
CMV:UNI001_EST2 id | EST clustering analysis id | The identity % |
The alignment coordinates are stored in two featurelocs in the featureloc table.
feature_id | srcfeature_id | fmin | fmax | strand | residue_info | rank |
---|---|---|---|---|---|---|
CMV:UNI001_EST1 id | UNI001 id | leftmost boundary | rightmost boundary | strand | CIGAR | 0 |
CMV:UNI001_EST1 id | EST001 id | leftmost boundary | rightmost boundary | strand | CIGAR | 1 |
CMV:UNI001_EST2 id | UNI001 id | leftmost boundary | rightmost boundary | strand | CIGAR | 0 |
CMV:UNI001_EST2 id | EST002 id | leftmost boundary | rightmost boundary | strand | CIGAR | 1 |