Store an unigene in Chado HOWTO
How to store an unigene in a Chado database
We have an EST set. We have done a clustering and an assembly, so we have now a set of contigs (assemblies composed by more than one EST) and singletons (composed by just one EST). In the Sequence ontology an EST is the term EST (SO:0000345) ('A tag produced from a single sequencing read from a cDNA clone or PCR product; typically a few hundred base pairs long.').
An unigene is the clustering of several ESTs, without taking into account the assembly. So an unigene would indicate just that the EST1 and EST2 are related, so they belong to the same unigene. But an unigene does not have any information about how these ESTs are aligned and assembled. In the Sequence ontology the unigene is named transcribed_cluster (SO:0001457) ('A region defined by a set of transcribed sequences from the same gene or expressed pseudogene.').
The sequence assembly consensus exists is represented as de residues in the transcribed_cluster feature.
Let's store an unigene in the database:
transcribed_cluster EST1 ---------> (EST) EST2 <--------- (EST) consensus -------------> (stored in the transcribed_cluster residues)
This schema would also work with the NCBI unigenes that do not have consensus.
We have to store several items in the chado database:
- The analysis that has clustered and assembled the ESTs.
- The ESTs.
- The consensus assembly.
- The relationship graph between the ESTs and the consensus (aka the unigene cluster)
- The alignments between each EST and the consensus.
The analysis is stored in the analysis table:
- name: A way of grouping analyses. This should be a handy short identifier that can help people find an analysis they want. For instance "unigene clustering", and it should not be assumed to be unique. For instance, there may be lots of separate analyses done against a cDNA database.
- description: Some description of this clustering and assembly analysis.
- program: Program used, like CAP3 or mira.
Each EST is stored in the feature table.
- dbxref_id: public stable identifier for this EST.
- name: Human readable.
- uniquename: Unique for this EST in this organism.
- residues: The sequence of nucleotides.
- seqlen: The sequence length.
- md5checksum: The sequence md5checksum.
- type_id: 345
- is_analysis: False
|CMV:EST001||organism id||EST001||CMV:EST001||SO:345 id||False|
|CMV:EST002||organism id||EST002||CMV:EST002||SO:345 id||False|
The unigene is the set of ESTs and the consensus. It is represented by a feature and by a feature graph. This graph defines which ESTs and assemblies belong to which unigenes. Each EST is part of the unigene cluster.
If there is a consensus sequence it should be stored in this feature residues.
The feature entry for the unigene.
The feature relationships (feature graph).
|EST001 id||UNI001 id||part_of id||0|
|EST002 id||UNI000 id||part_of id||0|
A unigene_cluster is a subtype of transcribed_cluster and represents a specific clustering methodology used by the NCBI to produce UniGenes (http://www.ncbi.nlm.nih.gov/unigene). If the clustering algorithm used was not Unigene you would use transcribed_cluster.
ESTs and assembly alignments
The ESTs are all aligned with the assembly consensus. For each EST there is an alignment. For each alignment there is:
- a feature (type EST_match (SO:0000668))
- an entry in the analysis feature table.
- two feature_locs, one for the consensus assembly and another for the EST.
Additional features for the alignments.
An analysisfeature for each of the previous features with the alignment scores. If we don't need to store these scores, maybe be this table could be optional.
|CMV:UNI001_EST1 id||EST clustering analysis id||The identity %|
|CMV:UNI001_EST2 id||EST clustering analysis id||The identity %|
The alignment coordinates are stored in two featurelocs in the featureloc table.
|CMV:UNI001_EST1 id||UNI001 id||leftmost boundary||rightmost boundary||strand||CIGAR||0|
|CMV:UNI001_EST1 id||EST001 id||leftmost boundary||rightmost boundary||strand||CIGAR||1|
|CMV:UNI001_EST2 id||UNI001 id||leftmost boundary||rightmost boundary||strand||CIGAR||0|
|CMV:UNI001_EST2 id||EST002 id||leftmost boundary||rightmost boundary||strand||CIGAR||1|