This is a scratch page for thoughts on development of the GMOD Bio Object Layer (GBOL).
First things first. Can we pronounce GBOL gobble?
Object Layer Design Goals
- Provide a simple object layer for manipulating generic genomic features, their locations, annotation properties, and supporting analyses. This object layer will essentially mimic the structure of GMOD's Chado schema.
- Provide a set of biologically contextual objects that are extensions of the abstract layer and that have type-specific convenience functions and properties.
- The simple / bio object layers should be able to be used independently of any connection to a data store.
- Provide a factory interface for the fetching / storing of GBOL objects, as well as implementations that work with the Chado Schema and ChadoXML files.
- EL: Make sure that the biological layer adheres to Chado best practices in how it represents the data. We need to start standardizing use.
- The bio object layer contains classes that are extensions of the Simple Object layer classes. for instance, the Gene bio object class extends the simple object layer Feature class and has convenience methods like getTranscripts(). The feature set in the constructor by a configuration adaptor.
- The simple object layer is made of of classes that are essentially a transformation of the Chado Schema. For instance, there's a CV class, CVTerm class, Feature class, etc.
- The simple object factory interface declares methods for fetching and saving simple objects to and from data stores. Initially, I'd like to implement that against chado & chado xml.
Simple Object Layer
As stated above, the simple object layer is essentially the chado schema in java object form.
- Random thoughts....
- In my first go around, I calculated equality and hascode based on the property values that are specified as the unique constraints of a table. For instance, equality of a Feature Object is based on organism, type, and uniquename, This might be unexpected behavior if people expect equality / hashcode to be evaluated based on all object fields instead of a subset.
- Definitely want to go for the schema-driven class generation approach if the tools are good enough.
Bio Object Layer
These BioObjects will be extensions of the simple object class Feature. We've discussed the idea of having some sort of "configuration adaptor" assigned to each bio object instance that would essentially specify the object's type, detail how other bio objects are related to it, and tell what sorts of features to save should the object be written to a data store. See Configuration Adaptor Thoughts for more details.
Central Dogma Model thoughts:
- I think there's a significant amount of confusion (or ambiguity) in the way the molbio central dogma is represented between MODs (and Apollo). Personally, I like the using a flybase-style protein feature who's coordinates denote the start and stop of translation of a protein within a transcript, but the name protein is sort of a misnomer and a protein shouldn't have a feature loc on a chromosome arm. So what about using a CDS feature to denote the boundaries of translation within a transcript? It could work just like the flybase protein feature - not multiple CDS regions on a per-exon basis but rather a single feature who's coordinates dictate start and stop of translation within a transcript. It's residues will be nucleotides and can be translated to get the polypeptide residues. There can be a polypeptide feature also, but it's related to the CDS feature, not the transcript.
- Possible biological model: Gene->Transcript->(Single) CDS ->Polypeptide. The CDS feature would be a single feature that would denote the start and stop of translation within the transcript and would have coordinates relative to the genome or transcript. Genomic coordinates makes it easiest to calculate exon cds regions & urs. Transcript-relative coordinates seem most technically correct. The polypeptide would have no feature coordinates relative to the genome.
Extended Objects and methods:
- SO:0000704: Gene
- List<Transcript> getTranscripts(): Returns all transcripts related to a gene.
- boolean addTranscript(Transcript): Associates a transcript with a gene. Updates gene boundaries if necessary.
- List<Exon> getExons(): Returns all gene exons ordered by increasing start location.
- SO:0000673: Transcript
- Most of the objects involved in the CDS_region/ coding / UTR related functions could be calculated on the fly.
- Gene getGene(): returns gene that transcript is part of.
- List<Exon> getExons(): Returns transcript exons ordered by increasing start location.
- List<Exon> getCodingExons(): Returns transcript exons that contain coding sequence.
- List<Exon> getFivePrimeUTRExons(): Returns transcript exons containing 5' UTR sequence.
- List<Exon> getThreePrimerimeUTRExons(): Returns transcript exons containing 3' UTR sequence.
- boolean addExon(Exon): Associates an exon with a transcript. Updates transcript / gene boundaries if necessary.
- List<FivePrimeUTR> getFivePrimeUTRs(): Returns FivePrimeUTR features for a transcript. Should this actually be FivePrimeUTR_region?... see below.
- List<ThreePrimeUTR> getThreePrimeUTRs(): Returns ThreePrimeUTR features for a transcript. Ditto ThreePrimeUTR_region.
- List<Intron> getIntrons(): Returns intron features for a transcript.
- CDS getCDS() returns a single CDS feature for a transcript.
- List<CDS> getCDSRegions(): Returns CDS_region features for a transcript.
- Protein getProtein(): Not sure about this one. It'd be nice to assume that there was a single protein for any protein-coding transcript. How many people have multiple proteins associated with a single transcript?
- SO:0000147: Exon
- gene getGene(): returns gene that exon is part of.
- List<Transcript> getTranscripts: returns transcripts that exon is part of.
- boolean containsCDS(): returns true if exon contains coding sequence, false otherwise.
- boolean containsUTR(): returns true if exon has any UTR, false otherwise.
- boolean contains5primeUTR(): returns true if exon contains 5primeUTR, false otherwise.
- boolean contains3primeUTR(): returns true if exon contains 3primeUTR, false otherwise.
- CDSRegion getCDSRegion(): Returns CDSRegion compoenent of an exon if it exists.
- 5primeUTR get5primeUTR(): Returns 5primeUTR component of an exon if it exists.
- 3primeUTR get3primeUTR(): Returns 3primeUTR component of an exon if it exists.
- SO:0000188: Intron
- transcript getTranscript(): returns transcript that intron is a part of.
- SO:0000204: FivePrimeUTR:
- The SO definition indicates that it's a contiguous sequence, but we want separate 5' UTR pieces. So maybe a new FivePrimeUTR_region SO term is called for?
- transcript getTranscript(): returns transcript that UTR is a part of.
- exon getExon(): returns
- SO:0000205: ThreePrimeUTR:
- Same as FivePrimeUTR- Need a ThreePrimeUTR_region in so?
- transcript getTranscript(): returns transcript that UTR is a part of.
- SO:0000316: CDS
- The idea of having a cds feature define the translational boundaries of a transcript actually makes more sense than using a protein feature. So maybe the model should look like gene->transcript->cds->protein. The CDS feature would have nucleotides as residues and featurelocs that define the translational boundaries of a related protein feature.
- transcript getTranscript(): returns transcript that CDS is a subsequence of.
- String translate(): returns the translated peptide sequence of the CDS.
- SO:0000851: CDSRegion
- First, I don't really see the utility of CDS_region features - they seem like a waste of space and all the information that they contain can simply be derived from single flybase-style protein feature (with translational start / stop coordinates as its featureloc). Actually... that should be a CDS feature instead of a protein. Despite this, it appears as though a number of MODs and Ensembl too maybe rely on CDS_region-style features (often just called 'CDS' features) so GBOL should support them to some extent.
- transcript getTranscript(): returns transcript that CDSRegion is a part of.
- exon getExon(): returns the exon that the CDSRegion is a part of.
- SO:0000104: Polypeptide
- Should this even have a featureloc if the CDS feature is used to define translational boundaries?
- Should this contain the translated stop codon (*) from the CDS sequence?
- transcript getTranscript(): returns the transcript the the polypeptide derives from.
- CDS getCDS(): returns the CDS feature that the polypeptide is translated from.
- SO:0005836: RegulatoryRegion
- SO:0000188: Intron
- SO:0000105: ChromosomeArm
- SO:0000148: Supercontig
Configuration Adaptor Thoughts
I'd like a bioperl set of objects that allow you to traverse sequence alignments, hits, & HSPs. Certainly, people are always asking how to get analyses in and out of Chado. I think this will involve an extension of the Feature class like the bio object layer. Err... It probably should be part of the bio object layer.
Is there a controlled vocabulary for annotation properties? We need to specify conventions for things like, storing feature comments and comment properties. How about annotation evidence structure and confidence codes from GO?