Difference between revisions of "Chado New Users"
(→Getting the Sequence Module Working) |
(Removed denormalized statement. The public FlyBase chado dumps contain everything that is in the production database (fully normalized) plus some denormalized features.) |
||
(14 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
* Zheng - on PC with Fedora. See [[Zheng's installation notes]] | * Zheng - on PC with Fedora. See [[Zheng's installation notes]] | ||
* Mike - on Intel Mac running Fedora partition | * Mike - on Intel Mac running Fedora partition | ||
− | * Jim - | + | * Jim - |
** on PPC Mac the hard way See [[Jim's installation notes]] | ** on PPC Mac the hard way See [[Jim's installation notes]] | ||
** on CentOS server in Texas | ** on CentOS server in Texas | ||
Line 10: | Line 10: | ||
If the easy way fails, the old documentation outside the wiki can be pretty confusing. | If the easy way fails, the old documentation outside the wiki can be pretty confusing. | ||
== Loading the Ontologies == | == Loading the Ontologies == | ||
− | + | This works via make ontologies. How to do updates? | |
== Getting the Sequence Module Working == | == Getting the Sequence Module Working == | ||
We think GFF3 can be thought of as a view into Chado using the [[Chado_Sequence_Module|Sequence module]] and the [[Chado_CV_Module|CV module]], or we can think of GFF3 as a denormalized view of Chado. | We think GFF3 can be thought of as a view into Chado using the [[Chado_Sequence_Module|Sequence module]] and the [[Chado_CV_Module|CV module]], or we can think of GFF3 as a denormalized view of Chado. | ||
+ | |||
+ | == Migration from other databases == | ||
+ | *[[Zheng's notes on wormbase migration]] | ||
+ | *[[Jim's notes on E. coli migration]] | ||
+ | |||
+ | == Sample Data== | ||
+ | To understand [[Chado Best Practices]], where the documentation is sometimes incomplete, we've tried to get some samples of Chado data in use. Things we've looked at so far, and comments on them: | ||
+ | * Sample yeast data from GFF3 bulk loader. The problem with this is that it doesn't reflect a real use case for Chado, since SGD does not use the Chado Schema | ||
+ | * FlyBase SQL dump. Zheng got this and loaded it. One problem. | ||
+ | ** It's huge | ||
+ | |||
+ | == Understanding how things are represented in Chado == | ||
+ | ===Central Dogma=== | ||
+ | [[Chado Best Practices]] describes some of the representations. Unfortunately it's somewhat incomplete at present. | ||
+ | ====Gene==== | ||
+ | Chado uses a eukaryotic-centric gene definition which is based on monocistronic mRNAs. In this view, the gene includes information in the genomic DNA outside of the part that codes for the mRNA. To represent a gene, there needs to be: | ||
+ | * A [[http://gmod.cshl.edu/wiki/index.php/Chado_Tables#Table:_feature feature]] | ||
+ | ** Note that the field seqlen could be problematic - added a note. | ||
+ | * If the gene is mapped to the sequence, there should be a featureloc | ||
+ | Completing the representation of the gene seems to require additional features of types 'mRNA' and 'exon' (and 'polypeptide' if it's protein coding). What happens if software tries to write a feature record as a gene without creating these? Presumably the gene feature has to be entered first in order to have an object_id for feature_relationship. | ||
+ | |||
+ | |||
+ | ====mRNA==== | ||
+ | |||
+ | mRNA features are entered with part_of relationships to genes. This is straightforward in cases where the mRNA is derived from a high-quality full length cDNA (but what's the feature_relationship type?). Does an mRNA have to have a featureloc? What if the CDS is known but the precise ends of the UTRs are not? | ||
+ | |||
+ | =====Polycistronic transcription units===== | ||
+ | As of this writing, the description of handling dicistronic genes is not very clear. Based on the [http://www.sequenceontology.org/gff3.shtml GFF3 spec]: | ||
+ | * Parenting a CDS/polypeptide directly on a gene is deprecated because the gene (sensu eukaryota) includes nontranscribed regions | ||
+ | * A solution is to give the mRNA feature multiple parents. Thus ''lacZ'', ''lacY'' and ''lacA'' would all be parents of ''lacZYA'', which in turn would be parent via a derives_from relationship to the LacZ, LacY and LacA polypeptides. | ||
+ | |||
+ | ====other RNAs==== | ||
+ | tRNAs, rRNAs, snRNAs etc have similar relationships to genes. Note that even in eukaryotes, rRNAs and tRNAs are often polycistronic transcripts! | ||
+ | |||
+ | ====Polypeptides==== | ||
+ | Polypeptides derive_from mRNAs | ||
+ | ====Proteins==== | ||
+ | Note that proteins ≠ polypeptides. Hemoglobin is a heterotetramer of two α and two β subunits. Is there a feature type that represents this? | ||
== See also == | == See also == | ||
* [[Chado - Getting_Started]] - and documentation links from there. | * [[Chado - Getting_Started]] - and documentation links from there. | ||
* [[:Category:Chado]] - the Category page for all pages about Chado in this wiki | * [[:Category:Chado]] - the Category page for all pages about Chado in this wiki | ||
+ | |||
+ | [[Category:Chado]] | ||
+ | [[Category:User Experiences]] |
Latest revision as of 19:58, 22 April 2008
This page, and it's associated discussion page follow the learning curve for new Chado users learning the system at CSHL.
Contents
Getting an empty Chado PostgreSQL on our machines
- Zheng - on PC with Fedora. See Zheng's installation notes
- Mike - on Intel Mac running Fedora partition
- Jim -
- on PPC Mac the hard way See Jim's installation notes
- on CentOS server in Texas
Installation Notes
If the easy way fails, the old documentation outside the wiki can be pretty confusing.
Loading the Ontologies
This works via make ontologies. How to do updates?
Getting the Sequence Module Working
We think GFF3 can be thought of as a view into Chado using the Sequence module and the CV module, or we can think of GFF3 as a denormalized view of Chado.
Migration from other databases
Sample Data
To understand Chado Best Practices, where the documentation is sometimes incomplete, we've tried to get some samples of Chado data in use. Things we've looked at so far, and comments on them:
- Sample yeast data from GFF3 bulk loader. The problem with this is that it doesn't reflect a real use case for Chado, since SGD does not use the Chado Schema
- FlyBase SQL dump. Zheng got this and loaded it. One problem.
- It's huge
Understanding how things are represented in Chado
Central Dogma
Chado Best Practices describes some of the representations. Unfortunately it's somewhat incomplete at present.
Gene
Chado uses a eukaryotic-centric gene definition which is based on monocistronic mRNAs. In this view, the gene includes information in the genomic DNA outside of the part that codes for the mRNA. To represent a gene, there needs to be:
- A [feature]
- Note that the field seqlen could be problematic - added a note.
- If the gene is mapped to the sequence, there should be a featureloc
Completing the representation of the gene seems to require additional features of types 'mRNA' and 'exon' (and 'polypeptide' if it's protein coding). What happens if software tries to write a feature record as a gene without creating these? Presumably the gene feature has to be entered first in order to have an object_id for feature_relationship.
mRNA
mRNA features are entered with part_of relationships to genes. This is straightforward in cases where the mRNA is derived from a high-quality full length cDNA (but what's the feature_relationship type?). Does an mRNA have to have a featureloc? What if the CDS is known but the precise ends of the UTRs are not?
Polycistronic transcription units
As of this writing, the description of handling dicistronic genes is not very clear. Based on the GFF3 spec:
- Parenting a CDS/polypeptide directly on a gene is deprecated because the gene (sensu eukaryota) includes nontranscribed regions
- A solution is to give the mRNA feature multiple parents. Thus lacZ, lacY and lacA would all be parents of lacZYA, which in turn would be parent via a derives_from relationship to the LacZ, LacY and LacA polypeptides.
other RNAs
tRNAs, rRNAs, snRNAs etc have similar relationships to genes. Note that even in eukaryotes, rRNAs and tRNAs are often polycistronic transcripts!
Polypeptides
Polypeptides derive_from mRNAs
Proteins
Note that proteins ≠ polypeptides. Hemoglobin is a heterotetramer of two α and two β subunits. Is there a feature type that represents this?
See also
- Chado - Getting_Started - and documentation links from there.
- Category:Chado - the Category page for all pages about Chado in this wiki