NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.
Zheng's notes on wormbase migration
Contents
description
this page is a record of my experience for migrating wormbase onto chado. As far as I know, wormbase is based on the Acedb (an object-oriented schema) mapping onto rmdbs (mysql/postgresql). Chado is a new, more sophisticated, but generic schema.
module-based migration
focus on sequence module first. using gff3 files as input.
gff3 | chado table | |
---|---|---|
seqid | featureloc.srcfeature_id, featureloc.rank=0 | |
source | feature_dbxref.feature_id-> | |
type | feature.type_id, cvterm.id feature_cvterm | |
start | featureloc.nbeg | |
end | featureloc.nend | |
score | ||
strand | featureloc.strand | |
phase | featureloc.phase | |
attribute ID | feature.name, feature.uniquename if ID is unique otherwise 'auto'+feature.feature_id | |
attribute Name | feature. |
bio-chaos and gmod_bulk_load_gff3
both bio-chaos 0.02 and gmod_bulk_load_gff3 can theoretically work. btw, bio-chaos 0.01 is included in the schema cvs download, but no gff3->chaos script in it. so go to bio-chaos 0.02 for prerequisite and installation. read a book XML in a nutshell helps a lot for me to understand chaos DTD.
Now I know XMLXORT will be finally used not only for sequence-related data but also for other data, I have to learn XMLXORT.
first step
get the current release WS171 gff3 file from wormbase. total 1.07G. split it by:
grep -P /^I\t/ [zha@localhost 1]$ ls -l chrI.gff3 -rw-rw-r-- 1 zha zha 165530115 Mar 20 17:33 chrI.gff3
only two directive lines in ws171
##gff-version 3 ##Index-subfeature 0
but adding the size of chr-based files does not (similarly) equal to the original size of ws171, ??? I lost something here already?
pain for loading
- first try load a sample gff3
a sample nGASP gff3 file has been successfully transformed to chadoXML by bio-chaos.
use Bio::Chaos; my $path = '/home/zha/gff3/phase2_confirmed.gff3'; my $infmt = 'gff3'; my $outfmt = 'chadoxml'; my $c = Bio::Chaos->new; $c->parse($path, $infmt); print $c->transform_to($outfmt)->xml;
but I doubt it could load onto chado for the following test on gmod-bulk-load-gff3.
[zha@localhost gff3]$ gmod_bulk_load_gff3.pl --dbname zha --organism worm --gfffile \ phase2_confirmed.gff3 Preparing data for inserting into the zha database (This may take a while ...) Unable to find srcfeature IV in the database.
sort it so that Parent of a feature (column 9 tag Parent) comes before the feature line in file.
sorted it by:
gmod_sort_gff3 --infile chrI.gff3 > chrI.unresolved
two files are generated:
chrI.sorted.gff3 chrI.unresolved
but adding the size of them, much less than the size of chrI.gff3, I definitely lost a lot here, abadon this is not what I expected from the name of the file and perldoc.
my experience with chromosome I
- chromosome definition line
I Link chromosome 1 15072419 . + . Name=I
I manually changed it to
I Link chromosome 1 15072419 . + . ID=I, Name=I
and put it at the top of the gff3 file, it is NOT a problem of gff3 file, i.e., the file is valid wherever this line is or even without this line, but put it on top helps the bulk_load, or maybe gmod_gff3_prepocessor will try to do this change.
- clone_end line
I . clone_end 10038617 10038617 . . . Name=C03C11 no cvterm for clone_end at /usr/lib/perl5/site_perl/5.8.8/Bio/GMOD/DB/Adapter.pm line 3445, <GEN0> line 12402. Issuing rollback() for database handle being DESTROY'd without explicit disconnect().
this is a valid line, i.e, clone_end is a valid SOFA term, accroding to SOFA v2 (05-16-2005). what we loaded in chado installation is the SO latest minor revision version v2.1 (08-16-2006). in this version clone_end change to clone_insert_end.
- this is a known situation...
Your GFF3 file uses a tag called 'confirmed_est', but this term is not already in the cvterm table so that it's value can be inserted into the featureprop table. The easiest way to rectify this is to execute the following SQL commands in the psql shell: INSERT INTO dbxref (db_id,accession) VALUES ((select db_id from db where name='null'),'autocreated:confirmed_est'); INSERT INTO cvterm (cv_id,name,dbxref_id) VALUES ((select cv_id from cv where name='autocreated'), 'confirmed_est', (select dbxref_id from dbxref where accession='autocreated:confirmed_est')); and then rerun this loader. Your other option is to write a special handler for this tag so that it will go where you want it in the database. Died at /usr/lib/perl5/site_perl/5.8.8/Bio/GMOD/DB/Adapter.pm line 2834, <GEN0> line 13204. Issuing rollback() for database handle being DESTROY'd without explicit disconnect().
Noticed the above situation the cvterm is in column 3 (type), here the term is in column 9, a tag, such as ID, NAME, Dbxref, etc. I encoutered a series of them, which are good information.
confirmed_EST |
confirmed_UTR |
confirmed_Homology |
confirmed_inconsistence |
confirmed_false |
used_for_training |
predicted_ncrna_gene |