Difference between revisions of "Zheng's notes on wormbase migration"
(→module-based migration) |
(→module-based migration) |
||
Line 4: | Line 4: | ||
==module-based migration== | ==module-based migration== | ||
focus on sequence module first. using gff3 files as input. | focus on sequence module first. using gff3 files as input. | ||
+ | gff3 file format and chado | ||
+ | gff3 chado table | ||
+ | seqid | ||
+ | source feature_dbxref | ||
+ | type feature.type_id, cvterm.id feature_cvterm | ||
+ | start | ||
+ | end | ||
+ | score | ||
+ | strand | ||
+ | phase | ||
+ | attribute ID feaure.name | ||
+ | Name feature. | ||
==bio-chaos and gmod_bulk_load_gff3== | ==bio-chaos and gmod_bulk_load_gff3== |
Revision as of 22:48, 22 March 2007
Contents
description
this page is a record of my experience for migrating wormbase onto chado. As far as I know, wormbase is based on the Acedb (an object-oriented schema) mapping onto rmdbs (mysql/postgresql). Chado is a new, more sophisticated, but generic schema.
module-based migration
focus on sequence module first. using gff3 files as input. gff3 file format and chado gff3 chado table seqid source feature_dbxref type feature.type_id, cvterm.id feature_cvterm start end score strand phase attribute ID feaure.name
Name feature.
bio-chaos and gmod_bulk_load_gff3
both bio-chaos 0.02 and gmod_bulk_load_gff3 can (theoretically) work. btw, bio-chaos 0.01 is included in the schema cvs download, but no gff3->chaos script in it. so go to bio-chaos 0.02 for prerequisite and installation. read a book XML in a nutshell helps a lot for me to understand chaos DTD. XMLXORT
first step
get the current release WS171 gff3 file from wormbase. total 1.07G. split it by:
grep -P /^I\t/ [zha@localhost 1]$ ls -l chrI.gff3 -rw-rw-r-- 1 zha zha 165530115 Mar 20 17:33 chrI.gff3
only two directive lines in ws171
##gff-version 3 ##Index-subfeature 0
but adding the size of chr-based files does not (similarly) equal to the original size of ws171, ??? I lost something here already?
pain for loading
- first try load a sample gff3
a sample nGASP gff3 file has been successfully transformed to chadoXML by bio-chaos.
use Bio::Chaos; my $path = '/home/zha/gff3/phase2_confirmed.gff3'; my $infmt = 'gff3'; my $outfmt = 'chadoxml'; my $c = Bio::Chaos->new; $c->parse($path, $infmt); print $c->transform_to($outfmt)->xml;
but I doubt it could load onto chado for the following test on gmod-bulk-load-gff3. [zha@localhost gff3]$ gmod_bulk_load_gff3.pl --dbname zha --organism worm --gfffile \ phase2_confirmed.gff3 Preparing data for inserting into the zha database (This may take a while ...) Unable to find srcfeature IV in the database.
sort it so that Parent of a feature (column 9 tag Parent) comes before the feature line in file.
sorted it by:
gmod_sort_gff3 --infile chrI.gff3 > chrI.unresolved
two files are generated:
chrI.sorted.gff3 chrI.unresolved
but adding the size of them, much less than the size of chrI.gff3, I definitely lost a lot here, abadon this is not what I expected from the name of the file and perldoc.