Difference between revisions of "Zheng's notes on wormbase migration"

Revision as of 22:48, 22 March 2007

description

this page is a record of my experience for migrating wormbase onto chado. As far as I know, wormbase is based on the Acedb (an object-oriented schema) mapping onto rmdbs (mysql/postgresql). Chado is a new, more sophisticated, but generic schema.

module-based migration

focus on sequence module first. using gff3 files as input. gff3 file format and chado gff3 chado table seqid source feature_dbxref type feature.type_id, cvterm.id feature_cvterm start end score strand phase attribute ID feaure.name

          Name   feature.

bio-chaos and gmod_bulk_load_gff3

both bio-chaos 0.02 and gmod_bulk_load_gff3 can (theoretically) work. btw, bio-chaos 0.01 is included in the schema cvs download, but no gff3->chaos script in it. so go to bio-chaos 0.02 for prerequisite and installation. read a book XML in a nutshell helps a lot for me to understand chaos DTD. XMLXORT

first step

get the current release WS171 gff3 file from wormbase. total 1.07G. split it by:

grep -P /^I\t/
[zha@localhost 1]$ ls -l chrI.gff3
-rw-rw-r-- 1 zha zha 165530115 Mar 20 17:33 chrI.gff3

only two directive lines in ws171

##gff-version 3
##Index-subfeature 0

but adding the size of chr-based files does not (similarly) equal to the original size of ws171, ??? I lost something here already?

pain for loading

first try load a sample gff3

a sample nGASP gff3 file has been successfully transformed to chadoXML by bio-chaos.

use Bio::Chaos;
my $path = '/home/zha/gff3/phase2_confirmed.gff3';
my $infmt = 'gff3';
my $outfmt = 'chadoxml';
my $c = Bio::Chaos->new;
$c->parse($path, $infmt);
print $c->transform_to($outfmt)->xml;

but I doubt it could  load onto chado for the following test on gmod-bulk-load-gff3. 
[zha@localhost gff3]$ gmod_bulk_load_gff3.pl --dbname zha --organism worm --gfffile  \ phase2_confirmed.gff3
Preparing data for inserting into the zha database
(This may take a while ...)
Unable to find srcfeature IV in the database.

sort it so that Parent of a feature (column 9 tag Parent) comes before the feature line in file. sorted it by:

gmod_sort_gff3 --infile chrI.gff3 > chrI.unresolved

two files are generated:

chrI.sorted.gff3
chrI.unresolved

but adding the size of them, much less than the size of chrI.gff3, I definitely lost a lot here, abadon this is not what I expected from the name of the file and perldoc.

Difference between revisions of "Zheng's notes on wormbase migration"

Revision as of 22:48, 22 March 2007

Contents

description

module-based migration

bio-chaos and gmod_bulk_load_gff3

first step

pain for loading

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Documentation

Community

Tools

@@ Line 4: / Line 4: @@
 ==module-based migration==
 focus on sequence module first. using gff3 files as input.
+gff3 file format and chado
+gff3              chado table
+seqid
+source            feature_dbxref
+type              feature.type_id, cvterm.id feature_cvterm
+start
+end
+score
+strand
+phase
+attribute  ID     feaure.name
+           Name   feature.
 ==bio-chaos and gmod_bulk_load_gff3==