Difference between revisions of "Zheng's notes on wormbase migration"

From GMOD
Jump to: navigation, search
(module-based migration)
(module-based migration)
Line 4: Line 4:
 
==module-based migration==
 
==module-based migration==
 
focus on sequence module first. using gff3 files as input.
 
focus on sequence module first. using gff3 files as input.
 +
gff3 file format and chado
 +
gff3              chado table
 +
seqid           
 +
source            feature_dbxref
 +
type              feature.type_id, cvterm.id feature_cvterm
 +
start
 +
end
 +
score
 +
strand
 +
phase
 +
attribute  ID    feaure.name
 +
          Name  feature.
  
 
==bio-chaos and gmod_bulk_load_gff3==
 
==bio-chaos and gmod_bulk_load_gff3==

Revision as of 22:48, 22 March 2007

description

this page is a record of my experience for migrating wormbase onto chado. As far as I know, wormbase is based on the Acedb (an object-oriented schema) mapping onto rmdbs (mysql/postgresql). Chado is a new, more sophisticated, but generic schema.

module-based migration

focus on sequence module first. using gff3 files as input. gff3 file format and chado gff3 chado table seqid source feature_dbxref type feature.type_id, cvterm.id feature_cvterm start end score strand phase attribute ID feaure.name

          Name   feature.

bio-chaos and gmod_bulk_load_gff3

both bio-chaos 0.02 and gmod_bulk_load_gff3 can (theoretically) work. btw, bio-chaos 0.01 is included in the schema cvs download, but no gff3->chaos script in it. so go to bio-chaos 0.02 for prerequisite and installation. read a book XML in a nutshell helps a lot for me to understand chaos DTD. XMLXORT

first step

get the current release WS171 gff3 file from wormbase. total 1.07G. split it by:

grep -P /^I\t/
[zha@localhost 1]$ ls -l chrI.gff3
-rw-rw-r-- 1 zha zha 165530115 Mar 20 17:33 chrI.gff3

only two directive lines in ws171

##gff-version 3
##Index-subfeature 0

but adding the size of chr-based files does not (similarly) equal to the original size of ws171, ??? I lost something here already?

pain for loading

  • first try load a sample gff3

a sample nGASP gff3 file has been successfully transformed to chadoXML by bio-chaos.

use Bio::Chaos;
my $path = '/home/zha/gff3/phase2_confirmed.gff3';
my $infmt = 'gff3';
my $outfmt = 'chadoxml';
my $c = Bio::Chaos->new;
$c->parse($path, $infmt);
print $c->transform_to($outfmt)->xml;


but I doubt it could  load onto chado for the following test on gmod-bulk-load-gff3. 
[zha@localhost gff3]$ gmod_bulk_load_gff3.pl --dbname zha --organism worm --gfffile  \ phase2_confirmed.gff3
Preparing data for inserting into the zha database
(This may take a while ...)
Unable to find srcfeature IV in the database.


sort it so that Parent of a feature (column 9 tag Parent) comes before the feature line in file. sorted it by:

gmod_sort_gff3 --infile chrI.gff3 > chrI.unresolved 

two files are generated:

chrI.sorted.gff3
chrI.unresolved

but adding the size of them, much less than the size of chrI.gff3, I definitely lost a lot here, abadon this is not what I expected from the name of the file and perldoc.