Zheng's notes on wormbase migration

From GMOD
Revision as of 18:45, 21 March 2007 by Zheng (Talk | contribs)

Jump to: navigation, search

description

this page is a record of my experience for migrating wormbase onto chado. As far as I know, wormbase is based on the Acedb (an object-oriented schema) mapping onto rmdbs (mysql/postgresql). Chado is a new, more sophisticated, but generic schema.

module-based migration

focus on sequence module first. using gff3 files.

bio-chaos and gmod_bulk_load_gff3

both bio-chaos 0.02 and gmod_bulk_load_gff3 can (theoretically) work. btw, bio-chaos 0.01 is included in the schema cvs download, but no gff3->chaos script in it. so go to bio-chaos 0.02 for prerequisite and installation. read a book XML in a nutshell helps a lot for me to understand chaos DTD.

first step

get the current release WS171 gff3 file from wormbase. total 1.07G. split it by:

grep -P /^I\t/
[zha@localhost 1]$ ls -l chrI.gff3
-rw-rw-r-- 1 zha zha 165530115 Mar 20 17:33 chrI.gff3

only two directive lines in ws171

##gff-version 3
##Index-subfeature 0

but adding the size of chr-based files does not (similarly) equal to the original size of ws171, ??? I lost something here already?

pain for loading

first try load chrI.gff3 a sample nGASP gff3 file could be loaded perfectly by both bio sort it so that Parent of a feature (column 9 tag Parent) comes before the feature line in file. sorted it by:

gmod_sort_gff3 --infile chrI.gff3 > chrI.unresolved 

two files are generated:

chrI.sorted.gff3
chrI.unresolved