Zheng's notes on wormbase migration
Contents
description
this page is a record of my experience for migrating wormbase onto chado. As far as I know, wormbase is based on the Acedb (an object-oriented schema) mapping onto rmdbs (mysql/postgresql). Chado is a new, more sophisticated, but generic schema.
module-based migration
focus on sequence module first. using gff3 files.
bio-chaos and gmod_bulk_load_gff3
both bio-chaos 0.02 and gmod_bulk_load_gff3 can (theoretically) work. btw, bio-chaos 0.01 is included in the schema cvs download, but no gff3->chaos script in it. so go to bio-chaos 0.02 for prerequisite and installation. read a book XML in a nutshell helps a lot for me to understand chaos DTD.
first step
get the current release WS171 gff3 file from wormbase. total 1.07G. split it by:
grep -P /^I\t/ [zha@localhost 1]$ ls -l chrI.gff3 -rw-rw-r-- 1 zha zha 165530115 Mar 20 17:33 chrI.gff3
only two directive lines in ws171
##gff-version 3 ##Index-subfeature 0
but adding the size of chr-based files does not (similarly) equal to the original size of ws171, ??? I lost something here already?
pain for loading
first try load chrI.gff3 a sample nGASP gff3 file could be loaded perfectly by both bio sort it so that Parent of a feature (column 9 tag Parent) comes before the feature line in file. sorted it by:
gmod_sort_gff3 --infile chrI.gff3 > chrI.unresolved
two files are generated:
chrI.sorted.gff3 chrI.unresolved