Zheng's notes on wormbase migration

From GMOD
Revision as of 20:00, 23 March 2007 by Zheng (Talk | contribs)

Jump to: navigation, search

description

this page is a record of my experience for migrating wormbase onto chado. As far as I know, wormbase is based on the Acedb (an object-oriented schema) mapping onto rmdbs (mysql/postgresql). Chado is a new, more sophisticated, but generic schema.

module-based migration

focus on sequence module first. using gff3 files as input.

gff3 columns and chado table relationship
gff3 chado table
seqid featureloc.srcfeature_id, featureloc.rank=0
source feature_dbxref.feature_id->
type feature.type_id, cvterm.id feature_cvterm
start featureloc.nbeg
end featureloc.nend
score
strand featureloc.strand
phase featureloc.phase
attribute ID feature.name, feature.uniquename if ID is unique otherwise 'auto'+feature.feature_id
attribute Name feature.

bio-chaos and gmod_bulk_load_gff3

both bio-chaos 0.02 and gmod_bulk_load_gff3 can theoretically work. btw, bio-chaos 0.01 is included in the schema cvs download, but no gff3->chaos script in it. so go to bio-chaos 0.02 for prerequisite and installation. read a book XML in a nutshell helps a lot for me to understand chaos DTD.

Now I know XMLXORT will be finally used not only for sequence-related data but also for other data, I have to learn XMLXORT.

first step

get the current release WS171 gff3 file from wormbase. total 1.07G. split it by:

grep -P /^I\t/
[zha@localhost 1]$ ls -l chrI.gff3
-rw-rw-r-- 1 zha zha 165530115 Mar 20 17:33 chrI.gff3

only two directive lines in ws171

##gff-version 3
##Index-subfeature 0

but adding the size of chr-based files does not (similarly) equal to the original size of ws171, ??? I lost something here already?

pain for loading

  • first try load a sample gff3

a sample nGASP gff3 file has been successfully transformed to chadoXML by bio-chaos.

use Bio::Chaos;
my $path = '/home/zha/gff3/phase2_confirmed.gff3';
my $infmt = 'gff3';
my $outfmt = 'chadoxml';
my $c = Bio::Chaos->new;
$c->parse($path, $infmt);
print $c->transform_to($outfmt)->xml;

but I doubt it could load onto chado for the following test on gmod-bulk-load-gff3.

[zha@localhost gff3]$ gmod_bulk_load_gff3.pl --dbname zha --organism worm --gfffile  \ phase2_confirmed.gff3
Preparing data for inserting into the zha database
(This may take a while ...)
Unable to find srcfeature IV in the database.


sort it so that Parent of a feature (column 9 tag Parent) comes before the feature line in file. sorted it by:

gmod_sort_gff3 --infile chrI.gff3 > chrI.unresolved 

two files are generated:

chrI.sorted.gff3
chrI.unresolved

but adding the size of them, much less than the size of chrI.gff3, I definitely lost a lot here, abadon this is not what I expected from the name of the file and perldoc.

my experience with chromosome I

  • chromosome definition line
I       Link    chromosome      1       15072419        .       +       .       Name=I

I manually changed it to

I       Link    chromosome      1       15072419        .       +       .       ID=I, Name=I

and put it at the top of the gff3 file, it is NOT a problem of gff3 file, i.e., the file is valid wherever this line is or even without this line, but put it on top helps the bulk_load, or maybe gmod_gff3_prepocessor will try to do this change.

  • clone_end line
I       .       clone_end       10038617        10038617        .       .       .       Name=C03C11
no cvterm for clone_end at /usr/lib/perl5/site_perl/5.8.8/Bio/GMOD/DB/Adapter.pm line 3445, <GEN0>  line 12402.
Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

this is a valid line, i.e, clone_end is a valid SOFA term, accroding to SOFA v2 (05-16-2005). what we loaded in chado installation is the SO latest minor revision version v2.1 (08-16-2006). in this version clone_end change to clone_insert_end.

  • this is a known situation...
Your GFF3 file uses a tag called 'confirmed_est', but this term is not
already in the cvterm table so that it's value can be inserted
into the featureprop table.  The easiest way to rectify this is
to execute the following SQL commands in the psql shell:
 INSERT INTO dbxref (db_id,accession)
   VALUES ((select db_id from db where name='null'),'autocreated:confirmed_est');
 INSERT INTO cvterm (cv_id,name,dbxref_id)
   VALUES ((select cv_id from cv where name='autocreated'), 'confirmed_est',
           (select dbxref_id from dbxref where accession='autocreated:confirmed_est'));
and then rerun this loader.  Your other option is to
write a special handler for this tag so that it will
go where you want it in the database.
Died at /usr/lib/perl5/site_perl/5.8.8/Bio/GMOD/DB/Adapter.pm line 2834, <GEN0> line 13204.
Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

Noticed the above situation the cvterm is in column 3 (type), here the term is in column 9, a tag, such as ID, NAME, Dbxref, etc. I encoutered a series of them, which are good information.

non-canonical tag
confirmed_EST
confirmed_UTR
confirmed_Homology
confirmed_inconsistence
confirmed_false
used_for_training
predicted_ncrna_gene