GMOD

Learn XMLXORT

ok, first question is why xmlxort. because many information are not in gff3.

second question is how. acedb underlines wormbase and acedb is organized into ace classes. the classes are here. seems to me there are two stategies: first, for each class in ace set up a procedure to migrate its info into chado, this is like gff3 to chado, have to do a ‘chain reaction’ on chado tables; the good news is that we don’t need to do ‘join’ on the acedb side and once we are done, we know all data have been migrated. second, for each module in chado get the required information from acedb, we will do a lot of join on acedb side, potential problem is we donnot know whether we have migrated all the data. Which strategy to use?

steps: I will first try on the rudimental table, such as cv(?), biblio, db(?).

Contents

wormbase acedb classes

I started with class Paper. acedb classes are mixture of c structure and database description. pointers are to other class object. class paper points to class author, person, person_name, keyword?, address, role, laboratory. some are weak reference (circular, back_ref, dealed in perl module scalar::util, interesting it is here) obviously, a lot of ad hoc reading script needed.

chado database ddl

there is a file named chado.ddl in xmlxort/example/, schema/dat/. the grammar can be found at postgresql create table command.

chadoxml dtd

chadoxml

<pub>
   <pubprop>
      ....
   </pubprop>
</pub> 
<pub>
   <uniquename>this paper</uniquename>
</pub>
<pub op='lookup'>
<pub id='WBPaper000001'>
  <title>....
  <miniref>....
  <volumn>
<stock_pub>
   <pub_id>
      <pub>WBPaper000001</pub>

migration

I first wrote a read and a write part for migrating paper class object. after I finished write subroutine with XML::Writer, I found it just mimics the tree structure of chadoxml, so apparently a fixed DOM object ( a fixed-structured tree) for each table will do most of the writing job. So I need to read acedb (multi classes) to extract info for a table and represent it into a DOM object. this is what I think I will do in the next several months.

mapping wormbase info to chado

here is the paper class model in Wormbase. AcePerl is here. mapping from Wormbase to chado is as following:

extract info from wormbase using AcePerl (sample code)

 sub read_paper_pub {
   my $paper = shift;
   my %info;
   $info{uniquename} = $paper->name; 
   ........... 
   if (defined($paper->Page)) {
       my @pages = $paper->Page->row;
       if (scalar @pages == 1) {
           $info{pages} = $pages[0]->name;
       } elsif (scalar @pages == 2) {
           $info{pages} = join "-", ($pages[0]->name, $pages[1]->name);
       }
   }

write chadoxml (sample xml)

sub write_paper_pub {
   my $paper = shift;
   my $fh = shift;
   my $p_href = &read_paper_pub($paper);
   my $doc = new XML::DOM::Document;
   my $root = $doc->createElement("chado");
   my $pub_el = create_ch_pub(doc => $doc,
                              no_lookup => 1,
                              %$p_href);
   ........
   if (defined($paper->CGC_name)) {
       my $db = 'CGC';
       my $accession = substr($paper->CGC_name->name, 3);
       my $is_current = 't';
       my $pd_el = create_ch_pub_dbxref(doc => $doc,
                                        db => $db,
                                        accession => $accession,
                                        no_lookup => 1);
       $pub_el->appendChild($pd_el);
   } 
   ........
   if (defined($paper->Abstract)) {
       my %abstract = ();
       $abstract{type} = 'pubmed_abstract';
       if ($paper->Abstract->right->name ne ) {
           $abstract{value} = $paper->Abstract->right->name;
           my $pp_el = create_ch_pubprop(doc => $doc,
                                         %abstract);
           $pub_el->appendChild($pp_el);
       }
   }
   .........

sub write_paper_pub_relationship {
   .........
   if (defined($paper->In_book)) {
       my %pr = ();
       $pr{is_object} = 't';
       $pr{rtype} = 'published_in';
       if (defined($paper->In_book->at('Title'))) {
         $pr{uniquename} = $paper->In_book->at('Title')->right->name;
       } else {........    
       }
       my $pr_el = create_ch_pub_relationship(doc => $doc;
                                              %pr);
       $pub_el->appendChild($pr_el);
   }

validate xml file

xort_validator.pl -d wormbase_chado -f xml/paper/1.xml -v 1 -b 1

this will connect database to validate the xml file

load xml file

xort_loader.pl -d wormbase_chado -f xml/paper/1.xml

order is important for loading

extend chado schema

   1 -- ================================================
   2 -- TABLE: contactprop
   3 -- ================================================
   4 
   5 -- contactprop models person/lab properties, such as email, phone, etc.
   6 -- the cvterms come from FOAF project, see the spec at http://xmlns.com/foaf/spec/
   7 
   8 create table contactprop (
   9    contactprop_id serial not null,
  10    primary key (contactprop_id),
  11    contact_id int not null,
  12    foreign key (contact_id) references contact (contact_id) on delete cascade,
  13    type_id int not null,
  14    foreign key (type_id) references cvterm (cvterm_id) on delete cascade,
  15    value text,
  16 
  17    unique (contact_id, type_id, value)
  18 );
  19 create index contactprop_idx1 on contactprop (contactprop_id);
  20 create index contactprop_idx2 on contactprop (type_id);

cvterm is most important for chado

the power of chado relies on common and controlled cvterms, FOAF cvterms will be reused as much as possible for contact properties. Plus some terms from WormBase.

strain from wormbase to stock from chado

straight forward mapping.

transgene from wormbase to feature from chado

a transgene in wormbase will map to a feature in chado, type_id=synthetic construct

variation from wormbase

Categories:

Documentation

Community

Tools