ok, first question is why xmlxort. because many information are not in gff3.
second question is how. acedb underlines wormbase and acedb is organized into ace classes. the classes are here. seems to me there are two stategies: first, for each class in ace set up a procedure to migrate its info into chado, this is like gff3 to chado, have to do a ‘chain reaction’ on chado tables; the good news is that we don’t need to do ‘join’ on the acedb side and once we are done, we know all data have been migrated. second, for each module in chado get the required information from acedb, we will do a lot of join on acedb side, potential problem is we donnot know whether we have migrated all the data. Which strategy to use?
steps: I will first try on the rudimental table, such as cv(?), biblio, db(?).
I started with class Paper. acedb classes are mixture of c structure and database description. pointers are to other class object. class paper points to class author, person, person_name, keyword?, address, role, laboratory. some are weak reference (circular, back_ref, dealed in perl module scalar::util, interesting it is here) obviously, a lot of ad hoc reading script needed.
there is a file named chado.ddl in xmlxort/example/, schema/dat/. the grammar can be found at postgresql create table command.
<pub>
<pubprop>
....
</pubprop>
</pub>
<pub>
<uniquename>this paper</uniquename>
</pub>
<pub op='lookup'>
<pub id='WBPaper000001'>
<title>....
<miniref>....
<volumn>
<stock_pub>
<pub_id>
<pub>WBPaper000001</pub>
I first wrote a read and a write part for migrating paper class object. after I finished write subroutine with XML::Writer, I found it just mimics the tree structure of chadoxml, so apparently a fixed DOM object ( a fixed-structured tree) for each table will do most of the writing job. So I need to read acedb (multi classes) to extract info for a table and represent it into a DOM object. this is what I think I will do in the next several months.
here is the paper class model in Wormbase. AcePerl is here. mapping from Wormbase to chado is as following:
sub read_paper_pub {
my $paper = shift;
my %info;
$info{uniquename} = $paper->name;
...........
if (defined($paper->Page)) {
my @pages = $paper->Page->row;
if (scalar @pages == 1) {
$info{pages} = $pages[0]->name;
} elsif (scalar @pages == 2) {
$info{pages} = join "-", ($pages[0]->name, $pages[1]->name);
}
}
sub write_paper_pub {
my $paper = shift;
my $fh = shift;
my $p_href = &read_paper_pub($paper);
my $doc = new XML::DOM::Document;
my $root = $doc->createElement("chado");
my $pub_el = create_ch_pub(doc => $doc,
no_lookup => 1,
%$p_href);
........
if (defined($paper->CGC_name)) {
my $db = 'CGC';
my $accession = substr($paper->CGC_name->name, 3);
my $is_current = 't';
my $pd_el = create_ch_pub_dbxref(doc => $doc,
db => $db,
accession => $accession,
no_lookup => 1);
$pub_el->appendChild($pd_el);
}
........
if (defined($paper->Abstract)) {
my %abstract = ();
$abstract{type} = 'pubmed_abstract';
if ($paper->Abstract->right->name ne ) {
$abstract{value} = $paper->Abstract->right->name;
my $pp_el = create_ch_pubprop(doc => $doc,
%abstract);
$pub_el->appendChild($pp_el);
}
}
.........
sub write_paper_pub_relationship {
.........
if (defined($paper->In_book)) {
my %pr = ();
$pr{is_object} = 't';
$pr{rtype} = 'published_in';
if (defined($paper->In_book->at('Title'))) {
$pr{uniquename} = $paper->In_book->at('Title')->right->name;
} else {........
}
my $pr_el = create_ch_pub_relationship(doc => $doc;
%pr);
$pub_el->appendChild($pr_el);
}
xort_validator.pl -d wormbase_chado -f xml/paper/1.xml -v 1 -b 1
this will connect database to validate the xml file
xort_loader.pl -d wormbase_chado -f xml/paper/1.xml
1 -- ================================================
2 -- TABLE: contactprop
3 -- ================================================
4
5 -- contactprop models person/lab properties, such as email, phone, etc.
6 -- the cvterms come from FOAF project, see the spec at http://xmlns.com/foaf/spec/
7
8 create table contactprop (
9 contactprop_id serial not null,
10 primary key (contactprop_id),
11 contact_id int not null,
12 foreign key (contact_id) references contact (contact_id) on delete cascade,
13 type_id int not null,
14 foreign key (type_id) references cvterm (cvterm_id) on delete cascade,
15 value text,
16
17 unique (contact_id, type_id, value)
18 );
19 create index contactprop_idx1 on contactprop (contactprop_id);
20 create index contactprop_idx2 on contactprop (type_id);
the power of chado relies on common and controlled cvterms, FOAF cvterms will be reused as much as possible for contact properties. Plus some terms from WormBase.
straight forward mapping.
a transgene in wormbase will map to a feature in chado, type_id=synthetic construct