Difference between revisions of "XORT Presentation"

From GMOD
Jump to: navigation, search
m
m
 
(One intermediate revision by one other user not shown)
Line 3: Line 3:
 
=====Introduction=====
 
=====Introduction=====
  
* An XML-database mapping system for data exchange between DB and XML-driven application
+
* An [[Glossary#XML|XML]]-database mapping system for data exchange between DB and XML-driven application
* XORT can handle typical XML, it's not Chado-specific
+
* [[XORT]] can handle typical XML, it's not [[Chado]]-specific
* Developed/Supported by Pinglei Zhou at FlyBase   Harvard, 0.007 version now.
+
* Developed/Supported by Pinglei Zhou at FlyBase Harvard, 0.007 version now.
 
* Used at all FlyBase sites
 
* Used at all FlyBase sites
 
** Harvard has extensive library of Perl modules for generating ChadoXML
 
** Harvard has extensive library of Perl modules for generating ChadoXML
 
* Written in Perl
 
* Written in Perl
* Required perl modules:
+
* Required Perl modules:
** XML::Parser::PerlSAX
+
** {{CPAN|XML::Parser::PerlSAX}}
** Unicode::String
+
** {{CPAN|Unicode::String}}
** XML::DOM
+
** {{CPAN|XML::DOM}}
** DBI
+
** {{CPAN|DBI}}
  
 
=====Chado XML=====
 
=====Chado XML=====
  
* Is ChadoXML necessary? No, but it may help you.
+
* Is [[Chado XML]] necessary? No, but it may help you.
 
* ChadoXML assists with incremental updates, if you want to avoid flush-and-reload.
 
* ChadoXML assists with incremental updates, if you want to avoid flush-and-reload.
 
* While update can be achived by other middleware (for example, perl Class::DBI, Java Hibernate), ChadoXML provide additional feature as way to archive your transaction.
 
* While update can be achived by other middleware (for example, perl Class::DBI, Java Hibernate), ChadoXML provide additional feature as way to archive your transaction.
Line 43: Line 43:
 
There are three Flybase sites, and most curation is done at Harvard and
 
There are three Flybase sites, and most curation is done at Harvard and
 
Cambridge. Proforma is the curation format at Cambridge and Harvard, but
 
Cambridge. Proforma is the curation format at Cambridge and Harvard, but
Harvard also curates with Apollo and ChadoXML.
+
Harvard also curates with [[Apollo]] and ChadoXML.
  
 
Once in Chado, the reporting instance, there's a denormalization step
 
Once in Chado, the reporting instance, there's a denormalization step
 
in moving data to a read-only database. Once in the read-only database there are
 
in moving data to a read-only database. Once in the read-only database there are
 
dumps, for reporting purposes, using XORT to create ChadoXML. Once
 
dumps, for reporting purposes, using XORT to create ChadoXML. Once
ChadoXML is created version 2 of XSLT is used to create HTML and GFF. HTML reports
+
ChadoXML is created version 2 of XSLT is used to create HTML and [[GFF]]. HTML reports
are for human-readable reports, GFF for GBrowse and for various power
+
are for human-readable reports, [[GFF]] for [[GBrowse]] and for various power
 
users.
 
users.
  
Line 66: Line 66:
 
4. ChadoXML is created from Chado using XORT
 
4. ChadoXML is created from Chado using XORT
  
5.a. GFF and Fasta is created from ChadoXML
+
5.a. [[GFF]] and Fasta is created from ChadoXML
  
 
5.b. HTML is created from Chado XML
 
5.b. HTML is created from Chado XML
Line 96: Line 96:
  
 
Step 1. Dump all data use simple dumpspec
 
Step 1. Dump all data use simple dumpspec
<xml>
+
 
 +
<syntaxhighlight lang="xml">
 
  <chado>
 
  <chado>
 
   <feature dump=“all”>
 
   <feature dump=“all”>
Line 102: Line 103:
 
   </feature>
 
   </feature>
 
  </chado>
 
  </chado>
</xml>
+
</syntaxhighlight>
 
Step 2 Delete feature x from DB, with triggers to clean orphan records, if necessary
 
Step 2 Delete feature x from DB, with triggers to clean orphan records, if necessary
           
+
 
 
Step 3. Edit the output xml, change uniquename x to y, then load the edited file back to DB
 
Step 3. Edit the output xml, change uniquename x to y, then load the edited file back to DB
  
Line 130: Line 131:
 
=====Acknowledgements=====
 
=====Acknowledgements=====
  
* Willian Gelbart  
+
* Willian Gelbart
 
* Chris Mungall
 
* Chris Mungall
* David Emmert      
+
* David Emmert
 
* Mark Gibson
 
* Mark Gibson
* Stan Letovsky    
+
* Stan Letovsky
 
* Nomi Harris
 
* Nomi Harris
* Frank Smutniak    
+
* Frank Smutniak
 
* Suzanna Lewis
 
* Suzanna Lewis
* Peili Zhang      
+
* Peili Zhang
 
* Stan Letovsky
 
* Stan Letovsky
* Haiyan Zhang      
+
* Haiyan Zhang
 
* Aubrey de Grey
 
* Aubrey de Grey
* Andy Schroeder    
+
* Andy Schroeder
 
* Don Gilbert
 
* Don Gilbert
 
* Susan Russo
 
* Susan Russo
* Mark Zythovicz    
+
* Mark Zythovicz
 
* Scott Cain
 
* Scott Cain
 
* Lincoln Stein
 
* Lincoln Stein
Line 151: Line 152:
 
* Robert Wilson
 
* Robert Wilson
 
* Paul Leyland
 
* Paul Leyland
 
  
 
[[Category:FlyBase]]
 
[[Category:FlyBase]]
 
[[Category:XORT]]
 
[[Category:XORT]]

Latest revision as of 18:54, 9 October 2012

This Wiki section is an edited version of Josh Goodman and Pinglei Zhou's presentation.

Introduction
  • An XML-database mapping system for data exchange between DB and XML-driven application
  • XORT can handle typical XML, it's not Chado-specific
  • Developed/Supported by Pinglei Zhou at FlyBase Harvard, 0.007 version now.
  • Used at all FlyBase sites
    • Harvard has extensive library of Perl modules for generating ChadoXML
  • Written in Perl
  • Required Perl modules:
Chado XML
  • Is Chado XML necessary? No, but it may help you.
  • ChadoXML assists with incremental updates, if you want to avoid flush-and-reload.
  • While update can be achived by other middleware (for example, perl Class::DBI, Java Hibernate), ChadoXML provide additional feature as way to archive your transaction.
  • It provides bulk update/download which other methods lack or is inefficient
Components
  • Database & Schema
  • ChadoXML Specification
  • DumpSpec
    • DumpSpec files are simple XML files that tell XORT what to do
    • DumpSpec files are language independent, being XML
    • It's fairly easy for those who know the schema to read these files and understand what the operation is
Highlights of Chado XML Specification
  • Unique represent of specific database schema
  • Get away with those internal primary key value
  • Static vs. Operational
  • Encoding for non-ASCII characters
  • Macro mechanism (object reference)
Putting it together: New FlyBase dataflow Part 1

There are three Flybase sites, and most curation is done at Harvard and Cambridge. Proforma is the curation format at Cambridge and Harvard, but Harvard also curates with Apollo and ChadoXML.

Once in Chado, the reporting instance, there's a denormalization step in moving data to a read-only database. Once in the read-only database there are dumps, for reporting purposes, using XORT to create ChadoXML. Once ChadoXML is created version 2 of XSLT is used to create HTML and GFF. HTML reports are for human-readable reports, GFF for GBrowse and for various power users.

1.a. Proforma (FlyBase Cambridge) is converted to ChadoXML

1.b. ChadoXML is created by Apollo (Harvard)

1.c. ChadoXML is created by Java SEAN (Harvard)

2. All ChadoXML is loaded into Chado by XORT

Putting it together: New FlyBase dataflow Part 2

3. Chado (Harvard) is denormalized and loaded into Chado (Indiana)

4. ChadoXML is created from Chado using XORT

5.a. GFF and Fasta is created from ChadoXML

5.b. HTML is created from Chado XML

Data & Report Generation
  • Content of all output files is controlled by XML dumpspecs.
    • Dumpspecs are language independent.
    • Easily readable (with knowledge of Chado structure).
  • All XML transformation steps are done with XSLT v2.
    • Saxon XSLT (http://saxon.sourceforge.net/)
    • ChadoXML is split into individual chunks before XSLT processing to accommodate large file sizes.
    • Extremely fast. We can process all data for ~60,000 Drosophila genes in under 30 minutes.
Hibernate & XORT
  • Hibernate didn't scale well when dealing with 5,000+ features in bulk.
    • The test was simply calling print() statements
  • Performance tweaks for Hibernate can be quite complicated to setup for bulk operations.
  • XORT is currently handling ~6 million features in production with only minor performance problems.
  • XORT is much more language independent.
Support for complex transactions using XORT

For example:

  • Find all records linked to a record using dumpspec
  • Merge gene x into y, each with thousands of records attached

Step 1. Dump all data use simple dumpspec

 <chado>
  <feature dump=“all”>
   <uniquename test=“eq”>x</uniquename>
  </feature>
 </chado>

Step 2 Delete feature x from DB, with triggers to clean orphan records, if necessary

Step 3. Edit the output xml, change uniquename x to y, then load the edited file back to DB

CHIA (Chado Interface Application)

A Java application that organizes SQL and XORT functionality for internal users, e.g.:

  • Dump chado-XML for gene regions for Apollo curation
  • Organize and execute “canned” SQL queries
  • Serve IDs for curators (in development)
  • Dynamic browser Chado without writing SQL statement

CHIA is being designed to be extensible for adding new functionality as needed.


Documentation
  • Using Chado to Store Genome Annotation Data"
    • Current Protocols in Bioinformatics (Baxevanis, A.D., and Davison, D.B., eds) 2, 9.6.1-9.6.28.
  • XORT specification docs
  • XORT draft (unpublished)
  • GMOD case demo procedure
Acknowledgements
  • Willian Gelbart
  • Chris Mungall
  • David Emmert
  • Mark Gibson
  • Stan Letovsky
  • Nomi Harris
  • Frank Smutniak
  • Suzanna Lewis
  • Peili Zhang
  • Stan Letovsky
  • Haiyan Zhang
  • Aubrey de Grey
  • Andy Schroeder
  • Don Gilbert
  • Susan Russo
  • Mark Zythovicz
  • Scott Cain
  • Lincoln Stein
  • Victor Strelets
  • Robert Wilson
  • Paul Leyland