Difference between revisions of "XORT Presentation"

Latest revision as of 18:54, 9 October 2012

This Wiki section is an edited version of Josh Goodman and Pinglei Zhou's presentation.

1 Introduction
2 Chado XML
3 Components
4 Highlights of Chado XML Specification
5 Putting it together: New FlyBase dataflow Part 1
6 Putting it together: New FlyBase dataflow Part 2
7 Data & Report Generation
8 Hibernate & XORT
9 Support for complex transactions using XORT
10 CHIA (Chado Interface Application)
11 Documentation
12 Acknowledgements

Introduction

An XML-database mapping system for data exchange between DB and XML-driven application
XORT can handle typical XML, it's not Chado-specific
Developed/Supported by Pinglei Zhou at FlyBase Harvard, 0.007 version now.
Used at all FlyBase sites
- Harvard has extensive library of Perl modules for generating ChadoXML
Written in Perl
Required Perl modules:

Chado XML

Is Chado XML necessary? No, but it may help you.
ChadoXML assists with incremental updates, if you want to avoid flush-and-reload.
While update can be achived by other middleware (for example, perl Class::DBI, Java Hibernate), ChadoXML provide additional feature as way to archive your transaction.
It provides bulk update/download which other methods lack or is inefficient

Components

Database & Schema
ChadoXML Specification
DumpSpec
- DumpSpec files are simple XML files that tell XORT what to do
- DumpSpec files are language independent, being XML
- It's fairly easy for those who know the schema to read these files and understand what the operation is

Highlights of Chado XML Specification

Unique represent of specific database schema
Get away with those internal primary key value
Static vs. Operational
Encoding for non-ASCII characters
Macro mechanism (object reference)

Putting it together: New FlyBase dataflow Part 1

There are three Flybase sites, and most curation is done at Harvard and Cambridge. Proforma is the curation format at Cambridge and Harvard, but Harvard also curates with Apollo and ChadoXML.

Once in Chado, the reporting instance, there's a denormalization step in moving data to a read-only database. Once in the read-only database there are dumps, for reporting purposes, using XORT to create ChadoXML. Once ChadoXML is created version 2 of XSLT is used to create HTML and GFF. HTML reports are for human-readable reports, GFF for GBrowse and for various power users.

1.a. Proforma (FlyBase Cambridge) is converted to ChadoXML

1.b. ChadoXML is created by Apollo (Harvard)

1.c. ChadoXML is created by Java SEAN (Harvard)

2. All ChadoXML is loaded into Chado by XORT

Putting it together: New FlyBase dataflow Part 2

3. Chado (Harvard) is denormalized and loaded into Chado (Indiana)

4. ChadoXML is created from Chado using XORT

5.a. GFF and Fasta is created from ChadoXML

5.b. HTML is created from Chado XML

Data & Report Generation

Content of all output files is controlled by XML dumpspecs.
- Dumpspecs are language independent.
- Easily readable (with knowledge of Chado structure).
All XML transformation steps are done with XSLT v2.
- Saxon XSLT (http://saxon.sourceforge.net/)
- ChadoXML is split into individual chunks before XSLT processing to accommodate large file sizes.
- Extremely fast. We can process all data for ~60,000 Drosophila genes in under 30 minutes.

Hibernate & XORT

Hibernate didn't scale well when dealing with 5,000+ features in bulk.
- The test was simply calling print() statements
Performance tweaks for Hibernate can be quite complicated to setup for bulk operations.
XORT is currently handling ~6 million features in production with only minor performance problems.
XORT is much more language independent.

Support for complex transactions using XORT

For example:

Find all records linked to a record using dumpspec
Merge gene x into y, each with thousands of records attached

Step 1. Dump all data use simple dumpspec

 <chado>
  <feature dump=“all”>
   <uniquename test=“eq”>x</uniquename>
  </feature>
 </chado>

Step 2 Delete feature x from DB, with triggers to clean orphan records, if necessary

Step 3. Edit the output xml, change uniquename x to y, then load the edited file back to DB

CHIA (Chado Interface Application)

A Java application that organizes SQL and XORT functionality for internal users, e.g.:

Dump chado-XML for gene regions for Apollo curation
Organize and execute “canned” SQL queries
Serve IDs for curators (in development)
Dynamic browser Chado without writing SQL statement

CHIA is being designed to be extensible for adding new functionality as needed.

Documentation

Using Chado to Store Genome Annotation Data"
- Current Protocols in Bioinformatics (Baxevanis, A.D., and Davison, D.B., eds) 2, 9.6.1-9.6.28.
XORT specification docs
XORT draft (unpublished)
GMOD case demo procedure
- All in the doc directory of XORT package, http://www.gmod.org

Acknowledgements

Willian Gelbart
Chris Mungall
David Emmert
Mark Gibson
Stan Letovsky
Nomi Harris
Frank Smutniak
Suzanna Lewis
Peili Zhang
Stan Letovsky
Haiyan Zhang
Aubrey de Grey
Andy Schroeder
Don Gilbert
Susan Russo
Mark Zythovicz
Scott Cain
Lincoln Stein
Victor Strelets
Robert Wilson
Paul Leyland

@@ Line 3: / Line 3: @@
 =====Introduction=====
-* An XML-database mapping system for data exchange between DB and XML-driven application
+* An [[Glossary#XML|XML]]-database mapping system for data exchange between DB and XML-driven application
-* XORT can handle typical XML, it's not Chado-specific
+* [[XORT]] can handle typical XML, it's not [[Chado]]-specific
-* Developed/Supported by Pinglei Zhou at FlyBase    Harvard, 0.007 version now.
+* Developed/Supported by Pinglei Zhou at FlyBase Harvard, 0.007 version now.
 * Used at all FlyBase sites
 ** Harvard has extensive library of Perl modules for generating ChadoXML
 * Written in Perl
-* Required perl modules:
+* Required Perl modules:
-** XML::Parser::PerlSAX
+** {{CPAN|XML::Parser::PerlSAX}}
-** Unicode::String
+** {{CPAN|Unicode::String}}
-** XML::DOM
+** {{CPAN|XML::DOM}}
-** DBI
+** {{CPAN|DBI}}
 =====Chado XML=====
-* Is ChadoXML necessary? No, but it may help you.
+* Is [[Chado XML]] necessary? No, but it may help you.
 * ChadoXML assists with incremental updates, if you want to avoid flush-and-reload.
 * While update can be achived by other middleware (for example, perl Class::DBI, Java Hibernate), ChadoXML provide additional feature as way to archive your transaction.
@@ Line 43: / Line 43: @@
 There are three Flybase sites, and most curation is done at Harvard and
 Cambridge. Proforma is the curation format at Cambridge and Harvard, but
-Harvard also curates with Apollo and ChadoXML.
+Harvard also curates with [[Apollo]] and ChadoXML.
 Once in Chado, the reporting instance, there's a denormalization step
 in moving data to a read-only database. Once in the read-only database there are
 dumps, for reporting purposes, using XORT to create ChadoXML. Once
-ChadoXML is created version 2 of XSLT is used to create HTML and GFF. HTML reports
+ChadoXML is created version 2 of XSLT is used to create HTML and [[GFF]]. HTML reports
-are for human-readable reports, GFF for GBrowse and for various power
+are for human-readable reports, [[GFF]] for [[GBrowse]] and for various power
 users.
@@ Line 66: / Line 66: @@
 . ChadoXML is created from Chado using XORT
-.a. GFF and Fasta is created from ChadoXML
+.a. [[GFF]] and Fasta is created from ChadoXML
 .b. HTML is created from Chado XML
@@ Line 96: / Line 96: @@
 Step 1. Dump all data use simple dumpspec
-<xml>
+<syntaxhighlight lang="xml">
   <chado>
    <feature dump=“all”>
@@ Line 102: / Line 103: @@
    </feature>
   </chado>
-</xml>
+ </syntaxhighlight>
 Step 2 Delete feature x from DB, with triggers to clean orphan records, if necessary
 Step 3. Edit the output xml, change uniquename x to y, then load the edited file back to DB
@@ Line 130: / Line 131: @@
 =====Acknowledgements=====
 * Willian Gelbart
 * Chris Mungall
 * David Emmert
 * Mark Gibson
 * Stan Letovsky
 * Nomi Harris
 * Frank Smutniak
 * Suzanna Lewis
 * Peili Zhang
 * Stan Letovsky
 * Haiyan Zhang
 * Aubrey de Grey
 * Andy Schroeder
 * Don Gilbert
 * Susan Russo
 * Mark Zythovicz
 * Scott Cain
 * Lincoln Stein
@@ Line 151: / Line 152: @@
 * Robert Wilson
 * Paul Leyland
 [[Category:FlyBase]]
 [[Category:XORT]]

Difference between revisions of "XORT Presentation"

Latest revision as of 18:54, 9 October 2012

Contents

Introduction

Chado XML

Components

Highlights of Chado XML Specification

Putting it together: New FlyBase dataflow Part 1

Putting it together: New FlyBase dataflow Part 2

Data & Report Generation

Hibernate & XORT

Support for complex transactions using XORT

CHIA (Chado Interface Application)

Documentation

Acknowledgements

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Documentation

Community

Tools