GMODTools

From GMOD
Revision as of 01:38, 26 September 2007 by Dongilbert (Talk | contribs)

Jump to: navigation, search

Description

Bulkfiles is a GMOD Perl package that generates Fasta, GFF, DNA and other bulk genome annotation files from Chado databases. It works with several FlyBase chado releases, with SGDLite, and has been tested with other chado databases.

Bulkfiles is mostly a self-contained package, but uses a few BioPerl parts plus XML::Simple for configuration files. All of the organism/database-specific logic should be in these configuration files (see GMODTools/conf/bulkfiles/)

Outputs

  • DNA files (full chromosomes) in raw and fasta formats
  • GFF (v3) feature files
  • Fasta sequence for each selected feature set, with headers from feature files
  • BLAST Index files (NCBI)
  • Additional genome data outputs can be configured with Perl adaptor packages following the Bulkfiles base adaptor object.

Why Use Bulkfiles?

Why use this package rather than using other middleware layers to chado db - chadoxml, chadodbi, bioperl, ? The general logic is

1. dump all chado db features using simple (and quick) sql, to common intermediate table files, and chromosome dna to raw files. The feature info is simple: type, location, name/id, and a few attributes (db_xrefs,..)

2. postprocess these table files to create the various public use formats (the time-consuming and configurable part), organized into per-chromosome files.

Here are some reasons we take this approach:

a. using simple sql to dump all db features to intermediate table allows easy checks that all features get to bulk files

b. simple sql dump is fast (30 - 60 min for full fly genome), reliable in getting all mapped features by keeping logic simple

c. process table output in stages - better debugging of steps in process, and can split processing among computers c1. the stages are loosely coupled - one can go back, tweek configurations and get a new output w/o redoing the complete extraction process.

d. convert one common feature table + dna to several output formats in one step, or repeatedly as needed.

e. combine features from several chado dbs (flybase now has 3 chado dbs for d.mel genome features), and add other sources like flybase cytology features.

f. need fairly complex and data specific configurations - moving that to config files keeps code reusable.

g. each genome chado database has different policy and choices with respect to feature, vocabulary and other data. A highly configurable tool, with data extraction and correction methods that are separate and tunable is needed to adapt to such variation in genome databases.


Downloads

Here is a candidate release package for GMODTools:

>curl -O http://eugenes.org/gmod/GMODTools/GMODTools-1.0.zip
>unzip GMODTools*.zip

If you want to try out GMODTools from CVS check out using these commands:

>cvs -d:pserver:anonymous@gmod.cvs.sourceforge.net:/cvsroot/gmod login
>cvs -d:pserver:anonymous@gmod.cvs.sourceforge.net:/cvsroot/gmod co schema/GMODTools

These commands will create a directory named schema, with a directory named GMODTools in it. Note that you don't need to supply a password, and it may be very slow - the SourceForge anonymous CVS server is notoriously overworked.


Sample Use

Load a genome chado db to Postgres database:

>curl -O http://sgdlite.princeton.edu/download/sgdlite/sgdlite.sql.gz
>createdb sgdlite
>(gunzip -c sgdlite.sql.gz | psql -d sgdlite -f - ) >& log.load

Extract bulk files from database:

>cd GMODTools
>perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make

It should take only a few minutes to run.


Requirements

  • Postgres
  • Basic Perl tool set used for other GMOD packages.

See Also

Note that XORT offers an alternative approach to bulk uploads and downloads from a Chado database.

Contact

Dongilbert