NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.
GMODTools
Contents
Description
Bulkfiles is a GMOD Perl package that generates Fasta, GFF, DNA and other bulk genome annotation files from Chado databases. It works with several FlyBase chado releases, with SGDLite, and has been tested with other chado databases. Once tuned to your project's needs with its organism and site configurations, it can generate public data releases on a regular basis. It produces all the contents needed for a GMOD Standard URL genome data download folder.
Outputs
- DNA files (full chromosomes) in raw and fasta formats
- GFF (v3) feature files
- Fasta sequence for each selected feature set, with headers from feature files
- BLAST Index files (NCBI)
- Additional genome data outputs can be configured with Perl adaptor packages following the Bulkfiles base adaptor object.
Why Use Bulkfiles?
Why use this package rather than using other middleware layers to chado db - chadoxml, chadodbi, bioperl, ? The general logic is
1. dump all chado db features using simple (and quick) sql, to common intermediate table files, and chromosome dna to raw files. The feature info is simple: type, location, name/id, and a few attributes (db_xrefs,..)
2. postprocess these table files to create the various public use formats (the time-consuming and configurable part), organized into per-chromosome files.
Here are some reasons we take this approach:
a. using simple sql to dump all db features to intermediate table allows easy checks that all features get to bulk files
b. simple sql dump is fast, and reliable in getting all mapped features by keeping logic simple
c. process table output in stages - better debugging of steps in process, and can split processing among computers These stages are loosely coupled - one can go back, tweek configurations and get a new output w/o redoing the complete extraction process.
d. convert one common feature table + dna to several output formats in one step, or repeatedly as needed.
e. combine features from several chado dbs, and add other sources that may not be in your chado database.
f. model organism projects need fairly complex and data specific configurations - moving that to config files keeps code reusable.
g. each genome chado database has different policy and choices with respect to feature, vocabulary and other data. A highly configurable tool, with data extraction and correction methods that are separate and tunable is needed to adapt to such variation in genome databases.
Downloads
Here is a candidate release package for GMODTools:
curl -O http://eugenes.org/gmod/GMODTools/GMODTools-1.0.zip unzip GMODTools*.zip
If you want to try out GMODTools from CVS check out using these commands:
cvs -d:pserver:anonymous@gmod.cvs.sourceforge.net:/cvsroot/gmod login cvs -d:pserver:anonymous@gmod.cvs.sourceforge.net:/cvsroot/gmod co schema/GMODTools
These commands will create a directory named schema, with a directory named GMODTools in it. Note that you don't need to supply a password, and it may be very slow - the SourceForge anonymous CVS server is notoriously overworked.
Configuration
Bulkfiles has extensive configurations, in XML. This is both a strength and weakness. The strength is that most aspects for genome data publication, such as feature names, types, aspects of the output format, are under your project's control. The weakness is these need detailed documentation to make it easier to tune your site's configuration.
Once configured for your organism(s) and project database, it is quick to generate new bulk data release files, and link into a collection of public releases. A new release needs only a minor configuration update (release number and date), and can be generated automatically if desired.
- add details here
Sample Use
Load a genome chado db to Postgres database:
curl -O http://sgdlite.princeton.edu/download/sgdlite/sgdlite.sql.gz createdb sgdlite (gunzip -c sgdlite.sql.gz | psql -d sgdlite -f - ) >& log.load
Extract bulk files from database:
cd GMODTools perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make
Sample Output Data Folder
Output data folder looks like this. It is suited to linking to a public web server for data downloading, e.g. the GMOD Standard_URL.
data/genomes/Saccharomyces_cerevisiae/sgdlite_2005_08_23/ Example.txt README.html blast/ fasta/ gff/ tmp/ HEADER.html Release.txt dna/ fff/ tables/
See the tables/ folder for summary tables. The contents including web HEADER, README and others are configured with the conf/bulkfiles/ configurations.
Sample Run Log
microbe% perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make - Setting GMOD_ROOT=/usr/local/bio/argos/gmod/gmtmp/GMODTools Config: title = SGD Lite; date = 20051129; from conf/bulkfiles/sgdbulk.xml Config: title = Site Default settings for GMODTools; from conf/bulkfiles/site_defaults.xml Config: title = Species abbreviations; date = 20051129; from conf/bulkfiles/organisms.xml Config: title = Bulkfiles fileset definitions; date = 20040821; from conf/bulkfiles/filesets.xml Config: title = Chado Feature mapping info; date = 20040821; from conf/bulkfiles/featuresets.xml missing data dir /usr/local/bio/argos/gmod/gmtmp/GMODTools/data/genomes/Saccharomyces_cerevisiae Config: title = Chado DB SQL; date = 20051129; from conf/bulkfiles/chadofeatsql.xml Automaking feature_table files Config: title = Blast index writer; date = 20040821; from conf/bulkfiles/blastfiles.xml Config: title = Summary Tables; date = 20051217; from conf/bulkfiles/tablewriter.xml Config: title = Gbrowse conf generator; date = 20040826; from conf/bulkfiles/gbrowseconf.xml Config: title = Genome Web docs; date = 20051225; from conf/bulkfiles/genomeweb.xml Changed 'current' release symlink to /usr/local/bio/argos/gmod/gmtmp/GMODTools/data/genomes/Saccharomyces_c erevisiae/sgdlite_2005_08_23; ok=1 Bulkfiles done. result=overviews:, fff+gff=146661, fasta=19849, blast=14, tables=ok, Bulkfiles are located at /usr/local/bio/argos/gmod/gmtmp/GMODTools/data/genomes/Saccharomyces_cerevisiae/sg dlite_2005_08_23
Requirements
Bulkfiles is mostly a self-contained Perl package. It uses a few BioPerl parts plus XML::Simple for configuration files.
- Postgres and GMOD Chado database.
- Basic Perl tool set used for other GMOD packages.
See Also
Note that XORT offers an alternative approach to bulk uploads and downloads from a Chado database.