Bulkfiles is a GMOD Perl package that generates Fasta, GFF3, DNA and other bulk genome annotation files from Chado databases. It works with several FlyBase Chado releases, with SGDLite, and has been tested with other Chado databases. Once tuned to your project’s needs with its organism and site configurations, it can generate public data releases on a regular basis. It produces all the contents needed for a GMOD Standard URL genome data download folder.
These are current primary outputs, which are configurable to suit project needs.
Additional genome outputs can be added with Perl adaptor packages from a Bulkfiles base adaptor object.
Why use this package rather than using other middleware layers to Chado db - Chado XML, Chado%253A%253AAutoDBI, BioPerl, ? The general logic is
Here are some reasons we take this approach:
Here is a candidate release package for GMODTools:
curl -O http://eugenes.org/gmod/GMODTools/GMODTools-1.2.zip
unzip GMODTools*.zip
If you want to try out GMODTools from SVN check out using these commands:
svn co https://gmod.svn.sourceforge.net/svnroot/gmod/schema/trunk/GMODTools
These commands will create a directory named GMODTools
, with a
directory named GMODTools
in it. Note that you don’t need to supply a
password, and it may be very slow.
Bulkfiles has extensive configurations, in a simple XML format. This is both a strength and weakness. The strength is that most aspects for genome data publication, such as feature names, types, aspects of the output format, are under your project’s control. The weakness is these need detailed documentation to make it easier to tune your site’s configuration.
All of the organism and project-specific logic is in these configuration files, including output documentation, feature controls and naming, file choices.
Once configured for your organism(s) and project database, it is quick to generate new bulk data release files, and link into a collection of public releases. A new release needs only a minor configuration update (release number and date), and can be generated automatically if desired.
These are the main contents of the configuration files
sgdbulk.xml
: main release configuration for tested sampleThis main release XML controls what other configurations are used.
A new release configuration with date, release name should be added as needed.
bulkfiles_template.xml
: documented template for creating your
project/organism configurationsite_defaults.xml
: common database and output settings for your
site.chadofeatsql.xml
: primary Chado SQL
used to extract data from database.If you have a complex Chado database, you may well want to add to or update this.
chadofeatconv.xml
: logic to convert Chado view to public view of
features.What features are to be published, the structure of features, and much of the messy genome detail are included here. It is complex but that is part of the territory with genome databases.
fastawriter.xml, blastfiles.xml, genbanksubmit.xml
have some site
configurations such as path to NCBI tools that will need attention for
proper use.Load a genome Chado db to PostgreSQL database:
curl -O http://sgdlite.princeton.edu/download/sgdlite/sgdlite.sql.gz
createdb sgdlite
(gunzip -c sgdlite.sql.gz | psql -d sgdlite -f - ) >& log.load
Extract bulk files from database:
cd GMODTools
perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make
If your test fails, pleas re-run with the -debug
option and send the
result log file to the developer contact below.
perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make -debug >& gmodtools-debug.log
Output data folder looks like this. It is suited to linking to a public web server for data downloading, e.g. the GMOD Standard_URL.
data/genomes/Saccharomyces_cerevisiae/sgdlite_2005_08_23/
Example.txt README.html blast/ fasta/ gff/ tmp/
HEADER.html Release.txt dna/ fff/ tables/
See the tables/
folder for summary tables. The contents including web
HEADER, README and others are configured with the conf/bulkfiles/
configurations.
microbe% perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make
- Setting GMOD_ROOT=/bio/argos/gmod/GMODTools
Config: title = SGD Lite; date = 20051129; from conf/bulkfiles/sgdbulk.xml
Config: title = Site Default settings for GMODTools; from conf/bulkfiles/site_defaults.xml
Config: title = Species abbreviations; date = 20051129; from conf/bulkfiles/organisms.xml
Config: title = Bulkfiles fileset definitions; date = 20040821; from conf/bulkfiles/filesets.xml
Config: title = Chado Feature mapping info; date = 20040821; from conf/bulkfiles/featuresets.xml
missing data dir data/genomes/Saccharomyces_cerevisiae
Config: title = Chado DB SQL; date = 20051129; from conf/bulkfiles/chadofeatsql.xml
Automaking feature_table files
Config: title = Blast index writer; date = 20040821; from conf/bulkfiles/blastfiles.xml
Config: title = Summary Tables; date = 20051217; from conf/bulkfiles/tablewriter.xml
Config: title = Gbrowse conf generator; date = 20040826; from conf/bulkfiles/gbrowseconf.xml
Config: title = Genome Web docs; date = 20051225; from conf/bulkfiles/genomeweb.xml
Changed 'current' release symlink to data/genomes/Saccharomyces_cerevisiae/sgdlite_2005_08_23; ok=1
Bulkfiles done. result=overviews:, fff+gff=146661, fasta=19849, blast=14, tables=ok,
Bulkfiles are located at data/genomes/Saccharomyces_cerevisiae/sgdlite_2005_08_23
See also this brief GMODTools TestCase that describes how to load a GenBank genome to Chado then regurgitate it via Bulkfiles as a GenBank submission file set.
Bulkfiles is mostly a self-contained Perl package. It uses a few BioPerl parts plus XML::Simple for configuration files.
Version 1.2 (2008 May) IN PROGRESS
There now is enough of a GMODTools framework for dumping Chado genome databases to Genbank Submit format that it likely will save effort for those who need to do this job. This is open source and collaborators are welcome to add code here
http://gmod.cvs.sourceforge.net/gmod/schema/GMODTools/
esp. lib/Bio/GMOD/Bulkfiles/GenbankSubmitWriter.pm and conf/bulkfiles/genbanksubmit.xml
The above code is packaged at http://eugenes.org/gmod/GMODTools/ as GMODTools-1.2.zip
Here are sample Bulkfiles outputs from DrosMel CHR_4 and AnoGam CHR_X
http://insects.eugenes.org/genome/Drosophila_melanogaster/dromel_20080512/
http://insects.eugenes.org/genome/Anopheles_gambiae_str._PEST/anogam_20080511/
Version 1.1 (2007 October) adds these features and corrections: