This HOWTO describes how to load GenBank format files into Chado. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:
http://eugenes.org/gmod/genbank2chado/
genbank2gff3
script from BioPerl (Important: use a version of
the script created April 2007 or later)gmod_bulk_load_gff3.pl
script (from GMOD)In summary, to load Saccharomyces chromosome X to Chado database ‘mychado’, from a Unix command-line, do:
curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
| perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
| perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata
Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes, or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/
mkdir data; cd data
Fetch from NCBI, or this Indiana mirror
curl ftp://bio-mirror.net/biomirror/ncbigenomes/
curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz
Other sample genomes of interest:
Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
M_musculus/CHR_19/mm_ref_chr19.gbk.gz
H_sapiens/CHR_19/hs_ref_chr19.gbk.gz
The BioPerl script bp_genbank2gff3.pl
(scripts/Bio-DB-GFF/genbank2gff3.PLS
) will convert to
GFF3 suited to
Chado loading.
Important: use a version of the script created April 2007 or later.
The new -noCDS
flag is required for this. Use -s
flag to summarize
features found.
bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz
Use the GMOD script gmod_bulk_load_gff3.pl
for this. Note that
gmod_bulk_load_gff3
will only handle one organism at a time. Chose
the best --dbxref
per organism (WormBase, SGD, MGI, FLYBASE),
depending on contents of GenBank annotations. The ‘GeneID’ dbxref is
standard for most GenBank genomes.
gmod_bulk_load_gff3.pl --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
Check data:
psql -d dev_chado_01c -c 'select count(f.*), \
(select common_name from organism where organism_id = f.organism_id) as species \
from feature f group by f.organism_id;'
psql -d dev_chado_01c -c 'select count(f.*), \
(select common_name from organism where organism_id = f.organism_id) as species \
from feature f where f.seqlen>0 group by f.organism_id;'
It’s possible that you’ll run into some errors coming from the input data itself. Some of the errors, and their fixes, are described below.
couldn’t open /var/lib/gmod/conf directory for reading:No such file or directory
Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:
setenv GMOD_ROOT /usr/local/gmod/ # tcsh
or
set GMOD_ROOT=/usr/local/gmod/ # bash
Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table
Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the Sequence Ontology.
DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique
constraint “featureprop_c1”
CONTEXT: COPY featureprop, line …
Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF3 file. Correct these errors by hand and reload.