Difference between revisions of "Load GenBank into Chado"

Latest revision as of 21:49, 30 December 2008

Abstract

This HOWTO describes how to load GenBank format files into Chado. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:

http://eugenes.org/gmod/genbank2chado/

Summary

Install prerequisites: latest versions of Chado and GBrowse
Fetch Genbank genome/chromosomes
Run genbank2gff3 script from BioPerl (Important: use a version of the script created April 2007 or later)
Run gmod_bulk_load_gff3.pl script (from GMOD)
View genome(s) with GBrowse (see an example at eugenes.org).

In summary, to load Saccharomyces chromosome X to Chado database 'mychado', from a Unix command-line, do:

 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
 | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata

Fetch Genbank Genome Files

Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes, or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/

 mkdir data; cd data

Fetch from NCBI, or this Indiana mirror

 curl ftp://bio-mirror.net/biomirror/ncbigenomes/
 curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz

Other sample genomes of interest:

Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
M_musculus/CHR_19/mm_ref_chr19.gbk.gz
H_sapiens/CHR_19/hs_ref_chr19.gbk.gz

Create GFF3 from the Genbank Files

The BioPerl script bp_genbank2gff3.pl (scripts/Bio-DB-GFF/genbank2gff3.PLS) will convert to GFF3 suited to Chado loading. Important: use a version of the script created April 2007 or later.

The new -noCDS flag is required for this. Use -s flag to summarize features found.

 bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz

Load GFF3 into Chado

Use the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle one organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.

 gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff

Check data:

 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f group by f.organism_id;'
 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f where f.seqlen>0 group by f.organism_id;'

Possible Errors

It's possible that you'll run into some errors coming from the input data itself. Some of the errors, and their fixes, are described below.

couldn't open /var/lib/gmod/conf directory for reading:No such file or directory

Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:

 setenv GMOD_ROOT /usr/local/gmod/ # tcsh

or

 set GMOD_ROOT=/usr/local/gmod/ # bash

Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table

Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the Sequence Ontology.

DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"
CONTEXT: COPY featureprop, line ...

Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF3 file. Correct these errors by hand and reload.

Authors

@@ Line 3: / Line 3: @@
 ==Abstract==
-This HOWTO describes how to load GenBank format files into [[Chado]].
+This [[:Category:HOWTO|HOWTO]] describes how to load GenBank format files into [[Chado]]. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:
-==Authors==
+:http://eugenes.org/gmod/genbank2chado/
-* [[Don Gilbert]]
-* [[bp:Brian_Osborne|Brian Osborne]]
-==Copyright==
+==Summary==
-This document is copyright Don Gilbert , 2007. For reproduction other than personal use please contact <cain@cshl.edu>
+* Install prerequisites: latest versions of [[Chado]] and [[GBrowse]]
+* Fetch Genbank genome/chromosomes
+* Run <tt>[http://code.open-bio.org/svnweb/index.cgi/bioperl/view/bioperl-live/trunk/scripts/Bio-DB-GFF/genbank2gff3.PLS genbank2gff3]</tt> script from [[BioPerl]] (Important: use a version of the script created April 2007 or later)
+* Run <tt>gmod_bulk_load_gff3.pl</tt> script (from GMOD)
+* View genome(s) with [[GBrowse]] (see an [http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ example at eugenes.org]).
+In summary, to load ''Saccharomyces'' chromosome X to Chado database 'mychado', from a Unix command-line, do:
-==Revision History==
+  curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
+  | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
+  | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata
-{| border="1" cellspacing="0" cellpadding="4"
+==Fetch Genbank Genome Files==
-|-
-| Revision 1.0 2007-04-16 BIO
-| First version
-|-
-|}
+Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes, or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/
+  mkdir data; cd data
+Fetch from NCBI, or this Indiana mirror
-==Creating GFF3 from GenBank Files==
+  curl ftp://bio-mirror.net/biomirror/ncbigenomes/
+  curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz
-GFF3 can also be generated using a script provided by [[bp:Main_Page|Bioperl]],  <code>scripts/Bio-DB-GFF/genbank2gff3.pl</code> (this script is currently preferred over the script of the same name found in the [[GMOD]] package). If your working directory contains a Genbank file you could use it like this:
+Other sample genomes of interest:
- >bp_genbank2gff3.pl --dir . --outdir .
+* <tt>Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz</tt>
+* <tt>Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz</tt>
+* <tt>Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz</tt>
+* <tt>M_musculus/CHR_19/mm_ref_chr19.gbk.gz</tt>
+* <tt>H_sapiens/CHR_19/hs_ref_chr19.gbk.gz</tt>
+==Create GFF3 from the Genbank Files==
+The [[BioPerl]] script <code>bp_genbank2gff3.pl</code> (<tt>scripts/Bio-DB-GFF/genbank2gff3.PLS</tt>) will convert to [[GFF3]] suited to [[Chado]] loading. '''Important''': use a version of the script created April 2007 or later.
+The new <code>-noCDS</code> flag is required for this. Use <code>-s</code> flag to summarize features found.
+  bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz
+==Load GFF3 into Chado==
+Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle '''one''' organism at a time. Chose the best <tt>--dbxref</tt> per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
+  gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
+Check data:
+  psql -d dev_chado_01c -c 'select count(f.*), \
+   (select common_name from organism where organism_id = f.organism_id) as species \
+   from feature f group by f.organism_id;'
+  psql -d dev_chado_01c -c 'select count(f.*), \
+   (select common_name from organism where organism_id = f.organism_id) as species \
+   from feature f where f.seqlen>0 group by f.organism_id;'
-A recent update (April 2007) to bp_genbank2gff3.pl and gmod_bulk_load_gff3.pl
-should solve the first two problems below.  Another addition to bp_genbank2gff3.pl is the option --noCDS
-that produces GFF gene models suited to loading to Chado.
-   >bp_genbank2gff3.pl --noCDS --in mygenome.gbk
-   >gmod_bulk_load_gff3.pl --database mygenome --gff  mygenome.gbk.gff
 ==Possible Errors==
-This method  for generating GFF3 files is not completely satisfactory and development is ongoing to provide better translation. However, by proceeding carefully you should be able to get it to produce GFF3 that can be loaded. Possible errors from running this script, and their fixes, are described below.
+It's possible that you'll run into some errors coming from the input data itself. Some of the errors, and their fixes, are described below.
 '''couldn't open /var/lib/gmod/conf directory for reading:No such file or directory'''
@@ Line 55: / Line 83: @@
    set GMOD_ROOT=/usr/local/gmod/ # bash
-'''Unable to find srcfeature <some feature> in the database'''
-Solution: Edit the '##sequence-region' 2nd line of the GFF3 output. Change it to '# sequence-region' is enough, or remove the line.
+'''Your [[GFF3]] file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table'''
+Solution: This error message will be followed by [[Glossary#SQL|SQL]] statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology].
-'''Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table'''
-Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology].
@@ Line 68: / Line 92: @@
 '''CONTEXT:  COPY featureprop, line ...'''
-Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF file. Correct these errors by hand and reload.
+Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the [[GFF3]] file. Correct these errors by hand and reload.
+==Authors==
+* [[User:Dongilbert|Don Gilbert]]
+* [[bp:Brian_Osborne|Brian Osborne]]
 [[Category:HOWTO]]
 [[Category:Chado]]

Difference between revisions of "Load GenBank into Chado"

Latest revision as of 21:49, 30 December 2008

Contents

Abstract

Summary

Fetch Genbank Genome Files

Create GFF3 from the Genbank Files

Load GFF3 into Chado

Possible Errors

Authors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Documentation

Community

Tools