Difference between revisions of "Load GFF Into Chado"

From GMOD
Jump to: navigation, search
m (Corrected S.Cer genome download link)
 
(11 intermediate revisions by 6 users not shown)
Line 1: Line 1:
__TOC__
+
This [[:Category:HOWTO|HOWTO]] describes a method for loading sequence annotation data in [[GFF3]] format into a [[Chado_-_Getting_Started|Chado database]].
  
==Abstract==
+
==Download the GFF3 Files==
 
+
This HOWTO describes a method for loading sequence annotation data in [[bp:GFF|GFF]] format into the [[Chado_-_Getting_Started|Chado database]].
+
 
+
==Authors==
+
+
* [[Scott Cain]]
+
* [[bp:Brian_Osborne|Brian Osborne]]
+
 
+
 
+
==Copyright==
+
 
+
This document is copyright Scott Cain , 2007. For reproduction other than personal use please contact <cain@cshl.edu>
+
 
+
 
+
==Revision History==
+
 
+
{| border="1" cellspacing="0" cellpadding="4"
+
|-
+
| Revision 1.0 2007-03-16 BIO
+
| First version
+
|-
+
|}
+
 
+
 
+
==Download the GFF Files==
+
  
 
An easy way
 
An easy way
to load data into the database is to use a [[bp:GFF|GFF3]] file and the script
+
to load data into the database is to use a [[GFF3]] file and the script
<code>load/bin/gmod_bulk_load_gff3.pl</code>.  A good set of sample data is the GFF3 file prepared
+
<code>load/bin/gmod_bulk_load_gff3.pl</code>.  A good set of sample data is the GFF3 file prepared by the nice folks at the [[:Category:SGD|Saccharomyces Genome Database]]:
by the nice folks at the Saccharomyces Genome Database:
+
 
+
    ftp://ftp.yeastgenome.org/pub/yeast/data_download/chromosomal_feature/saccharomyces_cerevisiae.gff
+
  
This file contains [http://geneontology.org Gene Ontology (GO)] anotations, so if you didn't load
+
: http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff
GO when you executed <code>make ontologies</code> you will get many warning messages
+
about being unable to find entries in the [[Chado_Tables#Table:_dbxref|dbxref]] table.  If you want to
+
load GO  you should be able to execute <code>make ontologies</code> and select '''Gene Ontology'''
+
for installation.
+
  
 +
This file contains [http://geneontology.org Gene Ontology (GO)] annotations, so if you didn't load GO when you executed <code>make ontologies</code> you will get many warning messages about being unable to find entries in the [[Chado_Tables#Table:_dbxref|dbxref]] table.  If you want to
 +
load GO you should be able to execute <code>make ontologies</code> and select '''Gene Ontology''' for installation.
  
 
==Add an Entry for Your Organism==
 
==Add an Entry for Your Organism==
  
You will need to have an entry for your species in the [[Chado_Tables#Table:_organism|Chado organism table]]. If you are unsure if this entry exists log into your database and execute this SQL command:
+
You will need to have an entry for your species in the [[Chado_Tables#Table:_organism|Chado organism table]]. To add a new organism, run the tool that came with Chado, <code>gmod_add_organism.pl</code>
<sql>
+
select common_name from organism;
+
</sql>
+
If you do not see your organism listed, execute a command equivalent to this:
+
<sql>
+
  insert into organism (abbreviation, genus, species, common_name, organism_id)
+
                values ('S.cerevisiae', 'Saccharomyces', 'cerevisiae', 'yeast', 4932);
+
</sql>
+
  
Substitute in the appropriate values for your own organism if it's not ''yeast''.
+
This script will ask you what information about your organism:
  
 +
  Both genus and species are required; please provide them below
  
==Load the GFF==
+
  Organism's genus? 
 +
  Organism's species? 
 +
  Organism's abbreviation? []
 +
  Comment (can be empty)?
  
Then execute gmod_bulk_load_gff3.pl:
+
==Load the GFF3==
  
>gmod_bulk_load_gff3.pl --organism yeast  --gfffile saccharomyces_cerevisiae.gff
+
Unless your [[GFF3]] is sorted by location with grouped gene models (gene, mRNA, CDS/exon/UTR), you must first do this.  Use this [http://gmod.cvs.sourceforge.net/*checkout*/gmod/schema/chado/bin/gmod_gff3_preprocessor.pl gmod_gff3_preprocessor.pl].
  
This loads the GFF3 file. The loading script requires [[bp:GFF3|GFF3]] as it has  tighter control of the syntax and requires the use of a controlled  vocabulary (from [http://sequenceontology.org Sequence Ontology Feature Annotation (SOFA)]), allowing  mapping to the relational schema.  In addition to supplying the location  of the file with the <code>--gfffile</code> flag, the <code>--organism</code> tag uses the common  name (<code>common_name</code> field) from the  [[Chado_Tables#Table:_organism|Chado organism table]]. Do  <code>perldoc gmod_bulk_load_gff.pl</code> for  more information on adding other organisms and databases, as well as other available command line flags.
+
  > gmod_gff3_preprocessor.pl --gfffile saccharomyces_cerevisiae.gff --outfile saccharomyces_cerevisiae.sorted.gff
  
Note that <code>gmod_load_gff3.pl</code> is also available, but is limited in how
+
Then execute <code>gmod_bulk_load_gff3.pl</code>:
much it has been supported and in how flexible it currently is.  It is
+
a good example of how to write code using Class::DBI classes that are
+
created at the time of install.  For more information on using these
+
classes, see http://sourceforge.net/projects/gmod-ware for a {{CPAN|Class::DBI}}-based middleware/API.
+
  
 +
>gmod_bulk_load_gff3.pl --organism yeast  --gfffile saccharomyces_cerevisiae.sorted.gff
  
==Creating GFF3 from GenBank Files==
+
This loads the [[GFF3]] file.  The loading script requires [[GFF3]] as it has  tighter control of the syntax and requires the use of a controlled  vocabulary (from [http://sequenceontology.org Sequence Ontology Feature Annotation (SOFA)]), allowing  mapping to the relational schema.  In addition to supplying the location  of the file with the <code>--gfffile</code> flag, the <code>--organism</code> tag uses the common name (<code>common_name</code> field) from the [[Chado_Tables#Table:_organism|Chado organism table]].  Do  <code>perldoc gmod_bulk_load_gff.pl</code> for  more information on adding other organisms and databases, as well as other available command line flags.
  
GFF3 can also be generated using a script provided by [[bp:Main_Page|Bioperl]],  <code>scripts/Bio-DB-GFF/genbank2gff3.pl</code> (this script is currently preferred over the script of the same name found in the [[GMOD]] package). If your working directory contains a Genbank file you could use it like this:
+
Note that <code>gmod_load_gff3.pl</code> is also available, but is limited in how much it has been supported and in how flexible it currently is.  It is
 
+
a good example of how to write code using {{CPAN|Class::DBI}} classes that are created at the time of install. For more information on using these classes, see [[Modware]] for a {{CPAN|Class::DBI}}-based [[:Category:Middleware|middleware/API]].
>bp_genbank2gff3.pl --dir . --outdir .
+
 
+
A recent update (April 2007) to bp_genbank2gff3.pl and gmod_bulk_load_gff3.pl
+
should solve the first two problems below.  Another addition to bp_genbank2gff3.pl is the option --noCDS
+
that produces GFF gene models suited to loading to Chado.
+
  >bp_genbank2gff3.pl --noCDS --in mygenome.gbk
+
  >gmod_bulk_load_gff3.pl --database mygenome --gff mygenome.gbk.gff
+
 
+
This method  for generating GFF3 files is not completely satisfactory and development is ongoing to provide better translation. However, by proceeding carefully you should be able to get it to produce GFF3 that can be loaded. Possible errors from running this script, and their fixes, are described below.
+
 
+
 
+
'''couldn't open /var/lib/gmod/conf directory for reading:No such file or directory'''
+
 
+
Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:
+
 
+
setenv GMOD_ROOT /usr/local/gmod/ # tcsh
+
 
+
set GMOD_ROOT=/usr/local/gmod/ # bash
+
 
+
'''Unable to find srcfeature <some feature> in the database'''
+
 
+
Solution: Edit the '##sequence-region' 2nd line of the GFF3 output. Change it to '# sequence-region' is enough, or remove the line.
+
 
+
 
+
'''Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table'''
+
 
+
Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology].
+
 
+
 
+
'''DBD::Pg::db pg_endcopy failed: ERROR:  duplicate key violates unique constraint "featureprop_c1"'''<br>
+
'''CONTEXT:  COPY featureprop, line ...'''
+
 
+
Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF file. Correct these errors by hand and reload.
+
  
 
==Creating GFF3 from UniProt/SwissProt Files==
 
==Creating GFF3 from UniProt/SwissProt Files==
  
A recent update (April 2007) to bp_genbank2gff3.pl extends it to handle Swiss and EMBL format input,
+
A recent update (April 2007) to <code>bp_genbank2gff3.pl</code> extends it to handle Swiss and EMBL format input, along with GenBank.  You can now create [[GFF3]] entries of UniProt sequences suited to loading into [[Chado]], including most of the protein description, Dbxref, and related fields useful in annotating genome matches. Use the <code>--format Uniprot</code> flag to specify this input format (<code>--format EMBL</code> can also be useful).
along with GenBank.  You can now create GFF entries of UniProt sequences suited to loading into Chado,
+
including most of the protein description, Dbxref, and related fields useful in annotating genome matches.
+
Use the --format Uniprot flag to specify this input format (--format EMBL can also be useful).
+
  
 
   >bp_genbank2gff3.pl --noCDS --in uniprot-subset.dat --format Uniprot
 
   >bp_genbank2gff3.pl --noCDS --in uniprot-subset.dat --format Uniprot
 
   >gmod_bulk_load_gff3.pl --database mygenome --gff  uniprot-subset.dat.gff --organism fromdata
 
   >gmod_bulk_load_gff3.pl --database mygenome --gff  uniprot-subset.dat.gff --organism fromdata
  
Use the --organism fromdata flag to load UniProt with many organisms.  
+
Use the <code>--organism fromdata</code> flag to load UniProt with many organisms.
  
 
{{NeedsTesting}}
 
{{NeedsTesting}}
Line 126: Line 53:
 
==More Information==
 
==More Information==
  
See the related HOWTO [[Load_RefSeq_Into_Chado|Load RefSeq Into Chado]].
+
See the related HOWTO [[Load RefSeq Into Chado]].
  
 
Please send questions to the GMOD developers list:
 
Please send questions to the GMOD developers list:
Line 134: Line 61:
 
Or contact the [[GMOD_Help_Desk|GMOD Help Desk]]
 
Or contact the [[GMOD_Help_Desk|GMOD Help Desk]]
  
 +
==Authors==
 +
 +
* [[User:Scott|Scott Cain]]
 +
* [[bp:Brian_Osborne|Brian Osborne]]
  
 
[[Category:HOWTO]]
 
[[Category:HOWTO]]
 
[[Category:Chado]]
 
[[Category:Chado]]

Latest revision as of 15:43, 20 July 2015

This HOWTO describes a method for loading sequence annotation data in GFF3 format into a Chado database.

Download the GFF3 Files

An easy way to load data into the database is to use a GFF3 file and the script load/bin/gmod_bulk_load_gff3.pl. A good set of sample data is the GFF3 file prepared by the nice folks at the Saccharomyces Genome Database:

http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff

This file contains Gene Ontology (GO) annotations, so if you didn't load GO when you executed make ontologies you will get many warning messages about being unable to find entries in the dbxref table. If you want to load GO you should be able to execute make ontologies and select Gene Ontology for installation.

Add an Entry for Your Organism

You will need to have an entry for your species in the Chado organism table. To add a new organism, run the tool that came with Chado, gmod_add_organism.pl

This script will ask you what information about your organism:

 Both genus and species are required; please provide them below
 Organism's genus?  
 Organism's species?  
 Organism's abbreviation? [] 
 Comment (can be empty)?

Load the GFF3

Unless your GFF3 is sorted by location with grouped gene models (gene, mRNA, CDS/exon/UTR), you must first do this. Use this gmod_gff3_preprocessor.pl.

> gmod_gff3_preprocessor.pl --gfffile saccharomyces_cerevisiae.gff --outfile saccharomyces_cerevisiae.sorted.gff

Then execute gmod_bulk_load_gff3.pl:

>gmod_bulk_load_gff3.pl --organism yeast  --gfffile saccharomyces_cerevisiae.sorted.gff

This loads the GFF3 file. The loading script requires GFF3 as it has tighter control of the syntax and requires the use of a controlled vocabulary (from Sequence Ontology Feature Annotation (SOFA)), allowing mapping to the relational schema. In addition to supplying the location of the file with the --gfffile flag, the --organism tag uses the common name (common_name field) from the Chado organism table. Do perldoc gmod_bulk_load_gff.pl for more information on adding other organisms and databases, as well as other available command line flags.

Note that gmod_load_gff3.pl is also available, but is limited in how much it has been supported and in how flexible it currently is. It is a good example of how to write code using Class::DBI classes that are created at the time of install. For more information on using these classes, see Modware for a Class::DBI-based middleware/API.

Creating GFF3 from UniProt/SwissProt Files

A recent update (April 2007) to bp_genbank2gff3.pl extends it to handle Swiss and EMBL format input, along with GenBank. You can now create GFF3 entries of UniProt sequences suited to loading into Chado, including most of the protein description, Dbxref, and related fields useful in annotating genome matches. Use the --format Uniprot flag to specify this input format (--format EMBL can also be useful).

  >bp_genbank2gff3.pl --noCDS --in uniprot-subset.dat --format Uniprot
  >gmod_bulk_load_gff3.pl --database mygenome --gff  uniprot-subset.dat.gff --organism fromdata

Use the --organism fromdata flag to load UniProt with many organisms.

This code needs to be tested. Please help improve this section with your tests.

More Information

See the related HOWTO Load RefSeq Into Chado.

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net

Or contact the GMOD Help Desk

Authors