Difference between revisions of "Load RefSeq Into Chado"
m (→Download the Sequence Files) |
m (format changes) |
||
Line 1: | Line 1: | ||
− | + | This [[:Category:HOWTO|HOWTO]] describes a method for loading the sequence data in Genbank RefSeq files into the [[Chado_-_Getting_Started|Chado database]]. | |
− | + | ||
− | + | ||
− | + | ||
− | This HOWTO describes a method for loading the sequence data in Genbank RefSeq files into the [[Chado_-_Getting_Started|Chado database]]. | + | |
==Download the Sequence Files== | ==Download the Sequence Files== | ||
Line 9: | Line 5: | ||
These steps have been used to load data from genomic RefSeq files, you can recognized these files by their <code>NC_</code> and <code>NT_</code> prefixes. First download the Genbank genome records of interest. A good source for RefSeq files is [ftp://ftp.ncbi.nih.gov/genomes/MapView NCBI's FTP site], they will probably be compressed and have <code>.gbk.gz</code> suffixes. | These steps have been used to load data from genomic RefSeq files, you can recognized these files by their <code>NC_</code> and <code>NT_</code> prefixes. First download the Genbank genome records of interest. A good source for RefSeq files is [ftp://ftp.ncbi.nih.gov/genomes/MapView NCBI's FTP site], they will probably be compressed and have <code>.gbk.gz</code> suffixes. | ||
− | ==Convert RefSeq to | + | ==Convert RefSeq to GFF3== |
+ | |||
+ | Use the [[BioPerl]] script <code>genbank2gff3.pl</code>, found in <code>scripts/Bio-DB-GFF/</code> within the BioPerl distribution. If you've actually installed BioPerl then the installed script will have been renamed <code>bp_genbank2gff3.pl</code>. Note that there's also an older <code>genbank2gff.pl</code> script, don't use it. | ||
− | |||
− | |||
>bp_genbank2gff3.pl <filename> | >bp_genbank2gff3.pl <filename> | ||
− | This will create a [[ | + | This will create a [[GFF3]] file. It may give several warnings about ''unrecognized feature types''. If the feature types are not part of [http://www.sequenceontology.org/ SOFA], you will have to hand edit the resulting [[GFF3]] file to change the feature type. Any skipped features will be printed at the end. If you want those to be part of the GFF3 file, you will have to add those manually as well, fixing any non-SOFA feature types. |
==Add an Entry for Your Organism== | ==Add an Entry for Your Organism== | ||
− | You will need to have an entry for your species in the [[Chado_Tables#Table:_organism|Chado organism table]]. If you are unsure if this entry exists log into your database and execute this SQL command: | + | You will need to have an entry for your species in the [[Chado_Tables#Table:_organism|Chado organism table]]. If you are unsure if this entry exists log into your database and execute this [[Glossary#SQL|SQL]] command: |
<sql> | <sql> | ||
select common_name from organism; | select common_name from organism; | ||
Line 30: | Line 26: | ||
Substitute in the appropriate values for your own organism. | Substitute in the appropriate values for your own organism. | ||
− | ==Load the | + | ==Load the GFF3== |
Run the <code>load/bin/gmod_load_gff3.pl</code> script from the GMOD distribution: | Run the <code>load/bin/gmod_load_gff3.pl</code> script from the GMOD distribution: | ||
Line 36: | Line 32: | ||
>gmod_load_gff3.pl --organism <your org common name> --srcdb DB:genbank --gfffile <your gfffile> | >gmod_load_gff3.pl --organism <your org common name> --srcdb DB:genbank --gfffile <your gfffile> | ||
− | This will load your data into the [[Chado_-_Getting_Started|Chado database]]. Note that if there are non-[http://sequenceontology SOFA] feature types remaining in the | + | This will load your data into the [[Chado_-_Getting_Started|Chado database]]. Note that if there are non-[http://sequenceontology SOFA] feature types remaining in the GFF3 file the load will fail when they are encountered. If that happens, edit the file to fix the incorrect term and load again. Only the previously unloaded data will load (i.e. you won't have duplicate rows). |
==More Information== | ==More Information== | ||
Line 49: | Line 45: | ||
* [[User:Scott|Scott Cain]] | * [[User:Scott|Scott Cain]] | ||
* [[bp:Brian_Osborne|Brian Osborne]] | * [[bp:Brian_Osborne|Brian Osborne]] | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
[[Category:HOWTO]] | [[Category:HOWTO]] | ||
[[Category:Chado]] | [[Category:Chado]] |
Revision as of 21:53, 30 December 2008
This HOWTO describes a method for loading the sequence data in Genbank RefSeq files into the Chado database.
Contents
Download the Sequence Files
These steps have been used to load data from genomic RefSeq files, you can recognized these files by their NC_
and NT_
prefixes. First download the Genbank genome records of interest. A good source for RefSeq files is NCBI's FTP site, they will probably be compressed and have .gbk.gz
suffixes.
Convert RefSeq to GFF3
Use the BioPerl script genbank2gff3.pl
, found in scripts/Bio-DB-GFF/
within the BioPerl distribution. If you've actually installed BioPerl then the installed script will have been renamed bp_genbank2gff3.pl
. Note that there's also an older genbank2gff.pl
script, don't use it.
>bp_genbank2gff3.pl <filename>
This will create a GFF3 file. It may give several warnings about unrecognized feature types. If the feature types are not part of SOFA, you will have to hand edit the resulting GFF3 file to change the feature type. Any skipped features will be printed at the end. If you want those to be part of the GFF3 file, you will have to add those manually as well, fixing any non-SOFA feature types.
Add an Entry for Your Organism
You will need to have an entry for your species in the Chado organism table. If you are unsure if this entry exists log into your database and execute this SQL command: <sql> select common_name from organism; </sql> If you do not see your organism listed, execute a command equivalent to this: <sql>
insert into organism (abbreviation, genus, species, common_name, organism_id) values ('H.sapiens', 'Homo', 'sapiens', 'Human', 9606);
</sql> Substitute in the appropriate values for your own organism.
Load the GFF3
Run the load/bin/gmod_load_gff3.pl
script from the GMOD distribution:
>gmod_load_gff3.pl --organism <your org common name> --srcdb DB:genbank --gfffile <your gfffile>
This will load your data into the Chado database. Note that if there are non-SOFA feature types remaining in the GFF3 file the load will fail when they are encountered. If that happens, edit the file to fix the incorrect term and load again. Only the previously unloaded data will load (i.e. you won't have duplicate rows).
More Information
Please send questions to the GMOD developers list:
gmod-devel@lists.sourceforge.net