Load BLAST Into Chado

From GMOD
Revision as of 03:35, 4 April 2007 by Dongilbert (Talk | contribs)

Jump to: navigation, search

Abstract

This HOWTO describes steps to add a BLAST analysis to a Chado database.

Have an existing Chado genome database

See Load_RefSeq_Into_Chado for example to load a GenBank Genome into a database. In the following examples, bp_scriptname is from http://Bioperl.org/, and gmod_scriptname is from http://GMOD.org//. As of this date (2007 April) you will need current modules from the BioPerl and GMOD CVS repositories to have this example work.

Convert BLAST analysis to GFF

For example, match yeast proteins to your genome with tBLASTn, and reformat to GFF v3.

 $ncbi/blastall -p tblastn -i MOD_Scer.fa -d dmel4 -o dmel4-modsc.tblastn
 

First, patch this problematic policy in Bioperl Bio::Tools::GFF.pm first, to produce Target values in GFF output:

  sub _gff3_string:
    for my $tag ( @all_tags ) {
       ### next if $tag eq 'Target';
       # ^^^ add comment to block, or change to
       next if ($origfeat->isa('Bio::SeqFeature::FeaturePair') 
         and $tag eq 'Target');

Then reformat to GFF with the Bioperl bp_search2gff.pl script. The Chado policy here is to put your program and blast query data names into the GFF --source field. The GFF --method field should always be SO term 'match_part'. You also want the --type hit and --target options.

 scripts/bp_search2gff.pl --in dmel4-modsc.tblastn \
     --out dmel4-modsc.tblastn.gff -version 3
     --format blast \ 
     --method match_part --source tBLASTn.MOD_Scer \
     --type hit --target 

Finally clean up the GFF a bit:

  perl -pi -e's/Target=Sequence:/Target=/;' dmel4-modsc.tblastn.gff

BLAST GFF sample for Chado

Result should be formatted like this:

 ##gff-version 3
 # sample tBLASTn yeast protein x fly chromosome 4 (GenBank NC_004353) matches
 # GFF formatted for loading to Chado database 
 
 NC_004353 tBLASTn.MOD_Scer  match_part  141495  141815  48.9  - 0 Target=S000003211,43,156
 
 NC_004353 tBLASTn.MOD_Scer  match_part  161699  162793  217 + 0 Target=S000005817,984,1204
 NC_004353 tBLASTn.MOD_Scer  match_part  160517  161407  185 + 0 Target=S000005817,455,980
    # this is a protein match in 2 parts: note the identical Target= and Dbxref=

Load Query Protein sequence to Chado DB

You want to have your query sequences used for BLAST, such as proteins, for reference in your Chado db. The GMOD Bulk_load_gff3 will handle this after converting sequence Fasta to GFF format.

 gmod_fasta2gff3.pl  --type protein --source SGD --fasta MOD_Scer.fa --gff MOD_Scer.fa.gff 
 gmod_bulk_load_gff3.pl --dbname dev_chado_01  --organism yeast --dbxref --gff MOD_Scer.fa.gff 


Load BLAST result GFF to Chado DB

Use the GMOD Bulk_load_gff3 script for this, indicating the input is --analysis, and the Target names are unique IDs matching above protein features.

 gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism fruitfly \
   --analysis --unique_target --score raw  \
   --gff dmel4-modsc.tblastn.gff 

Note: If you have pre-loaded the database with all the protein sequence features such as the SGD:S000002445 protein, you should use 'gmod_bulk_load_gff3 --analysis --unique_target' to ensure that Target feature is linked with your Blast result.

Chado Tables Updated

Then you should see these database updates:

   Analysis table entry:
 analysis_id|name|description|program|programversion|algorithm|sourcename|.
 10|tBLASTn.MOD_Scer||tBLASTn|null||MOD_Scer|
 
   Analysisfeature table entries (1/hsp)
 analysisfeature_id|feature_id|analysis_id|rawscore|normscore|significance|identity
 21|88148|10|117|||    << 1st HSP
 22|88150|10|91.7|||
 ...
 
   Feature table entry for Hit feature_id 88148 :
 dev_chado_01b=# select * from feature where feature_id = 88148;
 feature_id|dbxref_id|organism_id|name|uniquename|residues|seqlen|md5checksum|type_id|is_analysis|is_obsolete|
 88148|76797|10|protein_match-auto88148|auto88148||||423|t|f|
 ...
 
   Featureloc entries for Hit feature_id 88148:
 featureloc_id|feature_id|srcfeature_id|fmin|is_fmin_partial|fmax|is_fmax_partial|strand|phase|residue_info|locgroup|rank
 88149|88148|88149|69|f|858|f||||0|1        << Target protein 
 88148|88148|31118|24052|f|24448|f|1|||0|0  << Source genome
   
   Featureloc entries for Target feature_id 88149:
 88149|88148|88149|69|f|858|f||||0|1  -- first HSP
 88151|88150|88149|69|f|858|f||||0|1  -- second HSP

More Information

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net

Authors