Load BLAST Into Chado

From GMOD
Revision as of 03:33, 4 April 2007 by Dongilbert (Talk | contribs)

Jump to: navigation, search

Abstract

This HOWTO describes steps to add a BLAST analysis to a Chado database.

Have an existing Chado genome database

See Load_RefSeq_Into_Chado for example to load a GenBank Genome into a database. In the following examples, bp_scriptname is from http://Bioperl.org/, and gmod_scriptname is from http://GMOD.org//.

Convert BLAST analysis to GFF

For example, match yeast proteins to your genome with tBLASTn, and reformat to GFF v3.

 $ncbi/blastall -p tblastn -i MOD_Scer.fa -d dmel4 -o dmel4-modsc.tblastn
 

First, patch this problematic policy in Bioperl Bio::Tools::GFF.pm first, to produce Target values in GFF output:

  sub _gff3_string:
    for my $tag ( @all_tags ) {
       ### next if $tag eq 'Target';
       # ^^^ add comment to block, or change to
       next if ($origfeat->isa('Bio::SeqFeature::FeaturePair') 
         and $tag eq 'Target');

Then reformat to GFF with the Bioperl bp_search2gff.pl script. The Chado policy here is to put your program and blast query data names into the GFF --source field. The GFF --method field should always be SO term 'match_part'. You also want the --type hit and --target options.

 scripts/bp_search2gff.pl --in dmel4-modsc.tblastn \
     --out dmel4-modsc.tblastn.gff -version 3
     --format blast \ 
     --method match_part --source tBLASTn.MOD_Scer \
     --type hit --target 

Finally clean up the GFF a bit:

  perl -pi -e's/Target=Sequence:/Target=/;' dmel4-modsc.tblastn.gff

BLAST GFF sample for Chado

Result should be formatted like this:

 ##gff-version 3
 # sample tBLASTn yeast protein x fly chromosome 4 (GenBank NC_004353) matches
 # GFF formatted for loading to Chado database 
 
 NC_004353 tBLASTn.MOD_Scer  match_part  141495  141815  48.9  - 0 Target=S000003211,43,156
 
 NC_004353 tBLASTn.MOD_Scer  match_part  161699  162793  217 + 0 Target=S000005817,984,1204
 NC_004353 tBLASTn.MOD_Scer  match_part  160517  161407  185 + 0 Target=S000005817,455,980
    # this is a protein match in 2 parts: note the identical Target= and Dbxref=

Load Query Protein sequence to Chado DB

You want to have your query sequences used for BLAST, such as proteins, for reference in your Chado db. The GMOD Bulk_load_gff3 will handle this after converting sequence Fasta to GFF format.

 gmod_fasta2gff3.pl  --type protein --source SGD --fasta MOD_Scer.fa --gff MOD_Scer.fa.gff 
 gmod_bulk_load_gff3.pl --dbname dev_chado_01  --organism yeast --dbxref --gff MOD_Scer.fa.gff 


Load BLAST result GFF to Chado DB

Use the GMOD Bulk_load_gff3 script for this, indicating the input is --analysis, and the Target names are unique IDs matching above protein features.

 gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism fruitfly \
   --analysis --unique_target --score raw  \
   --gff dmel4-modsc.tblastn.gff 

Note: If you have pre-loaded the database with all the protein sequence features such as the SGD:S000002445 protein, you should use 'gmod_bulk_load_gff3 --analysis --unique_target' to ensure that Target feature is linked with your Blast result.

Chado Tables Updated

Then you should see these database updates:

   Analysis table entry:
 analysis_id|name|description|program|programversion|algorithm|sourcename|.
 10|tBLASTn.MOD_Scer||tBLASTn|null||MOD_Scer|
 
   Analysisfeature table entries (1/hsp)
 analysisfeature_id|feature_id|analysis_id|rawscore|normscore|significance|identity
 21|88148|10|117|||    << 1st HSP
 22|88150|10|91.7|||
 ...
 
   Feature table entry for Hit feature_id 88148 :
 dev_chado_01b=# select * from feature where feature_id = 88148;
 feature_id|dbxref_id|organism_id|name|uniquename|residues|seqlen|md5checksum|type_id|is_analysis|is_obsolete|
 88148|76797|10|protein_match-auto88148|auto88148||||423|t|f|
 ...
 
   Featureloc entries for Hit feature_id 88148:
 featureloc_id|feature_id|srcfeature_id|fmin|is_fmin_partial|fmax|is_fmax_partial|strand|phase|residue_info|locgroup|rank
 88149|88148|88149|69|f|858|f||||0|1        << Target protein 
 88148|88148|31118|24052|f|24448|f|1|||0|0  << Source genome
   
   Featureloc entries for Target feature_id 88149:
 88149|88148|88149|69|f|858|f||||0|1  -- first HSP
 88151|88150|88149|69|f|858|f||||0|1  -- second HSP

More Information

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net

Authors