Difference between revisions of "Load BLAST Into Chado"

From GMOD
Jump to: navigation, search
(Load Query Protein sequence to Chado DB)
m (Text replace - "</sql>" to "</syntaxhighlight>")
 
(16 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
=Abstract=
 
=Abstract=
  
This HOWTO describes steps to add a BLAST analysis to a [[Chado_-_Getting_Started|Chado database]].
+
This [[:Category:HOWTO|HOWTO]] describes steps to add a [[:Category:BLAST|BLAST]] analysis to a [[Chado_-_Getting_Started|Chado database]].
  
 
=Have an existing Chado genome database =
 
=Have an existing Chado genome database =
  
See [[Load_RefSeq_Into_Chado]] for example to load a GenBank Genome into a database.
+
See [[Load RefSeq Into Chado]] for advice on how to load a GenBank Genome into a database.
In the following examples, '''bp_scriptname''' is from http://Bioperl.org/, and '''gmod_scriptname''' is from http://GMOD.org//.  As of this date (2007 April) you will need current modules from the BioPerl and GMOD CVS repositories to have this example work.
+
In the following examples, '''bp_''scriptname''''' is from [[BioPerl]], and '''gmod_''scriptname''''' is from GMOD.  As of this date (2007 April) you will need current modules from the [[BioPerl]] Git and [[SVN|GMOD SVN repositories]] to have this example work.
  
=Convert BLAST analysis to GFF=
+
=Convert BLAST analysis to GFF3=
  
For example, match yeast proteins to your genome with tBLASTn, and reformat to GFF v3.
+
For example, match yeast proteins to your genome with tBLASTn, and reformat to [[GFF3]].
  
 
   $ncbi/blastall -p tblastn -i MOD_Scer.fa -d dmel4 -o dmel4-modsc.tblastn
 
   $ncbi/blastall -p tblastn -i MOD_Scer.fa -d dmel4 -o dmel4-modsc.tblastn
 
 
First, patch this problematic policy in Bioperl Bio::Tools::GFF.pm first, to produce Target values in GFF output:
 
  
  sub _gff3_string:
+
First reformat to [[GFF3]] with the [[BioPerl]] <tt>bp_search2gff.pl</tt> script.  The Chado policy here is to put your program and blast query data names into the GFF3 <tt>--source</tt> field.  The GFF3 <tt>--method</tt> field should always be SO term 'match_part'.  You also want the <tt>--type hit</tt> and <tt>--target</tt> options.
    for my $tag ( @all_tags ) {
+
        ### next if $tag eq 'Target';
+
        # ^^^ add comment to block, or change to
+
        next if ($tag eq 'Target'
+
          and $origfeat->isa('Bio::SeqFeature::FeaturePair') ;
+
 
+
Then reformat to GFF with the Bioperl bp_search2gff.pl script.  The
+
Chado policy here is to put your program and blast query data names
+
into the GFF --source field.  The GFF --method field should always be
+
SO term 'match_part'.  You also want the --type hit and --target options.
+
  
 
   scripts/bp_search2gff.pl --in dmel4-modsc.tblastn \
 
   scripts/bp_search2gff.pl --in dmel4-modsc.tblastn \
 
       --out dmel4-modsc.tblastn.gff -version 3 \
 
       --out dmel4-modsc.tblastn.gff -version 3 \
       --format blast \  
+
       --format blast \
 
       --method match_part --source tBLASTn.MOD_Scer \
 
       --method match_part --source tBLASTn.MOD_Scer \
       --type hit --target  
+
       --type hit --target
  
Finally clean up the GFF a bit:
+
Finally clean up the GFF3 a bit:
  
   perl -pi -e's/Target=Sequence:/Target=/;' dmel4-modsc.tblastn.gff
+
   perl -pi -e 's/Target=Sequence:/Target=/' dmel4-modsc.tblastn.gff
  
==BLAST GFF sample for Chado==
+
==BLAST GFF3 sample for Chado==
  
 
Result should be formatted like this:
 
Result should be formatted like this:
 
+
<pre>
 
   ##gff-version 3
 
   ##gff-version 3
 
   # sample tBLASTn yeast protein x fly chromosome 4 (GenBank NC_004353) matches
 
   # sample tBLASTn yeast protein x fly chromosome 4 (GenBank NC_004353) matches
   # GFF formatted for loading to Chado database  
+
   # GFF formatted for loading to Chado database
 
+
 
   NC_004353 tBLASTn.MOD_Scer  match_part  141495  141815  48.9  - 0 Target=S000003211,43,156
+
   NC_004353 tBLASTn.MOD_Scer  match_part  141495  141815  48.9  - 0 Target=S000003211 43 156
 
+
 
   NC_004353 tBLASTn.MOD_Scer  match_part  161699  162793  217 + 0 Target=S000005817,984,1204
+
   NC_004353 tBLASTn.MOD_Scer  match_part  161699  162793  217 + 0 Target=S000005817 984 1204
   NC_004353 tBLASTn.MOD_Scer  match_part  160517  161407  185 + 0 Target=S000005817,455,980
+
   NC_004353 tBLASTn.MOD_Scer  match_part  160517  161407  185 + 0 Target=S000005817 455 980
 
     # this is a protein match with 2 HSP parts, note the identical Target=
 
     # this is a protein match with 2 HSP parts, note the identical Target=
 +
</pre>
  
 
=Load Query Protein sequence to Chado DB=
 
=Load Query Protein sequence to Chado DB=
  
You want to have your query sequences used for BLAST,  
+
You want to have your query sequences used for BLAST, such as proteins, for reference in your [[Chado]] db.  The GMOD script <tt>gmod_bulk_load_gff3.pl</tt> will handle this after converting sequence Fasta to [[GFF3]] format.
such as proteins, for reference in your Chado db.  The
+
GMOD Bulk_load_gff3 will handle this after converting
+
sequence Fasta to GFF format.
+
  
   gmod_fasta2gff3.pl  --type protein --source SGD --fasta MOD_Scer.fa --gff MOD_Scer.fa.gff  
+
   gmod_fasta2gff3.pl  --type protein --source SGD --fasta MOD_Scer.fa --gff MOD_Scer.fa.gff
 
   gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism yeast --dbxref --gff MOD_Scer.fa.gff
 
   gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism yeast --dbxref --gff MOD_Scer.fa.gff
  
If your query sequence somes from say UniProt or GenBank formats,
+
If your query sequence comes from UniProt or GenBank formats, you can instead use the <tt>[[Load GFF Into Chado | bp_genbank2gff.pl]]</tt> script that will retain more useful annotations for your Chado database.  Then BLAST matches can
you can instead use the [[Load_GFF_Into_Chado | Genbank2GFF]] script that will retain more  
+
useful annotations for your Chado database.  Then BLAST matches can
+
 
be linked to many known gene/protein attributes.
 
be linked to many known gene/protein attributes.
  
=Load BLAST result GFF to Chado DB=
+
=Load BLAST result GFF3 to Chado DB=
  
Use the GMOD Bulk_load_gff3 script for this, indicating the
+
Use the <tt>gmod_bulk_load_gff3.pl</tt> script for this, indicating the
input is --analysis, and the Target names are unique IDs matching
+
input is <tt>--analysis</tt>, and the Target names are unique IDs matching
 
above protein features.
 
above protein features.
  
 
   gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism fruitfly \
 
   gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism fruitfly \
 
     --analysis --unique_target --score raw  \
 
     --analysis --unique_target --score raw  \
     --gff dmel4-modsc.tblastn.gff  
+
     --gff dmel4-modsc.tblastn.gff
  
 
Note: If you have pre-loaded the database with all the protein
 
Note: If you have pre-loaded the database with all the protein
 
sequence features such as the SGD:S000002445 protein, you should use
 
sequence features such as the SGD:S000002445 protein, you should use
'gmod_bulk_load_gff3 --analysis --unique_target' to ensure that Target
+
gmod_bulk_load_gff3 --analysis --unique_target
feature is linked with your Blast result.
+
to ensure that Target feature is linked with your Blast result.
  
 
==Chado Tables Updated==
 
==Chado Tables Updated==
Line 86: Line 70:
 
Then you should see these database updates:
 
Then you should see these database updates:
  
    Analysis table entry of the [[Chado_Companalysis_Module]]:
+
Analysis table entry of the [[Chado_Companalysis_Module]]:
  analysis_id|name|description|program|programversion|algorithm|sourcename|.
+
 
  10|tBLASTn.MOD_Scer||tBLASTn|null||MOD_Scer|
+
{| class="wikitable"
 
+
!analysis_id
    Analysisfeature table entries (1/hsp)
+
!name
  analysisfeature_id|feature_id|analysis_id|rawscore|normscore|significance|identity
+
!description
  21|88148|10|117|||   << 1st HSP
+
!program
  22|88150|10|91.7|||
+
!programversion
  ...
+
!algorithm
 
+
!sourcename
    Feature table entry for Hit feature_id 88148 :
+
|-
  dev_chado_01b=# select * from feature where feature_id = 88148;
+
10
  feature_id|dbxref_id|organism_id|name|uniquename|residues|seqlen|md5checksum|type_id|is_analysis|is_obsolete|
+
|tBLASTn.MOD_Scer
  88148|76797|10|protein_match-auto88148|auto88148||||423|t|f|
+
|
  ...
+
|tBLASTn
 
+
|null
    Featureloc entries for Hit feature_id 88148:
+
|
  featureloc_id|feature_id|srcfeature_id|fmin|is_fmin_partial|fmax|is_fmax_partial|strand|phase|residue_info|locgroup|rank
+
|MOD_Scer
  88149|88148|88149|69|f|858|f||||0|1       << Target protein  
+
|}
  88148|88148|31118|24052|f|24448|f|1|||0|0 << Source genome
+
 
   
+
Analysisfeature table entries (1/hsp)
    Featureloc entries for Target feature_id 88149:
+
{| class="wikitable"
  88149|88148|88149|69|f|858|f||||0|1  -- first HSP
+
!analysisfeature_id
  88151|88150|88149|69|f|858|f||||0|1 -- second HSP
+
!feature_id
 +
!analysis_id
 +
!rawscore
 +
!normscore
 +
!significance
 +
!identity
 +
|
 +
|-
 +
21
 +
|88148
 +
|10
 +
|117
 +
|
 +
|
 +
|
 +
|&larr; 1st HSP
 +
|-
 +
22
 +
|88150
 +
|10
 +
|91.7
 +
|
 +
|
 +
|
 +
|-
 +
|colspan="7"|  ...
 +
|}
 +
 
 +
Feature table entry for Hit feature_id 88148 :
 +
<syntaxhighlight lang="sql">
 +
select * from feature where feature_id = 88148;
 +
</syntaxhighlight>
 +
{| class="wikitable"
 +
! feature_id
 +
!dbxref_id
 +
!organism_id
 +
!name
 +
!uniquename
 +
!residues
 +
!seqlen
 +
!md5checksum
 +
!type_id
 +
!is_analysis
 +
!is_obsolete
 +
|-
 +
88148
 +
|76797
 +
|10
 +
|protein_match-auto88148
 +
|auto88148
 +
|
 +
|
 +
|
 +
|423
 +
|t
 +
|f
 +
|-
 +
| colspan="11" | ...
 +
|}
 +
 
 +
Featureloc entries for Hit feature_id 88148:
 +
{| class="wikitable"
 +
!featureloc_id
 +
!feature_id
 +
!srcfeature_id
 +
!fmin
 +
!is_fmin_partial
 +
!fmax
 +
!is_fmax_partial
 +
!strand
 +
!phase
 +
!residue_info
 +
!locgroup
 +
!rank
 +
|-
 +
88149
 +
|88148
 +
|88149
 +
|69
 +
|f
 +
|858
 +
|f
 +
|
 +
|
 +
|
 +
|0
 +
|1
 +
|&larr; Target protein
 +
|-
 +
88148
 +
|88148
 +
|31118
 +
|24052
 +
|f
 +
|24448
 +
|f
 +
|1
 +
|
 +
|
 +
|0
 +
|0
 +
| &larr; Source genome
 +
|}
 +
 
 +
Featureloc entries for Target feature_id 88149:
 +
{| class="wikitable"
 +
!featureloc_id
 +
!feature_id
 +
!srcfeature_id
 +
!fmin
 +
!is_fmin_partial
 +
!fmax
 +
!is_fmax_partial
 +
!strand
 +
!phase
 +
!residue_info
 +
!locgroup
 +
!rank
 +
|-
 +
88149
 +
|88148
 +
|88149
 +
|69
 +
|f
 +
|858
 +
|f
 +
|
 +
|
 +
|
 +
|0
 +
|1
 +
| &larr; first HSP
 +
|-
 +
88151
 +
|88150
 +
|88149
 +
|69
 +
|f
 +
|858
 +
|f
 +
|
 +
|
 +
|
 +
|0
 +
|1
 +
| &larr; second HSP
 +
|}
 +
 
 +
''I believe the ranks shown in the featureloc example above may not be correct or at least are misleading; see the [[Chado_Best_Practices#Results_from_BLAST|Chado Best Practices section for handling BLAST results]] for clarification'' [[User:Scott|Scott]] 19:18, 21 November 2008 (UTC)
  
 
=See also=
 
=See also=
Line 121: Line 253:
  
 
=Authors=
 
=Authors=
+
 
 
* [[User:Dongilbert|Dongilbert]] 23:24, 3 April 2007 (EDT)
 
* [[User:Dongilbert|Dongilbert]] 23:24, 3 April 2007 (EDT)
  
[[Category:HOWTO]]
+
[[Category:BLAST]]
 
[[Category:Chado]]
 
[[Category:Chado]]
 +
[[Category:HOWTO]]

Latest revision as of 23:34, 8 October 2012

Abstract

This HOWTO describes steps to add a BLAST analysis to a Chado database.

Have an existing Chado genome database

See Load RefSeq Into Chado for advice on how to load a GenBank Genome into a database. In the following examples, bp_scriptname is from BioPerl, and gmod_scriptname is from GMOD. As of this date (2007 April) you will need current modules from the BioPerl Git and GMOD SVN repositories to have this example work.

Convert BLAST analysis to GFF3

For example, match yeast proteins to your genome with tBLASTn, and reformat to GFF3.

 $ncbi/blastall -p tblastn -i MOD_Scer.fa -d dmel4 -o dmel4-modsc.tblastn

First reformat to GFF3 with the BioPerl bp_search2gff.pl script. The Chado policy here is to put your program and blast query data names into the GFF3 --source field. The GFF3 --method field should always be SO term 'match_part'. You also want the --type hit and --target options.

 scripts/bp_search2gff.pl --in dmel4-modsc.tblastn \
     --out dmel4-modsc.tblastn.gff -version 3 \
     --format blast \
     --method match_part --source tBLASTn.MOD_Scer \
     --type hit --target

Finally clean up the GFF3 a bit:

  perl -pi -e 's/Target=Sequence:/Target=/' dmel4-modsc.tblastn.gff

BLAST GFF3 sample for Chado

Result should be formatted like this:

  ##gff-version 3
  # sample tBLASTn yeast protein x fly chromosome 4 (GenBank NC_004353) matches
  # GFF formatted for loading to Chado database

  NC_004353 tBLASTn.MOD_Scer  match_part  141495  141815  48.9  - 0 Target=S000003211 43 156

  NC_004353 tBLASTn.MOD_Scer  match_part  161699  162793  217 + 0 Target=S000005817 984 1204
  NC_004353 tBLASTn.MOD_Scer  match_part  160517  161407  185 + 0 Target=S000005817 455 980
     # this is a protein match with 2 HSP parts, note the identical Target=

Load Query Protein sequence to Chado DB

You want to have your query sequences used for BLAST, such as proteins, for reference in your Chado db. The GMOD script gmod_bulk_load_gff3.pl will handle this after converting sequence Fasta to GFF3 format.

 gmod_fasta2gff3.pl  --type protein --source SGD --fasta MOD_Scer.fa --gff MOD_Scer.fa.gff
 gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism yeast --dbxref --gff MOD_Scer.fa.gff

If your query sequence comes from UniProt or GenBank formats, you can instead use the bp_genbank2gff.pl script that will retain more useful annotations for your Chado database. Then BLAST matches can be linked to many known gene/protein attributes.

Load BLAST result GFF3 to Chado DB

Use the gmod_bulk_load_gff3.pl script for this, indicating the input is --analysis, and the Target names are unique IDs matching above protein features.

 gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism fruitfly \
   --analysis --unique_target --score raw  \
   --gff dmel4-modsc.tblastn.gff

Note: If you have pre-loaded the database with all the protein sequence features such as the SGD:S000002445 protein, you should use

gmod_bulk_load_gff3 --analysis --unique_target

to ensure that Target feature is linked with your Blast result.

Chado Tables Updated

Then you should see these database updates:

Analysis table entry of the Chado_Companalysis_Module:

analysis_id name description program programversion algorithm sourcename
10 tBLASTn.MOD_Scer tBLASTn null MOD_Scer

Analysisfeature table entries (1/hsp)

analysisfeature_id feature_id analysis_id rawscore normscore significance identity
21 88148 10 117 ← 1st HSP
22 88150 10 91.7
...

Feature table entry for Hit feature_id 88148 :

 SELECT * FROM feature WHERE feature_id = 88148;
feature_id dbxref_id organism_id name uniquename residues seqlen md5checksum type_id is_analysis is_obsolete
88148 76797 10 protein_match-auto88148 auto88148 423 t f
...

Featureloc entries for Hit feature_id 88148:

featureloc_id feature_id srcfeature_id fmin is_fmin_partial fmax is_fmax_partial strand phase residue_info locgroup rank
88149 88148 88149 69 f 858 f 0 1 ← Target protein
88148 88148 31118 24052 f 24448 f 1 0 0 ← Source genome

Featureloc entries for Target feature_id 88149:

featureloc_id feature_id srcfeature_id fmin is_fmin_partial fmax is_fmax_partial strand phase residue_info locgroup rank
88149 88148 88149 69 f 858 f 0 1 ← first HSP
88151 88150 88149 69 f 858 f 0 1 ← second HSP

I believe the ranks shown in the featureloc example above may not be correct or at least are misleading; see the Chado Best Practices section for handling BLAST results for clarification Scott 19:18, 21 November 2008 (UTC)

See also

More Information

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net

Authors