Difference between revisions of "MotifFinder.pm"

From GMOD
Jump to: navigation, search
(download the plugin)
 
(32 intermediate revisions by 2 users not shown)
Line 1: Line 1:
MotifFinder.pm is a [[GBrowse]] plugin written by [[User:xshi|Xiaoqi Shi]].  It finds sequence specific motifs using Position Weight Matrix
+
MotifFinder.pm is a [[GBrowse]] plugin written by [[User:xshi|Xiaoqi Shi]].  It finds sequence specific motifs using Position Weight Matrix and display results graphically as tracks in the genome browser.
  
and display results graphically as tracks in the genome browser. Please feel free to [mailto:xshi@oicr.on.ca?subject=PrimerDesigner%20plugin contact the author]  for help or more information.
+
Please feel free to [mailto:xshi@oicr.on.ca?subject=MotifFinder%20plugin contact the author]  for help or more information. Follow this link for <span class=pops>[http://en.wikipedia.org/wiki/Position-specific_scoring_matrix background reading of Position Weight Matrix]</span>
  
* Follow this link for <span class=pops>[http://en.wikipedia.org/wiki/Position-specific_scoring_matrix background reading of Position Weight Matrix]</span>
+
=download the plugin=
 +
 
 +
The plugin is installed on the GBrowse of WormBase&modENCODE, you may access it directory here: http://www.wormbase.org/db/gb2/gbrowse or http://modencode.oicr.on.ca/
 +
 
 +
If you want to install it on your own GBrowse, please [mailto:xshi@oicr.on.ca?subject=MotifFinder%20plugin contact the author] for source code and then follow the instruction below:
 +
* save both 'motiffinder' and 'MotifFinder.pm' under GBrwose plugin diretory(set the permission as executable).
 +
* save 'matrices.txt'(example of the PFM tables) under GBrowse conf directory
 +
* include "MotifFinder" in your main GBrowse.conf.
 +
* specify the matrix file name in your species *.conf
 +
        [MotifFinder:plugin]
 +
        matrix = matrices.txt
 +
 
 +
Then you should be able to run the plugin!
  
 
=How to use MotifFinder plugin=
 
=How to use MotifFinder plugin=
  
==MotifFinder parameters==
+
==Access The Plugin==
 +
*From GBrowse main page, the PrimerDesigner plugin, as well as other installed plugins, can be accessed via the upper right menu.
 +
*In GBrowse, navigate to the genomic region you interested in, then select 'Annotate Sequence Motif' from the menu and click 'Configure'
 +
 
 +
[[Image:select.png|border]]
 +
 
 +
 
 +
==MotifFinder Parameters==
 
* Reasonable default options are provided for each parameter.
 
* Reasonable default options are provided for each parameter.
 
* Threshold: a cutoff score between 0.8 to 1 is recommended.
 
* Threshold: a cutoff score between 0.8 to 1 is recommended.
 
* Background Probability: should be inputed in (A C G T) order.
 
* Background Probability: should be inputed in (A C G T) order.
 
* Indel Size: currently only small Indels(length under 6) can be handled.
 
* Indel Size: currently only small Indels(length under 6) can be handled.
 +
 +
[[Image:Parameter.png|boarder]]
 +
  
 
==Position Frequency Matrices==
 
==Position Frequency Matrices==
Line 18: Line 40:
 
Click here for a list of [http://www.wormbase.org/db/seq/position_matrix?list=all all the available PFMs from WormBase]
 
Click here for a list of [http://www.wormbase.org/db/seq/position_matrix?list=all all the available PFMs from WormBase]
  
However, you can also add your own PFMs to the toggle section "Paste PFMs Here" in fasta format(arrange rows in A C G T order)
+
However, you can also add your own PFMs to the toggle section "Paste PFMs Here" in fasta format(arrange rows in A C G T order). e.g.
e.g.
+
>name of the matrix
+
0      1      1      1      1      23      0      0      1      7      0      0      19
+
10      18      1      13      14      2      20      0      17      0      7      16      0
+
2      4      24      1      0      0      0      26      8      2      0      10      7
+
14      3      0      11      11      1      6      0      0      17      19      0      0
+
  
==Indel detection==
+
  >name of the matrix '''( the '>' sign is required !)'''
User can search for sequence motifs that contains Indels up to certain length. This part hasn't been widely tested and depends on future improvement.
+
  0      1      1      1      1      23      0      0      1      7      0      0      19
 +
  10      18      1      13      14      2      20      0      17      0      7      16      0
 +
  2      4      24      1      0      0      0      26      8      2      0      10      7
 +
  14      3      0      11      11      1      6      0      0      17      19      0      0
  
 +
==Indel Detection==
 +
User can search for sequence motifs that contain Indels up to certain length. This part hasn't been fully tested and depends on future improvement.
 +
 +
==Graphical Presentation==
 +
*each matching motifs is displayed as a glyph box on the tracks
 +
*box arrow indicates the strand info
 +
*move mouse on glyph will show you the computed similarity score and start/stop position of the motif
 +
[[Image:Display.png|center]]
  
 
=How is the motif predicted?=
 
=How is the motif predicted?=
Line 34: Line 61:
 
The problem is to find occurrences of known patterns(represented by position matrix) in new sequences.
 
The problem is to find occurrences of known patterns(represented by position matrix) in new sequences.
  
Equations:
+
==Caculate Weight Score==
Usually done to this step, however, after reading a few papers I found information content a important component here.
+
The self-information of observing a particular symbol at a particular position of the motif is:
+
 
+
    − log(pi,j)
+
The expected (average) self-information of a particular element in the PWM is then:
+
corresponds to the log-odds of the substring being generated by the motif versus being generated by the background,
+
 
+
Finally, the IC of the PWM is then the sum of the expected self-information of every element:
+
multiplication of the frequencies with the information vector leads to a higher acceptance of mismatches in less conserved regions, whereas mismatches in highly conserved regions are very much discourages. This leads to a better perfomance in recognition of TF binding sites.
+
detection with small indels.
+
 
+
 
+
 
+
 
+
  
A sliding window of variable size and overlap is used to calculate the spectrogram, which is displayed graphically as a track in the genome browser. Each window is a subsegment of DNA and corresponds to a 'column' in the graphical display of the spectrogram. The window slides along the sequence, from left to right, at a set increment, which corresponds to the column width.
+
Scoring function is the same as the [http://tfbs.genereg.net/ TFBS Perl modules] developed by Bergen University.
  
The spectrogram refers collectively to all of the rows and columns seen in the graphical display.
+
  w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / 0.25 )
  
The spectrogram has n rows, where n is the number of bases in the window. Each row corresponds to a discrete 'frequency' from 0 -> n-1.
+
If we have PFM from TRANSFAC 7.0:
  
An arguably more intuitive way to relate this to DNA sequence to calculate the 'period' (n/frequency*2). If we see a feature in the spectrogram at period x, there is a non-random structure with a periodicity of x nucleotides. The chief example of this would be coding DNA at period 3.
+
    A 1 12 0 0 0 0 0 7 1 1 0 0 0 2 1
 +
    C 8 0 0 0 0 0 13 1 7 0 0 3 8 7 8
 +
    G 2 1 12 0 0 0 0 1 2 0 0 0 0 2 3
 +
    T 2 0 0 13 13 13 0 4 3 12 13 10 5 2 1
  
The DNA sequence is converted from analog to digital by creating four binary indicator sequences:
+
w - is a weight for the current nucleotide we are calculating
  
          G A T C C T C T G A T T C C A A
+
f - is a number of occurrences of the current nucleotide in the current column (e.g., "1" for A in column 1, "8" for C etc)
        G 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
+
        A 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
+
        T 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0
+
        C 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0
+
  
 +
N - total number of observations, the sum of all nucleotides occurrences in a column (13 in this example)
  
The magnitude of the discrete fourier transform (DFT) is calculated seperately for each of the four indicator sequences. The algorithm used is the fast fourier transfrom (FFT; via Math::FFT), which is much faster than the original DFT algorithm but is limited in that only base2 numbers (128, 256, 512, etc) can be used for window sizes. This is necessary to make the spectrogram calculation fast enough for real-time use.
+
p - [prior] [background] frequency of the current nucleotide; this one usually defaults to 0.25 (i.e. one nucleotide out of four)
  
For graphical rendering, each transformed sequence is assigned a color (A=blue; T=red; C=green; G=yellow). The colors for each base are superimposed on the image. In a given spot on the spectrogram, the brightness corresponds to the magnitide (signal intensity) and the color corresponds to the dominant base at that frequency/period. If no single base predominates, an intermediate color is calculated based on the relative magnitudes.
+
==Algorithms==
  
The spectrogram is visible as a track in the generic genome browser. Please note that the calculations and graphical rendering are computationally intensive, so the image will take a while to load, especially with larger sequence regions and/or small increments for the sliding window.
+
*Backtrack: use recursive function to build all possible motifs, terminate recursion when an intermediate score is not reached.
 +
*Brute-Force: calculate the similarity score across the whole region using a sliding window of motif size
  
After you have launched this plugin, the spectrogram will continue to be calculated in the main gbrowse display until you turn off the 'Spectrogram' track.
+
This program uses a combined strategy by choosing between above two methods(depending length of the motif and cutoff score) to achieve faster computational speed .
  
The plugin was written by Sheldon McKay (mckays@cshl.edu)
+
[[Category:GBrowse Plugins]]

Latest revision as of 15:00, 11 July 2016

MotifFinder.pm is a GBrowse plugin written by Xiaoqi Shi. It finds sequence specific motifs using Position Weight Matrix and display results graphically as tracks in the genome browser.

Please feel free to contact the author for help or more information. Follow this link for background reading of Position Weight Matrix

download the plugin

The plugin is installed on the GBrowse of WormBase&modENCODE, you may access it directory here: http://www.wormbase.org/db/gb2/gbrowse or http://modencode.oicr.on.ca/

If you want to install it on your own GBrowse, please contact the author for source code and then follow the instruction below:

  • save both 'motiffinder' and 'MotifFinder.pm' under GBrwose plugin diretory(set the permission as executable).
  • save 'matrices.txt'(example of the PFM tables) under GBrowse conf directory
  • include "MotifFinder" in your main GBrowse.conf.
  • specify the matrix file name in your species *.conf
        [MotifFinder:plugin]
        matrix = matrices.txt

Then you should be able to run the plugin!

How to use MotifFinder plugin

Access The Plugin

  • From GBrowse main page, the PrimerDesigner plugin, as well as other installed plugins, can be accessed via the upper right menu.
  • In GBrowse, navigate to the genomic region you interested in, then select 'Annotate Sequence Motif' from the menu and click 'Configure'

Select.png


MotifFinder Parameters

  • Reasonable default options are provided for each parameter.
  • Threshold: a cutoff score between 0.8 to 1 is recommended.
  • Background Probability: should be inputed in (A C G T) order.
  • Indel Size: currently only small Indels(length under 6) can be handled.

boarder


Position Frequency Matrices

Existing PFMs were loaded from file 'matrices.txt' under GBrowse configuration directory, they are mostly curated PFMs from existing publications.

Click here for a list of all the available PFMs from WormBase

However, you can also add your own PFMs to the toggle section "Paste PFMs Here" in fasta format(arrange rows in A C G T order). e.g.

 >name of the matrix ( the '>' sign is required !)
 0       1       1       1       1       23      0       0       1       7       0       0       19
 10      18      1       13      14      2       20      0       17      0       7       16      0
 2       4       24      1       0       0       0       26      8       2       0       10      7
 14      3       0       11      11      1       6       0       0       17      19      0       0

Indel Detection

User can search for sequence motifs that contain Indels up to certain length. This part hasn't been fully tested and depends on future improvement.

Graphical Presentation

  • each matching motifs is displayed as a glyph box on the tracks
  • box arrow indicates the strand info
  • move mouse on glyph will show you the computed similarity score and start/stop position of the motif
Display.png

How is the motif predicted?

The problem is to find occurrences of known patterns(represented by position matrix) in new sequences.

Caculate Weight Score

Scoring function is the same as the TFBS Perl modules developed by Bergen University.

 w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / 0.25 )

If we have PFM from TRANSFAC 7.0:

   A 1 12 0 0 0 0 0 7 1 1 0 0 0 2 1
   C 8 0 0 0 0 0 13 1 7 0 0 3 8 7 8
   G 2 1 12 0 0 0 0 1 2 0 0 0 0 2 3
   T 2 0 0 13 13 13 0 4 3 12 13 10 5 2 1

w - is a weight for the current nucleotide we are calculating

f - is a number of occurrences of the current nucleotide in the current column (e.g., "1" for A in column 1, "8" for C etc)

N - total number of observations, the sum of all nucleotides occurrences in a column (13 in this example)

p - [prior] [background] frequency of the current nucleotide; this one usually defaults to 0.25 (i.e. one nucleotide out of four)

Algorithms

  • Backtrack: use recursive function to build all possible motifs, terminate recursion when an intermediate score is not reached.
  • Brute-Force: calculate the similarity score across the whole region using a sliding window of motif size

This program uses a combined strategy by choosing between above two methods(depending length of the motif and cutoff score) to achieve faster computational speed .