Difference between revisions of "MotifFinder.pm"

From GMOD
Jump to: navigation, search
Line 20: Line 20:
  
 
However, you can also add your own PFMs to the toggle section "Paste PFMs Here" in fasta format(arrange rows in A C G T order)
 
However, you can also add your own PFMs to the toggle section "Paste PFMs Here" in fasta format(arrange rows in A C G T order)
e.g.
+
 
>name of the matrix
+
  e.g.
0      1      1      1      1      23      0      0      1      7      0      0      19
+
  >name of the matrix
10      18      1      13      14      2      20      0      17      0      7      16      0
+
  0      1      1      1      1      23      0      0      1      7      0      0      19
2      4      24      1      0      0      0      26      8      2      0      10      7
+
  10      18      1      13      14      2      20      0      17      0      7      16      0
14      3      0      11      11      1      6      0      0      17      19      0      0
+
  2      4      24      1      0      0      0      26      8      2      0      10      7
 +
  14      3      0      11      11      1      6      0      0      17      19      0      0
  
 
==Indel detection==
 
==Indel detection==

Revision as of 23:37, 5 April 2010

MotifFinder.pm is a GBrowse plugin written by Xiaoqi Shi. It finds sequence specific motifs using Position Weight Matrix

and display results graphically as tracks in the genome browser. Please feel free to contact the author for help or more information.


How to use MotifFinder plugin

MotifFinder parameters

  • Reasonable default options are provided for each parameter.
  • Threshold: a cutoff score between 0.8 to 1 is recommended.
  • Background Probability: should be inputed in (A C G T) order.
  • Indel Size: currently only small Indels(length under 6) can be handled.

Position Frequency Matrices

Existing PFMs were loaded from file 'matrices.txt' under GBrowse configuration directory, they are mostly curated PFMs from existing publications.

Click here for a list of all the available PFMs from WormBase

However, you can also add your own PFMs to the toggle section "Paste PFMs Here" in fasta format(arrange rows in A C G T order)

 e.g.
 >name of the matrix
 0       1       1       1       1       23      0       0       1       7       0       0       19
 10      18      1       13      14      2       20      0       17      0       7       16      0
 2       4       24      1       0       0       0       26      8       2       0       10      7
 14      3       0       11      11      1       6       0       0       17      19      0       0

Indel detection

User can search for sequence motifs that contain Indels up to certain length. This part hasn't been fully tested and depends on future improvement.

How is the motif predicted?

The problem is to find occurrences of known patterns(represented by position matrix) in new sequences.

Equations: Usually done to this step, however, after reading a few papers I found information content a important component here. The self-information of observing a particular symbol at a particular position of the motif is:

   − log(pi,j)

The expected (average) self-information of a particular element in the PWM is then: corresponds to the log-odds of the substring being generated by the motif versus being generated by the background,

Finally, the IC of the PWM is then the sum of the expected self-information of every element: multiplication of the frequencies with the information vector leads to a higher acceptance of mismatches in less conserved regions, whereas mismatches in highly conserved regions are very much discourages. This leads to a better perfomance in recognition of TF binding sites. detection with small indels.



A sliding window of variable size and overlap is used to calculate the spectrogram, which is displayed graphically as a track in the genome browser. Each window is a subsegment of DNA and corresponds to a 'column' in the graphical display of the spectrogram. The window slides along the sequence, from left to right, at a set increment, which corresponds to the column width.

The spectrogram refers collectively to all of the rows and columns seen in the graphical display.

The spectrogram has n rows, where n is the number of bases in the window. Each row corresponds to a discrete 'frequency' from 0 -> n-1.

An arguably more intuitive way to relate this to DNA sequence to calculate the 'period' (n/frequency*2). If we see a feature in the spectrogram at period x, there is a non-random structure with a periodicity of x nucleotides. The chief example of this would be coding DNA at period 3.

The DNA sequence is converted from analog to digital by creating four binary indicator sequences:

          G A T C C T C T G A T T C C A A
        G 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
        A 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
        T 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0
        C 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0


The magnitude of the discrete fourier transform (DFT) is calculated seperately for each of the four indicator sequences. The algorithm used is the fast fourier transfrom (FFT; via Math::FFT), which is much faster than the original DFT algorithm but is limited in that only base2 numbers (128, 256, 512, etc) can be used for window sizes. This is necessary to make the spectrogram calculation fast enough for real-time use.

For graphical rendering, each transformed sequence is assigned a color (A=blue; T=red; C=green; G=yellow). The colors for each base are superimposed on the image. In a given spot on the spectrogram, the brightness corresponds to the magnitide (signal intensity) and the color corresponds to the dominant base at that frequency/period. If no single base predominates, an intermediate color is calculated based on the relative magnitudes.

The spectrogram is visible as a track in the generic genome browser. Please note that the calculations and graphical rendering are computationally intensive, so the image will take a while to load, especially with larger sequence regions and/or small increments for the sliding window.

After you have launched this plugin, the spectrogram will continue to be calculated in the main gbrowse display until you turn off the 'Spectrogram' track.

The plugin was written by Sheldon McKay (mckays@cshl.edu)