SOBA Tutorial 2012

Jump to: navigation, search

This SOBA tutorial was taught as part of the 2012 GMOD Summer School and the 2013 GMOD Summer School.

About SOBA

SOBA is a command line tool and web application for analyzing GFF3 annotations. GFF3 is a standard file format for genomic annotation data. SOBA gathers statistics from GFF3 files and renders them as tables and graphs.

The web version of SOBA will produce the following:

  • Summary statistics of feature types and attributes used
  • Histograms of feature lengths
  • Graphs of Sequence Ontology terms used
  • Histograms of intron density
  • Suggestions to improve SO compliance for invalid terms

In addition, the command line tool (SOBAcl) flexibly produces a much wider variety of tables, figures and graphs based on the data in a GFF3 file, as well as the ability to produce complex and extensible custom reports via a robust template system.

SOBA is a tool for those dealing with genomic sequence annotation who want to view genome-wide summaries of their annotation files. For example: SOBA would be a useful tool at an annotation jamboree for a newly sequenced organism and when preparing the resulting genome paper; SOBA would help those developing annotation tools to quickly evaluate updates to their tool; SOBA assists comparative genomics analyses by providing a high-level overview of the genome of multiple organisms. SOBA complements genome browsers by providing a summary of all the features annotated in the genome.

SOBA is built with Perl (and JavaScript for the web interface). The web interface uses CGI::Application as a Perl webapp framework and the JQuery JavaScript library for Web 2.0 effects and AJAX. Both versions of SOBA use the Template Tooklit (TT) to generate html/txt reports, graphviz for the ontology graphs, and GD for charts. Template Toolkit makes extensibility very easy, at least for someone who's willing to learn the fairly simple template language of TT as you don't need to know Perl or any other programming to use TT.

SOBA Web Application

Documentation for the web interface to SOBA is available on the Sequence Ontology Wiki as well as via tool-tips on the site itself.

Navigate to the SOBA Web Application with any modern browser that has JavaScript enabled.

SOBA main page

SOBA feature type counts

SOBA feature lengths

SOBA Feature length distribution

SOBA CDS length histogram

SOBA ontology graph


Information and downloads for SOBAcl are available from the Sequence Ontology website.

Let's start by grabbing some data to play with.

cd /home/ubuntu/soba_course/
tar -xzvf SOBA_GMOD_demo.tar.gz
cd soba/
export PATH=$PATH:/usr/local/SOBA/bin


Documentation for the command line version, SOBAcl, is available as a usage statement with the script itself:

SOBAcl --help

SOBAcl [options] db_name
SOBAcl [options] file.gff3

SOBAcl file.gff3 2> soba_count.txt


The Sequence Ontology Bioinformatics Analysis command line tool
(SOBAcl) will generate a variety of reports from the data in GFF3
files.  The data can optionally first be loaded into a database.


  --help      Print this help page and exit.

  --title     A title for the table, graph, or report.

  --columns [--series]
              The columns parameter indicates how data will be
              grouped into columns (or series for axis graphs and
              seperate graphs for pie graphs).  For example, using
              seqid as the value would create one column of data for
              each seqid (chromosome).  For bar and line graphs there
              would be a data series for each value.  For pie graphs,
              a separate graph would be generated for each
              value. (seqid, source, type, strand, phase, [file],

  --rows [--x_data]
              The rows parameter indicates data will be grouped into
              rows (x-axis labels for graphs with axes; wedges for
              pie graphs).  For example, using type as the value
              would create one row of data for each unique feature
              type in the GFF3 file.  For bar and line graphs there
              would be one x-axis label for each type and for pie
              graphs there would be one pie wedge for each
              type. (seqid, source, [type], strand, phase, file)

  --data [--y_data]
              The data parameter indicates which data fields (columns
              or attributes in the GFF3 file) will be reported.  For
              example using score for the data parameter will create
              a report on the scores of features. (seqid, source,
              type, start, end, [length], score, strand, phase)

  --data_type The data_type parameter indicates how the data described
	      in the data parameter should be summarized.  For
	      example if data is length and data_type is 'mean', then
	      the mean length would be reported for each grouping of
	      the data. (count, [mean], harmonic_mean,
	      geometric_mead, median, mode, sum, min, max, variance,
	      standard deviation, range, footprint)

  --layout    The layout parameter describes how the data will be
	      presented.  ([table], lines, bars, hbars, points,
	      linespoints, area, pie)

  --format    The format parameter determines how the output will be
	      formatted.  For all text based outputs the options are
	      (text, [tab], html, html_page, pdf).  For all graphical
	      layouts the options are (jpeg, [png], gif).

  --collapse  Collapse series data onto one graph.


First let's just run SOBAcl with all default parameters and see what we get...

SOBAcl hsap_hg18_demo.gff3

By default SOBA prints count of each feature type.

          type type (count)
|               |hsap_hg18_demo.gff3|
|CDS            |11944              |
|contig         |    3              |
|exon           |12918              |
|five_prime_UTR | 1381              |
|gene           | 1157              |
|mRNA           | 1085              |
|ncRNA          |   72              |
|three_prime_UTR| 1385              |

Next let's customize the report by specifying the data that we want to summarize in the table rows.

SOBAcl --rows seqid hsap_hg18_demo.gff3
    seqid seqid (count)
|chr1 |9986               |
|chr2 |9977               |
|chr3 |9982               |

Now we will further customize the table by indicating what data we want to represent in our rows and columns.

SOBAcl --rows type --columns seqid hsap_hg18_demo.gff3
       type type (count)
|type           |chr1|chr2|chr3|
|CDS            |3815|4144|3985|
|contig         |   1|   1|   1|
|exon           |4182|4416|4320|
|five_prime_UTR | 540| 387| 454|
|gene           | 463| 318| 376|
|mRNA           | 430| 301| 354|
|ncRNA          |  33|  17|  22|
|three_prime_UTR| 522| 393| 470|

By default we've just been getting the counts of the different feature types. Let's look at their lengths by specifying length to the --data parameter and mean to the --data_type parameter.

SOBAcl --rows type --columns seqid --data length --data_type mean hsap_hg18_demo.gff3
                   type length (mean)
|type           |chr1        |chr2        |chr3        |
|CDS            |      168.67|      169.28|      159.59|
|contig         |247249718   |242951148   |199501826   |
|exon           |      308.16|      270.85|      287.11|
|five_prime_UTR |      554.27|      618.66|      629.51|
|gene           |    48070.89|    88243.71|    73324.19|
|mRNA           |    50172.13|    90551.96|    77185.4 |
|ncRNA          |    20691.15|    47374.06|    11193.95|
|three_prime_UTR|      539.2 |      579.47|      596.2 |

Let's look at the minimum feature length for the entire file.

SOBAcl --rows type --data length --data_type min hsap_hg18_demo.gff3
          type length (min)
|type           |hsap_hg18_demo.gff3|
|CDS            |        2          |
|contig         |199501826          |
|exon           |        6          |
|five_prime_UTR |        0          |
|gene           |       74          |
|mRNA           |      287          |
|ncRNA          |       74          |
|three_prime_UTR|        0          |

Specifying --data_type footprint will give us a look at the "tiled" length of the features on the chromosome.

SOBAcl --rows type --data length --data_type footprint hsap_hg18_demo.gff3
       type length (footprint)
|type           |hsap_hg18_demo.gff3|
|CDS            |  1829153          |
|contig         |247249719          |
|exon           |  3423479          |
|five_prime_UTR |   785362          |
|gene           | 61344223          |
|mRNA           | 60303556          |
|ncRNA          |  1627355          |
|three_prime_UTR|   717369          |

Perhaps we want to load that data into Excel so the we can prepare a table or figure for our paper.

SOBAcl --rows type --data length --data_type footprint --format tab hsap_hg18_demo.gff3
CDS	        1829153
contig	        247249719
exon	        3423479
five_prime_UTR	785362
gene	        61344223
mRNA	        60303556
ncRNA	        1627355
three_prime_UTR	717369


Let's start by looking at the mean length of the features.

SOBAcl --x_data type --series seqid --data length --data_type mean \
  --layout bars --format png hsap_hg18_demo.gff3
firefox SOBAcl_graphic_chr*.png
SOBA feature lengths for chromosome 1
SOBA feature lengths for chromosome 2
SOBA feature lengths for chromosome 3

That worked, but it doesn't look quite like what we had in mind. First of all, the X-axis labels overrun each other, the contig feature is so long that none of the other feature lengths even show up on the chart, and we've got three charts - one for each chromosome (seqid). Let's fix all of those problems using the --gd, --collapse and --select parameters.

SOBAcl --x_data type --series seqid --data length --data_type mean \
  --layout bars --format png --gd x_labels_vertical=1 --select 'type => "exon"' \
  --collapse hsap_hg18_demo.gff3
firefox SOBAcl_graphic.png
SOBA feature lengths by chromosome

If we just wanted to exclude the contigs, our --select parameter can do that as well.

SOBAcl --x_data type --series seqid --data length --data_type mean \
  --layout bars --format png --gd x_labels_vertical=1 --select \
  'type => ["ne", "contig"]' --collapse hsap_hg18_demo.gff3
SOBA feature lengths by chromosome

We can constrain the features reported in other ways as well.

SOBAcl --columns seqid --rows type --data length --data_type mean \
  --layout table --format text --select 'start => [">=", "1000"], \
  end => ["<=", "1000000"]' hsap_hg18_demo.gff3


SOBAcl has support for more complex reports.

SOBAcl --report attributes --format html_page hsap_hg18_demo.gff3

These reports can be generated by custom templates.

SOBAcl --columns file   --rows type --data length --data_type mean \
  --layout table --format tab --template \
	         count	length (mean)
CDS	         11944	165.853064300067
contig	         3	229900897.333333
exon	         12918	288.366697631212
five_prime_UTR	 1381	597.052136133237
gene	         1157	67319.117545376
mRNA	         1085	70187.8202764977
ncRNA	         72	24089.3611111111
three_prime_UTR	 1385	569.969675090253