GMOD Online Training 2014//SOBA Tutorial

From GMOD
Revision as of 13:47, 19 May 2014 by Bmoore (Talk | contribs)

Jump to: navigation, search

About SOBA

SOBA is a command line tool and web application for analyzing GFF3 annotations. GFF3 is a standard file format for genomic annotation data. SOBA gathers statistics from GFF3 files and renders them as tables and graphs.

The web version of SOBA will produce the following:

  • Summary statistics of feature types and attributes used
  • Histograms of feature lengths
  • Graphs of Sequence Ontology terms used
  • Histograms of intron density
  • Suggestions to improve SO compliance for invalid terms

In addition, the command line tool (SOBAcl) flexibly produces a much wider variety of tables, figures and graphs based on the data in a GFF3 file, as well as the ability to produce complex and extensible custom reports via a robust template system.

SOBA is a tool for those dealing with genomic sequence annotation who want to view genome-wide summaries of their annotation files. For example: SOBA would be a useful tool at an annotation jamboree for a newly sequenced organism and when preparing the resulting genome paper; SOBA would help those developing annotation tools to quickly evaluate updates to their tool; SOBA assists comparative genomics analyses by providing a high-level overview of the genome of multiple organisms. SOBA complements genome browsers by providing a summary of all the features annotated in the genome.

SOBA is built with Perl (and JavaScript for the web interface). The web interface uses CGI::Application as a Perl webapp framework and the JQuery JavaScript library for Web 2.0 effects and AJAX. Both versions of SOBA use the Template Tooklit (TT) to generate html/txt reports, graphviz for the ontology graphs, and GD for charts. Template Toolkit makes extensibility very easy, at least for someone who's willing to learn the fairly simple template language of TT as you don't need to know Perl or any other programming to use TT.

SOBA Web Application

Documentation for the web interface to SOBA is available on the Sequence Ontology Wiki as well as via tool-tips on the site itself.

Navigate to the SOBA Web Application with any modern browser that has JavaScript enabled.

SOBA main page


SOBA feature type counts


SOBA feature lengths


SOBA Feature length distribution


SOBA CDS length histogram


SOBA ontology graph


SOBAcl

Information and downloads for SOBAcl are available from the Sequence Ontology website.


Tutorial AMI

  • NAME: GMOD Summer 2014 - SOBA/GAL
  • ID: ami-58846d30

Getting Started

Let's start by grabbing some data to play with.

cd /home/ubuntu/soba_course/
wget topaz.genetics.utah.edu/SOBA_GMOD_demo.tar.gz
tar -xzvf SOBA_GMOD_demo.tar.gz
cd soba/
export PATH=$PATH:/usr/local/SOBA/bin

Help

Documentation for the command line version, SOBAcl, is available as a usage statement with the script itself:

SOBAcl --help
Synopsis:

SOBAcl [options] db_name
SOBAcl [options] file.gff3

SOBAcl file.gff3 2> soba_count.txt

Description:

The Sequence Ontology Bioinformatics Analysis command line tool
(SOBAcl) will generate a variety of reports from the data in GFF3
files.  The data can optionally first be loaded into a database.

Options:

  --help      Print this help page and exit.

  --title     A title for the table, graph, or report.

  --columns [--series]
              The columns parameter indicates how data will be
              grouped into columns (or series for axis graphs and
              seperate graphs for pie graphs).  For example, using
              seqid as the value would create one column of data for
              each seqid (chromosome).  For bar and line graphs there
              would be a data series for each value.  For pie graphs,
              a separate graph would be generated for each
              value. (seqid, source, type, strand, phase, [file],
              stats).

  --rows [--x_data]
              The rows parameter indicates data will be grouped into
              rows (x-axis labels for graphs with axes; wedges for
              pie graphs).  For example, using type as the value
              would create one row of data for each unique feature
              type in the GFF3 file.  For bar and line graphs there
              would be one x-axis label for each type and for pie
              graphs there would be one pie wedge for each
              type. (seqid, source, [type], strand, phase, file)

  --data [--y_data]
              The data parameter indicates which data fields (columns
              or attributes in the GFF3 file) will be reported.  For
              example using score for the data parameter will create
              a report on the scores of features. (seqid, source,
              type, start, end, [length], score, strand, phase)

  --data_type The data_type parameter indicates how the data described
	      in the data parameter should be summarized.  For
	      example if data is length and data_type is 'mean', then
	      the mean length would be reported for each grouping of
	      the data. (count, [mean], harmonic_mean,
	      geometric_mead, median, mode, sum, min, max, variance,
	      standard deviation, range, footprint)

  --layout    The layout parameter describes how the data will be
	      presented.  ([table], lines, bars, hbars, points,
	      linespoints, area, pie)

  --format    The format parameter determines how the output will be
	      formatted.  For all text based outputs the options are
	      (text, [tab], html, html_page, pdf).  For all graphical
	      layouts the options are (jpeg, [png], gif).

  --collapse  Collapse series data onto one graph.

Tables

First let's just run SOBAcl with all default parameters and see what we get...

SOBAcl hsap_hg18_demo.gff3

By default SOBA prints count of each feature type.

          type type (count)
=====================================
|               |hsap_hg18_demo.gff3|
=====================================
|CDS            |11944              |
+---------------+-------------------+
|contig         |    3              |
+---------------+-------------------+
|exon           |12918              |
+---------------+-------------------+
|five_prime_UTR | 1381              |
+---------------+-------------------+
|gene           | 1157              |
+---------------+-------------------+
|mRNA           | 1085              |
+---------------+-------------------+
|ncRNA          |   72              |
+---------------+-------------------+
|three_prime_UTR| 1385              |
-------------------------------------

Next let's customize the report by specifying the data that we want to summarize in the table rows.

SOBAcl --rows seqid hsap_hg18_demo.gff3
    seqid seqid (count)
===========================
|seqid|hsap_hg18_demo.gff3|
===========================
|chr1 |9986               |
+-----+-------------------+
|chr2 |9977               |
+-----+-------------------+
|chr3 |9982               |
---------------------------

Now we will further customize the table by indicating what data we want to represent in our rows and columns.

SOBAcl --rows type --columns seqid hsap_hg18_demo.gff3
       type type (count)
================================
|type           |chr1|chr2|chr3|
================================
|CDS            |3815|4144|3985|
+---------------+----+----+----+
|contig         |   1|   1|   1|
+---------------+----+----+----+
|exon           |4182|4416|4320|
+---------------+----+----+----+
|five_prime_UTR | 540| 387| 454|
+---------------+----+----+----+
|gene           | 463| 318| 376|
+---------------+----+----+----+
|mRNA           | 430| 301| 354|
+---------------+----+----+----+
|ncRNA          |  33|  17|  22|
+---------------+----+----+----+
|three_prime_UTR| 522| 393| 470|
--------------------------------

By default we've just been getting the counts of the different feature types. Let's look at their lengths by specifying length to the --data parameter and mean to the --data_type parameter.

SOBAcl --rows type --columns seqid --data length --data_type mean hsap_hg18_demo.gff3
                   type length (mean)
========================================================
|type           |chr1        |chr2        |chr3        |
========================================================
|CDS            |      168.67|      169.28|      159.59|
+---------------+------------+------------+------------+
|contig         |247249718   |242951148   |199501826   |
+---------------+------------+------------+------------+
|exon           |      308.16|      270.85|      287.11|
+---------------+------------+------------+------------+
|five_prime_UTR |      554.27|      618.66|      629.51|
+---------------+------------+------------+------------+
|gene           |    48070.89|    88243.71|    73324.19|
+---------------+------------+------------+------------+
|mRNA           |    50172.13|    90551.96|    77185.4 |
+---------------+------------+------------+------------+
|ncRNA          |    20691.15|    47374.06|    11193.95|
+---------------+------------+------------+------------+
|three_prime_UTR|      539.2 |      579.47|      596.2 |
--------------------------------------------------------

Let's look at the minimum feature length for the entire file.

SOBAcl --rows type --data length --data_type min hsap_hg18_demo.gff3
          type length (min)
=====================================
|type           |hsap_hg18_demo.gff3|
=====================================
|CDS            |        2          |
+---------------+-------------------+
|contig         |199501826          |
+---------------+-------------------+
|exon           |        6          |
+---------------+-------------------+
|five_prime_UTR |        0          |
+---------------+-------------------+
|gene           |       74          |
+---------------+-------------------+
|mRNA           |      287          |
+---------------+-------------------+
|ncRNA          |       74          |
+---------------+-------------------+
|three_prime_UTR|        0          |
-------------------------------------

Specifying --data_type footprint will give us a look at the "tiled" length of the features on the chromosome.

SOBAcl --rows type --data length --data_type footprint hsap_hg18_demo.gff3
       type length (footprint)
=====================================
|type           |hsap_hg18_demo.gff3|
=====================================
|CDS            |  1829153          |
+---------------+-------------------+
|contig         |247249719          |
+---------------+-------------------+
|exon           |  3423479          |
+---------------+-------------------+
|five_prime_UTR |   785362          |
+---------------+-------------------+
|gene           | 61344223          |
+---------------+-------------------+
|mRNA           | 60303556          |
+---------------+-------------------+
|ncRNA          |  1627355          |
+---------------+-------------------+
|three_prime_UTR|   717369          |
-------------------------------------

Perhaps we want to load that data into Excel so the we can prepare a table or figure for our paper.

SOBAcl --rows type --data length --data_type footprint --format tab hsap_hg18_demo.gff3
	hsap_hg18_demo.gff3
CDS	        1829153
contig	        247249719
exon	        3423479
five_prime_UTR	785362
gene	        61344223
mRNA	        60303556
ncRNA	        1627355
three_prime_UTR	717369

Charts

Let's start by looking at the mean length of the features.

SOBAcl --x_data type --series seqid --data length --data_type mean \
  --layout bars --format png hsap_hg18_demo.gff3
firefox SOBAcl_graphic_chr*.png
SOBA feature lengths for chromosome 1
SOBA feature lengths for chromosome 2
SOBA feature lengths for chromosome 3

That worked, but it doesn't look quite like what we had in mind. First of all, the X-axis labels overrun each other, the contig feature is so long that none of the other feature lengths even show up on the chart, and we've got three charts - one for each chromosome (seqid). Let's fix all of those problems using the --gd, --collapse and --select parameters.

SOBAcl --x_data type --series seqid --data length --data_type mean \
  --layout bars --format png --gd x_labels_vertical=1 --select 'type => "exon"' \
  --collapse hsap_hg18_demo.gff3
firefox SOBAcl_graphic.png
SOBA feature lengths by chromosome


If we just wanted to exclude the contigs, our --select parameter can do that as well.

SOBAcl --x_data type --series seqid --data length --data_type mean \
  --layout bars --format png --gd x_labels_vertical=1 --select \
  'type => ["ne", "contig"]' --collapse hsap_hg18_demo.gff3
SOBA feature lengths by chromosome

We can constrain the features reported in other ways as well.

SOBAcl --columns seqid --rows type --data length --data_type mean \
  --layout table --format text --select 'start => [">=", "1000"], \
  end => ["<=", "1000000"]' hsap_hg18_demo.gff3

Reports

SOBAcl has support for more complex reports.

SOBAcl --report attributes --format html_page hsap_hg18_demo.gff3

These reports can be generated by custom templates.

SOBAcl --columns file   --rows type --data length --data_type mean \
  --layout table --format tab --template soba_custom_template_text.tt \
  hsap_hg18_demo.gff3
	         count	length (mean)
CDS	         11944	165.853064300067
contig	         3	229900897.333333
exon	         12918	288.366697631212
five_prime_UTR	 1381	597.052136133237
gene	         1157	67319.117545376
mRNA	         1085	70187.8202764977
ncRNA	         72	24089.3611111111
three_prime_UTR	 1385	569.969675090253

The Genome Annotation Library (GAL) is a library of Perl modules that makes working with sequence features in GFF3 format straight forward and intuitive. It aims, 'to make easy things easy and hard things possible', when working with GFF3 based sequence feature data. GAL also provides numerous parser classes for converting other formats to GFF3 and makes writing new parsers easy. The GAL API takes advantage of the powerful ontology based relationships copatured by GFF3-based sequence feature data to allow users to traverse the relationships between features in a GFF3 file. GAL loads features into a light-weight relational database and thus selecting sets of features based on complex data queries (get all genes on the minus strand of chromosome 2 who have a Dbxref attribute set) is straight-forward, and iterator classes make working with feature sets convenient.

In addition to the GAL programming library many scripts utilizing the library are provided as well as several stand-alone scripts for working with GFF/GVF, fasta, fastq, and SAM/BAM files are included (see below).

GAL is written in Perl and uses DBIx::Class to interface with an SQLite representation of the sequence features. See the Documentation link below to find out more.

Stand-alone Scripts Several popular stand-alone scripts (they don't use the underlying GAL library) are included with GAL and can be used without installing the GAL libary.

fasta_tool A script providing various utility functions to manipulate fasta files. gff_tool A script providing various utility functions to manipulate GFF files. gtf2gff3 A script to convert GTF to GFF3 files. gff3_validator A syntactic validator for GFF3 files (Does not validate relationships). gvf_validator A syntactic validator for GVF files. sam_inspector A script providing various utility functions to manipulate SAM/BAM files. ucsc2gff A script to convert USCS gene tables to GFF3 output. Works with the build_genes script below. build_genes A script to add genes to UCSC table based GFF3 files. Works with ucsc2gff above. If you are only intersted in using these scripts they can be downloaded as GAL_#.#.#_stand_alone_scripts.tar.gz from the same location described in the Download section.

Download GAL releases can be downloaded from:

http://www.sequenceontology.org/software/GAL_Code/

The bleeding-edge code for GAL is also available (via Subversion) from:


svn co svn://topaz.genetics.utah.edu/GAL/trunk GAL

Installation Installation is automated by the included installation script and an INSTALL document describes all the details.

Documentation Full API documentation can be found with the distribution in the GAL/docs/html/GAL.html document or online at:

GAL Manual

In addtion, each script included with GAL comes with a detailed useage statement - just run the script with the --help flag.

GAL Working Example A very simple working GAL script below will iterate over mRNA features, retrieve exons, infer and retrieve introns and call methods on each resulting object.

use GAL::Annotation;
my $annotation = GAL::Annotation->new( $gff3_file, $fasta_file );

my ( $gff3, $fasta_file ) = @ARGV;

my $features = $annotation->features;

# Do the search and get an interator for all matching features.
# Searches can be as complex as desired.
my $mrnas = $features->search( { type => 'mRNA' } );

# Iterate over the features
while ( my $mrna = $mrnas->next ) {

    # Get the feature ID
    my $mrna_id = $mrna->feature_id;

    # Get and iterator for all the exons for this mRNA
    my $exons   = $mrna->exons;
    # The iterators are smart
    my $e_count = $exons->count;
    my ( $e_length, $e_gc );

    # Iterate over each exon
    while ( my $exon = $exons->next ) {
        $e_length += $exon->length;
        $e_gc     += $exon->gc_content;
    }
    print join "\t", int( $e_length / $e_count ), $e_gc / $e_count;
    print "\n";

    # Introns don't exist in the dataset, so GAL
    # will infer them on the fly.
    my $introns = $mrna->introns;
    my $i_count = $introns->count;
    my ( $i_length, $i_gc );
    while ( my $intron = $introns->next ) {
        $i_length += $intron->length;
        $i_gc     += $intron->gc_content;
    }
    print $i_length
      ? join( "\t", int( $i_length / $i_count ), '', $i_gc / $i_count )
      : "\t\t";
    print "\n";
}

Mailing List GAL is supported by the Sequence Ontology Developers Mailing list at:

https://lists.sourceforge.net/lists/listinfo/song-devel