GMOD

MWAS Tutorial

Contents

Maker Web Annotation Service

The MAKER Web Annotation Service (MWAS) is an easily configurable web-accesible genome annotation pipeline. It’s purpose is to allow research groups with small to intermediate amounts of eukaryotic and prokaryotic genome sequence (i.e. BAC clones, small whole genomes, preliminary sequencing data, etc.) to independently annotate and analyse their data and produce output that can be loaded into a genome database. MWAS is build on the stand alone genome annotation pipeline MAKER, and users who wish to annotate datasets that are too large to submit to MWAS are free to download MAKER for use on their own systems.

Understanding MWAS

The first half of this page gives general background to genome annotation as well as describes validation data for the MAKER Web Annotation Service, MWAS. The stand alone annotation pipeline MAKER is at the heart of MWAS, and MWAS has been configured to present the user with configuration options that match those of the command line program MAKER as closely as possible.

Introduction to Genome Annotation

What Are Annotations?

Annotations are descriptions of different features of the genome, and they can be both structural or functional in nature.

Examples:

It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation. This assists in quality control and downstream management of genome annotations.

Examples of evidence supporting a structural annotation:

Importance of Genome Annotations

Why should the average biologist care about genome annotations? Genome sequence itself is not very useful. The main question when any genome is sequenced is, “where are the genes?” To identify the genes we need to annotate the genome. And while most researchers probably don’t give annotations a lot of thought, they use them everyday.

Examples of Annotation Databases:

Every time we use techniques such as RNAi, PCR, gene expression arrays, targeted gene knockout, or CHIP we are basing our experiments on the information derived from a digitally stored genome annotation. If the annotation is correct, then these experiments should succeed; however, if an annotation is incorrect these experiments are bound to fail. Which brings up a major point:

Quality control and evidence management are therefore essential components to any annotation process.

Effect of Next Generation Sequencing on the Annotation Process

It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame. When these expectations finally become reality, then whole genome sequencing will likely become routine for even small laboratories. Unfortunately, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.

For example:

The MAKER Web Annotation Service is a tool to assist research groups in converting the mountain of genomic data provided by next generation sequencing technologies into a usable resource, and for larger datasets, research groups can use a local installation of the annotation pipeline MAKER.

What does MWAS do?

MAKER generated annotations, shown in Apollo.

What sets MAKER and MWAS apart from other tools (ab initio gene predictors etc.)?

MAKER is an annotation pipeline, not a gene predictor. MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.

gene prediction ≠ gene annotation

This may seem like just a matter of semantics since the primary output for both ab initio gene predictors and the MAKER pipeline is the same, a collection of gene models. However there are a few very significant consequences to the differences between these programs that I will explain shortly.

Emerging vs. Model Genomes

Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes. These include difficulties associated with Repeat identification, gene finder training, and other complex analyses. Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.

Classic Model Organisms Emerging Model Organisms

Well developed experimental systems

New experimental systems

  • Genome will be the central resource for work in these systems

Much prior knowledge about genome

Little prior knowledge about genome

  • Usually no genetics
Large community Small communities
Big $ Less $
Examples: D. melanogaster, C. elegans, human, etc. Examples: oomycetes, flat worms, cone snail, etc.

Comparison of Algorithm Performance on Model vs. Emerging Genomes

If you have ever looked at comparisons of gene predictor performance on classic model organisms such as C. elegans you would conclude that ab initio gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do. However, it is important to keep in mind that ab initio gene predictors have been specifically optimized to perform well on model organisms such as Drosophila and C. elegans, organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.

Table: MAKER's Performance on the C. elegans genome

Performance

Category

Ab initio Evidence Based
SNAP Augustus MAKER Gramene
Genomic Overlap (gene)
SP 82.48 88.09 91.69 93.49
SN 95.44 96.78 89.81 88.74
Exon Overlap
SP 18.88 22.87 25.58 27.38
SN 87.63 93.09 91.17 94.84

What about emerging model organisms for which little data is available? Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with. As a result ab initio gene predictors generally perform very poorly on emerging genomes.

MAKER will:

Categories:

Documentation

Community

Tools