Difference between revisions of "MAKER"

From GMOD

Jump to: navigation, search

Revision as of 21:43, 13 February 2013

Status

Mature release
Development: active
Support: active

Resources

1 About MAKER
- 1.1 Screenshots
2 Downloads
3 Using MAKER
4 Publications, Tutorials, and Presentations
5 Contacts and Mailing Lists
6 MAKER Development
- 6.1 Development team
7 See also
8 More on MAKER

About MAKER

MAKER is a portable and easy to configure genome annotation pipeline. MAKER allows smaller eukaryotic genome projects and prokaryotic genome projects to annotate their genomes and to create genome databases. MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions and automatically synthesizes these data into gene annotations with evidence-based quality values. MAKER is also easily trainable: outputs of preliminary runs can be used to automatically retrain its gene prediction algorithm, producing higher quality gene-models on subsequent runs. MAKER's inputs are minimal. Its outputs are in GFF3 or FASTA format, and can be directly loaded into Chado, GBrowse, JBrowse or Apollo.

Additional MAKER options/capabilities include:

Map old annotation sets on to new assemblies.
Merge multiple legacy annotation sets into a consensus set of annotations.
Update existing annotations to take new evidence into account.
Tag pre-existing gene models with evidence alignments and quality control metrics to assist in downstream manual curation.
Use GFF3 pass-through to include both evidence alignments and predicted gene models from algorithms not natively supported by MAKER.
MAKER is MPI capable for rapid parallelization across computer clusters.
You can also easily integrate raw InterProScan results into MAKER, which will identify protein domains, add GO functional categories, and help assign putative gene functions to genome annotations. This data then becomes accessible as part of the GFF3 output and can be loaded into a Chado database, GBrowse, or Apollo.

MAKER comes with sample data for testing purposes. See the /data directory in the download.

Visit the MAKER website.

Screenshots

View of MAKER annotations in the Apollo genome annotation curation tool. Supporting evidence is shown in the upper dark panel. Gene annotations are shown in the blue panel.

Downloads

MAKER is available as a compressed TAR file from the MAKER website.

The source code for MAKER can be downloaded from http://yandell.topaz.genetics.utah.edu/cgi-bin/maker_license.cgi.

Using MAKER

We recommend that users read the MAKER paper [below] and the README file included with the download before installing and using MAKER.

System Requirements

Perl 5.8.0 or higher
BioPerl 1.6 or higher.
WU-BLAST 2.0 or higher or NCBI-BLAST 2.2.X or higher
RepeatMasker 3.1.6 or higher
- RepeatMasker requires a repeat library, available from Repbase.
Exonerate 1.4 or higher.

Optional extras:

SNAP version 2009-02-03 or higher (for eukaryotic genomes).
Augustus 2.0 or higher (for eukaryotic genomes)
GeneMark-ES (for eukaryotic genomes)
FGENESH 2.6 or higher - requires licence (for eukaryotic genomes)
GeneMarkS (for prokaryotic genomes)

Installation

An online installation guide is coming soon; please see the README file in the download for instructions.

Documentation

By default, MAKER runs on a single computer. A parallel version, mpi_maker, is also available. To run mpi_maker you will need a message passing interface (MPI) package installed on all participating computers; try MPICH2. Remember to install MPICH2 with the --enable-sharedlibs flag set to the appropriate value .See the MPICH2 Installer's Guide for more information.

Annotations

The MAKER genome annotation pipeline generates several different types of annotations, including

Ab initio gene predictions from SNAP, Augustus, FGENESH, and GeneMark
Final gene models from MAKER
EST alignments from both EXONERATE and BLASTN
Protein alignments from EXONERATE and BLASTX
Repeats from RepeatMasker and the MAKER internal RepeatRunner

Publications, Tutorials, and Presentations

Publications on or mentioning MAKER

MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. ^[1]
MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. PMID:18025269 ^[2]

Tutorials

MAKER Tutorial: taught as part of the 2013 GMOD Summer School

Presentations

MAKER—the easy-to-use genome annotation pipeline, presented by Mark Yandell at the SMBE 2009 GMOD Workshop.

Contacts and Mailing Lists

	Mailing List Link	Description	Archive(s)
MAKER	maker-devel	MAKER developers and users list.	Google, Nabble (2010/05+)

Support for MAKER is provided on a mailing list, and through a Trac-based wiki, bug tracker and code browser.

MAKER Development

Development team

Yandell Lab, Utah; Ian Korf (University of California Davis)

More on MAKER

See Category:MAKER

↑ Cite error: Invalid <ref> tag; no text was provided for refs named PMID:22192575
↑ Cite error: Invalid <ref> tag; no text was provided for refs named PMID:18025269

Raw tool data at MAKER/tool data

Retrieved from "http://gmod.org/mediawiki/index.php?title=MAKER&oldid=23017"

Categories:

Facts about "MAKER"RDF feed

Has download URL	http://yandell.topaz.genetics.utah.edu/cgi-bin/maker_license.cgi +
Has website	http://www.yandell-lab.org/software/maker.html +

Difference between revisions of "MAKER"

Revision as of 21:43, 13 February 2013

Contents

About MAKER

Screenshots

Downloads

Using MAKER

System Requirements

Installation

Documentation

Annotations

Publications, Tutorials, and Presentations

Publications on or mentioning MAKER

Tutorials

Presentations

Contacts and Mailing Lists

MAKER Development

Development team

See also

More on MAKER

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Documentation

Community

Tools

@@ Line 1: / Line 1: @@
-{{SessionHead}}
+<!-- to alter this page, please edit the raw data, which is stored at http://gmod.org/wiki/MAKER/tool_data -->
-{| class="tutorialheader"
-| {{TutorialTitleLine|[[gmod:MAKER|MAKER]]}}<br />
-[[2011 GMOD Spring Training]]<br />
--12 March 2011<br />
-[[User:Bmoore@genetics.utah.edu|Barry Moore]]
-| align="right" | {{#icon: MAKERLogo.png|MAKER||[[gmod:MAKER]]}}
-|}
+{{ :MAKER/tool_data | template = Template:ToolDisplay }}
-=Maker Overview, Installation, and Basic Configuration for Annotating Genomic Sequence=
+[[Category:GMOD Components]]
-The first half of this page describes the basics of [[gmod:MAKER|MAKER]] - the easy-to-use genome annotation pipeline.
+[[Category:Annotation]]
+[[Category:MAKER]]
-==About MAKER==
-MAKER is an easy-to-use genome annotation pipeline designed to be usable by small research groups with little bioinformatics experience; however, MAKER is also designed to be scalable and is appropriate for projects of any size including use by large sequence centers.  MAKER can be used for ''de novo'' annotation of newly sequenced genomes, for updating existing annotations to reflect new evidence, or just to combine annotations, evidence, and quality control statistics for use in other GMOD programs like [[GBrowse]], [[JBrowse]], [[Chado]], and [[Apollo]].
-MAKER has been used in many genome annotation projects:
-*The Pine genome project (A. Stambolia-Kovach, UCD). Now published.
-*The Lamprey genome project (W. Li, MSU). Manuscript in preparation.
-*The ''Fusarium circinatum'' (pine pitch canker) genome project (B. Wingfield, U of Pretoria). Manuscript submitted.
-*Leaf-cutter Ant genome (C. Currie, U of Wis., Madison) Now published.
-*Argentine Ant genome (CD. Smith, San Francisco State).
-*Red Harvester Ant. Now published.
-*Several parasitic nematode genomes (M. Mitreva, WASHU). Some now published.
-*The ''Conus bullatus'' genome project (B. Olivera U of Utah). Preliminary results published.
-*The ''Pythium ultimum'' genome project (R. Buell, MSU). Published.
-*''Cardiocondyla obscurior'' (ant) genome project (J. Gadau, Az. State University)
-*Corn rootworm beetle genome project (H. Robertson, Univ. of Illinois)
-*Rice re-annotation (R. Buell, MSU).
-*''Arabidopsis'' re-annotation (E. Huala, TAIR)
-*Maize re-annotation (C. Lawrence, MaizeGDP)
-*Wheat stem sawfly project (H. Robertson, Univ. of Illinois)
-*Apple maggot fly (H. Robertson, Univ. of Illinois)
-*Pigeon genome (M. Shapiro, Univ. of Utah)
-*The fungus ''Cronartium quercuum'' (J. M. Davis, University of Florida)
-==Introduction to Genome Annotation==
-===What Are Annotations?===
-Annotations are descriptions of different features of the genome, and they can be structural or functional in nature.
-Examples:
-*Structural Annotations: exons, introns, UTRs, splice forms etc.[[Image:structural.png]]
-*Functional Annotations: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.
-It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation.  This assists in quality control and downstream management of genome annotations.
-Examples of evidence supporting a structural annotation:
-*''Ab initio'' gene predictions
-*ESTs
-*Protein homology
-===Importance of Genome Annotations===
-Why should the average biologist care about genome annotations?
-{|
-| |
-[[Image:process.png|thumb|560px|'''Figure:''' Genome project from sequencing to experimental application of annotations]]
-|}
-Genome sequence itself is not very useful.  The first question that occurs to most of us when a genome is sequenced is, "where are the genes?"  To identify the genes we need to annotate the genome.  And while most researchers probably don't give annotations a lot of thought, they use them everyday.
-Examples of Annotation Databases:
-* [http://uswest.ensembl.org/index.html Ensembl]
-* [http://www.ncbi.nlm.nih.gov/RefSeq/ RefSeq]
-* [[gmod:Category:FlyBase|FlyBase]]
-* [[gmod:Category:WormBase|WormBase]]
-* [[gmod:Category:MGI|Mouse Genome Informatics Group]]
-Every time we use techniques such as RNAi, PCR, gene expression arrays, targeted gene knockout, or ChIP we are basing our experiments on the information derived from a digitally stored genome annotation.  If an annotation is correct, then these experiments should succeed; however, if an annotation is incorrect then the experiments that are based on that annotation are bound to fail.  Which brings up a major point:
-*'''Incorrect and incomplete genome annotations poison every experiment that uses them.'''
-Quality control and evidence management are therefore essential components to any annotation process.
-===Effect of [[gmod:Next Generation Sequencing|NextGen Sequencing]] on the Annotation Process===
-It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame.  Pacific Biosciences is claiming they will be able to sequence a human sized genome in [http://www.pacificbiosciences.com/assets/files/1-23439884-eprint.pdf fifteen minutes by 2013].  If the hype is to be believed, then whole genome sequencing will become "routine" for even small labs in the not so distant future.  Unfortunately, however, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.
-For example:
-*As of March 2011, 2002 eukaryote and 5811 prokaryote genome projects were underway.
-*If we assume 10,000 genes per genome, that’s over 200,000,000 new annotations (with this many new annotations, quality control and maintenance become an issue).
-*While there are organizations dedicated to producing and distributing genome annotations (i.e ENSEMBL and VectorBase), the shear volume of newly sequenced genomes exceeds both their capacity and stated purview.
-*Small research groups are affected disproportionately by the difficulties related to genome annotation, primarily because they often lack bioinformatics experience and must confront the difficulties associated with genome annotation on their own.
-MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the coming tsunami of genomic data provided by next generation sequencing technologies into a usable resource.
-==MAKER Overview==
-[[Image:MAKERLogo.png]]
-[http://www.cafepress.com/dd/31169546 Get MAKER Bling!]
-The easy-to-use  annotation pipeline.
-{| class="wikitable"
-! User Requirements:
-| Can be run by a single individual with little bioinformatics experience
-|-
-! System Requirements:
-| Can run on laptop or desktop computers running Linux or Mac OS X (also cluster compatible)
-|-
-! Program Output:
-| Output is compatible with popular GMOD annotation tools like [[Apollo]], [[GBrowse]]  [[JBrowse]]
-|-
-! Availability:
-| Free open source application (for academic use)
-|}
-===What does MAKER do?===
-*Identifies and masks out repeat elements
-*Aligns ESTs to the genome
-*Aligns proteins to the genome
-*Produces ''ab initio'' gene predictions
-*Synthesizes these data into final annotations
-*Produces evidence-based quality values for downstream annotation management
-{| cellpadding="5px"
-|-
-| valign="top" style="border: 1px solid gray" | [[Image:Apollo_view.jpg | border ]]
-|-
-|valign="top" align="center" |MAKER generated annotations, shown in [[Apollo]].
-|}
-===What sets MAKER apart from tools (''ab initio'' gene predictors etc.)?===
-MAKER is an annotation pipeline, not a gene predictor.  MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.
-{|
-| |
-[[Image:comparison.png|thumb|560px|]]
-|}
-gene prediction ≠ gene annotation
-*gene predictions are partial gene models.
-*gene annotations are gene models but should include a documented evidence trail supporting the model in addition to quality control metrics.
-This may seem like just a matter of semantics since the primary output for both ''ab initio'' gene predictors and the MAKER pipeline is the same, a collection of gene models.  However there are a few very significant consequences to the differences between these programs that I will explain shortly.
-===Emerging vs. Classic Model Genomes===
-Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes.  These include difficulties associated with Repeat identification, gene finder training, and other complex analyses.  Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.
-{| class="wikitable "
-|-
-! Classic Model Organisms
-! Emerging Model Organisms
-|-
-| valign='top' |
-Well developed experimental systems
-| |
-New experimental systems
-*Genome will be the central resource for work in these systems
-|-
-| valign='top'|
-Much prior knowledge about genome
-| |
-Little prior knowledge about genome
-*Usually no genetics
-|-
-| Large community
-| Small communities
-|-
-| Big $
-| Less $
-|-
-| Examples: ''D. melanogaster, C. elegans'', human, etc.
-| Examples: oomycetes, flat worms, cone snail, etc.
-|}
-===Comparison of Algorithm Performance on Model vs. Emerging Genomes===
-If you have ever looked at comparisons of gene predictor performance on classic model organisms such as ''C. elegans'' you would conclude that ''ab initio'' gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do.  However, it is important to keep in mind that ''ab initio'' gene predictors have been specifically optimized to perform well on model organisms such as ''Drosophila'' and ''C. elegans'', organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.
-{|
-| |
-[[Image:RIP_figure.png|thumb|560px|'''Figure:''' Comparison of Base-Level Accuracies For Final Gene Models]]
-|}
-What about emerging model organisms for which little data is available?  Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with.  As a result ''ab initio'' gene predictors generally perform very poorly on emerging genomes.
-{|
-| |
-[[Image:Maker_performance.jpg|thumb|560px|'''Figure:''' MAKER's Performance on the ''S. mediterranea'' Emerging Model Organism Genome. Pfam domain content of gene models determined using rpsblast]]
-|}
-By using ''ab inito'' gene predictors inside of the MAKER pipeline instead of as stand alone applications you get certain benefits:
-*Provide gene models as well as an evidence trail correlations for quality control and manual curation
-*Provide a mechanism to train and retrain ''ab initio'' gene predictors for even better performance.
-*Output can be easily loaded into a GMOD compatible database for annotation distribution (including evidence associations).
-*Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline
-==Installation==
-===Prerequisites===
-Perl Modules
-* {{CPAN|DBI}}
-* {{CPAN|DBD::SQLite}}
-* {{CPAN|Proc::ProcessTable}}
-* {{CPAN|threads}} (Optional, for MPI scripts)
-* {{CPAN|IO::All}} (Optional, for accessory scripts)
-* {{CPAN|IO::Prompt}} (Optional, for accessory scripts)
-External Programs
-*[http://www.perl.org/ Perl] 5.8.0 or Higher
-*[http://www.bioperl.org/ BioPerl] 1.6 or higher
-*[http://homepage.mac.com/iankorf/ SNAP] version 2009-02-03  or higher
-*[http://www.repeatmasker.org/ RepeatMasker] 3.1.6  or higher
-*[http://www.ebi.ac.uk/~guy/exonerate/ Exonerate] 1.4  or higher
-You must also install one of the following:
-*[http://blast.wustl.edu/ WU-BLAST] 2.0 or higher (Now [http://www.advbiocomp.com/ AB-BLAST])
-*[http://www.ncbi.nlm.nih.gov/Ftp/ NCBI BLAST] 2.2.X or higher
-Optional Components:
-*[http://augustus.gobics.de/ Augustus] 2.0 or higher
-*[http://exon.biology.gatech.edu/ GeneMark-ES] 2.3a or higher
-*[http://www.softberry.com/ FGENESH] 2.6 or higher
-Required for optional MPI support:
-*[http://www.mcs.anl.gov/research/projects/mpich2/ MPICH2]
-(Working on Amazon EC2 support.  Can also start MAKER multiple times and get parallelization without MPI.  Subsequent MAKER instances will detect already running instances and integrate seamlessly.)
-===The MAKER Package===
-Because of the number of prerequisites, we will not cover the details of installing these other programs; they have already been installed for you.  But even though I did pre-install most programs for you, I'm still going to have you perform basic post installation configurations, so lets get started.
-<div class="attn">
-MAKER can be downloaded from:
-*http://www.yandell-lab.org/ - but it should already be on the image
-</div>
-To keep everyone from hitting the server at once though, I have already placed MAKER in the <tt>~/Documents/Software/maker/</tt> directory.  This is a [[gmod:SVN|subversion]] repository of MAKER, so lets make sure it is updated to the latest version, and then look at the packages contents.
-<div class="dont">
- #Don't do this today
- cd Documents/Software/maker/
- svn update
- ls -1
-</div>
-Note: That is a ''dash one'', not a ''dash el'', on the <tt>ls</tt> command.
-You should now see the following:
- GMOD
- INSTALL
- LICENSE
- MWAS
- README
- data
- lib
- src
-There are two files in particular that you would want to look at when installing MAKER -  <tt>INSTALL</tt> and <tt>README</tt>.  <tt>INSTALL</tt> gives a brief overview of MAKER and prerequisite installation.  Lets take a look at this.
- <span class="enter">less INSTALL</span>
-You shouldn't need to do this if MAKER is pre-installed or if MAKER installed all the prerequisites.  But for the sake of documentation...
-<pre>
-***Installation Documentation***
-How to Install Standard MAKER
-EASY INSTALL
-.  Go to the .../maker/src/ directory and run 'perl Build.PL' to configure
-    the install and then './Build install' to complete the installation.
-.  If anything fails, either use the ./Build file commands to retry the
-    failed section or follow the detailed install instructions below to
-    manually install missing modules or programs.
-See the README file for details on installing mpi_maker
-</pre>
-<div class="dont">
-According to the documentation we need to add a few entries to your user profile.  So lets open it in a [[Linux Text Editors|text editor]].
- <span class="enter">gedit ~/.profile</span>
-{{TextEditorLink|gedit}}
-Add the following to your user profile (optional with the latest MAKER versions).
-For bash:
-<pre class="enter">
- PATH=/home/gmod/Documents/Software/maker/bin:$PATH
- PATH=/usr/local/ncbi-blast/bin:$PATH
- PATH=/usr/local/exonerate/bin:$PATH
- PATH=/usr/local/augustus/bin:$PATH
- PATH=/usr/local/snap:$PATH
- PATH=/usr/local/gm_es:$PATH
- PATH=/usr/local/RepeatMasker:$PATH
- export PATH
- export ZOE=/usr/local/snap
- export AUGUSTUS_CONFIG_PATH=/usr/local/augustus/config
-</pre>
-Now reload your profile to make the changes take hold.
- <span class="enter">source ~/.profile</span>
-</div>
-MAKER should now be installed.  Let's test the executable.  We should see the usage statement.
- <span class="enter">maker -help</span>
- Usage:
-     maker [options] <maker_opts> <maker_bopts> <maker_exe>
-     Maker is a program that produces gene annotations in GFF3 file format using
-     evidence such as EST alignments and protein homology.  Maker can be used to
-     produce gene annotations for new genomes as well as update annoations from
-     existing genome databases.
-     The three input arguments are user control files that specify how maker
-     should behave. All options for maker should be set in the control files,
-     but a few can also be set on the command line. Command line options provide
-     a convenient machanism to override commonly altered control file values.
-     Input files listed in the control options files must be in fasta format.
-     unless otherwise specified. Please see maker documentation to learn more
-     about control file  configuration.  Maker will automatically try and locate
-     the user control files in the current working directory if these arguments
-     are not supplied when initializing maker.
-     It is important to note that maker does not try and recalculated data that
-     it has already calculated.  For example, if you run an analysis twice on
-     the same dataset file you will notice that maker does not rerun any of the
-     blast analyses, but instead uses the blast analyses stored from the
-     previous run.  To force maker to rerun all analyses, use the -f flag.
- Options:
-     -genome|g <filename> Specify the genome file.
-     -RM_off|R           Turns all repeat masking off.
-     -datastore/         Forcably turn on/off MAKER's use of a two deep datastore
-      nodatastore        directory structure for output.  By default this option
-                         turns on whenever there are more the 1,000 contigs in
-                         the input genome fasta file.
-     -base    <string>   Set the base name MAKER uses to save output files.
-                         MAKER uses the input genome file name by default.
-     -retry|r <integer>  Rerun failed contigs up to the specified count.
-     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
-     -force|f            Forces maker to delete old files before running again.
-                         This will require all blast analyses to be re-run.
-     -again|a            Caculate all annotations and output files again even if
-                         no settings have changed. Does not delete old analyses.
-     -evaluate|e         Run Evaluator on final annotations (under development).
-     -fast               Causes MAKER to skip most clustering and analysis.
-                         A quick way to align evidence.  You then must re-run
-                         MAKER to produce full GFF3 output and annotations.
-     -quiet|q            Silences most of maker's status messages.
-     -qq                 Really quit. Silences everything but major errors.
-     -CTL                Generate empty control files in the current directory.
-     -help|?             Prints this usage statement.
-==Getting Started with MAKER==
-===Note===
-Before we begin with any examples.  I want everyone to note that all finished examples are located in <tt>~/Documents/Data/maker</tt>, so if you fall behind you can always find MAKER control files datasets and final results in there.
-Let's just quickly take a look
-<pre class="enter">
- cd ~/Documents/Data/maker
- ls -1
-</pre>
-You should see five example folders
-<pre class="enter">
- example1_dmel
- example2_pyu
- example3_mRNAseq
- example4_legacy
- example5_ecoli
-</pre>
-Lets look inside example1
- <span class="enter">ls -1 example1_dmel</span>
-You will see a directory called <tt>finished.maker.output</tt> which contains all the final results for the example.  Each of the other examples will contain a similar directory.
- finished.maker.output
-Now let's get started!
-===RUNNING MAKER WITH EXAMPLE DATA===
-MAKER comes with some example input files to test the installation and to familiarize the user with how to run MAKER.  The example files are found in the <tt>maker/data</tt> directory.
- <span class="enter">ls -1 /home/gmod/Documents/Software/maker/data</span>
- dpp_contig.fasta
- dpp_proteins.fasta
- dpp_est.fasta
- te_protein.fasta
-The example files are in [[gmod:Glossayr#FASTA|FASTA]] format. MAKER requires FASTA format for it's input files.  Let's take a look at one of theses files to see what the format looks like.
- <span class="enter">cat /home/gmod/Documents/Software/maker/data/dpp_proteins.fasta</span>
- >dpp-CDS-5
- MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLASASGSGSGRSGSRSVG
- ASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANRQFNEVHKPRTDQLENSKN
- KSKQLVNKPNHNKMAVKEQRSHHKKSHHHRSHQPKQASASTESHQSSSIESIFVEEPTLV
- LDREVASINVPANAKAIIAEQGPSTYSKEALIKDKLKPDPSTLVEIEKSLLSLFNMKRPP
- KIDRSKIIIPEPMKKLYAEIMGHELDSVNIPKPGLLTKSANTVRSFTHKDSKIDDRFPHH
- HRFRLHFDVKSIPADEKLKAAELQLTRDALSQQVVASRSSANRTRYQVLVYDITRVGVRG
- QREPSYLLLDTKTVRLNSTDTVSLDVQPAVDRWLASPQRNYGLLVEVRTVRSLKPAPHHH
- VRLRRSADEAHERWQHKQPLLFTYTDDGRHKARSIRDVSGGEGGGKGGRNKRQPRRPTRR
- KNHDDTCRRHSLYVDFSDVGWDDWIVAPLGYDAYYCHGKCPFPLADHFNSTNHAVVQTLV
- NNMNPGKVPKACCVPTQLDSVAMLYLNDQSTVVLKNYQEMTVVGCGCR
-FASTA format is fairly simple. It contains a definition line starting with '>' that contains a name for a sequence followed by the actual sequence in nucleotide or amino acid format.  The file we are looking at contains protein sequences, so the sequence uses the single letter code for amino acids.  A minimal input file set for MAKER would generally consist of a FASTA file for the genomic sequence, a FASTA file of ESTs derived from the transcriptome, and a FASTA file of protein sequences from the same or related organisms.  I'll describe in more detail exactly what MAKER does with each data file shortly.
-Now we are going to copy the example files to the <tt>example1_dmel</tt> directory we looked at earlier before running MAKER.
-<pre class="enter">
- cd /home/gmod/Documents/Data/maker/example1_dmel
- cp /home/gmod/Documents/Software/maker/data/dpp* .
-</pre>
-Next we need to tell MAKER all the details about how we want the annotation process to proceed.  Because there can be many variables and options involved in annotation, command line options would be too numerous and cumbersome.  Instead MAKER uses a set of configuration files which guide each run.  You can create a set of generic configuration files in the current working directory by typing the following.
- <span class="enter">maker -CTL</span>
-This creates three files (type <tt>ls -l</tt> to see).
-*<tt>maker_exe.ctl</tt> - contains the path information for needed executables.
-*<tt>maker_bopt.ctl</tt> - contains filtering statistics for BLAST and Exonerate
-*<tt>maker_opt.ctl</tt> - contains all other information for MAKER, including the location of the input genome file.
-Control files are run-specific and a separate set of control files will need to be generated for each genome annotated with MAKER. MAKER will look for control files in the current working directory, so it is recommended that MAKER be run in a separate directory containing unique control files for each genome.
-Let's take a look at the <tt>maker_exe.ctl</tt> file.
- <span class="enter">gedit maker_exe.ctl</span>
-{{TextEditorLink|gedit}}
-You will see the names of a number of MAKER supported executables as well as the path to their location.  If you followed the installation instructions correctly, including the instructions for installing prerequisite programs, all executable paths should show up automatically for you.  However if the location to any of the executables is not set in your PATH environment variable, as per installation instructions, you will have to add these manually to the <tt>maker_exe.ctl</tt> file every time you run MAKER.
-Lines in the MAKER control files have the format <tt>key=value</tt> with no spaces before or after the equals sign(=).  If the value is a file name, you can use relative paths and environment variables, i.e. <tt>snap=$HOME/snap</tt>.  Note that for all control files the comments written to help users begin with a pound sign(#).  In addition, options before the equals sign(=) can not be changed, nor should there be a space before or after the equals sign.
-Now let's take a look at the <tt>maker_bopts.ctl</tt> file.
- <span class="enter">gedit maker_bopts.ctl</span>
-{{TextEditorLink|gedit}}
-In this file you will find values you can edit for downstream filtering of BLAST and Exonerate alignments.  At the very top of the file you will see that I have the option to tell MAKER whether I prefer to use WU-BLAST or NCBI-BLAST.  We want to set this to NCBI-BLAST, since that is what is installed.  We can just leave the remaining values as the default.
- <span class="enter">blast_type=ncbi+</span>
-Now let's take a look at the <tt>maker_opts.ctl</tt> file.
- <span class="enter">gedit maker_opts.ctl</span>
-{{TextEditorLink|gedit}}
-This is the primary configuration file for MAKER specific options.  Here we need to set the location of the genome, EST, and protein input files we will be using.  These come from the supplied example files.  We also need to set repeat masking options, as well as a number of other configurations.  We'll discuss these options in more detail later on, but for now just adjust the following values.
-<pre class="enter">
- genome=dpp_contig.fasta
- est=dpp_est.fasta
- protein=dpp_protein.fasta
- est2genome=1
-</pre>
-''Note: Do not put spaces on either side of the <tt>=</tt> on the above control file lines.''
-Now let's run MAKER.
- <span class="enter">maker</span>
-You should now see a large amount of status information flowing past your screen.  If you don't want to see this you can run MAKER with the <tt>-q</tt> option for "quiet" on future runs.
-==Details of What is Going on Inside of MAKER==
-===Repeat Masking===
-The first step in the MAKER pipeline is repeat masking. Why do we need to do this?  Repetitive elements can make up a significant portion of the genome.  Some of these repeats are simple/low-complexity repeats where you have runs of C's or G's or maybe dinucleotide repeats.  Other repeats are more complex, i.e. transposable elements.  These high-complexity repeats often encode real proteins like retrotranscriptase or the viral Gag, Pol, and Env proteins.  Because they encode real proteins, they can play havoc with ''ab initio'' gene predictors.  For example, a transposable element that occurs next to or even within the intron of a real protein encoding gene might cause a gene predictor to include extra exons as part of a gene model, sequence which really only belongs to the transposable element and not to the coding sequence of the gene.  You will also get hundreds of instances where identical transportable element proteins get annotated as part of an organisms proteome.  In addition to these issues, low-complexity repeat regions can align with high statistical significance to low-complexity protein regions creating a false sense of homology throughout the genome.  To avoid these complications it is convenient to identify and mask any repeat elements before doing other analyses.
-{|
-| |
-[[Image:repeatmask.jpg|thumb|560px|'''Figure:''' Identify and mask repetitive elements]]
-|}
-MAKER identifies repeats in two steps.
-*First a program called RepeatMasker is used to identify low-complexity and high-complexity repeats that match entries in the RepBase repeat library, or any species specific repeat library supplied by the user.
-*Next MAKER uses RepeatRunner to identify transposable element and viral proteins from the RepeatRunner protein database.  Because protein sequence diverges at a slower rate than nucleotide sequence, this step helps pick up the most problematic regions of divergent repeats that are missed by RepeatMasker, which searches in nucleotide space.
-Regions identified during repeat analysis are masked out so as not to complicate other downstream annotation analyses.
-*High-complexity repeats are hard-masked, a technique in which nucleotide sequence is replaced with the letter N to prohibit any alignments to that region.
-*Low-complexity regions are soft-masked, a technique in which nucleotides are made lower case so they can be treated as masked under certain situations without losing sequence information.  I will discuss some of the applications and effects of soft-masking later.
-Now the idea of masking out sequence might seem on the surface like we're losing a lot of information, and it is true that there can be proteins that have integrated repeats into their structure, so repeat masking will affect our ability to annotate these proteins.  However, these proteins are rare and the number of gene models and homology alignments improved by this step far exceed the few gene models that may be negatively affected.  You do have the option to run ''ab initio'' gene predictors on both the masked and unmasked sequence if repeat masking worries you though.  You do this by setting unmask:1 in the <tt>maker_opt.ctl</tt> configuration file.
-===''Ab Initio'' Gene Prediction===
-Following repeat masking, MAKER runs ''ab initio'' gene predictors specified by the user to produce preliminary gene models.  ''Ab initio'' gene predictors produce gene predictions based on underlying mathematical models describing patterns of intron/exon structure and consensus start signals.  Because the patterns of gene structure are going to differ from organism to organism, you must train gene predictors before you can use them.  I will discuss how to do this later on.
-{|
-| |
-[[Image:prediction.jpg|thumb|560px|'''Figure:''' Generate ''ab initio'' gene predictions]]
-|}
-MAKER currently supports:
-*SNAP (Works good, easy to train, not as good as others especially on longer intron genomes).
-*Augustus (Works great, hard to train, but getting better)
-*GeneMark (Self training, no hints, buggy, not good for fragmented genomes or long introns).
-*FGENESH (Works great, costs money even for training)
-You must specify in the maker_opts.ctl file the training parameters file you want to use use when running each of these algorithms.
-===EST and Protein Evidence Alignment===
-A simple way to indicate if a sequence region is likely associated with a gene is to identify (A) if the region is actively being transcribed or (B) if the region has homology to a known protein.  This can be done by aligning Expressed Sequence Tags (ESTs) and proteins to the genome using alignment algorithms.
-*ESTs are sequences derived from a cDNA library.  Because of the difficulties associated with working with mRNA and depending on how the cDNA library was prepared, EST databases usually represent bits and pieces of transcribed mRNAs with only a few full length transcripts.  MAKER aligns these sequences to the genome using BLASTN.  If ESTs from the organism being annotated are unavailable or sparse, you can use ESTs from a closely related organism.  However, ESTs from closely related organisms are unlikely to align using BLASTN since nucleotide sequences can diverge quite rapidly.  For these ESTs, MAKER uses TBLASTX to align them in protein space.
-*Protein sequence generally diverges quite slowly over large evolutionary distances, as a result proteins from even evolutionarily distant organisms can be aligned against raw genomic sequence to try and identify regions of homology.  MAKER does this using BLASTX.
-{|
-| |
-[[Image:evidence.jpg|thumb|560px|'''Figure:''' Align EST and protein evidence]]
-|}
-Remember now that we are aligning against the repeat-masked genomic sequence.  How is this going to affect our alignments?  For one thing we won't be able to align against low-complexity regions.  Some real proteins contain low-complexity regions and it would be nice to identify those, but if I let anything align to a low-complexity region, then I will get spurious alignments all over the genome.  Wouldn't it be nice if there was a way to allow BLAST to extend alignments through low-complexity regions, but only if there is is already alignment somewhere else?  You can do this with soft-masking.  If you remember soft-masking is using lower case letters to mask sequence without losing the sequence information.  BLAST allows you to use soft-masking to keep alignments from seeding in low-complexity regions, but allows you to extend through them.  This of course will allow some of the spurious alignments you were trying to avoid, but overall you still end up suppressing the majority of poor alignments while letting through enough real alignments to justify the cost.  You can turn this behavior off though if it bothers you by setting <tt>softmask:0</tt> in the <tt>maker_bopt.ctl</tt> file.
-===Polishing Evidence Alignments===
-Because of oddities associated with how BLAST statistics work, BLAST alignments are not as informative as they could be.  BLAST will align regions any where it can, even if the algorithm aligns regions out of order, with multiple overlapping alignments in the exact same region, or with slight overhangs around splice sites.
-To get more informative alignments MAKER uses the program Exonerate to polish BLAST hits.  Exonerate realigns each sequences identified by BLAST around splice sites and forces the alignments to occur in order.  The result is a high quality alignment that can be used to suggest near exact intron/exon positions. Polished alignments are produced using the est2genome and protein2genome options for Exonerate.
-{|
-| |
-[[Image:polish.jpg|thumb|560px|'''Figure:''' Polish BLAST alignments with Exonerate]]
-|}
-One of the benefits of polishing EST alignments is the ability to identify the strand an EST derives from.  Because of amplification steps involved in building an EST library and limitations involved in some high throughput sequencing technologies, you don't necessarily know whether you're really aligning the forward or reverse transcript of an mRNA. However, if you take splice sites into account, you can only align to one strand correctly.
-===Integrating Evidence to Synthesize Annotations===
-Once you have ''ab initio'' predictions, EST alignments, and protein alignments you can integrate this evidence to produce even better gene predictions.  MAKER does this by "talking" to the gene prediction programs.  MAKER takes all the evidence, generates "hints" to where splice sites and protein coding regions are located, and then passes these "hints" to programs that will accept them.
-{|
-| |
-[[Image:hint.jpg|thumb|560px|'''Figure:''' Pass gene finders evidence-based ‘hints’]]
-|}
-MAKER produces hint based predictors for:
-*SNAP
-*Augustus
-*FGENESH
-*GeneMark (under development)
-===Selecting and Revising the Final Gene Model===
-MAKER then takes the entire pool of ''ab initio'' and evidence informed gene predictions, updates features such as 5' and 3' UTRs based on EST evidence, tries to determine alternative splice forms where EST data permits, produces quality control metrics for each gene model (this is included in the output), and then MAKER chooses from among all the gene model possibilities the one that best matches the evidence.  This is done using a modified sensitivity/specificity distance metric.
-{|
-| |
-[[Image:select.jpg|thumb|560px|'''Figure:''' Identify gene model most consistent with evidence*]]
-|}
-MAKER can use evidence from EST alignments to revise gene models to include features such as 5' and 3' UTRs.
-{|
-| |
-[[Image:revise.jpg|thumb|560px|'''Figure:''' Revise model further if necessary; create new annotation]]
-|}
-===Quality Control===
-Finally MAKER calculates quality control statistics to assist in downstream management and curation of gene models outside of MAKER.
-{|
-| |
-[[Image:statistics.jpg|thumb|560px|'''Figure:''' Compute support for each portion of the gene model]]
-|}
-==MAKER's Output==
-If you look in the current working directory, you will see that MAKER has created an output directory called <tt>dpp_contig.maker.output</tt>.  The name of the output directory is based on the input genomic sequence file, which in this case was <tt>dpp_contig.fasta</tt>.
-Now let's see what's inside the output directory.
-<pre class="enter">
- cd dpp_contig.maker.output
- ls -1
-</pre>
-You should now see a list of directories and files created by MAKER.
- dpp_contig_datastore
- dpp_contig_master_datastore_index.log
- maker_bopts.log
- maker_exe.log
- maker_opts.log
- mpi_blastdb
-*The <tt>maker_opt.log</tt>, <tt>maker_exe.log</tt>, and <tt>maker_bopts.log</tt> files are logs of the control files used for this run of MAKER.
-*The <tt>mpi_blastdb</tt> directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases.
-*The <tt>dpp_contig_master_datastore_index.log</tt> contains information on both the run status of individual contigs and information on where individual contig data is stored.
-*The <tt>dpp_contig_datastore</tt> directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
-Once a MAKER run is finished the most important file to look at is the <tt>dpp_contig_master_datastore_index.log</tt> to see if there were any failures.
- <span class="enter">cat dpp_contig_master_datastore_index.log</span>
-If everything proceeded correctly you should see the following.
- contig-dpp-500-500      dpp_contig_datastore/contig-dpp-500-500 STARTED
- contig-dpp-500-500      dpp_contig_datastore/contig-dpp-500-500 FINISHED
-There are only entries describing a single contig because there was only one contig in the example file.  These lines indicate that the contig <tt>contig-dpp-500-500</tt> STARTED and then FINISHED without incident.  Other possible entries include:
-*FAILED - indicates a failed run on this contig, MAKER will retry these
-*RETRY - indicates that MAKER is retrying a contig that failed
-*SKIPPED_SMALL - indicates the contig was too short to annotate (minimum contig length is specified in <tt>maker_opt.ctl</tt>)
-*DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in <tt>maker_opt.ctl</tt>)
-The entries in the <tt>dpp_contig_master_datastore_index.log</tt> file also indicate that the output files for this contig are stored in the directory <tt>dpp_contig_datastore/contig-dpp-500-500/</tt>.  Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory.  Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue.  To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories.  For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the <tt>master_datastore_index.log</tt> file becomes essential for identifying where the output for a given contig is stored.
-Now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.
-<pre class="enter">
- cd dpp_contig_datastore/contig-dpp-500-500
- ls -1
-</pre>
-The directory should contain a number of files and a directory.
- contig-dpp-500-500.gff
- contig-dpp-500-500.maker.proteins.fasta
- contig-dpp-500-500.maker.transcripts.fasta
- run.log
- theVoid.contig-dpp-500-500
-*The <tt>contig-dpp-500-500.gff</tt> contains all annotations and evidence alignments in [[GFF3]] format.  This is the important file for use with [[Apollo]] or [[GBrowse]].
-*The <tt>contig-dpp-500-500.maker.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.proteins.fasta</tt> files contain the transcript and protein sequences for MAKER produced gene annotations.
-*The <tt>run.log</tt> file is a log file.  If you change settings and rerun MAKER on the same dataset, or if you are running a job on an entire genome and the system fails, this file lets MAKER know what analyses need to be deleted, rerun, or can be carried over from a previous run.  One advantage of this is that rerunning MAKER is extremely fast, and your runs are virtually immune to all system failures.
-*The directory <tt>theVoid.contig-dpp-500-500</tt> contains raw output files from all the programs MAKER wraps around (BLAST, SNAP, RepeatMasker, etc.).  You can usually ignore this directory and it's contents.
-==Viewing MAKER Annotations==
-Let's take a look at the [[GFF3]] file produced by MAKER.
- less contig-dpp-500-500.gff
-As you can see, manually viewing the raw GFF3 file produced by MAKER really isn't that meaningful.  While you can identify individual features such as genes, mRNAs, and exons, trying to interpret those features in the context of thousands of other genes and thousands of bases of sequence really can't be done by directly looking at the GFF3 file.
-For sanity check purposes it would be nice to have a graphical view of what's in the GFF3 file.  To do that, GFF3 files can be loaded into programs like [[Apollo]] and [[GBrowse]].
-===Apollo===
-Let's load the <tt>contig-dpp-500-500.gff</tt> into [[Apollo]] and take a look at what MAKER produced. Copy the <tt>contig-dpp-500-500.gff</tt> file to your home directory to make it easy to locate.
- <span class="enter">cp contig-dpp-500-500.gff ~</span>
-Now before starting Apollo, MAKER comes with a configuration file that will allow Apollo to display MAKER annotations and evidence in nice color (otherwise everything will be the same color of white).  Copy the configuration file to the <tt>~/.apollo</tt> directory, to make the configuration file available to Apollo.
- <span class="enter">cp /home/gmod/Documents/Software/maker/GMOD/Apollo/gff3.tiers ~/.apollo/</span>
-Now open Apollo and select our [[GFF3]] file.
- /home/gmod/Documents/Software/Apollo/bin/apollo
-You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations.  Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom.  As you have probably realized, this view is much easier to interpret than looking directly at the GFF3 file.
-Now click on each piece of evidence and you will see it's source in the table at the bottom of the Apollo screen.
-Possible Sources Include:
-*BLASTN - BLASTN alignment of EST evidence
-*BLASTX - BLASTX alignment of protein evidence
-*TBLASTX - TBLASTX alignment of EST evidence from closely related organisms
-*EST2Genome - Polished EST alignment from Exonerate
-*Protein2Genome - Polished protein alignment from Exonerate
-*SNAP - SNAP ''ab inito'' gene prediction
-*GENEMARK - GeneMark''ab inito'' gene prediction
-*Augustus - Augustus ''ab inito'' gene prediction
-*FgenesH  - FGENESH ''ab inito'' gene prediction
-*Repeatmasker - RepeatMasker identified repeat
-*RepeatRunner - RepeatRunner identified repeat from the repeat protein database
-=Advanced MAKER Configuration, Re-annotation Options, and Improving Annotation Quality =
-The remainder of this page addresses issues that can be encountered during the annotation process.  I then describe how MAKER can be used to resolve each issue.
-==Configuration Files in Detail==
-Let's take a closer look at the configuration options in the <tt>maker_opt.ctl</tt> file.
-<pre class="enter">
- cd /home/gmod/Documents/Data/maker/example1_dmel
- gedit maker_opts.ctl
-</pre>
-===Basic Input Files===
-All the basic input files for MAKER should be in FASTA format.
-*''genome'' - Genomic sequence file
-*''est'' - ESTs from the same organism or from a very very closely related organism (i.e. chimpanzee to human).  These are aligned first via BLASTN with very strict filtering so any sequence divergence can prohibit the alignment.
-*''altest'' - These are ESTs from other closely related organisms (i.e. mouse to human).  They are aligned via TBLASTX in protein space, so greater sequence divergence is permitted.
-*''protein'' - proteins from the same or other organisms.  These are aligned via BLASTX against the genome.  Proteins that align to a region will not necessarily be orthologous or paralogous.  The alignment may just be based on short regions such as a shared domain.  You may also get alignments to pseudogenes.  Polishing BLASTX hits with Exonerate helps identify what are likely true paralogs and orthologs.
-===Repeat Masking Options===
-Repeat masking is important for improving gene predictor performance and avoiding protein alignments to what are likely just retrotransposons.  You also expect a certain amount of genomic contamination in the EST database, much of this contamination maps back to repeat regions.  By repeat masking we can avoid issues with all types of input data.
-*''model_org'' - This is a RepeatMasker option that lets you limit the repeat database to specific organisms or groups of organisms (i.e. vertebrates, Nematodes, ''Drosophila'', primates etc).  By default MAKER sets this to 'all'.
-*''repeat_protein'' - This is a fasta file of transposon and virus related proteins.  MAKER has an internal RepeatRunner database it uses by default.
-*''rmlib'' - This is a fasta file of nucleotide repeats provided by the user.  You can create a species specific repeat database using programs like PILER.
-===Gene Prediction Options===
-Gene prediction options affect the final gene annotations more than any other option type.  This brings up the point that electronically produced gene annotations will only be as good as the gene predictions they are based on.
-*''predictor'' - This tells MAKER what programs to run for generating annotations.
-**est2genome - Allows high quality spliced Exonerate EST alignments to become gene annotations.  This only happens when there is no gene prediction overlapping the region.  This is useful for generating gene annotations in the absence of a trained gene predictor.
-**protein2genome - Attempts to build gene models directly from protein alignments (works on prokaryotes only)
-**model_gff - This allows user defined gene models to be used
-**pred_gff - This allows user provided ''ab initio'' predictions
-**snap
-**augustus
-**genemark
-**fgenesh
-*''unmask'' - Produce ''ab initio'' gene predictions for unmasked sequence as well as for masked sequence
-*''snaphmm'' - SNAP training file (SNAP has some species files already available in the snap/HMM/ directory)
-*''gmhmm'' -  GeneMark training file (GeneMark self-trains and produces the resulting training file in the output mod/ directory)
-*''augustus_species'' - Augustus species ID (Augustus uses an internal species index rather than a simple set of training files.  Type 'augustus --species=help' to see the values you can choose)
-*''fgenesh_par_file'' - FGENESH training file
-===Other MAKER Options===
-*''evaluate'' - runs an experimental annotation quality analysis program (Evaluator) on each annotation.  Provides quantitative metrics for ranking annotations and identifying the features most in need of review.  I'd like to emphasize that this is experimental.
-*''max_dna_len'' -  sets the length for dividing up contigs into chunks for processing.  Larger chunks require more memory; smaller chunks require less memory.  Allows the user to control system memory usage.
-*''min_contig'' - sets the minimum length a contig must have or else it will be skipped.
-*''min_protein'' - sets the minimum length a predicted protein must have (in amino acids) to be annotated.
-*''split_hit'' - sets the expected max intron size for evidence alignments
-*''pred_flank'' - sets the length for the sequence surrounding clusters of EST and protein evidence that will be used when building hint based gene predictions.
-*''single_exon'' - tells MAKER to consider single exon EST evidence when generating annotations.  Single exon ESTs are more likely to be genomic contamination.
-*''single_length'' - sets the minimum length required for single exon ESTs if 'single_exon' is enabled
-*''keep_preds'' - adds non-overlapping ab-inito gene prediction to the final annotation set rather than pushing them off into a separate file for the user to analyse.  These predictions by definition do not overlap any form of supporting evidence.
-*''retry'' - sets the number of times to retry a contig if there is a failure
-*''clean_try'' - removes all data from previous MAKER runs before retrying a contig
-*''clean_up'' - removes theVoid directory with individual raw analysis files at the end of the MAKER run
-*''TMP'' - specifies a directory other than the system default temporary directory (<tt>/tmp</tt>) for writing temporary files.  On some Linux systems the primary hard drive that also holds the default temporary directory is small, and most of the systems storage space is located on secondary hard drives mounted in directories elsewhere on the system.  This is often true of computer clusters where each node has it's own small hard drive for booting purposes, and most storage space is network mounted.  Temporary files created by MAKER are deleted as the program advances, but individual files related to BLAST jobs can be quite large, so setting TMP to another location can be useful.
-==Training ''ab initio'' Gene Predictors==
-If you are involved in a genome project for an emerging model organism, you should already have an EST database which would have been generated as part of the original sequencing project.  A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database.  However a trained ''ab initio'' gene predictor is a much more difficult thing to generate.  Gene predictors require existing gene models on which to base prediction parameters.  However, with emerging model organisms you are not likely to have any pre-existing gene models.  So how then are you supposed to train your gene prediction programs?
-MAKER gives the user the option to produce gene annotations directly from the EST evidence.  You can then use these imperfect gene models to train gene predictor program.  Once you have re-run MAKER with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again.  This boot-strap process allows you to iteratively improve the performance of ''ab initio'' gene predictors.
-I've created an example file set so you can learn to train the gene predictor SNAP using this procedure.
-First let's move to the example directory.
-<pre class="enter">
- cd /home/gmod/Documents/Data/maker/example2_pyu
- ls -1
-</pre>
-You should see the following files (plus others) in the directory
- pyu-contig.fasta
- pyu-est.fasta
- pyu-protein.fasta
-We need to build maker configuration files and populate the appropriate values.
-<pre class="enter">
- maker -CTL
- gedit maker_opts.ctl
-</pre>
-{{TextEditorLink|gedit}}
-Edit the following:
-<pre class="enter">
- genome=pyu-contig.fasta
- est=pyu-est.fasta
- protein=pyu-protein.fasta
- est2genome=1
-</pre>
-MAKER is now configured to generate annotations from the EST data, so start the program (this will take a minute to run).
-<pre class="enter">
- maker
-</pre>
-Once finished load the file <tt>pyu-contig.maker.output/pyu-contig_datastore/scf1117875581239.gff</tt> into [[Apollo]].  You will see that there are far more regions with evidence alignments than there are gene annotations.  This is because there are so few spliced ESTs that are capable of generating gene models.
-Now exit [[Apollo]]. We now need to convert the [[GFF3]] gene models to ZFF format.  This is the format SNAP requires for training.  To do this wee need to collect all GFF3 files into a single directory.
-<pre class="enter">
- mkdir gff
- cp pyu-contig.maker.output/pyu-contig_datastore/09/14/scf1117875582023/scf1117875582023.gff gff/
- cd gff
- maker2zff scf1117875582023.gff
- ls -1
-</pre>
-There should now be two new files. The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.
- genome.dna
-The basic steps for training SNAP are first to filter the input gene models, then capture genomic sequence immediately surrounding each model locus, and finally uses those captured segments to produce the HMM. You can explore the internal SNAP documentation for more details if you wish.
-<pre class="enter">
- fathom -categorize 1000 genome.ann genome.dna
- fathom -export 1000 -plus uni.ann uni.dna
- forge export.ann export.dna
- hmm-assembler.pl Pult . > Pult.hmm
- cd ..
-</pre>
-The final training parameters file is <tt>Pult.hmm</tt>.  We do not expect SNAP to perform that well with this training file because it is based on incomplete gene models; however, this file is a good starting point for further training.
-We need to run MAKER again with the new HMM file we just built for SNAP.
-<pre class="enter">
- gedit maker_opts.ctl
-</pre>
-{{TextEditorLink|gedit}}
-And set:
-<pre class="enter">
- snaphmm=gff/Pult.hmm
- est2genome=0
-</pre>
-And run
-<pre class="enter">
- maker
-</pre>
-Now lets look at the output once again in [[Apollo]].  When you examine the annotations you should notice that final MAKER gene models displayed in light blue, are more abundant now and are in relatively good agreement with the evidence alignments.  However the SNAP ''ab initio'' gene predictions in the evidence tier do not yet match the evidence that well.  This is because SNAP predictions are based solely on the mathematic descriptions in the HMM; whereas, MAKER models also use evidence alignments to help further inform gene models.  This demonstrates why you get better performance by running ''ab initio'' gene predictors like SNAP inside of MAKER rather than producing gene models by themselves for emerging model organism genomes.  The fact that the MAKER models are in better agreement with the evidence than the current SNAP models also means I can use the MAKER models to retrain SNAP in a bootstrap fashion, thereby improving SNAP's performance and consequentially MAKER's performance.
-Close Apollo, retrain SNAP, and run MAKER again.
-<pre class="enter">
- mkdir gff2
- cp pyu-contig.maker.output/pyu-contig_datastore/scf1117875582023/scf1117875582023.gff gff2/
- cd gff2
- maker2zff.pl scf1117875582023.gff
- fathom -categorize 1000 genome.ann genome.dna
- fathom -export 1000 -plus uni.ann uni.dna
- forge export.ann export.dna
- hmm-assembler.pl Pult . > Pult2.hmm
- cd ..
- gedit maker_opts.ctl
-</pre>
-Change configuration file.
-<pre class="enter">
- snaphmm:gff2/Pult2.hmm
-</pre>
-Run maker.
-<pre class="enter">
- maker
-</pre>
-Let's examine the [[GFF3]] file one last time in [[Apollo]].  As you can see there, there is now a marked degree of improvement in both the MAKER and SNAP gene models, and both models are in more agreement with each other.
-==MAKER Web Annotation Service==
-As you have all experienced with the previous examples, running programs on the command line can seem difficult.  Many users might feel overwhelmed by trying to install and run a program like MAKER locally, especially if they are not very familiar with Linux.  For those individuals, our lab has produced the MAKER Web Annotation Service (MWAS).  MWAS is a website where you can run MAKER over the web without having to install any software locally, and you are provided with a much more user friendly interface for configuring MAKER and viewing results.
-*Go to http://www.yandell-lab.org and select MWAS from the tabbed menu.  You will see a link at the bottom of the page to access the MAKER Web Annotation Server.  On the MWAS server page log in as a guest, then select 'New Job' from the top of the page.
-Scrolling down the page, you should notice there are options to select the genome file, EST and protein evidence files, and choose ''ab initio'' gene predictors.  At the top of the page select '''Example Jobs''' &rarr; '''D. melanogaster :Dpp'''' and click ''''Load'''.
-[[Image:select_dpp.jpg]]
-Now if you scroll down, you should notice that the values for your genome, EST and protein files has been filled out for you.  At the bottom of the page click '''Add Job to Queue'''.  You will now be sent to the job status page.
-[[Image:status.jpg|560px|]]
-You will need to click '''Refresh Job Status''', a couple of times until your job finishes.  When your job is finished you will see an '''icon''' in the column marked '''Log'''. Click it. A window will come up displaying any errors that occurred for your job, so ideally this window will be blank. Next click on the '''View Results icon'''.
-[[Image:results.jpg]]
-The results window will provide a brief summery of the status of each contig in your job, and will give you the opportunity to download the data, or view the results for individual contigs.  Click on '''View in Apollo'''. This will open your data in Apollo ([[User:Elee|Ed Lee]] will describe just how launching Apollo over the web works during the [[Apollo]] section).  Then close Apollo and click on '''SOBA statistics'''.  This will open up a tool from the Sequence Ontology Consortium that provides simple summery statistics of features in a [[GFF3]] file.
-==mRNAseq==
-mRNAseq is a high throughput technique for sequencing the entire transcriptome, and it holds the promise of allowing researchers to identify all exons and alternative splice forms for every gene in the genome with a single experiment.  It may soon make gene predictors (mostly) a thing of the past.
-*Still need to de-convolute reads & evidence (for now)
-*Still need to archive, manage, and distribute annotations
-[[Image:MRNAseq.jpg | 560px | ]]
-By mapping mRNAseq reads using programs like [http://tophat.cbcb.umd.edu/ TopHat] and [http://bowtie-bio.sourceforge.net/index.shtml Bowtie], you can create [[GFF3]] files of read islands and junctions.  This data can then be passed in as EST evidence and will be used for generating hint based gene prediction and for choosing final annotations.
-Load example on MWAS site.
-http://derringer.genetics.utah.edu/MWAS/
-==Merge/Resolve Legacy Annotations==
-Legacy annotations
-*Many are no longer maintained by original creators
-*In some cases more than one group has annotated the same genome, using very different procedures, even different assemblies
-*Many investigators have their own genome-scale data and would like a private set of annotations that reflect these data
-*There will be a need to revise, merge, evaluate, and verify legacy annotation sets in light of RNA-seq and other data
-[[Image:Legacy.png|560px|]]
-MAKER will:
-*Identify legacy annotation most consistent with new data
-*Automatically revise it in light of new data
-*If no existing annotation, create new one
-Load example on MWAS class site.
-http://derringer.genetics.utah.edu/MWAS/
-==MPI Support==
-[[MAKER]] optionally supports Message Passing Interface (MPI), a parallel computation communication protocol primarily used on computer clusters.  This allows MAKER jobs to be broken up across multiple nodes/processors for increased performance and scalability.
-[[Image:Mpi_maker.png|560px|]]
-To use this feature, you must have MPICH2 installed with the the <tt>--enable-sharedlibs</tt> flag set during installation (See MPICH2 Installer's Guide).  I have installed this for you.  So lets set up MPI_MAKER and run the example file that comes with MAKER.
-<pre class="enter">
- cd ~/Documents/Software/maker/src
- perl Build.PL
-</pre>
-Accept the default that we want to build for MPI support
-<pre class=enter>
- ./Build install
-</pre>
-You should now see the executable <tt>mpi_maker</tt> listed among the MAKER scripts (<tt>/maker/bin</tt>).  Let's run some example data to see if MPI_MAKER is working properly.
-<pre class="enter">
- cd ~
- mkdir ~/maker_run2
- cd maker_run2
- cp ~Documents/Software/maker/data/dpp_* ~/maker_run2
- maker -CTL
- gedit maker_opts.ctl
-</pre>
-Set values in maker configuration files.
-<pre class="enter">
- genome=dpp_contig.fasta
- est=dpp_est.fasta
- protein=dpp_protein.fasta
- snap=/home/gmod/Documents/Software/maker/exe/snap/HMM/fly
-</pre>
-We need to set up a few more things for MPI to work.  Type <tt>mpd</tt> to see a list of instructions.
-<pre class="enter">
- mpd
-</pre>
-You should see the following.
- configuration file /home/gmod/mpd.conf not found
- A file named .mpd.conf file must be present in the user's home
- directory (/etc/mpd.conf if root) with read and write access
- only for the user, and must contain at least a line with:
- MPD_SECRETWORD=<secretword>
- One way to safely create this file is to do the following:
-   cd $HOME
-   touch .mpd.conf
-   chmod 600 .mpd.conf
- and then use an editor to insert a line like
-   MPD_SECRETWORD=mr45-j9z
- into the file.  (Of course use some other secret word than mr45-j9z.)
-Follow the instructions to set this file up, and start the mpi environment with <tt>mpdboot</tt>.  Then run <tt>mpi_maker</tt> through the MPI manager <tt>mpiexec</tt>.
-<pre class="enter">
- mpdboot
- mpiexec -n 2 mpi_maker
-</pre>
-<tt>mpiexec</tt> is a wrapper that handles the MPI environment.  The <tt>-n 2</tt> flag tells <tt>mpiexec</tt> to use 2 cpus/nodes when running <tt>mpi_maker</tt>.  For a large cluster, this could be set to something like 100.  You should now know how to start a MAKER job via MPI.
-==User Interface for Local MAKER Instalation==
-<div class="emphasisbox">
-This example did not work during class because a conflict with the version of Apache that was installed.  The issue has since been fixed.  Before beginning the example, open a terminal and remove the following files otherwise the subversion update of maker fails.
-<pre class="enter">
- rm ~/Documents/Software/maker/MWAS/bin/mwas_server
- rm ~/Documents/Software/maker/MWAS/cgi-bin/tt_templates/apollo_webstart.tt
-</pre>
-Then update maker via subversion.
-<pre class="enter">
- svn update ~/Documents/Software/maker/
-</pre>
-</div>
-The MWAS interface provides a very convenient method for running MAKER and viewing results; however, because compute resources are limited users are only allowed to submit a maximum of 2 megabases of sequence per job.  So while MWAS might be suitable for some analyses (i.e. annotating BACs and short preliminary assemblies), if you plan on annotating an entire genome you will need to install MAKER locally.  But if you like the convenience of the MWAS user interface, you can optionally install the interface on top of a locally installed version of MAKER for use in your own lab.
-First under the <tt>maker</tt> directory there is a subdirectory called <tt>MWAS</tt>.  <tt>MWAS</tt> contains all the needed files to build the MAKER web interface. The <tt>maker/MWAS/bin/mwas_server</tt> file is used to setup and run this web interface.  Lets configure that now.  There are three steps to setting up the server.  First you must create and edit a server configuration file, then load all other configuration files, and then install all files to the appropriate web accessible directory.
-<pre class="enter">
- cd /home/gmod/Documents/Software/maker/MWAS/
- bin/mwas_server PREP
-</pre>
-This will create a file in <tt>/maker/MWAS/config/</tt> called <tt>server.ctl</tt>.  We will need to edit this file before continuing.
-<pre class="enter">
- gedit config/server.ctl
-</pre>
-Set:
-<pre class="enter">
- apache_user:www-data
- web_address:http://localhost
- cgi_dir:/usr/lib/cgi-bin/maker
- cgi_web:/cgi-bin/maker
- html_dir:/var/www/maker
- html_web:/maker
- data_dir:/var/www/maker/data
- use_login:0
-</pre>
-Now we need to generate other settings that are dependent on the values in
-<tt>server_opts.ctl</tt>.
-<pre class="enter">
- bin/mwas_server CONFIG
-</pre>
-Several new configuration files should now be loaded in the <tt>config/</tt> directory.  These new files define default MAKER options for the server and the location of files for the server dropdown menus.
- maker_bopts.ctl
- maker_exe.ctl
- maker_opts.ctl
- menus.ctl
-We shouldn't need to edit any of these file.  So lets copy files to the appropriate web accessible directories.  This must be done as root or using <tt>sudo</tt>.
-<pre class="enter">
-  sudo bin/mwas_server SETUP
-</pre>
-If you set <tt>APOLLO_ROOT</tt> in the  <tt>server.ctl</tt> file, then you can now setup a special Java Web Start version of [[Apollo]] to view results directly from the web interface.  Web Start will be described in more detail in the Apollo session.  This must be done as root or using <tt>sudo</tt>.
-<pre class="enter">
- sudo bin/mwas_server APOLLO
-</pre>
-We can now run MAKER examples using this web interface, but first we need to launch a server to monitor for new job submissions.
-<pre class="enter">
- sudo bin/mwas_server START
-</pre>
-And then go to
-: http://localhost/maker
-==MAKER Accessory Scripts==
-MAKER comes with a number of accessory scripts that are meant to assist in manipulations of the MAKER input and output files.
-Scripts:
-*''add_utr_start_stop_gff'' - Adds explicit 5' and 3' UTR as well as start and stop codon features to the GFF3 output file
-:<pre> add_utr_start_stop_gff <gff3_file></pre>
-*''add_utr_to_gff3.pl'' - Adds explicit 5' and 3' UTR features to the [[GFF3]] output file
-:<pre> add_utr_gff.pl <gff3_directory></pre>
-*''cegma2zff' -  This script converts the output of a GFF file from CEGMA into ZFF format for use in SNAP training.  Output files are always genome.ann and genome.dna
-:<pre> cegma2zff <cegma_gff> <genome_fasta></pre>
-*''chado2gff3'' -  This script takes default CHADO database content and produces GFF3 files for each contig/chromosome.
-:<pre> chado2gff3 [OPTION] <database_name></pre>
-*''compare'' -  This script compares the contents of a GFF3 file to a CHADO database to look for merged, split and missing genes.
-:<pre> compare [OPTION] <database_name> <gff3_file></pre>
-*''cufflinks2gff3'' -  This script converts the cufflinks output transcripts.gtf file into GFF3 format for use in MAKER via GFF3 passthrough. By default standless features which correspond to single exon cufflinks models will be ignored.  This is because these features can correspond to repetative elements and pseudogenes. Ouput is to STDOUT so you will need to redirect to a file.
-:<pre> cufflinks2gff3 <transcripts1.gtf> <transcripts2.gtf> ...</pre>
-*''evaluator'' - Evaluate the the quality of an annotation set.
-:<pre>  mpi_evaluator [options] <eval_opts> <eval_bopts> <eval_exe></pre>
-*''fasta_merge'' - Collects all of MAKER's fasta file output for each contig and merges them to make genome level fastas
-:<pre> fasta_merge -d <datastore_index> -o <outfile></pre>
-*''fasta_tool'' - The script can search, reformat, and manipulate a fasta file in a variety of ways.
-*''fix_fasta'' - Deprecated, use fasta_tool
-*''genemark_gtf2gff3'' - This converts genemark's GTF output into GFF3 format. The script prints to STDOUT. Use the '>' character to redirect output into a file.
-:<pre> genemark_gtf2gff3 <filename><pre>
-*''gff3_2_gtf'' - Converts MAKER GFF3 files to GTF format (run add_utr_start_stop_gff first to get UTR features)
-:<pre> gff3_2_gtf <gff3_file></pre>
-*''gff3_merge'' - Collects all of MAKER's GFF3 file output for each contig and merges them to make a single genome level GFF3
-:<pre> gff3_merge -d <datastore_index> -o <outfile></pre>
-*''gff3_preds2models'' - Converts the gene prediction match/match_part format to annotation gene/mRNA/exon/CDS format
-:<pre> gff3_preds2models <gff3 file> <pred list></pre>
-*''gff3_to_eval_gtf'' - This script converts MAKER GFF3 files into GTF formated files for the program EVAL (an annotation  sensitivity/specificity evaluating program).  The script will only extract features explicitly declared in the GFF3 file, and will skip implicit features (i.e. UTR, start codons, and stop codons).  To extract implicit features to the GTF file, you will first need to expicitly declare them in the GFF3 file. This can be done by calling the script add_utr_to_gff3 to add formal declaration lines to the GFF3 file.
-:<pre> gff3_to_eval_gtf <maker_gff3_file></pre>
-*''iprscan2gff3'' - Takes InerproScan (iprscan) output and generates GFF3 features representing domains. Interesting tier for GBrowse.
-:<pre> iprscan2gff3 <iprscan_file> <gff3_fasta></pre>
-*''iprscan_batch'' - Wrapper for iprscan to take advantage of multiprocessor systems.
-:<pre> iprscan_batch <file_name> <cpus> <log_file></pre>
-*''iprscan_wrap'' - A wrapper that will run iprscan
-*''ipr_update_gff'' - Takes InterproScan (iptrscan) output and maps domain IDs and GO terms to the Dbxref and Ontology_term attributes in the GFF3 file.
-:<pre> ipr_update_gff <gff3_file> <iprscan_file></pre>
-*''maker2chado'' - This script takes MAKER produced GFF3 files and dumps them into a [[Chado]] database.  You must set the database up first according to CHADO installation instructions.  CHADO provides its own methods for loading GFF3, but this script makes it easier for MAKER specific data.  You can either provide the datastore index file produced by MAKER to the script or add the GFF3 files as command line arguments.
-:<pre>  maker2chado [OPTION] <database_name> <gff3file1> <gff3file2> ...</pre>
-*''maker2jbrowse'' - This script will produce a JBrowse data set from MAKER gff3 files.
-:<pre>   maker2chado [OPTION] <database_name> <gff3file1> <gff3file2> ...</pre>
-*''maker2zff.pl'' - Pulls out MAKER gene models from the MAKER GFF3 output and convert them into ZFF format for SNAP training.
-:<pre> maker2zff.pl <gff3_file></pre>
-*''maker_functional''
-*''maker_functional_fasta'' - Maps putative functions identified from BLASTP against UniProt/SwissProt to the MAKER produced tarnscript and protein fasta files.
-:<pre> maker_functional_fasta <uniprot_fasta> <blast_output> <fasta1> <fasta2> <fasta3> ...</pre>
-*''maker_functional_gff'' - Maps putative functions identified from BLASTP against UniProt/SwissProt to the MAKER produced GFF3 files in the Note attribute.
-:<pre> maker_functional_gff <uniprot_fasta> <blast_output> <gff3_1></pre>
-*''maker_map_ids'' - Build shorter IDs/Names for MAKER genes and transcripts following the NCBI suggested naming format.
-:<pre> maker_map_ids --prefix PYU1_ --justify 6 genome.all.gff > genome.all.id.map</pre>
-*''map2assembly'' - Maps old gene models to a new assembly where possible.
-:<pre>  map2assembly <genome.fasta> <transcripts.fasta></pre>
-*''map_data_ids'' - This script takes a id map file and changes the name of the ID in a data file.  The map file is a two column tab delimited file with two columns: old_name and new_name.  The data file is assumed to be tab delimited by default, but this can be altered with the delimit option.  The ID in the data file can be in any column and is specified by the col option which defaults to the first column.
-:<pre> map_data_ids genome.all.id.map data.txt</pre>
-*''map_fasta_ids'' - Maps short IDs/Names to MAKER fasta files.
-:<pre> map_fasta_ids <map_file> <fasta_file></pre>
-*''map_gff_ids'' -  Maps short IDs/Names to MAKER GFF3 files, old IDs/Names are mapped to to the Alias attribute.
-:<pre> map_gff_ids <map_file> <gff3_file></pre>
-*''split_fasta'' - Splits multi-fasta files into the number of files specified by the user.  Useful for breaking up MAKER jobs.
-:<pre> split_fasta [count] <input_fasta></pre>
-*''tophat2gff3'' - This script converts the juctions file producted by TopHat into GFF3 format for use with MAKER.
-:<pre> tophat2gff3 <junctions.bed></pre>
-==References==
-= Evaluation =
-{{Feedback}}
-{{NextSession|Chado|Chado}}