GMOD Malaysia 2014/MAKER Tutorial
This MAKER tutorial was presented by Michael
Campbell at GMOD
Malaysia 2014, February
- This tutorial requires MAKER version 2.x.
The most recent MAKER tutorial can be found at the
MAKER Tutorial page.
This tutorial uses the AWS AMI ‘ named ‘ in the ‘
.
Get MAKER Bling!
Contents
About MAKER
MAKER is an easy-to-use genome annotation pipeline designed for small
research groups with little bioinformatics experience. However, MAKER is
also designed to be scalable and is thus appropriate for projects of any
size including use by large sequence centers. MAKER can be used for de
novo annotation of newly sequenced genomes, for updating existing
annotations to reflect new evidence, or just to combine annotations,
evidence, and quality control statistics for use with other GMOD
programs like GBrowse,
JBrowse,
Chado, and
Apollo.
MAKER has been used in many genome annotation projects:
- Schmidtea mediterranea - planaria (A Alvarado, Stowers Institute)
PubMed
- Pythium ultimum oomycete (R Buell, Michigan State Univ.)
PubMed
- Pinus taeda - Loblolly pine (A Stambolia-Kovach, Univ. California
Davis) PubMed
- Atta cephalotes - leaf-cutter ant (C Currie, Univ. Wisconsin,
Madison) PubMed
- Linepithema humile - Argentine ant (CD Smith, San Francisco State
Univ.) PubMed
- Pogonomyrmex barbatus - red harvester Ant (J Gadau, Arizona State
Univ.) PubMed
- Conus bullatus - cone snail (B Olivera Univ. Utah)
PubMed
- Petromyzon marinus - Sea lamprey (W Li, Michigan State)
PubMed
- Fusarium circinatum - pine pitch canker (B Wingfield, Univ.
Pretoria) - Manuscript in preparation
- Cardiocondyla obscurior - tramp ant (J Gadau, Arizona State Univ.) -
Manuscript in preparation
- Columba livia - pigeon (M Shapiro, Univ. Utah) - Manuscript in
preparation
- Megachile routundata alfalfa leafcutter bee () - Manuscript in
preparation
- Latimeria menadoensis - african coelacanth () -
PubMed
- Nannochloropsis - micro algae (SH Shiu, Michigan State Univ.)
PubMed
- Arabidopsis thale cress re-annotation (E Huala, TAIR) - Manuscript
in preparation
- Cronartium quercuum - rust fungus (JM Davis, Univ. Florida) -
Annotation in progress
- Ophiophagus hannah - King cobra (T. Castoe, Univ. Colorado) -
Annotation in progress
- Python molurus - Burmese python (T. Castoe, Univ. Colorado) -
Annotation in progress
- Lactuca sativa - Lettuce (RW Michelmore) - Annotation in progress
- parasitic nematode genomes (M Mitreva, Washington Univ)
- Diabrotica virgifera - corn rootworm beetle (H Robertson, Univ.
Illinois)
- Oryza sativa - rice re-annotation (R Buell, MSU)
- Zea mays - maize re-annotation (C Lawrence, MaizeGDP)
- Cephus cinctus - wheat stem sawfly (H Robertson, Univ. Illinois)
- Rhagoletis pomonella - apple maggot fly (H Robertson, Univ.
Illinois)
Introduction to Genome Annotation
What Are Annotations?
Annotations are descriptions of different features of the genome, and
they can be structural or functional in nature.
Examples:
To use this feature, you must have MPICH2 installed with the the
--enable-sharedlibs
flag set during installation (See MPICH2
Installer’s Guide). Or openmpi and allow shared libraries by adding a
line like this to your profile
--export LD_PRELOAD=/home/ubuntu/applications/maker/exe/openmpi/lib/libmpi.so:$LD_PRELOAD
I have installed this for you. So let’s set up MPI_MAKER and run the
example file that comes with MAKER.
cd /usr/local/maker/src
perl Build.PL
Say Yes that we want to build for MPI support
./Build install
Set values in maker configuration files.
genome=dpp_contig.fasta
est=dpp_est.fasta
protein=dpp_protein.fasta
snap=$PATH_TO_SNAP/HMM/fly
We need to set up a few more things for MPI to work. Type mpd
to see a
list of instructions.
mpd
You should see the following.
configuration file /home/ubuntu/.mpd.conf not found
A file named .mpd.conf file must be present in the user's home
directory (/etc/mpd.conf if root) with read and write access
only for the user, and must contain at least a line with:
MPD_SECRETWORD=<secretword>
One way to safely create this file is to do the following:
cd $HOME
touch .mpd.conf
chmod 600 .mpd.conf
and then use an editor to insert a line like
MPD_SECRETWORD=mr45-j9z
into the file. (Of course use some other secret word than mr45-j9z.)
Follow the instructions to set this file up, and start the mpi
environment with mpdboot
. Then run maker
through the MPI manager
mpiexec
.
mpdboot
mpiexec -n 2 maker </dev/null
mpiexec
is a wrapper that handles the MPI environment. The -n 2
flag
tells mpiexec
to use 2 cpus/nodes when running maker
. For a large
cluster, this could be set to something like 100. You should now know
how to start a MAKER job via MPI.
MAKER Accessory Scripts
MAKER comes with a number of accessory scripts that assist in
manipulations of the MAKER input and output files.
Scripts:
- cegma2zff’ - This script converts the output of a GFF file from CEGMA
into ZFF format for use in SNAP training. Output files are always
genome.ann and genome.dna
cegma2zff <cegma_gff> <genome_fasta>
- chado2gff3 - This script takes default CHADO database content and
produces GFF3 files for each contig/chromosome.
chado2gff3 [OPTION] <database_name>
- compare - This script compares the contents of a GFF3 file to a
CHADO database to look for merged, split and missing genes.
compare [OPTION] <database_name> <gff3_file>
- cufflinks2gff3 - This script converts the cufflinks output
transcripts.gtf file into GFF3 format for use in MAKER via GFF3
passthrough. By default strandless features which correspond to single
exon cufflinks models will be ignored. This is because these features
can correspond to repetative elements and pseudogenes. Output is to
STDOUT so you will need to redirect to a file.
cufflinks2gff3 <transcripts1.gtf> <transcripts2.gtf> ...
- evaluator - Evaluate the the quality of an annotation set.
mpi_evaluator [options] <eval_opts> <eval_bopts> <eval_exe>
- fasta_merge - Collects all of MAKER’s fasta file output for each
contig and merges them to make genome level fastas
fasta_merge -d <datastore_index> -o <outfile>
- fasta_tool - The script can search, reformat, and manipulate a fasta
file in a variety of ways.
- fix_fasta - Deprecated, use fasta_tool
- genemark_gtf2gff3 - This converts genemark’s GTF output into GFF3
format. The script prints to STDOUT. Use the ‘>’ character to
redirect output into a file.
genemark_gtf2gff3 <filename>
- gff3_merge - Collects all of MAKER’s GFF3 file output for each
contig and merges them to make a single genome level GFF3
gff3_merge -d <datastore_index> -o <outfile>
- gff3_preds2models - Deprecated pass the predictions to MAKER in the
maker_opts.ctl
in gff3 format here pred_gff
= and set
keep_preds=1
- maker2eval_gtf - This script converts MAKER GFF3 files into GTF
formated files for the program EVAL (an annotation
sensitivity/specificity evaluating program). The script will only
extract features explicitly declared in the GFF3 file, and will skip
implicit features (i.e. UTR, start codons, and stop codons). To
extract implicit features to the GTF file, you will first need to
expicitly declare them in the GFF3 file. This can be done by calling
the script add_utr_to_gff3 to add formal declaration lines to the GFF3
file.
maker2eval_gtf <maker_gff3_file>
- iprscan2gff3 - Takes InerproScan (iprscan) output and generates GFF3
features representing domains. Interesting tier for GBrowse.
iprscan2gff3 <iprscan_file> <gff3_fasta>
- iprscan_wrap - A wrapper that will run iprscan
- ipr_update_gff - Takes InterproScan (iptrscan) output and maps
domain IDs and GO terms to the Dbxref and Ontology_term attributes in
the GFF3 file.
ipr_update_gff <gff3_file> <iprscan_file>
- maker2chado - This script takes MAKER produced GFF3 files and dumps
them into a
Chado
database. You must set the database up first according to CHADO
installation instructions. CHADO provides its own methods for loading
GFF3, but this script makes it easier for MAKER specific data. You can
either provide the datastore index file produced by MAKER to the
script or add the GFF3 files as command line arguments.
maker2chado [OPTION] <database_name> <gff3file1> <gff3file2> ...
- maker2jbrowse - This script will produce a JBrowse data set from
MAKER gff3 files.
maker2chado [OPTION] <database_name> <gff3file1> <gff3file2> ...
- maker2zff - Pulls out MAKER gene models from the MAKER GFF3 output
and convert them into ZFF format for SNAP training.
- maker_functional_fasta - Maps putative functions identified from
BLASTP against UniProt/SwissProt to the MAKER produced tarnscript and
protein fasta files.
maker_functional_fasta <uniprot_fasta> <blast_output> <fasta1> <fasta2> <fasta3> ...
- maker_functional_gff - Maps putative functions identified from
BLASTP against UniProt/SwissProt to the MAKER produced GFF3 files in
the Note attribute.
maker_functional_gff <uniprot_fasta> <blast_output> <gff3_1>
- maker_map_ids - Build shorter IDs/Names for MAKER genes and
transcripts following the NCBI suggested naming format.
maker_map_ids --prefix PYU1_ --justify 6 genome.all.gff > genome.all.id.map
- map2assembly - Maps old gene models to a new assembly where
possible.
map2assembly <genome.fasta> <transcripts.fasta>
- map_data_ids - This script takes a id map file and changes the name
of the ID in a data file. The map file is a two column tab delimited
file with two columns: old_name and new_name. The data file is assumed
to be tab delimited by default, but this can be altered with the
delimit option. The ID in the data file can be in any column and is
specified by the col option which defaults to the first column.
map_data_ids genome.all.id.map data.txt
- map_fasta_ids - Maps short IDs/Names to MAKER fasta files.
map_fasta_ids <map_file> <fasta_file>
- map_gff_ids - Maps short IDs/Names to MAKER GFF3 files, old
IDs/Names are mapped to to the Alias attribute.
map_gff_ids <map_file> <gff3_file>
- tophat2gff3 - This script converts the juctions file producted by
TopHat into GFF3 format for use with MAKER.
tophat2gff3 <junctions.bed>
Categories:
Facts about
“GMOD Malaysia 2014/MAKER
Tutorial”
Navigation
Documentation