NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.
GBrowse syn PAG tutorial
__NOTITLE__
This tutorial walks you through how to install and configure the GBrowse_syn comparative genomics viewer. This tutorial was originally taught by Sheldon McKay at the 2009 GMOD Summer School - Europe & Americas. The notes and VMware image used on this page are from the Europe course.
Contents
- 1 VMware
- 2 Caveats
- 3 The Generic Synteny Browser
- 4 Installing GBrowse_syn
- 5 Loading and Configuration of GBrowse_syn
- 6 Optional Advanced Section
VMware
This tutorial was taught using a VMware system image as a starting point. If you want to start with that same system, download and install the Starting image.
See VMware for what software you need to use a VMware system image, and for directions on how to get the image setup and running on your machine. |
|
Caveats
Important Note
This tutorial describes the world as it existed on the day the tutorial was given. Please be aware that things like CPAN modules, Java libraries, and Linux packages change over time, and that the instructions in the tutorial will slowly drift over time. Newer versions of tutorials will be posted as they become available.
The Generic Synteny Browser
GBrowse_syn, or the Generic Synteny Browser, is a GBrowse-based synteny browser designed to display multiple genomes, with a central reference species compared to two or more additional species. It can be used to view multiple sequence alignment data, synteny or co-linearity data from other sources against genome annotations provided by GBrowse. GBrowse_syn is included with the standard GBrowse package (version 1.69 and later). Working examples can be seen at TAIR, WormBase, and SGN.
Gbrowse_syn Introduction
- View introductory presentation in GBrowse_syn
- It would be a good idea to download or acquire the sample data during the above presentation.
GBrowse_syn Documentation
There is detailed documentation on the GMOD wiki for how to install, configure and use GBrowse_syn. To get started, browse these pages:
- GBrowse_syn overview
- Installation
- Configuration
- Alignment Data
- The user interface
- Presentations and workshops
Whole Genome Alignments
The focus of the section of the course is on dealing with alignment or synteny data and using GBrowse_syn. However, how to generate whole genome alignments, identify orthologous regions, etc, are the subject of considerable interest, so some background reading is listed below:
- Primer on Hierarchical Genome Alignment Strategies
- article on PECAN and ENREDO
- all about PECAN
- The gene annotations for each species are in GFF files.
- The alignment data are in a constrained CLUSTALW format (They were not generated by the program CLUSTALW, which is not necessarily suitable for whole genome alignments)
- There are processing steps for the alignment data but it is very computationally intensive and we will load pre-processed data to get a head start.
Installing GBrowse_syn
GBrowse_syn is part of the GBrowse package and was pre-installed when you went through the GBrowse installation.
Loading and Configuration of GBrowse_syn
The example we will use is a two-species comparison of rice (Oryza sativa) and one of its wild relatives*
*Data courtesy of Bonnie Hurwitz; sequences and names have been obfuscated to protect unpublished data
The instructions for downloading these data to the Ubuntu virtual disk:
$ mkdir ~/data/gbrowse_syn $ cd ~/data/gbrowse_syn $ wget http://mckay.cshl.edu/downloads/rice.tar.bz2 $ tar xjvf rice.tar.bz2
Create a MySQL database
- GBrowse_syn uses a "joining" database to store all of the alignment data
- The first thing we need to do is create a MySQL alignment database using the command-line incantation below:
$ mysql -uroot -e 'create database rice_synteny'
- Then make sure the web user "nobody" can read the database. Pay special attention to the quotes!
$ mysql -uroot -e "GRANT SELECT on rice_synteny.* to 'nobody'@'localhost'"
Loading the alignment data
The alignment data file
Have a look at the input data in clustalw format:
$ cd ~/data/gbrowse_syn/rice $ more data/rice.aln CLUSTAL W(1.81) multiple sequence alignment W(1.81) rice-3(+)/16598648-16600199 ggaggccggccgtctgccatgcgtgagccagacggggcgggccggagacaggccacgtgg wild_rice-3(+)/14467855-14469373 gggggccgg------------------------------------agacaggccacgtgg ** ****** *************** rice-3(+)/16598648-16600199 ccctgccccgggctgttgacccactggcacccctgtcccgggttgtcgccctcctttccc wild_rice-3(+)/14467855-14469373 ccctgccccgggctgttgacccactggcacccctgtcccgggttgtcgccctcctttccc ************************************************************ rice-3(+)/16598648-16600199 cgccatgctctaagtttgctcctcttctcgaacttctctctttgattcttcacgtcctct wild_rice-3(+)/14467855-14469373 cgccatgctctaagtttgctcctcttctcgaacttctctctttgattcttcacgtcctct ************************************************************ rice-3(+)/16598648-16600199 tggagcctccccttctagctcgatcacgctctgctcttccgcttggaggctggcaaaact wild_rice-3(+)/14467855-14469373 tggagcctccccttctagctcgatcgcgctctgctcttccgcttggaggctggcaaaact ************************* **********************************
Note on CLUSTALW
These data are in clustalw format. The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's AlignIO parser. This does not mean that clustalw is the actual program used to generate the alignment data.
- These particular alignment file in clustalw format was generated using a part of the compara pipeline.
- See this generalized hierarchical whole genome alignment workflow for general information on how whole genome alignment data ca be generated.
Note on the sequence ID syntax
The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates. This information is essential:
rice-3(+)/16598648-16600199 speciesv-refseq(strand)/start-end
The database loading script
Then, we will load the database.
- We will use the loading script load_alignments_msa.pl, which will accept clustalw as an input format.
we are using the options: -u root -- username is root -d rice_synteny -- use database rice_synteny -f clustalw -- use clustalw format -c -- initialize a new (or overwrite the old) database -v -- print information about what is happening
other available options that we do not need here: -p password -- not used because the root user has no password in this implementation -n -- do not calculate map coordinates (faster)
We will be running the script with this command line incantation (see below):
$ ../bin/load_alignments_msa.pl -u root -d rice_synteny -format clustalw -v data/rice.aln
Running in the background with the linux screen command
Using screen: Running the script as we are below is time-consuming, so we will use a screen session to run it in the background while we turn our attention to downstream tasks. [more information on 'screen'...]
- When entering screen mode, hit 'space' to clear the first screen if a message appears.
- If your backspace key does not work in screen mode, use ^H (ctrl key + H key).
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ screen -S load1 gmod@ubuntu:~/data/gbrowse_syn/rice/data$ ~/data/gbrowse_syn/bin/load_alignments_msa.pl -u root -d rice_synteny -format clustalw -v rice.aln -c Processing alignment file rice.aln... Processing Multiple Sequence Alignment 1 (length 1557) Processing Multiple Sequence Alignment 2 (length 11275) Processing Multiple Sequence Alignment 3 (length 3526) Processing Multiple Sequence Alignment 4 (length 5992) Processing Multiple Sequence Alignment 5 (length 24267) Processing Multiple Sequence Alignment 6 (length 697) Processing Multiple Sequence Alignment 7 (length 6798) Processing Multiple Sequence Alignment 8 (length 4760) Processing Multiple Sequence Alignment 9 (length 4595) Processing Multiple Sequence Alignment 10 (length 95) Processing Multiple Sequence Alignment 11 (length 479) Processing Multiple Sequence Alignment 12 (length 9123) Processing Multiple Sequence Alignment 13 (length 80) Processing Multiple Sequence Alignment 14 (length 11864) Processing Multiple Sequence Alignment 15 (length 775) etc...
- This will go on for some time (there are 1800 alignments), so we will let the screen run in the background and work on our other tasks. We do this like so:
- hit ^A (ctrl key + A key), then release
- hit the D key, which will detach the screen (continues to run in the background)
- We can check back later like so:
$ screen -r load1
- If the job is done, we can exit the session by typing 'exit' at the command prompt.
Setting up the species' databases
GFF3
- Each of the species' databases will be installed in MySQL using the Bio::DB::SeqFeature::Store adapter for GFF3
Let's have a look at the GFF3 data:
$ more rice.gff3 ##gff-version 3 ##sequence-region 3 1 19401704 3 ensembl gene 78 1849 . - . ID=3_FG2548;Name=3_FG2548;biotype=protein_coding 3 ensembl mRNA 78 1849 . - . ID=3_FGT2548;Parent=3_FG2548;Name=3_FGT2548;biotype=protein_coding 3 ensembl CDS 1645 1849 . - 0 Parent=3_FGT2548;Name=CDS.12 3 ensembl CDS 1444 1547 . - 1 Parent=3_FGT2548;Name=CDS.13 3 ensembl CDS 999 1144 . - 0 Parent=3_FGT2548;Name=CDS.14 3 ensembl CDS 799 913 . - 2 Parent=3_FGT2548;Name=CDS.15 3 ensembl CDS 646 786 . - 0 Parent=3_FGT2548;Name=CDS.16 3 ensembl CDS 78 215 . - 0 Parent=3_FGT2548;Name=CDS.17 3 ensembl gene 4910 5518 . + . ID=3_FG2546;Name=3_FG2546;biotype=protein_coding 3 ensembl mRNA 4910 5518 . + . ID=3_FGT2546;Parent=3_FG2546;Name=3_FGT2546;biotype=protein_coding 3 ensembl CDS 4910 5518 . + 0 Parent=3_FGT2546;Name=CDS.19 3 ensembl gene 5743 6351 . - . ID=3_FG2565;Name=3_FG2565;biotype=protein_coding 3 ensembl mRNA 5743 6351 . - . ID=3_FGT2565;Parent=3_FG2565;Name=3_FGT2565;biotype=protein_coding 3 ensembl CDS 5743 6351 . - 0 Parent=3_FGT2565;Name=CDS.21 3 ensembl gene 10979 16914 . + . ID=3_FG2570;Name=3_FG2570;biotype=protein_coding 3 ensembl mRNA 10979 16914 . + . ID=3_FGT2570;Parent=3_FG2570;Name=3_FGT2570;biotype=protein_coding 3 ensembl CDS 10979 11592 . + 0 Parent=3_FGT2570;Name=CDS.29 3 ensembl CDS 11670 13317 . + 2 Parent=3_FGT2570;Name=CDS.30 3 ensembl CDS 13390 14204 . + 0 Parent=3_FGT2570;Name=CDS.31 3 ensembl CDS 14433 16914 . + 2 Parent=3_FGT2570;Name=CDS.32
Some key things to note:
- The ##sequence-region directive
- is used to create a reference sequence named 3, which is the scaffold on which all of the other features in the file are located
- The 'gene' features
- are the top-level parent featured. The 'mRNA' and 'CDS' features are children of the gene. The containement hierarchy is organized using the 'Parent' tag. The CDSs are children of the mRNA, which is in turn a child of the gene. For display purposes, we only need to worry about the gene.
Loading
- Loading the GFF3 into the MySQL database is the same procedure that is used for loading GFF3 databases for GBrowse.
- It uses the bioperl script bp_seqfeature_load.pl.
Note: before we load the GFF3 databases, we need to create a database for each species and give the web user 'nobody' read privileges. Let's create a little SQL script to make this easier:
- This is just a list of SQL commands that give instructions to the mysql database manager, which we can pass via STDIN
- create a file create_species_dbs.sql with the contents below.
CREATE DATABASE rice; CREATE DATABASE wild_rice; GRANT SELECT on rice.* TO 'nobody'@'localhost'; GRANT SELECT on wild_rice.* TO 'nobody'@'localhost';
- Then we can run the commands like so:
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ mysql -uroot <create_species_dbs.sql
- Make sure we are in the location of the GFF data files
$ cd ~/data/gbrowse_syn/rice/data
- The script we need is bp_seqfeature_load.pl, which come pre-installed with bioperl-live
- The -f options means "fast load"
- The -c option means complete (or destructive) load. It would overwrite previously loaded 'rice' databases
Load the rice data...
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ bp_seqfeature_load.pl -u root -d rice -c -f rice.gff3 loading rice.gff3... Building object tree... 0.53s4s Loading bulk data into database... 0.65s load time: 11.74s
and repeat for wild rice...
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ bp_seqfeature_load.pl -u root -d wild_rice -c -f wild_rice.gff3 loading wild_rice.gff3... Building object tree... 0.55s7s Loading bulk data into database... 0.66s load time: 11.98s
- The alignment database loading should also be done by now, we can check like so:
gmod@ubuntu:~/data/gbrowse_syn/rice/data$screen -r load1
Setting up the Configuration Files
Copy the configuration file to the installation directory. Note that you will need root privileges to do this.
Change to the conf directory and make sure we have the files...
gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ cd ../conf gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ ls header.txt oryza.synconf rice_synteny.conf wild_rice_synteny.conf <pre> * The default configuration location for Ubuntu Linux is /etc/apache2/gbrowse.conf, copy the files there gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ sudo cp *conf /etc/apache2/gbrowse.conf [sudo] password for gmod:
A Species Config File
File: rice_synteny.conf
[GENERAL] description = Domestic rice chromosome 3 db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor DBI::mysql -dsn dbi:mysql:rice;host=localhost -user nobody # examples to show in the introduction examples = 3:51418..52015 3:67260..67704 # what image widths to offer image widths = 450 640 800 1024 # default width of detailed view (pixels) default width = 1024 initial landmark = 3:200000..300000 # Web site configuration info stylesheet = /gbrowse/gbrowse.css buttons = /gbrowse/images/buttons tmpimages = /gbrowse/tmp # max and default segment sizes for detailed view max segment = 5000000 default segment = 5000 # zoom levels zoom levels = 50 100 200 1000 2000 5000 10000 20000 40000 50000 100000 500000 1000000 5000000 # colors of the overview, detailed map and key overview bgcolor = lightgrey detailed bgcolor = lightgoldenrodyellow key bgcolor = beige default features = EG balloon tips = 1 [TRACK DEFAULTS] glyph = generic height = 10 bgcolor = lightgrey fgcolor = black font2color = blue label density = 25 link = AUTO link_target = _blank title = Hello, my name is $name! ################## TRACK CONFIGURATION #################### # the remainder of the sections configure individual tracks ########################################################### [EG] feature = gene:ensembl glyph = gene height = 10 bgcolor = peachpuff fgcolor = hotpink description = 0 label = 0 category = Transcripts key = ensembl gene
The GBrowse_syn Config File
File: oryza.synconf
[GENERAL] description = BLASTZ alignments for Oryza sativa # The synteny database join = dbi:mysql:database=rice_synteny;host=localhost;user=nobody # This option maps the relationship between the species data sources, names and descriptions # The value for "name" (the first column) is the symbolic name that gbrowse_syn uses to identify each species. # This value is also used in two other places in the gbrowse_syn configuration: # 1) the species name in the "examples" directive and the species name in the .aln file # 2) the species name in the .aln file # The value for "conf. file" is the basename of the corresponding gbrowse .conf file. # This value is also used to identify the species configuration stanzas at the bottom of the configuration file. # name conf. file Description source_map = rice rice_synteny "Domesic Rice (O. sativa)" wild_rice wild_rice_synteny "Wild Rice" tmpimages = /gbrowse/tmp imagewidth = 800 stylesheet = /gbrowse/gbrowse.css cache time = 1 config_extension = conf #Note the include statement below. It loads a boiler-plate header to be displayed at the top of the gbrowse_syn installation #include header.txt # example searches to display examples = rice 3:16050173..16064974 wild_rice 3:1..400000 zoom levels = 5000 10000 25000 50000 100000 200000 400000 # species-specific databases [rice_synteny] tracks = EG color = blue [wild_rice_synteny] tracks = EG color = red
This should complete the instalation. Time to test it out...
Testing the rice and wild_rice data sources in GBrowse
- If things have worked out, you should see something like the image below when you point you browser to:
http://localhost/cgi-bin/gbrowse/rice
Note you will use 'localhost' if you are running your browser within the VMware player.
Viewing the data in GBrowse_syn
- Cross you fingers
http://172.16.109.133/cgi-bin/gbrowse_syn/oryza
Note that the IP address will vary
Optional Advanced Section
This optional session will use five pre-built databases.
The instructions for downloading these data to the Ubuntu virtual disk:
$ cd ~/data/gbrowse_syn $ wget ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/nematodes.tar.bz2 $ tar xjvf nematodes.tar.bz2
Deal with the databases (these are MySQL dumps)
$ cd ~/data/gbrowse_syn/nematodes/mysql_dumps
The script load.pl is: <perl>
- !/usr/bin/perl -w
use strict;
while (<*.sql>) {
my ($name) = /(\S+)\.sql/; system "mysql -uroot -e 'drop database $name'"; system "mysql -uroot -e 'create database $name'"; system "mysql -uroot $name <$_"; print "$name loaded\n";
} </perl>
$ ./load.pl &
This will take a while to run. It is building five MySQL databases, two GBrowse_syn data sources (pecan and orthocluster data sets) and one species' database for each of C. elegans, C. remanei and C. briggsae.
NOTE: you will need to give the user 'nobody', as specified in the database connection section of the configuration files, SELECT access to the MySQL databases.
- You must be user 'root' to do this.
- You can use the command-line incantation below:
$ mysql -uroot -e "GRANT SELECT on *.* TO 'nobody'@'localhost'"
Loading Sample Configuration Files
- get the config files
$ cd ~/data/gbrowse/nematodes/conf $ sudo cp * /etc/apache2/gbrowse.conf $ cd /etc/apache2
point your web browser at:
http://localhost:/cgi-bin/gbrowse_syn/orthocluster/