Difference between revisions of "GBrowse syn PAG tutorial"

From GMOD
Jump to: navigation, search
(Testing the rice and wild_rice data sources in GBrowse)
(The GBrowse_syn Config File)
Line 388: Line 388:
  
 
config_extension = conf
 
config_extension = conf
 +
 +
#Note the include statement below.  It loads a boiler-plate header to be displayed at the top of the gbrowse_syn installation
 +
#include header.txt
  
 
# example searches to display
 
# example searches to display

Revision as of 13:46, 13 January 2010

__NOTITLE__


This tutorial walks you through how to install and configure the GBrowse_syn comparative genomics viewer. This tutorial was originally taught by Sheldon McKay at the 2009 GMOD Summer School - Europe & Americas. The notes and VMware image used on this page are from the Europe course.



VMware

This tutorial was taught using a VMware system image as a starting point. If you want to start with that same system, download and install the Starting image.

See VMware for what software you need to use a VMware system image, and for directions on how to get the image setup and running on your machine.

Download
Starting Image

Ending Image


Username: gmod
Password: gmod

Caveats

Important Note

This tutorial describes the world as it existed on the day the tutorial was given. Please be aware that things like CPAN modules, Java libraries, and Linux packages change over time, and that the instructions in the tutorial will slowly drift over time. Newer versions of tutorials will be posted as they become available.

The Generic Synteny Browser

GBrowse_syn, as implemented at WormBase

GBrowse_syn, or the Generic Synteny Browser, is a GBrowse-based synteny browser designed to display multiple genomes, with a central reference species compared to two or more additional species.  It can be used to view multiple sequence alignment data, synteny or co-linearity data from other sources against genome annotations provided by GBrowse. GBrowse_syn is included with the standard GBrowse package (version 1.69 and later).  Working examples can be seen at TAIR, WormBase, and SGN.

Gbrowse_syn Introduction

GBrowse_syn Documentation

There is detailed documentation on the GMOD wiki for how to install, configure and use GBrowse_syn. To get started, browse these pages:

Whole Genome Alignments

The focus of the section of the course is on dealing with alignment or synteny data and using GBrowse_syn. However, how to generate whole genome alignments, identify orthologous regions, etc, are the subject of considerable interest, so some background reading is listed below:

Installing GBrowse_syn

GBrowse_syn is part of the GBrowse package and was pre-installed when you went through the GBrowse installation.

Loading and Configuration of GBrowse_syn

The example we will use is a two-species comparison of rice (Oryza sativa) and one of its wild relatives*

*Data courtesy of Bonnie Hurwitz; sequences and names have been obfuscated to protect unpublished data

The instructions for downloading these data to the Ubuntu virtual disk:

$ mkdir ~/data/gbrowse_syn
$ cd ~/data/gbrowse_syn
$ wget http://mckay.cshl.edu/downloads/rice.tar.bz2
$ tar xjvf rice.tar.bz2

Create a MySQL database

  • GBrowse_syn uses a "joining" database to store all of the alignment data
  • The first thing we need to do is create a MySQL alignment database using the command-line incantation below:
$ mysql -uroot -e 'create database rice_synteny'
  • Then make sure the web user "nobody" can read the database. Pay special attention to the quotes!
$ mysql -uroot -e "GRANT SELECT on rice_synteny.* to 'nobody'@'localhost'"

Loading the alignment data

The alignment data file

Have a look at the input data in clustalw format:

 $ cd ~/data/gbrowse_syn/rice
 $ more data/rice.aln

CLUSTAL W(1.81) multiple sequence alignment W(1.81)


rice-3(+)/16598648-16600199      ggaggccggccgtctgccatgcgtgagccagacggggcgggccggagacaggccacgtgg
wild_rice-3(+)/14467855-14469373 gggggccgg------------------------------------agacaggccacgtgg
                                 ** ******                                    ***************


rice-3(+)/16598648-16600199      ccctgccccgggctgttgacccactggcacccctgtcccgggttgtcgccctcctttccc
wild_rice-3(+)/14467855-14469373 ccctgccccgggctgttgacccactggcacccctgtcccgggttgtcgccctcctttccc
                                 ************************************************************


rice-3(+)/16598648-16600199      cgccatgctctaagtttgctcctcttctcgaacttctctctttgattcttcacgtcctct
wild_rice-3(+)/14467855-14469373 cgccatgctctaagtttgctcctcttctcgaacttctctctttgattcttcacgtcctct
                                 ************************************************************



rice-3(+)/16598648-16600199      tggagcctccccttctagctcgatcacgctctgctcttccgcttggaggctggcaaaact
wild_rice-3(+)/14467855-14469373 tggagcctccccttctagctcgatcgcgctctgctcttccgcttggaggctggcaaaact
                                 ************************* **********************************

Note on CLUSTALW

These data are in clustalw format. The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's AlignIO parser. This does not mean that clustalw is the actual program used to generate the alignment data.

Note on the sequence ID syntax

The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates. This information is essential:

 rice-3(+)/16598648-16600199
 speciesv-refseq(strand)/start-end

The database loading script

Then, we will load the database.

we are using the options:
-u root          -- username is root
-d rice_synteny  -- use database rice_synteny
-f clustalw      -- use clustalw format
-c               -- initialize a new (or overwrite the old) database
-v               -- print information about what is happening
other available options that we do not need here:
-p password      -- not used because the root user has no password
                    in this implementation
-n               -- do not calculate map coordinates (faster)


We will be running the script with this command line incantation (see below):

$ ../bin/load_alignments_msa.pl -u root -d rice_synteny -format clustalw -v data/rice.aln

Running in the background with the linux screen command

Using screen: Running the script as we are below is time-consuming, so we will use a screen session to run it in the background while we turn our attention to downstream tasks. [more information on 'screen'...]

  • When entering screen mode, hit 'space' to clear the first screen if a message appears.
  • If your backspace key does not work in screen mode, use ^H (ctrl key + H key).
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ screen -S load1
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ ~/data/gbrowse_syn/bin/load_alignments_msa.pl -u root -d rice_synteny -format clustalw -v rice.aln -c 
Processing alignment file rice.aln...
Processing Multiple Sequence Alignment 1 (length 1557)
Processing Multiple Sequence Alignment 2 (length 11275)
Processing Multiple Sequence Alignment 3 (length 3526)
Processing Multiple Sequence Alignment 4 (length 5992)
Processing Multiple Sequence Alignment 5 (length 24267)
Processing Multiple Sequence Alignment 6 (length 697)
Processing Multiple Sequence Alignment 7 (length 6798)
Processing Multiple Sequence Alignment 8 (length 4760)
Processing Multiple Sequence Alignment 9 (length 4595)
Processing Multiple Sequence Alignment 10 (length 95)
Processing Multiple Sequence Alignment 11 (length 479)
Processing Multiple Sequence Alignment 12 (length 9123)
Processing Multiple Sequence Alignment 13 (length 80)
Processing Multiple Sequence Alignment 14 (length 11864)
Processing Multiple Sequence Alignment 15 (length 775)
etc...
  • This will go on for some time (there are 1800 alignments), so we will let the screen run in the background and work on our other tasks. We do this like so:
  1. hit ^A (ctrl key + A key), then release
  2. hit the D key, which will detach the screen (continues to run in the background)
  • We can check back later like so:
$ screen -r load1
  • If the job is done, we can exit the session by typing 'exit' at the command prompt.

Setting up the species' databases

GFF3

Let's have a look at the GFF3 data:

$ more rice.gff3
##gff-version 3
##sequence-region 3 1 19401704
3       ensembl gene    78      1849    .       -       .       ID=3_FG2548;Name=3_FG2548;biotype=protein_coding
3       ensembl mRNA    78      1849    .       -       .       ID=3_FGT2548;Parent=3_FG2548;Name=3_FGT2548;biotype=protein_coding
3       ensembl CDS     1645    1849    .       -       0       Parent=3_FGT2548;Name=CDS.12
3       ensembl CDS     1444    1547    .       -       1       Parent=3_FGT2548;Name=CDS.13
3       ensembl CDS     999     1144    .       -       0       Parent=3_FGT2548;Name=CDS.14
3       ensembl CDS     799     913     .       -       2       Parent=3_FGT2548;Name=CDS.15
3       ensembl CDS     646     786     .       -       0       Parent=3_FGT2548;Name=CDS.16
3       ensembl CDS     78      215     .       -       0       Parent=3_FGT2548;Name=CDS.17
3       ensembl gene    4910    5518    .       +       .       ID=3_FG2546;Name=3_FG2546;biotype=protein_coding
3       ensembl mRNA    4910    5518    .       +       .       ID=3_FGT2546;Parent=3_FG2546;Name=3_FGT2546;biotype=protein_coding
3       ensembl CDS     4910    5518    .       +       0       Parent=3_FGT2546;Name=CDS.19
3       ensembl gene    5743    6351    .       -       .       ID=3_FG2565;Name=3_FG2565;biotype=protein_coding
3       ensembl mRNA    5743    6351    .       -       .       ID=3_FGT2565;Parent=3_FG2565;Name=3_FGT2565;biotype=protein_coding
3       ensembl CDS     5743    6351    .       -       0       Parent=3_FGT2565;Name=CDS.21
3       ensembl gene    10979   16914   .       +       .       ID=3_FG2570;Name=3_FG2570;biotype=protein_coding
3       ensembl mRNA    10979   16914   .       +       .       ID=3_FGT2570;Parent=3_FG2570;Name=3_FGT2570;biotype=protein_coding
3       ensembl CDS     10979   11592   .       +       0       Parent=3_FGT2570;Name=CDS.29
3       ensembl CDS     11670   13317   .       +       2       Parent=3_FGT2570;Name=CDS.30
3       ensembl CDS     13390   14204   .       +       0       Parent=3_FGT2570;Name=CDS.31
3       ensembl CDS     14433   16914   .       +       2       Parent=3_FGT2570;Name=CDS.32

Some key things to note:

The ##sequence-region directive 
is used to create a reference sequence named 3, which is the scaffold on which all of the other features in the file are located
The 'gene' features 
are the top-level parent featured. The 'mRNA' and 'CDS' features are children of the gene. The containement hierarchy is organized using the 'Parent' tag. The CDSs are children of the mRNA, which is in turn a child of the gene. For display purposes, we only need to worry about the gene.

Loading

Note: before we load the GFF3 databases, we need to create a database for each species and give the web user 'nobody' read privileges. Let's create a little SQL script to make this easier:

  • This is just a list of SQL commands that give instructions to the mysql database manager, which we can pass via STDIN
  • create a file create_species_dbs.sql with the contents below.

CREATE DATABASE rice;
CREATE DATABASE wild_rice;
GRANT SELECT on rice.* TO 'nobody'@'localhost';
GRANT SELECT on wild_rice.* TO 'nobody'@'localhost';

  • Then we can run the commands like so:
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ mysql -uroot <create_species_dbs.sql
  • Make sure we are in the location of the GFF data files
$ cd ~/data/gbrowse_syn/rice/data
  • The script we need is bp_seqfeature_load.pl, which come pre-installed with bioperl-live
  • The -f options means "fast load"
  • The -c option means complete (or destructive) load. It would overwrite previously loaded 'rice' databases

Load the rice data...

gmod@ubuntu:~/data/gbrowse_syn/rice/data$ bp_seqfeature_load.pl -u root -d rice -c -f rice.gff3
loading rice.gff3...
Building object tree... 0.53s4s
Loading bulk data into database... 0.65s
load time: 11.74s

and repeat for wild rice...

gmod@ubuntu:~/data/gbrowse_syn/rice/data$ bp_seqfeature_load.pl -u root -d wild_rice -c -f wild_rice.gff3
loading wild_rice.gff3...
Building object tree... 0.55s7s
Loading bulk data into database... 0.66s
load time: 11.98s
  • The alignment database loading should also be done by now, we can check like so:
gmod@ubuntu:~/data/gbrowse_syn/rice/data$screen -r load1

Setting up the Configuration Files

Copy the configuration file to the installation directory. Note that you will need root privileges to do this.

Change to the conf directory and make sure we have the files...

gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ cd ../conf
gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ ls
header.txt  oryza.synconf  rice_synteny.conf  wild_rice_synteny.conf
<pre>

* The default configuration location for Ubuntu Linux is /etc/apache2/gbrowse.conf, copy the files there

gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ sudo cp *conf /etc/apache2/gbrowse.conf
[sudo] password for gmod: 

A Species Config File

File: rice_synteny.conf

[GENERAL]
description   = Domestic rice chromosome 3
db_adaptor    = Bio::DB::SeqFeature::Store
db_args       = -adaptor DBI::mysql
                     -dsn     dbi:mysql:rice;host=localhost
                     -user    nobody

# examples to show in the introduction
examples = 3:51418..52015
           3:67260..67704

# what image widths to offer
image widths  = 450 640 800 1024

# default width of detailed view (pixels)
default width = 1024

initial landmark = 3:200000..300000

# Web site configuration info
stylesheet  = /gbrowse/gbrowse.css
buttons     = /gbrowse/images/buttons
tmpimages   = /gbrowse/tmp

# max and default segment sizes for detailed view
max segment      = 5000000
default segment  = 5000

# zoom levels
zoom levels      = 50 100 200 1000 2000 5000 10000 20000 40000 50000 100000 500000 1000000 5000000

# colors of the overview, detailed map and key
overview bgcolor = lightgrey
detailed bgcolor = lightgoldenrodyellow
key bgcolor      = beige
default features = EG
balloon tips     = 1

[TRACK DEFAULTS]
glyph         = generic
height        = 10
bgcolor       = lightgrey
fgcolor       = black
font2color    = blue
label density = 25
link          = AUTO
link_target   = _blank
title         = Hello, my name is $name!

################## TRACK CONFIGURATION ####################
# the remainder of the sections configure individual tracks
###########################################################

[EG]
feature      = gene:ensembl
glyph        = gene
height       = 10
bgcolor      = peachpuff
fgcolor      = hotpink
description  = 0
label        = 0
category     = Transcripts
key          = ensembl gene

The GBrowse_syn Config File

File: oryza.synconf

[GENERAL]
description =  BLASTZ alignments for Oryza sativa

# The synteny database
join        = dbi:mysql:database=rice_synteny;host=localhost;user=nobody

# This option maps the relationship between the species data sources, names and descriptions
# The value for "name" (the first column) is the symbolic name that gbrowse_syn uses to identify each species.
# This value is also used in two other places in the gbrowse_syn configuration:
# 1) the species name in the "examples" directive and the species name in the .aln file
# 2) the species name in the .aln file
# The value for "conf. file" is the basename of the corresponding gbrowse .conf file.
# This value is also used to identify the species configuration stanzas at the bottom of the configuration file.

#                 name          conf. file            Description
source_map =      rice          rice_synteny          "Domesic Rice (O. sativa)"
                  wild_rice     wild_rice_synteny     "Wild Rice"

tmpimages     = /gbrowse/tmp
imagewidth    = 800
stylesheet    = /gbrowse/gbrowse.css
cache time    = 1

config_extension = conf

#Note the include statement below.  It loads a boiler-plate header to be displayed at the top of the gbrowse_syn installation
#include header.txt

# example searches to display
examples = rice 3:16050173..16064974
           wild_rice 3:1..400000

zoom levels = 5000 10000 25000 50000 100000 200000 400000

# species-specific databases
[rice_synteny]
tracks    = EG
color     = blue

[wild_rice_synteny]
tracks    = EG
color     = red

This should complete the instalation. Time to test it out...

Testing the rice and wild_rice data sources in GBrowse

  • If things have worked out, you should see something like the image below when you point you browser to:
http://localhost/cgi-bin/gbrowse/rice

Note you will use 'localhost' if you are running your browser within the VMware player.

Rice in gbrowse.png

Viewing the data in GBrowse_syn

  • Cross you fingers
http://172.16.109.133/cgi-bin/gbrowse_syn/oryza

Note that the IP address will vary

Ihopethisworks.png

Optional Advanced Section

This optional session will use five pre-built databases.

The instructions for downloading these data to the Ubuntu virtual disk:

$ cd ~/data/gbrowse_syn
$ wget ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/nematodes.tar.bz2
$ tar xjvf nematodes.tar.bz2

Deal with the databases (these are MySQL dumps)

$ cd ~/data/gbrowse_syn/nematodes/mysql_dumps

The script load.pl is: <perl>

  1. !/usr/bin/perl -w

use strict;

while (<*.sql>) {

 my ($name) = /(\S+)\.sql/;
 system "mysql -uroot -e 'drop database $name'";
 system "mysql -uroot -e 'create database $name'";
 system "mysql -uroot $name <$_";
 print "$name loaded\n";

} </perl>

$ ./load.pl &

This will take a while to run. It is building five MySQL databases, two GBrowse_syn data sources (pecan and orthocluster data sets) and one species' database for each of C. elegans, C. remanei and C. briggsae.

NOTE: you will need to give the user 'nobody', as specified in the database connection section of the configuration files, SELECT access to the MySQL databases.

  • You must be user 'root' to do this.
  • You can use the command-line incantation below:
$ mysql -uroot -e "GRANT SELECT on *.* TO 'nobody'@'localhost'"

Loading Sample Configuration Files

  • get the config files
$ cd ~/data/gbrowse/nematodes/conf
$ sudo cp * /etc/apache2/gbrowse.conf
$ cd /etc/apache2

point your web browser at:

http://localhost:/cgi-bin/gbrowse_syn/orthocluster/

EtVoila.png