GBrowse2 VMs

From GMOD
Revision as of 20:37, 13 February 2013 by Girlwithglasses (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

GBrowse2 Virtual Machines

GBrowse running under VirtualBox] Click to view at full resolution

GBrowse versions 2.50 and higher are available as preconfigured virtual machines. Each VM provides you with a stable, self-contained environment on which to build a GBrowse-based web site. Two versions are available:

  1. A VirtualBox VM that can be run on any Windows, Mac OS, Solaris or Linux desktop system or server using the Open Source VirtualBox application.
  2. A Amazon Web Services VM that runs under the Amazon Elastic Compute Cloud (EC2), and can be used to host GBrowse with no investment in computer hardware whatsoever.

Because of the many Perl and other third-party software packages that are needed to run GBrowse, it is much easier to run GBrowse from these VMs than by installing from source or binary. Performance of either type of VM is excellent, and they provide added benefits in flexibility and future expansion.

In addition to the "official" GBrowse VMs described above, there are other several other excellent VMs that were developed for use in specific workshops and tutorials. Please see GBrowse NGS Tutorial and VMware HOWTO.

The first sections of this page describes specific aspects of the VirtualBox and EC2 VMs. The last section describes common features for them both.

Getting Started with the VirtualBox VM

To start using the GBrowse2 VirtualBox VM, first install the VirtualBox package onto the desktop or server system that will be hosting the VM. You may download VirtualBox directly from its web site, or, if you are using a debian or RedHat-based Linux distribution, use the software package manager to install "VirtualBox". The VirtualBox Extension Pack, which adds support for USB 2.0 devices, is not required to run GBrowse2.

Next, download GBrowse2, VirtualBox Edition.

The download will give you a compressed virtual machine image file named "GBrowse 2.XX VirtualBox Edition.ova".

Now launch the VirtualBox GUI, choose File=>Import Appliance... and select the downloaded .ova file. This will install and initialize the GBrowse2 VM. If you prefer the command-line version of VirtualBox, run VBoxManage import 'GBrowse 2.XX VirtualBox Edition.ova' .

Once GBrowse is installed, you may run it by selecting it in the GUI and pressing the Start button (command-line equivalent VBoxManage startvm 'GBrowse 2.XX VirtualBox Edition' ). The virtual machine will boot, and in about 30s will bring up the Chromium browser displaying the welcome apge shown at the top of this documentation. There are several example genome data sources preinstalled. Please go ahead and browse them by clicking on the links under Example Databases.

When the VM was installed, VirtualBox should have automatically sets up port forwarding from the host (physical) machine's port 8081 to the guest (virtual) machine's port 80. Check this now by opening up a web browser on the host machine, and requesting http://localhost:8081. This should bring you to the GBrowse2 welcome page. If this does not work, then see #Editing Network Settings for help.

User Accounts

The GBrowse2 VM has two user accounts preinstalled. One is the "GBrowse" account, a restricted autologin shell that runs the welcome screen, and the other is the "Administrator" account (username "admin"). The GBrowse account has restricted access to the system resources and is there as a poor man's "kiosk mode" for browsing databases installed on the VM. The Administrator account is allowed to run the sudo command to elevate privileges, and was created for the purpose of adding and configuring new GBrowse databases and tracks.

When you first start up the VirtualBox VM, you will be logged into the GBrowse account automatically. To log in as Administrator, go to the menu at the upper left of the desktop and choose Log Out.... When prompted, select either the Log Out or Switch Users button. You will then be taken to a login screen. Select the Administrator user, and type the password gbrowse, to be taken to the administrator user's desktop. From here you can administer GBrowse from the command line.

You may ssh to the VM from the host machine by using the IP address 192.168.56.10:

ssh admin@192.168.56.10

For ssh access to the virtual machine from elsewhere please see the various alternatives described in this article.

Important: If this is a publicly-accessible machine, you will most definitely want to change the administrator's password. You can do this by selecting Menu=>System Tools=>Users and Groups, where Menu is stylized arrow icon in the upper left of the menu bar. Select the Administrator user and click Password: Change.... You may also change the password by running the passwd command from the command line shell.

Editing Network Settings

The VirtualBox Edition has two network interfaces installed. The first is a Network Address Translation (NAT) interface that allows the guest machine to access the LAN that the host machine is connected to. The second interface is attached directly to the host and allows for incoming connections from the host to the guest machine. This is primarily to allow the host machine to ssh to the guest.

Port Forwarding Settings

By default, the NAT interface will take network requests targeted to the host machine's port 8081 to the guest VM's port 80. This allows the host machine to act as a GBrowse server for other physical machines on the local LAN and/or Internet. You may modify this if you wish in the following way. From the VirtualBox GUI, select the GBrowse VM, and then double-click on the Network setting in the details panel. This will bring up a dialog box that has tabs for each of the two network adapters attached to the VM. Select Adapter 1, which is the NAT adapter, and then open up the Advanced section. Click on Port Forwarding to open up the Port Forwarding definitions. You may add new forwarding rules by clicking on the Plus icon.

In the example shown on the right, the first rule is the default forwarding of host port 8081 to guest port 80. In addition, we have added a second rule that forwards host's port 80 itself to guest port 80. This allows you and others to access GBrowse without specifying the :8081 port. Note, however, that you must start VirtualBox with root privileges in order to successfully forward port 80 and any others below 1024.

Enabling SSH

The VirtualBox VM does not run an SSH server by default, however you can enable it easily by running this command from Administrator's command-line shell.

sudo apt-get install openssh-server

You can then ssh into the guest by using its host-only network interface. You can look up the network interface from within the virtual machine's desktop by right clicking on the Network icon (pie-wedge icon in the upper right of the menu bar), and selecting Connection Information=>Wired connection 2. Alternatively, you can look it up from the command-line of the host machine using VBoxManage:

$ VBoxManage guestproperty get "Growse 2.XX, VirtualBox Edition" "/VirtualBox/GuestInfo/Net/1/V4/IP"
Value: 192.168.56.101

You can now ssh in using the address you just got:

ssh admin@192.168.56.101

Expanding Storage

All the GBrowse databases are stored on a separate logical volume mounted at /opt/gbrowse. This volume grows dynamically up to 50 GB. If you run out of space, you can easily add more storage. First halt the virtual machine. Using the VirtualBox GUI, select the GBrowse VM, click on Settings, and then select Storage. In the Storage Tree panel, select SATA Controller and click on the Add Hard Disk icon. When prompted, choose Create new disk and then use the wizard to create a virtual disk with the size and characteristics you desire.

Now launch the VM. The new disk you added will appear as a new unpartitioned disk drive device named /dev/sdc (or /dev/sdd if you've done this once before). Do not partition or format this disk. Instead, use the Logical Volume Manager to extend the logical /opt/gbrowse volume onto this disk via the following series of commands:

sudo pvcreate /dev/sdc
sudo vgextend volumes /dev/sdc
sudo lvextend -l +100%FREE /dev/volumes/gbrowse
sudo resize2fs /dev/volumes/gbrowse

/opt/gbrowse will now be increased by the size of the disk you just added.

Getting Started with the EC2 VM

To start a GBrowse server within the Amazon Elastic Compute Cloud (EC2), you will need to have an Amazon Web Services account and to be familiar with the process of configuring and launching Amazon Machine Images (AMIs. A good getting started guide can be found here.

GMOD provides GBrowse AMIs in a small number of availability regions; this list will grow with time. Choose the region you prefer, and click on one of the links below:

GBrowse 2.54, us-east-1: ami-93e46cfa

GBrowse 2.54, us-west-2: ami-c04fc4f0

GBrowse 2.50, us-east-1: ami-4914a220

GBrowse 2.50, us-west-2: ami-449a1474

Alternatively you may use the Amazon Console to search for public images matching the name "GBrowse*" in order to find images in your preferred region (there may not be any).

Clicking on one of the AMIs listed above will start an instance launch wizard. Here is guidance on how to configure the instance:

Instance Type
GBrowse will run even on the tiny t1.micro instance type, but this is only recommended for the purposes of kicking the tires. For production use, we recommend m1.medium or larger. High-CPU and high I/O performance instance types generally provide better performance than high-memory instances.
Key Pair
Use your default SSH keypair.
Firewall
Use or create a security group that allows both incoming SSH and HTTP connections. The former will allow you to log into the GBrowse server for administrative purposes, while the latter gives you access to the GBrowse web application.

Once you are satisfied with the settings, launch the instance and watch the Amazon Console until it enters Running state. At this point, record the instance's public DNS name from the information provided on the console, and paste this DNS name into a web browser. You will be taken to the GBrowse welcome page, which offers links to documentation and bare-bones starter genome databases.

Logging into the GBrowse EC2 VM

The administrative user for the GBrowse EC2 VM has a password-less account named "admin", which can only be accessed via an SSH key. To log into the instance, invoke ssh with your AWS keypair identity file and the "admin" username:

ssh -i /path/to/aws_private_key.pem admin@ec2-XX-XX-XX-XX.compute-1.amazonaws.com

This will give you a command-line shell on the GBrowse server. To run any command as root, you may use sudo without providing a password.

Using the Load Balancing Scripts

You may take advantage of Amazon's spot instances to load balance GBrowse at modest cost. The two scripts needed are:

gbrowse_sync_aws_slave.pl
This script will synchronize the databases on the master VM to a data snapshot that can be accessed by the load balancing slave machines.
gbrowse_aws_balancer.pl
This script runs the balancer daemon on the master VM. It monitors load on the master instance and launches slave instances to handle increased load.

Configuration information for these two scripts is maintained in a common file named /opt/gbrowse/etc/aws_balancer.conf. This file controls the number of spot instances for launch for different load levels, the size and memory capacity of the spot instances, and the maximum hourly price you are willing to pay for a spot instance.

There is also an init script located at /opt/gbrowse/etc/init.d/gbrowse-aws-balancer that launches and configures the balancer daemon. It is controlled by the configuration file /opt/gbrowse/etc/default/gbrowse-aws-balancer.

Some setup is required to use the load balancer. For details see Load Balancing GBrowse2 with AWS Spot Instances or run:

perldoc gbrowse_aws_balancer.pl

Expanding Database Storage on the GBrowse EC2 VM

All GBrowse databases are stored in a logical volume mounted at /opt/gbrowse. The default volume is only 10 Gb in size, but you can easily expand it while logged into the GBrowse server by running gbrowse_grow_cloud_vol.pl and providing the number of gigabytes to expand it by. For example, this command increases the volume by an additional 40 GB:

gbrowse_grow_cloud_vol.pl 40

This works by allocating a new EBS volume, attaching it to the server instance, adding it to the logical volume manager group, and then resizing the /opt/gbrowse filesystem. Volumes added in this way have their "delete on termination" flag set, so the data will disappear when you terminate the instance, unless you snapshot the volume or create a new AMI to hold your local modifications.

Administration common to both VirtualBox and EC2 VMs

This section discusses configuration and administration tasks that are common for both the VirtualBox and EC2 virtual machines.

Configuring Tracks

The process of loading data and configuring tracks is described well in detail in the GBrowse 2.0 HOWTO, and the GBrowse NGS Tutorial (for BAM and SAM next generation sequencing data). There are also several step-by-step tutorials located on the GBrowse Tutorial page. All tutorials will work on the VMs, except that the layout of the GBrowse's data and configuration files is slightly different from those in the tutorial. For example, the main GBrowse configuration file is located at /opt/gbrowse/etc/GBrowse.conf, instead of /etc/gbrowse2/GBrowse.conf. The VM filesystem layout is described in the next section.

Filesystem Layout

The filesystem layout of both VMs is similar. The root filesystem at "/" consists of a generic Ubuntu bootable disk. Almost all GBrowse-related software has been installed in /opt/gbrowse:

/opt/gbrowse/etc
GBrowse configuration files, including GBrowse.conf and datasource .conf files.
/opt/gbrowse/databases
GBrowse databases
/opt/gbrowse/htdocs
GBrowse Javascript and static HTML files.
/opt/gbrowse/bin
Various command-line tools from the GBrowse and BioPerl distributions.
/opt/gbrowse/lib
Perl and non-Perl libraries used by GBrowse.

The system PATH includes /opt/gbrowse/bin, and the PERL5LIB environment variable is /opt/gbrowse/lib/perl:/opt/gbrowse/lib/perl5:/opt/gbrowse/share/perl.

Adding New Databases

You can add new databases in the way described in GBrowse 2.0 HOWTO and GBrowse Tutorial. However, there is also a new command-line tool called gbrowse_import_ucsc_db.pl that will initialize starter databases from information held in the UCSC Genome Browser. It is run this way:

gbrowse_import_ucsc_db.pl hg19 'H. sapiens genome (hg19), genes and DNA'

The first argument is the UCSC genome build name, such as hg19. The second is the description to attach to the new database. This description will appear in the GBrowse datasources menu. When this script is run, it will fetch the DNA, genes and non-coding RNA annotations for the indicated data source and create a bare-bones configuration file for the data source. You can then tweak the configuration file and add to it.

The gbrowse_clean.pl Script

On both VMs there is a daily cron job that runs under the www-data user. This cronjob runs gbrowse_clean.pl to remove old cache files as well as uploads that haven't been accessed in a minimum number of days (controlled by the "expire uploads" option in GBrowse.conf). The script will log its run time actions to the file /etc/gbrowse2/gbrowse_clean.log.

You may disable gbrowse_clean.pl by running:

sudo crontab -u www-data -e

This will open up a text editor. Simply comment out the line that runs gbrowse_clean.pl.

Admin mode

Both VMs come with a preinstalled GBrowse administrator user. When the administrator is logged into the browser using the Log in / create account link in the upper right of the GBrowse main window, all tracks uploaded by this user will be publicly visible. The username and password for this user are:

username: admin
password: gbrowse

If you make the VM publicly accessible, you are advised to change the password by selecting My Account and then Change Password after you initially log in. You may also use the command-line tool gbrowse_set_admin_passwd.pl to change the password from a server shell.

To administer the embedded MySQL server, please use the user account "root" and the password "gbrowse". To change the MySQL root password, use the mysqladmin password command.

Important note: The GBrowse "admin" user is separate from the "admin" user used to log into the server command-line shell. On the Amazon edition, "admin" has no password and must use an SSH key to log in. On the VirtualBox edition, "admin" has the default password of "gbrowse". If you change the web interface password, you will not change the shell login password; this must be done separately.