Running a GBrowse2 render farm

From GMOD
Revision as of 22:46, 22 January 2009 by Lstein (Talk | contribs)

Jump to: navigation, search
Figure 1:Single server with multiple processors or cores
Figure 2:Multiple servers sharing NFS-mounted file-based databases
Figure 3:Multiple servers sharing the same relational databases
Figure 4:Multiple servers with private databases

GBrowse can be configured to use one or more "render slave" daemons. A render slave is a small Perl processes that runs in the background, processing requests to render GBrowse tracks. By distributing several render slaves across one or more computers, you can spread out the work of generating GBrowse pages, thereby achieving improved performance. A slave can be attached to a particular track, a particular set of tracks, or to all tracks. In addition, you may assign multiple slaves to tracks, in which case the load will be distributed across each slave in a round-robin fashion.

Configuration of a render slave is minimal. All the datasource-specific information is stored in the central GBrowse script (known as the "master" GBrowse) and sent across the wire to the render slave. Configuration of the master GBrowse is also quite simple, and requires just an extra line or two in its configuration files.

The main limitations on render slaves is that a slave needs access to the database underlying any track that it is asked to render. If you run slaves on multiple hosts, then each host must have access to the relevant databases. For relational database backends, such as MySQL, you will need to grant access permissions for each host. For file-based database backends, such as the BerkeleyDB and in-memory databases, the database files must be replicated among the hosts, or else mounted on a common filesystem such as NFS. An additional limitation is that all plugins and uploaded files will be rendered on the master, and cannot be offloaded to render slaves.

Common Configurations

Figures 1 through 4 illustrate common render slave configurations. In each of these configurations, we assume that your data source has three underlying databases: a "scaffold" database that contains basic chromosome information and the DNA sequence, for use by any plugins running on the master, and two "annotation" databases that contain feature tracks.

Figure 1 illustrates a simple case in which the master and slave processes are all running on the same machine and all databases are locally accessible. This configuration makes sense on a multiprocessor or multicore system that can take advantage of parallel processes. The only special configuration needed for this arrangement is to activate GBrowse's renderfarm support (as described in Initial Setup and to launch one or more slave processes at machine startup time.

Figure 2 illustrates a master server and two slave servers, all located on the same LAN. Each server runs a single GBrowse slave process; however it is possible to launch multiple processes for a possible performance boost. The three databases are located on a single NFS-mounted volume, so that the master server and the two slave servers have direct access to them. The host for this volume could be the master server, one of the slave servers, or a dedicated NFS server somewhere on the LAN.

Figure 3 illustrates almost the same configuration as the previous one, except that instead of an NFS-mounted filesystem, the databases reside on a relational database (e.g. MySQL) server which is accessed via the network.

Figure 4 illustrates a case in which the master and slave servers do not share databases. In this case, each machine has access to a private local database file or relational database server.

Which configuration is best for you depends on your needs. For small renderfarms, a shared disk or database server is easy and effective. For large renderfarms, you may encounter resource contention as the slave servers all attempt to access the same files. In this case, it makes more sense to give each slave its own database copy.

Many other configurations are possible, including distributing the work across the Internet. However, there is no particular authentication or authorization mechanism built into the slaves, so it is recommended that they be run on a protected LAN.

Initial Setup

To set up a GBrowse renderfarm, you will need to install and configure slaves on each physical server you plan to use, arrange for the slaves to launch at boot time, and configure the master server to use the slaves.

Installing the GBrowse Slave Software

The GBrowse render slave is a small perl script named gbrowse_slave. Most of its functionality is contained in various Perl modules that are part of the GBrowse core distribution. On the master server, when you install GBrowse using ./Build install, gbrowse_slave will be installed automatically to Perl's default binary directory, usually /usr/local/bin. For the master, no additional software installation is needed.

On machines destined to be slave servers, you will probably not want to install the full GBrowse distribution, because this includes configuration files, sample databases, javascript libraries and other elements that are not needed. For these machines, the build and install procedure is as follows:

% cd Generic-Genome-Browser
% perl Build.PL
% ./Build test  (optional)
% sudo ./Build install_slave

The last step, which should be performed as root, will install the required Perl libraries, manual pages and gbrowse_slave script, as well as the init support scripts needed to launch gbrowse_slave at startup time. Running the tests is optional, but it does ensure that all prerequisites are installed and working properly. The slave requires all the same prerequisites as the master server, including Bio::Perl, GD, CGI::Session, JSON, etc. However, an Apache server is not needed to run a slave.