Difference between revisions of "Galaxy Tutorial 2013"

From GMOD
Jump to: navigation, search
m (Setting up our own custom build)
m (Use a more robust database)
Line 181: Line 181:
 
  # config file documentation.                                                                                                               
 
  # config file documentation.                                                                                                               
 
  #database_connection = sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE                                                     
 
  #database_connection = sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE                                                     
  <span class="enter">database_connection = postgresql://ubuntu:@localhost:5432/'''galaxydb'''</span>
+
  <span class="enter">database_connection = postgres://ubuntu:@localhost:5432/'''galaxydb'''</span>
 
  &nbsp;
 
  &nbsp;
 
  # If the server logs errors about not having enough database pool connections,                                                             
 
  # If the server logs errors about not having enough database pool connections,                                                             

Revision as of 15:49, 21 July 2013

Under Construction

This page or section is under construction.

It started out as a copy of the 2012 Tutorial and is slowly being updated

This walks you through setting up and running a Galaxy server. This tutorial will be taught by Dave Clements at the 2013 GMOD Summer School.

Galaxy is a data integration and analysis framework for biomedical research. Galaxy allows nearly any tool that can be run from the command line to be integrated into it.

On top of these tools, Galaxy provides an accessible environment for interactive analysis that transparently tracks the details of analyses, a workflow system for convenient reuse, data management, sharing, publishing, and more.

Some General Galaxy Resources

Before we get started, let's highlight some Galaxy resources that may be useful to us along the way.

http://galaxyproject.org
The Galaxy Project home page
GalaxyWiki
All things Galaxy.
http://getgalaxy.org/
Hub page for installing and managing your own Galaxy instance.
http://usegalaxy.org/
The Galaxy project's free public server.
Public Galaxy Servers
Current list of know publicly accessible Galaxy servers.
Galaxy Search
Integrated searches of all online Galaxy resources. Available searches:
Pan-Galactic Web Search
Search everything
Galaxy Mailing Lists Search
Search the (Nabble-powered) mailing list archives
Using Galaxy Search
Search online resources related to using Galaxy
Galaxy Admin and Development Search
Search online resources related to deploying and developing Galaxy
Results from searches are often further broken down into categories
  • All: give me everything
  • Tools: show me doc on tools related to my search.
  • Email: show email threads related to my search.
  • Source code: show Galaxy source code related to my search
  • Shared: Show published Galaxy objects related to my search
  • Documentation: Show documentation (e.g. wiki pages, tool doc, ...) related to my search.
  • Abstracts: Show papers related to my search.
  • Requests: Should feature requests related to my search.

This is all implemented using Google Custom Search.

Mailing Lists and Mailing Lists Search
Galaxy has several mailing lists, some of which are very active
Learning hub page
Start here to learn how to use Galaxy.
Galaxy CiteULike group (@ CiteULike)
Seventeen different tags/categories on 1066+ publications

Open Port 8081

For this tutorial, Galaxy will use port 8081. Update your security group in AWS to open up port 8081. Go to Security Groups → Inbound and then add

Create a new rule Custom TCP Rule
Port range: 8081
Source 0.0.0.0/0

And then click + Add rule and then Apply Rule Changes

Create a Galaxy instance

Prerequisites

The only prerequisite to run your own Galaxy is a Python interpreter, version 2.6 or greater. Python 3 is a different language and is currently not supported. The GMOD Amazon Machine Image (AMI) used for this course includes version 2.6.5 of the interpreter.

$ python --version
Python 2.7.3

Galaxy is distributed (and developed) using a distributed version control system called Mercurial. The AMI already includes mercurial version 1.4.3:

$ hg --version
Mercurial Distributed SCM (version 2.0.2)
...

Clone the Galaxy repository

The development and release repositories are available through the bitbucket hosting service.

DO NOT DO THIS NOW as it has already been done on your image:

To create a local clone of the release repository run the following:

 $ cd ~/Galaxy
 $ hg clone http://bitbucket.org/galaxy/galaxy-dist

Take Advantage of the GMOD in the Cloud Directory Structure

All of the Galaxy files are currently in the ~ubuntu home directory under Galaxy. Let's start by moving this to the non-volatile disk, so to speak, on the GMOD in the Cloud-based AWS image we are using.

$ cd
$ mv Galaxy /data/dataHome/
$ ln -s /data/dataHome/Galaxy Galaxy

Update Galaxy Configuration File

Often you can just fire up Galaxy at this point. However, we want a few things to be different from the default installation. Galaxy's main configuration file is universe_wsgi.ini. By default, that file is created at initialization time by copying universe_wsgi.ini.sample. However, if the file already exists it is not copied over. Copy the file and update it:

$ cd ~/Galaxy/galaxy-dist
$ cp universe_wsgi.ini.sample universe_wsgi.ini
$ nano universe_wsgi.ini


Change the port from

#port = 8080

to this:

port = 8081

Galaxy, like WebApollo and several other components that are also covered at the course, will listen to port 8080 by default; to avoid stomping on that earlier work, we will configure Galaxy to listen to a different port.

Change the host from

#host = 127.0.0.1

to:

host = 0.0.0.0

This makes Galaxy visible to remote hosts, such as your laptop


Set the brand to make it obvious that you are working on your Galaxy instance

Change this:

#brand = None

to this:

brand = My Super Cool Brand

Actually use something shorter.

Use a more robust database

Out of the box Galaxy includes the embedded SQLite database. This allows Galaxy to run with zero-configuration and provides an excellent solution for single-user Galaxy installations being used for tool development. However, for any multi-user scenario a more robust database will be needed for Galaxy to be reliable. We highly recommend Postgres, although other databases are known to work. Postgres is already installed on our AMI (it's the default DBMS for Chado)

Update universe_wsgi.ini file to use Postgres. Update the Database section of your Galaxy config file to look like:

# -- Database                                                                                                                             
 
# By default, Galaxy uses a SQLite database at 'database/universe.sqlite'.  You                                                           
# may use a SQLAlchemy connection string to specify an external database                                                                   
# instead.  This string takes many options which are explained in detail in the                                                          
# config file documentation.                                                                                                              
#database_connection = sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE                                                     
database_connection = postgres://ubuntu:@localhost:5432/galaxydb
 
# If the server logs errors about not having enough database pool connections,                                                            
# you will want to increase these values, or consider running more Galaxy                                                                 
# processes.                                                                                                                              
#database_engine_option_pool_size = 5                                                                                                     
#database_engine_option_max_overflow = 10                                                                                                 
 
# If using MySQL and the server logs the error "MySQL server has gone away",                                                              
# you will want to set this to some positive value (7200 should work).                                                                    
#database_engine_option_pool_recycle = -1                                                                                                 
 
# If large database query results are causing memory or response time issues in                                                           
# the Galaxy process, leave the result on the server instead.  This option is                                                             
# only available for PostgreSQL and is highly recommended.                                                                                
#database_engine_option_server_side_cursors = False                                                                                       
database_engine_option_server_side_cursors = True
 
# Create only one connection to the database per thread, to reduce the                                                                    
# connection overhead.  Recommended when not using SQLite:                                                                                
#database_engine_option_strategy = threadlocal                                                                                            
database_engine_option_strategy = threadlocal
 
# Log all database transactions, can be useful for debugging and performance                                                              
# profiling.  Logging is done via Python's 'logging' module under the qualname                                                            
# 'galaxy.model.orm.logging_connection_proxy'                                                                                             
#database_query_profiling_proxy = False                                      

Save the file.

The ubuntu user has permission to create databases, so let's create the database that we told Galaxy to connect to:

$ createdb galaxydb

Run, Galaxy, Run!

Galaxy includes a script to run it. This script also performs the Galaxy initialization the first time it is run. Run it now:

$ sh run.sh
Initializing external_service_types_conf.xml from external_service_types_conf.xml.sample
Initializing migrated_tools_conf.xml from migrated_tools_conf.xml.sample
Initializing reports_wsgi.ini from reports_wsgi.ini.sample
Initializing shed_tool_conf.xml from shed_tool_conf.xml.sample
Initializing tool_conf.xml from tool_conf.xml.sample
... (several minutes pass while install rolls through database changes) ...
galaxy.webapps.galaxy.buildapp DEBUG 2013-07-15 18:52:06,052 Enabling 'x-forwarded-host' middleware
galaxy.webapps.galaxy.buildapp DEBUG 2013-07-15 18:52:06,053 Enabling 'Request ID' middleware
Starting server in PID 7158.
serving on 0.0.0.0:8081 view at http://127.0.0.1:8081

This script performs several significant actions the first time it is run:

  • Creates initial configuration files and empty directories for storing data files
  • Fetches all of the Galaxy framework's dependencies, packaged as Python eggs, for the current platform.
  • Initializes its database. Galaxy uses a database migration system to automatically handle any changes to the database schema. On first load it runs all migrations to ensure the database is in a known state, which may take a little time.

Once the database is initialized, the normal startup process proceeds, loading tool configurations, starting the job runner, and finally initializing the web interface on the requested port. You can now access your Galaxy at http://ec2-##-##-##-##.compute-1.amazonaws.com:8081.

Running analyses with Galaxy

Without any additional configuration, there is already a lot we can do with our first Galaxy instance. As an example, let's work through an analysis that is based on, but distinct from the Galaxy 101 tutorial.

1. Access your new Galaxy instance

Start a web browser and access http://ec2-##-##-##-##.compute-1.amazonaws.com:8081.

Galaxy FirstAnalysis 1.png

2. Create a user

In the top bar, select User → Register. Enter your

  • Email address
  • Password (use a low-security password, it's going over the net unencrypted)
  • Public name: Public names must be at least four characters in length and contain only lower-case letters, numbers, and the '-' character.

and click Submit.

Registering is not required in order to use Galaxy. However, to use all of it, users need to register.

3. Lets answer a question

Now that Galaxy is up and running, let's use it to answer a question

We scooped up an unknown beast out of the slime and sent it off to the sequencing core to get it sequenced and an assembly so we can study it. Turns out it's an archaea, and it's new. We have subsequently run the assembly through a gene prediction pipeline, and have also identified potential transcription binding sites (TFBSs) using another pipeline.

What we now want to know is: which exons have the most overlapping / embedded TFBSs?

4. Get the data into Galaxy

Select Tools → Get Data → Upload Data. This brings up the upload data form from which you can

  • upload data from your computer, or
  • upload data from one or more URLs
  • cut and paste data directly
  • bring files into your workspace that you have previously send to Galaxy via FTP.

We are going to use the URL option. Cut and paste these URLs into the URL/Text box

http://bx.psu.edu/~clements/Events/GMOD2013/m.vannielli.codingexons.bed
http://bx.psu.edu/~clements/Events/GMOD2013/m.vannielli.TFBs.bed
http://bx.psu.edu/~clements/Events/GMOD2013/m.vannielli.sequence.fasta
http://bx.psu.edu/~clements/Events/GMOD2013/m.vannielli.wholegene.bed 

and click Execute.

This will import those 4 datasets your history. Let's take a look at the data. For each dataset,

  • Click on the dataset name for a preview.
  • Poke the eye to see the full dataset.
  • Click on pencil icon and give each dataset a better name (like M vannielli Exons). Click Save.
  • Change the history name from unnamed history (which is true, but not useful) to something more meaningful.
What's a "database" and why is it "?"

In the preview of each dataset it says

database: ?

The database (also referred to as a build) specifies which genome assembly this dataset is associated with. A genome assembly is usually named with an abbreviation for the species and a version number. For example, "mm9" represents the Mus musculus 9" assembly released in 2007, and "mm10" is the more recent assembly released in 2010. If these datasets had been directly imported into Galaxy from, for example, the UCSC Table Browser, or BioMart then the database for these datasets would have been set automatically.

The "?" means that Galaxy does not know which genome assembly the datasets (or more precisely, the coordinates in the datasets) are associated with. Genome assemblies are defined to Galaxy by the instance's administrators. Most Galaxy servers know about widely used assemblies, and if our datasets were for mm9 or mm10 (or any of many other choices), we (as users) could just tell Galaxy that and be done.

Setting up our own custom build

However, these datasets aren't from a common assembly - they are from your sequencing center and are for some novel archaea that was pulled out of a mudflat (that is a bald-faced lie; it is actually Methanococcus vannielii but we will pretend not to know that).

One feature of Galaxy is that users can define their own custom builds/genomes to the system. Once defined, they will be there every time that user logs in in the future.

To define a custom build, click on User → Custom builds.


Our first peek at the Plumbing

Galaxy-dist has several important subdirectories

Path Description
tools/ Defines tools in Galaxy.
tool-data/ Home of .loc files for sets of tools. .loc files tell where reference genomes, indexes, and the like can be found for particular tools.
• shared/ Contains subdirectories for ensembl, gbrowse, genetrack, igv, jars, ncbi, rviewer, ucsc
• • ucsc/
• • • ucsc_build_sites.txt Defines which genomes can be viewed at the various UCSC sites.

susScr2 is not in the list for the main UCSC site. Edit tool-data/shared/ucsc/ucsc_build_sites.txt and add it.

Restart Galaxy:

<control-c>
$ sh run.sh --reload

Click the Analyze Data tab to reload the screen. display at UCSC main is now one of the options.

Galaxy LinkToUCSCForPigs.png

3. Get Pig Repeat Regions

Get repeats from UCSC as well. Select Tools → Get Data → UCSC Main.

Set

  • group: Variation and Repeats
  • region: position and enter chr18

Galaxy GetPigRepeatsFromUCSC1.png

In the second UCSC window make sure Whole Gene is selected and then send the dataset to Galaxy.

Galaxy GetPigRepeatsFromUCSC2.png

Click on the new dataset's pencil icon and rename the dataset to something more useful, such as Pig Chr18 Rpts. Also set the score column to column 5.

Galaxy RepeatsDetails.png

Note that the dataset is already viewable in UCSC.

4. Identify genes and repeats that overlap

Select Tools → Operate on Genomic Intervals → Join.

Join dataset 1 (exons) with dataset 2 (repeats), with min overlap of 1 bp. Return Only records that are joined (INNER JOIN).

Galaxy IntervalJoinSettings.png

Takes two 6 column bed files and joins them together into 12 column records where the first 6 columns are from the exons dataset and the last 6 columns are from the repeats dataset. Furthermore, it only create records when a gene and a repeat overlap.

Galaxy IntervalJoinResults.png

Take a close look at the dataset. Note that

  • Some exons were dropped
  • Some repeats were dropped
  • Some exons occur multiple times

Make sure you understand why.

Finally, rename the dataset something like Exon Rpt Pairings

5. Group and Count

Now we want to walk through the exon-repeat pairings and count the number of times each exon occurs. This number is the number of repeats that overlap with each exon.

We are going to do another operation that is borrowed from relational databases. Select Tools → Join, Subtract, and Group → Group.

Select the exon-repeat pairings dataset and set Group by column to c4, the column in the dataset that contains the exon name.

Then click Add new operation and then set Type to Count.

Galaxy GroupBySettings.png

This tells Galaxy to walk through the dataset, create a group for each different value of column 4 (the exon name), and then count the number of records that were in that group (i.e. the number of records that had each exon name).

This produces a two column dataset:

Galaxy GroupByResults.png

The first column is the value of the column we grouped by. The second is the number of records in the dataset that have that exon name.

Rename this dataset to Exons with rpt counts, unsorted.

If we were to now to run Tools → Filter and Sort → Sort on this dataset, we would have the answer to our original question:

Which exons have the most repeats?

We have the list of exons, and the counts in them. We could use this dataset in further analysis, email it someone, etc..

6. Get Exon Info back

However, we can do better. We have lost some information about the exons (like position, strand, and so on) that we had in the original exon dataset. If we can reclaim that information, and add to it, we can produce a more useful dataset that we can visualize right now.

The original exon dataset downloaded from UCSC had a meaningless score column. Let's replace that with the repeat count.


First, bring the original exon information together with the counts.

Select Tools → Join, Subtract and Group → Join two Datasets. Set the first dataset to Exons with repeat counts and the second to be the Pig Chr18 Exons dataset.

Join them using column c1 and column c4, which are the exon names in both datasets.

Galaxy JoinOnExonName.png

This produces and 8 column dataset with the exon repeat counts in the first two columns and the exon information in the last 6 columns.

Galaxy JoinOnExonNameResults.png


Now, use the Cut tool to reshuffle these 8 columns into a valid 6 column BED file with the repeat count in column 5, the score column.

Select Tools → Text Manipulation → Cut. Enter c3,c4,c5,c6,