GBrowse syn Database

From GMOD
Revision as of 15:40, 30 December 2009 by Mckays (Talk | contribs)

Jump to: navigation, search

GBrowse_syn is a GBrowse based synteny viewer. This page describes the database that GBrowse_syn uses, and how to get syntenic data into that database.

  • Sample data and configuration files can be downloaded from the GMOD FTP site.
  • These data are for two rice species (courtesy of Bonnie Hurwitz)

Example Alignment Data

The sample below is in CLUSTALW format. Other formats are also supported (see below)

NOTE: The sequence naming convention "species-seqid(strand)/start-end" shown in the example below contains meta-data about the alignment that is essential for the data to be loaded correctly with strand and coordinate information


CLUSTAL W(1.81) multiple sequence alignment


c_briggsae-chrII(+)/43862-46313           ATGAGCTTCCACAAAAGCATGAGCTTTCTCAGCTTCTGCCACATCAGCATTCAAATGATC
c_remanei-Crem_Contig172(-)/123228-124941 ATGAGCCTCTACAACCGCATGATTCTTTTCAGCCTCTGCCACGTCCGCATTCAAATGCTC
c_brenneri-Cbre_Contig60(+)/627772-630087 ATGAGCCTCCACAACAGCATGATTTTTCTCGGCTTCCGCCACATCCGCATTCAAATGATC
c_elegans-II(+)/9706834-9708803           ATGAGCCTCTACTACAGCATGATTCTTCTCAGCTTCTGCAACGTCAGCATTCAGATGATC
                                          ****** ** ** *  ******   ** ** ** ** ** ** ** ******* *** **

c_briggsae-chrII(+)/43862-46313           CGCACAAATATGATGCACAAATCCACAACCTAAAGCATCTCCGATAACGTTGACCGAAGT
c_remanei-Crem_Contig172(-)/123228-124941 AGCACAAATGTAATGAACGAATCCGCATCCCAACGCATCGCCAATCACATTCACAGATGT
c_brenneri-Cbre_Contig60(+)/627772-630087 CGCACAAATGTAGTGGACAAATCCGCATCCCAAAGCGTCTCCGATAACATTTACCGAAGT
c_elegans-II(+)/9706834-9708803           TGCACAAATGTGATGAACGAATCCACATCCCAATGCATCACCGATCACATTGACAGATGT
                                           ******** *  ** ** ***** ** ** ** ** ** ** ** ** ** ** ** **
c_briggsae-chrII(+)/43862-46313           CCGGAGTCGATCCCTGAAT-----------------------------------------
c_remanei-Crem_Contig172(-)/123228-124941 ACGAAGTCGGTCCCTATAAGGTATGATTTTATATGA----TGTACCATAAGGAAATAGTC
c_brenneri-Cbre_Contig60(+)/627772-630087 ACGAAGTCGATCCCTGAAA---------TCAGATGAGCGGTTGACCA---GAGAACAACC
c_elegans-II(+)/9706834-9708803           ACGAAGTCGGTCCCTGAAC--AATTATTT----TGA----TATA---GAAAGAAACGGTA
                                           ** ***** *****  *

Note on formats</font>

These example data are in clustalw format. The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's AlignIO parser. This does not mean that clustalw is the actual program used to generate the alignment data.

Note on the sequence ID syntax

The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates. This information is essential:

 rice-3(+)/16598648-16600199

The general format is species-refseq(strand)/start-end

species
name of species, genome, strain, etc (string with no '-' characters)
sequence
name of reference sequence (string with no '/' characters)
(strand)
orientation of the alignment (relative to the reference sequence; + or -)
start
start coordinate of the alignment relative to the reference sequence (integer)
end
end coordinate of the alignment relative to the reference sequence (integer)

Examples:

   c_elegans-I(+)/1..2300
   myco_bovis-chr1(-)/15000..25000

Loading The alignment database

From multiple sequence alignments

Multiple sequence alignemnts, often saved in clustalw or fasta formats, can be loaded directly into the GBrowse_syn alignment database The easiest way to load alignment data into the database is to use the script [load_alignments_msa.pl].

See here for more information on whole genome alignment approaches: Whole genome alignments

This script will process alignment data in a number of formats recognized by BioPerl. Note that these are file formatting conventions that do not imply that particular program is the method of choice for generating your alignments. Whole genome alignments for multiple species

alignment data loading format

  • The output of the script is a tab-delimited ad hoc intermediate format that encodes the alignment coordinates plus 1:1 mapping of coordinates within the alignment (to facilitate accurate grid-lines in GBrowse_syn)
  • However to get you alignment or syntency data, this the the format currently supported for loading the alignment database.
  • Each record is a single line (wrapped here for display only)
#species1       seqid1  start1   end1   strand1  reserved  species2      seqid2         start2   end2  strand2 reserved \
# pos1-1  pos1-2  ...  posn-1  posn-2  |  pos1-2  pos1-1  ...  posn-2  posn-1
c_briggsae      chrI    1583997 1590364 +       .       c_remanei       Crem_Contig24   631879  634679  -       .       \
1584000 634676  1584100 634584  (truncated...)  |     631900  1590333 632000  1590233  (truncated ...)

Create a mysql database

mysql -uroot -ppassword -e 'create database my_database'

Load the database

GBrowse_syn Database Schema

  • The alignment database schema is very simple; it has a tables for all reciprocal 'hits', or alignment features and a table for (optional) 1:1 coordinate maps
  • The alignments table contains coordinate information and also support cigar-line representations and the alignment to facilitate future reconstruction of the alignment within GBrowse_syn.

GBS Schema.png