Difference between revisions of "GBrowse syn Database"

From GMOD
Jump to: navigation, search
(Sample Alignment Data)
m (cleaning up grammar, etc.)
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[GBrowse_syn]] is a [[GBrowse]] based [[synteny]] viewer.  This page describes the database that GBrowse_syn uses, and how to get syntenic data into that database.
 
[[GBrowse_syn]] is a [[GBrowse]] based [[synteny]] viewer.  This page describes the database that GBrowse_syn uses, and how to get syntenic data into that database.
  
=Schema=
+
* Sample data and configuration files can be downloaded from the [ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/rice.tar.bz2 GMOD FTP site]; the sample data are for two rice species (courtesy of Bonnie Hurwitz)
* The alignment database [[Glossary#Schema|schema]] is very simple; it has a tables for all reciprocal 'hits', or alignment features and a table for (optional) 1:1 coordinate maps
+
* The alignments table contains coordinate information and also support cigar-line representations and the alignment to facilitate future reconstruction of the alignment within [[GBrowse_syn]].
+
  
[[Image:GBS_Schema.png|border]]
 
  
=Sample data=
+
=Example Alignment Data=
* Sample data and configuration files can be downloaded from the [ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/rice.tar.bz2 GMOD FTP site].
+
* These data are for two rice species (courtesy of Bonnie Hurwitz)
+
  
=Loading=
+
The sample below is in CLUSTALW format.  Other formats are also supported (see below).
* The starting point for loading the alignment database is a CLUSTALW format multiple sequence alignment
+
* The source of the alignment data is up to you but the supported format for entry is currently CLUSTALW.  This format is not mandatory but the parsing script below will need to be adjusted to support other formats
+
==Sample Alignment Data==
+
The sample below is in CLUSTALW format.  Other formats are also supported (see below)
+
  
 
'''NOTE:''' The sequence naming convention "[[#Note_on_the_sequence_ID_syntax|species-seqid(strand)/start-end]]" shown in the example below contains meta-data about the alignment that is essential for the data to be loaded correctly with strand and coordinate information
 
'''NOTE:''' The sequence naming convention "[[#Note_on_the_sequence_ID_syntax|species-seqid(strand)/start-end]]" shown in the example below contains meta-data about the alignment that is essential for the data to be loaded correctly with strand and coordinate information
Line 42: Line 33:
 
</pre>
 
</pre>
  
==='''<font color=red>Note on formats</font>'''===
+
==Note on formats==
These example data are in clustalw format.  The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's [http://search.cpan.org/~birney/bioperl-1.2.3/Bio/AlignIO.pm AlignIO parser].  '''''This does not mean that clustalw is the actual program used to generate the alignment data'''''.
+
  
* These particular alignment file in clustalw ''format'' was generated using a part of the <span class="pops">[http://feb2006.archive.ensembl.org/info/software/compara/compara_tutorial.html compara pipeline]</span>.
+
These example data are in clustalw format.  The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's [http://www.bioperl.org/wiki/Module:Bio::AlignIO AlignIO parser].  '''''This does not mean that clustalw is the program used to generate the alignment data'''''.
 +
 
 +
* This particular alignment file in clustalw ''format'' was generated using a part of the <span class="pops">[http://feb2006.archive.ensembl.org/info/software/compara/compara_tutorial.html compara pipeline]</span>.
 
* See <span class="pops">[https://www.nescent.org/wg/courses_gmod_09/images/c/cf/WGA_data.png this generalized hierarchical whole genome alignment workflow]</span> for general information on how whole genome alignment data can be generated.
 
* See <span class="pops">[https://www.nescent.org/wg/courses_gmod_09/images/c/cf/WGA_data.png this generalized hierarchical whole genome alignment workflow]</span> for general information on how whole genome alignment data can be generated.
  
===Note on the sequence ID syntax===
+
==Note on the sequence ID syntax==
 +
 
 
The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates.  This information is essential:
 
The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates.  This information is essential:
  
Line 55: Line 48:
 
The general format is <font color="blue">species</font>-<font color="green">refseq</font>(<font color="red">strand</font>)/<font color="purple">start-end</font>
 
The general format is <font color="blue">species</font>-<font color="green">refseq</font>(<font color="red">strand</font>)/<font color="purple">start-end</font>
  
;<font color=blue>species</font>:name of species, genome, strain, etc (string with no '-' characters)
+
;<font color=blue>species</font>: name of species, genome, strain, etc (string with no '-' characters)
;<font color=green>sequence</font>:name of reference sequence (string with no '/' characters)
+
;<font color=green>sequence</font>: name of reference sequence (string with no '/' characters)
;<font color=red>(strand)</font>:orientation of the alignment (relative to the reference sequence; + or -)
+
;<font color=red>(strand)</font>: orientation of the alignment (relative to the reference sequence; + or -)
;<font color=purple>start</font>:start coordinate of the alignment relative to the reference sequence (integer)
+
;<font color=purple>start</font>: start coordinate of the alignment relative to the reference sequence (integer)
;<font color=purple>end</font>:end coordinate of the alignment relative to the reference sequence  (integer)
+
;<font color=purple>end</font>: end coordinate of the alignment relative to the reference sequence  (integer)
  
 
Examples:
 
Examples:
Line 65: Line 58:
 
     myco_bovis-chr1(-)/15000..25000
 
     myco_bovis-chr1(-)/15000..25000
  
==Database Loading Scripts==
 
===Split large clustal files into smaller ones===
 
For performance reasons and to avoid overloading the bioperl-based perl scripts, split large clustal files that have many alignments into one alignment/file with the script <span class="pops">[http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/Generic-Genome-Browser/bin/gbrowse_syn/split_clustal.pl?pathrev=stable split_clustal.pl]</span>.  This is necessary for very large files that would otherwise overload the [[BioPerl]] alignment parser.
 
  
gunzip my_huge_clustal_file.aln.gz | perl split_clustal.pl /path/to/smaller_alignment_files
+
=Loading the alignment database=
  
===Parse the alignments===
+
==Create a MySQL database==
The easiest way to load alignment data into the database is to use the script [load_alignments_msa.pl], which is included in the GBrowse distribution under ./bin/gbrowse_syn.
+
  
This script will process alignment data in a number of formats recognized by BioPerl, such as Clustal, Stockholm, etc.  Note that these are ''file formatting conventions'' that do not imply that particular program, such as CLUSTALW, for example, is the method of choice for generating your alignments.
+
Before you load the database, make sure that a database of that name already exists; if not, create one from scratch using the following MySQL command:
  
 +
mysql -uroot -ppassword -e 'create database my_database'
  
  
 +
==Loading from multiple sequence alignments==
  
A suggested method for formatiing alignment data is to parse clustalw format the alignments using the script <span class="pops">[http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/Generic-Genome-Browser/bin/gbrowse_syn/clustal2hit.pl?pathrev=stable clustal2hit.pl]</span>
+
Multiple sequence alignments can be loaded directly into the GBrowse_syn alignment database with the script [[GBrowse_syn_Scripts#load_alignments_msa.pl|load_alignments_msa.pl]]. This script will process alignment data in a number of formats recognized by BioPerl, such as CLUSTAL and FASTA.  Note that these are file formatting conventions used by a variety of different applications and use of one of the formats that do not imply that a particular program is the method of choice for generating your alignments.  Whole genome alignments for multiple species are generally more complex than simple multiple sequence alignments with clustalw.
  
perl clustal2hit.pl /path/to/smaller_alignment_files/*.aln |gzip -c >processed_alignments.txt.gz
+
* More information on [[GBrowse_syn_Scripts#load_alignments_msa.pl|load_alignments_msa.pl]]
 +
* See the GBrowse_syn page for more on [[GBrowse_syn#See_also|whole genome alignment approaches]]
  
There is also a script for processing mercator alignments (<span class="pops">[http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/Generic-Genome-Browser/bin/gbrowse_syn/mercatoraln_to_synhits.pl?pathrev=stable mercatoraln_to_synhits.pl]</span>)
 
  
A generic version (<span class="pops">[http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/Generic-Genome-Browser/bin/gbrowse_syn/aln2hit.pl?pathrev=stable aln2hit.pl]</span>) for other MSA formats supported by Bio::SeqIO;
+
==Loading from other sources==
  
* The output of this script is described below
+
The script [[GBrowse_syn_Scripts#load_alignment_database.pl|load_alignment_database.pl]] can be used to load the alignment database from a tab-delimited alignment data files (format described below). This format can either be an intermediate for parsed alignment data or can be used for data that does not come from multiple sequence alignments, for example gene orthology data, defined regions of co-linearity, etc. The tab-delimited format requires start and end coordinates for each reference sequence. Any features that have start end and strand information can be used.
 +
 
 +
* More information on [[GBrowse_syn_Scripts#load_alignment_database.pl|load_alignment_database.pl]]
 +
 
 +
 
 +
=== Data loading format ===
 +
 
 +
A tab-delimited intermediate format that encodes the alignment coordinates plus optional 1:1 mapping of coordinates within the alignment (to facilitate accurate grid-lines in GBrowse_syn). Each record is a single line (wrapped here for display only).  Note that a reciprocal alignment is also created during database loading.
  
===alignment data loading format===
 
* The output of the script is a tab-delimited ''ad hoc'' intermediate format that encodes the alignment coordinates plus 1:1 mapping of coordinates within the alignment (to facilitate accurate grid-lines in GBrowse_syn)
 
* However to get you alignment or syntency data, this the the format currently supported for loading the alignment database.
 
* Each record is a single line (wrapped here for display only)
 
 
<pre>
 
<pre>
 
#species1      seqid1  start1  end1  strand1  reserved  species2      seqid2        start2  end2  strand2 reserved \
 
#species1      seqid1  start1  end1  strand1  reserved  species2      seqid2        start2  end2  strand2 reserved \
Line 100: Line 94:
 
</pre>
 
</pre>
  
===Create a mysql database===
 
  
mysql -uroot -ppassword -e 'create database my_database'
+
=GBrowse_syn Database Schema=
  
===Load the database===
+
* The alignment database [[Glossary#Schema|schema]] is very simple; it has a tables for all reciprocal 'hits,' or alignment features, and a table for (optional) 1:1 coordinate maps
load the database with the script <span class="pops">[http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/Generic-Genome-Browser/bin/gbrowse_syn/load_alignment_database.pl?pathrev=stable load_alignment_database.pl]</span>
+
* The alignments table contains coordinate information and also support cigar-line representations and the alignment to facilitate future reconstruction of the alignment within [[GBrowse_syn]].
  
gunzip -c processed_alignments.txt.gz | perl load_alignment_database.pl
 
  
* Once loaded, the data look something like this:
 
The alignment data
 
[[Image:alignments_table.png]]
 
  
The grid mapping data
+
[[Image:GBS_Schema.png|border]]
<br clear="all">
+
[[Image:grid_map.png]]
+
  
 
[[Category:GBrowse syn]]
 
[[Category:GBrowse syn]]
 
[[Category:Documentation]]
 
[[Category:Documentation]]

Latest revision as of 21:27, 14 August 2012

GBrowse_syn is a GBrowse based synteny viewer. This page describes the database that GBrowse_syn uses, and how to get syntenic data into that database.

  • Sample data and configuration files can be downloaded from the GMOD FTP site; the sample data are for two rice species (courtesy of Bonnie Hurwitz)


Example Alignment Data

The sample below is in CLUSTALW format. Other formats are also supported (see below).

NOTE: The sequence naming convention "species-seqid(strand)/start-end" shown in the example below contains meta-data about the alignment that is essential for the data to be loaded correctly with strand and coordinate information


CLUSTAL W(1.81) multiple sequence alignment


c_briggsae-chrII(+)/43862-46313           ATGAGCTTCCACAAAAGCATGAGCTTTCTCAGCTTCTGCCACATCAGCATTCAAATGATC
c_remanei-Crem_Contig172(-)/123228-124941 ATGAGCCTCTACAACCGCATGATTCTTTTCAGCCTCTGCCACGTCCGCATTCAAATGCTC
c_brenneri-Cbre_Contig60(+)/627772-630087 ATGAGCCTCCACAACAGCATGATTTTTCTCGGCTTCCGCCACATCCGCATTCAAATGATC
c_elegans-II(+)/9706834-9708803           ATGAGCCTCTACTACAGCATGATTCTTCTCAGCTTCTGCAACGTCAGCATTCAGATGATC
                                          ****** ** ** *  ******   ** ** ** ** ** ** ** ******* *** **

c_briggsae-chrII(+)/43862-46313           CGCACAAATATGATGCACAAATCCACAACCTAAAGCATCTCCGATAACGTTGACCGAAGT
c_remanei-Crem_Contig172(-)/123228-124941 AGCACAAATGTAATGAACGAATCCGCATCCCAACGCATCGCCAATCACATTCACAGATGT
c_brenneri-Cbre_Contig60(+)/627772-630087 CGCACAAATGTAGTGGACAAATCCGCATCCCAAAGCGTCTCCGATAACATTTACCGAAGT
c_elegans-II(+)/9706834-9708803           TGCACAAATGTGATGAACGAATCCACATCCCAATGCATCACCGATCACATTGACAGATGT
                                           ******** *  ** ** ***** ** ** ** ** ** ** ** ** ** ** ** **
c_briggsae-chrII(+)/43862-46313           CCGGAGTCGATCCCTGAAT-----------------------------------------
c_remanei-Crem_Contig172(-)/123228-124941 ACGAAGTCGGTCCCTATAAGGTATGATTTTATATGA----TGTACCATAAGGAAATAGTC
c_brenneri-Cbre_Contig60(+)/627772-630087 ACGAAGTCGATCCCTGAAA---------TCAGATGAGCGGTTGACCA---GAGAACAACC
c_elegans-II(+)/9706834-9708803           ACGAAGTCGGTCCCTGAAC--AATTATTT----TGA----TATA---GAAAGAAACGGTA
                                           ** ***** *****  *

Note on formats

These example data are in clustalw format. The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's AlignIO parser. This does not mean that clustalw is the program used to generate the alignment data.

Note on the sequence ID syntax

The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates. This information is essential:

 rice-3(+)/16598648-16600199

The general format is species-refseq(strand)/start-end

species
name of species, genome, strain, etc (string with no '-' characters)
sequence
name of reference sequence (string with no '/' characters)
(strand)
orientation of the alignment (relative to the reference sequence; + or -)
start
start coordinate of the alignment relative to the reference sequence (integer)
end
end coordinate of the alignment relative to the reference sequence (integer)

Examples:

   c_elegans-I(+)/1..2300
   myco_bovis-chr1(-)/15000..25000


Loading the alignment database

Create a MySQL database

Before you load the database, make sure that a database of that name already exists; if not, create one from scratch using the following MySQL command:

mysql -uroot -ppassword -e 'create database my_database'


Loading from multiple sequence alignments

Multiple sequence alignments can be loaded directly into the GBrowse_syn alignment database with the script load_alignments_msa.pl. This script will process alignment data in a number of formats recognized by BioPerl, such as CLUSTAL and FASTA. Note that these are file formatting conventions used by a variety of different applications and use of one of the formats that do not imply that a particular program is the method of choice for generating your alignments. Whole genome alignments for multiple species are generally more complex than simple multiple sequence alignments with clustalw.


Loading from other sources

The script load_alignment_database.pl can be used to load the alignment database from a tab-delimited alignment data files (format described below). This format can either be an intermediate for parsed alignment data or can be used for data that does not come from multiple sequence alignments, for example gene orthology data, defined regions of co-linearity, etc. The tab-delimited format requires start and end coordinates for each reference sequence. Any features that have start end and strand information can be used.


Data loading format

A tab-delimited intermediate format that encodes the alignment coordinates plus optional 1:1 mapping of coordinates within the alignment (to facilitate accurate grid-lines in GBrowse_syn). Each record is a single line (wrapped here for display only). Note that a reciprocal alignment is also created during database loading.

#species1       seqid1  start1   end1   strand1  reserved  species2      seqid2         start2   end2  strand2 reserved \
# pos1-1  pos1-2  ...  posn-1  posn-2  |  pos1-2  pos1-1  ...  posn-2  posn-1
c_briggsae      chrI    1583997 1590364 +       .       c_remanei       Crem_Contig24   631879  634679  -       .       \
1584000 634676  1584100 634584  (truncated...)  |     631900  1590333 632000  1590233  (truncated ...)


GBrowse_syn Database Schema

  • The alignment database schema is very simple; it has a tables for all reciprocal 'hits,' or alignment features, and a table for (optional) 1:1 coordinate maps
  • The alignments table contains coordinate information and also support cigar-line representations and the alignment to facilitate future reconstruction of the alignment within GBrowse_syn.


GBS Schema.png