Difference between revisions of "GBrowse syn Database"

From GMOD
Jump to: navigation, search
(Note on CLUSTALW)
Line 12: Line 12:
  
 
=Loading=
 
=Loading=
==Input formats for alignment, colinearity or synteny data==
+
* The starting point for loading the alignment database is a CLUSTALW format multiple sequence alignment
* The starting point for loading the alignment database is typically a multiple sequence alignment file
+
* The source of the alignment data is up to you but the supported format for entry is currently CLUSTALW.  This format is not mandatory but the parsing script below will need to be adjusted to support other formats
* The source of the alignment data is up to you but the supported aligment inputs formats for entry are currently [http://www.bioperl.org/wiki/ClustalW_multiple_alignment_format CLUSTALW], [http://www.bioperl.org/wiki/FASTA_multiple_alignment_format FASTA], [http://bio.perl.org/wiki/SELEX_multiple_alignment_format SELEX], [http://www.bioperl.org/wiki/MSF_multiple_alignment_format MSF] and [http://www.bioperl.org/wiki/Stockholm_multiple_alignment_format STOCKHOLM]Which format is not particularly important, as long as is is recognized by the Bio::SeqIO library for parsing and allows sequence names long enough to accommodate the rules below.
+
==Clustal alignment format==
 
+
<font color="green">''NOTE: The sequence naming convention "species-seqid(strand)/start-end" shown in the example below is essential for the data to be loaded correctly with strand and coordinate information''</font>
<font color="green">''NOTE: It is necessary to overload the sequence IDs with meta-data about the alignment, using naming convention "species-seqid(strand)/start-end" shown in the example below''</font>
+
===Example Alignment Data===
+
selex format
+
 
<pre>
 
<pre>
rice-3(+)/16598648-16598707        ggaggccggccgtctgccatgcgtgagccagacggggcgggccggagacaggccacgtgg
 
wild_rice-3(+)/14467855-14467878    gggggccgg------------------------------------agacaggccacgtgg
 
</pre>
 
MSF format
 
<pre>
 
NoName  MSF: 2  Type: N  Thu Aug  6 10:56:43 2009  Check: 00 ..
 
  
Name: rice-3(+)/16598648-16598707      Len:    60  Check:  4764  Weight:  1.00
+
CLUSTAL W(1.81) multiple sequence alignment
Name: wild_rice-3(+)/14467855-14467878  Len:    60  Check:  932  Weight:  1.00
+
  
//
 
 
 
                                  1                                                  50
 
rice-3(+)/16598648-16598707      ggaggccggc cgtctgccat gcgtgagcca gacggggcgg gccggagaca
 
wild_rice-3(+)/14467855-14467878  gggggccgg- ---------- ---------- ---------- -----agaca
 
 
 
                                  51                                                100
 
rice-3(+)/16598648-16598707      ggccacgtgg
 
wild_rice-3(+)/14467855-14467878  ggccacgtgg
 
</pre>
 
 
STOCKHOLM format
 
<pre>
 
# STOCKHOLM 1.0
 
 
#=GF ID NoName
 
#=GS rice-3(+)/16598648-16598707AC unknown
 
#=GS wild_rice-3(+)/14467855-14467878AC unknown
 
 
rice-3(+)/16598648-16598707      ggaggccggccgtctgccatgcgtgagccagacggggcgggccggagacaggccacgtgg
 
wild_rice-3(+)/14467855-14467878  gggggccgg------------------------------------agacaggccacgtgg
 
//
 
</pre>
 
FASTA format
 
<pre>
 
>rice-3(+)/16598648-16598707
 
ggaggccggccgtctgccatgcgtgagccagacggggcgggccggagacaggccacgtgg
 
>wild_rice-3(+)/14467855-14467878
 
gggggccgg------------------------------------agacaggccacgtgg
 
</pre>
 
 
CLUSTALW format
 
<pre>
 
CLUSTAL W(1.81) multiple sequence alignment
 
  
 +
c_briggsae-chrII(+)/43862-46313          ATGAGCTTCCACAAAAGCATGAGCTTTCTCAGCTTCTGCCACATCAGCATTCAAATGATC
 +
c_remanei-Crem_Contig172(-)/123228-124941 ATGAGCCTCTACAACCGCATGATTCTTTTCAGCCTCTGCCACGTCCGCATTCAAATGCTC
 +
c_brenneri-Cbre_Contig60(+)/627772-630087 ATGAGCCTCCACAACAGCATGATTTTTCTCGGCTTCCGCCACATCCGCATTCAAATGATC
 +
c_elegans-II(+)/9706834-9708803          ATGAGCCTCTACTACAGCATGATTCTTCTCAGCTTCTGCAACGTCAGCATTCAGATGATC
 +
                                          ****** ** ** *  ******  ** ** ** ** ** ** ** ******* *** **
  
rice-3(+)/16598648-16598707      ggaggccggccgtctgccatgcgtgagccagacggggcgggccggagacaggccacgtgg
+
c_briggsae-chrII(+)/43862-46313          CGCACAAATATGATGCACAAATCCACAACCTAAAGCATCTCCGATAACGTTGACCGAAGT
wild_rice-3(+)/14467855-14467878 gggggccgg------------------------------------agacaggccacgtgg
+
c_remanei-Crem_Contig172(-)/123228-124941 AGCACAAATGTAATGAACGAATCCGCATCCCAACGCATCGCCAATCACATTCACAGATGT
                                ** ******  
+
c_brenneri-Cbre_Contig60(+)/627772-630087 CGCACAAATGTAGTGGACAAATCCGCATCCCAAAGCGTCTCCGATAACATTTACCGAAGT
 +
c_elegans-II(+)/9706834-9708803          TGCACAAATGTGATGAACGAATCCACATCCCAATGCATCACCGATCACATTGACAGATGT
 +
                                          ******** *  ** ** ***** ** ** ** ** ** ** ** ** ** ** ** **
 +
c_briggsae-chrII(+)/43862-46313          CCGGAGTCGATCCCTGAAT-----------------------------------------
 +
c_remanei-Crem_Contig172(-)/123228-124941 ACGAAGTCGGTCCCTATAAGGTATGATTTTATATGA----TGTACCATAAGGAAATAGTC
 +
c_brenneri-Cbre_Contig60(+)/627772-630087 ACGAAGTCGATCCCTGAAA---------TCAGATGAGCGGTTGACCA---GAGAACAACC
 +
c_elegans-II(+)/9706834-9708803          ACGAAGTCGGTCCCTGAAC--AATTATTT----TGA----TATA---GAAAGAAACGGTA
 +
                                          ** ***** ***** *
 
</pre>
 
</pre>
  
==='''<font color=red>Note on formats</font>'''===
+
==='''<font color=red>Note on CLUSTALW</font>'''===
These data are in clustalw and other format.  The scripts used to process these data will recognize these commonly used formats recognized by BioPerl's [http://search.cpan.org/~birney/bioperl-1.2.3/Bio/AlignIO.pm AlignIO parser].  '''''This does not mean that clustalw or some other program are necessarily required to generate the alignment data'''''.
+
These data are in clustalw format.  The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's [http://search.cpan.org/~birney/bioperl-1.2.3/Bio/AlignIO.pm AlignIO parser].  '''''This does not mean that clustalw is the actual program used to generate the alignment data'''''.
  
 +
* These particular alignment file in clustalw ''format'' was generated using a part of the <span class="pops">[http://feb2006.archive.ensembl.org/info/software/compara/compara_tutorial.html compara pipeline]</span>.
 
* See <span class="pops">[https://www.nescent.org/wg/courses_gmod_09/images/c/cf/WGA_data.png this generalized hierarchical whole genome alignment workflow]</span> for general information on how whole genome alignment data ca be generated.
 
* See <span class="pops">[https://www.nescent.org/wg/courses_gmod_09/images/c/cf/WGA_data.png this generalized hierarchical whole genome alignment workflow]</span> for general information on how whole genome alignment data ca be generated.
  

Revision as of 15:07, 6 August 2009

GBrowse_syn is a GBrowse based synteny viewer. This page describes the database that GBrowse_syn uses, and how to get syntenic data into that database.

Schema

  • The alignment database schema is very simple; it has a tables for all reciprocal 'hits', or alignment features and a table for (optional) 1:1 coordinate maps
  • The alignments table contains coordinate information and also support cigar-line representations and the alignment to facilitate future reconstruction of the alignment within GBrowse_syn.

GBS Schema.png

Sample data

  • Sample data and configuration files can be downloaded from the GMOD FTP site.
  • These data are for two rice species (courtesy of Bonnie Hurwitz)

Loading

  • The starting point for loading the alignment database is a CLUSTALW format multiple sequence alignment
  • The source of the alignment data is up to you but the supported format for entry is currently CLUSTALW. This format is not mandatory but the parsing script below will need to be adjusted to support other formats

Clustal alignment format

NOTE: The sequence naming convention "species-seqid(strand)/start-end" shown in the example below is essential for the data to be loaded correctly with strand and coordinate information


CLUSTAL W(1.81) multiple sequence alignment


c_briggsae-chrII(+)/43862-46313           ATGAGCTTCCACAAAAGCATGAGCTTTCTCAGCTTCTGCCACATCAGCATTCAAATGATC
c_remanei-Crem_Contig172(-)/123228-124941 ATGAGCCTCTACAACCGCATGATTCTTTTCAGCCTCTGCCACGTCCGCATTCAAATGCTC
c_brenneri-Cbre_Contig60(+)/627772-630087 ATGAGCCTCCACAACAGCATGATTTTTCTCGGCTTCCGCCACATCCGCATTCAAATGATC
c_elegans-II(+)/9706834-9708803           ATGAGCCTCTACTACAGCATGATTCTTCTCAGCTTCTGCAACGTCAGCATTCAGATGATC
                                          ****** ** ** *  ******   ** ** ** ** ** ** ** ******* *** **

c_briggsae-chrII(+)/43862-46313           CGCACAAATATGATGCACAAATCCACAACCTAAAGCATCTCCGATAACGTTGACCGAAGT
c_remanei-Crem_Contig172(-)/123228-124941 AGCACAAATGTAATGAACGAATCCGCATCCCAACGCATCGCCAATCACATTCACAGATGT
c_brenneri-Cbre_Contig60(+)/627772-630087 CGCACAAATGTAGTGGACAAATCCGCATCCCAAAGCGTCTCCGATAACATTTACCGAAGT
c_elegans-II(+)/9706834-9708803           TGCACAAATGTGATGAACGAATCCACATCCCAATGCATCACCGATCACATTGACAGATGT
                                           ******** *  ** ** ***** ** ** ** ** ** ** ** ** ** ** ** **
c_briggsae-chrII(+)/43862-46313           CCGGAGTCGATCCCTGAAT-----------------------------------------
c_remanei-Crem_Contig172(-)/123228-124941 ACGAAGTCGGTCCCTATAAGGTATGATTTTATATGA----TGTACCATAAGGAAATAGTC
c_brenneri-Cbre_Contig60(+)/627772-630087 ACGAAGTCGATCCCTGAAA---------TCAGATGAGCGGTTGACCA---GAGAACAACC
c_elegans-II(+)/9706834-9708803           ACGAAGTCGGTCCCTGAAC--AATTATTT----TGA----TATA---GAAAGAAACGGTA
                                           ** ***** *****  *

Note on CLUSTALW

These data are in clustalw format. The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's AlignIO parser. This does not mean that clustalw is the actual program used to generate the alignment data.

Note on the sequence ID syntax

The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates. This information is essential:

 rice-3(+)/16598648-16600199
 speciesv-refseq(strand)/start-end

Database Loading Scripts

Split large clustal files into smaller ones

For performance reasons and to avoid overloading the bioperl-based perl scripts, split large clustal files that have many alignments into one alignment/file with the script split_clustal.pl. This is necessary for very large files that would otherwise overload the BioPerl alignment parser.

gunzip my_huge_clustal_file.aln.gz | perl split_clustal.pl /path/to/smaller_alignment_files

Parse the alignments

A suggested method for formatiing alignment data is to parse clustalw format the alignments using the script clustal2hit.pl

perl clustal2hit.pl /path/to/smaller_alignment_files/*.aln |gzip -c >processed_alignments.txt.gz

There is also a script for processing mercator alignments (mercatoraln_to_synhits.pl) and a generic version (aln2hit.pl) for other MSA formats supported by Bio::SeqIO;

  • The output of this script is described below

alignment data loading format

  • The output of the script is a tab-delimited ad hoc intermediate format that encodes the alignment coordinates plus 1:1 mapping of coordinates within the alignment (to facilitate accurate grid-lines in GBrowse_syn)
  • However to get you alignment or syntency data, this the the format currently supported for loading the alignment database.
  • Each record is a single line (wrapped here for display only)
#species1       seqid1  start1   end1   strand1  reserved  species2      seqid2         start2   end2  strand2 reserved \
# pos1-1  pos1-2  ...  posn-1  posn-2  |  pos1-2  pos1-1  ...  posn-2  posn-1
c_briggsae      chrI    1583997 1590364 +       .       c_remanei       Crem_Contig24   631879  634679  -       .       \
1584000 634676  1584100 634584  (truncated...)  |     631900  1590333 632000  1590233  (truncated ...)

Create a mysql database

mysql -uroot -ppassword -e 'create database my_database'

Load the database

load the database with the script load_alignment_database.pl

gunzip -c processed_alignments.txt.gz | perl load_alignment_database.pl
  • Once loaded, the data look something like this:

The alignment data Alignments table.png

The grid mapping data
Grid map.png