Difference between revisions of "GBrowse syn Database"
(→Loading) |
(→Loading) |
||
Line 12: | Line 12: | ||
=Loading= | =Loading= | ||
− | * The starting point for loading the alignment database is | + | ==Input formats for alignment, colinearity or synteny data== |
− | * The source of the alignment data is up to you but the supported aligment inputs formats for entry are currently CLUSTALW, | + | * The starting point for loading the alignment database is typically a multiple sequence alignment file |
+ | * The source of the alignment data is up to you but the supported aligment inputs formats for entry are currently [http://www.bioperl.org/wiki/ClustalW_multiple_alignment_format CLUSTALW], [http://www.bioperl.org/wiki/FASTA_multiple_alignment_format FASTA], [http://bio.perl.org/wiki/SELEX_multiple_alignment_format SELEX], [http://www.bioperl.org/wiki/MSF_multiple_alignment_format MSF] and [http://www.bioperl.org/wiki/Stockholm_multiple_alignment_format STOCKHOLM]. Which format is not particularly important, as long as is is recognized by the Bio::SeqIO library for parsing and allows sequence names long enough to accommodate the rules below. | ||
− | + | <font color="green">''NOTE: It is necessary to overload the sequence IDs with meta-data about the alignment, using naming convention "species-seqid(strand)/start-end" shown in the example below''</font> | |
− | <font color="green">''NOTE: | + | |
<pre> | <pre> | ||
Revision as of 14:54, 6 August 2009
GBrowse_syn is a GBrowse based synteny viewer. This page describes the database that GBrowse_syn uses, and how to get syntenic data into that database.
Contents
Schema
- The alignment database schema is very simple; it has a tables for all reciprocal 'hits', or alignment features and a table for (optional) 1:1 coordinate maps
- The alignments table contains coordinate information and also support cigar-line representations and the alignment to facilitate future reconstruction of the alignment within GBrowse_syn.
Sample data
- Sample data and configuration files can be downloaded from the GMOD FTP site.
- These data are for two rice species (courtesy of Bonnie Hurwitz)
Loading
Input formats for alignment, colinearity or synteny data
- The starting point for loading the alignment database is typically a multiple sequence alignment file
- The source of the alignment data is up to you but the supported aligment inputs formats for entry are currently CLUSTALW, FASTA, SELEX, MSF and STOCKHOLM. Which format is not particularly important, as long as is is recognized by the Bio::SeqIO library for parsing and allows sequence names long enough to accommodate the rules below.
NOTE: It is necessary to overload the sequence IDs with meta-data about the alignment, using naming convention "species-seqid(strand)/start-end" shown in the example below
CLUSTAL W(1.81) multiple sequence alignment c_briggsae-chrII(+)/43862-46313 ATGAGCTTCCACAAAAGCATGAGCTTTCTCAGCTTCTGCCACATCAGCATTCAAATGATC c_remanei-Crem_Contig172(-)/123228-124941 ATGAGCCTCTACAACCGCATGATTCTTTTCAGCCTCTGCCACGTCCGCATTCAAATGCTC c_brenneri-Cbre_Contig60(+)/627772-630087 ATGAGCCTCCACAACAGCATGATTTTTCTCGGCTTCCGCCACATCCGCATTCAAATGATC c_elegans-II(+)/9706834-9708803 ATGAGCCTCTACTACAGCATGATTCTTCTCAGCTTCTGCAACGTCAGCATTCAGATGATC ****** ** ** * ****** ** ** ** ** ** ** ** ******* *** ** c_briggsae-chrII(+)/43862-46313 CGCACAAATATGATGCACAAATCCACAACCTAAAGCATCTCCGATAACGTTGACCGAAGT c_remanei-Crem_Contig172(-)/123228-124941 AGCACAAATGTAATGAACGAATCCGCATCCCAACGCATCGCCAATCACATTCACAGATGT c_brenneri-Cbre_Contig60(+)/627772-630087 CGCACAAATGTAGTGGACAAATCCGCATCCCAAAGCGTCTCCGATAACATTTACCGAAGT c_elegans-II(+)/9706834-9708803 TGCACAAATGTGATGAACGAATCCACATCCCAATGCATCACCGATCACATTGACAGATGT ******** * ** ** ***** ** ** ** ** ** ** ** ** ** ** ** ** c_briggsae-chrII(+)/43862-46313 CCGGAGTCGATCCCTGAAT----------------------------------------- c_remanei-Crem_Contig172(-)/123228-124941 ACGAAGTCGGTCCCTATAAGGTATGATTTTATATGA----TGTACCATAAGGAAATAGTC c_brenneri-Cbre_Contig60(+)/627772-630087 ACGAAGTCGATCCCTGAAA---------TCAGATGAGCGGTTGACCA---GAGAACAACC c_elegans-II(+)/9706834-9708803 ACGAAGTCGGTCCCTGAAC--AATTATTT----TGA----TATA---GAAAGAAACGGTA ** ***** ***** *
Note on CLUSTALW
These data are in clustalw format. The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's AlignIO parser. This does not mean that clustalw is the actual program used to generate the alignment data.
- These particular alignment file in clustalw format was generated using a part of the compara pipeline.
- See this generalized hierarchical whole genome alignment workflow for general information on how whole genome alignment data ca be generated.
Note on the sequence ID syntax
The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates. This information is essential:
rice-3(+)/16598648-16600199 speciesv-refseq(strand)/start-end
Database Loading Scripts
Split large clustal files into smaller ones
For performance reasons and to avoid overloading the bioperl-based perl scripts, split large clustal files that have many alignments into one alignment/file with the script split_clustal.pl. This is necessary for very large files that would otherwise overload the BioPerl alignment parser.
gunzip my_huge_clustal_file.aln.gz | perl split_clustal.pl /path/to/smaller_alignment_files
Parse the alignments
A suggested method for formatiing alignment data is to parse clustalw format the alignments using the script clustal2hit.pl
perl clustal2hit.pl /path/to/smaller_alignment_files/*.aln |gzip -c >processed_alignments.txt.gz
There is also a script for processing mercator alignments (mercatoraln_to_synhits.pl) and a generic version (aln2hit.pl) for other MSA formats supported by Bio::SeqIO;
- The output of this script is described below
alignment data loading format
- The output of the script is a tab-delimited ad hoc intermediate format that encodes the alignment coordinates plus 1:1 mapping of coordinates within the alignment (to facilitate accurate grid-lines in GBrowse_syn)
- However to get you alignment or syntency data, this the the format currently supported for loading the alignment database.
- Each record is a single line (wrapped here for display only)
#species1 seqid1 start1 end1 strand1 reserved species2 seqid2 start2 end2 strand2 reserved \ # pos1-1 pos1-2 ... posn-1 posn-2 | pos1-2 pos1-1 ... posn-2 posn-1 c_briggsae chrI 1583997 1590364 + . c_remanei Crem_Contig24 631879 634679 - . \ 1584000 634676 1584100 634584 (truncated...) | 631900 1590333 632000 1590233 (truncated ...)
Create a mysql database
mysql -uroot -ppassword -e 'create database my_database'
Load the database
load the database with the script load_alignment_database.pl
gunzip -c processed_alignments.txt.gz | perl load_alignment_database.pl
- Once loaded, the data look something like this: