NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.

Difference between revisions of "GBrowse syn Database"

From GMOD
Jump to: navigation, search
(Database Loading Scripts)
(Database Loading Scripts)
Line 38: Line 38:
 
  gunzip my_huge_clustal_file.aln.gz | perl split_clustal.pl /path/to/smaller_alignment_files
 
  gunzip my_huge_clustal_file.aln.gz | perl split_clustal.pl /path/to/smaller_alignment_files
  
2) parse the alignments using the script [http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/Generic-Genome-Browser/bin/gbrowse_syn/clustal2hit.pl?pathrev=stable clustal2hits.pl]
+
2) parse the alignments using the script <span class="pops">[http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/Generic-Genome-Browser/bin/gbrowse_syn/clustal2hit.pl?pathrev=stable clustal2hits.pl]</span>
 +
 
 +
perl clustal2hit.pl /path/to/smaller_alignment_files/*.aln |gzip -c >processed_alignments.txt.gz
 +
 
 +
The output of this script is an ''ad hoc'' intermediate format that encodes the alignment coordinates plus 1:1 mapping of coordinates within the alignment (to facilitate accurate grid-lines in GBrowse_syn)
 +
 
 +
<pre>
 +
#species1      seqid1  start1  end1  strand1  reserved  species2      seqid2        start2  end2  strand2 reserved     
 +
c_briggsae      chrI    1583997 1590364 +      .      c_remanei      Crem_Contig24  631879  634679  -      .      \
 +
# pos1-1  pos1-2  ...                    posn-1  posn-2    |    # pos1-2  pos1-1  ...                    posn-2  posn-1
 +
1584000 634676  1584100 634584  (truncated for display...)  |    631900  1590333 632000  1590233  (truncated for display...)
 +
</pre>
 +
 
 +
 
 +
3) create a mysql database
 +
 
 +
mysql -uroot -ppassword -e 'create database my_database'
 +
 
 +
4) load the database with the script <span class="pops">[http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/Generic-Genome-Browser/bin/gbrowse_syn/load_alignment_database.pl?pathrev=stable load_alignment_database.pl]</span>
 +
 
 +
gunzip -c processed_alignments.txt.gz | perl load_alignment_database.pl

Revision as of 12:15, 14 January 2009

Schema

  • The alignment database is very simple; it has a tables for all reciprocal 'hits', or alignment features and a table for (optional) 1:1 coordinate maps
  • The alignments table contains coordinate information and also support cigar-line representations and the alignment to facilitate future reconstruction of the alignment within GBrowse_syn.

GBS Schema.png

Loading

  • The starting point for loading the alignment database is a CLUSTALW format multiple sequence alignment
  • The source of the alignment data is up to you but the supported format for entry is currently CLUSTALW

Clustal alignment format

CLUSTAL W(1.81) multiple sequence alignment


c_briggsae-chrII(+)/43862-46313           ATGAGCTTCCACAAAAGCATGAGCTTTCTCAGCTTCTGCCACATCAGCATTCAAATGATC
c_remanei-Crem_Contig172(-)/123228-124941 ATGAGCCTCTACAACCGCATGATTCTTTTCAGCCTCTGCCACGTCCGCATTCAAATGCTC
c_brenneri-Cbre_Contig60(+)/627772-630087 ATGAGCCTCCACAACAGCATGATTTTTCTCGGCTTCCGCCACATCCGCATTCAAATGATC
c_elegans-II(+)/9706834-9708803           ATGAGCCTCTACTACAGCATGATTCTTCTCAGCTTCTGCAACGTCAGCATTCAGATGATC
                                          ****** ** ** *  ******   ** ** ** ** ** ** ** ******* *** **

c_briggsae-chrII(+)/43862-46313           CGCACAAATATGATGCACAAATCCACAACCTAAAGCATCTCCGATAACGTTGACCGAAGT
c_remanei-Crem_Contig172(-)/123228-124941 AGCACAAATGTAATGAACGAATCCGCATCCCAACGCATCGCCAATCACATTCACAGATGT
c_brenneri-Cbre_Contig60(+)/627772-630087 CGCACAAATGTAGTGGACAAATCCGCATCCCAAAGCGTCTCCGATAACATTTACCGAAGT
c_elegans-II(+)/9706834-9708803           TGCACAAATGTGATGAACGAATCCACATCCCAATGCATCACCGATCACATTGACAGATGT
                                           ******** *  ** ** ***** ** ** ** ** ** ** ** ** ** ** ** **
c_briggsae-chrII(+)/43862-46313           CCGGAGTCGATCCCTGAAT-----------------------------------------
c_remanei-Crem_Contig172(-)/123228-124941 ACGAAGTCGGTCCCTATAAGGTATGATTTTATATGA----TGTACCATAAGGAAATAGTC
c_brenneri-Cbre_Contig60(+)/627772-630087 ACGAAGTCGATCCCTGAAA---------TCAGATGAGCGGTTGACCA---GAGAACAACC
c_elegans-II(+)/9706834-9708803           ACGAAGTCGGTCCCTGAAC--AATTATTT----TGA----TATA---GAAAGAAACGGTA
                                           ** ***** *****  *

NOTE: The sequence naming convention "species-seqid(strand)/start-end" shown in the above example is essential for the data to be loaded correctly

Database Loading Scripts

1) split large clustal files into one alignment/file with the script split_clustal.pl

gunzip my_huge_clustal_file.aln.gz | perl split_clustal.pl /path/to/smaller_alignment_files

2) parse the alignments using the script clustal2hits.pl

perl clustal2hit.pl /path/to/smaller_alignment_files/*.aln |gzip -c >processed_alignments.txt.gz
The output of this script is an ad hoc intermediate format that encodes the alignment coordinates plus 1:1 mapping of coordinates within the alignment (to facilitate accurate grid-lines in GBrowse_syn)
#species1       seqid1  start1   end1   strand1  reserved  species2      seqid2         start2   end2  strand2 reserved       
c_briggsae      chrI    1583997 1590364 +       .       c_remanei       Crem_Contig24   631879  634679  -       .       \
# pos1-1  pos1-2  ...                     posn-1  posn-2    |    # pos1-2  pos1-1  ...                     posn-2  posn-1 
1584000 634676  1584100 634584  (truncated for display...)  |     631900  1590333 632000  1590233  (truncated for display...) 


3) create a mysql database

mysql -uroot -ppassword -e 'create database my_database'

4) load the database with the script load_alignment_database.pl

gunzip -c processed_alignments.txt.gz | perl load_alignment_database.pl