August 2009 GMOD Meeting
| August 2009 GMOD Meeting
6-7 August, 2009
Part of GMOD Europe 2009, five days of GMOD including a GMOD Summer School
This GMOD Community Meeting was held 6-7 August, 2009, in Oxford UK. The meeting was a part of GMOD Europe 2009, a week long event that also included a GMOD Summer School. This is the first time a GMOD meeting has been held in Europe.
As with previous GMOD meetings, this meeting had a mixture of project talks, component talks, and user talks. The agenda was driven by attendee suggestions. The two previous meetings were the January 2009 and July 2008 meetings. GMOD meetings are an excellent way to meet GMOD developers and users, and to learn (and affect) what's coming in the project.
Dr Heng Li of the Sanger Institute was the special guest speaker. Heng discussed his recent work on SAMtools, a set of file formats and scripts for efficiently storing and accessing next generation sequence data. Heng is a developer on several projects focused on next generation sequencing, including SAMtools, BWA, and MAQ.
|8:30-12:00||Last half day of 2009 GMOD Summer School - Europe|
|13:30-14:30||Scott Cain - Introductions and the State of GMOD||Prezi, PPT, PDF, Summary|
|14:30-15:00||Dave Clements - GMOD Help Desk Stuff||PPT, PDF, Summary|
|15:00-15:30||Jun Zhao - Linked Data for GMOD Databases||PDF, Summary|
|15:45-16:15||Steve Taylor - GMOD in the Trenches||PDF, Summary|
|16:15-16:30||Scott Cain (for Robert Buels) - A DBIx::Class layer for Chado||S5 Slides, Summary|
|16:30-17:00||Ed Lee - GMOD Biological Object Layer||PDF, Summary|
|17:00-17:30||Josh Goodman - A Restful interface for MODs||Summary|
|17:30||Dinner (on your own)|
|8:45-9:15||Heng Li - Quest for Standard: Sequence alignment/map format (SAM) and SAMtools||PDF, Summary|
|9:15-9:45||Dave Clements - Visualising NGS Data in GBrowse 2||PPT, PDF, Summary|
|9:45-10:15||Erick Antezana & Frederic Potier - GBrowse: Lessons Learned and Statement of Interest||PDF, Summary|
|10:15-11:45||Ian Holmes - JBrowse||PDF, Summary|
|11:00-11:30||Sheldon McKay - GBrowse_syn||PDF, Summary|
|11:30-12:30||Discussion: NextGen data and GMOD: What do we do (and not do)?|
|13:30-14:00||Alessandra Bilardi - GBrowse.org||PDF, Summary|
|14:00-14:30||Jonathan Warren - DAS update||PPT, PDF, Summary|
|14:30-15:00||Julie Sullivan - InterMine update||Summary|
|15:00-18:00||Show and Tell, Discussion||Summary|
GMOD Project Talks
HHMI Science Education Alliance
The Howard Hughes Medical Instutute's Science Education Alliance (SEA) is using GMOD tools to teach annotation to college freshmen. They isolate and sequence phage samples. The sequence is then stored in Chado, annotated with Apollo and visualized with GBrowse. In production at 12 colleges across the US.
- Chado (GMOD) 1.1 is coming
- Minor schema changes
- Minor fixes to GFF scripts
- Addition of Chris Mungall's script to create views based on CV terms.
- Distributed databases and render servers
- AJAX track loading
- Improved configuration management
- Support for SAM/BAM databases - see SAMtools
- Coverage XY-plot, Confidence density plot, Individual alignments, Paired reads
- Currently Alpha in GBrowse 2; may work in GBrowse 1 with some DAS magic.
- Circular chromosome support
- Another complete rearchitecture
- Uses AJAX for client side rendering
- Distributed with GBrowse 1.70
- Makes use of data adaptors/databases that GBrowse uses
- Tripal is a set of modules for Drupal to interact with a Chado database.
- Drupal: Widely used CMS, very extensible
- Can integrate GBrowse/CMap
- Modules for organism, library, sequence
DIYA is a gene prediction pipeline for prokaryotes. It complements MAKER, a pipeline for eukaryotes. DIYA is actually a generic, lightweight pipeline framework which was initially built to produce gene predictions. DIYA is becoming part of GMOD.
- Atlases and Aniseed
Aniseed is converting its schema to Chado. One of Aniseed's particular strengths is atlases for expression, anatomy, and cell fate. They are extending Chado to better support atlases, and will also make their web front end available as a part of GMOD.
2008 GMOD Summer School - first school ever offered
That's an over 350% increase in interest from 2008.
We'll do another summer school at NESCent in 2010. We are also considering one in Asia/Pacific in 2010.
- Since January 2009 GMOD Meeting we've been busy - see Training and Outreach.
- And in the next few months
GMOD Community Surveys
GMOD is now surveying the community every year. The 2008 GMOD Community Survey had 89 and is very informative about how GMOD is used. The 2009 survey will be in October
Upcoming GMOD Hackathon ?
There may be a GMOD hackathon this coming spring (March to May) at US National Evolutionary Synthesis Center (NESCent) in Durham, NC, USA. If this happens the focus will be on extending GMOD for evolutionary biology. Contact Dave if you want to be on organizing committee or participate.
Linked Data for GMOD Databases
Jun discussed her group's efforts to build an RDF triple store from several very different data sources: FlyBase (a Chado database), BDGP, FlyTED, FlyAtlas, and Affymetrix data sources. The integrated triple-store can be accessed at OpenFlyData.
- D2RQ mapping to load FlyBase and BDGP, using conservative mapping with minimum interpretation.
- OAI2SPARQL to harvest N3 RDF metadata via the OAI-PMH protocol, using built-in support by Eprints, and further info from ESWC2008 paper.
- Custom Python program to get FlyAtlas data.
Some performance numbers:
- Loading: Our datasets ~175 million triples
- Good enough for real time user interaction, e.g., <1s for single gene search, 1-4s for multigene search (unions)
- No significant slowdown when scale from 10m to 175m triples
- Text matching and case insensitive search
- Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL
- Pre-generated lower-case gene names and loaded into the FlyBase RDF DB
- Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search
Jun used OpenFlyData to:
- Search by gene, gene expression mashup: (go)
- Search gene expression by gene batch (go)
- Search gene expression by tissue expression profile (go)
Jun also described a second effort, Open-BioMed, that uses the same technologies to connect knowledge about alternative medicine and western drugs. Open-BioMed demonstrate the value of Linked Data, and shows a novel technique for creating interlinks between datasets on a large scale. This is a joint effort of the BioRDF and LODD (Linked Open Drug Data) task forces of the World Wide Web Consortium (W3C) Health Care Life Science Interest Group. Jun used Open-BioMed to Search for herbs associated with a particular disease.
RDF & SPARQL: Benefits & Risks
Some identified benefits:
- RDF provides a uniform and flexible data model
- RDF dump is cheaper and quicker
- Maintaining a separate SPARQL endpoint for each data source makes it easier than a data warehouse approach for handling data updates
- RDF facilitates data re-use and re-purposing
- SPARQL raises the point of departure for an application
- Expressive, open-ended query protocol
- Support for unanticipated queries
- Mapping data to RDF requires expertise and experience
- Expressive query protocol is a double-edged sword
- Performance is good for some queries, not for others ...
GMOD in the Trenches
The Computational Biology Research Group (CBRG provides bioinformatics support to researchers at the University of Oxford. They are heavy GMOD users and have used GBrowse, Citrina, BioMart, and Apollo (along with Artemis).
GBrowse at CBRG
Back in 2004, the CBRG wanted to pull data together to make a lab resource, and the genome is a useful data organiser. The CBRG evaluated these platforms: UCSC, Ensembl, AceDB, and GBrowse. Each had advantages and disadvantages, but GBrowse looked like it was built to be distributed and used elsewhere. Ease of installation and were not a priority for the others.
The CBRG now supports over 50 different GBrowse databases. Data is mainly human, mouse or bacterial, and data types include time series, arrays, and ChIP-on-Chip. They visualize a lot of Next Generation Sequencing data, including histone modifications, ChIP-Seq, cis/trans interaction data, PCR amplified regions, and RNA-Seq.
The CBRG actively manages data flow to its GBrowse instances. Each production GBrowse instance has a matching development instance where updates and changes are staged and tested before pushing them to production. They also use core and satellite databases. Core databases are built for human and mouse using public source data. To meet individual groups' needs they then clone a core database, load custom data that is specific to that group, and then run a script to merge the core and satellite GBrowse configuration files. They use Apache to restrict access to the satellite instances.
The CBRG strives to encourage power users. Data is available for download, and they have regular meetings to discuss best practices.
In the future they would like to use GBrowse as a workbench. To do this they need flexible ways to import and export features. For example, you can define a temporary track by uploading a GFF3 file, or by connecting via DAS to an outside source, or to another GBrowse. It would be nice to have a method to commit a temporary track and make it permanent. This requires some sort of user authentication.
Steve also walked through and example of how it would be useful to support querying and visualize data from multiple loci at the same time.
Make Existing GBrowse More Useful to External Developers
Finally Steve listed these 5 ways to make GBrowse more useful to external developers:
- Document general structure of GBrowse perl modules
- Tips on debugging
- Document / define API
- Central Glyph page
- Include a copy of BioPerl inside GBrowse
A DBIx::Class layer for Chado
Chado needs middleware, a layer of software between the application (e.g. a website) and the database. Chado's flexible design makes for complex queries and a steep learning curve. It is also hard to get good performance. This talk introduces a Perl DBIx::Class layer for use with Chado, which can be used as the basis for many applications, including the next generation of Modware.
DBIx::Class is an object-relational mapping framework for Perl, and is the de facto. It has powerful features for:
- query building (the magic of chainable ResultSets)
- cross-database deployment (using SQL::Translator in the backend)
- testing with Fixtures
Middleware can help by storing and/or automating complex queries, codifying best practices with both code and unified, high-level documentation. Some performance optimizations can be put in middleware and it can assist in creating indexes and materialized views.
The Bio-Chado-Schema project has been set up by Robert Buels, with source control at GitHub, and releases available on CPAN. This contains DBIx::Class modules for every Chado table that should work with all database platforms that are supported by Chado. The project uses automated tools to keep the modules in sync with changes in the Chado schema. The project is currently actively looking for development help, CPAN releases are currently intended for developers. Future goals include API support for common querying and loading patterns, interoperation with BioPerl objects, forming the basis for a future version of Modware, and more.
- other people should start building features onto and into it
- and do some of the other things on the slides
- make a new version of Modware based on it
- do you think somebody could get funding to work on it full time?
Ed has been working with E.O. Stinson and Robert Bruggner at BBOP, and Robin Houston and Adrian Tivey at Sanger to create a Java based biological object layer (GBOL) for genomic features.
GBOL is the top layer of a multilayer architecture:
- Biological Object Layer (GBOL)
This layer defines an object at a biological level of interest, say a gene. It aggregates together all of the information about that high level concept into a single, programmatically accessible entity. It hides all of the information about how and where the underlying data is stored.
This layer is inspired by Chado, but is not necessarily built on top of Chado.
- Biological Object/IO Layer
This layer ...
- Simple Object Layer
This layer knows about basic biological concepts, but does not directly know how or where this information is stored.
- Simple ObjectI/O Layer
This layer can do simple aggregation such as "return all features in this range", but does not perform aggregation based on biological models. That type of aggregation is performed by higher levels.
Biological Layer Configuration
<?xml version="1.0" encoding="UTF-8"?> <gbol_mappings> <feature_mappings> <type cv="SO" term="gene" default="true"> <read_class>Gene</read_class> </type> <type cv="SO" term="transcript" default="true"> <read_class>Transcript</read_class> </type> <type cv="SO" term=”my_transcript”> <read_class>Transcript</read_class> </type> … </feature_mappings> <relationship_mappings> <type cv="relationship" term="part_of" default="true"> <read_class>PartOf</read_class> </type> … </relationship_mappings> </gbol_mappings>
- Continued development on Biological layer
- Inference of data: infer introns from exon structure
- New format handlers: Chado XML, GAME XML, BioPerl bridge
- Configuration of common relationship variations such as ESTs aligned to the genome directly vs having a "match" feature
A Restful interface for MODs
Quest for Standard: Sequence alignment/map format (SAM) and SAMtools
Heng spoke about SAM/BAM and SAMtools, a platform agnostic set of file formats and programs for next generation sequence data.
SAM/BAM is a generic nucleotide alignment format that is
- is simple to understand, easy to generate and easy to parse
- is compact in ﬁle size
- is streamable
- supports fast random access
Quest for Standards
There had been no standardized and computationally efficient way to store the volumes of data that next generation sequence data. Several formats such as phrap ACE and GFF existed but these were unable to scale up.
The Sequence Alignment / Map (SAM) format is motivated by short read alignment but also works with long reads and de novo assemblies. SAM uses a GFF3-like tab-delimited format with 11 mandatory fields for key information, and variable optional fields and predefined tags for non-standard information. It is designed to be simple to generate and to parse. It uses an extended CIGAR string for various types of alignments. The extended CIGAR string format adds support for clipped, spliced, multi-part, and padded alignments. See the SAM Format Specification for details.
SAM is a text format. The Binary Alignment/Map (BAM) format is an exact binary representation of SAM. It has Zlib/gzip compatible compression (and can be decompressed by zlib/gzip). BAM is space efficient, achieving 1 byte per raw base pair, including sequence, quality, read name, position and meta info. BAM is also streamable: programs can process alignments without loading the entire alignment into memory. BAM is usually sorted by the leftmost chromosomal position. BAM is indexed, supports random access, and can quickly retrieve sequences overlapping a specified region.
BAM uses BGZF, a generic indexable compression format. The standard gzip/zlib format is not block-wise. Indexing is intricate and inefficient. BGZF is separated into multiple standalone gzip/zlib blocks (64kB each).
BAM indexing uses binning plus linear index for alignments sorted by the leftmost coordinates. B-trees and pure linear indexes are inefficient for resolving ‘overlap’ queries. R-tree and pure binning indexes have difficultly in streaming. For short read alignment, typically one seek function call for the retrieval of reads in a region (more efficient than R-trees). Also produces small index files (e.g., ~9MB for deep human resequencing)
APIs, Implementations and Supported Platforms
Several assembly programs can now produce SAM directly, and SAMtools comes with scripts to convert the output of several other assemblers to SAM format.
SAM also has native HTTP/FTP support. Programs can retrieve alignments overlapping a specified region from a remote file via http/ftp. Simply replace the input BAM file name with a URL (http/ftp only). This partial load approach greatly reduces data transfer for applications such as genome browsers, that typically only need small regions of an assembly at any time.
Several implementations using SAMtools are available. The SAMtools package itself includes command line tools and C APIs for:
- Conversion from other formats
- SAM ⇔ BAM, indexing, sorting, merging, pileup, SNP/indel calling, alignment viewer ...
- Native HTTP/FTP support
There are also implementations in Java ([http:picard.sourceforge.net Picard] and GATK), and Perl (Bio::DB::Sam, which is what GBrowse uses - see the next talk).
An alignment viewer is a great help for method development:
- Visually understand the alignment: the error rate, the depth, etc.
- Validate aligner results: even read depth? right coordinates? right gaps?
- Validate SNP/indel calls: human eyes are always better.
- Validate structural variations: pair-end information
SAMtools comes with a Text Alignment Viewer, tview which uses the GNU ncurses library. tview retrieves alignments using FTP/HTTP and is fairly simple. It shows alignments, but not annotation, paired-end information, multiple tracks, ...
The Broad Institute's Java-based Integrative Genomics Viewer (IGV) also works with data in BAM format.
And you can view SAM/BAM in GBrowse using the Bio::DB::Sam Perl adaptor (based on SAMtools C APIs). For SAM/BAM, GBrowse is a lightweight and versatile shared alignment viewer supporting mutliple tracks and gene annotations.
For GBrowse, SAM/BAM can provide an efficient way to access large-scale new sequencing data, store various types of alignment (EST, mRNA, etc.) as an alternative to SQL databases, and possibly realize distributed alignment resources. GBrowse already pulls in data from remote sources using DAS. It could be extended to pull in remote SAM/BAN data using FTP/HTTP.
Are distributed alignments feasible? There is already Native HTTP/FTP support in SAMtools. This could be added to Bio::DB::Sam as well. Alignment files are compressed. For short reads, one seek call (establishing network connection) is required to get alignments in a region. This would require very little conﬁguration at the server hosting alignments, and compressed data transfer between ﬁle servers and the GBrowse server.
There are some major obstacles. The index ﬁles have to sit on local disks at the GBrowse server, and matching the reference sequences may be an issue. Also have to address bandwidth and caching.
Visualising NGS Data in GBrowse 2
Lincoln Stein has written a GBrowse adaptor, Bio::DB::Sam, for Next Generation Sequencing data stored in the BAM format that Heng Li described in his talk. This is currently in Alpha release, and works only with GBrowse 2. It is in available in the gbrowse-adaptors project of GMOD's CVS repository. Short read, next generation sequence data can be directly represented in GFF3, but the amount of data makes it very slow, and requires a very large database ti back it. Using Bio::DB::Sam on top of BAM files makes visualizing individual reads both computationally tractable, and manageable.
The talk used an example of 4 E. coli strains: an ancestral strain for which a reference sequence is available, a manipulated strain, and then two strains with phage resistance that evolved from the manipulated strain. Whole genome resequencing was performed on the manipulated and evolved lines. The resequencing was done on an Illumina GA2 and then assembled with the MAQ aligner. The MAQ alignments were then converted to SAM using a SAMtools script, and then to BAM.
Dave then showed how to configure GBrowse to be a short read viewer using Bio::DB::Sam, including an example callback to show alignment quality using color. However, the utility of showing short reads quickly declines as you zoom out past 100-200 bp. You can also use to Bio::DB::Sam to show summary statistics such as coverage depth. Dave will work on documenting the Bio:DB::Sam adaptor and it's interface to SAMtools in the coming months.
The talk then showed several other visualizations that can be done with next generation sequence data that don't display the short reads themselves. This included a number of ways to show allele and genotype frequencies (including showing them on a geolocation map).
Finally, if you are planning on starting to use NGS data, make sure you have a lot of bioinformatics infrastructure in place first.
GBrowse: Lessons Learned and Statement of Interest
Erick Antezana & Frederic Potier, Bayer CropScience, PDF
History and Current GBrowse Infrastructure
Bayer CropScience uses GBrowse 1.70 and GBrowse 2, CMap, Galaxy, and Ergatis. They have been a GBrowse user since 2004. They also evaluated Chado and chose not to use it because of performance issues. Currently using GBrowse 2 and mainly Bio::DB:GFF databases, focused mainly on plants. They have both publicly available plant genomes, private genomes, and increasingly frequent annotation updates. Their requirements include minor data reformatting, fast data loading and querying, customizable application, and a high level of integrity.
Bayer currently has more than 30 databases with public data, at around 30GB. Their in house data includes next generation sequence data (stored in BAM and accessed in GBrowse 2 via Bio::DB::Sam), genome annotation (stored in a Bio:DB:GFF database), molecular mapping visualized with CMap. They also considering supporting user annotation / manual curation with Apollo and/or Artemis. Their automated annotation workflow produces GFF and generates GBrowse configurations files.
Bayer has extended GBrowse in several ways, including user authentication, permissions, and tracking.
- On the fly visualization
- Blast anchoring/Sequence homology search
- blast homologies are uploaded as user annotations
- data export
- links to in house applications
- In house keyword search engine
- fast search utility
- cross databases search
- centralised access point
Statement of Interest: Requirements and Needs
Bayer CropScience would also like to see GMOD extended in a number of areas.
GBrowse Database Adaptors
- NGS adaptor (Bio::DB::Sam) is a key priority
- Memory adaptor would like to be able to specify a file name or a complete path via a parameter so, the adaptor doesn't need to load all the GFF files in the directory
- Chado adaptor Portability to Oracle; ability to store user-specific annotation / manual curation; a system track versions and history of the annotations; and management of user access rights
- SeqFeature::Store Portability to Oracle (c.f. user access rights via VPD) and faster loading time.
- Compatibility with other genome browsers databases for instance ensembl databases?
GBrowse User Interaction
- To track user sessions
- To enable user access rights management
- User Annotation Management
- To store the user annotations in a database or in a file on the server. Thus the users will be able to get their annotations while getting connected to different machines
- To send automatically user’s annotations to GBrowse via a URL parameter
- Integration with CMap
GBrowse Configuration Files
Current format is error prone, difficult to debug, has a steep learning curve, and is time consuming to maintain. Bayer (and CBRG and modENCODE and ...) partially works around this by having scripts generate their configuration files.
Would also like the ability to configure the global layout to enable/disable components such as disable the custom tracks or display settings components.
Would also like to have a standardized way to specify metadata in the configuration files. For example, species and assembly versions:
################################# # database definitions ################################# [TAIR_Arabidopsis_V8:database] db_adaptor = Bio::DB::GFF db_args = -adaptor DBI::mysql -dsn dbi:mysql:TAIR_Arabidopsis_V8 species = Arabidopsis thaliana assembly.source = TAIR assembly.version = 8 annotation.source = TAIR annotation.version = 8
Metadata Web Services
Web services could be used to query and report on metadata such as: list of reference sequences, annotation version, assembly version, list of available feature types,
<browser> <species>Arabidopsis</species> <assembly>bayer</assembly> <annotation>1.0</annotation> <reference-sequence>chr1</reference-sequence> <reference-sequence>chr2</reference-sequence> <feature-type>fgenesh:mRNA</feature-type> <feature-type>splign:mRNA</feature-type> </browser>
This information could be defined in the config file:
[TAIR_Arabidopsis_V8:database] db_adaptor = Bio::DB::GFF db_args = -adaptor DBI::mysql -dsn dbi:mysql:TAIR_Arabidopsis_V8 species=Arabidopsis thaliana assembly.source=TAIR assembly.version=8 annotation.source=TAIR annotation.version=8
Conclusion / Discussion
GBrowse 2 is a tool that can be used in a production environment. It is intensively used within the Bayer Bioinformatics platform to facility a high level data integration. It is easy to maintain.
Our priorities for further developments:
- Adaptors performance
- Need to focus on user interaction
- GBrowse.conf representation
- Native integration of other GMOD tools (e.g. CMap)
Ian Holmes, University of California - Berkeley, PDF
Some useful links:
- The JBrowse paper will be published in Genome Research in the September 2009 issues. An advanced access version is available online.
- All things JBrowse are available at JBrowse.org
JBrowse was initially going to look and feel very much like GBrowse, but with pre-rendered, tiled images, a la Google Maps. A prototype was built, but this approach did not scale:
- D. melanogaster at pixel resolution is an order of magnitude wider than the continental US.
JBrowse uses nested containment lists (NCList) to store features. This approach is 5-500 times faster than competing methods such as R-trees, and B-trees with binning.
Ian demonstrated a TWiki plugin for JBrowse that demonstrated an easy way for users to upload their own tracks.
Some "imminent" developments for JBrowse:
- Lazily-loaded NCLists
- Text autocompletion; “proper” search
- Nextgen sequence data
- Start with basic summarization, then custom tracks
- Community annotation
- Persistent upload & sharing of tracks
- Editing/curation over the web (ackles...)
- Documented image-track API
- Synteny browser (c.f. GBrowse_syn)
- Much more at jbrowse.lighthouseapp.com
Ian closed with a very strong acknowledgment of Mitch Skinner's contribution to this work.
A synteny browser had display elements in common with a genome browsers. They use sequence alignments, orthology or co-linearity data to highlight different genomes, strains, etc., and they usually displays co-linearity relative to a reference genome.
Other GMOD Synteny Viewers
CMap is a comparative map viewer and can be used to show alignments between markers and regions on any type of map.
Apollo (and Artemis too) provides an embedded synteny viewer.
GBrowse_syn is different from the other browsers in a number of ways:
- Does not rely on perfect co-linearity across the entire displayed region (no orphan alignments)
- Offers on the fly alignment chaining
- No upward limit on the number of species
- Used grid lines to trace fine-scale sequence gain/loss
- Seamless integration with GBrowse data sources
- Ongoing support and development
- Some people think it looks nice
GBrowse_syn is part of the GBrowse distribution. It uses native (GBrowse-compliant) GFF2/GFF3 or Chado adapters for individual species' data, and stores synteny data are stored in a separate joining database. The databases form a hub and spoke (or star), with the joining database at the hub, and the individual species databases as the spokes.
At run time, GBrowse_syn reads the species databases, the joining/alignment database, and configuration files for each species and an overall config file.
Where do I get data for GBrowse_syn?
You have to make it.
GBrowse_syn helps you visualize multiple sequence alignment data, but it does not generate it for you. This is a non-trivial task and is not for the faint of heart. Sheldon provided a high level overview of one possible process and possible tools you could use in that process.
|Raw genomic sequences|
| Mask repeats
RepeatMasker, Tandem Repeats Finder, nmerge
| Identify orthologous regions
ENREDO, MERCATOR, orthocluster
| Nucleotide-level alignment
Once you have the data, you need to get it into a format that is supported by the GBrowse_syn load scripts.
GBrowse_syn's user interface looks very much like GBrowse's interface. After selecting a reference assembly, GBrowse_syn displays each aligned sequence as a track, with every other track being the reference assembly. Aligned regions can be shown with and without connecting ribbons. Ribbons are twisted to indicate strand reversal. Strands can also be reversed in the display to untwist the ribbons. Alignment ribbons can be shown with or without embedded grid lines. Grid lines show a finer level of alignment than plain ribbons, allowing the user to easily identify regions with indels, and to visualize gene structure evolution or gene loss. They also require nucleotide level alignment.
GBrowse_syn can show the same breadth of features as GBrowse. However, for a clearer display, users are strongly encourage to limit what they show. As in GBrowse, arbitrary annotations can be added to any feature and shown with popups or linked pages.
GBrowse_syn also provides direct visual feedback on the likely quality of assemblies and can be used for guidance on refining them. For closely related species, regions in the reference should like to only a few regions in the other sequences. If it links to many different regions, the assembly likely needs significant additional work.
If all you have is orthology data, GBrowse_syn can show that. However, the utility of GBrowse_syn declines if the aligned sequences are too far apart. It does faithfully show the results of the alignment, but the visualization often highlights that the alignments are of poor quality.
Finally, if your alignment data has regions aligning to multiple regions in other species, say because of recent duplications, GBrowse_syn will visualize this correctly.
- Integration with GBrowse 2.0
- "On the fly" sequence alignment view
- AJAX-based user interface and navigation
- High-level graphical overviews
Alessandra created GBrowse.org to facilitate exchange of data, configuration files, and best practices between GBrowse users. The web site links to GBrowse instances and data download pages. It is based on the MediaWiki wiki package and makes extensive use of category tags to make information accessible in many different ways.
GBrowse.org is updated through a mixture of automated and manual mechanisms. Entrez' EFetch utility is used to initially create pages with their genome sequencing status. Each organism's page includes links to browsers, downloades, and sites and pages about that organism. If information is available on how the sequence and annotation data was produced then that is included as well.
GBrowse.org is not limited to just GBrowse sites. It also links to Ensembl, UCSC, and several other browser types.
Future plans for GBrowse.org include:
- complete automations
- test and edit links
- edit sequencing and annotation methods
- generate GBrowses and pages about all genomes with sequencing completed
- divide GBrowses and genome pages in different sites (optional)
Jonathan started with an introduction to DAS. DAS:
- Stops us from suffering under too much data to manage.
- Allows us to download annotations for regions of interest rather than for whole genomes or databases,
- Allows data providers to be in control of their annotations displayed to the world and can keep them up to date for users.
DAS stands for Distributed Annotation System. It allows data providers to provide their data over the web in a common format. It is based on HTTP and XML. Apollo and GBrowse, and many other popular packages, can speak DAS. DAS client programs request a list of DAS sources, and can then request regions of interest from those sources.
DAS has a couple of versions. DAS was originally published in 2001. Over the years the DAS standard bifurcated into the DAS 1.x and DAS2 lines. DAS 1.x has proved more popular than DAS2. Current standard is 1.53E, but a DAS1.6E standard came out of a workshop in March 2009. DAS 1.6E is expected to provide the functionality that many DAS2 users desired. 1.6 spec has new features and is a consolidation of the way DAS is being used. 1.6E has extensions being developed.
Some DAS 1.5/1.6 Commands: Sources, Features, Sequence, types, Stylesheet, Structure Alignment, and Interaction.
Some extensions in DAS 1.6:
- Represent features with more than two levels
- Reliably relate feature types to a more structured ontology.
- Identify when two DAS servers are using the same coordinate system.
- A standard way to create and edit DAS features.
- Verification of DAS servers for standards compliance.
Current and Future Work
- More validation (headers and feature by id).
- Capability of bulk uploading/mirroring DAS sources to Registry (sources cmd).
- Adding all of ensembl genomes (bacteria and viruses) as DAS sources and to the registry.
- Completing the 1.6 spec - hierarchies, nextFeature.
- Updating client libraries and servers to work with both 1.53 and 1.6 spec
- New user interface to the registry for faster searching using Lucene - also limited version available from Sanger and EBI sites.
- Greater support for ontologies-give me all das sources that provide genes?
|DAS Libraries||DAS Servers||DAS Clients|
Some bullet points from Julie's talk on InterMine:
- InterMine has RESTful web services
- Web service can return HTML.
- FlyMine started in 2002. 5 developers, release about 10 times a year.
The Mines4Mods project started May 2009. It is a 2 year grant. RGD, SGD, and ZFIN are all participating. Each has half a developer working on it. The project is aiming for interoperability between InterMine instances. Hope to port results from one InterMine to another, and then use it in a query in its new location.
Show and Tell, Discussion
Daniel Sobral and Baptiste Brault of INRA Versailles demonstrated the Aniseed website, particularly the anatomy and gene expression atlas parts of it. Aniseed is currently in the process of converting their schema to Chado and is planning on making their web interface available to the GMOD community.
If you have items that you would like to discuss (or be discussed) at this meeting, please add them here.
- Project update
- GMOD REST API - Presentation of the current draft spec and a feedback/discussion period. --Jogoodma
- DAS: Current Situation and Developments -- JWarren
- InterMine - update and new project with SGD, RGD and ZFIN -- Julie Sullivan
- Next Generation Sequencing - methods for viewing NGS data in GBrowse -- Steve Taylor
- GBrowse 1.70 and 2.0 releases
- JBrowse 1.0 release
- GBrowse_syn update
- Apollo plans for the next generation of Apollo
- Chado 1.1 release
- GMOD for Metagenomics? -- Clements
- Assembly life cycle management for beginners --DanBolser 14:04, 20 July 2009 (UTC)
- Linked Data for GMOD Databases - making GMOD databases machine-accessible to achieve better data sharing and reuse for bioinformaticians and application developers -- Jun Zhao
- GMOD Biological Object Layer framework
Cost and Registration
The cost was £50, which included a catered lunch on Friday. Space was limited to the first 50 people to register.
The meeting has a mailing list that all meeting related correspondence will be sent to:
Any meeting participant can send an email to the list.
We would like to thank the Computational Biology Research Group (CBRG) of the University of Oxford for hosting and financially supporting the week's events.
I would particularly like to thank Stephen Taylor, Simon McGowan and Zong-Pei Han for their help and support during the entire week of GMOD Europe 2009. We could not have done this without you. -- Dave C.
|First Name||Last Name||Affiliation|
|ERICK||ANTEZANA||BAYER BIOSCIENCE NV|
|T. Grant||Belgard||MRC FGU|
|Alessandra||Bilardi||CRIBI - University of Padova|
|Tim||Burgis||Imperial College- London|
|Scott||Cain||Ontario Institute for Cancer Research|
|Maria||Cartolano||University of Oxford|
|Etienne P||de Villiers||ILRI|
|Phil||East||Cancer Research UK|
|Matt||Eldridge||Cancer Research UK- Cambridge Research Institute|
|Ben||Elsworth||University of Edinburgh|
|Josh||Goodman||FlyBase (Indiana University)|
|Zong-Pei||Han||Computational Biology Research Group, Oxford|
|Ed||Lee||Lawrence Berkeley National Laboratory|
|Jacob||Lemieux||Computational Biology Research Group|
|Siu-wai||Leung||University of Macau|
|Emanuele||Marchi||University of Oxford|
|Simon||McGowan||Computational Biology Research Group, Oxford|
|Sheldon||McKay||Cold Spring Harbor Laboratory|
|FREDERIC||POTIER||BAYER BIOSCIENCE NV|
|Peter||Rice||European Bioinformatics Institute|
|Kim||Rutherford||University of Cambridge|
|michelle||simon||Medical Research Council|
|Aengus||Stewart||London Research Institute CRUK|
|Julie||Sullivan||InterMine- Dept of Genetics- Cambridge|
|Steve||Taylor||Computational Biology Research Group, Oxford|
|Adrian||Tivey||Wellcome Trust Sanger Institute|
|Giles||Velarde||Welcome Trust Sanger Institute|
|Pieter Emiel||Ver Loren van Themaat||Macx Planck Institute for Plant Breeding Research|
|Jonathan||Warren||The Sanger Institue|
|Xikun||Wu||Institute for Animal Health|
|Jun||Zhao||University of Oxford|
Attendees were asked to provide feedback at the end of the meeting.
Q: Would you recommend GMOD meetings to others
Q: Please rate the meeting(s) using the following scale: 1 (not at all) to 3 (reasonably) to 5 (exceptionally).
|How useful was the meeting?||0%||0%||23%||53%||23%|
|Was the meeting well run and organized?||0%||0%||18%||47%||35%|
Q: Was the meeting what you expected?
- Yes of course! The meeting was really interesting!
- yes and it was good to for me to meet the developers
- Yes, pretty much. It was in part this time just a good way to meet up with particular collaborators.
- Yes, but I was hoping to learn more about Chado
- Very very useful.
Q: Which presentations and sessions at this meeting were the most useful or interesting?
- SAMtools and updated on GMOD tools also user presentations
- GMOD Biological Object Layer, JBrowse
- I was really interested in the NGS integration and display on gbrowse 2, especially the reads representation and population genotypes. Thanks Dave!
- NGS visualization in GBrowse/JBrowse, SAMtools
- Linked Data for GMOD Databases, Class layer for Chado, GMOD Biological Object Layer, SAMtools for NextGen Sequence Data, GBrowse 2, NextGen, PopGen
- JBrowse, GBrowse_syn
- Interesting presentations on SAMtools, JBrowse and GBrowse_syn. Web services. discussions were also interesting
- Chado and GBrowse
- GMOD from the Trenches; SAMtools; Dave's GBrowse2
- restful services, JBrowse, semantic web
- all useful and interesting
- Heng Li, Ian Holmes, Jun Zhao
- GBrowse and JBrowse updates
- The talks about GBrowse and the talks about next gen sequencing.
- GBrowse2, JBrowse, SAMtools, InterMine
- GMOD Biological Object Layer, JBrowse
Q: Do you have suggestions for improving GMOD meetings in the future?
- Another one in Europe please. We could host one in Hinxton but I am prepared to travel
- I was able to come to the meeting because it was in Europe, so more meetings in Europe would be very helpful
- Maybe some people can present posters during Coffee Breaks for the next GMOD meeting.
- more sessions
- Less instruction copying, more problem solving
- I do think a informal or formal drinks or meal in the evening is a good idea, even if it's just - 'we are going to this pub to get a meal' which delegates can go to or not and then pay for themselves?
- Better time keeping
- Somewhere drier ;-) Seriously, it didn't seem to have the energy of some of the other 2 I've been to - maybe me or maybe people tired from the course
- Try encouraging outsiders to bring non-genomic information to GMOD E.g. people from BDGP, ZFIN expression data, 4Dxpress, BGee, etc...
Additional feedback, suggestions, criticism, and praise.
- This is the first time ever to learn to make use of so many useful bioinformatics tools from the developers and experts of them.
- Thanks for the meeting.
- Thanks very much to the organisers for their hard work - I definitely thought it was worth it
Next Meeting: January 2010 in San Diego California
The next GMOD Community Meeting was held January 14-15, 2010 in San Diego, California, United States, immediately following PAG 2010.