GMOD Evo Hackathon
| Tools for Evolutionary Biology Hackathon|
November 8-12, 2010
NESCent, Durham, North Carolina, USA
GMOD held a hackathon November 8-12, 2010, at the National Evolutionary Synthesis Center (NESCent) in Durham, North Carolina. This hackathon focused on improving GMOD's support for evolutionary biology.
The Open Call for Participation went out on August 1, 2010, and remained open until August 25. Participants have been selected and notified of their status.
- 1 Synopsis
- 2 Background
- 3 Specific objectives
- 4 Hackathon Structure
- 5 Participant Funding
- 6 March 2011 Satellite
- 7 Timeline
- 8 Sponsorship
- 9 Organizing Committee
This hackathon addressed critical gaps in the capabilities of the Generic Model Organism Database (GMOD) toolbox that limited its utility for evolutionary research. Specifically, we focused on tools for 1) viewing comparative genomics data; 2) visualizing phylogenomic data; and 3) supporting population diversity data and phenotype annotation.
The event brought together a group of 30 software developers, end-user representatives, and documentation experts who would otherwise not have met. The participants included key developers of GMOD components that lacked features critical for emerging evolutionary biology research, developers of informatics tools in evolutionary research that lacked GMOD integration, and informatics-savvy biologists who represented end-user requirements.
This hackathon provided a unique opportunity to infuse the community of GMOD developers with a heightened awareness of unmet needs in evolutionary biology that GMOD components have the potential to fill, and for tool developers in evolutionary biology to better understand how best to extend or integrate with already existing GMOD components.
- Worked on establishing a common database backend and JSON-based API for comparative genomics data, using several visualization tools (including JBrowse and GBrowse_syn) as targets. Will enable sharing of comparative data in multiple tools from multiple sources.
- GBrowse_syn is built on and takes advantage of the GBrowse genome browser code and config files. However, it did not work well with GBrowse2, due to significant architectural changes. This group refactored GBrowse2 to naturally support GBrowse_syn. This work will also enable several other GBrowse1-only applications (SynView, Primer Designer, ...) to be ported to GBrowse2 as well. Two participants also became core GBrowse_syn developers.
- This group set out to extend JBrowse to be a comparative genomics browser. The group removed the existing "single genome" assumption from the code and successfully displayed several genomes in parallel. Several participants also became familiar with JBrowse code and architecture.
- PhyloBox is a flexible and fast web based tree visualization program. At the hackathon the PhyloBox team extended PhyloBox in numerous ways to make it a "widget" that can interact with other widgets. PhyloBox documentation was also created.
- Integration PhyloBox JBrowse Integration
- The group is the perfect example of the interaction that can happen at a hackathon. They worked with the JBrowse_syn, PhyloBox and GMatchbox groups to enable integration of these three technologies. This group was very helpful at getting teams to work together.
- Natural Diversity and Phenotypes in Chado
- This group focused on two outcomes, both relating to Chado. The first was a prototype Rails application that provided a web interface to the new Natural Diversity module in Chado. This was built on top of the emerging Chado on Rails project. The second was a better understanding, slight refactoring, and updated documentation for Chado's phenotype module.
- Galaxy + HyPhy
- Galaxy is both a workflow system and a means of persisting computational pipelines and results. This group worked on improving Galaxy's ability to integrate interactive tools, using HyPhy as the prototype application. The Galaxy and HyPhy code bases were modified to support this.
- This subgroup worked on improving tree handling in BioPerl. Specifically, they addressed the handling of very large trees or large numbers of small trees. BioPerl now supports storing such trees in a lightweight database instead of in memory.
The GMOD project is a confederation of intercompatible open-source projects developing software tools for storing, managing, curating, and publishing biological data. Although the GMOD project originated from the goal of developing a generic tool set for common needs among model organism databases, GMOD tools are meanwhile used by many large and small, collaborative and single-investigator biological database projects for the dissemination of experimental results and curated knowledge.
GMOD's software tools provide a powerful and feature-rich basis for working with biological, in particular genomic and other molecular data. However, due to GMOD's historical emphasis on single-genome projects many GMOD tools still lack features that are critical to effectively support the comparative, phylogenetic, and natural diversity-oriented questions frequently asked in evolutionary research.
Recent developments have given rise to a window of opportunity for forging collaborations towards filling this gap. In particular, the cost of collecting comparative molecular data on a large or even genomic scale has recently dropped dramatically, primarily thanks to next-generation high-throughput sequencing technologies. This has enabled evolutionary researchers to bring genome-scale molecular data to bear on key evolutionary questions. It has also allowed single organism-focused molecular biology labs, who represent GMOD's traditional user base, to broaden out to multi-organism comparative approaches. Bringing these two communities with increasingly shared interests and complementary scientific and technical expertise together offers an opportunity to start filling GMOD's gaps in these areas while building on its existing strengths. In addition, such direct interaction will heighten future awareness of needs of evolutionary researchers among GMOD developers who have so far mostly supported its traditional user base, and can in the long term increase the ranks of GMOD contributors from a field it was not originally designed to serve.
The hackathon format is ideally suited to realize this opportunity. Its strengths lie in facilitating face-to-face interaction among people with complementary expertise, and collaborative work on tangible products that can form the basis of continued partnerships long beyond the end of the meeting.
Organizers identified the following broad themes for focusing work at the event. Before and at the hackathon, the participants refined and distilled these and other options into concrete implementation targets. The participants developed criteria for prioritization, such as maturity of a target for implementation, availability of test data, and potential for completing or making significant progress towards the target during the hackathon. Further ideas and discussion topics can be found on the Supplemental Information page.
Viewing tools for comparative genomics data
GBrowse_syn is a popular GMOD component for viewing comparative genomics data, particularly for viewing synteny between genomes. It does not currently support the next-generation sequencing (NGS) data increasingly available for comparative genomics and emerging model systems. Support for NGS data was identified by the EMS working group as a high priority.
In particular, GBrowse_syn lacks support for the Sequence Alignment Format (SAM), its mechanism of storing genome comparisons does not scale beyond a few organisms, and the means for tracking the necessary alignment metadata in Chado are insufficient.
In addition to filling those gaps, GBrowse_syn would also particularly stand to benefit from the event by gaining a more sustainable developer base.
Visualization of phylogenetic data and trees
The GMOD toolkit at present does not include web-based alignment viewers, nor can the increasingly popular JBrowse genome browser (the designated successor of GBrowse) display multiple sequence alignments. GMOD also lacks a phylogenetic tree widget.
Population Diversity and Phenotype support
GMOD's capabilities in managing phenotype and natural diversity data is scattered across partially redundant and outdated modules, does not support modern ontology-based entity-quality data, and lacks a web-interface. The sophisticated phenotype annotation tools that do exist cannot interface with Chado, GMOD's central relational data model. Yet, phenotypic and genetic diversity data are central to many evolutionary research questions.
A Natural Diversity Module initiative to address at least the deficiencies within Chado has already formed earlier this year. Several key developers (one of the original developers of the module, and the developer of Phenex, a phenotype curation tool) are already local to NESCent, and so the hackathon provides a unique opportunity to review and refine the natural diversity data model face-to-face, and to integrate it with an updated and reconciled phenotype module. A recently reported prototype of a Chado data adapter for Phenote, GMODs phenotype annotation tool, could be generalized to become the data persistence interface for such data.
Aside from the data model deficiencies, the ANISEED project has started efforts to generalize its sophisticated atlas/image-based web interface for phenotype data, and to make it operate on top of Chado. The hackathon could harness this synergy to help this effort leap forward, which could ultimately provide GMOD with the currently missing web-interface for such data.
Before the Event
Discussion of ideas and sometimes even design actually starts well before the hackathon, on mailing lists, wiki pages, and conference calls set up among accepted attendees. This advance work lays the foundation for participants to be productive from the very first day. This also means that participants should be willing to contribute some time in advance of the hackathon itself to participate in this preparatory discussion.
During the Event
Typically, hackathon participants use the morning of the first day of the event to organize themselves into working groups of between 3 and 6 people, each with a focused implementation objective. Ideas and objectives are discussed, and attendees coalesce around the projects in which they have the most experience or interest.
Deliverables / Event Results
The hackathon will use a wiki hosted at NESCent during the event. Once the hackathon is done, relevant content will be copied from the NESCent wiki to the GMOD wiki. Each working group during the event will typically have its own wiki page, linked from the main hackathon wiki page, where it documents its minutes and design notes, and provides links to the code and documentation it produces. Also, since GMOD and NESCent are both committed to open source principles, all code and documentation produced by participants during the hackathon must be published under an OSI-approved open source license. As contributions to existing GMOD tools, all hackathon products will most likely satisfy this requirement automatically.
NESCent is sponsoring this hackathon, and had made funds available to defray costs for qualified participants.
March 2011 Satellite
|Duke Leto (Organizer)||jonathan at leto net|
|Hilmar Lapp||hlapp at nescent.org||Large trees in BioPerl cleanup|
|↑ Add your name and details above|
|June 3, 2010||Proposal submitted to NESCent|
|June 10, 2010||Funding approved|
|August 1, 2010||Open call for participants, applications open|
|August 25, 2010||Open call application deadline|
|September 16, 2010||Applicants notified|
|September 24, 2010||Deadline for participant attendance commitment|
|November 8-12, 2010||Hackathon at NESCent|
|March 7 (+ ?), 2011||Hackathon followup gathering at NESCent as part of GMOD Americas 2011.|
This event is sponsored by the US National Evolutionary Synthesis Center (NESCent) through its Informatics Whitepapers program. NESCent promotes the synthesis of information, concepts and knowledge to address significant, emerging, or novel questions in evolutionary science and its applications. NESCent achieves this by supporting research and education across disciplinary, institutional, geographic, and demographic boundaries.