GMOD Evo Hackathon

From GMOD
Jump to: navigation, search
GMOD Evo Hackathon Tools for Evolutionary Biology Hackathon
November 8-12, 2010
NESCent, Durham, North Carolina, USA


EvoHackRoom2010.jpg EvoHackWhiteboard2010.jpg

GMOD held a hackathon November 8-12, 2010, at the National Evolutionary Synthesis Center (NESCent) in Durham, North Carolina. This hackathon focused on improving GMOD's support for evolutionary biology.

The Open Call for Participation went out on August 1, 2010, and remained open until August 25. Participants have been selected and notified of their status.

Synopsis

This hackathon addressed critical gaps in the capabilities of the Generic Model Organism Database (GMOD) toolbox that limited its utility for evolutionary research. Specifically, we focused on tools for 1) viewing comparative genomics data; 2) visualizing phylogenomic data; and 3) supporting population diversity data and phenotype annotation.

The event brought together a group of 30 software developers, end-user representatives, and documentation experts who would otherwise not have met. The participants included key developers of GMOD components that lacked features critical for emerging evolutionary biology research, developers of informatics tools in evolutionary research that lacked GMOD integration, and informatics-savvy biologists who represented end-user requirements.

This hackathon provided a unique opportunity to infuse the community of GMOD developers with a heightened awareness of unmet needs in evolutionary biology that GMOD components have the potential to fill, and for tool developers in evolutionary biology to better understand how best to extend or integrate with already existing GMOD components.

Subgroups

Due to the closure of NESCent, the links below to the hackathon wiki originally hosted by NESCent no longer resolve. A permanent archive of the wiki in the form of a MediaWiki export is available at Zenodo. To view the content of that wiki, one first needs to reinstantiate a MediaWiki instance from the export.

EvoHackLaptops2010.jpg

The 30 participants self-organized into eight groups with at least one group addressing each of the event's three objectives. The outcomes for each group are summarized below.

GMatchbox
Worked on establishing a common database backend and JSON-based API for comparative genomics data, using several visualization tools (including JBrowse and GBrowse_syn) as targets. Will enable sharing of comparative data in multiple tools from multiple sources.
GBrowse_syn2
GBrowse_syn is built on and takes advantage of the GBrowse genome browser code and config files. However, it did not work well with GBrowse2, due to significant architectural changes. This group refactored GBrowse2 to naturally support GBrowse_syn. This work will also enable several other GBrowse1-only applications (SynView, Primer Designer, ...) to be ported to GBrowse2 as well. Two participants also became core GBrowse_syn developers.
JBrowse_syn
This group set out to extend JBrowse to be a comparative genomics browser. The group removed the existing "single genome" assumption from the code and successfully displayed several genomes in parallel. Several participants also became familiar with JBrowse code and architecture.
PhyloBox
PhyloBox is a flexible and fast web based tree visualization program. At the hackathon the PhyloBox team extended PhyloBox in numerous ways to make it a "widget" that can interact with other widgets. PhyloBox documentation was also created.
Integration PhyloBox JBrowse Integration
The group is the perfect example of the interaction that can happen at a hackathon. They worked with the JBrowse_syn, PhyloBox and GMatchbox groups to enable integration of these three technologies. This group was very helpful at getting teams to work together.
Natural Diversity and Phenotypes in Chado
This group focused on two outcomes, both relating to Chado. The first was a prototype Rails application that provided a web interface to the new Natural Diversity module in Chado. This was built on top of the emerging Chado on Rails project. The second was a better understanding, slight refactoring, and updated documentation for Chado's phenotype module.
Galaxy + HyPhy
Galaxy is both a workflow system and a means of persisting computational pipelines and results. This group worked on improving Galaxy's ability to integrate interactive tools, using HyPhy as the prototype application. The Galaxy and HyPhy code bases were modified to support this.
BioPerl
This subgroup worked on improving tree handling in BioPerl. Specifically, they addressed the handling of very large trees or large numbers of small trees. BioPerl now supports storing such trees in a lightweight database instead of in memory.

PAG Poster

Robert Buels gave a poster on the hackathon at PAG 2011.

Background

The GMOD project is a confederation of intercompatible open-source projects developing software tools for storing, managing, curating, and publishing biological data. Although the GMOD project originated from the goal of developing a generic tool set for common needs among model organism databases, GMOD tools are meanwhile used by many large and small, collaborative and single-investigator biological database projects for the dissemination of experimental results and curated knowledge.

GMOD's software tools provide a powerful and feature-rich basis for working with biological, in particular genomic and other molecular data. However, due to GMOD's historical emphasis on single-genome projects many GMOD tools still lack features that are critical to effectively support the comparative, phylogenetic, and natural diversity-oriented questions frequently asked in evolutionary research.

Recent developments have given rise to a window of opportunity for forging collaborations towards filling this gap. In particular, the cost of collecting comparative molecular data on a large or even genomic scale has recently dropped dramatically, primarily thanks to next-generation high-throughput sequencing technologies. This has enabled evolutionary researchers to bring genome-scale molecular data to bear on key evolutionary questions. It has also allowed single organism-focused molecular biology labs, who represent GMOD's traditional user base, to broaden out to multi-organism comparative approaches. Bringing these two communities with increasingly shared interests and complementary scientific and technical expertise together offers an opportunity to start filling GMOD's gaps in these areas while building on its existing strengths. In addition, such direct interaction will heighten future awareness of needs of evolutionary researchers among GMOD developers who have so far mostly supported its traditional user base, and can in the long term increase the ranks of GMOD contributors from a field it was not originally designed to serve.

The hackathon format is ideally suited to realize this opportunity. Its strengths lie in facilitating face-to-face interaction among people with complementary expertise, and collaborative work on tangible products that can form the basis of continued partnerships long beyond the end of the meeting.

Specific objectives

Organizers identified the following broad themes for focusing work at the event. Before and at the hackathon, the participants refined and distilled these and other options into concrete implementation targets. The participants developed criteria for prioritization, such as maturity of a target for implementation, availability of test data, and potential for completing or making significant progress towards the target during the hackathon. Further ideas and discussion topics can be found on the Supplemental Information page.

Viewing tools for comparative genomics data

GBrowse_syn is a popular GMOD component for viewing comparative genomics data, particularly for viewing synteny between genomes. It does not currently support the next-generation sequencing (NGS) data increasingly available for comparative genomics and emerging model systems. Support for NGS data was identified by the EMS working group as a high priority.

In particular, GBrowse_syn lacks support for the Sequence Alignment Format (SAM), its mechanism of storing genome comparisons does not scale beyond a few organisms, and the means for tracking the necessary alignment metadata in Chado are insufficient.

In addition to filling those gaps, GBrowse_syn would also particularly stand to benefit from the event by gaining a more sustainable developer base.

Visualization of phylogenetic data and trees

The GMOD toolkit at present does not include web-based alignment viewers, nor can the increasingly popular JBrowse genome browser (the designated successor of GBrowse) display multiple sequence alignments. GMOD also lacks a phylogenetic tree widget.

Implementing these from scratch would be far beyond a suitable hackathon target. However, SGN has a relatively mature web-based multiple alignment and tree browser that could be extracted from SGN's codebase and transformed into a GMOD component, an add-on for JBrowse. Current Java-based tree viewers (such as Archaeopteryx or PhyloWidget) could be used as the basis for a JavaScript-based tree viewer (or an applet that can be controlled through JavaScript) that integrates with JBrowse.

Population Diversity and Phenotype support

GMOD's capabilities in managing phenotype and natural diversity data is scattered across partially redundant and outdated modules, does not support modern ontology-based entity-quality data, and lacks a web-interface. The sophisticated phenotype annotation tools that do exist cannot interface with Chado, GMOD's central relational data model. Yet, phenotypic and genetic diversity data are central to many evolutionary research questions.

A Natural Diversity Module initiative to address at least the deficiencies within Chado has already formed earlier this year. Several key developers (one of the original developers of the module, and the developer of Phenex, a phenotype curation tool) are already local to NESCent, and so the hackathon provides a unique opportunity to review and refine the natural diversity data model face-to-face, and to integrate it with an updated and reconciled phenotype module. A recently reported prototype of a Chado data adapter for Phenote, GMODs phenotype annotation tool, could be generalized to become the data persistence interface for such data.

Aside from the data model deficiencies, the ANISEED project has started efforts to generalize its sophisticated atlas/image-based web interface for phenotype data, and to make it operate on top of Chado. The hackathon could harness this synergy to help this effort leap forward, which could ultimately provide GMOD with the currently missing web-interface for such data.

Hackathon Structure

Before the Event

Discussion of ideas and sometimes even design actually starts well before the hackathon, on mailing lists, wiki pages, and conference calls set up among accepted attendees. This advance work lays the foundation for participants to be productive from the very first day. This also means that participants should be willing to contribute some time in advance of the hackathon itself to participate in this preparatory discussion.

During the Event

Typically, hackathon participants use the morning of the first day of the event to organize themselves into working groups of between 3 and 6 people, each with a focused implementation objective. Ideas and objectives are discussed, and attendees coalesce around the projects in which they have the most experience or interest.

Deliverables / Event Results

The hackathon will use a wiki hosted at NESCent during the event. Once the hackathon is done, relevant content will be copied from the NESCent wiki to the GMOD wiki. Each working group during the event will typically have its own wiki page, linked from the main hackathon wiki page, where it documents its minutes and design notes, and provides links to the code and documentation it produces. Also, since GMOD and NESCent are both committed to open source principles, all code and documentation produced by participants during the hackathon must be published under an OSI-approved open source license. As contributions to existing GMOD tools, all hackathon products will most likely satisfy this requirement automatically.

Participant Funding

NESCent is sponsoring this hackathon, and had made funds available to defray costs for qualified participants.

March 2011 Satellite

Satellite Meetings at GMOD Americas 2011

We are planning a followup gathering as a Satellite Meeting at GMOD Americas 2011, in March at NESCent. If you are interested in participating, please add your name below.

You do not need to have attended the original hackathon or plan on attending any other GMOD Americas 2011 events to participate in this satellite (or any other satellite). If you have an interest in extending GMOD and will be in the area or at GMOD Americas 2011, then you are strongly encouraged to participate.


Name Email Particular Interest?
Duke Leto (Organizer) jonathan at leto net
Hilmar Lapp hlapp at nescent.org Large trees in BioPerl cleanup
↑ Add your name and details above

Timeline

June 3, 2010 Proposal submitted to NESCent
June 10, 2010 Funding approved
August 1, 2010 Open call for participants, applications open
August 25, 2010 Open call application deadline
September 16, 2010 Applicants notified
September 24, 2010 Deadline for participant attendance commitment
November 8-12, 2010 Hackathon at NESCent
March 7 (+ ?), 2011 Hackathon followup gathering at NESCent as part of GMOD Americas 2011.

Sponsorship

NESCent

This event is sponsored by the US National Evolutionary Synthesis Center (NESCent) through its Informatics Whitepapers program. NESCent promotes the synthesis of information, concepts and knowledge to address significant, emerging, or novel questions in evolutionary science and its applications. NESCent achieves this by supporting research and education across disciplinary, institutional, geographic, and demographic boundaries.

Organizing Committee