Tools for Evolutionary Biology
Hackathon November 8-12, 2010 NESCent, Durham, North Carolina, USA |
GMOD held a hackathon November 8-12, 2010, at the National Evolutionary Synthesis Center (NESCent) in Durham, North Carolina. This hackathon focused on improving GMOD’s support for evolutionary biology.
The Open Call for Participation went out on August 1, 2010, and remained open until August 25. Participants have been selected and notified of their status.
This hackathon addressed critical gaps in the capabilities of the Generic Model Organism Database (GMOD) toolbox that limited its utility for evolutionary research. Specifically, we focused on tools for 1) viewing comparative genomics data; 2) visualizing phylogenomic data; and 3) supporting population diversity data and phenotype annotation.
The event brought together a group of 30 software developers, end-user representatives, and documentation experts who would otherwise not have met. The participants included key developers of GMOD components that lacked features critical for emerging evolutionary biology research, developers of informatics tools in evolutionary research that lacked GMOD integration, and informatics-savvy biologists who represented end-user requirements.
This hackathon provided a unique opportunity to infuse the community of GMOD developers with a heightened awareness of unmet needs in evolutionary biology that GMOD components have the potential to fill, and for tool developers in evolutionary biology to better understand how best to extend or integrate with already existing GMOD components.
Due to the closure of NESCent, the links below to the hackathon wiki originally hosted by NESCent no longer resolve. A permanent archive of the wiki in the form of a MediaWiki export is available at Zenodo. To view the content of that wiki, one first needs to reinstantiate a MediaWiki instance from the export.
The 30 participants self-organized into eight groups with at least one group addressing each of the event’s three objectives. The outcomes for each group are summarized below.
GMatchbox
Worked on establishing a common database backend and
JSON-based API for comparative genomics
data, using several visualization tools (including
JBrowse and
GBrowse_syn) as targets. Will enable
sharing of comparative data in multiple tools from multiple sources.
GBrowse_syn2
GBrowse_syn is built on and takes
advantage of the GBrowse genome browser code and
config files. However, it did not work well with GBrowse2, due to
significant architectural changes. This group refactored GBrowse2 to
naturally support GBrowse_syn. This work will also enable several other
GBrowse1-only applications (SynView, Primer
Designer, …) to be ported to
GBrowse2 as well. Two participants also became core GBrowse_syn
developers.
JBrowse_syn
This group set out to extend JBrowse to be a
comparative genomics browser. The group removed the existing “single
genome” assumption from the code and successfully displayed several
genomes in parallel. Several participants also became familiar with
JBrowse code and architecture.
PhyloBox
PhyloBox is a flexible and fast web based tree
visualization program. At the hackathon the PhyloBox team extended
PhyloBox in numerous ways to make it a “widget” that can interact with
other widgets. PhyloBox documentation was also created.
Integration PhyloBox JBrowse
Integration
The group is the perfect example of the interaction that can happen at a
hackathon. They worked with the
JBrowse_syn,
PhyloBox and
GMatchbox groups to enable
integration of these three technologies. This group was very helpful at
getting teams to work together.
Natural Diversity and Phenotypes in
Chado
This group focused on two outcomes, both relating to
Chado. The first
was a prototype Rails application that provided a web interface to the
new Natural
Diversity
module in Chado. This was built on top of the emerging Chado on
Rails project. The second was a better
understanding, slight refactoring, and updated documentation for Chado’s
phenotype module.
Galaxy + HyPhy
Galaxy is both a workflow system and a means of
persisting computational pipelines and results. This group worked on
improving Galaxy’s ability to integrate interactive tools, using
HyPhy as the prototype application. The Galaxy and
HyPhy code bases were modified to support this.
BioPerl
This subgroup worked on improving tree handling in
BioPerl. Specifically, they addressed the handling
of very large trees or large numbers of small trees. BioPerl now
supports storing such trees in a lightweight database instead of in
memory.
Robert Buels gave a poster on the hackathon at PAG 2011.
The GMOD project is a confederation of intercompatible open-source projects developing software tools for storing, managing, curating, and publishing biological data. Although the GMOD project originated from the goal of developing a generic tool set for common needs among model organism databases, GMOD tools are meanwhile used by many large and small, collaborative and single-investigator biological database projects for the dissemination of experimental results and curated knowledge.
GMOD’s software tools provide a powerful and feature-rich basis for working with biological, in particular genomic and other molecular data. However, due to GMOD’s historical emphasis on single-genome projects many GMOD tools still lack features that are critical to effectively support the comparative, phylogenetic, and natural diversity-oriented questions frequently asked in evolutionary research.
Recent developments have given rise to a window of opportunity for forging collaborations towards filling this gap. In particular, the cost of collecting comparative molecular data on a large or even genomic scale has recently dropped dramatically, primarily thanks to next-generation high-throughput sequencing technologies. This has enabled evolutionary researchers to bring genome-scale molecular data to bear on key evolutionary questions. It has also allowed single organism-focused molecular biology labs, who represent GMOD’s traditional user base, to broaden out to multi-organism comparative approaches. Bringing these two communities with increasingly shared interests and complementary scientific and technical expertise together offers an opportunity to start filling GMOD’s gaps in these areas while building on its existing strengths. In addition, such direct interaction will heighten future awareness of needs of evolutionary researchers among GMOD developers who have so far mostly supported its traditional user base, and can in the long term increase the ranks of GMOD contributors from a field it was not originally designed to serve.
The hackathon format is ideally suited to realize this opportunity. Its strengths lie in facilitating face-to-face interaction among people with complementary expertise, and collaborative work on tangible products that can form the basis of continued partnerships long beyond the end of the meeting.
Organizers identified the following broad themes for focusing work at the event. Before and at the hackathon, the participants refined and distilled these and other options into concrete implementation targets. The participants developed criteria for prioritization, such as maturity of a target for implementation, availability of test data, and potential for completing or making significant progress towards the target during the hackathon. Further ideas and discussion topics can be found on the Supplemental Information page.
GBrowse_syn is a popular GMOD component for viewing comparative genomics data, particularly for viewing synteny between genomes. It does not currently support the next-generation sequencing (NGS) data increasingly available for comparative genomics and emerging model systems. Support for NGS data was identified by the EMS working group as a high priority.
In particular, GBrowse_syn lacks support for the Sequence Alignment Format (SAM), its mechanism of storing genome comparisons does not scale beyond a few organisms, and the means for tracking the necessary alignment metadata in Chado are insufficient.
In addition to filling those gaps, GBrowse_syn would also particularly stand to benefit from the event by gaining a more sustainable developer base.
The GMOD toolkit at present does not include web-based alignment viewers, nor can the increasingly popular JBrowse genome browser (the designated successor of GBrowse) display multiple sequence alignments. GMOD also lacks a phylogenetic tree widget.
Implementing these from scratch would be far beyond a suitable hackathon target. However, SGN has a relatively mature web-based multiple alignment and tree browser that could be extracted from SGN’s codebase and transformed into a GMOD component, an add-on for JBrowse. Current Java-based tree viewers (such as Archaeopteryx or PhyloWidget) could be used as the basis for a JavaScript-based tree viewer (or an applet that can be controlled through JavaScript) that integrates with JBrowse.
GMOD’s capabilities in managing phenotype and natural diversity data is scattered across partially redundant and outdated modules, does not support modern ontology-based entity-quality data, and lacks a web-interface. The sophisticated phenotype annotation tools that do exist cannot interface with Chado, GMOD’s central relational data model. Yet, phenotypic and genetic diversity data are central to many evolutionary research questions.
A Natural Diversity Module initiative to address at least the deficiencies within Chado has already formed earlier this year. Several key developers (one of the original developers of the module, and the developer of Phenex, a phenotype curation tool) are already local to NESCent, and so the hackathon provides a unique opportunity to review and refine the natural diversity data model face-to-face, and to integrate it with an updated and reconciled phenotype module. A recently reported prototype of a Chado data adapter for Phenote, GMODs phenotype annotation tool, could be generalized to become the data persistence interface for such data.
Aside from the data model deficiencies, the ANISEED project has started efforts to generalize its sophisticated atlas/image-based web interface for phenotype data, and to make it operate on top of Chado. The hackathon could harness this synergy to help this effort leap forward, which could ultimately provide GMOD with the currently missing web-interface for such data.
Discussion of ideas and sometimes even design actually starts well before the hackathon, on mailing lists, wiki pages, and conference calls set up among accepted attendees. This advance work lays the foundation for participants to be productive from the very first day. This also means that participants should be willing to contribute some time in advance of the hackathon itself to participate in this preparatory discussion.
Typically, hackathon participants use the morning of the first day of the event to organize themselves into working groups of between 3 and 6 people, each with a focused implementation objective. Ideas and objectives are discussed, and attendees coalesce around the projects in which they have the most experience or interest.
The hackathon will use a wiki hosted at NESCent during the event. Once the hackathon is done, relevant content will be copied from the NESCent wiki to the GMOD wiki. Each working group during the event will typically have its own wiki page, linked from the main hackathon wiki page, where it documents its minutes and design notes, and provides links to the code and documentation it produces. Also, since GMOD and NESCent are both committed to open source principles, all code and documentation produced by participants during the hackathon must be published under an OSI-approved open source license. As contributions to existing GMOD tools, all hackathon products will most likely satisfy this requirement automatically.
NESCent is sponsoring this hackathon, and had made funds available to defray costs for qualified participants.
We are planning a followup gathering as a Satellite Meeting at GMOD Americas 2011, in March at NESCent. If you are interested in participating, please add your name below.
You do not need to have attended the original hackathon or plan on attending any other GMOD Americas 2011 events to participate in this satellite (or any other satellite). If you have an interest in extending GMOD and will be in the area or at GMOD Americas 2011, then you are strongly encouraged to participate.
Name | Particular Interest? | |
---|---|---|
Duke Leto (Organizer) | jonathan at leto net | |
Hilmar Lapp | hlapp at nescent.org | Large trees in BioPerl cleanup |
↑ Add your name and details above |
June 3, 2010 | Proposal submitted to <a href=”http://nesscent.org/” class=”external text” |
rel=”nofollow”>NESCent</a> | |
June 10, 2010 | Funding approved |
August 1, 2010 | Open call for participants, applications open |
August 25, 2010 | Open call application deadline |
September 16, 2010 | Applicants notified |
September 24, 2010 | Deadline for participant attendance commitment |
November 8-12, 2010 | Hackathon at NESCent |
March 7 (+ ?), 2011 | Hackathon followup gathering at NESCent as part of GMOD Americas 2011. |
This event is sponsored by the US National Evolutionary Synthesis Center (NESCent) through its Informatics Whitepapers program. NESCent promotes the synthesis of information, concepts and knowledge to address significant, emerging, or novel questions in evolutionary science and its applications. NESCent achieves this by supporting research and education across disciplinary, institutional, geographic, and demographic boundaries.