GMOD Evo Hackathon Proposal
__NOTITLE__ The GMOD Evo Hackathon aims to bring together experts in evolutionary biology, software, and bioinformatics to design and implement enhancements for GMOD tools, improving their support for evolutionary biology.
- 1 Overview
- 2 Background
- 3 Specific objectives
- 3.1 Viewing tools for Comparative genomics data
- 3.2 Phylogenetics Visualization
- 3.3 Population Diversity and Phenotype support
- 4 Hackathon Procedures
- 5 Organizers
- 6 Participant Invitation and Selection
We propose a NESCent-hosted hackathon to improve Generic Model Organism Database (GMOD) tools' support for evolutionary biology. Historically, the GMOD tools have emphasized storage and presentation of single-genome information, with very limited support for the data and formats that are important to the evolutionary biology community. In this meeting, we specifically aim to develop tools for 1) viewing comparative genomics data, 2) visualize phylogenomic data, and 3) supporting population diversity and phenotype annotation. The proposed event will bring together a selected group of 20 experts in evolutionary biology, software, and bioinformatics in an informal, high-intensity environment, with the goal of achieving the maximum possible progress within the event's five-day duration.
The GMOD project is a confederation of open-source projects developing software tools for storing, managing, curating, and publishing biological data. GMOD tools are used by many large and small biological databases, and increasingly by individual research labs, for the dissemination of the results of experimental research and curated knowledge.
While GMOD's software tools provide a powerful and feature-rich basis for working with biological data, many GMOD tools still lack features needed to effectively support evolutionary biology. GMOD's current strengths are with genomic data. In the past GMOD has not emphasized areas that are traditionally important in evolutionary biology, such as phenotypes, phylogenetics, population genetics and natural diversity. Historically, evolutionary researchers have not had access to genomics data because of cost issues, and thus GMOD has not had a large presence in that community.
However, two trends are now bringing the interests of these communities together in ways that can benefit both. First, high-throughput sequencing technologies have made large-scale, multi-specimen/multi-organism sequencing affordable, even for small labs. This trend has also vastly increased the volume and diversity of public genome and transcriptome data, creating tempting opportunities for evolutionary and comparative analysis. GMOD provides important tools for working with this data: GBrowse and JBrowse for visualization, Chado for storage, indexing, and as a backend for analyses, MAKER for high-throughput standardized annotation, and Galaxy as a comparative genomics workbench.
Second, GMOD's existing user base is increasingly pursuing research in areas such as phenotypes and population biology. Evolutionary biologists have a great deal of experience working with these types of data. Making GMOD more useful to evolutionary researchers and bringing those researchers into the GMOD community will also benefit GMOD's traditional user base, which is just starting to use these data types.
This hackathon is way to bring these communities together so that GMOD tools can be enhanced to:
- better serve the needs of evolutionary biologists for data types GMOD already handles well, and
- better support data types that evolutionary biologists have a longstanding interest in, but that are new to GMOD.
We are seeking NESCent's support and hosting for this event. NESCent has a commitment to facilitating data sharing in evolutionary research and GMOD is a way to achieve that goal. GMOD is a widely adopted set of tools that facilitate data integration and sharing through common formats and interfaces. Making GMOD more useful to evolutionary biologists would help the evolutionary community better share their data.
NESCent is also an excellent place to hold a hackathon. NESCent has significant experience hosting hackathons. Holding the event at NESCent would allow us take advantage of that experience.
Organizers have identified the following broad themes and specific objectives for guiding work at the event. This is based on our own experience, interactions with others in the GMOD and evolutionary biology communities, and insights gained by the recent Tools for Emerging Model Systems working group (EMS WG) at NESCent. This group consisted of evolutionary biologists working on non-model organisms and struggling with how best to exploit their data and connect their communities.
During the hackathon, participants will refine and distill these and other options into concrete implementation objectives. (Additional ideas and discussion topics can be found on our Supplemental Information page.) Organizers will focus participants on tasks for which we have a very clear idea of the objective, available data to test the problem, and ones that we can be completed or significant progress can be made during the hackathon itself.
Viewing tools for Comparative genomics data
GBrowse_syn is the GMOD component for viewing comparative genomics data, particularly for viewing synteny between genomes. GBrowse_syn is a popular component, tying with JBrowse in popularity (33%) in the user survey. While very useful, it does not currently support the Next Generation Sequencing (NGS) data increasingly available for comparative genomics. Because GBrowse_syn is under-engineered for the current community need, and has a very small developer base, this is an ideal component to work on during the Hackathon. Additionally, because the EMS working group identified 'working with NGS data' as their number one concern, this objective is a high priority. Working on improving the GBrowse_syn component at the Hackathon would facilitate the recruitment of new developers, as well as catalyze new feature development. We expect the improved analysis and storage tools resulting from this work to make cross-species comparative analysis of large-scale datasets much more accessible.
Compatibility with SAM data
Sequence Alignment Format (SAM) has become the de facto standard format for high throughput sequencing data. While GBrowse 2.0 can support SAM data, GBrowse_syn does not. Because GBrowse_syn currently runs on the GBrowse 1.x platform., it needs to be upgraded to the GBrowse 2.x platform before it can support SAM data. We may want to also extend basic SAM functionality to show per-base information.
Database backend scalability
Currently, each pairwise comparison between organisms is stored in a separate database to drive GBrowse_syn. This quickly becomes unwieldy for large numbers of genomes. A hackathon objective could be to address this scalability issue.
Tracking of alignment metadata
In order to properly display syntenic views, particularly for NGS data, the source and target genome build information must be readily available for the genomes that have been aligned. There is currently limited ability for this information to be extracted from SAM alignment files and stored in Chado. Therefore, we intend to extend Chado to properly store experimental and alignment metadata, vital for comparative analysis.
A multiple sequence alignment viewer for GMOD
GMOD doesn't have presently have web-based alignment viewers in its toolkit. However, SGN has a relatively mature web-based multiple alignment and tree browser. The tool currently is SGN-specific, and supports only FASTA and clustal alignment formats. An objective for the hackathon would be to extract this viewer from SGN's codebase to make it an independently installable component, and to add support for viewing SAM and BAM alignment formats. Adding a multiple alignment viewer to the GMOD suite of tools would give a significant boost to GMOD's comparative toolset.
Evolutionary visualization in JBrowse
The rising popularity and functionality of JBrowse makes this software a good platform to develop and/or integrate multiple alignment viewing. Its current genome browser functionality should naturally extend to viewing multiple sequence alignment both at close and distant zoom levels. One possible outcome of the meeting could be to fold in some of the SGN codebase into the JBrowse tool. Additionally, visualization of phylogenies while viewing alignments is a necessary improvement.
Specific project ideas:
3. High-grade alignment tracks with operations such as row re-ordering via JBrowse drag-and-drop track ordering mechanism. The current alignment tracks are nearly complete and online at omgbrowse.org, and we also have a nice new API for rendering image tracks with one example (RNA secondary structure). This is a small but developer-friendly base on which we need to build more sophisticated glyphs and tracks.
4. As preparation for JBrowse-syn, a goal is to be able to have two parallel JBrowse instances running side by side and talking to each other, e.g. A can tell B what region A's looking at, and suggest a homologous region that B should look at. Some messaging infrastructure exists; we need code to register listeners and keep them informed.
My inclination would just be for everyone to pile on the tree widget...
Population Diversity and Phenotype support
Population diversity in Chado
Phenotypic diversity data is very useful for evolutionary studies. In-depth analysis of this data requires proper representation, handling, and storage: specific phenotypes, environmental conditions, population details, and other experimental metadata all must be tracked, and more importantly cross-referenced with known genomic and genetic information. One of the best conceptual tools for representing this type of information in machine-readable form is ontologies, and GMOD's open-source Chado database schema is the most mature, flexible, and feature-rich storage engine for storing ontology-based data. However, it lacks specific support for evolutionary phenotype data or natural diversity data. Earlier this year, a working group was formed to work on the design of a new Natural Diversity module for Chado, based initially on the design of the GDPDM. One of the objectives for this hackathon will be to finalize and integrate the group's work into the larger Chado schema, and to make sure it integrates well with the existing or new Phenotype module (below).
Evolutionary phenotype data in Chado
Support for phenotype data in Chado needs to be rationalized, as it currently supports two distinct models (an older prototype, a more robust followup) that use overlapping sets of tables. Ideally, we will settle on one set of well defined tables to facilitate future work, as well as come up with migration plans for those using the old model. Included in this would be the ability to support both EAV (Entity-Attribute-Value, used in the "old" schema) and EQ (Entity-Quality), both of which can leverage PATO and other ontologies for phenotype term specificity. In particular, we will want to make sure that the Phenotype module is congruent with the Natural Diversity module (above), so that proper links are made between the recorded phenotypes, and the environments in which they are observed.
Connecting Annotation tools to Chado
Phenote and Phenex are two related annotation tools for capturing ontology-based phenotype descriptions. However, neither of these tools, connect with Chado. At least one current GMOD user has written data adapters to take the Phenote generated annotation and load/retrieve it into/from Chado. Together with an improved and standardized phenotype module in Chado, these tools would help unify the ability to capture, store, and access phenotype data. This may be a suitable foundation for a general purpose program for doing this data transfer.
Together, these objectives are particularly suited for a NESCent-supported hackathon. Phenote currently has limited financial support, and Phenex is outside of GMOD. Also, due to NESCent's involvement in the Phenoscape project, which records phenotype data with ontologies, we will have many of the users and developers at our disposal. And, since many of the users and developers work at different institutions, working face-to-face at the hackathon is an advantage over the struggle of communicating through mailing lists.
Web Interfaces to Evolutionary Data in Chado
The ANISEED project has an atlas/image-based web interface for phenotype, gene expression, and cell fate data. They are currently developing version 3 of this interface, called NISEED, that will be based on Chado for the first time. NISEED adds web-based query and curation interfaces to Chado. The ANISEED team is new to the GMOD community and having an ANISEED representative would both add knowledge of these dataypes to the mix, and help ANISEED integrate with the larger goals of GMOD.
Participants will split into subgroups at the event. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants both prior to and at the event.
After the hackathon organizers and GMOD staff will followup with participants to get tasks done through to completion. Such followup is standard practice in GMOD and we have been doing it consistently following GMOD Meetings since 2008.
The hackathon concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an OSI-approved open source license.
- Nicole Washington, Chair
- Nicole is currently the Lead Data Manager for the Data Collection Center the modENCODE project at the Lawrence Berkeley National Lab. She also has experience curating phenotype data with ontologies and was one of the developers of Phenote, an ontology-based phenotype annotation tool.
- Sheldon McKay
- Sheldon (University of Arizona) is the Scientific Lead for the The iPlant collaborative's iPlant tree of life project (iPToL) engagement team. Sheldon is also the lead developer for GBrowse_syn, GMOD's most widely used comparative genomics viewer. Sheldon was a participant in the Evolutionary Database Interoperability Hackathon at NESCent in March 2009 and was a co-organizer and participant in the NESCent Phyloinformatics VoCamp November 2009.
- Robert Buels
- Rob is the engineering team lead at the Sol Genomics Network (SGN), a clade-oriented database dedicated to the biology of the Solanaceae family which includes a large number of closely related and many agronomically important species such as tomato, potato, tobacco, eggplant, pepper, and the ornamental Petunia hybrida. Rob's (and SGN's) interests include plant adaptation and diversification, and linking the phenome to the genome.
- Scott Cain
- Scott is the GMOD Program Coordinator. Scott has organized a previous GMOD hackathon held at the University of Chicago in 2007.
- Hilmar Lapp
- Hilmar is the Assistant Director of Informatics at NESCent, where he is responsible for implementing NESCent's goal of enabling data interoperability in evolutionary biology. He is also a veteran of many hackathons at NESCent and elsewhere.
Participant Invitation and Selection
Participation will be arranged by invitation and by self-nomination followed by review. If you are interested in participating, please contact one of the organizers. We are asking for support for ~20 participants, including the organizing committee, with a 50/50 split between invited and self-nominated attendees.
For each of the objectives above, the organizing committee has nominated 2-4 people who are experts in their respective areas to guide the hackathon in that area. While assigned to a particular objective, these key persons often have cross-cutting expertise and will be able to influence and contribute to multiple objectives. We believe they will make a significant contribution to the hackathon. A preliminary list is included in this proposal. We propose ~10 persons with a 70/30 split between GMOD and evolutionary community members, respectively.
Selected self-nominated Participants
We expect the rest of the attendees (~10 participants) to be self-nominated, with an emphasis on those from the evolutionary community. Nominations will be solicited on GMOD mailing lists, the GMOD web site, and prominent evolutionary biology mailing lists such as EvolDir. Organizers will also work with NESCent's Education and Outreach group to identify appropriate resources to send out announcements. People will be encouraged to self-nominate, as well as to nominate others. Depending on the timing of the hackathon, members of the organizing committee and key GMOD community members will solicit nominations at conferences they attend such as Arthropod Genomics, Evolution, iEvoBio, BOSC, and ISMB. We will also work with NESCent's Education and Outreach group to advertise the hackathon as a part of their activities.
Participants will be selected from the applicant pool by the organizing committee based on a number of criteria:
- Experience in bioinformatics programming
- The goal of the hackathon is to produce working code. Participants do not need to have a degree in computer science, but they must know how to program.
- Experience with and understanding of evolutionary data types
- Participants must also have some understanding of the domain the hackathon is working in. Participants do not need to have a degree in evolutionary biology, but they must have at least a basic understanding of current issues in evolutionary biology.
- Access to evolutionary data
- Applicants that have access to evolutionary datasets will be favored over those who do not. Having data in hand reveals concrete problems that can be addressed in the context of a week-long hackathon.
- Knowledge of GMOD
- Applicants with GMOD experience will be favored over those who do not.
The first two criteria are required for participants. The remaining criteria will be used to build a balanced, complementary, and diverse set of applicants.
Organizers will strive to include participants from a broad swath of the tree of life, that are working in many different areas of evolutionary biology. We will also strive to nominate and include participants from historically under-represented groups in computational and evolutionary biology.