Difference between revisions of "GMOD Evo Hackathon Proposal"

From GMOD
Jump to: navigation, search
(Organizers)
m (Text replace - "__NOTITLE__" to "")
 
(81 intermediate revisions by 6 users not shown)
Line 1: Line 1:
<div style="text-align: center; margin-bottom: 2em; font-size: 200%">NESCent Hackathon on GMOD Tools for Evolutionary Biology</div>
+
<center>
__NOTITLE__
+
{| style="vertical-align: middle; border: 2px solid #A6A6BC" cellpadding="10"
The GMOD Evo Hackathon aims to bring together experts in evolutionary biology, software, and bioinformatics to design and implement enhancements for GMOD tools, improving their support for evolutionary biology.
+
| [[Image:EvoHackathonLogo.png|center|200px]]
 +
| <span style="font-size: 200%; line-height: 120%"><b>[[GMOD Evo Hackathon|Tools for Evolutionary Biology Hackathon]] Proposal</b><br />November 8-12, 2010<br />[http://nescent.org/ NESCent], Durham, North Carolina, USA</span>
 +
|}
 +
</center>
  
  
__TOC__
 
  
 +
<div class="emphasisbox">
 +
This proposal was submitted to, and then approved and [[GMOD Evo Hackathon#Sponsorship|funded]] by [http://nescent.org NESCent] in June 2010.  '''For up-to-date information on the hackathon, see the [[GMOD Evo Hackathon|hackathon's home page]].'''
 +
</div>
  
The GMOD project is a confederation of open-source projects developing software tools for storing, managing, curating, and publishing biological data. GMOD tools are used by many large and small biological databases, and increasingly by individual research labs, for the dissemination of the results of experimental research and curated knowledge.
+
The [[GMOD Evo Hackathon]] aims to bring together key developers of GMOD components with developers from the evolutionary biology community to fill critical gaps in GMOD for evolutionary research.
  
== Need ==
+
__TOC__
  
While GMOD's software tools provide a powerful and feature-rich basis for working with biological data, many GMOD tools still lack features needed to effectively support evolutionary biology.  GMOD's current strengths are with genomic data.  In the past GMOD has not emphasized areas that are traditionally important in evolutionary biology, such as phenotypes, phylogenetics, population genetics and natural diversity.  Historically, evolutionary researchers have not had access to genomics data because of cost issues, and thus GMOD has not had a large presence in that community.
+
== Synopsis ==
  
However, two trends are now bringing the interests of these communities together in ways that can benefit both.  First, high-throughput sequencing technologies have made large-scale, multi-specimen/multi-organism sequencing affordable, even for small labs. This trend has also vastly increased the volume and diversity of public genome and transcriptome data, creating tempting opportunities for evolutionary and comparative analysis.  GMOD provides important tools for working with this data: [[GBrowse]] and [[JBrowse]] for visualization, [[Chado]] for storage, indexing, and as a backend for analyses, [[MAKER]] for high-throughput standardized annotation, and [[Galaxy]] as a comparative genomics workbench.
+
We propose a hackathon to fill critical gaps in the capabilities of the Generic Model Organism Database (GMOD) toolbox that currently limit its utility for evolutionary research. Specifically, we aim to focus on tools for 1) viewing comparative genomics data; 2) visualizing phylogenomic data; and 3) supporting population diversity data and phenotype annotation.
  
Second, GMOD's existing user base is increasingly pursuing research in areas such as phenotypes and population biology. Evolutionary biologists have a great deal of experience working with these types of data.  Making GMOD more useful to evolutionary researchers and bringing those researchers into the GMOD community will also benefit GMOD's traditional user base, which is just starting to use these data types.
+
The event would be hosted at NESCent and bring together a group of about 20 software developers, end-user representatives, and documentation experts who would otherwise not meet. The participants would include key developers of GMOD components that currently lack features critical for emerging evolutionary biology research, developers of informatics tools in evolutionary research that lack GMOD integration, and informatics-savvy biologists who can represent end-user requirements.
  
== Why a NESCent Hackathon? ==
+
The event would hence provide a unique opportunity to infuse the community of GMOD developers with a heightened awareness of unmet needs in evolutionary biology that GMOD components have the potential to fill, and for tool developers in evolutionary biology to better understand how best to extend or integrate with already existing GMOD components.
  
A hackathon is way to bring these communities together so that GMOD tools can be enhanced to:
+
== Background ==
# better serve the needs of evolutionary biologists for data types GMOD already handles well, and
+
# better support data types that evolutionary biologists have a longstanding interest in, but that are new to GMOD.
+
  
We are seeking NESCent's support and hosting for this event.  NESCent has a commitment to facilitating data sharing in evolutionary research and GMOD is a way to achieve that goal. GMOD is a widely adopted set of tools that facilitate data integration and sharing through common formats and interfaces.  Making GMOD more useful to evolutionary biologists would help the evolutionary community better share their data.
+
The GMOD project is a confederation of intercompatible open-source projects developing software tools for storing, managing, curating, and publishing biological data. Although the GMOD project originated from the goal of developing a generic tool set for common needs among model organism databases, GMOD tools are meanwhile used by many large and small, collaborative and single-investigator biological database projects for the dissemination of experimental results and curated knowledge.
  
NESCent is also an excellent place to hold a hackathon. NESCent has significant experience hosting hackathons.  Holding the event at NESCent would allow us take advantage of that experience.
+
GMOD's software tools provide a powerful and feature-rich basis for working with biological, in particular genomic and other molecular data. However, due to GMOD's historical emphasis on single-genome projects many GMOD tools still lack features that are critical to effectively support the comparative, phylogenetic, and natural diversity-oriented questions frequently asked in evolutionary research.
  
== Organizers ==
+
Recent developments have given rise to a window of opportunity for forging collaborations towards filling this gap. In particular, the cost of collecting comparative molecular data on a large or even genomic scale has recently dropped dramatically, primarily thanks to next-generation high-throughput sequencing technologies. This has enabled evolutionary researchers to bring genome-scale molecular data to bear on key evolutionary questions. It has also allowed single organism-focused molecular biology labs, who represent GMOD's traditional user base, to broaden out to multi-organism comparative approaches.  Bringing these two communities with increasingly shared interests and complementary scientific and technical expertise together offers an opportunity to start filling GMOD's gaps in these areas while building on its existing strengths. In addition, such direct interaction will heighten future awareness of needs of evolutionary researchers among GMOD developers who have so far mostly supported its traditional user base, and can in the long term increase the ranks of GMOD contributors from a field it was not originally designed to serve.
  
The organizing committee includes:
+
The hackathon meeting format is ideally suited to realize this opportunityIts strengths lie in facilitating face-to-face interaction among people with complementary expertise, and collaborative work on tangible products that can form the basis of continued partnerships long beyond the end of the meetingThis meeting format, and the overall goals of the event are closely aligned with NESCent's objectives in promoting collaborative work, data sharing and interoperability. NESCent's past experience in organizing successful hackathons, and its position as a neutral intellectual hub within the evolutionary biology make it an ideal location for holding the event.
 
+
; Nicole Washington, Chair
+
: Nicole works at the Digital Curation Center for the modENCODE project where she ....  She also works on [[Phenote]], an ontology-based phenotype annotation tool.
+
; [[User:Mckays|Sheldon McKay]]
+
: Sheldon (University of Arizona) is the Scientific Lead for the [http://iplant.org The iPlant collaborative]'s [https://pods.iplantcollaborative.org/wiki/display/iptol/iPToL_Progress iPlant tree of life project] (iPToL) engagement team.    Sheldon is also the lead developer for [[GBrowse_syn]], GMOD's most widely used comparative genomics viewerSheldon was a participant in the [https://www.nescent.org/wg/evoinfo/index.php?title=Database_Interop_Hackathon Evolutionary Database Interoperability Hackathon] at NESCent in March 2009 and was a co-organizer and participant in the [http://evoio.org/wiki/VoCamp1 NESCent Phyloinformatics VoCamp] November 2009.
+
; Robert Buels
+
: Rob is the engineering team lead at the Sol Genomics Network (SGN), a clade-oriented database dedicated to the biology of the ''Solanaceae'' family which includes a large number of closely related and many agronomically important species such as tomato, potato, tobacco, eggplant, pepper, and the ornamental ''Petunia hybrida''Rob's (and SGN's) interests include plant adaptation and diversification, and linking the phenome to the genome.
+
; Scott Cain
+
: Scott is the GMOD Program Coordinator. Scott has organized a previous GMOD hackathon held at the University of Chicago in 2007.
+
; Hilmar Lapp
+
: Hilmar is the Assistant Director of Informatics at NESCent, where he is responsible for implementing NESCent's goal of enabling data interoperability in evolutionary biology.  He is also a veteran of many hackathons at NESCent and elsewhere
+
 
+
== Participants ==
+
 
+
Participation will be arranged by invitation and by self-nomination followed by review. If you are interested in participating, please contact one of the organizers.
+
 
+
=== Nominations / Applications ===
+
 
+
We are asking for support for 20-25 participants, including the organizing committee.  Participants will be selected by the organizing committee from a list of nominated candidates.  Nominations will be solicited in several ways:
+
; Nominations from the organizing committee.
+
: People whom the organizing committee believes could make a significant contribution to the hackathon.  A preliminary list is included in this proposal.
+
; Nominations from the GMOD Community
+
: Nominations will be solicited on GMOD mailing lists and on the GMOD web site. People will be encouraged to self-nominate, as well as to nominate others.
+
; Nominations from the evolutionary community.
+
: We will solicit nominations on prominent evolutionary biology mailing lists such as EvolDir.  Organizers will also work with NESCent's Education and Outreach group to identify appropriate resources to send announcements to.
+
; Nominations from conference contacts
+
: This spans communities. Depending on the timing of the hackathon, members of the organizing committee and key GMOD community members will solicit nominations at conferences they attend.  Upcoming conferences include Arthropod Genomics, Evolution, iEvoBio, BOSC, and ISMB.  We will also work with NESCent's Education and Outreach group to advertise the hackathon as a prt of their activities.
+
 
+
=== Selection Criteria ===
+
 
+
Participants will be selected by the organizing committee based on a number of criteria:
+
; Experience in bioinformatics programming
+
: The goal of the hackathon is to produce working code.  Participants do not need to have a degree in computer science, but they must know how to program.
+
; Experience with and understanding of evolutionary data types
+
: Participants must also have some understanding of the domain the hackathon is working in.  Participants do not need to have a degree in evolutionary biology, but they must have at least a basic understanding of current issues in evolutionary biology.
+
; Access to evolutionary data
+
: Applicants that have access to evolutionary datasets will be favored over those who do not.  Having data in hand reveals concrete problems that can be addressed in the context of a week-long hackathon.
+
; Knowledge of GMOD
+
: Applicants with GMOD experience will be favored over those who do not.
+
; Diversity
+
: Organizers will strive to include participants from a broad swath of the tree of life, that are working in many different areas of evolutionary biology.  We will also strive to nominate and include participants from historically under-represented groups in evolutionary biology.
+
 
+
The first two criteria are required for participants.  The remaining three criteria will be used to build a balanced, complementary, and diverse set of applicants.
+
 
+
== Hackathon Mechanics ==
+
 
+
Participants will split into subgroups at the event. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants both prior to and at the event.
+
 
+
Organizers will focus participants on objectives for which we have data available, a very clear idea of the objective, and that we can complete or make significant progress on during the hackathon itself.  After the hackathon organizers and GMOD staff will followup with participants to get tasks done through to completion.  Such followup is standard practice in GMOD and we have been doing it consistently following [[Meetings|GMOD Meetings]] since 2008.
+
  
 
== Specific objectives ==
 
== Specific objectives ==
  
Organizers have identified the following broad objectives for guiding work at the event.  This is based on our own experience, and interaction with others in the GMOD and evolutionary biology communities.  These include insights gained by the recent ''Tools for Emerging Model Systems'' working group (EMS WG) at NESCent.  This group consisted of evolutionary biologists working on non-model organisms and struggling with how best to exploit their data and connect their communities.
+
Organizers have identified the following broad themes for focusing work at the event.  These are based on the organizers' experience, interactions with others in the GMOD and evolution communities, and insights gained by the recent [http://www.nescent.org/cal/calendar_detail.php?id=530 ''Tools for Emerging Model Systems'' working group (EMS WG)] at NESCent.
 
+
During the hackathon, participants will refine and distill these and other options into concrete implementation objectives.
+
 
+
The hackathon concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an [http://www.opensource.org/licenses/alphabetical OSI-approved] open source license.
+
 
+
=== Better GMOD support for alignment metadata ===
+
 
+
{{SAMtoolsLink|Sequence Alignment Map (SAM)}} format has become the de-facto standard format for representing short-read genome alignments, but it still has only limited support in these tools. One objective of the proposed hackathon is to design and implement improvements to GMOD tools to give them excellent support for SAM data, ''particularly for cross-species alignments and views''. This includes extending Chado to properly store experimental and alignment metadata, vital for identification of source and target genome builds for comparative analysis. We expect the improved analysis and storage tools resulting from this work to make cross-species comparative analysis of large-scale datasets much more accessible.
+
 
+
''Note:'' Next generation sequencing (NGS) data in SAM format is referred to repeatedly in this proposal.  This is a function of both the widespread adoption of SAM as the standard way to represent this data, and of the importance of NGS data in evolutionary biology.  The EMS working group identified 'working with NGS data' as their number one concern.
+
 
+
=== [[GBrowse_syn]] compatibility with SAM data ===
+
 
+
The GBrowse_syn comparative genomics viewer does not currently support SAM data.  GBrowse_syn currently runs on the GBrowse 1.x platform.  It needs to be upgraded to the GBrowse 2.x platform before it can support SAM data.  We may want to also extend basic SAM functionality to show per-base information.
+
 
+
=== [[GBrowse_syn]] database backend scalability ===
+
 
+
Currently, each pairwise comparison between organisms is stored in a separate database to drive GBrowse_syn.  This quickly becomes unwieldy for large numbers of genomes.  A hackathon objective could be to address this scalability issue.
+
 
+
=== Whole-Genome Comparison Visualization ===
+
 
+
This functionality could be added ''de novo'' to [[GBrowse_syn]], or we could update GMOD tools to produce output that is compatible with external tools.  [[CMap]] has recently added an "export to Circos" functionality.  We currently don't have an easy way to do whole genome visualization for data, and we could address this during the hackathon.
+
 
+
=== Phylogenetics Visualization ===
+
 
+
[http://solgenomics.net SGN] has a nice [http://solgenomics.net/tools/align_viewer/ web-based multiple alignment and tree browser], one possible implementation objective could be to extract it as a GMOD component.  Also see the Chado visualization/web front end objectives in this proposal.
+
 
+
=== Evolutionary phenotype data in Chado ===
+
 
+
This task could include several items.  Support for phenotype data in Chado needs to be rationalized, as it currently supports two distinct models (an older prototype, a more robust followup) that use overlapping sets of tables.  Ideally, we will settle on one set of well defined tables to facilitate future work, as well as come up with migration plans for those using the old model.  Included in this would be the ability to support both EAV (Entity-Attribute-Value, used in the "old" schema) and EQ (Entity-Quality), both of which can leverage PATO and other ontologies for phenotype term specificity.  In particular, we will want to make sure that the Phenotype module is congruent with the Natural Diversity module (below), so that proper links are made between the recorded phenotypes, and the environments in which they are observed.
+
 
+
We could also add [[Phenote]] support.  At least one current GMOD user has written data adapters to take the Phenote generated annotation and load/retrieve it into/from Chado.  This may be a suitable foundation for a general purpose program for doing this data transfer.  Similarly, Phenex is a tool for curating evolutionary character trait data across multiple species.  It is built on the same base code as Phenote, although it currently uses a different database backend for storate.  Chado could enhanced/adapted to support this type of data as well.
+
 
+
=== Population diversity support for Chado and associated application connectivity ===
+
 
+
(a la {{GDPDMLink|GDPDM}})
+
 
+
Phenotypic diversity data is also very useful for evolutionary studies. In-depth analysis of this data requires proper representation, handling, and storage: specific phenotypes, environmental conditions, population details, and other experimental metadata all must be tracked, and more importantly cross-referenced with known genomic and genetic information. Developers at this hackathon will work to add '''...'''  One of the best conceptual tools for representing this type of information in machine-readable form is ontologies, and GMOD's open-source Chado database schema is the most mature, flexible, and feature-rich storage engine for storing ontology-based data. However, it lacks specific support for evolutionary phenotype data or natural diversity data. Earlier this year, a [[Chado Natural Diversity Module Working Group|working group]] was formed to work on the design of a new Natural Diversity module for Chado, and one of the objectives for this hackathon will be to finalize and integrate the group's work into the larger Chado schema, and to make sure it integrates well with the existing or new Phenotype module (discussed in the previous aim).
+
 
+
=== Evo-Devo Support ===
+
 
+
Add the ability to compare the developmental programs between organisms at different levels: anatomy, genomics, expression patterns, gene regulatory networks and their architecture, and phenotypes.  This objective, with the exception of gene regulatory networks, overlaps with several others.
+
<!-- [[NBrowse]] is a network browsing tool.  There have been discussions in the past about integrating NBrowse into GMOD.  Key personnel on the Reactome project are also key personnel on GMOD.  Adding genetic network support to GMOD could also be an objective.  -->
+
 
+
=== Tree / Graph Visualization ===
+
 
+
GMOD's Chado database schema already includes strong support for storing trees and graphs.  This capability has been in Chado since its beginning.  GMOD, however, lacks visualization support for tree and graph based data.  This includes phylogeny, gene orthology, lineage (anatomy and breeding), ontologies, and breeding data.  Several GMOD users have developed their own visualization tools for this type of data.  We could integrate one of those solutions, or an outside solution, for visualization.
+
 
+
=== Web Interfaces to Evolutionary Data in Chado ===
+
 
+
The ANISEED project has an atlas/image-based web interface for phenotype, gene expression, and cell fate data.  They are currently developing version 3 of this interface, called NISEED, that will be based on Chado for the first time.  The hackathon could enhance or help finish this integration.
+
 
+
[[Tripal]] is a [http://drupal.org Drupal]-based web interface to Chado databases. It supports interfaces for several popular data types, but does not currently support phylogenies, phenotypes, expression, or natural diversity data.  We could extend it to evolutionary data types as part of the hackathon.
+
 
+
=== Natural Diversity / Population Genetics / Multidimensional Data Visualization in a Genomic Context ===
+
 
+
The Barley1K project (Eyal Fridman group, The Hebrew University) is an example dataset that should be supportable by GMOD. They gathered a thousand wild samples of barley in a hierarchical mode of collection (51 sites that include 5 microsite on different slopes or niches within the site). They also recorded many local environmental conditions and collected detailed phenotype data on portion of this collection, including that of a diverse set of interspecifc hybrids derived from a genetically well-defined core collection (by Illumina Golden Gate platform, the Barley OPA array[BOPA1]). The Natural Diversity module will allow us to store this type of data including also accumulated allelic variation obtained from the microsatellites (SSRs) and BOPA1 array, as well as from next generation sequencing of cDNA libraries . However, we lack tools to visualize such multi-dimensional data in a genomic context (e.g., in GBrowse, JBrowse, and GBrowse_syn) including the association of genome-phenotype (phenome). This could be solved either with specific new glyphs and plugins, or with generic interfaces to statistical/geolocation/image based visualization packages.
+
 
+
There is also work currently under way to extend [[GFF3]] to handle variant information.  Several existing GMOD tools will need to be modified to recognize this data.
+
  
=== Support for pangenomes and core genomes ===
+
Before and at the hackathon, the participants will refine and distill these and other options into concrete implementation targets.  The participants will develop criteria for priotization, such as maturity of a target for implementation, availability of test data, and potential for completing or making significant progress towards the target during the hackathon. Further ideas and discussion topics can be found on the [[GMOD_Evo_Hackathon_Proposal_Supplemental_Information | Supplemental Information ]] page.
The concept of the pangenome and core genomes is becoming common in the analysis of bacterial genomes, but is more broadly applicable.  The pangenome is the union of all genes found in all strains of a species, while the core genome is the intersection of those sets.  In both cases, a gene or feature is a generalization of the instances of the feature in multiple genomes. The gene in a pangenome, like a gene in an inferred ancestor, does not have a physical location, but it may have one or more contextual locations in a syntenic block of sequence found in some or all of the strains.
+
  
=== Support for annotation tools based on phylogenetic analysis, such as PAINT ===
+
=== Viewing tools for comparative genomics data ===
The RefGenome project of the GO consortium is working on [http://wiki.geneontology.org/index.php/PAINT PAINT], a system for doing inference of GO annotations based on the distribution of curated annotations within clades and outgroups.  GMOD tools and schemas need to be prepared to handle this kind of annotation. For example, ancestor nodes in PANTHER trees will have accessions; these will require versioning to deal with changes in the analysis as annotations to descendants and the addition/placement of descendants changes with the addition of new genomes or revision of the orthology analysis.
+
  
=== Evolutionary visualization in jbrowse ===
+
[[GBrowse_syn]] is a popular GMOD component for viewing comparative genomics data, particularly for viewing synteny between genomes.  It does not currently support the next-generation sequencing (NGS) data increasingly available for comparative genomics and emerging model systems. Support for NGS data was identified by the EMS working group as a high priority.
  
Visualization of phylogenies in jbrowse (http://jbrowse.org); improved visualization of alignments.
+
In particular, GBrowse_syn lacks support for the [http://samtools.sourceforge.net/ Sequence Alignment Format (SAM)], its mechanism of storing genome comparisons does not scale beyond a few organisms,  and the means for tracking the necessary alignment metadata in Chado are insufficient.
  
=== Linking xrate (and other phylo-aware annotation tools) to jbrowse ===
+
In addition to filling those gaps, GBrowse_syn would also particularly stand to benefit from the event by gaining a more sustainable developer base.
  
The conservation track is a staple of the UCSC browser. Evofold predictions form another useful track. Tools like xrate (http://biowiki.org/XRATE) allow automation and generalization of these kinds of phylogenetic HMM or SCFG model. Development would focus on linking these into existing GMOD browsers (e.g. JBrowse).
+
=== Visualization of phylogenetic data and trees ===
  
== Discussion / Development Topics ==
+
The GMOD toolkit at present does not include web-based alignment viewers, nor can the increasingly popular [[JBrowse]] genome browser (the designated successor of [[GBrowse]]) display multiple sequence alignments. GMOD also lacks a phylogenetic tree widget.
  
This section contains early-stage ideas that merit discussion and serious consideration by the attendees of the hackathon, but are not yet developed enough for specific implementation objectives.
+
Implementing these from scratch would be far beyond a suitable hackathon target. However, [http://solgenomics.net SGN] has a relatively mature [http://solgenomics.net/tools/align_viewer/ web-based multiple alignment and tree browser] that could be extracted from SGN's codebase and transformed into a GMOD component, an add-on for JBrowse.  Current Java-based tree viewers (such as [http://www.phylosoft.org/atv/ Archaeopteryx] or [http://www.phylowidget.org PhyloWidget]) could be used as the basis for a JavaScript-based tree viewer (or an applet that can be controlled through JavaScript) that integrates with JBrowse.
  
=== Post-Reference Genome Tools ===
+
=== Population Diversity and Phenotype support ===
  
This is a great example of how evolutionary biology can help lead the rest of the GMOD community.
+
GMOD's capabilities in managing phenotype and natural diversity data is scattered across partially redundant and outdated modules, does not support modern ontology-based entity-quality data, and lacks a web-interface. The sophisticated phenotype annotation tools that do exist cannot interface with Chado, GMOD's central relational data model. Yet, phenotypic and genetic diversity data are central to many evolutionary research questions.
  
The concept of a reference genome has been an extremely valuable tool in model organismsThe importance of a reference genome is not diminishing, but the need for an additional framework is on the rise.
+
A [[Chado Natural Diversity Module Working Group| Natural Diversity Module initiative]]  to address at least the deficiencies within Chado has already formed earlier this year. Several key developers (one of the original developers of the module, and the developer of Phenex, a phenotype curation tool) are already local to NESCent, and so the hackathon provides a unique opportunity to review and refine the natural diversity data model face-to-face, and to integrate it with an updated and reconciled phenotype moduleA recently reported prototype of a Chado data adapter for Phenote, GMODs phenotype annotation tool, could be generalized to become the data persistence interface for such data.
  
To explain this, lets contrast evolutionary and developmental biologists. Developmental biologists embrace and strive for similarity as a means of controlling experimental conditions.  Inbred lines of organisms do not usually occur in nature, but are usually preferred for developmental biology work.  Developmental biology has anatomy ontologies and staging series based on stereotypic progression of anatomical development in single inbred lines. Developmental biologists strive to eliminate genetic and environmental diversity in order to create controlled experimental conditions.  The concept of a reference genome historically has fit very well into this paradigm.
+
Aside from the data model deficiencies, the [http://aniseed-ibdm.univ-mrs.fr/ ANISEED] project has started efforts to generalize its sophisticated atlas/image-based web interface for phenotype data, and to make it operate on top of Chado. The hackathon could harness this synergy to help this effort leap forward, which could ultimately provide GMOD with the currently missing web-interface for such data.
  
In contrast, evolutionary biologists embrace and study genetic and environmental diversity.  Evolutionary biologists typically study populations rather than individual lines.  They characterize and analyze differences, rather than eliminate them.  For evolutionary biologists, a reference genome is much less of a central tool than it is for developmental biologists.
+
== Logistics and Participation ==
  
Second-generation sequencing now allows evolutionary biologists to exploit ''genomic'' data for populations or large numbers of individuals.  It also allows every other kind of biologist to do the same.  We currently have tools to show linkage disequilibrium, and genotype and allele frequencies, but these still typically show data in the context of a reference genome.
+
The event will tentatively be held at NESCent in Durham, North Carolina, from Nov 8-12, 2010.
  
By some estimates, three years from now many projects will have thousands of full genomes. In such an environment, does the concept of a reference genome still remain relevant? How should GMOD tools change, grow, and adapt?
+
Participation will be arranged by invitation and by self-nomination followed by review. If you are interested in participating, please contact one of the organizers. We expect to support about 20 participants, about half of whom will be invited and half will be self-nominated attendees.
  
=== High-throughput Imaging / Phenotyping ===
+
The objective for direct invitations is to ensure that critical developers for each of the three themes are present. Self-nominations of participants will be solicited through a variety of channels, including GMOD mailing lists, the [http://www.gmod.org GMOD web site], and [http://life.biology.mcmaster.ca/~brian/evoldir.html EvolDir].  In addition, the organizers will announce Calls for Participation at conferences they attend, such as [http://www.k-state.edu/agc/symp2010 Arthropod Genomics], [http://www.evolutionsociety.org/SSE2010/ Evolution], [http://ievobio.org/ iEvoBio], [http://www.open-bio.org/wiki/BOSC_2010 BOSC], and [http://www.iscb.org/ismb2010 ISMB]. Additional targets may be identified by [http://www.nescent.org/eog/AboutEOG.php NESCent's Education and Outreach group].
  
Adoption of high-throughput imaging and phenotyping technologies is increasing.  What software exists for working with this type of data, and how should the GMOD community participate?
+
The organizing committee will select participants from the applicant pool to create a group with balanced, complementary, and diverse sets of expertise, background, and interests, using a number of criteria:
 +
* Experience in bioinformatics programming in general and GMOD in particular;
 +
* Experience with and understanding of evolutionary data types;
 +
* Potential to uniquely benefit from the event;
 +
* Complementarity of expertise and background;
 +
* Achieving critical mass for each of the themes; and
 +
* Availability during the event.
  
=== Third-generation Sequencing ===
+
A hackathon is a working meeting and concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an [http://www.opensource.org/licenses/alphabetical OSI-approved] open source license.
  
What challenges will the GMOD community face in handling third-generation (single-molecule) sequencing data, and how can we prepare for them?  Second generation sequencing technologies typically produce many short reads at very high coverage.  The high coverage is necessary to compensate for lower accuracy.  Challenges with 2nd gen data include 1) dealing with the huge amount of data that comes with the high coverage, 2) distinguishing read and amplification errors from signal, and 3) assembling short reads.
+
== Organization and Agenda ==
  
Third generation technologies will have some commonalities and some key differences.  First, they will continue to produce large volumes of data.  The nature of the data will change significantly, though. Third generation technologies are expected to be significantly less error-prone, thus reducing the need for high coverage. This will also reduce cost and turnaround time.  While the average ''depth'' of the data will decrease, the ''width'' of the data will greatly increase. The technology will enable more samples to be sequenced, and at greater accuracy. The improved accuracy and longer read length will also make assembly easier.
+
The following people comprise the organizing committee:
 +
* [[User:NLWashington|Nicole Washington]], Chair (Lawrence Berkeley National Laboratory; [http://www.modencode.org modENCODE] and developer of [[Phenote]])
 +
* [[User:Mckays|Sheldon McKay]] (University of Arizona; [http://www.iplantcollaborative.org/ The iPlant collaborative] and developer of [[GBrowse_syn]])
 +
* [[User:RobertBuels|Robert Buels]] (Cornell University; [http://solgenomics.net/ Solanaceae Genomics Network (SGN)])
 +
* [[User:Scott|Scott Cain]] (Ontario Institute for Cancer Research; GMOD Program Coordinator)
 +
* [[User:Hlapp|Hilmar Lapp]] ([http://www.nescent.org NESCent])
 +
* [[User:Clements|Dave Clements]] ([http://www.nescent.org NESCent]; [[GMOD Help Desk]])
  
Areas of discussion include:
+
The actual agenda will be determined by the participants. At the event, participants will split into subgroups. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants both prior to and at the event.
  
* data modeling and storage
+
After the hackathon organizers and GMOD staff will followup with participants to help with seeing unfinished tasks through to completion, similar as has been done in GMOD following the [[Meetings|GMOD Meetings]].
* graphical visualization
+
* online display and searching
+
  
 
[[Category:GMOD Evo Hackathon]]
 
[[Category:GMOD Evo Hackathon]]
[[Category:Meetings]]
 

Latest revision as of 18:53, 8 October 2012

EvoHackathonLogo.png
Tools for Evolutionary Biology Hackathon Proposal
November 8-12, 2010
NESCent, Durham, North Carolina, USA


This proposal was submitted to, and then approved and funded by NESCent in June 2010. For up-to-date information on the hackathon, see the hackathon's home page.

The GMOD Evo Hackathon aims to bring together key developers of GMOD components with developers from the evolutionary biology community to fill critical gaps in GMOD for evolutionary research.

Synopsis

We propose a hackathon to fill critical gaps in the capabilities of the Generic Model Organism Database (GMOD) toolbox that currently limit its utility for evolutionary research. Specifically, we aim to focus on tools for 1) viewing comparative genomics data; 2) visualizing phylogenomic data; and 3) supporting population diversity data and phenotype annotation.

The event would be hosted at NESCent and bring together a group of about 20 software developers, end-user representatives, and documentation experts who would otherwise not meet. The participants would include key developers of GMOD components that currently lack features critical for emerging evolutionary biology research, developers of informatics tools in evolutionary research that lack GMOD integration, and informatics-savvy biologists who can represent end-user requirements.

The event would hence provide a unique opportunity to infuse the community of GMOD developers with a heightened awareness of unmet needs in evolutionary biology that GMOD components have the potential to fill, and for tool developers in evolutionary biology to better understand how best to extend or integrate with already existing GMOD components.

Background

The GMOD project is a confederation of intercompatible open-source projects developing software tools for storing, managing, curating, and publishing biological data. Although the GMOD project originated from the goal of developing a generic tool set for common needs among model organism databases, GMOD tools are meanwhile used by many large and small, collaborative and single-investigator biological database projects for the dissemination of experimental results and curated knowledge.

GMOD's software tools provide a powerful and feature-rich basis for working with biological, in particular genomic and other molecular data. However, due to GMOD's historical emphasis on single-genome projects many GMOD tools still lack features that are critical to effectively support the comparative, phylogenetic, and natural diversity-oriented questions frequently asked in evolutionary research.

Recent developments have given rise to a window of opportunity for forging collaborations towards filling this gap. In particular, the cost of collecting comparative molecular data on a large or even genomic scale has recently dropped dramatically, primarily thanks to next-generation high-throughput sequencing technologies. This has enabled evolutionary researchers to bring genome-scale molecular data to bear on key evolutionary questions. It has also allowed single organism-focused molecular biology labs, who represent GMOD's traditional user base, to broaden out to multi-organism comparative approaches. Bringing these two communities with increasingly shared interests and complementary scientific and technical expertise together offers an opportunity to start filling GMOD's gaps in these areas while building on its existing strengths. In addition, such direct interaction will heighten future awareness of needs of evolutionary researchers among GMOD developers who have so far mostly supported its traditional user base, and can in the long term increase the ranks of GMOD contributors from a field it was not originally designed to serve.

The hackathon meeting format is ideally suited to realize this opportunity. Its strengths lie in facilitating face-to-face interaction among people with complementary expertise, and collaborative work on tangible products that can form the basis of continued partnerships long beyond the end of the meeting. This meeting format, and the overall goals of the event are closely aligned with NESCent's objectives in promoting collaborative work, data sharing and interoperability. NESCent's past experience in organizing successful hackathons, and its position as a neutral intellectual hub within the evolutionary biology make it an ideal location for holding the event.

Specific objectives

Organizers have identified the following broad themes for focusing work at the event. These are based on the organizers' experience, interactions with others in the GMOD and evolution communities, and insights gained by the recent Tools for Emerging Model Systems working group (EMS WG) at NESCent.

Before and at the hackathon, the participants will refine and distill these and other options into concrete implementation targets. The participants will develop criteria for priotization, such as maturity of a target for implementation, availability of test data, and potential for completing or making significant progress towards the target during the hackathon. Further ideas and discussion topics can be found on the Supplemental Information page.

Viewing tools for comparative genomics data

GBrowse_syn is a popular GMOD component for viewing comparative genomics data, particularly for viewing synteny between genomes. It does not currently support the next-generation sequencing (NGS) data increasingly available for comparative genomics and emerging model systems. Support for NGS data was identified by the EMS working group as a high priority.

In particular, GBrowse_syn lacks support for the Sequence Alignment Format (SAM), its mechanism of storing genome comparisons does not scale beyond a few organisms, and the means for tracking the necessary alignment metadata in Chado are insufficient.

In addition to filling those gaps, GBrowse_syn would also particularly stand to benefit from the event by gaining a more sustainable developer base.

Visualization of phylogenetic data and trees

The GMOD toolkit at present does not include web-based alignment viewers, nor can the increasingly popular JBrowse genome browser (the designated successor of GBrowse) display multiple sequence alignments. GMOD also lacks a phylogenetic tree widget.

Implementing these from scratch would be far beyond a suitable hackathon target. However, SGN has a relatively mature web-based multiple alignment and tree browser that could be extracted from SGN's codebase and transformed into a GMOD component, an add-on for JBrowse. Current Java-based tree viewers (such as Archaeopteryx or PhyloWidget) could be used as the basis for a JavaScript-based tree viewer (or an applet that can be controlled through JavaScript) that integrates with JBrowse.

Population Diversity and Phenotype support

GMOD's capabilities in managing phenotype and natural diversity data is scattered across partially redundant and outdated modules, does not support modern ontology-based entity-quality data, and lacks a web-interface. The sophisticated phenotype annotation tools that do exist cannot interface with Chado, GMOD's central relational data model. Yet, phenotypic and genetic diversity data are central to many evolutionary research questions.

A Natural Diversity Module initiative to address at least the deficiencies within Chado has already formed earlier this year. Several key developers (one of the original developers of the module, and the developer of Phenex, a phenotype curation tool) are already local to NESCent, and so the hackathon provides a unique opportunity to review and refine the natural diversity data model face-to-face, and to integrate it with an updated and reconciled phenotype module. A recently reported prototype of a Chado data adapter for Phenote, GMODs phenotype annotation tool, could be generalized to become the data persistence interface for such data.

Aside from the data model deficiencies, the ANISEED project has started efforts to generalize its sophisticated atlas/image-based web interface for phenotype data, and to make it operate on top of Chado. The hackathon could harness this synergy to help this effort leap forward, which could ultimately provide GMOD with the currently missing web-interface for such data.

Logistics and Participation

The event will tentatively be held at NESCent in Durham, North Carolina, from Nov 8-12, 2010.

Participation will be arranged by invitation and by self-nomination followed by review. If you are interested in participating, please contact one of the organizers. We expect to support about 20 participants, about half of whom will be invited and half will be self-nominated attendees.

The objective for direct invitations is to ensure that critical developers for each of the three themes are present. Self-nominations of participants will be solicited through a variety of channels, including GMOD mailing lists, the GMOD web site, and EvolDir. In addition, the organizers will announce Calls for Participation at conferences they attend, such as Arthropod Genomics, Evolution, iEvoBio, BOSC, and ISMB. Additional targets may be identified by NESCent's Education and Outreach group.

The organizing committee will select participants from the applicant pool to create a group with balanced, complementary, and diverse sets of expertise, background, and interests, using a number of criteria:

  • Experience in bioinformatics programming in general and GMOD in particular;
  • Experience with and understanding of evolutionary data types;
  • Potential to uniquely benefit from the event;
  • Complementarity of expertise and background;
  • Achieving critical mass for each of the themes; and
  • Availability during the event.

A hackathon is a working meeting and concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an OSI-approved open source license.

Organization and Agenda

The following people comprise the organizing committee:

The actual agenda will be determined by the participants. At the event, participants will split into subgroups. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants both prior to and at the event.

After the hackathon organizers and GMOD staff will followup with participants to help with seeing unfinished tasks through to completion, similar as has been done in GMOD following the GMOD Meetings.