This is the new server for GMOD.org. Please let us know if you notice anything weird while it's getting broken in.

GSOC Project Ideas 2016

From GMOD
Jump to: navigation, search

There are plenty of challenging and interesting project ideas this year. These projects include a broad set of skills, technologies and domains, such as GUIs, database integration and algorithms.

Students are also encouraged to propose their own ideas related to our projects. If you have strong computer skills and have an interest in biology or bioinformatics, you should definitely apply! Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open source programmers.

  • Project Idea Name
    • Brief explanation: Brief description of the idea, including any relevant links, etc.
    • Expected results: describe the outcome of the project idea.
    • Knowledge prerequisites: programming language(s) to be used, plus any other particular computer science skills needed.
    • Skill level: Basic, Medium or Advanced.
    • Mentors: name + contact details of the lead mentor, name + contact details of backup mentor.


Here is a list of the proposed project ideas for 2016:

  • Project Idea 1: Biological Graph Visualization
    • Brief explanation: Tripal (http://tripal.info) is an open-source suite of Drupal modules that allows a scientific research community to more easily setup and manage a data repository for genomic, genetic and related biological data. It provides data pages, data mining tools and visualizations. Tripal is used or in development by over 25 different genome database websites, and is developed by an international group. A Tripal module currently exists for importing, searching and visualizing graph data that models the "network" of interactions of various components of a biological system. However, the module is not complete and requires improvements to the visualizations. The goal of this project would be to complete the remaining work for this module such that it can be shared with others.
    • Expected results: Once completed, a Drupal module will freely available for Tripal-based sites to use on their own sites. Thus providing graph visualizations for complex biological systems.
    • Knowledge prerequisites: PHP, Drupal, JavaScript, SQL.
    • Skill level: Medium
    • Mentors: Stephen Ficklin (spficklin@gmail.com)


  • Project Idea 2: Github-based revision control of synthetic chromosomes
    • Brief explanation: JBrowse (http://jbrowse.org) is a robust open-source genome visualization tool built around Javascript and HTML5. It has gained wide acceptance among biologists and bioinformatisist with thousands of active installations worldwide in genomic research. This project deals with a module that enables management synthetic DNA sequences in JBrowse. Synthetic biologists design DNA sequences that differ from their analogous sequences in natural organisms. As with code, the differences can be incremental or radical, and can be visualized using “diff”-like tools. And, also as with code, good revision control is fundamentally important. We propose to build module that enables biologists to manage synthetic sequences. This will contribute toward a broader effort to visualize the results of synthetic biology experiments in the computational design phase, after synthesis (via DNA re-sequencing), and to verify gene expression under various conditions (via RNA sequencing). We propose to store revisions of synthetic chromosome sequences in git (primarily github) repositories. The project would include the development of plugin components on both the server and client sides. The new extensions would provided a means of detection of branch and tag updates on the github repos and provide a means to select and retrieve the synthetic sequences from github. The backend part of the plugin would provide a means to manage multiple synthetic sequences and manipulate associated JBrowse-based datasets. This is a challenging and exciting project at the interface of computational and synthetic biology. You’ll have lots of guidance developing cool science tools that will have a relevant impact to the scientific community.
    • Expected results: Your module will be exploited at large by a new breed of synthetic biologists.
    • Knowledge prerequisites: Candidates should have some good experience with Javascript, HTML5 & CSS3. Experience with REST, Node.js, Dojo, and jQuery, or Github API would be a plus (you’ll be learning it all). If you think biology is cool and would like to learn a lot more about it, that’s a plus too.
    • Skill level: Medium
    • Mentors: Eric Yao (ericiam@berkeley.edu), Lead JBrowse Developer


  • Project Idea 2a: Marking up a genome for edits using CRISPR/Cas9
    • Brief explanation: The idea is to develop a system for marking up a genome for edits. The current standard technology for making precise cuts in DNA is the CRISPR/Cas9 system, but it can only cut the DNA at certain positions and some positions are more reliable than others... there could be visual feedback for this. This plugin would involve hooking up existing JBrowse views and controls to existing back-end server tools that work on the DNA.
    • Expected results: This module is also targeted at synthetic biologists.
    • Knowledge prerequisites: Candidates should have some good experience with Javascript, HTML5 & CSS3. Experience with REST, Node.js, Dojo, and jQuery, or Github API would be a plus.
    • Skill level: Medium
    • Mentors: Ian Holmes (ihh@berkeley.edu), PI of JBrowse


  • Project Idea 2b: Customizable themes and plugins for JBrowse
    • Brief explanation: The idea, motivated by similar extensions to web apps like Wordpress, is to develop (a) a theme system for JBrowse so that users could switch out different CSS, image, and dijit themes; (b) a more user-friendly plugin system (building on the existing plugin API) so that administrators could install a plugin over the web.
    • Expected results: Wordpress-style plugins and themes.
    • Knowledge prerequisites: Candidates should have some good experience with Javascript, HTML5 & CSS3. Experience with REST, Node.js, Dojo, and jQuery, or Github API would be a plus.
    • Skill level: Medium
    • Mentors: Ian Holmes (ihh@berkeley.edu), PI of JBrowse


  • Project Idea 3: Lightweight chat plugin for the JBrowse genome browser
    • Brief explanation: Increasingly, genome scientists collaborate remotely from multiple sites: at genome centers in academic institutions, from biotech companies, from clinical labs and (increasingly with the advent of portable genome sequencing) from field sites. This project idea is to develop a lightweight messaging/chat plugin for JBrowse using OAuth2 and the Faye pub/sub framework. Users will be able to see who else is currently browsing the genome (provided that they have set themselves as visible), to see where they are browsing, and to send and receive messages. A possible extension is to post comments on the genome. The general idea here is to make genomes (and their constituent objects, e.g. gene annotations) into “social objects”. This is in keeping with our vision of JBrowse as not just a tool for genomics, but for social genomics. The availability of thousands of JBrowse instances which could readily incorporate the plugin offers the possibility quick and deep adoption by the genomics community.
    • Expected results: This module will enable a new way for bioengineers to share and socialize genomic information.
    • Knowledge prerequisites: Javascript/HTML5/CSS3. NodeJS, Dojo a plus. Digging science, a plus.
    • Skill level: Medium
    • Mentors: Ian Holmes (ihholmes@gmail.com), Principal Investigator and founder of JBrowse


  • Project Idea 4: Linking Galaxy with Google Drive
    • Brief explanation: The Galaxy application implements the notion of an Object Store - a pluggable file management interface that acts as a layer between Galaxy and any user dataset. This Object Store interface allows datasets to be ‘physically’ disconnected from a particular instance of Galaxy while the application can still access and interact with them. This opens the door for providing various storage mediums where the data is actually stored. Ultimately, thus allows a user to associate self-provisioned external storage resources with their Galaxy account and move beyond the imposed quota or limitations on any given Galaxy instance. Thus far, an abstract hierarchical store, Amazon S3, iRODS, and various local disk object stores have been implemented. However, use of an Object Store within Galaxy is an application-wide setting instead of being a per-user setting allowing users to specify their own back-end storage medium. Additionally, linking with the Google Drive is highly desirable allowing user to leverage the Google Drive for Education program.
    • Expected results: Implement a Galaxy Object Store for Google Drive. Allow per-user specification of a back-end data store
    • Knowledge prerequisites: Required Skills Python programming. Familiarity with Galaxy and/or object store APIs
    • Skill level: Medium
    • Mentors: Enis Afgan (enis.afgan@jhu.edu)


  • Project Idea 5: Work with the Dockstore Team and the GA4GH to Enable Cross Docker Repository Sharing
    • Brief explanation: The Dockstore project seeks to create a site where researchers can encapsulate their tools in Docker, a flexible and popular virtualization technology, and describe the tools using the Common Workflow Language and/or the Workflow Definition Language. The benefit is having a programatic way to then create, share, and run bioinformatics tools. On its own this is cool since it allows scientists to make their tools portable from cloud-to-cloud, something we saw as key in Petabyte-scale projects like PCAWG where the data simply can't be moved around. But an equally important goal is create a community standard with the GA4GH so many sites like Dockstore can be created that all share a common API. This project will focus on working with the GA4GH community, which is a huge collaboration between over 300 groups and companies world-wide, to create and implement API standards, ensure Dockstore supports them, and to facilitate cross indexing of tools across all sites that support the standard in order to share tools as seamlessly as possible.
    • Expected results: the API is "approved" by the GA4GH as an official standard and Dockstore, and other Docker repositories, support the standard in order to facilitate exchange of tools
    • Knowledge prerequisites: Dockstore is written in Java and uses AngularJS, experience with the former is important and the latter is nice to have. Ability to work with diverse people from a variety of organizations and companies required.
    • Skill level: Basic to Medium
    • Mentors: Brian O'Connor (Brian.OConnor@oicr.on.ca, OICR & UCSC), Denis Yuen (OICR), and the GA4GH Containers and Workflows interest group (https://groups.google.com/forum/#!forum/ga4gh-dwg-containers-workflows)


  • Project Idea 6: Galaxy Pages Overhaul
    • Brief explanation: Galaxy Pages are a way of communicating Galaxy analyses so that other researchers can easily view, reproduce, or extend an analyses. To build pages - researchers use a WYSIWYG editor to build HTML pages that may contain embedded Galaxy objects such as histories, datasets, workflows, and visualizations. Pages are a powerful concept but are underutilized, and we believe a substantial overhaul could increase their accessibility and usage. The current HTML-based pages contain a number of usability issues. The first step would be to address these and update the embedded WYMeditor to its latest stable version. The embedded HTML approach works well for non-technically savvy users - but advanced users would prefer alternatives such as Markdown or IPython Notebooks - extending the framework to allow these is one possibility for the project. Alternatively - extending pages with new features for collaborative editing would make them much more powerful as well.
    • Expected results: Improve Galaxy Pages addressing existing bugs and swapping to a pluggable back-end supporting Markdown.
    • Knowledge prerequisites: Python and JavaScript experience would be useful.
    • Skill level: Medium
    • Mentors: Dannon Baker (dannon.baker@gmail.com) and other core Galaxy developers


  • Project Idea 7: Galaxy Kubernetes Integration
    • Brief explanation: Galaxy supports running jobs in Docker containers for running jobs on a single node. However, the size of biological datasets and the complexity of the questions being asked is constantly increasing and this is leading to ever more complex analytics - meaning one container running on one node will become an increasingly problematic limitation. Kubernetes is an exciting project that provides facilities for coordination of many containers. Extending Galaxy and/or the Galaxy remote job submission application Pulsar to interface with Kubernetes would potentially allow Galaxy to leverage to run these more complicated multiple-node, multiple-container analysis steps that will be required for future large scale biological data analysis.
    • Expected results: Implement the ability to annotate Galaxy tools with Kubernetes pods and orchestrate these jobs via Kubernetes orchestration either in Galaxy directly or via Pulsar. Develop example pod.
    • Knowledge prerequisites: Python programming, with experience in cloud and/or cluster computing, and with containers.
    • Skill level: Medium
    • Mentors: Dannon Baker (dannon.baker@gmail.com) and other core Galaxy developers


  • Project Idea 8: Galaxy and Bioconductor Tool Integration
    • Brief explanation: The RGalaxy Bioconductor package allows one to markup R functions and then to automatically create Galaxy tools from these functions. This is an exciting project that provides the dual benefits of allowing Galaxy users to easily leverage the complex and cutting edge tools developed in R while allowing R authors to easily create and disseminate accessible web interfaces for their modules through Galaxy. It would be great to determine if this idea could be taken further and make this process more seamless - lowering the barrier to entry. Specifically, leveraging the existing Bioconductor documentation to build these tools instead of requiring the functions to be built for Galaxy or at least provide a higher-level intermediary to bridge existing functions more easily. Additionally, once tools are created the publishing process could be automated as well for maximum impact. An example of a project that leverages existing structures this is gxargparse for Python.
    • Expected results: Develop a higher-level intermediary or allow direct translation of Bioconductor tools into Galaxy Tools. Port several Bioconductor tools to Galaxy, and publish them.
    • Knowledge prerequisites: Python programming and R familiarity would be helpful.
    • Skill level: Medium
    • Mentors: Nitesh Turaga (nturaga1@jhu.edu) and other core Galaxy developers


  • Project Idea 9: BLAST visualisation library for BioRuby and SequenceServer
    • Brief explanation: It is now trivial to generate large amounts of DNA sequence data; the challenge lies in making sense of the data. In many cases, researchers can gain important insights by comparing newly obtained sequences to previously know sequences using BLAST (>100,000 citations). This project will focus on creating a reusable visualisation library for BLAST results and integrating it with SequenceServer (http://sequenceserver.com), a popular BLAST server. The student will study techniques to visualise BLAST results as a dot-plot, Circos visualisation, JBrowse track, and visualisations implemented in GeneValidator (a former GSoC project - http://wurmlab.github.io/tools/genevalidator/) and potentially others; decide on architecture for implementing these (or a subset) as a well documented and modular BioRuby plugin and integrate the library into SequenceServer.
    • "Expected results: A reusable library for visualising BLAST result. Improved & flexible BLAST visualisation for SequenceServer.
    • Knowledge prerequisites: Working knowledge of Ruby and JavaScript (we use jQuery, React, and SystemJS); students can pick up the rest as they go along.
    • Skill level: Medium
    • Mentors: Anurag Priyam (anurag08priyam@gmail.com), Yannick Wurm (y.wurm@qmul.ac.uk)


  • Project Idea 10: Use Galaxy to run Reactome analysis and processes on genomic data
    • Brief explanation: Reactome is a free, open-source, curated and peer reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education. Galaxy is an open, web-based platform for data intensive biomedical research, which allows users to perform, reproduce, and share complete analyses.
    • Expected results: There are two potential sub-projects. 1) Adding Reactome as a data resource in Galaxy, to enable Galaxy users to use Reactome reaction and pathway annotation data, and 2) Performing identifier mapping and over-representation analysis workflows from Reactome in Galaxy. Reactome Github: https://github.com/reactome/
    • Knowledge prerequisites: Galaxy, Java, web services
    • Skill level: Medium
    • Mentor: Joel Weiser (joel.weiser@oicr.on.ca)


  • Project Idea 11: Reactome Diagrams WebGL
    • Brief explanation: Implementing WebGL support in the renderer layer of Reactome's new DiagramViewer (https://github.com/reactome-pwp/diagram) using the Parallax project (http://parallax3d.org/) or similar.
    • Expected results: Faster renderings, nicer and smoother transitions, overlay more data in any zoom level, use of textures to make pathway elements more realistic and ending up having a multi-platform WebGL support.
    • Knowledge prerequisites: Java, GWT, GIT, MAVEN, HTML5 Canvas, WebGL.
    • Skill level: Medium-Advanced.
    • Mentor: Antonio Fabregat (fabregat@ebi.ac.uk) (lead mentor), Kostas Sidiropoulos (ksidiro@ebi.ac.uk) (backup mentor)


  • Project Idea 12: A personal persistent genome browser
    • Brief explanation: JBrowse (http://jbrowse.org) is a genome browser allowing users to browse specifics genomes and visualize some qualitative and quantitative data. The client-server architecture of JBrowse offers several advantages over other genome browser solutions: (i) the system is fully compatible with a wide spectrum of data types, including sequence files (fasta format), genomic feature files (gff), alignment files (bam) and quantitative data files (bedGraph wig, bigWig); (ii) genome browsing is rapid even when multiple users are processing data simultaneously; (iii) and JBrowse provides a user-friendly and highly flexible graphical interface. Our project aims to enhance the JBrowse experience by providing to user the ability to easily deployed there own persistent JBrowse session, customize it quickly and share this instance with all the community. Using the Docker technology, users will be able to select a configurated JBrowse instance image for a specific genome and deploy it seamlessly for the current session. The session ends when the web page is closed. All user configurations, customization and tracks will be saved into a database used by all JBrowse images to display features.
    • Expected results: The expected result is a website offering user session (using authentication login protocol ) with several images of JBrowse instance, configurated for reference genomes.
    • Knowledge prerequisites: Javascript, HTML5, CSS3, REST, Node.js, Dojo, database management system and jQuery.
    • Skill level: Medium
    • Mentor: François Moreew (francois.moreews@irisa.fr), Thomas Darde (thomas.darde@inria.fr)


  • Project Idea 13: Using Go-Docker to enhance Galaxy functionalities
    • Brief explanation: Go-Docker is a cluster management tool using Docker as execution/isolation system. It can be seen like Sun Grid Engine/Torque/… Using Bioinformatics dedicated Docker hub as BioShaDock, it allows Life scientists to use powerful tools in a easy and secure way while finely managing infrastructures resources. Galaxy is a user-friendly web platform originally dedicated to life sciences data analysis. If Galaxy provides amazing functionalities to enhance transparency, accessibility and reproducibility of data analysis related processes, there is still opportunities to provide more reproducibility and flexibility in the manner of using this web platform. This project idea is to integrate the Go-Docker system with Galaxy, and we, for now, have identified at least 2 different ways. First, using a first proof of concept presented in the 2015 French Galaxy day conference (http://goo.gl/zoq7w3) to develop a Go-Docker Galaxy tool who will allow the possibility to create / use new on demand “tools” in a Galaxy instance without be an admin and without going to the system black matter. Second, as Go-Docker can be used as any other scheduler (i.e. SGE, Torque …), we want to configure Galaxy to use it for dispatching jobs.
    • Expected results: This integration will enable a new way for life scientists and biointormaticians to interact with Galaxy instances.
    • Knowledge prerequisites: Python, Docker, Galaxy. Interested by life science, bioinformatics, a plus.
    • Skill level: Medium
    • Mentor: Olivier Sallou (olivier.sallou@irisa.fr, Go-Docker developer), Yvan Le Bras (yvan.le_bras@irisa.fr). GenOuest core facility


  • Project 14: Interactive visualisation framework for genome mutations in gene and protein networks
    • Brief explanation: Genomes are the source code for creating humans and other organisms. Small changes in genomes called single nucleotide variants (SNVs) make individuals different and sometimes cause disease and cancer. Some SNVs have specific roles in gene and protein networks because they cause "network rewiring" by removing existing connections and creating new connections between genes. Our research has discovered thousands of these SNVs among healthy humans, in cancer samples, and in inherited diseases. We are developing a database and visualisation platform for scientists to interactively explore analyse these SNVs in human genomes. The goal of this GSoC project is to develop an interactive visualisation platform to better understand biological interaction networks with hundreds of thousands of genome changes. We will build on existing JavaScript resources such as D3.js and Cytoscape.js and use RESTful services to retrieve data for visualisation.
    • Expected results: A web-based, open-source and interactive platform for visualising and exploring genome variants in networks
    • Knowledge prerequisites: web development (JavaScript, CSS, HTML), programming in Python or similar, RESTful web services, understanding of genetics and/or biology is a plus
    • Skill level: Medium
    • Mentors: Jüri Reimand (juri.reimand@utoronto.ca)


  • Project 15: Relational database for genome mutations in gene and protein networks
    • Brief explanation: Genomes are the source code for creating humans and other organisms. Small changes in genomes called single nucleotide variants (SNVs) make individuals different and sometimes cause disease and cancer. Some SNVs have specific roles in gene and protein networks because they cause "network rewiring" by removing existing connections and creating new connections between genes. Our research has discovered thousands of these SNVs among healthy humans, in cancer samples, and in inherited diseases. We are developing a database and visualisation platform for scientists to interactively explore analyse these SNVs in human genomes. The goal of this GSoC project is to develop a relational database to better understand biological interaction networks with hundreds of thousands of genome changes. The database will provide a REST interface for the front-end.
    • Expected results: A open-source database of genome variants in molecular interaction networks.
    • Knowledge prerequisites: relational databases and SQL, programming language such as Python, RESTful web services, understanding of genetics and/or biology is a plus
    • Skill level: Medium
    • Mentors: Jüri Reimand (juri.reimand@utoronto.ca)


  • Project 16: Interactive pipeline for retrieving, processing and visualising vast public gene expression datasets
    • Brief explanation: recent biotechnology allows us to simultaneously measure the activation of all genes in a tissue of interest. Consequently, thousands of datasets, millions of measurements, and terabytes of data have been created by scientists and are now stored in public databases such as ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) and GEO (http://www.ncbi.nlm.nih.gov/geo/). Most of the data remains unused after initial depositing, however it likely hides undiscovered knowledge of genetics, biology, and disease such as cancer. This knowledge can only revealed when all data are analysed in a unified model. To enable such analysis, this GSoC project will create a computational toolkit to retrieve these big datasets, perform automated pre-processing and quality control with established approaches, and create summary reports and network visualisations to enable interactive exploratory analysis.
    • Expected results: A open-source software package in R and Shiny for large-scale processing and visualisation of gene expression data.
    • Knowledge prerequisites: programming in R, statistics, understanding of genetics and/or biology is a plus
    • Skill level: Medium
    • Mentors: Jüri Reimand (juri.reimand@utoronto.ca)


  • Project 17: Statistical framework to analyse gene function in multivariate cancer datasets
    • Brief explanation: genes and proteins never act alone but instead carry out processes in cells by forming groups and interaction networks. We can use statistics to understand how these processes are altered in disease such as cancer. Cancer researchers use multiple complementary technologies to measure all gene activities in cancer cells at the same time, however these technologies are subject to biases and errors and thus better modelled jointly. The goal of this project is to implement a multivariate generalised linear regression model to understand which genes and biological processes are altered in a large collection of cancer patients.
    • Expected results: A open-source software package in R for statistical modeling of multiple genomics datasets with information of gene function.
    • Knowledge prerequisites: programming in R, strong statistical skills, understanding of genetics and/or biology is an important asset
    • Skill level: Advanced
    • Mentors: Jüri Reimand (juri.reimand@utoronto.ca)


  • Project 18: Add support for Google Compute Cloud to CloudBridge
    • Brief explanation: With clouds becoming the standard platform for deploying applications, it is more important than ever to be able to seamlessly utilize resources and services from multiple providers. Proprietary vendor APIs make this challenging and lead to conditional code begin written to accommodate for the API differences. We have been working on an open source Python library called CloudBridge (http://cloudbridge.readthedocs.org/en/latest/) that provides a simple, uniform, and extensible API for multiple clouds. CloudBridge currently supports AWS and OpenStack clouds. This project would add support for the Google Compute Cloud (GCE).
    • Expected results: An GCE provider implementation in CloudBridge.
    • Knowledge prerequisites: Python programming, cloud computing concepts
    • Skill level: Medium
    • Mentors: Enis Afgan (enis.afgan@jhu.edu), Nuwan Goonasekera <nuwan.goonasekera@gmail.com>