This is the new server for GMOD.org. Please let us know if you notice anything weird while it's getting broken in.

GSOC Project Ideas 2017

From GMOD
Jump to: navigation, search

There are plenty of challenging and interesting project ideas this year. These projects include a broad set of skills, technologies, and domains, such as GUIs, database integration and algorithms.

Students are also encouraged to propose their own ideas related to our projects. If you have strong computer skills and have an interest in biology or bioinformatics, you should definitely apply! Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open source programmers.

  • Project Idea Name (Project Name/Lab Name)
    • Brief explanation: Brief description of the idea, including any relevant links, etc.
    • Expected results: describe the outcome of the project idea.
    • Knowledge prerequisites: programming language(s) to be used, plus any other particular computer science skills needed.
    • Skill level: Basic, Medium or Advanced.
    • Mentors: name + contact details of the lead mentor, name + contact details of backup mentor.


Here is a list of the proposed project ideas for 2017:

  • 1. Project Publication Reference Tracking (Galaxy/Reactome)
    • Brief explanation: Open source projects need ways to demonstrate relevance and viability to funders, users, and developers. One way to do that is to track publications that use and/or reference a project's products. This is typically done through setting up email alerts or RSS feeds from sources (Google Scholar, Web of Science, ScienceDirect, ...) This effort would create software that helps projects track publications that reference them.
    • Expected results: The software would integrate notifications from many sources into a coherent list of publications, report which ones are not yet known, and provide support for adding new ones to online reference managers such as CiteULike and Mendeley. The software would be extensible to make it easy to add support for new sources of publications and to support many online references managers. The software would be usable by any project to create and maintain publication lists.
    • Knowledge prerequisites: Python or Java experience is preferred, as those are the languages of choice of the two mentor projects.
    • Skill level: Basic
    • Mentors: Dave Clements, Galaxy Project, Johns Hopkins University, clements@galaxyproject.org, Robin Haw, Reactome, Ontario Institute for Cancer Research.


  • Project Idea 2: Reactome Diagrams WebGL (Reactome)
    • Brief explanation: Implementing WebGL support in the renderer layer of Reactome's new DiagramViewer (https://github.com/reactome-pwp/diagram) using the Parallax project (http://parallax3d.org/) or similar.
    • Expected results: Faster renderings, nicer and smoother transitions, overlay more data in any zoom level, use of textures to make pathway elements more realistic and ending up having a multi-platform WebGL support.
    • Knowledge prerequisites: Java, GWT, GIT, MAVEN, HTML5 Canvas, WebGL.
    • Skill level: Medium-Advanced.
    • Mentor: Antonio Fabregat (fabregat@ebi.ac.uk) (lead mentor), Kostas Sidiropoulos (ksidiro@ebi.ac.uk) (backup mentor)


  • Project Idea 3: iOS InterMine App (InterMine)
    • Brief explanation: InterMine already has an Android application that allow users to search for genes across most of the 29 public InterMine instances, with a well documented API (http://iodocs.apps.intermine.org/). We’d love to see this reflected in an iOS application, designed using HTML5 or native technologies. As a minimum we'd like to see the Android app features replicated whilst querying a single re-badgeable InterMine. A great stretch goal would be to query multiple mines simultaneously.
    • Knowledge prerequisites:
      • iOS app development, whether native or HTML5 based.
      • Understanding of working with REST APIs.
      • Biology knowledge an advantage, but not required.
      • Git or other version control
    • Mentors:
      • Yo yo@intermine.org
      • Josh josh@intermine.org
    • Expected results: an iOS application with functionality similar to https://play.google.com/store/apps/details?id=org.intermine.app that is ready to be submitted to the Apple store.
    • Skill level: Medium.
    • Guidance for applying: https://intermineorg.wordpress.com/2017/02/28/google-summer-of-code-at-intermine/


  • Project Idea 4: Similarity project (InterMine)
    • Brief explanation: InterMine is a large graph (matrix) of entities with relationships and this holds potentially valuable data. For instance, entities that share a large number of neighbours in a graph might be biologically similar. Entities in which many other entities pass through it might be biologically important. Precalculating this information and serving it via a web service would greatly enhance InterMine’s discovery potential.
    • Knowledge prerequisites:
      • Development experience; most languages ok
      • Math skills (matrix theory?)
      • Some database experience (ability to query using SQL)
      • Biology knowledge an advantage, but not required.
      • Git or other version control
    • Expected results: A script or program that build statistics about the relationships of objects in our database.
    • Mentors:
      • Josh josh@intermine.org
      • Yo yo@intermine.org
    • Skill level: Medium.
    • Guidance for applying: https://intermineorg.wordpress.com/2017/02/28/google-summer-of-code-at-intermine/


  • Project Idea 5: ElasticSearch and InterMine: (InterMine)
    • Brief explanation: (Project no longer available).


  • Project Idea 6: Create a set of exciting bioinformatics R demos using the InterMineR package (InterMine)
    • Brief explanation: Fan of R and biology? InterMine has recently created an R package (https://github.com/intermine/interminer) to take advantage of InterMine’s biological data warehouse web services, but we could use someone who is familiar with R and biology to create and document/blog some interesting code examples based on use cases we provide, with a focus on well-explained code and thorough documentation. A stretch goal would be to extend the core InterMineR package to provide additional services.
    • Knowledge prerequisites:
      • Knowledge of the R programming language
      • Proven writing / documentation or blogging skills
      • Understanding of biology / bioinformatics a significant advantage
    • Expected results: 3-10 help articles with well documented code samples demonstrating the use of the InterMineR R package.
    • Mentors:
      • Rachel rachel@intermine.org
      • Julie julie@intermine.org
    • Skill level: Easy
    • Guidance for applying: https://intermineorg.wordpress.com/2017/02/28/google-summer-of-code-at-intermine/


  • Project Idea 7: InterMine Registry (InterMine)
    • Brief explanation: Currently there are 29 different instances of InterMine, a bioinformatics data warehouse, available on the web (see footer of http://intermine.org/). We’d like to create a registry for all public instances - essentially an API that exposes the names, URLS, datatypes, and other useful information regarding an InterMine instance. This could be served from a manually curated list of InterMines, but a stretch goal might be to also include API methods to create and administer registry entries.
    • Knowledge prerequisites:
      • Good understanding of RESTful APIs, how they work, and how to implement one in a language of your choice.
      • No biology skills needed.
      • Git or other version control
    • Expected results: A read-only API that provides basic information about existing InterMine instances around the world.
    • Mentors
      • Daniela daniela@intermine.org
    • Guidance for applying: https://intermineorg.wordpress.com/2017/02/28/google-summer-of-code-at-intermine/


  • Project Idea 8: Query Visualiser: (InterMine)
    • Brief explanation: InterMine’s biological data warehouse has an extensible XML data model designed to be heavily queryable. Similar to SQL, most queries have a combination of views and constraints. This can probably be visualised as an interactive network graph and should offer some good opportunities for creative data visualisation, and would complement InterMine’s existing Query Builder (user documentation for the query builder is here: http://flymine.readthedocs.io/en/latest/query-builder/Documentationquerybuilder.html)
    • Knowledge prerequisites:
      • Good understanding of RESTful APIs
      • Client-side dev skills (JS or a language which compiles to JS).
      • No biology skills needed, but advantageous.
      • Git or other version control
    • Expected results: An interactive web-based data visualisation tool to visualise simple InterMine queries.
    • Mentors
      • Yo yo@intermine.org
      • Josh josh@intermine.org
    • Guidance for applying: https://intermineorg.wordpress.com/2017/02/28/google-summer-of-code-at-intermine/


  • Project Idea 9: Prototype a new RESTFul API querying Neo4j database (InterMine)
    • Brief explanation: InterMine is a biological data warehouse, currently running with a PostgreSQL database. We'd like to prototype InterMine with Neo4j replacing PostgreSQL. The task: given an instance of Neo4j loaded with testmodel data, implement a new RESTful API which receives in input a query in Path-Query XML format and returns the result using Neo4j Java API or the traversal framework
    • Knowledge prerequisites:
      • Good understanding of Java
      • Good understanding of RESTful API
      • Basic understanding of graph databases
      • Git or other version control
      • No biology skills needed
    • Expected results:
      • Verify and eventually adapt the xml model to represent the relationships and their properties in Neo4j
      • Prototype a parser which reads from a simple data source and uploads Neo4j
      • Prototype a new RESTful API which returns a query result using Neo4j Java API or the traversal framework
    • Mentors
      • Daniela daniela@intermine.org
    • Guidance for applying: https://intermineorg.wordpress.com/2017/02/28/google-summer-of-code-at-intermine/


  • Project Idea 10: Use Galaxy to run Reactome analysis and processes on genomic data (Reactome)
    • Brief explanation: Reactome is a free, open-source, curated and peer reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education. Galaxy is an open, web-based platform for data intensive biomedical research, which allows users to perform, reproduce, and share complete analyses.
    • Expected results: There are two potential sub-projects. 1) Adding Reactome as a data resource in Galaxy, to enable Galaxy users to use Reactome reaction and pathway annotation data, and 2) Performing identifier mapping and over-representation analysis workflows from Reactome in Galaxy. Reactome Github: https://github.com/reactome/
    • Knowledge prerequisites: Galaxy, Java, web services
    • Skill level: Medium
    • Mentor: Joel Weiser (joel.weiser@oicr.on.ca)


  • Project Idea 11: Stand-alone Reactome server in a Docker image (Reactome)
    • Brief explanation: Reactome is a free, open-source, curated and peer reviewed pathway database. The goal of this project is to produce a Docker image that contains everything that is needed for a user to run Reactome on their own workstation. This includes the web applications, databases, scripts, and other supporting infrastructure components that make up Reactome.
    • Expected results: A Docker image that can be pulled from an image repository such as dockerhub or quay.io which contains the latest Reactome data and software, and can be run on any Docker-capable workstation. A process by which such docker images could be automatically built as a part of the Reactome data-release cycle would also be a goal.
    • Knowledge prerequisites: linux, Docker, Apache web servers, Tomcat, bash
    • Skill level: Medium
    • Mentor: Solomon Shorser (solomon.shorser@oicr.on.ca)


  • Project Idea 12: Pan-Genome Module for the Genome Context Viewer (GMOD)
    • Brief explanation: With the number of sequenced and annotated genomes continuously increasing, there is a need for new algorithms and tools for comparative analyses both at the nucleotide and genic levels. The Genome Context Viewer (GCV, https://goo.gl/trvfg1) is an OSS tool that enables comparative genomics by using gene families as a unit of search and comparison. It currently uses Chado as a reference implementation for its data services and can be integrated with other GMOD components via a service layer/API. This work will create an extension module that will integrate new and existing pan-genomics algorithms into GCV while leveraging the existing UI for visualization purposes. This will help to serve communities facing the challenges of having multiple reference genomes within a single species, as well as improving GCV’s utility for clade-oriented resources.
    • Expected results: The module would implement the Approximate Frequent Subpaths algorithm (Cleary, et al, 2017) for finding candidate GCV search queries and generating context-sensitive chromosome-scale synteny blocks, that is, synteny blocks derived from a chromosome’s gene family content and the GCV search parameters. It would also implement the Frequented Regions algorithm (Cleary, et al, in review) for identifying syntenic regions in pan-genome gene family graphs. As with current GCV algorithms, these implementations would be capable of aggregating data from multiple sources, enabling analyses of data distributed across multiple data-stores. The results of these algorithms would be displayed in the GCV UI, where users can interactively explore the results, perform new searches, and interlink to other relevant tools.
    • Knowledge prerequisites: Python (Django and Spark) and JavaScript (Angular 2 and D3) experience is preferred, as those are the languages used to implement GCV.
    • Skill level: Advanced
    • Mentors: Andrew Farmer, Legume Information System, National Center for Genome Resources, adf@ncgr.org; Steven Cannon, Legume Information System, US Department of Agriculture Agricultural Research Service.


  • Project Idea 13: Peformance and user-centric improvements to Afra's annotation editor
    • Brief explanation: Gene prediction software often make mistakes (e.g., merging adjacent genes, or splitting a long gene into two). Predicted gene models are thus visually inspected and manually corrected before using them in an analysis, which in turn take a lot of time. Inspired by the success of GalxyZoo and Foldit, we created Afra to crowdsoucrce manual curation of predicted gene models. In short, Afra redundantly collects corrections for a given gene model from different user and calls a winner. Conflicts are resolved manually by a senior curator. Broadly, Afra has two modules: an annotation editor (to visualise and correct predicted gene models), and a task processor (to dole out curation tasks and collect submissions from user). Our annotation editor, built using JBrowse and WebApollo, is completely browser based. The goal of this GSoC project is to a) migrate to latest JBrowse for improved performance (and latest features), b) further optimise the annotation editor to be more performant and implement features that will make annotation editing more accessible to non-experts. Afra is one of a kind project and performance and making annotation editing accessible to less experienced users (e.g., undergraduate students, school teachers) are the major hurdles in our way of a bigger rollout, which will contribute immensely to the biocuration community.
    • Expected results: The student will a) migrate annotation editing code and related changes to JBrowse to JBrowse 1.12, including making use of new JBrowse features that will improve performance (test suite should be used for migration), b) propose and implement already proposed features (Github), that will make annotation editor faster and ease the learning curve of manual curation. The work should be done in form of small pull-requests, that will be merged and deployed throughout the course of GSoC.
    • Knowledge prerequisites: JavaScript. JBrowse is written using dojo - should be possible to pick up the basics while drafting the proposal and along the way. Editing aspects make use of jquery-ui - prior knowledge here would be great, because some aspects can be tricky. Basic genomics (gene transcription, splicing, alternative splicing, translation) - very very important, can be picked up while drafting proposal (see corresponding Wikipedia pages). We use Ruby on the server side and Angular in the browser but you probably won't have to work on those parts of the codebase - a basic knowledge there would make it easier to handle the overall codebase.
    • Skill level: Intermediate-Advanced
    • Mentors: Anurag Priyam (firstname.lastname at qmul.ac.uk), Yannick Wurm (Wurmlab, Queen Mary University of London)


  • Project Idea 14: activedriverdb.org (Reimand Lab)
    • Brief explanation:activedriverdb.org is a newly developed interactive database to understand the intricate connections of human disease, genetics and cellular interaction networks. It provides a resource and visualisation platform of hundreds of thousands of known disease and cancer mutations in human genes that affect small protein sites involved in network interactions. The GSoC project is based on an earlier GSoC project and involves development of further functionality of the database.
    • Expected results: A data upload module that allows users to securely upload and store large sets of mutations from their experiments for full-scale analysis and visualisation within the platform.
    • Knowledge prerequisites: relational databases and SQL, programming language such as Python, RESTful web services, understanding of genetics and/or biology is a plus.
    • Skill level: Medium
    • Mentors: Jüri Reimand (Juri.Reimand@oicr.on.ca)