Difference between revisions of "GSOC Project Ideas 2020"

From GMOD
Jump to: navigation, search
(GraphDB API for Reactome)
(Statistics consolidation and display of public release data (Reactome))
(9 intermediate revisions by the same user not shown)
Line 52: Line 52:
 
**''Mentors:'' Robin Haw (robin.haw[AT]oicr.on.ca) and Joel Weiser (joel.weiser[AT]oicr.on.ca).
 
**''Mentors:'' Robin Haw (robin.haw[AT]oicr.on.ca) and Joel Weiser (joel.weiser[AT]oicr.on.ca).
  
== Create a software package for use in R to query Reactome’s Graph Database in Neo4J ==
+
== Create a software package for use in R to query Reactome’s Graph Database in Neo4J (Reactome) ==
  
 
*'''Project Idea Name (Project Name/Lab Name)'''
 
*'''Project Idea Name (Project Name/Lab Name)'''
Line 62: Line 62:
 
**''Mentors:''  Joel Weiser (joel.weiser[AT]oicr.on.ca).
 
**''Mentors:''  Joel Weiser (joel.weiser[AT]oicr.on.ca).
  
 
+
== Automating Reactome’s data release post-step QA (Reactome) ==
== Automating Reactome’s data release post-step QA ==
+
 
* ''Brief explanation:''
 
* ''Brief explanation:''
 
** Reactome is a free, open-source, curated and peer-reviewed pathway database. Every quarter we complete a data release that contains newly curated information as well as updated data from a variety of resources. Running the data release is time-intensive, in part due to the number of steps involved that require manual inspection to verify it was run correctly.
 
** Reactome is a free, open-source, curated and peer-reviewed pathway database. Every quarter we complete a data release that contains newly curated information as well as updated data from a variety of resources. Running the data release is time-intensive, in part due to the number of steps involved that require manual inspection to verify it was run correctly.
Line 74: Line 73:
 
* ''Mentors:'' Justin Cook (justin.cook[AT]oicr.on.ca)
 
* ''Mentors:'' Justin Cook (justin.cook[AT]oicr.on.ca)
  
== GraphDB API for Reactome ==
+
== GraphDB API (Reactome) ==
 
* ''Brief explanation:'' Reactome uses both a relational database (MySQL) and a graph database (Neo4j). There is an existing API that uses the relational database, and many Reactome components use this API. To make it easier to transition these components to using the graph database, a new API with equivalent functionality needs to be created.  
 
* ''Brief explanation:'' Reactome uses both a relational database (MySQL) and a graph database (Neo4j). There is an existing API that uses the relational database, and many Reactome components use this API. To make it easier to transition these components to using the graph database, a new API with equivalent functionality needs to be created.  
 
* ''Expected results:'' A new Java API that interacts with the graph database, with functionality such that it could be used as a drop-in replacement for the relational database API.
 
* ''Expected results:'' A new Java API that interacts with the graph database, with functionality such that it could be used as a drop-in replacement for the relational database API.
Line 81: Line 80:
 
* ''Knowledge prerequisites:''  Java, MySQL. Neo4j would be good, but not necessary.
 
* ''Knowledge prerequisites:''  Java, MySQL. Neo4j would be good, but not necessary.
 
* ''Skill level:'' Advanced.
 
* ''Skill level:'' Advanced.
* ''Mentors:'' Solomon Shorser (solomon.shorser[AT]oicr.on.ca),
+
* ''Mentors:'' Solomon Shorser (solomon.shorser[AT]oicr.on.ca).
 +
 
 +
== GraphQL interface for querying Reactome data (Reactome) ==
 +
* ''Brief explanation:'' Reactome currently has a REST-based API that allows end-users to obtain specific data from pre-defined queries.  To allow users to customize their queries, explore the Reactome data schema and better understand what data they can obtain, a GraphQL based endpoint could be added to the existing API.
 +
* ''Expected results:'' A publicly accessible GraphQL API that allows Reactome end-users to submit custom data queries to Reactome
 +
* ''Project Home Page URL:'' [https://reactome.org. reactome.org].
 +
* ''Knowledge prerequisites: Java, GraphQL, Neo4j (preferred), Swagger (optional)''
 +
* ''Skill level:'' Advanced.
 +
* ''Mentors:'' Joel Weiser (joel.weiser[AT]oicr.on.ca)
 +
 
 +
== Statistics consolidation/display of release data (Reactome) ==
 +
* ''Brief explanation:'' Reactome has both manual and automated statistical tracking of its quarterly release data.  This project would seek to fully automate and consolidate the quantification of release data measurement for metrics such as the number of pathways, reactions, distinct proteins (with and without UniProt isoforms), complexes, small molecules, drugs/therapeutics, literature references, etc. for human (curated) and non-human (electronically inferred) species and stratified for normal and disease biology
 +
* ''Expected results:'' A program which will produce a standardized report of statistics for a Reactome release database with aesthetic visuals
 +
* ''Project Home Page URL:'' [https://reactome.org. reactome.org].
 +
* ''Knowledge prerequisites:'' Java, MySQL and/or Neo4j, creating visuals for statistical data (preferred but not required)
 +
* ''Skill level:'' Medium.
 +
* ''Mentors:'' Joel Weiser (joel.weiser[AT]oicr.on.ca)
  
 
== Community data submission (WormBase) ==
 
== Community data submission (WormBase) ==

Revision as of 15:48, 6 February 2020

Got an idea for GSOC 2020?

Then please post it. You can either

  1. Add it here, by directly editing this page. Just copy, paste and update the template below. This requires that you have or create a GMOD.org login.

Projects can use a broad set of skills, technologies, and domains, such as GUIs, database integration and algorithms.

Students are also encouraged to propose their own ideas related to our projects. If you have strong computer skills and have an interest in biology or bioinformatics, you should definitely apply! Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open-source programmers.


Proposed project ideas for 2020

Be the first to add a project idea.

Template

  • Project Idea Name (Project Name/Lab Name)
    • Brief explanation: Brief description of the idea, including any relevant links, etc.
    • Expected results: describe the outcome of the project idea.
    • Project Home Page URL: if there is one.
    • Project paper reference and URL: Is there a paper about the project this effort will be a part of?
    • Knowledge prerequisites: programming language(s) to be used, plus any other particular computer science skills needed.
    • Skill level: Basic, Medium or Advanced.
    • Mentors: name + contact details of the lead mentor, name + contact details of 1 or 2 backup mentors.


Automated Bioinformatics Help in Galaxy

  • Brief explanation:
    • Galaxy users often encounter errors when trying to run a bioinformatics analysis. These errors may be user or data errors (e.g. misformatted dataset) or errors due to underlying computing hardware (e.g. disk is full). Helping users and Galaxy support staff determine the kind of error they encountered would be useful because a user can likely address the first type of error, while the second type requires expert invention.
    • This project will improve Galaxy’s error system by using heuristics or machine learning to identify common types of user/data errors and make suggestions on likely causes of the error and how they might be fixed. This will benefit Galaxy users with clear and actionable error messages and support staff by reducing the amount of reported, non-system errors.
  • Expected results:
    • Create a tool for analyzing, identifying, and classifying common error messages from the extensive history of error messages from the main public Galaxy server (https://usegalaxy.org).
      • The diversity and size of this data suggests a machine learning approach, but the specific approach taken would be decided by the student and mentor.
    • Extend Galaxy’s tool definition syntax to support defining common error classes and suggested resolutions.
    • Update Galaxy’s user interface to display potential resolutions and suggested actions based on the types of errors found in an analysis.
  • Project Home Page URL: galaxyproject.org
  • Project paper reference and URL: The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update, Enis Afgan et al., Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W537–W544, https://doi.org/10.1093/nar/gky379
  • Knowledge prerequisites: programming language(s) to be used, plus any other particular computer science skills needed.
  • Skill level: Medium.
  • Mentors:

Use Galaxy to run Reactome analysis and processes on genomic data (Reactome)

    • Brief explanation: Reactome is a free, open-source, curated and peer-reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modelling, systems biology and education. Galaxy is an open, web-based platform for data intensive biomedical research, which allows users to perform, reproduce, and share complete analyses.
    • Expected results: There are two potential sub-projects. 1) Adding Reactome as a data resource in Galaxy, to enable Galaxy users to use Reactome reaction and pathway annotation data, and 2) Performing identifier mapping and over-representation analysis workflows from Reactome in Galaxy. Reactome Github.
    • Project Home Page URL: if there is one.
    • Project paper reference and URL: reactome.org, galaxyproject.org
    • Knowledge prerequisites: Galaxy, Java, web services.
    • Skill level: Medium.
    • Mentors: Robin Haw (robin.haw[AT]oicr.on.ca) and Joel Weiser (joel.weiser[AT]oicr.on.ca).

Create a software package for use in R to query Reactome’s Graph Database in Neo4J (Reactome)

  • Project Idea Name (Project Name/Lab Name)
    • Brief explanation: The R programming language has an existing package for connection to Neo4J databases. This project’s purpose would be to use this package as a base to create a connection for querying Reactome’s Neo4J graph database and return data structures for manipulating Reactome pathway and reaction data..
    • Expected results:Allow R end-users to be able to retrieve Reactome pathway and reaction data for analysis by both pre-written functions and custom queries. Examples of categories for such functions may include pathways and reactions which contain certain genes, proteins, Gene Ontology terms or cross-references to other external databases as well as other useful queries for Reactome end-users.
    • Project Home Page URL: reactome.org.
    • Knowledge prerequisites: R Programming Language, Neo4J.
    • Skill level: Medium.
    • Mentors: Joel Weiser (joel.weiser[AT]oicr.on.ca).

Automating Reactome’s data release post-step QA (Reactome)

  • Brief explanation:
    • Reactome is a free, open-source, curated and peer-reviewed pathway database. Every quarter we complete a data release that contains newly curated information as well as updated data from a variety of resources. Running the data release is time-intensive, in part due to the number of steps involved that require manual inspection to verify it was run correctly.
    • The project will involve the developer building QA steps that can automatically verify the data release steps ran correctly. The data release can be divided into MySQL or Neo4j components, and the student can choose to work on QA for either or both.
  • Expected results: QA code that automatically verifies release steps were correctly run. The new QA tests will encompass data checks within the MySQL and/or Neo4j databases as well as comparisons between data releases.
  • Project Home Page URL: https://reactome.org/
  • Project paper reference and URL:
  • Knowledge prerequisites: Java, MySQL, (optional) Neo4j, AWS
  • Skill level: Medium
  • Mentors: Justin Cook (justin.cook[AT]oicr.on.ca)

GraphDB API (Reactome)

  • Brief explanation: Reactome uses both a relational database (MySQL) and a graph database (Neo4j). There is an existing API that uses the relational database, and many Reactome components use this API. To make it easier to transition these components to using the graph database, a new API with equivalent functionality needs to be created.
  • Expected results: A new Java API that interacts with the graph database, with functionality such that it could be used as a drop-in replacement for the relational database API.
  • Project Home Page URL: reactome.org.
  • Project paper reference and URL:
  • Knowledge prerequisites: Java, MySQL. Neo4j would be good, but not necessary.
  • Skill level: Advanced.
  • Mentors: Solomon Shorser (solomon.shorser[AT]oicr.on.ca).

GraphQL interface for querying Reactome data (Reactome)

  • Brief explanation: Reactome currently has a REST-based API that allows end-users to obtain specific data from pre-defined queries. To allow users to customize their queries, explore the Reactome data schema and better understand what data they can obtain, a GraphQL based endpoint could be added to the existing API.
  • Expected results: A publicly accessible GraphQL API that allows Reactome end-users to submit custom data queries to Reactome
  • Project Home Page URL: reactome.org.
  • Knowledge prerequisites: Java, GraphQL, Neo4j (preferred), Swagger (optional)
  • Skill level: Advanced.
  • Mentors: Joel Weiser (joel.weiser[AT]oicr.on.ca)

Statistics consolidation/display of release data (Reactome)

  • Brief explanation: Reactome has both manual and automated statistical tracking of its quarterly release data. This project would seek to fully automate and consolidate the quantification of release data measurement for metrics such as the number of pathways, reactions, distinct proteins (with and without UniProt isoforms), complexes, small molecules, drugs/therapeutics, literature references, etc. for human (curated) and non-human (electronically inferred) species and stratified for normal and disease biology
  • Expected results: A program which will produce a standardized report of statistics for a Reactome release database with aesthetic visuals
  • Project Home Page URL: reactome.org.
  • Knowledge prerequisites: Java, MySQL and/or Neo4j, creating visuals for statistical data (preferred but not required)
  • Skill level: Medium.
  • Mentors: Joel Weiser (joel.weiser[AT]oicr.on.ca)

Community data submission (WormBase)

  • Brief explanation: WormBase is a comprehensive research knowledgebase on the biology of nematodes. Our database is built by extracting and standardizing information from published literature, which is time consuming and low throughput. Hence, we would like to encourage our users, who also derive the knowledge originally, to submit their findings through our website. This would speed up the integration of knowledge in our database and diversify our data sources.
  • Expected results: Website frontend components and backend mechanisms that allow inline data submission from users, realtime update of the website, notification and mechanism for review and integrate the data into WormBase database.
  • Project Home Page URL: https://wormbase.org
  • Project paper reference and URL: https://academic.oup.com/nar/article/48/D1/D762/5603222
  • Knowledge prerequisites: JavaScript, experience building cloud native solution preferred.
  • Skill level: Advanced
  • Mentors: Todd Harris (todd[AT]wormbase.org), Sibyl Gao (sibyl[AT]wormbase.org).

Data Table functionality and performance (WormBase)

  • Brief explanation: WormBase is a comprehensive research knowledgebase on the biology of nematodes. Biologists access our vast information through a web port which often provides information in many tables. Here is an example page of a well-studied gene in C. elegans, dat-6, https://wormbase.org/species/c_elegans/gene/WBGene00000912#01347b--10. These tables where developed years ago based on HTML and jQuery, with certain features depending on Flash. Their limitations and usability issues are more pronounced now. Hence, we are looking for a new implementation of these tables with React, which is used in many parts of the site.
  • Expected results: A generic and customizable table component in React for displaying WormBase data, with the ability to search, filter, sort, paginate, and export all or parts of the table.
  • Project Home Page URL: https://wormbase.org
  • Project paper reference and URL: https://academic.oup.com/nar/article/48/D1/D762/5603222
  • Knowledge prerequisites: JavaScript, CSS, React.
  • Skill level: Medium
  • Mentors: Sibyl Gao (sibyl[AT]wormbase.org).

GraphQL over Microservice Architecture (WormBase / Alliance of Genome Resources)

  • Brief explanation: As a data resource, we seek to make our data accessible through web APIs. GraphQL, a new web API specification, provides programmatic access users a powerful way to retrieve our data. However, a single monolithic GraphQL service will be difficult for us to manage due to heterogeneous data types and geographically distributed teams. We would like to allow different teams (each organized around a knowledge domain) to independently develop and deploy GraphQL services, and combine them into a single GraphQL schema and API endpoint that can be queried easily cross-domain. Apollo Federation seems to be what we need, and we’d like to assess its viability as a solution to our problem.
  • Expected results: Prototype for expressing multiple independently developed and deployed GraphQL services as a single graph using Apollo Federation.
  • Project Home Page URL: https://wormbase.org
  • Project paper reference and URL: https://academic.oup.com/nar/article/48/D1/D762/5603222
  • Knowledge prerequisites: JavaScript, microservice architecture.
  • Skill level: Advanced
  • Mentors: Sibyl Gao (sibyl[AT]wormbase.org).

Single Sign On (WormBase)

  • Brief explanation: WormBase consists of an ecosystem of tools, some of which we developed ourselves, others we adopted from the community. Several of these tools require a user to be signed in, giving rise to multiple implementations of authentication and unnecessary difficulties in user experience.
  • Expected results: That users only need to sign-in once on WormBase. We would like to have a centrally managed authentication service that allows individual services to manage their own authorization. And we would need a migration plan for the tools that currently manages its own authentication.
  • Project Home Page URL: https://wormbase.org
  • Project paper reference and URL: https://academic.oup.com/nar/article/48/D1/D762/5603222
  • Knowledge prerequisites: Authentication and Authorization (JWT, OAuth2).
  • Skill level: Advanced
  • Mentors: Sibyl Gao (sibyl[AT]wormbase.org).

Faster Autocompletion (WormBase Name Service)

  • Brief explanation: As a biological knowledge base, WormBase uses unique stable identifiers to biological entities described in scientific publications (such a s gene or a variation, etc). The tasks to create an identifier and to lookup information associated with it (such as its common name and metadata) is handled by a tool, called the Name Service. The performance of the Name Service is currently hindered by the slowness of the autocompletion, which is performed directly on the database that is not optimized for this type of query. We are looking to implement an autocomplete solution using a search engine, such as Elasticsearch.
  • Expected results: A faster REST API to retrieve autocomplete suggestions from, and the necessary workflow to ensure the search engine highly available and up-to-date with the database.
  • Project Home Page URL: https://wormbase.org
  • Project paper reference and URL: https://academic.oup.com/nar/article/48/D1/D762/5603222
  • Knowledge prerequisites: Elasticsearch, Clojure (or another LISP)
  • Skill level: Medium
  • Mentors: Sibyl Gao (sibyl[AT]wormbase.org).

Word and sentence completion for curatorial remarks (WormBase Name Service)

  • Brief explanation: As a biological research knowledgebase, WormBase aggregates and standardizes experimental findings reported in research literature. This work is often done with the help of our curators, who, in addition to filling in standardized forms, write free-text remarks providing additional context. We would like a word and sentence completion tool that reduces the amount of repetitive typing that a curator does, without hindering their ability to compose original content if they so choose.
  • Expected results: An performant in-browser word and sentence completion tool, similar to Gmail Smart Compose, that suggest words based on previously composed content, available through an API.
  • Project Home Page URL: https://wormbase.org
  • Project paper reference and URL: https://academic.oup.com/nar/article/48/D1/D762/5603222
  • Knowledge prerequisites: NLP, JavaScript
  • Skill level: Advanced
  • Mentors: Sibyl Gao (sibyl[AT]wormbase.org).