- 1 Welcome to the Genome Informatics Google Summer of Code
- 2 Genome Informatics GSoC
- 3 How GSoC Works
- 4 Contact Us
- 5 Students: How to apply
- 6 Resources
- 7 Project Ideas
- 8 2014 Project Ideas
Welcome to the Genome Informatics Google Summer of Code
from the Google Summer of Code website:
Google Summer of Code is a global program that offers student developers stipends to write code for various open source software projects. We work with many open source, free software, and technology-related groups to identify and fund projects over a three month period. Since its inception in 2005, the program has brought together over 7,500 successful student participants from 97 countries and over 7,000 mentors from over 100 countries worldwide to produce over 50 million lines of code. Through Google Summer of Code, accepted student applicants are paired with a mentor or mentors from the participating projects, thus gaining exposure to real-world software development scenarios and the opportunity for employment in areas related to their academic pursuits. In turn, the participating projects are able to more easily identify and bring in new developers. Best of all, more source code is created and released for the use and benefit of all.
Google Summer of Code has several goals:
- Create and release open source code for the benefit of all
- Inspire young developers to begin participating in open source development
- Help open source projects identify and bring in new developers and committers
- Provide students the opportunity to do work related to their academic pursuits (think "flip bits, not burgers")
- Give students more exposure to real-world software development scenarios (e.g., distributed development, software licensing questions, mailing-list etiquette)
Genome Informatics GSoC
For the past few years, a group of related bioinformatics projects have participated in Google Summer of Code under the umbrella of Genome Informatics. This includes GMOD and its software projects -- GBrowse, JBrowse, etc.; Galaxy; PortEco; Reactome; SeqWare WormBase; and others.
How GSoC Works
From the GSoC FAQ:
- Open source projects who'd like to participate in Google Summer of Code in 2014 should choose at least two organization administrators to represent them.
- Organization administrators will submit the mentoring organization’s proposal for participation online.
- Google will notify the organization administrators of acceptance, and an account for the accepted organizations will be created on the Google Summer of Code 2014 site.
- Students submit project proposals online to work with particular mentoring organizations.
- Mentoring organizations rank student proposals and perform any other due diligence on their potential students; student proposals are matched with a mentor.
- Google allocates a particular number of student slots to each organization.
- Mentoring organizations make their final decision on which students to accept into the program.
- Students are notified of acceptance.
- Students begin learning more about their mentoring organization and its community before coding work starts.
- Students begin coding work at the official start of the program, provided they've interacted well with their community up until the program start date.
- Mentors and students provide mid-term progress evaluations.
- Mentors provide a final evaluation of student progress at close of program; students submit a final review of their mentor and the program.
- Students upload completed code to Google Summer of Code site.
The organization administrators for the Genome Informatics group are Robin Haw of Reactome and Amelia Ireland of GMOD.
- Email: email@example.com and firstname.lastname@example.org -- find out more about GSoC, a specific project, or your potential mentor(s).
- Discussion mailing lists: Genome Informatics Google Groups - ask about our projects; join the community!
- IRC channel: #genomeinformatics on Freenode.
Students: How to apply
We would like to know who you are and how you think. Incorporate the following into your application:
- Your information
- Name, email, and website (optional)
- Brief background: education and relevant work experience
- Your programming interests and strengths
- What are your languages of choice?
- Any prior experience with open source development?
- Your interest and background in biology or bioinformatics
- Any prior exposure to biology or bioinformatics?
- Your ideas for a project (an original idea or one expanded from our Ideas Page)
- Provide as much detail as possible
- Strong applicants include an implementation plan and timeline (hint!)
- Refer to and link to other projects or products that illustrate your ideas
- Identify possible hurdles and questions that will require more research/planning
- What can you bring to the team?
These projects include a broad set of skills, technologies and domains, such as GUIs, database integration and algorithms.
Students are also encouraged to propose their own ideas related to our projects. If you have strong computer skills and have an interest in biology or bioinformatics, you should definitely apply! Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open source programmers.
If you have any difficulty using the wiki, please email your project proposal to email@example.com and we will add it for you.
See the list of project proposals from 2013 for ideas from last year's GSoC.
For potential mentors: students work remotely and will typically communicate with you electronically. Students are expected to be self-motivated and responsible for getting work done. The average time investment is about five hours per week.
Advice from Google on suitable project ideas
The following information comes from the GSoC manual on what makes a good GSoC project:
There are many ways to define a good GSoC project—probably as many ways as there are student-mentor pairings. Here are just a few:
Low-hanging fruit: These projects require minimal familiarity with the codebase and basic technical knowledge. They are relatively short, with clear goals.
Risky/Exploratory: These projects push the scope boundaries of your development effort. They might require expertise in an area not covered by your current development team. They might take advantage of a new technology. There is a reasonable chance that the project might be less successful, but the potential rewards make it worth the attempt.
Fun/Peripheral: These projects might not be related to the current core development focus, but create new innovations and new perspective for your project.
Core development: These projects derive from the ongoing work from the core of your development team. The list of features and bugs is never-ending, and help is always welcome.
Infrastructure/Automation: These projects are the code that your organization uses to get its development work done; for example, projects that improve the automation of releases, regression tests and automated builds. This is a category in which a GSoC student can be really helpful, doing work that the development team has been putting off while they focus on core development.
source: GSoC manual
From the Genome Informatics GSoC experience in 2013, prospective students are interested in "new" technologies and languages, such as iOS and Android apps, and fancy, flashy, web-based projects.
Project idea format
Example of Idea
Brief description of the idea, including any relevant links, etc.
- Languages and skills: programming language(s) to be used, plus any other particular computer science skills needed
- Idea: name + contact details of the person(s) who thought up the idea
- Mentor(s): name + contact details of the proposed mentor(s)
2014 Project Ideas
Reactome: Visualising Large Diagrams
Reactome is a free, open-source, curated and peer reviewed database of biomolecular pathways with about 12.000 distinct visitors/month. The Reactome Pathway Diagram viewer was develop initially as a GSoC project and it has become part of the Reactome Pathway Browser (http://www.reactome.org/PathwayBrowser/). The widget works fine for the current size of the diagrams but there is a need of including larger diagrams in the future, so we need to improve the current implementation using a different approach.
- Languages and skills: Java, GWT, HTML5 Canvas, Data visualisation
- Idea: Henning Hermjakob <firstname.lastname@example.org>, Antonio Fabregat <email@example.com>
- Mentor(s): Antonio Fabregat Mundo <firstname.lastname@example.org>
Description: The current pathway diagram widget works fine for the pathways in Reactome but diagrams with a large number of entities, for example large biomolecular disease maps, slow the widget down unacceptably. A different approach is needed in order to draw larger pathways in the canvas. Including techniques used for gaming can help to our propose, for example using quadtrees would help to filter the number of objects to be drawn in each canvas iteration (depending of the zoom level and the targeted frame) and will also speed up the object hovering detection while the user moves the mouse over the diagram. Another useful improvement to the diagram could be implementing a multi-layer approach using several canvases for representing different layers of information. In this case exporting the view as an image will be a little more complicated but it is a good use case to take into account at the end of the internship.
- Languages and skills: Java, Bash/Linux, AWS, Google Cloud, Ansible, Vagrant, HBase/NOSQL, MapReduce+associated Hadoop technologies
- Mentor(s): Brian O'Connor <email@example.com>, Denis Yuen <firstname.lastname@example.org>
There are quite a few projects that I would like to see happen for SeqWare and it would be great to get a student to help on these:
- add hybrid workflow support to SeqWare Pipeline so users can write workflows that include support for Hadoop tools (Pig, Hive, M/R, etc) and traditional command line tools
- push forward the design of our multi-cloud cluster provisioning technology stack based on Vagrant. This includes incorporating cool provision technologies like Ansible.
- leverage Elastic Map Reduce on Amazon's AWS as an environment to run SeqWare
- leverage the Google cloud, add support for spinning up SeqWare clusters in this environment and to interact with their bucket store
- work with the Galaxy tool and finish the compatibility layer that allows SeqWare workflows to run/interact with Galaxy
- write a AngularJS-based web application on top of our HBase variant/read NOSQL database, write proof of concept analytical plugins that use machine learning and other advanced techniques to analyze data stored in this scalable backend
Tripal Pedigree Viewer
- Mentor(s): Lacey-Anne Sanderson <email@example.com>
Description: Development of an interactive, collapsible pedigree diagram to be displayed on Tripal Germplasm pages. The nodes of the diagram need to contain the name of the stock with a link to the page and the edges of the diagram need to be named with the relationship type (ie: maternal parent of). All of the data is already stored within a PHP tree class with traversal methods. Thus we are looking for a student to use the traversal methods to generate the markup needed for their application and the actual drawing of the pedigree using languages and libraries of their choosing. Here is an example showing the collapsibility desired; however, names within the node circles (as compared to beside in the example) and labelled connector lines (edges) are needed.
Background: Tripal is a Drupal module that implements display and management of biological data within a Drupal site. Drupal is a PHP-based, database-driven content management system used for development of websites (from blogs, to ecommerce sites, and now organism community sites such as KnowPulse: Legume Breeding & Genomics, Citrus Genome and many more). See our website for more Tripal sites as well as additional information. The Tripal Germplasm module provides the ability to display and manage plant/animal breeding programs. Currently the pedigree is displayed in the community standard textual format (ie: ParentA//ParentB1/ParentB2 which says the offspring of ParentB1 & ParentB2 mated with ParentA to produce the current germplasm). Although this is descriptive and common in the community, a graphical diagram showing these relationships would be a lot more intuitive which is the motivation behind this project.
JBrowse: REST daemon for Chado
Implement a self-contained server in the language of your choice (such as Python/WSGI, Perl/Plack, node.js, or Java/Jetty) to serve feature data and name completions out of a GMOD Chado database schema according to the JBrowse 1 REST API, enabling an instance of JBrowse1 to run directly atop a Chado database. Possible addition: implement another daemon in Perl/Plack that does the same thing for a GBrowse 2 installation.
- Skills: server-side language of student's choice
JBrowse "regions of interest" lists
Drupal-based GMOD Tool Information Tool
- Mentors: Lacey-Anne Sanderson, Amelia Ireland
Description to be added soon.
- Idea: Enis Afgan (firstname.lastname@example.org)
- Mentor(s): Afgan (email@example.com), Baker (firstname.lastname@example.org)
Galaxy CloudMan (http://usecloudman.org) is a cloud manager that orchestrates all the steps required to provision and manage a set of cloud resources to deliver a functional compute cluster in the cloud. A deployed instance of CloudMan comes preconfigured with the Galaxy application, dozens of bioinformatics tools and gigabytes of genome reference data. The application is used around the world to launch hundreds of clusters per month. The following are suggestions for the student improvements that would help the project grow further (each would be a separate project): A new web interface, exposing key application functionality and focusing on scalability and accessibility An automated process for deploying/replicating Galaxy on the Cloud across all AWS regions Advanced cluster autoscaling (responsive, based on individual cluster’s workload, taking advantage of different cloud instance types)
- Improving Galaxy Charts by e.g. adding new visualizations or options to customize visualizations. This is a very confined project. It has the advantage that the student can (basically) not break code and does not have to grasp Galaxy’s inner layers, but still would be able to make a major contribution.
- Something from the Tool requests and Developer ideas lists at Trello although it is not as much fun as starting something new and one card is probably not enough.