Revision as of 18:41, 10 February 2015 by Robin.haw (Talk | contribs)

Jump to: navigation, search

Google Summer of Code 2015 @ Genome Informatics

Google Summer of Code is a global program that offers student developers stipends to write code for various open source software projects. We work with many open source, free software, and technology-related groups to identify and fund projects over a three month period. Since its inception in 2005, the program has brought together over 8,500 successful student participants from 101 countries and over 8,300 mentors from over 109 countries worldwide to produce over 50 million lines of code. Through Google Summer of Code, accepted student applicants are paired with a mentor or mentors from the participating projects, thus gaining exposure to real-world software development scenarios and the opportunity for employment in areas related to their academic pursuits. In turn, the participating projects are able to more easily identify and bring in new developers. Best of all, more source code is created and released for the use and benefit of all. (Excerpt from the Google Summer of Code website)

Since 2011, the Genome Informatics group has served as an "umbrella organization" to a variety of bioinformatics projects, including GMOD and its software projects -- GBrowse, JBrowse, etc.; Galaxy; PortEco; Reactome; SeqWare; WormBase; and others. More information about this year's participating bioinformatics groups can be found here here.

To learn more about this year's event and how GSoC works, please refer to the GSoC FAQ.

Mailing lists, IRC, and other ways to get in touch

  • Email: and -- find out more about GSoC, a specific project, or your potential mentor(s).
  • Discussion mailing lists: Genome Informatics Google Groups - ask about our projects; join the community!
  • IRC channel: #genomeinformatics on Freenode.
  • Mentors can email both Robin and Scott to get more information about the program and get signed up.

Project Ideas

There are plenty of challenging and interesting project ideas this year. These projects include a broad set of skills, technologies and domains, such as GUIs, database integration and algorithms. Students are also encouraged to propose their own ideas related to our projects. If you have strong computer skills and have an interest in biology or bioinformatics, you should definitely apply! Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open source programmers.

Project Idea 1: Using an interpreted language to develop bioinformatics workflows Brief explanation: SeqWare is a bioinformatics workflow engine that can be used to chain together the analysis of big data in genomics and bioinformatics. The current workflow language is Java, which is rather verbose.

Expected results: Use Groovy to hide the current rather verbose Java workflow language. Using an interpreted language also enables for rapid prototyping of workflows. The goal of this is to make scripting SeqWare feel more like shell scripting. This is a similar effort to the GATK team’s Queue, but this would leverage SeqWare. Prototype:

Knowledge prerequisites: Java, Groovy, git Skill level: Medium Mentors: Lars Jorgensen, Morgan Taschuk, Pipeline team

Project Idea 2: Write a Foreign Data Wrapper for Postgres and BAM/VCF Knowledge prerequisites: C+Postgres Brief explanation: SQL is a powerful language that makes querying structured data very straightforward, and genomics produces several types of structured data. Big data from genomics usually comes in two parts: the results, stored in files, and the metadata that describe the results, usually stored in databases. For example, VCF files describe a variant in particular cancer-causing gene, and the metadata will describe what the sample was, where it came from, how it was processed, etc. We would like to use SQL to query both results and metadata together.

Expected results: Develop a Foreign Data Wrapper for BAM and VCF in order to query alignment and variant information. There is an existing Foreign Data Wrapper for TSV files. This should make VCF and SAM fairly straight forward. Accessing BAM files would be slightly more involved. This could provide a good example of making queries against BAM data.

Knowledge prerequisites: PostgreSQL Skill level: advanced Mentors: Lars Jorgensen

Project Idea 3: Implement a FUSE interface to BAM/CRAM Brief explanation: Storage of big data is an ongoing problem that will only get worse. As data moves through a processing pipeline in genomics, the output data is often a lossless conversion of data integrating different information (e.g. FASTQ is a listing of all reads; BAM is an alignment of those reads to a reference but still contains all of the reads from the FASTQ). However, data from earlier in the pipeline is often kept so that the analysis can be repeated with different tools. This results in a duplication of data on the order of gigabytes to terabytes.

Expected results: Enable a tool to see the same BAM file as either two FASTQs, interleaved FASTQ or whatever format it needs (with the same information). This should be easy to prototype using Python as fuse-python and pysam exists.

Knowledge prerequisites: C and/or Python, POSIX APIs Skill level: advanced Mentor: Lars Jorgensen

Project Idea 4: Use Galaxy to run SeqWare workflows and process on data Brief explanation: SeqWare is a bioinformatics workflow engine that can be used to chain together the analysis of big data in genomics and bioinformatics. SeqWare is currently driven on the command line by skilled users. However, it would be incredibly useful to leverage SeqWare’s robustness and stability for individual non-expert users. Galaxy is a user-friendly mechanism for analysing data that can be used for this task.

Expected results: There are two potential sub-projects. 1) Adding SeqWare metadata and files as a data source in Galaxy, to enable Galaxy users to use SeqWare data, and 2) Launching and monitoring SeqWare workflows with Galaxy.

Knowledge prerequisites: Galaxy, Java, web services, PostgreSQL Skill level: Medium Mentor: Morgan Taschuk

Project Idea 5: Barcode scanner using phone or tablet to drive LIMS Brief explanation: In a typical genomics lab, the Laboratory Information Management System (LIMS) is required to keep track of lot of people, equipment and samples as they interact. A typical LIMS requires a desktop computer and a lot of drop down menus in order to fulfill this task, which takes the technician away from the bench and introduces the potential for error. Large sequencing labs use barcodes instead. Barcode readers are prohibitively expensive for smaller labs.

Cameras on phones are getting quite good, so it should be fairly easy to drive the barcode reading from a mobile device. This would be a low cost way for smaller labs to use barcoding in the lab workflows. Barcode reading library:

Expected results: A mobile LIMS application that stores a particular lab workflow and prompts the user to scan barcodes when they reach a particular step in the workflow. It would also be able to send information back to the central LIMS servers.

Knowledge prerequisites: iOS or Android development, web services, interface design Mentor: Lars Jorgensen, Timothy Beck and Tony DeBat

Project Idea 6: iPython notebook on top of our infrastructure Brief explanation: iPython notebook is a powerful tool. It enables reproducible science as people can share their work. It would be interesting to see how iPython notebook and SeqWare could interact. It would also be useful for OICR’s users if they could query our and other metadata using Python or R.

Expected result: A python library that can be used to query SeqWare’s metadata through their RESTful web service.

Knowledge prerequisites: Python, web services Skill level: basic Mentor: Timothy Beck, Lawrence Heisler, Yogi Sundaravadanam

Preparing for GSoC 2015

Right now it is off-season for GSoC - we won't know if Genome Informatics has been accepted as a GSOC 2015 mentoring organization until March 2nd. The timeline for GSoC for 2015 has now been posted here. Nevertheless, it is a perfect time if students would like to talk to mentors about project ideas. If you are interested in mentoring, please check the Mentors section below, and contact the organization admin.


More information about writing your application will be available closer to the start of the student application period.


We encourage mentors and mentoring organizations to think about new projects year round! If you'd like help with your ideas page or your separate mentoring org application, please feel to contact the organization admins. Links to advice about mentoring and other resources are available.