GSoC/IDEA 9
<-- Back to GSoC
Page for project discussion, ideas and some other third thing.
Contents
My Concern
I think that this idea is HUGE, and would require a dedicated team to pull off (not to mention that I think it is probably so nebulous at this point that given that team, it would still take quite a while just to nail down specs before development could start). Additionally, GMOD has typically shied working with assembly issues. The only foray in to assembly that I can think of is work done based on CMap called CMAE (CMap Assembly Editor). A significant number of developer hours went into that project only to find that the people who actually do assembly didn't like the way it worked. I suspect it is best to drop this idea unless a well thought out and doable proposal comes along.
Scott 20:05, 5 April 2011 (UTC)
Similar projects
Correspondence
I found there was a lot of 'invisible' information floating round by email. Here it is for reference:
Email 01
I suggest you start work immediately, researching the common assembly format, ACE, and it's implementation in BioPerl (recently overhauled by Rob Buels). We then need a preliminary appraisal of weather this format is suitable for 'community assembly' based on NGS technology, limitations, improvements etc.
Do you know of any other genome assembly formats? I guess AGP is another... there are currently no BioPerl modules for handling AGP, but I have some AGP code that could be added to BioPerl, and I know Rob Buels does too.
A very concrete sub-project would be to refactor the assembly manipulation tools available in BioPerl. I've CC'ed Chris Fields for his input on that.
Also, we need to work out how to integrate structured data with inherently 'flat' version control such as GIT.
P.S. You may like
Reply
BioPerl supports the following assembly formats:
-http://www.bioperl.org/wiki/Module:Bio::Assembly::IO
Email 02
I should stress that, if you decided to work on this project, I wouldn't be able to provide much time in the form of mentoring. Sorry about that, but I'm just being realistic. The main requirement, therefore, would be that you could work independently with only 'high level' guidance. This isn't a challenge, it's for you to decide what is best for you. > I am writing to express my interest in the GSOC 2011 project of GMOD. I am > particularly interested in the Idea 9: Develop collaborative genome assembly > tools and databases. I would rather appreciate if you could introduce more > about this idea. Here are some details I would like to know: > 1. As far as I know, current sequence databases, such as dbSNP and UCSC > Genome Browser, do have central databases and versioning (although I do not > know how they implement the versioning). Why we need another one? Well... a big organization putting a version on a certain database release is very different from what I'm proposing, which is a version control system specifically for genome assembly. The human genome build version is handled by a large centralised organization with very well laid out policy and guidelines and they are working on a very important genome. Here there is no problem, and we all can respect their authority and use their version codes. As sequencing gets cheaper, however, specialists get more disparate, and less and less investment goes into generating the data. In such cases, it behoves us to have a good way to coordinate community activities behind the best 'working draft' assembly. There are no organizations ready to step in, no money to spend, and no policy in place. I'm thinking of something like Chado (which is for genome annotation) for genome assembly. > 2. From my understanding, this tool is designed to be used by a community of > researchers, who may have access to certain genome assembly. This community > contributes to the completion and analysis on the same genome. Is it right? Yes. In fact Chado (and GMoD) does a very good job of housing genome analysis results (such a new metabolomics, proteomics or transcriptomics experiments), however, each genome sits behind an institutional facade. I'm thinking of a community based approach (like Wikipedia). > So it is not designed to integrate genome assembly from various sources? Am > I right? Sounds right. Rather it hosts the results of such integration. I'm not suggesting to implement an assembler, just a 'wiki-style' tool to store and edit an assembly. > 3. From my background, I am strong in Java, as well as the object oriented > principles. I have also... <snip, to protect the innocent> > However, I have never used Git, Chado and Catalyst. I am > confident that I can learn very fast based on my background and capability. > But I am a little bit concerned that whether these skills are compulsory or > not. Certainly nothing is compulsory. Unlike some of the other ideas, this idea is very speculative and open ended. I'm relying on a high level of competence and independence. > 4. And Finally, how is the current progress of this project? Have you > designed any model or architecture for the tool? Where should I start if I > would like to learn more about the implementation details? Unlike the other ideas, this is currently a pure 'vapour-ware' project. However, I think there are several core components in place that you can begin to research. Firstly, the Chado database system could form the core of an 'assembly data model'. Secondly, many of the Bio* projects (BioPerl, BioJava, BioEtc) have 'assembly' objects. Learning about how to handle assemblies in these languages will surely be important. Thirdly, I think we can build on the MediaWiki, Semantic-MediaWiki and Semantic Forms tools for creating the 'wiki' component. One distinct sub project would be to focus on an 'annotation wiki', which I think could serve the community very well, if done in a sufficiently generic and useful way (i.e. not just another 'me too' project).
Email 03
OK, think about a genome assembly... how is it done? How may it be edited and improved over time? Think about biologists with specific interests in specific regions of the genome doing work to refine those regions. How will they want to contribute their work back to the 'community genome'?
Think about (and research) what a genome assembly is, how it is built, the important information it carries, and the kinds of ways that people may want to edit it.
This comes down to thinking about data structures, databases and data models and algorithms for editing those data structures.
So far so good... Now we want to layer provenance and version control on top, creating a community maintained structured assembly database.
I think this is a very ambitious proposal, so it will need a huge amount of work to get even close to creating something useful. I don't want to put you off, but we need to be realistic!
IRC LOG
Edited for clarity
19:36 <@rbuels> mmlevitt: chado and chadoxml probably isn't something that would be good for storing and versioning an assembly 19:37 <@rbuels> mmlevitt: xml isn't really a good format for holding *data*, it's good for *documents* that have kind of a looser structure 19:37 <@rbuels> mmlevitt: dbolser and i have discussed this version-control-for-genomes thing before, i think it's a good idea 19:38 <@rbuels> mmlevitt: and git is probably the right version control system to be looking at 19:38 <@rbuels> mmlevitt: as for file formats ... it depends. 19:39 * rbuels thinks 19:40 <@rbuels> i dunno. there is a lot of variation in how to represent genome assemblies. 19:41 <@rbuels> you would need to have something that almost anything could be stored as ... 19:42 <@rbuels> NCBI, for file formats, seems to be standardizing on AGP for representing the finished assembly of how the contigs fit together 19:42 <@rbuels> as for the contigs themselves, i'm not sure if they want to store how the contigs are assembled from reads 19:43 <@rbuels> lots of file formats for storing that kind of thing, .ace might be the most common 19:44 <@rbuels> AGP is probably the emerging standard for representing how the contigs fit together into the finished assembly 19:44 <@rbuels> mmlevitt: and the contigs themselves, if you are looking at their sub-assemblies, .ace (ACE) is a very common format, but there are lots of other common ones 19:46 <@rbuels> if you're going to make an assembly-versioning system, you probably don't want to use a relational database like chado 19:47 <@rbuels> and certainly not a highly-normalized, super-general relational database schema, which chado also is. 19:47 <@rbuels> performance is going to be a big deal here ... your typical assembly runs into many GB of data. 19:48 <@rbuels> at least eukaryotic ones do 19:48 <@rbuels> i guess if you're doing bacteria genomes and such, it's smaller 19:48 <@rbuels> but still big 19:49 <@rbuels> the key to all of this, as i see it, is *how do you represent and visualize the differences between assemblies* 19:49 <@rbuels> and how do you do that quickly and easily 19:49 <@rbuels> that's a tough one. 19:51 <@rbuels> cause with git, you can make all kinds of diffs between different places in the history 19:51 <@rbuels> git's super powerful for that 19:52 <@rbuels> a colorized text diff like git makes is great for telling differences in source code 19:52 <@rbuels> but an assembly is not source code. 19:53 <@rbuels> you can certainly store the assembly in git, but git isn't going to help that much in visualizing the differences between the assemblies. 19:53 <@rbuels> between the versions of the assemblies, that is. 05:03 < dbolser> I've been using dnadiff within MUMmer to compare assemblies. It's fast, but it fails at visualizing the differences clearly. 05:04 < dbolser> for storing the reads / contigs, I'd recommend BAM 05:04 < dbolser> BAM is web-scale, ace is pre-2007 05:05 < dbolser> Seriously though, the really nice thing about BAM is that you can stream it over an FTP / HTTP connection very efficiently using its index 14:14 < dbolser> I only just got mmlevitt's idea ... BioForge would be like SourceForge, but instead of hosting source code projects, it'd host genome assemblies! Pretty nice idea sir!
Email 04
Perhaps it's worth distinguishing between assembly data (AGP for pseudomolecules / genome scale structures and BAM or ACE for contigs and scaffolds) and assembly metadata, that could be used for more rapid assembly-assembly comparison. I recommend that you learn a bit about assembly (how it is done), think a bit about data structures for assembly data (how it could be stored), and read a bit about AGP, ACE and BAM. The next step becomes thinking about how to edit these data structures and how to represent the edits in a biologically meaningful and computationally tractable way. This is where assembly meta-data may come into it. I think there are some more fundamental decisions to be made (including simply judging the feasibility of this proposal) before we worry too much about details of Git vs. SVN.