Difference between revisions of "GSoC/IDEA 9"
Line 140: | Line 140: | ||
project). | project). | ||
</PRE> | </PRE> | ||
+ | |||
+ | == Email 03 == | ||
+ | OK, think about a genome assembly... how is it done? How may it be | ||
+ | edited and improved over time? Think about biologists with specific | ||
+ | interests in specific regions of the genome doing work to refine those | ||
+ | regions. How will they want to contribute their work back to the | ||
+ | 'community genome'? | ||
+ | |||
+ | Think about (and research) what a genome assembly is, how it is built, | ||
+ | the important information it carries, and the kinds of ways that | ||
+ | people may want to edit it. | ||
+ | |||
+ | This comes down to thinking about data structures, databases and data | ||
+ | models and algorithms for editing those data structures. | ||
+ | |||
+ | So far so good... Now we want to layer provenance and version control | ||
+ | on top, creating a community maintained structured assembly database. | ||
+ | |||
+ | I think this is a very ambitious proposal, so it will need a huge | ||
+ | amount of work to get even close to creating something useful. I don't | ||
+ | want to put you off, but we need to be realistic! |
Revision as of 00:02, 23 March 2011
<-- Back to GSoC
Page for project discussion, ideas and some other third thing.
Similar projects
Correspondence
I found there was a lot of 'invisible' information floating round by email. Here it is for reference:
Email 01
I suggest you start work immediately, researching the common assembly format, ACE, and it's implementation in BioPerl (recently overhauled by Rob Buels). We then need a preliminary appraisal of weather this format is suitable for 'community assembly' based on NGS technology, limitations, improvements etc.
Do you know of any other genome assembly formats? I guess AGP is another... there are currently no BioPerl modules for handling AGP, but I have some AGP code that could be added to BioPerl, and I know Rob Buels does too.
A very concrete sub-project would be to refactor the assembly manipulation tools available in BioPerl. I've CC'ed Chris Fields for his input on that.
Also, we need to work out how to integrate structured data with inherently 'flat' version control such as GIT.
P.S. You may like
Reply
BioPerl supports the following assembly formats:
-http://www.bioperl.org/wiki/Module:Bio::Assembly::IO
Email 02
I should stress that, if you decided to work on this project, I wouldn't be able to provide much time in the form of mentoring. Sorry about that, but I'm just being realistic. The main requirement, therefore, would be that you could work independently with only 'high level' guidance. This isn't a challenge, it's for you to decide what is best for you. > I am writing to express my interest in the GSOC 2011 project of GMOD. I am > particularly interested in the Idea 9: Develop collaborative genome assembly > tools and databases. I would rather appreciate if you could introduce more > about this idea. Here are some details I would like to know: > 1. As far as I know, current sequence databases, such as dbSNP and UCSC > Genome Browser, do have central databases and versioning (although I do not > know how they implement the versioning). Why we need another one? Well... a big organization putting a version on a certain database release is very different from what I'm proposing, which is a version control system specifically for genome assembly. The human genome build version is handled by a large centralised organization with very well laid out policy and guidelines and they are working on a very important genome. Here there is no problem, and we all can respect their authority and use their version codes. As sequencing gets cheaper, however, specialists get more disparate, and less and less investment goes into generating the data. In such cases, it behoves us to have a good way to coordinate community activities behind the best 'working draft' assembly. There are no organizations ready to step in, no money to spend, and no policy in place. I'm thinking of something like Chado (which is for genome annotation) for genome assembly. > 2. From my understanding, this tool is designed to be used by a community of > researchers, who may have access to certain genome assembly. This community > contributes to the completion and analysis on the same genome. Is it right? Yes. In fact Chado (and GMoD) does a very good job of housing genome analysis results (such a new metabolomics, proteomics or transcriptomics experiments), however, each genome sits behind an institutional facade. I'm thinking of a community based approach (like Wikipedia). > So it is not designed to integrate genome assembly from various sources? Am > I right? Sounds right. Rather it hosts the results of such integration. I'm not suggesting to implement an assembler, just a 'wiki-style' tool to store and edit an assembly. > 3. From my background, I am strong in Java, as well as the object oriented > principles. I have also... <snip, to protect the innocent> > However, I have never used Git, Chado and Catalyst. I am > confident that I can learn very fast based on my background and capability. > But I am a little bit concerned that whether these skills are compulsory or > not. Certainly nothing is compulsory. Unlike some of the other ideas, this idea is very speculative and open ended. I'm relying on a high level of competence and independence. > 4. And Finally, how is the current progress of this project? Have you > designed any model or architecture for the tool? Where should I start if I > would like to learn more about the implementation details? Unlike the other ideas, this is currently a pure 'vapour-ware' project. However, I think there are several core components in place that you can begin to research. Firstly, the Chado database system could form the core of an 'assembly data model'. Secondly, many of the Bio* projects (BioPerl, BioJava, BioEtc) have 'assembly' objects. Learning about how to handle assemblies in these languages will surely be important. Thirdly, I think we can build on the MediaWiki, Semantic-MediaWiki and Semantic Forms tools for creating the 'wiki' component. One distinct sub project would be to focus on an 'annotation wiki', which I think could serve the community very well, if done in a sufficiently generic and useful way (i.e. not just another 'me too' project).
Email 03
OK, think about a genome assembly... how is it done? How may it be edited and improved over time? Think about biologists with specific interests in specific regions of the genome doing work to refine those regions. How will they want to contribute their work back to the 'community genome'?
Think about (and research) what a genome assembly is, how it is built, the important information it carries, and the kinds of ways that people may want to edit it.
This comes down to thinking about data structures, databases and data models and algorithms for editing those data structures.
So far so good... Now we want to layer provenance and version control on top, creating a community maintained structured assembly database.
I think this is a very ambitious proposal, so it will need a huge amount of work to get even close to creating something useful. I don't want to put you off, but we need to be realistic!