Difference between revisions of "GSoC/IDEA 9"

From GMOD
Jump to: navigation, search
Line 140: Line 140:
 
project).
 
project).
 
</PRE>
 
</PRE>
 +
 +
== Email 03 ==
 +
OK, think about a genome assembly... how is it done? How may it be
 +
edited and improved over time? Think about biologists with specific
 +
interests in specific regions of the genome doing work to refine those
 +
regions. How will they want to contribute their work back to the
 +
'community genome'?
 +
 +
Think about (and research) what a genome assembly is, how it is built,
 +
the important information it carries, and the kinds of ways that
 +
people may want to edit it.
 +
 +
This comes down to thinking about data structures, databases and data
 +
models and algorithms for editing those data structures.
 +
 +
So far so good... Now we want to layer provenance and version control
 +
on top, creating a community maintained structured assembly database.
 +
 +
I think this is a very ambitious proposal, so it will need a huge
 +
amount of work to get even close to creating something useful. I don't
 +
want to put you off, but we need to be realistic!

Revision as of 00:02, 23 March 2011

<-- Back to GSoC

Page for project discussion, ideas and some other third thing.


Similar projects

Correspondence

I found there was a lot of 'invisible' information floating round by email. Here it is for reference:

Email 01

I suggest you start work immediately, researching the common assembly format, ACE, and it's implementation in BioPerl (recently overhauled by Rob Buels). We then need a preliminary appraisal of weather this format is suitable for 'community assembly' based on NGS technology, limitations, improvements etc.

Do you know of any other genome assembly formats? I guess AGP is another... there are currently no BioPerl modules for handling AGP, but I have some AGP code that could be added to BioPerl, and I know Rob Buels does too.

A very concrete sub-project would be to refactor the assembly manipulation tools available in BioPerl. I've CC'ed Chris Fields for his input on that.

Also, we need to work out how to integrate structured data with inherently 'flat' version control such as GIT.

P.S. You may like

Reply

BioPerl supports the following assembly formats:

-http://www.bioperl.org/wiki/Module:Bio::Assembly::IO



Email 02

I should stress
that, if you decided to work on this project, I wouldn't be able to
provide much time in the form of mentoring. Sorry about that, but I'm
just being realistic. The main requirement, therefore, would be that
you could work independently with only 'high level' guidance. This
isn't a challenge, it's for you to decide what is best for you.


> I am writing to express my interest in the GSOC 2011 project of GMOD. I am
> particularly interested in the Idea 9: Develop collaborative genome assembly
> tools and databases. I would rather appreciate if you could introduce more
> about this idea. Here are some details I would like to know:

> 1. As far as I know, current sequence databases, such as dbSNP and UCSC
> Genome Browser, do have central databases and versioning (although I do not
> know how they implement the versioning). Why we need another one?

Well... a big organization putting a version on a certain database
release is very different from what I'm proposing, which is a version
control system specifically for genome assembly. The human genome
build version is handled by a large centralised organization with very
well laid out policy and guidelines and they are working on a very
important genome. Here there is no problem, and we all can respect
their authority and use their version codes.

As sequencing gets cheaper, however, specialists get more disparate,
and less and less investment goes into generating the data. In such
cases, it behoves us to have a good way to coordinate community
activities behind the best 'working draft' assembly. There are no
organizations ready to step in, no money to spend, and no policy in
place.

I'm thinking of something like Chado (which is for genome annotation)
for genome assembly.


> 2. From my understanding, this tool is designed to be used by a community of
> researchers, who may have access to certain genome assembly. This community
> contributes to the completion and analysis on the same genome. Is it right?

Yes. In fact Chado (and GMoD) does a very good job of housing genome
analysis results (such a new metabolomics, proteomics or
transcriptomics experiments), however, each genome sits behind an
institutional facade. I'm thinking of a community based approach (like
Wikipedia).


> So it is not designed to integrate genome assembly from various sources? Am
> I right?

Sounds right. Rather it hosts the results of such integration. I'm not
suggesting to implement an assembler, just a 'wiki-style' tool to
store and edit an assembly.


> 3. From my background, I am strong in Java, as well as the object oriented
> principles. I have also...

<snip, to protect the innocent>

> However, I have never used Git, Chado and Catalyst. I am
> confident that I can learn very fast based on my background and capability.
> But I am a little bit concerned that whether these skills are compulsory or
> not.

Certainly nothing is compulsory. Unlike some of the other ideas, this
idea is very speculative and open ended. I'm relying on a high level
of competence and independence.


> 4. And Finally, how is the current progress of this project? Have you
> designed any model or architecture for the tool? Where should I start if I
> would like to learn more about the implementation details?

Unlike the other ideas, this is currently a pure 'vapour-ware'
project. However, I think there are several core components in place
that you can begin to research. Firstly, the Chado database system
could form the core of an 'assembly data model'. Secondly, many of the
Bio* projects (BioPerl, BioJava, BioEtc) have 'assembly' objects.
Learning about how to handle assemblies in these languages will surely
be important. Thirdly, I think we can build on the MediaWiki,
Semantic-MediaWiki and Semantic Forms tools for creating the 'wiki'
component.

One distinct sub project would be to focus on an 'annotation wiki',
which I think could serve the community very well, if done in a
sufficiently generic and useful way (i.e. not just another 'me too'
project).

Email 03

OK, think about a genome assembly... how is it done? How may it be edited and improved over time? Think about biologists with specific interests in specific regions of the genome doing work to refine those regions. How will they want to contribute their work back to the 'community genome'?

Think about (and research) what a genome assembly is, how it is built, the important information it carries, and the kinds of ways that people may want to edit it.

This comes down to thinking about data structures, databases and data models and algorithms for editing those data structures.

So far so good... Now we want to layer provenance and version control on top, creating a community maintained structured assembly database.

I think this is a very ambitious proposal, so it will need a huge amount of work to get even close to creating something useful. I don't want to put you off, but we need to be realistic!